AI 每日进展速报 / Daily AI Digest - 2026-05-13

图像生成/编辑 / Image Generation/Editing

arXiv

Power Reinforcement Post-Training of Text-to-Image Models with Super-Linear Advantage Shaping 🆕NEW
- 赛道归属: 文生图（扩散模型）/ 强化学习后训练（RLHF/GRPO）
- 核心创新点: 指出GRPO类后训练中“归一化”会导致优势/奖励失配，从而诱发reward hacking；提出“超线性优势塑形”(super-linear advantage shaping) 的后训练策略，通过对优势函数进行非线性重标定来放大高质量样本的学习信号、抑制利用奖励偏置的投机解，并避免直接移除prompt相关项带来的校准问题，从机制上提升对齐增益的真实性与稳定性。
  Track: Text-to-Image (diffusion) / RL post-training (RLHF/GRPO)
- Core innovation: Identifies that normalization in GRPO-style post-training can miscalibrate advantages/rewards and trigger reward hacking; introduces super-linear advantage shaping to nonlinearly rescale advantages—amplifying learning from genuinely good samples while suppressing exploitative reward-bias shortcuts—without bluntly dropping prompt-related terms, improving alignment stability and real quality gains.

Masked Generative Transformer Is What You Need for Image Editing 🆕NEW
- 赛道归属: 图像编辑（基于生成式Transformer的局部编辑）
- 核心创新点: 用Masked Generative Transformer替代扩散模型做编辑，利用“掩码token预测”的局部生成范式天然实现编辑区域的空间隔离，避免扩散全局去噪导致的改动外溢；提出EditMGT框架，将编辑建模为受mask约束的token重生成，并配套多阶段/多粒度的训练与推理策略以兼顾局部可控性与全局一致性，实现“只改该改的地方”。
  Track: Image editing (generative Transformer, localized editing)
- Core innovation: Replaces diffusion-based global denoising with a Masked Generative Transformer that performs masked token prediction, inherently confining changes to the intended region and preventing edit leakage; proposes EditMGT that formulates editing as mask-constrained token regeneration with multi-stage/multi-granularity training/inference to balance strict locality and global coherence.

Reinforce Adjoint Matching: Scaling RL Post-Training of Diffusion and Flow-Matching Models 🆕NEW
- 赛道归属: 文生图（扩散/Flow-Matching）/ 强化学习后训练（可扩展对齐）
- 核心创新点: 提出Reinforce Adjoint Matching，将RL后训练改写为与扩散/flow-matching预训练同构的“回归式”目标：通过伴随(Adjoint)匹配把奖励信号注入到可解析/可回归的训练靶中，避免昂贵的SDE rollout、显式reward梯度或不稳定的替代损失；从而在保持原有可扩展训练结构的同时，实现大规模偏好对齐与可控提升。
  Track: Text-to-Image (diffusion/flow-matching) / scalable RL post-training
- Core innovation: Introduces Reinforce Adjoint Matching, reformulating RL post-training into a regression-like objective structurally aligned with diffusion/flow-matching pretraining; uses adjoint matching to inject reward into tractable regression targets, avoiding costly SDE rollouts, explicit reward gradients, or unstable surrogate losses—enabling scalable preference alignment while preserving the pretraining-friendly training structure.

Qwen-Image-2.0 Technical Report 🆕NEW
- 赛道归属: 统一文生图与图像编辑（基础模型/多模态系统工程）
- 核心创新点: 提出单框架统一“高保真生成+精确编辑”的全能图像基础模型，通过将强视觉语言理解/指令跟随能力（Qwen3-VL）与生成模型耦合，强化复杂构图、文本密集场景下的可控生成与编辑一致性；面向超长文本渲染、多语言排版与高分辨率写实等痛点，给出系统级训练配方与部署优化，使模型在可用性（指令遵循、编辑精度）与工程效率（推理/部署）上同时提升。
  Track: Unified text-to-image + image editing (foundation model / multimodal system)
- Core innovation: Presents an omni-capable foundation model unifying high-fidelity generation and precise editing in one framework by coupling strong vision-language understanding/instruction following (Qwen3-VL) with the generator; targets hard cases like ultra-long text rendering, multilingual typography, text-heavy compositional scenes, and high-res photorealism, and provides system-level training and deployment optimizations to improve controllability, edit fidelity, and serving efficiency.

LimeCross: Context-Conditioned Layered Image Editing with Structural Consistency 🆕NEW
- 赛道归属: 图像编辑（分层/Layered资产编辑与结构一致性）
- 核心创新点: 面向真实创作流程中的分层图像资产，提出LimeCross：在“上下文条件化”的分层表示上进行编辑，显式建模层间结构关系与接触/遮挡/光照一致性；通过跨层约束与结构一致性机制，避免传统“先压平再编辑再分解”导致的层间不一致与重组伪影，实现可重组、非破坏式的可控分层编辑。
  Track: Image editing (layered assets / structural consistency)
- Core innovation: Proposes LimeCross for context-conditioned layered image editing, explicitly modeling inter-layer structure and enforcing consistency (e.g., contact, occlusion, illumination) across layers; avoids the common flatten-edit-redecompose pipeline that breaks layer coherence, enabling controllable, non-destructive edits that remain recomposable with fewer cross-layer artifacts.

Empty SPACE: Cross-Attention Sparsity for Concept Erasure in Diffusion Models 🆕NEW
- 赛道归属: 文生图安全（概念擦除/模型去偏）/ 扩散模型机制改造
- 核心创新点: 提出Empty SPACE：利用“跨注意力稀疏化”实现闭式(无需反传)概念擦除，在SDXL等大模型上保持擦除强度；核心在于定位并稀疏化与目标概念强相关的cross-attention通路，使概念触发在注意力层面被系统性削弱，同时尽量保持非目标概念与整体生成质量，解决闭式擦除在大架构上失效的问题。
  Track: T2I safety (concept erasure / debiasing) for diffusion models
- Core innovation: Introduces Empty SPACE, a closed-form (no backprop) concept erasure method via cross-attention sparsity that remains effective when scaling to large models like SDXL; identifies and sparsifies cross-attention pathways strongly tied to the target concept to suppress its activation while preserving non-target capabilities and overall image quality, addressing the degradation of prior closed-form erasure at scale.

What Concepts Lie Within? Detecting and Suppressing Risky Content in Diffusion Transformers 🆕NEW
- 赛道归属: 文生图安全（风险概念检测与抑制）/ DiT架构防护
- 核心创新点: 针对Diffusion Transformer而非U-Net，提出“模型内”风险概念的可解释检测与抑制框架：先探测模型表征/注意力中隐含的风险概念通路或子空间，再在生成过程中对这些通路进行定向抑制（而非仅靠外部过滤器）；从而把安全机制迁移到DiT主流架构上，并提升对性/暴力/版权等风险内容的覆盖与可控性。
  Track: T2I safety (risky concept detection & suppression) for Diffusion Transformers
- Core innovation: Develops an in-model, interpretable pipeline tailored to Diffusion Transformers: first detects latent pathways/subspaces (e.g., in representations/attention) associated with risky concepts, then applies targeted suppression during generation rather than relying solely on external filters; brings safety mechanisms to DiT-era architectures with improved controllability over sexual/violent/copyright-related content.

A Real-Calibrated Synthetic-First Data Engine 🆕NEW
- 赛道归属: 数据引擎/合成数据生成与校准（Synthetic data for vision）
- 核心创新点: 提出Real-Calibrated Synthetic-First数据引擎：以可控扩散生成作为“合成优先”数据来源，但引入真实数据校准与闭环反馈，解决纯合成增强带来的数据集级质量漂移与收益不稳定；通过模块化管线对合成数据分布、质量与任务指标进行对齐（real-calibration），实现可持续迭代的数据生成-筛选-评估闭环，提高数据稀缺场景下的稳定增益。
  Track: Data engine / synthetic data generation & calibration for vision
- Core innovation: Proposes a Real-Calibrated Synthetic-First data engine that uses controllable diffusion for scalable synthetic data but adds real-data calibration and closed-loop feedback to prevent dataset-level drift and unstable gains; a modular generate-filter-evaluate loop aligns synthetic distribution/quality with real-world targets, delivering more reliable improvements in data-scarce regimes.

When Few Steps Are Enough: Training-Free Acceleration of Identity-Preserved Generation 🆕NEW
- 赛道归属: 个性化/身份保持生成（推理加速与部署优化）
- 核心创新点: 提出训练无关的身份保持加速：证明身份适配器InfuseNet可从多步FLUX骨干直接迁移到蒸馏的schnell骨干而无需重训；通过“替换骨干路径+关闭CFG”两处极简改动，在保持/提升身份一致性的同时显著减少采样步数与延迟（5.9×），揭示身份条件在蒸馏模型上的可迁移性与低成本部署路径。
  Track: Personalized / identity-preserved generation (inference acceleration & deployment)
- Core innovation: Shows a training-free acceleration for identity-preserved generation: a frozen InfuseNet identity adapter transfers from a multi-step FLUX backbone to a distilled schnell backbone without retraining; with a minimal two-line change (swap backbone path, disable CFG), it cuts latency by ~5.9× while maintaining/improving identity fidelity, demonstrating strong adapter transferability to distilled backbones.

Offline Preference Optimization for Rectified Flow with Noise-Tracked Pairs 🆕NEW
- 赛道归属: 文生图对齐（离线偏好优化）/ Rectified Flow
- 核心创新点: 提出面向Rectified Flow的离线偏好优化：构建“噪声可追踪的成对样本”(noise-tracked pairs)，在数据中显式保存与赢家/输家对应的同一先验噪声索引，使偏好学习与RF近直线的真实反向轨迹一致；避免扩散DPO常用的独立前向加噪来估计轨迹所带来的动力学失配与偏差，从而更稳定、有效地对RF模型进行离线对齐。
  Track: T2I alignment (offline preference optimization) for Rectified Flow
- Core innovation: Proposes offline preference optimization tailored to Rectified Flow by using noise-tracked winner/loser pairs that preserve the underlying prior noise identity; this aligns preference learning with RF’s true near-straight reverse trajectory and avoids the dynamics mismatch/bias introduced when estimating trajectories via an independent forward noising process (common in diffusion DPO), yielding more stable and effective offline alignment for RF models.

GitHub

[2026-05-13] YouMind-OpenLab/awesome-nano-banana-pro-prompts ⭐11956

🍌 World's largest Nano Banana Pro prompt library — 10,000+ curated prompts with preview images, 16 languages. Google Gemini AI image generation. Free ...

[2026-05-13] Light-Heart-Labs/DreamServer ⭐523 🆕NEW

Local AI anywhere, for everyone — LLM inference, chat UI, voice, agents, workflows, RAG, and image generation. No cloud, no subscriptions.

[2026-05-13] vibheksoni/free-ai ⭐423

Free OpenAI-compatible AI API with 16,000+ models, image generation, tool calling, and Discord key signup.

[2026-05-12] etkecc/baibot ⭐224

🤖 A Matrix bot for using different capabilities (text-generation, text-to-speech, speech-to-text, image-generation, etc.) of AI / Large Language Model...

[2026-05-12] facebookresearch/wmar ⭐63 🆕NEW

Official implementation of the paper "Watermarking Autoregressive Image Generation" (NeurIPS'25)

HuggingFace Models

HiDream-ai/HiDream-O1-Image

SeeSee21/Z-Anime

sensenova/SenseNova-U1-8B-MoT

HiDream-ai/HiDream-O1-Image-Dev

视频生成/编辑 / Video Generation/Editing

arXiv

PhyGround: Benchmarking Physical Reasoning in Generative World Models 🆕NEW
- 赛道归属: 视频生成评测 / 物理推理基准
- 核心创新点: 提出面向“生成式世界模型物理一致性”的细粒度评测基准，将物理规律遵循从粗粒度主观打分拆解为可定位的“定律级失败”检测；通过更结构化的评估协议降低人评的响应偏差与疲劳带来的噪声，使模型在不同物理规则维度（如碰撞、支撑、运动连续性等）的失真可被系统性暴露与对比。
- Track: Video generation evaluation / physical reasoning benchmark
- Key innovation: Introduces a fine-grained benchmark for physical consistency in generative world models, decomposing evaluation into law-specific failure detection rather than coarse overall ratings; uses a more structured evaluation protocol to reduce human bias/fatigue noise, enabling systematic diagnosis and comparison of violations across different physical-rule dimensions (e.g., collision, support, motion continuity).

AllocMV: Optimal Resource Allocation for Music Video Generation via Structured Persistent State 🆕NEW
- 赛道归属: 视频生成 / 长视频生成效率优化（资源分配与一致性）
- 核心创新点: 将长时程音乐视频生成建模为多选背包问题(MCKP)的最优资源分配：先由全局规划器生成“结构化持久状态”（角色实体、场景先验、共享图）作为跨镜头一致性的紧凑载体，再基于对片段饱和度/收益的估计动态分配算力与生成配置，在计算预算受限下最大化整体质量并维持跨shot一致性。
- Track: Video generation / efficient long-video generation (resource allocation & consistency)
- Key innovation: Formulates long-horizon music video generation as an MCKP for optimal compute allocation: a global planner first produces a compact structured persistent state (character entities, scene priors, sharing graphs) to carry cross-shot consistency, then allocates compute/configurations per segment based on estimated saturation/utility to maximize overall quality under a fixed budget while preserving consistency.

TIE: Time Interval Encoding for Video Generation over Events 🆕NEW
- 赛道归属: 视频生成 / 时序建模（多事件时间对齐）
- 核心创新点: 针对“并发/重叠事件”的生成需求，提出时间区间编码(TIE)，用区间而非离散时间点来表示事件的起止与重叠关系，缓解DiT等点位位置编码在多事件并行时的表达瓶颈；使模型能在同一时间轴上对多个事件进行可控的时序落位与持续时间建模，从而支持导演式多事件提示与交互式代理场景。
- Track: Video generation / temporal modeling (multi-event temporal grounding)
- Key innovation: Proposes Time Interval Encoding (TIE) to represent events as intervals (start–end) instead of discrete time points, addressing the representational bottleneck of point-wise positional encodings (e.g., in DiT) under overlapping events; enables controllable placement and duration modeling of concurrent events on a shared timeline for director-style prompting and interactive agent scenarios.

Improving Human Image Animation via Semantic Representation Alignment 🆕NEW
- 赛道归属: 图生视频 / 人体动画（语义条件对齐）
- 核心创新点: 提出“语义表征对齐”框架来提升人体图像动画的稳定性：在引入dense pose、ID等人体语义条件的同时，通过对齐机制缓解语义条件与生成表征之间的分布/尺度不匹配，减少长视频与大幅动作下的肢体扭曲、面部畸变；并试图降低强条件带来的可生成性/灵活性损失，实现更稳与更自由的折中。
- Track: Image-to-video / human animation (semantic conditioning alignment)
- Key innovation: Introduces a semantic representation alignment framework for more stable human image animation: while leveraging human-specific conditions (dense pose, identity embeddings), an alignment mechanism mitigates distribution/scale mismatch between semantic conditions and generative representations, reducing limb twisting and facial distortion in long or intense-motion videos, and alleviating the rigidity typically introduced by strong conditioning.

WorldReasonBench: Human-Aligned Stress Testing of Video Generators as Future World-State Predictors 🆕NEW
- 赛道归属: 视频生成评测 / 世界状态预测与因果推理基准
- 核心创新点: 将视频生成评估重构为“未来世界状态预测”：给定初始状态与动作，要求生成的视频在状态演化上与人类预期对齐；通过压力测试式任务设计直接检验模型对动作后果、时序因果与状态一致性的推理能力，而非仅评估画质或文本对齐，从而更贴近“世界模拟器”能力的核心指标。
- Track: Video generation evaluation / world-state prediction & causal reasoning benchmark
- Key innovation: Reframes evaluation as future world-state prediction: given an initial state and an action, the generated video must match human-expected state evolution; uses stress-test task design to directly probe reasoning about action consequences, temporal causality, and state consistency rather than primarily visual fidelity or text alignment—targeting the core “world simulator” capability.

SocialDirector: Training-Free Social Interaction Control for Multi-Person Video Generation 🆕NEW
- 赛道归属: 视频生成 / 可控多人物交互（训练免控制）
- 核心创新点: 提出训练免(Training-Free)的多人物社交交互控制方法，将“谁在何时对谁做什么”的交互结构显式注入生成过程，解决多人物生成中常见的角色错配与动作归因错误；通过对交互关系与时序的可控编排，实现对对话、手势、协同行为等社会互动的细粒度导演式控制，而无需重新训练基础视频模型。
- Track: Video generation / controllable multi-person interactions (training-free control)
- Key innovation: Presents a training-free control method for multi-person social interactions, explicitly injecting interaction structure—who does what, when, and toward whom—into the generation process to reduce actor/action misbinding; enables fine-grained director-style control over conversations, gestures, and coordinated behaviors without retraining the base video model.

Forcing-KV: Hybrid KV Cache Compression for Efficient Autoregressive Video Diffusion Models 🆕NEW
- 赛道归属: 推理优化 / 自回归视频扩散的KV缓存压缩
- 核心创新点: 面向流式自回归视频扩散的注意力冗余问题，提出混合式KV Cache压缩(Forcing-KV)：利用历史帧KV高度重复的结构，对不同时间尺度/重要性采用差异化压缩策略，在尽量保持生成质量的前提下显著降低显存占用与注意力计算复杂度，从而提升长视频AR扩散的可扩展性与实时性。
- Track: Inference optimization / KV-cache compression for autoregressive video diffusion
- Key innovation: Proposes Forcing-KV, a hybrid KV-cache compression scheme for streaming autoregressive video diffusion, exploiting high redundancy of historical-frame KV states; applies differentiated compression across time/importance to substantially reduce memory and attention compute while preserving quality, improving scalability and real-time long-video generation.

SWIFT: Prompt-Adaptive Memory for Efficient Interactive Long Video Generation 🆕NEW
- 赛道归属: 推理优化 / 流式长视频生成的自适应记忆管理
- 核心创新点: 提出SWIFT的“提示词自适应记忆”(Prompt-Adaptive Memory)：针对交互式长视频中频繁语义切换，设计能随prompt更新而重组/选择性保留的记忆机制，避免在提示边界反复重建缓存或受限于固定记忆预算造成的冗余计算与适配迟滞；在保持视觉连续性的同时提升语义切换响应效率。
- Track: Inference optimization / adaptive memory for streaming long-video generation
- Key innovation: Introduces SWIFT with prompt-adaptive memory: for interactive long videos with frequent semantic switches, it reorganizes/selectively retains memory in response to prompt updates, avoiding cache rebuilds at prompt boundaries and inefficiencies of fixed memory budgets; improves responsiveness to semantic changes while maintaining visual continuity.

EduStory: A Unified Framework for Pedagogically-Consistent Multi-Shot STEM Instructional Video Generation 🆕NEW
- 赛道归属: 视频生成 / 多镜头脚本化生成（教育内容一致性）
- 核心创新点: 提出面向STEM教学的多镜头生成统一框架：引入“教学状态建模”跟踪跨镜头的持久知识与概念依赖，并用脚本引导的结构化控制组织叙事与镜头编排，解决长视频中知识一致性、讲解连贯性与多镜头衔接问题；将“内容正确性/教学一致性”作为生成过程的核心约束而非事后筛选。
- Track: Video generation / multi-shot script-driven generation (educational consistency)
- Key innovation: Proposes a unified framework for multi-shot STEM instructional video generation: models a pedagogical state to track persistent knowledge and concept dependencies across shots, and uses script-guided structured control to organize narrative and shot composition; addresses knowledge consistency and pedagogical coherence as first-class generation constraints rather than post-hoc filtering.

CollabVR: Collaborative Video Reasoning with Vision-Language and Video Generation Models 🆕NEW
- 赛道归属: 多模态推理 / 结合视频生成的协同推理框架
- 核心创新点: 提出VLM+视频生成模型的协同推理(CollabVR)：用VLM承担显式规划、校验与纠错，将VGM生成的短时“Chain-of-Frames”作为可视化推理草稿；通过迭代式的生成—评估—修正闭环，缓解长任务的时序漂移与中段模拟错误累积，把视频生成从单纯输出器提升为可被语言推理约束与修正的“可视化思维工具”。
- Track: Multimodal reasoning / collaborative reasoning with video generation
- Key innovation: Proposes CollabVR, a VLM+VGM collaborative reasoning framework: the VLM performs explicit planning, verification, and correction while the VGM produces short-horizon Chain-of-Frames as visual reasoning drafts; an iterative generate–evaluate–revise loop mitigates long-horizon drift and mid-clip simulation error accumulation, turning video generation into a language-guided, correctable visual thinking tool.

GitHub

[2026-05-12] Anil-matcha/Open-Generative-AI ⭐13018

Open-source alternative to AI video platforms — Free AI image & video generation studio with 200+ models (Flux, Midjourney, Kling, Sora, Veo). No cont...

[2026-05-13] hao-ai-lab/FastVideo ⭐3469

A unified inference and post-training framework for accelerated video generation.

[2026-05-12] ModelTC/LightX2V ⭐2258

Light Image Video Generation Inference Framework

[2026-05-12] ZeroLu/awesome-seedance ⭐1717

The ultimate collection of high-fidelity Seedance 2.0 prompts and Seedance AI resources. Discover Seedance 2.0 how to use for cinematic film, anime, U...

[2026-05-12] YouMind-OpenLab/awesome-seedance-2-prompts ⭐998

🎬 2000+ curated Seedance 2.0 video generation prompts — cinematic, anime, UGC, ads, meme styles. Includes Seedance API guides, character consistency t...

HuggingFace Models

SulphurAI/Sulphur-2-base

音频生成 / Audio Generation

arXiv

Unison: Harmonizing Motion, Speech, and Sound for Human-Centric Audio-Video Generation 🆕NEW
- 赛道归属: 人物中心音视频生成（Audio-Video Generation，多模态联合生成：动作-语音-音效）
- 核心创新点: 提出统一框架在生成阶段显式约束“动作-语音-环境音效”三模态的时序一致性与语义协同，针对三者异质时间尺度与对齐难题，通过跨模态协同建模/对齐机制减少常见的口型-语音、动作-音效错配，实现更连贯的人物中心音视频联合生成。
- Track: Human-centric audio-video generation (multimodal joint generation: motion–speech–sound)
- Key innovations: Introduces a unified generation framework that explicitly enforces temporal alignment and semantic coherence across motion, speech, and environmental sound effects. By addressing heterogeneous temporal dynamics with cross-modal coordination/alignment mechanisms, it reduces typical mismatches (e.g., lip–speech and action–sound desynchronization) and improves coherent human-centric audio-video generation.

[2026-05-07] Optimal Transport Audio Distance with Learned Riemannian Ground Metrics
- 赛道归属: 音频生成评测（Audio Generation Evaluation）/ 最优传输距离度量
- 核心创新点: 提出OTAD以替代/修正FAD的两大结构性缺陷：在“代价项”上学习残差黎曼地面度量适配器（Riemannian ground-metric adapter）以避免冻结嵌入的不变性掩盖伪影；在“耦合项”上用离散OT（带熵正则）替代高斯拟合近似，提升对局部污染与细粒度失真的敏感性，从而得到更可信的生成音频距离度量。
  Track: Audio generation evaluation / Optimal transport metrics
  Key innovation: OTAD fixes FAD by (1) learning a residual Riemannian ground-metric adapter for the OT cost instead of relying on a frozen embedding pullback, and (2) replacing Gaussian coupling with discrete entropic OT—improving sensitivity to artifacts and rank-1/contaminated distortions.

[2026-05-06] Stage-adaptive audio diffusion modeling
- 赛道归属: 音频生成（扩散模型训练优化 / 自适应训练策略）
- 核心创新点: 提出“阶段自适应（stage-adaptive）”的音频扩散模型训练框架，针对扩散训练中不同阶段（如噪声水平/时间步、不同条件信号）的学习难度与贡献随训练进程变化这一现象，不再采用固定不变的优化配方，而是动态调整训练信号的重要性与采样/加权策略，使模型在训练早期与后期分别聚焦更关键的学习目标，从而在不改变生成范式的前提下提升训练效率与最终生成/复原质量。
  Track: Audio generation (diffusion model training optimization / adaptive training strategy)
  Key innovation: Introduces a stage-adaptive training framework for audio diffusion models. Instead of using static optimization recipes, it dynamically rebalances the importance of training signals (e.g., across noise levels/timesteps and heterogeneous conditioning regimes) as learning progresses, aligning optimization focus with stage-dependent difficulty and utility. This improves training efficiency and final generation/restoration quality without changing the underlying diffusion generation paradigm.

GitHub

[2026-05-12] huggingface/diffusers ⭐33602

🤗 Diffusers: State-of-the-art diffusion models for image, video, and audio generation in PyTorch.

[2026-05-11] Lightricks/LTX-2 ⭐6617

Official Python inference and LoRA trainer package for the LTX-2 audio–video generative model.

[2026-05-12] apocas/restai ⭐504

RESTai is an AIaaS (AI as a Service) open-source platform. Supports many public and local LLM suported by Ollama/vLLM/etc. Precise embeddings usage, t...

[2026-05-07] Saganaki22/ComfyUI-Woosh ⭐98

Text-to-audio and video-to-audio using Sony AI's Woosh foundation model.

语言大模型 / Large Language Models

arXiv

ELF: Embedded Language Flows 🆕NEW
- 赛道归属: 语言生成建模（连续空间扩散/流模型用于文本）
- 核心创新点: 将语言建模从离散token扩散迁移到连续嵌入空间的“语言流/扩散”框架：在嵌入空间进行连续生成与去噪/流匹配，再通过最小化对离散域的适配（如嵌入-词表映射与训练目标设计）实现有效的连续DLM；关键突破在于证明连续生成范式在文本上可行，并用嵌入空间的动力学建模缓解离散扩散的结构性限制。
- Track: Language generation modeling (continuous diffusion/flow models for text)
- Core innovation: Moves diffusion/flow language modeling from discrete tokens to a continuous embedding space: generation and denoising/flow-matching are performed in the embedding domain, then mapped back to the vocabulary with minimal discrete-domain adaptation (embedding–vocab interface and objective design). The key methodological advance is demonstrating effective continuous DLMs for language and leveraging embedding-space dynamics to bypass limitations of discrete token diffusion.

Dynamic Skill Lifecycle Management for Agentic Reinforcement Learning 🆕NEW
- 赛道归属: Agentic RL（技能库/工具使用的生命周期管理与策略学习）
- 核心创新点: 提出“技能生命周期管理”视角：不再假设外部技能只会单调累积或最终被完全内化为零技能推理，而是根据参数容量受限与技能边际贡献不均衡，动态决定技能的引入、保留、淘汰与再利用；方法上强调以任务收益/贡献为依据的技能管理机制，使agent在长期学习中维持可控的技能集合并避免无效技能膨胀或过早遗忘。
- Track: Agentic RL (skill/tool library lifecycle management and policy learning)
- Core innovation: Introduces a “skill lifecycle management” paradigm: instead of assuming skills monotonically accumulate or are fully distilled into the policy (zero-skill inference), it dynamically adds, retains, prunes, and reuses skills based on limited parametric capacity and uneven marginal utility. Methodologically, it centers on contribution/return-driven skill management to keep a compact, effective skill set over long-horizon learning.

WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation 🆕NEW
- 赛道归属: 智能体评测基准（真实环境/长时程/多模态CLI代理）
- 核心创新点: 构建原生运行时（native-runtime）的长时程代理基准：以真实CLI环境而非合成沙盒/模拟API为载体，提供人类编写的双语、多模态任务，并强调过程执行与最终产物的真实可验证性；方法论突破在于把评测从“短回合+最终答案”提升到“真实工具链+长链路工作流”的端到端能力测量。
- Track: Agent evaluation benchmarks (real-world runtime, long-horizon, multimodal CLI agents)
- Core innovation: Builds a native-runtime, long-horizon benchmark: tasks run in real CLI environments rather than synthetic sandboxes or mock APIs, with human-authored bilingual multimodal tasks and verification grounded in actual execution artifacts. The key advance is shifting evaluation from short-horizon final-answer checks to end-to-end, real-toolchain workflow completion.

Beyond Red-Teaming: Formal Guarantees of LLM Guardrail Classifiers 🆕NEW
- 赛道归属: LLM安全与对齐（Guardrail分类器的形式化验证）
- 核心创新点: 将“有害行为”验证从离散输入空间转移到分类器的预激活（pre-activation）连续空间，在该空间中定义可验证的鲁棒性/安全属性并给出形式化保证；突破点在于绕开离散token空间中缺乏语义一致的ε-邻域定义问题，使guardrail分类器的安全性从经验红队测试提升到可证明的保证框架。
- Track: LLM safety & alignment (formal verification of guardrail classifiers)
- Core innovation: Shifts verification of “harmful behavior” from discrete input space to the classifier’s continuous pre-activation space, where meaningful, verifiable robustness/safety properties can be defined and proven. The methodological breakthrough is bypassing the lack of semantically meaningful ε-ball notions in token space, upgrading guardrails from empirical red-teaming to formal guarantees.

Grounded or Guessing? LVLM Confidence Estimation via Blind-Image Contrastive Ranking 🆕NEW
- 赛道归属: 多模态理解（LVLM置信度/视觉扎根性评估）
- 核心创新点: 提出BICR（Blind-Image Contrastive Ranking）用于检测“视觉不扎根”的高置信回答：通过对比正常图像与“盲图像/去视觉信息”条件下的模型输出差异，以对比排序方式估计预测是否真正由图像驱动；方法上模型无关、无需改动主模型，通过构造反事实视觉输入来分离语言先验与视觉证据贡献。
- Track: Multimodal understanding (confidence estimation / visual grounding for LVLMs)
- Core innovation: Proposes BICR (Blind-Image Contrastive Ranking) to detect visually ungrounded yet confident answers: it contrasts model behavior under the real image vs a “blind-image” (vision removed) condition and uses contrastive ranking to estimate whether the prediction is image-driven. Methodologically, it is model-agnostic and leverages counterfactual visual inputs to disentangle language priors from visual evidence.

Count Anything at Any Granularity 🆕NEW
- 赛道归属: 视觉计数与开放词汇理解（多粒度目标计数）
- 核心创新点: 将开放世界计数重新定义为“多粒度计数”问题：显式建模用户意图的计数粒度（身份/属性/实例类型/类别/抽象概念等），而非把“要数什么”简化为单一类别匹配；核心突破在于引入粒度可控的表示与推断/匹配机制，使模型能在不同语义层级上稳定对齐计数目标。
- Track: Visual counting & open-vocabulary understanding (multi-granularity counting)
- Core innovation: Reframes open-world counting as a multi-granularity problem: it explicitly models the intended counting granularity (identity, attribute, instance type, category, abstract concept) instead of treating “what to count” as a single category-level matching task. The key advance is granularity-controllable representations and inference/matching that robustly align counting targets across semantic levels.

LoKA: Low-precision Kernel Applications for Recommendation Models At Scale 🆕NEW
- 赛道归属: 训练系统与推理优化（推荐模型FP8/低精度训练加速）
- 核心创新点: 面向大规模推荐模型（LRM）提出低精度内核应用框架：针对LRM“小GEMM+归一化+数值敏感+通信密集”的特性，设计更稳健的FP8/低精度算子与训练策略，避免直接FP8导致的精度下降与训练变慢；突破在于把LLM中成熟的低精度收益迁移到LRM并解决其数值与系统瓶颈。
- Track: Training systems & inference optimization (FP8/low-precision acceleration for recommendation models)
- Core innovation: Proposes a low-precision kernel application framework tailored to large recommendation models (LRMs): it addresses LRM-specific issues—small GEMMs followed by normalization, high numerical sensitivity, and communication-heavy training—via more robust FP8/low-precision kernels and training strategies, avoiding quality drops and slowdowns seen with naive FP8. The key advance is transferring low-precision gains from LLMs to LRMs by resolving their distinct numerical/system constraints.

Compute Where it Counts: Self Optimizing Language Models 🆕NEW
- 赛道归属: 推理优化（自适应计算/动态解码预算分配）
- 核心创新点: 提出“自优化语言模型”用于逐token动态分配计算量：不再对每个解码步使用固定计算预算，而是让单一模型学习判断token难度并自适应选择计算强度（如更深计算、更高精度或更多路径），在保证质量的同时减少易token的冗余计算；方法论突破在于把“预算分配策略”内生化到模型解码过程中，实现按需计算。
- Track: Inference optimization (adaptive compute / dynamic per-token decoding budgets)
- Core innovation: Introduces “self-optimizing language models” that allocate compute dynamically per token: instead of a uniform budget each decoding step, a single model learns to estimate token difficulty and adapt computation intensity (e.g., deeper compute, higher precision, or more routes), reducing wasted compute on easy tokens while preserving quality. The key methodological shift is internalizing budget allocation into the decoding process for compute-on-demand.

BenchCAD: A Comprehensive, Industry-Standard Benchmark for Programmatic CAD 🆕NEW
- 赛道归属: 多模态代码生成与评测（程序化CAD生成基准）
- 核心创新点: 提供面向工业标准的程序化CAD综合评测：不仅评估外形相似度，还强调3D结构理解、工程参数反推与CAD操作序列的可制造性/可执行性，推动MLLM从“看图说形”走向“生成可运行的参数化建模程序”；方法论突破在于以工业工作流为导向定义任务与指标，检验模型的真实工程建模能力。
- Track: Multimodal code generation & evaluation (programmatic CAD benchmark)
- Core innovation: Delivers an industry-oriented benchmark for programmatic CAD: beyond shape similarity, it evaluates 3D structural understanding, inference of engineering parameters, and generation of executable, manufacturable CAD operation sequences. The key advance is defining tasks/metrics aligned with real industrial workflows, testing whether MLLMs can produce runnable parametric modeling programs rather than superficial descriptions.

DGPO: Beyond Pairwise Preferences with Directional Consistent Groupwise Optimization 🆕NEW
- 赛道归属: 对齐训练（偏好优化/RLHF替代：组级多候选优化）
- 核心创新点: 提出DGPO（Directional Consistent Groupwise Optimization）：从成对偏好扩展到组级多候选比较，聚合组内监督信号并显式建模“方向一致性”（正向/反向问答组织）以兼顾对齐与推理多样性；突破点在于用方向感知的组级目标缓解pairwise方法在一致性与多样性之间的张力，并以轻量框架提升偏好学习效率与稳定性。
- Track: Alignment training (preference optimization beyond pairwise / RLHF alternatives)
- Core innovation: Proposes DGPO (Directional Consistent Groupwise Optimization): it generalizes pairwise preference learning to groupwise multi-candidate comparisons, aggregating supervision at the group level and explicitly modeling direction-aware consistency (via forward/reverse QA organization) to balance alignment with reasoning diversity. The key advance is a direction-aware groupwise objective that mitigates the consistency–diversity tradeoff and improves efficiency/stability with a lightweight framework.

GitHub

[2026-05-13] sgl-project/sglang ⭐27721

SGLang is a high-performance serving framework for large language models and multimodal models.

[2026-05-13] NVIDIA-NeMo/NeMo ⭐17199

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech ...

[2026-05-13] stanford-crfm/helm ⭐2786 🆕NEW

Holistic Evaluation of Language Models (HELM) is an open source Python framework created by the Center for Research on Foundation Models (CRFM) at Sta...

[2026-05-13] flagos-ai/FlagGems ⭐996

FlagGems is an operator library for large language models implemented in the Triton Language.

[2026-05-13] NVIDIA-NeMo/Skills ⭐950

A project to improve skills of large language models

HuggingFace Datasets

[2026-05-03] iletisim/dezenformasyon-bultenleri

İletişim Başkanlığı Dezenformasyon Bültenleri

Kaynak API: llm.iletisim.gov.trKaynak Bültenler: iletisim.gov.tr/turkce/dezenformasyon-bulten...

HuggingFace Spaces

AdithyaSK/rl-environments-guide

多模态大模型 / Multimodal Models

arXiv

WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation 🆕NEW
- 赛道归属: 智能体评测基准（真实环境/长时程/多模态CLI代理）
- 核心创新点: 构建原生运行时（native-runtime）的长时程代理基准：以真实CLI环境而非合成沙盒/模拟API为载体，提供人类编写的双语、多模态任务，并强调过程执行与最终产物的真实可验证性；方法论突破在于把评测从“短回合+最终答案”提升到“真实工具链+长链路工作流”的端到端能力测量。
- Track: Agent evaluation benchmarks (real-world runtime, long-horizon, multimodal CLI agents)
- Core innovation: Builds a native-runtime, long-horizon benchmark: tasks run in real CLI environments rather than synthetic sandboxes or mock APIs, with human-authored bilingual multimodal tasks and verification grounded in actual execution artifacts. The key advance is shifting evaluation from short-horizon final-answer checks to end-to-end, real-toolchain workflow completion.

Grounded or Guessing? LVLM Confidence Estimation via Blind-Image Contrastive Ranking 🆕NEW
- 赛道归属: 多模态理解（LVLM置信度/视觉扎根性评估）
- 核心创新点: 提出BICR（Blind-Image Contrastive Ranking）用于检测“视觉不扎根”的高置信回答：通过对比正常图像与“盲图像/去视觉信息”条件下的模型输出差异，以对比排序方式估计预测是否真正由图像驱动；方法上模型无关、无需改动主模型，通过构造反事实视觉输入来分离语言先验与视觉证据贡献。
- Track: Multimodal understanding (confidence estimation / visual grounding for LVLMs)
- Core innovation: Proposes BICR (Blind-Image Contrastive Ranking) to detect visually ungrounded yet confident answers: it contrasts model behavior under the real image vs a “blind-image” (vision removed) condition and uses contrastive ranking to estimate whether the prediction is image-driven. Methodologically, it is model-agnostic and leverages counterfactual visual inputs to disentangle language priors from visual evidence.

Count Anything at Any Granularity 🆕NEW
- 赛道归属: 视觉计数与开放词汇理解（多粒度目标计数）
- 核心创新点: 将开放世界计数重新定义为“多粒度计数”问题：显式建模用户意图的计数粒度（身份/属性/实例类型/类别/抽象概念等），而非把“要数什么”简化为单一类别匹配；核心突破在于引入粒度可控的表示与推断/匹配机制，使模型能在不同语义层级上稳定对齐计数目标。
- Track: Visual counting & open-vocabulary understanding (multi-granularity counting)
- Core innovation: Reframes open-world counting as a multi-granularity problem: it explicitly models the intended counting granularity (identity, attribute, instance type, category, abstract concept) instead of treating “what to count” as a single category-level matching task. The key advance is granularity-controllable representations and inference/matching that robustly align counting targets across semantic levels.

Learning More from Less: Exploiting Counterfactuals for Data-Efficient Chart Understanding 🆕NEW
- 赛道归属: 图表理解（Chart Understanding）/ 多模态理解数据效率提升
- 核心创新点: 利用“图表可程序化生成”的特性，引入反事实（counterfactual）样本构造：通过对生成代码做微小可控改动，制造视觉变化小但语义/答案变化大的样本对，迫使VLM学习对关键语义因素的敏感性与判别能力；相较单纯扩增SFT数据规模，该方法以更少数据获得更强的语义鲁棒性与泛化，核心在于把“反事实敏感性”作为训练信号显式注入。
- Track: Chart Understanding / Data-efficient multimodal understanding
- Core innovation: Exploits the programmatic nature of charts to generate counterfactual training pairs by small, code-controlled visual edits that induce large semantic/answer shifts. This explicitly trains VLMs to discriminate causally relevant visual factors (“counterfactual sensitivity”), improving robustness and generalization with less SFT data than brute-force dataset scaling.

MMVIAD: Multi-view Multi-task Video Understanding for Industrial Anomaly Detection 🆕NEW
- 赛道归属: 工业场景视频多模态理解 / 工业异常检测（Video Anomaly Detection）数据集与基准
- 核心创新点: 提出面向真实连续巡检流程的多视角、多任务工业异常视频数据集与评测基准：用“连续视频 + 多视角”覆盖静态图像/稀疏视角数据难以表达的时序线索与跨视角一致性问题，并以多任务设置统一评估异常检测与理解能力（如定位/分类/描述等任务组合），推动模型从单点缺陷识别走向面向流程的时空理解。
- Track: Industrial video understanding / Video anomaly detection (dataset & benchmark)
- Core innovation: Introduces the first continuous multi-view industrial inspection video dataset with a multi-task benchmark, capturing temporal dynamics and cross-view consistency absent in prior static/sparse-view datasets. The key contribution is a unified evaluation setting that stresses end-to-end spatiotemporal anomaly understanding beyond single-image defect detection.

Break the Brake, Not the Wheel: Untargeted Jailbreak via Entropy Maximization 🆕NEW
- 赛道归属: 多模态安全 / VLM越狱攻击（Jailbreak）与鲁棒性评估
- 核心创新点: 在严格“非定向（untargeted）”威胁模型下重审跨模型可迁移的图像越狱：提出以“熵最大化”为目标的通用扰动生成思路，基于观察到拒答行为在自回归解码中集中于高熵token区域，通过提升解码不确定性来打破安全拒答机制，而不依赖固定前缀或特定输出模式，从而提升通用越狱在不同VLM间的迁移潜力。
- Track: Multimodal security / VLM jailbreak attacks & robustness
- Core innovation: Revisits transferable image jailbreaks under a strictly untargeted threat model and proposes entropy maximization as the attack objective. Motivated by the finding that refusals concentrate around high-entropy tokens during decoding, the method increases decoding uncertainty to bypass refusal without enforcing fixed prefixes or response templates, improving cross-model transfer potential.

GridProbe: Posterior-Probing for Adaptive Test-Time Compute in Long-Video VLMs 🆕NEW
- 赛道归属: 长视频多模态理解 / 测试时自适应计算（Test-time compute）与帧选择
- 核心创新点: 提出GridProbe：通过“后验探测（posterior probing）”在测试时自适应分配计算量，避免对上千帧做一次性高成本前向；不同于依赖对比预训练相似度的训练无关帧选择，GridProbe直接利用模型对候选帧网格/片段的输出后验信号来评估其对当前问题（尤其是否定、计数、跨帧推理、整体总结等推理型查询）的贡献，并据此动态扩展/收缩计算与选帧范围，实现更强推理相关性与更低总体开销。
- Track: Long-video VLMs / Adaptive test-time compute & frame selection
- Core innovation: GridProbe performs test-time posterior probing to adaptively allocate compute, avoiding a monolithic quadratic-attention pass over thousands of frames. Unlike training-free selectors based on encoder-space similarity (often weak for reasoning-heavy queries), it uses model posterior signals over a grid of candidate segments to estimate query-specific utility and dynamically adjust selection/compute, improving reasoning relevance at lower cost.

TINS: Test-time ID-prototype-separated Negative Semantics Learning for OOD Detection 🆕NEW
- 赛道归属: 多模态OOD检测 / 测试时学习（Test-time adaptation）
- 核心创新点: 提出TINS：在测试时扩展“负语义（negative semantics）”以覆盖不断变化的OOD概念，同时通过“ID原型分离（ID-prototype-separated）”机制抑制从潜在OOD样本学习时引入的ID污染；核心在于将ID语义用原型/锚点显式建模并与负语义学习解耦，使得负标签集合可在推理阶段安全扩展，从而提升开放环境下的OOD检出能力与稳定性。
- Track: Multimodal OOD detection / Test-time adaptation
- Core innovation: TINS expands negative semantics at test time to better cover evolving OOD concepts, while preventing ID contamination via an ID-prototype-separated mechanism. By explicitly anchoring ID semantics with prototypes and decoupling them from negative-semantics learning, it enables safer on-the-fly negative expansion and improves OOD detection robustness in open-world settings.

C-CoT: Counterfactual Chain-of-Thought with Vision-Language Models for Safe Autonomous Driving 🆕NEW
- 赛道归属: 自动驾驶多模态决策 / 安全推理与因果反思（VLM-based planning）
- 核心创新点: 提出C-CoT（反事实链式思维）：在交叉口等高风险场景中，引入“反事实生成—风险归因—决策修正”的推理流程，让VLM不仅描述当前观测，还能构造关键参与者/事件的反事实变化并评估其对风险与可行动作的因果影响；通过将反事实推理嵌入CoT，增强对罕见危险情形的反思能力与安全裕度，提升规划决策的可靠性与可解释性。
- Track: Autonomous driving multimodal planning / Safety reasoning & causal reflection
- Core innovation: C-CoT integrates counterfactual chain-of-thought into VLM-based driving decision-making: the model generates counterfactual scene variations, attributes risk causally, and revises actions accordingly. Embedding counterfactual reasoning into CoT improves reflective safety reasoning in rare, high-risk intersection scenarios, yielding more reliable and interpretable planning.

LLaVA-CKD: Bottom-Up Cascaded Knowledge Distillation for Vision-Language Models 🆕NEW
- 赛道归属: 视觉语言模型压缩 / 知识蒸馏与高效部署
- 核心创新点: 提出自底向上的级联知识蒸馏（Bottom-Up Cascaded KD）：针对Teacher-Student容量差距过大导致蒸馏困难的问题，采用分阶段/级联的蒸馏路径，从底层表征到高层对齐逐步迁移知识，缓解一次性对齐带来的优化不稳定与信息丢失；在保持VQA等多任务能力的同时显著降低学生模型的显存与计算需求，提升可部署性。
- Track: VLM compression / Knowledge distillation for efficient deployment
- Core innovation: LLaVA-CKD proposes bottom-up cascaded knowledge distillation to bridge large teacher–small student capacity gaps. By distilling in stages (progressively from lower-level representations to higher-level alignment/behavior), it stabilizes optimization and reduces information loss compared to one-shot KD, retaining VLM capabilities (e.g., VQA) with substantially lower memory/compute for deployment.

GitHub

[2026-05-12] Blaizzy/mlx-vlm ⭐4706

MLX-VLM is a package for inference and fine-tuning of Vision Language Models (VLMs) on your Mac using MLX.

[2026-05-11] waybarrios/vllm-mlx ⭐1153

OpenAI and Anthropic compatible server for Apple Silicon. Run LLMs and vision-language models (Llama, Qwen-VL, LLaVA) with continuous batching, MCP to...

[2026-05-08] zhengli97/Awesome-Prompt-Adapter-Learning-for-VLMs-CLIP ⭐772

A curated list of awesome prompt/adapter learning methods for vision-language models like CLIP.

[2026-05-10] dongyangli-del/EEG_Image_decode ⭐203

Using vision-language models to decode natural image perception from non-invasive brain recordings.

[2026-05-12] ydyhello/Awesome-VLM-Streaming-Video ⭐155

📚 A curated collection of papers and open-source code repositories dedicated to the application of Vision-Language Models (VLMs) for streaming video.

强化学习 / Reinforcement Learning

arXiv

Power Reinforcement Post-Training of Text-to-Image Models with Super-Linear Advantage Shaping 🆕NEW
- 赛道归属: 文生图（扩散模型）/ 强化学习后训练（RLHF/GRPO）
- 核心创新点: 指出GRPO类后训练中“归一化”会导致优势/奖励失配，从而诱发reward hacking；提出“超线性优势塑形”(super-linear advantage shaping) 的后训练策略，通过对优势函数进行非线性重标定来放大高质量样本的学习信号、抑制利用奖励偏置的投机解，并避免直接移除prompt相关项带来的校准问题，从机制上提升对齐增益的真实性与稳定性。
  Track: Text-to-Image (diffusion) / RL post-training (RLHF/GRPO)
- Core innovation: Identifies that normalization in GRPO-style post-training can miscalibrate advantages/rewards and trigger reward hacking; introduces super-linear advantage shaping to nonlinearly rescale advantages—amplifying learning from genuinely good samples while suppressing exploitative reward-bias shortcuts—without bluntly dropping prompt-related terms, improving alignment stability and real quality gains.

Dynamic Skill Lifecycle Management for Agentic Reinforcement Learning 🆕NEW
- 赛道归属: Agentic RL（技能库/工具使用的生命周期管理与策略学习）
- 核心创新点: 提出“技能生命周期管理”视角：不再假设外部技能只会单调累积或最终被完全内化为零技能推理，而是根据参数容量受限与技能边际贡献不均衡，动态决定技能的引入、保留、淘汰与再利用；方法上强调以任务收益/贡献为依据的技能管理机制，使agent在长期学习中维持可控的技能集合并避免无效技能膨胀或过早遗忘。
- Track: Agentic RL (skill/tool library lifecycle management and policy learning)
- Core innovation: Introduces a “skill lifecycle management” paradigm: instead of assuming skills monotonically accumulate or are fully distilled into the policy (zero-skill inference), it dynamically adds, retains, prunes, and reuses skills based on limited parametric capacity and uneven marginal utility. Methodologically, it centers on contribution/return-driven skill management to keep a compact, effective skill set over long-horizon learning.

Equivariant Reinforcement Learning for Clifford Quantum Circuit Synthesis 🆕NEW
- 赛道归属: 强化学习 + 量子电路综合（对称/等变表示学习）
- 核心创新点: 将Clifford量子电路综合表述为RL序列决策：智能体通过选择基本Clifford门，将Clifford电路的辛矩阵（symplectic matrix）表示逐步化简到单位阵；提出基于“从单位阵出发的随机游走”的课程学习生成训练分布，稳定覆盖不同难度实例；引入利用问题结构对称性的等变神经网络架构，使策略/价值网络对相关群作用保持等变，从而提升泛化与样本效率，减少对特定表示/排列的过拟合。
- Track: Reinforcement Learning + Quantum circuit synthesis (symmetry/equivariant representation learning)
- Core innovation: Formulates Clifford circuit synthesis as an RL problem where the agent selects elementary Clifford gates to reduce a symplectic-matrix representation to the identity; introduces a simple curriculum via random walks starting from the identity to generate progressively harder training instances; proposes an equivariant neural architecture that respects the problem’s symmetries (group actions), improving generalization and sample efficiency by avoiding dependence on arbitrary representations/permutations.

RubricEM: Meta-RL with Rubric-guided Policy Decomposition beyond Verifiable Rewards 🆕NEW
- 赛道归属: 强化学习 + Meta-RL（面向LLM研究代理的非可验证奖励/规则驱动学习）
- 核心创新点: 面向“深度研究代理”这类缺乏标准答案、轨迹长且工具调用复杂的场景，提出用“Rubric（评分量规/评价维度）”不仅做终局打分，而是作为共享接口来分解策略：将整体策略拆成与不同rubric维度对齐的子策略/技能，并在元学习框架下实现跨任务复用；通过rubric引导的经验抽取与重用，把历史尝试转化为可迁移的训练信号，突破仅依赖可验证奖励或单次偏好反馈的后训练范式。
- Track: Reinforcement Learning + Meta-RL (LLM research agents under non-verifiable rewards / rubric-driven learning)
- Core innovation: Targets deep research agents where rewards are not verifiable and trajectories are long and tool-augmented; elevates rubrics from mere final evaluators to a shared interface for policy decomposition—aligning sub-policies/skills with rubric dimensions and enabling meta-learning for cross-task reuse; converts past attempts into reusable experience via rubric-guided experience extraction, going beyond standard post-training that relies on verifiable rewards or one-off preference signals.

Policy Gradient Methods for Non-Markovian Reinforcement Learning 🆕NEW
- 赛道归属: 强化学习理论 + 非马尔可夫决策过程（Non-Markovian RL）+ 策略梯度
- 核心创新点: 针对观测与奖励依赖完整历史的NMDP，提出“以奖励为中心”的联合优化框架：智能体通过递归更新的内部状态对历史进行压缩表征，同时不将该状态动力学固定或仅用预测目标学习，而是与策略一起直接围绕回报目标端到端优化；由此推导适用于非马尔可夫环境的策略梯度方法，使“记忆/状态更新机制”成为可优化的决策组成部分，系统性处理长程依赖下的梯度估计与学习。
- Track: RL theory + Non-Markovian decision processes + Policy gradient
- Core innovation: Addresses NMDPs where observations/rewards depend on full interaction history via a reward-centric formulation that jointly optimizes the policy and a recursively updated internal state (memory) that summarizes history; unlike approaches with fixed state dynamics or purely predictive training, it makes the state-update mechanism an end-to-end, return-optimized component, yielding policy-gradient methods tailored to non-Markovian settings and long-range dependencies.

Reinforce Adjoint Matching: Scaling RL Post-Training of Diffusion and Flow-Matching Models 🆕NEW
- 赛道归属: 文生图（扩散/Flow-Matching）/ 强化学习后训练（可扩展对齐）
- 核心创新点: 提出Reinforce Adjoint Matching，将RL后训练改写为与扩散/flow-matching预训练同构的“回归式”目标：通过伴随(Adjoint)匹配把奖励信号注入到可解析/可回归的训练靶中，避免昂贵的SDE rollout、显式reward梯度或不稳定的替代损失；从而在保持原有可扩展训练结构的同时，实现大规模偏好对齐与可控提升。
  Track: Text-to-Image (diffusion/flow-matching) / scalable RL post-training
- Core innovation: Introduces Reinforce Adjoint Matching, reformulating RL post-training into a regression-like objective structurally aligned with diffusion/flow-matching pretraining; uses adjoint matching to inject reward into tractable regression targets, avoiding costly SDE rollouts, explicit reward gradients, or unstable surrogate losses—enabling scalable preference alignment while preserving the pretraining-friendly training structure.

XQCfD: Accelerating Fast Actor-Critic Algorithms with Prior Data and Prior Policies 🆕NEW
- 赛道归属: 强化学习算法 + 样本效率优化（离线数据/示范/先验策略融合的Actor-Critic加速）
- 核心创新点: 提出XQCfD框架以加速“快速Actor-Critic”类算法在真实机器人等高成本探索场景下的学习：同时利用先验数据（如专家示范/历史回放）与先验策略（已有控制器/旧策略）进行初始化与训练信号增强；指出现有做法在“如何使用先验数据与先验策略”上存在设计缺陷，导致未能达到可实现的样本效率上限，并通过更紧耦合的利用方式（如更有效的行为约束/目标构造/更新机制）提升稀疏奖励与困难探索任务的收敛速度与稳定性。
- Track: RL algorithms + Sample-efficiency (accelerating actor-critic with offline data/demos and prior policies)
- Core innovation: Introduces XQCfD to accelerate fast actor-critic algorithms in real-world, exploration-expensive settings by leveraging both prior data (e.g., demonstrations, logged experience) and prior policies (existing controllers/previous policies); identifies a key design gap in how existing methods incorporate these priors, leaving sample-efficiency gains unrealized, and proposes a more tightly integrated use of prior data/policies to improve stability and speed on sparse-reward, hard-exploration tasks.

Natural Policy Gradient as Doubly Smoothed Policy Iteration: A Bellman-Operator Framework 🆕NEW
- 赛道归属: 强化学习理论 + 自然策略梯度（NPG）+ Bellman算子/策略迭代统一框架
- 核心创新点: 将自然策略梯度给出一个“精确等价”的Bellman算子视角：提出双重平滑策略迭代（DSPI），其中新策略通过对“过去Q函数的加权平均”做正则化贪婪（regularized greedy）得到；该框架把传统策略迭代、dual-averaged策略迭代与NPG统一为同一类算子迭代的特例，揭示NPG可解释为“平滑 + 平均”的策略改进过程，为算法设计（不同平滑/权重/正则选择）与收敛分析提供可组合的理论模板。
- Track: RL theory + Natural Policy Gradient + Bellman-operator / policy-iteration unification
- Core innovation: Provides an exact Bellman-operator formulation of natural policy gradient via Doubly Smoothed Policy Iteration (DSPI): each new policy is obtained by a regularized greedy step applied to a weighted average of past Q-functions; unifies classical policy iteration, dual-averaged policy iteration, and NPG as special cases, interpreting NPG as “smoothing + averaging” policy improvement and enabling modular algorithm design and convergence analysis through choices of smoothing, weights, and regularizers.

Evolving-RL: End-to-End Optimization of Experience-Driven Self-Evolving Capability within Agents 🆕NEW
- 赛道归属: 强化学习 + LLM智能体自进化/经验蒸馏（端到端优化的经验驱动适应）
- 核心创新点: 针对“经验驱动自进化智能体”常停留在系统工程层（经验如何存储/检索/管理），提出端到端优化范式：把从交互中提炼可复用经验、并在部署时适应新任务的能力，作为可训练目标直接反向优化基础模型/代理组件；强调提升模型在抽象、泛化与in-context学习上的内生能力，使经验蒸馏不只是外部记忆机制，而是形成可持续自我改进的学习闭环（交互→提炼→再利用→能力提升）。
- Track: RL + LLM agents self-evolution / experience distillation (end-to-end optimization for deployment-time adaptation)
- Core innovation: Moves beyond system-level choices (how to store/retrieve/manage experience) by proposing end-to-end optimization of experience-driven self-evolving capability: treats extracting reusable experience from interactions and adapting at deployment as a trainable objective that directly optimizes the foundation model/agent components; focuses on strengthening intrinsic abstraction, generalization, and in-context learning so experience distillation becomes a learning loop (interact → distill → reuse → improve), not merely an external memory add-on.

Controllability in preference-conditioned multi-objective reinforcement learning 🆕NEW
- 赛道归属: 多目标强化学习（Preference-conditioned MORL）+ 评测指标/可控性分析
- 核心创新点: 提出“可控性（controllability）”作为偏好条件多目标RL的关键性质与评测维度：指出现有MORL指标可能在性能上看似优秀，但策略对偏好输入不敏感，导致用户无法通过改变偏好可靠地改变行为；围绕“偏好变化是否引起预期的行为/回报权衡变化”构建可评估的度量与分析框架，用于诊断与比较不同偏好条件化方法，并推动训练目标/正则化朝提升偏好响应性与可操控性方向改进。
- Track: Multi-objective RL (preference-conditioned MORL) + evaluation metrics / controllability
- Core innovation: Introduces controllability as a first-class property for preference-conditioned MORL, highlighting that standard MORL metrics can be high even when the policy is insensitive to the preference input; develops an evaluation/analysis framework that measures whether changing preferences reliably induces the intended behavioral and trade-off shifts, enabling diagnosis and comparison of methods and motivating training objectives/regularization that improve preference responsiveness and user control.

GitHub

[2026-05-12] huggingface/trl ⭐18353

Train transformer language models with reinforcement learning.

[2026-05-12] rllm-org/rllm ⭐5497

Democratizing Reinforcement Learning for LLMs

[2026-05-12] google-deepmind/open_spiel ⭐5211 🆕NEW

OpenSpiel is a collection of environments and algorithms for research in general reinforcement learning and search/planning in games.

[2026-05-13] radixark/miles ⭐1322

Miles is an enterprise-facing reinforcement learning framework for LLM and VLM post-training, forked from and co-evolving with slime.

[2026-05-12] pettingllms-ai/PettingLLMs ⭐167 🆕NEW

[ICLR'26] Stronger-MAS: A RL Framework for multi LLM agent system; [arxiv] MetaAgent-X: End-to-End Reinforcement Learning Automatic Multi-Agent Syste...

HuggingFace Datasets

[2026-05-03] ADSKAILab/Zero-To-CAD-1m
```
Zero-to-CAD 1M
```

One million executable, interpretable CAD construction sequences synthesized entirely without real-world data.

...

[2026-05-12] TuringEnterprises/Open-MM-RL 🆕NEW
```
Dataset Summary
```

Open-MM-RL is a multimodal STEM reasoning dataset covering Physics, Mathematics, Biology, and Chemistry. It is designed for...

[2026-04-23] nvidia/Nemotron-Personas-Korea
```
Nemotron-Personas-Korea
```
우리나라 실제 분포에 기반한 합성 페르소나를 위한 복합 AI 시스템 A compound AI approach to personas grounded in real-world dist...

世界动作模型 / World Action Model

arXiv

[2026-05-08] Is the Future Compatible? Diagnosing Dynamic Consistency in World Action Models
- 赛道归属: 世界模型评测与可靠性诊断（World Action Model / 动态一致性）
- 核心创新点: 提出并系统化定义WAM可靠性的关键缺失维度——动作-状态一致性（action-state consistency），用于检验“模型生成的未来”是否与其声称的动作序列在动力学上相容，而不仅是视觉上合理；围绕该一致性构建诊断框架/评测思路，将WAM的失效从“看起来对”细化为“动力学不兼容”的可检测问题，从而为后续训练目标、校准与安全执行提供可操作的评价轴。
- Track: World-model evaluation & reliability diagnostics (World Action Model / dynamic consistency)
- Core innovation: Introduces and formalizes action–state consistency as a missing reliability axis for WAMs, testing whether imagined futures are dynamically compatible with the predicted action sequence rather than merely visually plausible; builds a diagnostic/evaluation perspective around this notion to make WAM failure modes measurable as dynamical incompatibility, enabling more actionable assessment for calibration, training objectives, and safe deployment.

[2026-05-07] When to Trust Imagination: Adaptive Action Execution for World Action Models
- 赛道归属: 世界模型驱动的机器人控制（自适应执行 / 想象-现实一致性验证）
- 核心创新点: 将WAM的执行策略从“每次推理固定执行N步”提升为自适应动作执行：把是否继续执行想象动作序列建模为未来-现实验证（future-reality verification）问题；核心方法论是在执行过程中持续对比模型想象的未来与真实滚动的偏差/一致性，并据此动态决定执行更长的开环段还是提前重规划，从机制上缓解因想象漂移导致的失控与累积误差，实现“何时信任想象”的可决策化。
- Track: World-model-based robotic control (adaptive execution / imagination–reality verification)
- Core innovation: Replaces the standard “execute a fixed N predicted actions per inference” paradigm with adaptive action execution, formulating it as a future–reality verification problem; methodologically, it continuously checks consistency between imagined rollouts and real-world evolution during execution and uses this signal to decide whether to keep executing longer open-loop segments or replan early, mitigating imagination drift and compounding errors via an explicit trust-and-replan mechanism.

GitHub

[2026-05-12] DravenALG/awesome-vla-wam ⭐373

A Curated List of Vision-Language-Action (VLA) and World Action Models (WAM) Research and Beyond

[2026-05-12] jiangranlv/DyWA ⭐81 🆕NEW

[ICCV 2025] DyWA:Dynamics-adaptive World Action Model for Generalizable Non-prehensile Manipulation

Generated automatically by Daily AI Digest Agent 生成时间: 2026-05-13 01:02:49