AI 每日进展速报 / Daily AI Digest - 2026-05-12
图像生成/编辑 / Image Generation/Editing
arXiv
- Flow-OPD: On-Policy Distillation for Flow Matching Models 🆕NEW
- 赛道归属: 文生图(Flow Matching)对齐/后训练(RL/蒸馏)
- 核心创新点: 提出 Flow-OPD,将大模型领域的 On-Policy Distillation 引入 Flow Matching 文生图的统一后训练框架:用“在策略采样→基于偏好/奖励信号的更新→蒸馏回学生”的闭环,缓解多任务对齐中的两大痛点——标量奖励导致的稀疏监督与异质目标联合优化带来的梯度干扰,从机制上降低指标“跷跷板效应”和 reward hacking,使多指标对齐更稳定、可控。
- Track: Text-to-Image (Flow Matching) alignment/post-training (RL/distillation)
- Core innovation: Proposes Flow-OPD, the first unified post-training framework that adapts On-Policy Distillation to Flow Matching T2I models. By closing the loop of on-policy sampling → preference/reward-driven updates → distillation back to a student, it directly targets reward sparsity from scalar rewards and gradient interference from heterogeneous objectives, reducing metric seesawing and reward hacking while improving stability under multi-objective alignment.
- SCOPE: Structured Decomposition and Conditional Skill Orchestration for Complex Image Generation
- 赛道归属: 文生图(复杂意图生成/结构化规划与验证)
- 核心创新点: 提出“语义承诺(semantic commitments)”及其在生成生命周期中断裂的“概念裂隙(Conceptual Rift)”框架,将复杂生成需求从文本理解、图像生成到结果核验进行结构化分解与可追踪表示;通过条件化“技能编排(skill orchestration)”把不同子能力(如属性绑定、关系约束、计数/布局等)按承诺单元进行调度与闭环校验,减少局部满足但全局失真的失败模式,提升复杂指令的可控一致性。
- Track: Text-to-Image (complex intent generation / structured planning & verification)
- Core innovation: Introduces “semantic commitments” and formalizes their lifecycle discontinuity as the “Conceptual Rift,” making complex requirements explicitly decomposable and trackable across understanding, generation, and verification; uses conditional skill orchestration to route specialized sub-skills (attribute binding, relational constraints, counting/layout, etc.) around commitment units with closed-loop checking, reducing cases where constraints are satisfied locally but violated globally.
- HEART: Hyperspherical Embedding Alignment via Kent-Representation Traversal in Diffusion Models 🆕NEW
- 赛道归属: 图像编辑/可控生成(扩散模型的条件空间操控)
- 核心创新点: 提出 HEART,将文本条件嵌入从常见的欧式假设转向超球面几何建模,并通过 Kent 分布表示(Kent-Representation)在球面上进行可控“遍历/插值”,实现对语义与属性变化的更解耦控制。该做法针对“改主体/改属性易牵连背景与细节”的根因(嵌入空间操作不符合真实几何结构),用分布化、方向性更强的球面表示来提升编辑的局部性与一致性,减少副作用与细节畸变。
- Track: Image editing / controllable generation (conditioning-space control in diffusion models)
- Core innovation: Introduces HEART, modeling text-conditioning embeddings with hyperspherical geometry instead of Euclidean assumptions, and performing controlled traversal/interpolation via a Kent-distribution representation on the sphere. This distributional, directional spherical control better decouples semantic/attribute changes, mitigating unintended background/detail shifts and reducing artifacts when editing subjects or attributes.
- Delta-Adapter: Scalable Exemplar-Based Image Editing with Single-Pair Supervision
- 赛道归属: 图像编辑(示例驱动编辑/少样本迁移)
- 核心创新点: 将示例编辑从传统“pair-of-pairs”监督降为“单对样本(single-pair)”监督:从源-目标示例对中显式抽取可迁移的“编辑增量/差分(delta)”表征,并以适配器(adapter)形式注入到生成/编辑模型中,实现编辑语义与内容解耦;从而在无需同语义第二对样本的情况下学习可泛化的编辑操作,显著降低数据构建成本并提升跨编辑类型的可扩展性。
- Track: Image Editing (exemplar-based editing / low-shot transfer)
- Core innovation: Replaces the pair-of-pairs supervision with single-pair supervision by explicitly extracting a transferable edit “delta” (difference) representation from one source–target exemplar pair and injecting it via an adapter module, decoupling edit semantics from image content; enables scalable learning of generalizable edits without needing a second pair sharing the same edit semantics.
- BRIDGE: Background Routing and Isolated Discrete Gating for Coarse-Mask Local Editing
- 赛道归属: 图像编辑 / 粗掩码局部编辑(Mask-based Local Editing)
- 核心创新点: 提出BRIDGE以解决粗掩码带来的“掩码形状偏置”:将任务形式化为“两区约束(Two-Zone Constraint)”——背景需强稳定、编辑区需遵循指令但不被粗边界牵引;通过Background Routing将背景信息与生成路径隔离、并用Isolated Discrete Gating在编辑区引入离散门控以控制信息流与边界泄漏,从而在粗掩码下实现更稳的背景保持与更自由的目标形状生成。
Track: Image editing / Coarse-mask local editing
Key innovations: BRIDGE tackles mask-shape bias in coarse-mask editing by formalizing a Two-Zone Constraint: the background must remain stable while the editable region follows the instruction without being pulled by accidental mask boundaries. It introduces Background Routing to isolate background pathways and Isolated Discrete Gating to discretely control information flow within the edit region, reducing boundary leakage and enabling stronger background preservation with more flexible object shaping under rough masks.
- [2026-05-08] SIMI: Self-information Mining Network for Low-light Image Enhancement 🆕NEW
- 赛道归属: 图像增强(低照度增强,无监督)
- 核心创新点: 提出 SIMI 无监督低照度增强框架,核心在于利用 位平面分解(bit-plane decomposition)将低照度图像拆解为多层信息分量,并进行“自信息挖掘”以显式提取低照度图像内部可用的结构/细节线索,而非依赖更复杂的外部监督或先验。通过这种信息分解与重组的学习范式,提升在弱光下的细节恢复与亮度/噪声权衡能力。
- Track: Image enhancement (low-light enhancement, unsupervised)
- Core innovation: Proposes SIMI, an unsupervised low-light enhancement framework built on bit-plane decomposition to separate an image into multiple information components and mine intrinsic “self-information” (structure/detail cues) directly from the low-light input. This decomposition-and-recomposition learning paradigm reduces reliance on heavy supervision/priors and improves detail recovery with better brightness–noise trade-offs.
- [2026-05-08] ReasonEdit: Towards Interpretable Image Editing Evaluation via Reinforcement Learning
- 赛道归属: 图像编辑(评测/可解释性评价与奖励建模)
- 核心创新点: 面向文本引导图像编辑评测引入强化学习范式:构建可解释的评价信号(不仅是标量分数),并训练能输出“原因链/错误归因”的评估器作为奖励模型;通过RL优化使评估器在识别伪影、非预期改动、审美退化等问题时同时给出可读的解释依据,弥补现有评测缺少解释数据与可训练奖励模型的短板。
- Track: Image Editing (evaluation / interpretable scoring & reward modeling)
- Core innovation: Brings reinforcement learning into text-guided image editing evaluation by training an evaluator/reward model that produces interpretable rationales (error attribution/reasoning traces) rather than only scalar scores; improves diagnosis of artifacts, unintended edits, and aesthetic regressions by coupling scoring with explanation generation.
- [2026-05-08] EditRefiner: A Human-Aligned Agentic Framework for Image Editing Refinement
- 赛道归属: 图像编辑(Agentic 精修/人类对齐的局部纠错)
- 核心创新点: 提出面向编辑结果“精修(refinement)”的代理式框架:通过构建人类反馈驱动的数据集(如细粒度缺陷与修复指令/偏好)对齐模型行为;结合具备更强空间落点能力的诊断-定位-修复流程(而非弱 grounding 的一次性VLM建议或反复重采样),在避免语义漂移的同时实现可靠的局部修补(物体不自然、光照不一致、局部纹理破坏等)。
- Track: Image Editing (agentic refinement / human-aligned local correction)
- Core innovation: Proposes an agentic refinement framework aligned with human feedback via a dedicated dataset of fine-grained defects and fixes/preferences; uses a diagnose–localize–repair pipeline with stronger spatial grounding than generic VLM-based refinement or costly iterative regeneration, enabling reliable local corrections while reducing semantic drift.
- [2026-05-08] EditTransfer++: Toward Faithful and Efficient Visual-Prompt-Guided Image Editing 🆕NEW
- 赛道归属: 图像编辑(视觉提示/示例对驱动的编辑迁移,扩散/DiT)
- 核心创新点: 提出 EditTransfer++,面向“示例对(before/after)+视觉提示”驱动的编辑迁移,针对现有 Diffusion Transformer 方法的两类结构性失配:①骨干预训练偏向文本条件,导致对示例编辑的忠实度不足;②采样随机性带来编辑不稳定。其方法通过更贴合任务的条件注入与训练/推理设计,强化对示例变换的可复现学习与稳定执行,在保证效率的同时提升编辑一致性与保真度。
- Track: Image editing (visual-prompt / example-pair guided edit transfer with diffusion/DiT)
- Core innovation: Presents EditTransfer++ for example-pair (before/after) + visual-prompt guided edit transfer, addressing two key mismatches in prior Diffusion Transformer approaches: (1) backbone pretraining bias toward text conditioning that hurts edit faithfulness, and (2) stochastic sampling instability that makes edits inconsistent. It redesigns conditioning injection and training/inference to better fit the task, improving reproducibility, stability, and faithfulness while remaining efficient.
- [2026-05-08] OrchJail: Jailbreaking Tool-Calling Text-to-Image Agents by Orchestration-Guided Fuzzing
- 赛道归属: 文生图安全(工具调用代理/越狱与红队测试)
- 核心创新点: 针对“工具调用型”文生图代理提出编排引导的模糊测试越狱框架:不再只优化单轮提示词,而是系统探索多步工具链的组合空间,利用“单步无害、组合有害”的编排漏洞生成攻击序列;通过对代理的规划/调用轨迹进行引导式变异与覆盖驱动搜索,发现传统prompt-only jailbreak难以触达的安全失效模式。
- Track: T2I Safety (tool-calling agents / jailbreak & red-teaming)
- Core innovation: Introduces orchestration-guided fuzzing for jailbreaking tool-calling T2I agents by exploring the multi-step toolchain composition space, targeting “benign individually, harmful jointly” orchestration vulnerabilities; uses guided mutation/coverage over planning and tool-invocation traces to uncover failure modes beyond prompt-only jailbreaks.
GitHub
- [2026-05-12] YouMind-OpenLab/awesome-nano-banana-pro-prompts ⭐11932
🍌 World's largest Nano Banana Pro prompt library — 10,000+ curated prompts with preview images, 16 languages. Google Gemini AI image generation. Free ...
- [2026-05-11] vibheksoni/free-ai ⭐413
Free OpenAI-compatible AI API with 16,000+ models, image generation, tool calling, and Discord key signup.
- [2026-05-11] Siddhesh2377/ToolNeuron ⭐393 🆕NEW
On-device AI for Android — LLM chat (GGUF/llama.cpp), vision models (VLM), image generation (Stable Diffusion), tool calling, AI personas, RAG knowled...
- [2026-05-11] AceDataCloud/Nexior ⭐369
Consumer AI app for chat, image generation, video generation, and music creation powered by Ace Data Cloud APIs.
- [2026-05-11] etkecc/baibot ⭐223 🆕NEW
🤖 A Matrix bot for using different capabilities (text-generation, text-to-speech, speech-to-text, image-generation, etc.) of AI / Large Language Model...
HuggingFace Models
HuggingFace Datasets
- [2026-05-07] unh1nge/comfyui-character-composer
AIO Qwen Workflow
The repository now includes:
AIO Comfyui-Character-Composer Qwen Workflow.json
A unified all-in-one Qwen wor...
视频生成/编辑 / Video Generation/Editing
arXiv
- [2026-05-08] OphEdit: Training-Free Text-Guided Editing of Ophthalmic Surgical Videos 🆕NEW
- 赛道归属: 视频编辑(医疗/手术视频,文本引导、免训练编辑)
- 核心创新点: 提出一种面向眼科手术视频的training-free文本引导编辑框架,在不对扩散模型进行再训练/微调的前提下,将文本指令转化为对视频生成过程的可控约束,实现对手术属性(如器械-组织交互、流程阶段等)的定向修改;方法重点解决手术场景中强解剖一致性与时间连续性带来的编辑难题,通过在扩散采样/注意力或条件注入层面进行约束与引导,尽量在局部语义改变的同时保持器械几何、组织结构与跨帧一致性,面向高保真、低风险的临床视频编辑需求。
- Track: Video Editing (medical/surgical video, text-guided, training-free editing)
- Core innovation: Proposes a training-free text-guided editing framework for ophthalmic surgical videos. Without finetuning diffusion models, it converts text instructions into controllable constraints over the generation/sampling process to edit surgical attributes (e.g., instrument–tissue interactions, procedural phases). The key methodological contribution is handling the strict anatomical fidelity and temporal coherence requirements by constraining/injecting conditions during diffusion (e.g., via attention/conditioning control), enabling targeted semantic changes while preserving geometry, structure, and cross-frame consistency.
- [2026-05-08] Do Joint Audio-Video Generation Models Understand Physics?
- 赛道归属: 多模态评测(音视频联合生成/物理一致性基准)
- 核心创新点: 提出 AV-Phys Bench,用于系统评估音视频联合生成模型是否具备“物理常识一致性”而非仅生成表面合理的声画;基准覆盖稳态、事件转变、环境转变三类场景,并围绕物理因果与跨模态一致性设计测试,从评测维度上把“声画同步”提升到“物理可解释的一致”。
- Track: Multimodal evaluation (joint audio-video generation / physics consistency benchmark)
- Key innovation: Introduces AV-Phys Bench to test whether joint audio-video generators exhibit physics-grounded commonsense rather than merely plausible A/V outputs; it structures evaluation into Steady State, Event Transition, and Environment Transition, emphasizing causal, cross-modal physical consistency beyond simple synchronization.
- Not All Tokens Need 40 Steps: Heterogeneous Step Allocation in Diffusion Transformers for Efficient Video Generation
- 赛道归属: 视频生成 / DiT 推理加速(异构去噪步数分配)
- 核心创新点: 提出训练免的异构步数分配(HSA),打破“所有时空 token 统一 40 步”等固定采样范式:根据 token 的重要性/冗余度(尤其是运动冗余)为不同空间位置与时间片分配不同去噪步数,对低贡献 token 早停或少步更新、对关键 token 保持充分迭代,从而在不改训练的情况下显著降低总体计算量,同时尽量维持视频的时序一致性与细节质量。
Track: Video generation / DiT inference acceleration (heterogeneous denoising step allocation)
Key innovation: Proposes training-free Heterogeneous Step Allocation (HSA) to replace uniform denoising steps across all spatiotemporal tokens. It allocates fewer steps (or early stopping) to redundant/low-importance tokens—particularly those with motion redundancy—while preserving sufficient iterations for critical tokens, reducing compute substantially without retraining and with minimal loss in temporal coherence and visual fidelity.
- [2026-05-07] FreeSpec: Training-Free Long Video Generation via Singular-Spectrum Reconstruction
- 赛道归属: 长视频生成(训练免/一致性增强)
- 核心创新点: 提出 FreeSpec,通过“奇异谱重建(Singular-Spectrum Reconstruction)”在训练免设置下延展短视频扩散模型到长视频,缓解内容漂移、时序不一致与动态过平滑;相较于依赖预定义规则拆分外观/运动的全局-局部分支方法,FreeSpec用谱结构重建来更自适应地耦合外观一致性与动作演进,降低错误分配带来的失真。
- Track: Long video generation (training-free / temporal consistency)
- Key innovation: FreeSpec extends short-video diffusion models to long videos without training via Singular-Spectrum Reconstruction, addressing drift, temporal inconsistency, and over-smoothed motion; unlike global/local branch methods that heuristically separate appearance vs dynamics, it leverages spectral reconstruction to adaptively couple identity/appearance consistency with action progression, reducing mis-assignment artifacts.
- [2026-05-07] SwiftI2V: Efficient High-Resolution Image-to-Video Generation via Conditional Segment-wise Generation
- 赛道归属: 图生视频(高分辨率/高效生成)
- 核心创新点: 提出 SwiftI2V 的“条件分段生成(conditional segment-wise generation)”策略,将 2K 级高分辨率 I2V 的生成过程拆分为可控的分段/分块推理,在显著降低显存与时延的同时保持输入图像的细粒度结构;相较于“低清生成+通用超分”的级联方案,该方法在生成阶段就注入输入条件约束,减少细节幻觉与对输入局部结构的漂移。
- Track: Image-to-Video (efficient high-resolution generation)
- Key innovation: SwiftI2V introduces conditional segment-wise generation for 2K I2V, partitioning synthesis into conditioned segments to cut memory/latency while preserving fine-grained appearance from the source image; compared to low-res generation plus generic video SR, it enforces input-conditioned structure during generation, reducing hallucinated details and local-structure drift.
- [2026-05-07] EA-WM: Event-Aware Generative World Model with Structured Kinematic-to-Visual Action Fields 🆕NEW
- 赛道归属: 视频生成(机器人世界模型/动作条件视频扩散,动作-视觉对齐)
- 核心创新点: 提出事件感知的生成式世界模型EA-WM,通过结构化的“运动学→视觉”动作场(kinematic-to-visual action fields)将机器人动作/运动学信号显式映射为对视频扩散生成的空间-时间控制,从而强化“由动作驱动的视频合成”这一逆向问题;相较将视频仅作为策略学习的辅助表征,该方法在生成过程中引入可解释的动作结构与事件感知机制,以更好地保持机器人精确空间几何、运动一致性与物理可预期性,提升动作条件下未来视频生成的可控性与保真度。
- Track: Video Generation (robotic world models, action-conditioned video diffusion, action–visual alignment)
- Core innovation: Introduces EA-WM, an event-aware generative world model that uses structured kinematic-to-visual action fields to explicitly map action/kinematic signals into spatiotemporal control for video diffusion. Instead of treating video prediction as a mere auxiliary representation for policy learning, it targets the inverse problem—action-guided video synthesis—by injecting interpretable action structure and event awareness into generation, improving controllability and fidelity while preserving precise robot geometry, motion consistency, and physical plausibility.
- [2026-05-07] RealCam: Real-Time Novel-View Video Generation with Interactive Camera Control
- 赛道归属: 视频编辑/视频到视频(实时新视角生成/交互式相机控制)
- 核心创新点: 提出 RealCam,用面向实时的因果式架构实现单目视频的交互式相机控制新视角生成,突破以往依赖全序列非因果处理与前缀拼接的范式;通过避免双向注意力带来的二次复杂度与高延迟,使模型能够低时延、可流式地响应相机轨迹控制,同时提升在线生成的时序连贯性与可用性。
- Track: Video editing / Video-to-Video (real-time novel-view generation with interactive camera control)
- Key innovation: RealCam proposes a real-time, causal architecture for camera-controllable novel-view V2V from monocular footage, replacing non-causal full-sequence processing and rigid prefix concatenation; by eliminating bidirectional attention’s quadratic cost and latency, it enables low-latency streaming generation that can interactively follow camera controls with improved temporal coherence.
- [2026-05-06] FaithfulFaces: Pose-Faithful Facial Identity Preservation for Text-to-Video Generation
- 赛道归属: 文生视频(人脸身份保持/可控生成)
- 核心创新点: 提出 FaithfulFaces 的“姿态共享身份表示”学习框架,在大姿态变化与遮挡场景下强化身份一致性;通过将身份特征与姿态因素解耦并在跨姿态条件下共享/对齐身份表征,减少因姿态迁移导致的身份漂移与面部结构失真,从而提升复杂动态场景中的人脸身份保真度。
- Track: Text-to-Video (face identity preservation / controllable generation)
- Key innovation: Proposes FaithfulFaces with a pose-shared identity representation learning scheme to improve identity consistency under large pose changes and occlusions; it explicitly disentangles identity from pose and aligns/shares identity features across pose conditions, reducing pose-induced identity drift and facial distortion in dynamic scenes.
- [2026-05-06] Stream-T1: Test-Time Scaling for Streaming Video Generation
- 赛道归属: 视频生成(流式生成/推理时扩展 Test-Time Scaling)
- 核心创新点: 提出 Stream-T1,将 Test-Time Scaling 从传统扩散式“多候选探索”转向更适配的流式视频生成范式;利用分块(chunk-level)合成与更少去噪步数的结构优势显著降低 TTS 的候选搜索成本,并引入面向时间维度的指导机制以增强跨块时序一致性,实现更可扩展的推理时质量提升。
- Track: Video generation (streaming generation / test-time scaling)
- Key innovation: Stream-T1 reframes Test-Time Scaling around streaming video generation, exploiting chunk-wise synthesis and fewer denoising steps to drastically cut candidate exploration cost; it further adds temporal guidance across chunks to improve long-range coherence, enabling scalable test-time quality gains without retraining.
- Attention Sparsity is Input-Stable: Training-Free Sparse Attention for Video Generation via Offline Sparsity Profiling and Online QK Co-Clustering
- 赛道归属: 视频生成 / DiT 推理优化(训练免稀疏注意力)
- 核心创新点: 发现注意力稀疏模式对输入具有稳定性(input-stable),据此提出“离线稀疏画像 + 在线 QK 协同聚类”的训练免稀疏注意力方案:离线为不同层建立更细粒度的稀疏度剖面以处理层间异质性;在线通过 Query-Key 联合分块/聚类显式建模 QK 耦合关系,避免仅按单侧划分带来的信息断裂,从而在不训练的前提下提升 3D 注意力加速的质量-速度曲线。
Track: Video generation / DiT inference optimization (training-free sparse attention)
Key innovation: Observes that attention sparsity patterns are input-stable, and proposes a training-free sparse attention pipeline combining offline sparsity profiling and online QK co-clustering. Offline profiling captures layer-wise heterogeneity; online joint query-key block partitioning explicitly models QK coupling, reducing information loss from one-sided partitioning and improving the quality–speed trade-off for accelerating dense 3D attention without retraining.
GitHub
- [2026-05-11] Anil-matcha/Open-Generative-AI ⭐12838
Unrestricted, open-source alternative to AI video platforms — Free, unrestricted AI image & video generation studio with 200+ models (Flux, Midjourney...
- [2026-05-11] ZeroLu/awesome-seedance ⭐1710
The ultimate collection of high-fidelity Seedance 2.0 prompts and Seedance AI resources. Discover Seedance 2.0 how to use for cinematic film, anime, U...
- [2026-05-11] YouMind-OpenLab/awesome-seedance-2-prompts ⭐986
🎬 2000+ curated Seedance 2.0 video generation prompts — cinematic, anime, UGC, ads, meme styles. Includes Seedance API guides, character consistency t...
- [2026-05-11] Guoxu1233/DreamID-Omni ⭐254 🆕NEW
[ICML 2026] DreamID-Omni: Unified Framework for Controllable Human-Centric Audio-Video Generation
- [2026-05-11] AMAP-ML/MACE-Dance ⭐84 🆕NEW
[SIGGRAPH 2026] MACE-Dance: Motion-Appearance Cascaded Experts for Music-Driven Dance Video Generation
HuggingFace Models
音频生成 / Audio Generation
arXiv
- [2026-05-07] Optimal Transport Audio Distance with Learned Riemannian Ground Metrics
- 赛道归属: 音频生成评测(Audio Generation Evaluation)/ 最优传输距离度量
- 核心创新点: 提出OTAD以替代/修正FAD的两大结构性缺陷:在“代价项”上学习残差黎曼地面度量适配器(Riemannian ground-metric adapter)以避免冻结嵌入的不变性掩盖伪影;在“耦合项”上用离散OT(带熵正则)替代高斯拟合近似,提升对局部污染与细粒度失真的敏感性,从而得到更可信的生成音频距离度量。
Track: Audio generation evaluation / Optimal transport metrics
Key innovation: OTAD fixes FAD by (1) learning a residual Riemannian ground-metric adapter for the OT cost instead of relying on a frozen embedding pullback, and (2) replacing Gaussian coupling with discrete entropic OT—improving sensitivity to artifacts and rank-1/contaminated distortions.
- [2026-05-06] Stage-adaptive audio diffusion modeling 🆕NEW
- 赛道归属: 音频生成(扩散模型训练优化 / 自适应训练策略)
- 核心创新点: 提出“阶段自适应(stage-adaptive)”的音频扩散模型训练框架,针对扩散训练中不同阶段(如噪声水平/时间步、不同条件信号)的学习难度与贡献随训练进程变化这一现象,不再采用固定不变的优化配方,而是动态调整训练信号的重要性与采样/加权策略,使模型在训练早期与后期分别聚焦更关键的学习目标,从而在不改变生成范式的前提下提升训练效率与最终生成/复原质量。
Track: Audio generation (diffusion model training optimization / adaptive training strategy)
Key innovation: Introduces a stage-adaptive training framework for audio diffusion models. Instead of using static optimization recipes, it dynamically rebalances the importance of training signals (e.g., across noise levels/timesteps and heterogeneous conditioning regimes) as learning progresses, aligning optimization focus with stage-dependent difficulty and utility. This improves training efficiency and final generation/restoration quality without changing the underlying diffusion generation paradigm.
GitHub
- [2026-05-11] huggingface/diffusers ⭐33595
🤗 Diffusers: State-of-the-art diffusion models for image, video, and audio generation in PyTorch.
- [2026-05-11] Lightricks/LTX-2 ⭐6574 🆕NEW
Official Python inference and LoRA trainer package for the LTX-2 audio–video generative model.
- [2026-05-06] OpenMOSS/MOVA ⭐996
MOVA: Towards Scalable and Synchronized Video–Audio Generation
- [2026-05-11] apocas/restai ⭐504
RESTai is an AIaaS (AI as a Service) open-source platform. Supports many public and local LLM suported by Ollama/vLLM/etc. Precise embeddings usage, t...
- [2026-05-07] Saganaki22/ComfyUI-Woosh ⭐98
Text-to-audio and video-to-audio using Sony AI's Woosh foundation model.
语言大模型 / Large Language Models
arXiv
- LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling
- 赛道归属: 推理优化(Test-Time Scaling)/ 自动化策略发现(Agentic/RL)
- 核心创新点: 提出环境驱动的AutoTTS:将研究者的设计对象从“具体TTS启发式”提升为“可搜索的环境/接口”,让LLM以代理式探索在测试时如何分配计算与组织推理轨迹;通过自动发现与评估策略,系统性覆盖更大的TTS策略空间,减少手工规则与直觉调参依赖。
Track: Inference optimization (Test-Time Scaling) / Agentic strategy discovery
Key innovation: AutoTTS reframes TTS from hand-crafted heuristics to an environment-driven search problem, enabling LLM agents to discover and evaluate test-time compute allocation/reasoning strategies automatically, expanding the explored policy space and reducing manual tuning.
- VecCISC: Improving Confidence-Informed Self-Consistency with Reasoning Trace Clustering and Candidate Answer Selection 🆕NEW
- 赛道归属: 推理优化(Inference-time Self-Consistency / 置信度加权投票)
- 核心创新点: 提出 VecCISC,在置信度加权自一致性(CISC)框架下,引入“推理轨迹聚类 + 候选答案选择”的两阶段机制:先基于多次采样得到的 reasoning traces 做向量化表征并聚类,减少同质化样本对投票的偏置、提升多样性有效利用;再在簇级别聚合置信度并进行候选答案筛选/重加权,从而在保持加权投票优势的同时,缓解噪声置信度与重复轨迹导致的误选问题,提升推理时集成决策的稳健性。
Track: Reasoning optimization (Inference-time Self-Consistency / confidence-weighted voting)
Key innovation: Proposes VecCISC, a confidence-informed self-consistency method that adds a two-stage “reasoning-trace clustering + candidate answer selection” pipeline. It vectorizes and clusters sampled reasoning traces to reduce redundancy-driven voting bias and better exploit meaningful diversity, then aggregates confidence at the cluster level to select/reweight candidate answers. This improves robustness against noisy confidence estimates and duplicated traces while retaining the benefits of weighted voting.
- Proxy3D: Efficient 3D Representations for Vision-Language Models via Semantic Clustering and Alignment
- 赛道归属: 3D表征学习 / 面向VLM的高效3D语义对齐
- 核心创新点: 提出Proxy3D,用“语义聚类+跨模态对齐”构建高效3D代理表示,替代传统VLM以像素对齐为主的2D视觉token管线;通过把视觉序列压缩为语义一致的3D代理单元,兼顾空间一致性(缓解隐式对应模型的空间不稳定)与计算效率(缓解显式3D几何先验方法在长序列上的开销),从而提升VLM的3D空间推理能力与可扩展性。
- Track: 3D representation learning / efficient 3D semantics for VLMs
- Key innovation: Proposes Proxy3D, building efficient 3D proxy representations via semantic clustering and multimodal alignment to replace pixel-aligned 2D token pipelines; compresses visual sequences into semantically coherent 3D proxy units, improving spatial consistency (vs. correspondence-based implicit 3D) while maintaining efficiency (vs. heavy 3D-prior representations), enabling scalable 3D reasoning in VLMs.
- Flow-OPD: On-Policy Distillation for Flow Matching Models 🆕NEW
- 赛道归属: 文生图(Flow Matching)对齐/后训练(RL/蒸馏)
- 核心创新点: 提出 Flow-OPD,将大模型领域的 On-Policy Distillation 引入 Flow Matching 文生图的统一后训练框架:用“在策略采样→基于偏好/奖励信号的更新→蒸馏回学生”的闭环,缓解多任务对齐中的两大痛点——标量奖励导致的稀疏监督与异质目标联合优化带来的梯度干扰,从机制上降低指标“跷跷板效应”和 reward hacking,使多指标对齐更稳定、可控。
- Track: Text-to-Image (Flow Matching) alignment/post-training (RL/distillation)
- Core innovation: Proposes Flow-OPD, the first unified post-training framework that adapts On-Policy Distillation to Flow Matching T2I models. By closing the loop of on-policy sampling → preference/reward-driven updates → distillation back to a student, it directly targets reward sparsity from scalar rewards and gradient interference from heterogeneous objectives, reducing metric seesawing and reward hacking while improving stability under multi-objective alignment.
- Rubric-Grounded RL: Structured Judge Rewards for Generalizable Reasoning
- 赛道归属: 大模型对齐与推理能力提升(LLM-as-a-judge、结构化奖励建模)
- 核心创新点: 提出Rubric-Grounded RL:将奖励分解为可验证的多维标准(rubric),由冻结的LLM裁判在辅助“grounding”信息条件下对各维度打分并加权汇总;用“部分得分/分项反馈”替代单一整体分或二元成败信号,提供更密集、更可控的优化梯度,从而提升推理训练的可泛化性与对奖励投机的抑制能力。
- Track: LLM alignment & reasoning improvement (LLM-as-a-judge, structured reward modeling)
- Core innovation: Introduces Rubric-Grounded RL, decomposing reward into weighted, verifiable criteria scored by a frozen LLM judge conditioned on auxiliary grounding; replaces binary/holistic rewards with multi-criterion partial credit to provide denser, more controllable learning signals, improving generalizable reasoning and reducing reward hacking.
- The Memory Curse: How Expanded Recall Erodes Cooperative Intent in LLM Agents 🆕NEW
- 赛道归属: 多智能体对齐与社会行为(Multi-agent cooperation / 长上下文影响分析)
- 核心创新点: 系统性提出并验证“记忆诅咒(memory curse)”:在多智能体社会困境中,扩大可访问历史(更长上下文/更强回忆)反而在多数设置下降低合作水平。方法上通过跨模型、跨博弈的大规模对照实验(多轮交互),并结合对海量推理轨迹的词汇/意图信号分析,定位机制为“前瞻性合作意图随历史扩展而被侵蚀”(更易沉溺于追责、报复或负面归因),从而将长上下文从“能力提升”转化为可被测量、可解释的行为退化现象与分析框架。
Track: Multi-agent alignment & social behavior (cooperation in social dilemmas / long-context effects)
Key innovation: Introduces and empirically substantiates the “memory curse”: expanding accessible interaction history (longer context/recall) systematically reduces cooperation in many multi-agent social dilemma settings. Methodologically, it combines large-scale cross-model, cross-game controlled experiments with lexical/intent analyses over extensive reasoning traces to identify a mechanism—erosion of forward-looking cooperative intent as history grows—turning long-context from a presumed capability gain into a measurable, explainable behavioral failure mode.
- CA-SQL: Complexity-Aware Inference Time Reasoning for Text-to-SQL via Exploration and Compute Budget Allocation 🆕NEW
- 赛道归属: Text-to-SQL 推理优化(推理时搜索与算力预算分配)
- 核心创新点: 提出 CA-SQL,通过“复杂度感知”的推理时策略提升困难样本的 Text-to-SQL 表现:先对问题/数据库/查询难度进行估计,用于指导更充分的解空间探索(生成多样候选 SQL、分支扩展),并将有限推理算力按难度自适应分配到更可能受益的探索与精炼步骤(例如候选筛选、迭代修正)。核心突破在于把 inference-time compute 从固定流程变为“按难度动态调度”的搜索-精炼管线,以覆盖 BIRD 等基准中最难子集所需的更大探索深度。
Track: Text-to-SQL reasoning optimization (inference-time exploration & compute budget allocation)
Key innovation: Proposes CA-SQL, a complexity-aware inference-time pipeline for hard Text-to-SQL. It estimates instance difficulty and uses it to drive broader solution-space exploration (diverse candidate SQL generation/branching) while adaptively allocating limited inference compute to the stages that benefit most (candidate selection and iterative refinement). The key methodological advance is turning inference-time compute from a fixed recipe into difficulty-conditioned scheduling of search-and-refine, improving coverage of challenging cases such as those in BIRD.
- Fast Byte Latent Transformer 🆕NEW
- 赛道归属: 推理加速与高效生成(字节级语言模型 / 非自回归或半并行生成)
- 核心创新点: 针对字节级 LM 逐字节自回归生成过慢的问题,在 Byte Latent Transformer(BLT)中引入 BLT Diffusion(BLT-D):在标准 next-byte 目标之外加入“块级扩散(block-wise diffusion)”辅助训练目标,使模型能以块为单位进行更并行/更快的生成与重建,从而显著降低生成时延,同时保持字节级建模不依赖子词词表的泛化优势。方法上的关键在于把字节序列映射到可高效生成的潜在块表示,并用扩散式目标提升块级生成质量与速度。
Track: Inference acceleration & efficient generation (byte-level LMs / semi-parallel generation)
Key innovation: Addresses slow byte-by-byte autoregressive decoding in byte-level LMs by introducing BLT Diffusion (BLT-D) within the Byte Latent Transformer. It augments standard next-byte prediction with an auxiliary block-wise diffusion objective, enabling faster, more parallel block-level generation/reconstruction while preserving the vocabulary-free generalization benefits of byte modeling. The core methodological shift is learning efficient latent block representations and using diffusion-style training to improve block generation quality and speed.
- Beyond Pairs: Your Language Model is Secretly Optimizing a Preference Graph 🆕NEW
- 赛道归属: 偏好对齐(DPO 扩展 / 图结构偏好建模)
- 核心创新点: 指出多 rollout/多候选的偏好数据天然形成“偏好图(preference graph)”,而将其坍缩为独立成对样本的传统 DPO 会丢失传递性、引入冗余/冲突监督并导致训练不稳定。该工作的方法论突破在于把对齐目标从“成对比较”提升为“图结构优化”:显式利用同一 prompt 下多候选之间的全局偏好关系(如传递闭包、排序一致性或图上的一致性约束),从而更充分地利用数据结构、减少矛盾梯度信号,并提升对齐训练的稳定性与样本效率。
Track: Preference alignment (DPO extensions / graph-structured preference modeling)
Key innovation: Observes that multi-rollout preference data induces a rich preference graph, and collapsing it into independent pairs (standard DPO) discards transitivity, adds redundancy/conflicts, and can destabilize training. The methodological advance is upgrading alignment from pairwise comparisons to graph-structured optimization that explicitly leverages global relations among multiple candidates per prompt (e.g., transitive consistency / ranking constraints), improving data utilization, reducing contradictory gradient signals, and enhancing stability and sample efficiency.
- Object Hallucination-Free Reinforcement Unlearning for Vision-Language Models
- 赛道归属: 视觉-语言模型遗忘/隐私合规(VLM Unlearning)
- 核心创新点: 提出HFRU(Object Hallucination-Free Reinforcement Unlearning),针对现有主要微调语言解码器导致“表层遗忘、底层视觉表征未清除”且易引入物体幻觉的问题,改为直接作用于视觉编码器进行深层语义移除;采用两阶段框架:先进行对遗忘目标的强化式优化(以奖励信号驱动“忘得更干净”),再通过稳定化/约束机制抑制遗忘副作用,从而在更彻底移除敏感视觉知识的同时降低幻觉风险。
- Track: VLM unlearning / privacy & safety (hallucination mitigation)
- Key innovation: Introduces HFRU, a reinforcement unlearning framework that targets the vision encoder (not just the language decoder) to achieve deep semantic removal and avoid object hallucinations; uses a two-stage procedure combining reward-driven unlearning for thorough forgetting with stabilization/constraints to reduce side effects, improving both unlearning efficacy and hallucination robustness.
GitHub
- [2026-05-12] sgl-project/sglang ⭐27664
SGLang is a high-performance serving framework for large language models and multimodal models.
- [2026-05-11] NVIDIA-NeMo/NeMo ⭐17195 🆕NEW
A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech ...
- [2026-05-12] NVIDIA/TensorRT-LLM ⭐13609 🆕NEW
TensorRT LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports state-of-the-art optimizations to perfo...
- [2026-05-11] flagos-ai/FlagGems ⭐995
FlagGems is an operator library for large language models implemented in the Triton Language.
- [2026-05-12] NVIDIA-NeMo/Skills ⭐949 🆕NEW
A project to improve skills of large language models
HuggingFace Datasets
- [2026-05-03] iletisim/dezenformasyon-bultenleri
İletişim Başkanlığı Dezenformasyon Bültenleri
Kaynak API: llm.iletisim.gov.trKaynak Bültenler: iletisim.gov.tr/turkce/dezenformasyon-bulten...
HuggingFace Spaces
多模态大模型 / Multimodal Models
arXiv
- Proxy3D: Efficient 3D Representations for Vision-Language Models via Semantic Clustering and Alignment
- 赛道归属: 3D表征学习 / 面向VLM的高效3D语义对齐
- 核心创新点: 提出Proxy3D,用“语义聚类+跨模态对齐”构建高效3D代理表示,替代传统VLM以像素对齐为主的2D视觉token管线;通过把视觉序列压缩为语义一致的3D代理单元,兼顾空间一致性(缓解隐式对应模型的空间不稳定)与计算效率(缓解显式3D几何先验方法在长序列上的开销),从而提升VLM的3D空间推理能力与可扩展性。
- Track: 3D representation learning / efficient 3D semantics for VLMs
- Key innovation: Proposes Proxy3D, building efficient 3D proxy representations via semantic clustering and multimodal alignment to replace pixel-aligned 2D token pipelines; compresses visual sequences into semantically coherent 3D proxy units, improving spatial consistency (vs. correspondence-based implicit 3D) while maintaining efficiency (vs. heavy 3D-prior representations), enabling scalable 3D reasoning in VLMs.
- Object Hallucination-Free Reinforcement Unlearning for Vision-Language Models
- 赛道归属: 视觉-语言模型遗忘/隐私合规(VLM Unlearning)
- 核心创新点: 提出HFRU(Object Hallucination-Free Reinforcement Unlearning),针对现有主要微调语言解码器导致“表层遗忘、底层视觉表征未清除”且易引入物体幻觉的问题,改为直接作用于视觉编码器进行深层语义移除;采用两阶段框架:先进行对遗忘目标的强化式优化(以奖励信号驱动“忘得更干净”),再通过稳定化/约束机制抑制遗忘副作用,从而在更彻底移除敏感视觉知识的同时降低幻觉风险。
- Track: VLM unlearning / privacy & safety (hallucination mitigation)
- Key innovation: Introduces HFRU, a reinforcement unlearning framework that targets the vision encoder (not just the language decoder) to achieve deep semantic removal and avoid object hallucinations; uses a two-stage procedure combining reward-driven unlearning for thorough forgetting with stabilization/constraints to reduce side effects, improving both unlearning efficacy and hallucination robustness.
- Semantic-Aware Adaptive Visual Memory for Streaming Video Understanding
- 赛道归属: 流式视频理解 / 在线视觉记忆管理与压缩
- 核心创新点: 提出语义感知的自适应视觉记忆机制,将“语义信号”显式纳入流式视频token的保留/压缩决策,而非仅依赖视觉相似度启发式;并把检索与压缩进行协同设计(而不是压缩后再补检索),使记忆在不确定查询到来时仍能保留对潜在问题最有用的语义证据,从而提升长时在线理解的实时性与问答命中率。
- Track: Streaming video understanding / online memory management & compression
- Key innovation: Proposes semantic-aware adaptive visual memory that incorporates semantic signals into keep/compress decisions beyond visual-similarity heuristics; co-designs retrieval with compression (instead of post-hoc retrieval after irreversible compression), preserving query-relevant semantic evidence under unpredictable query timing and improving real-time long-horizon streaming QA performance.
- Video Understanding Reward Modeling: A Robust Benchmark and Performant Reward Models
- 赛道归属: 视频理解奖励模型(Reward Modeling)/ 偏好数据与评测基准
- 核心创新点: 提出统一框架覆盖基准设计、偏好数据构建与奖励模型训练,发布VURB(Video Understanding Reward Bench):包含2,100组偏好对,并配套长链路推理(CoT)痕迹以提升监督信号密度与可诊断性;在此基础上训练更高性能的视频奖励模型,为视频生成/视频LLM对齐提供可复现、可量化的评测与训练基础,弥补视频域奖励建模长期缺少高质量基准与数据的问题。
- Track: Video reward modeling / preference data & benchmarking
- Key innovation: Establishes an end-to-end framework for benchmark design, preference data construction, and reward-model training; introduces VURB with 2,100 preference pairs plus long chain-of-thought traces to densify supervision and improve diagnosability; trains stronger video reward models, providing reproducible evaluation/training infrastructure for aligning video generators and Video-LLMs where robust benchmarks/data were previously lacking.
- EyeCue: Driver Cognitive Distraction Detection via Gaze-Empowered Egocentric Video Understanding
- 赛道归属: 具身/车载多模态理解(驾驶员状态监测)/ 注视点增强视频理解
- 核心创新点: 提出EyeCue,将眼动/注视信息作为关键中间表征融入第一视角车载视频理解,用于识别“认知分心”这一难以从外显动作判断的状态;核心洞察是认知分心体现在“注视与驾驶场景交互模式”的变化而非简单视线偏移,通过建模注视线索与场景语义/事件的耦合,提高对隐性分心的可检测性与鲁棒性。
- Track: Egocentric multimodal understanding for driver monitoring / gaze-augmented video understanding
- Key innovation: Proposes EyeCue, integrating gaze as a pivotal intermediate signal into egocentric driving video understanding to detect cognitive distraction—hard to infer from overt motions; leverages the insight that distraction manifests as altered gaze–scene interaction patterns (not merely gaze deviation), modeling gaze cues jointly with scene semantics/events to improve detection sensitivity and robustness.
- GazeVLM: Active Vision via Internal Attention Control for Multimodal Reasoning 🆕NEW
- 赛道归属: 多模态推理(主动视觉/注意力控制)
- 核心创新点: 提出在VLM内部实现“可控注意力”的主动视觉机制,用任务目标驱动的自上而下注意力路由替代被动的全局token堆叠;通过动态聚焦关键局部细节并保持全局上下文的外围感知,提升空间推理能力并降低语言幻觉,形成面向推理的“内生注视/扫视”式视觉信息采样与推理闭环。
Track: Multimodal reasoning (active vision / attention control)
Key innovations: Introduces an internal, controllable attention mechanism for VLMs to realize active vision: goal-directed top-down routing replaces passive accumulation of global visual tokens. By dynamically foveating task-relevant regions while preserving peripheral global context, it improves spatial reasoning and reduces linguistic hallucinations, forming a closed loop between attention allocation and reasoning.
- [2026-05-08] Operating Within the Operational Design Domain: Zero-Shot Perception with Vision-Language Models 🆕NEW
- 赛道归属: 多模态感知(自动驾驶/机器人ODD安全约束的零样本感知)
- 核心创新点: 将Operational Design Domain(ODD)作为安全与合规的核心约束引入VLM零样本感知流程:在无需任务特定训练的前提下,用ODD条件(环境、道路类型、天气/光照、速度范围等)对VLM的感知输出进行约束、筛选与一致性校验,从而把“能看见什么/该相信什么”与运行域绑定,提升部署可用性与安全可解释性。
Track: Multimodal perception (ODD-constrained zero-shot perception for autonomy)
Key innovations: Incorporates the Operational Design Domain (ODD) as a first-class safety/compliance constraint into VLM-based zero-shot perception. Without task-specific training, ODD conditions (scene, road type, weather/illumination, speed regime, etc.) are used to constrain, filter, and consistency-check VLM outputs, explicitly tying what the model should perceive/trust to the operating domain to improve deployability and safety interpretability.
- [2026-05-08] Response-G1: Explicit Scene Graph Modeling for Proactive Streaming Video Understanding
- 赛道归属: 流式视频理解 / 主动响应(Proactive)视频LLM与结构化证据对齐
- 核心创新点: 提出Response-G1,用显式场景图(scene graph)把“随时间累积的视频证据”与“查询所需的响应条件”进行结构化对齐,解决以往隐式、与查询无关的证据建模导致的“何时该回答”困难;采用无需微调的三阶段流程:在线的查询引导场景图构建、证据随时间的结构化更新、以及基于场景图的响应触发判定,从而提升流式场景下的及时性与可控性。
- Track: Proactive streaming video understanding / structured evidence alignment
- Key innovation: Introduces Response-G1, explicitly aligning accumulated streaming video evidence with query-specific response conditions via scene graphs, addressing the “when to respond” challenge caused by implicit, query-agnostic evidence modeling; uses a fine-tuning-free three-stage pipeline—online query-guided scene-graph construction, structured temporal evidence updates, and scene-graph-based response triggering—improving timeliness and controllability in proactive streaming settings.
- [2026-05-08] PolarVLM: Bridging the Semantic-Physical Gap in Vision-Language Models 🆕NEW
- 赛道归属: 多模态理解(物理感知增强的VLM:偏振成像+语言推理)
- 核心创新点: 提出将偏振成像的物理参数(可消解反射、透明等RGB固有歧义)与开放式语言推理统一到同一VLM框架中:不再局限于固定格式的偏振输出,而是把偏振信息作为可被语言查询与推理的视觉证据源,建立从“物理量→语义→推理”的桥接机制,显著增强对光学歧义场景的可解释理解与问答能力。
Track: Multimodal understanding (physics-augmented VLM with polarization imaging)
Key innovations: Integrates polarization-derived physical cues (resolving RGB ambiguities like reflections and transparency) into an open-ended VLM reasoning framework. Instead of producing fixed-format polarimetric outputs, polarization signals become queryable evidence for language-driven reasoning, bridging physical parameters to semantics and improving interpretable understanding/Q&A under optically ambiguous conditions.
- [2026-05-08] Beyond GSD-as-Token: Continuous Scale Conditioning for Remote Sensing VLMs 🆕NEW
- 赛道归属: 多模态理解(遥感VLM尺度建模/条件化微调)
- 核心创新点: 针对遥感中跨数量级GSD导致的强尺度域偏移,提出连续尺度条件化替代“GSD作为离散token”的做法:通过参数高效微调框架在模型内部注入连续尺度变量,使同一模型参数能随GSD平滑调制表征与对齐策略,减少尺度混叠与泛化退化,提升跨分辨率的遥感图文理解与检索/问答一致性。
Track: Multimodal understanding (remote sensing VLM scale modeling / conditioned PEFT)
Key innovations: Addresses extreme scale shifts across ground sampling distances (GSD) in remote sensing by replacing “GSD-as-discrete-token” with continuous scale conditioning. A parameter-efficient fine-tuning framework injects continuous scale variables to smoothly modulate internal representations/alignment across resolutions, reducing scale entanglement and improving cross-GSD generalization for RS vision-language tasks (retrieval/Q&A/captioning).
GitHub
- [2026-05-11] Blaizzy/mlx-vlm ⭐4694
MLX-VLM is a package for inference and fine-tuning of Vision Language Models (VLMs) on your Mac using MLX.
- [2026-05-11] waybarrios/vllm-mlx ⭐1147
OpenAI and Anthropic compatible server for Apple Silicon. Run LLMs and vision-language models (Llama, Qwen-VL, LLaVA) with continuous batching, MCP to...
- [2026-05-08] zhengli97/Awesome-Prompt-Adapter-Learning-for-VLMs-CLIP ⭐772
A curated list of awesome prompt/adapter learning methods for vision-language models like CLIP.
- [2026-05-10] dongyangli-del/EEG_Image_decode ⭐203
Using vision-language models to decode natural image perception from non-invasive brain recordings.
- [2026-05-08] ydyhello/Awesome-VLM-Streaming-Video ⭐154
📚 A curated collection of papers and open-source code repositories dedicated to the application of Vision-Language Models (VLMs) for streaming video.
强化学习 / Reinforcement Learning
arXiv
- Rubric-Grounded RL: Structured Judge Rewards for Generalizable Reasoning
- 赛道归属: 大模型对齐与推理能力提升(LLM-as-a-judge、结构化奖励建模)
- 核心创新点: 提出Rubric-Grounded RL:将奖励分解为可验证的多维标准(rubric),由冻结的LLM裁判在辅助“grounding”信息条件下对各维度打分并加权汇总;用“部分得分/分项反馈”替代单一整体分或二元成败信号,提供更密集、更可控的优化梯度,从而提升推理训练的可泛化性与对奖励投机的抑制能力。
- Track: LLM alignment & reasoning improvement (LLM-as-a-judge, structured reward modeling)
- Core innovation: Introduces Rubric-Grounded RL, decomposing reward into weighted, verifiable criteria scored by a frozen LLM judge conditioned on auxiliary grounding; replaces binary/holistic rewards with multi-criterion partial credit to provide denser, more controllable learning signals, improving generalizable reasoning and reducing reward hacking.
- Reinforcement Learning for Exponential Utility: Algorithms and Convergence in Discounted MDPs
- 赛道归属: 风险敏感强化学习(指数效用、折扣MDP的价值学习理论)
- 核心创新点: 针对固定风险厌恶下的指数效用目标,推导两类Q值形式的Bellman扩展,并证明相应算子在$L_\infty$与sup-log/Thompson等度量下为压缩映射,从而给出收敛性与不动点刻画;补齐指数效用RL中“可证明收敛的值迭代/Q学习式算法”理论空白,为风险敏感控制提供可实现的价值型方法。
- Track: Risk-sensitive RL (exponential utility, value-based theory in discounted MDPs)
- Core innovation: Derives two Q-value-style Bellman extensions for fixed risk-aversion exponential-utility objectives and proves the induced operators are contractions under $L_\infty$ and sup-log/Thompson-type metrics, yielding fixed-point characterization and convergence guarantees—filling a gap for principled, value-based algorithms in exponential-utility RL.
- Beyond Pairs: Your Language Model is Secretly Optimizing a Preference Graph 🆕NEW
- 赛道归属: 偏好对齐(DPO 扩展 / 图结构偏好建模)
- 核心创新点: 指出多 rollout/多候选的偏好数据天然形成“偏好图(preference graph)”,而将其坍缩为独立成对样本的传统 DPO 会丢失传递性、引入冗余/冲突监督并导致训练不稳定。该工作的方法论突破在于把对齐目标从“成对比较”提升为“图结构优化”:显式利用同一 prompt 下多候选之间的全局偏好关系(如传递闭包、排序一致性或图上的一致性约束),从而更充分地利用数据结构、减少矛盾梯度信号,并提升对齐训练的稳定性与样本效率。
Track: Preference alignment (DPO extensions / graph-structured preference modeling)
Key innovation: Observes that multi-rollout preference data induces a rich preference graph, and collapsing it into independent pairs (standard DPO) discards transitivity, adds redundancy/conflicts, and can destabilize training. The methodological advance is upgrading alignment from pairwise comparisons to graph-structured optimization that explicitly leverages global relations among multiple candidates per prompt (e.g., transitive consistency / ranking constraints), improving data utilization, reducing contradictory gradient signals, and enhancing stability and sample efficiency.
- Learning CLI Agents with Structured Action Credit under Selective Observation 🆕NEW
- 赛道归属: 智能体强化学习(CLI/工具使用代理)、信用分配与部分可观测RL
- 核心创新点: 利用CLI动作天然的结构化属性(如命令-参数-对象等可分解成分)来做更细粒度的动作信用分配,并在“选择性观测/部分可观测”的交互设定下,将可验证的任务反馈与动作结构信号结合,缓解仅靠终局反馈导致的学习稀疏与不稳定问题,从而提升CLI代理在真实文件系统与在线执行反馈中的可学习性与泛化。
- Track: Agentic RL (CLI/tool-use agents), credit assignment under partial/selective observability
- Core innovation: Exploits the inherent structure of CLI actions (e.g., command/flags/arguments/targets) as an additional learning signal for finer-grained action credit assignment. Under selective/partial observation, it couples verifiable task feedback with structured action-level supervision to reduce reward sparsity and instability, improving learnability and generalization for real filesystem + online execution settings.
- Interpreting Reinforcement Learning Agents with Susceptibilities 🆕NEW
- 赛道归属: 强化学习可解释性(Interpretability)、训练动力学分析
- 核心创新点: 将“susceptibilities(易感性)”从监督学习中的损失扰动响应推广到深度强化学习的遗憾/后悔(regret)框架,定义并分析策略对目标扰动的响应,用于揭示训练过程中阶段性能力形成与内部表征变化;在具备非平凡发展阶段的gridworld中验证其能定位“模型在学什么/何时学到”的内部特征。
- Track: RL interpretability, training dynamics analysis
- Core innovation: Generalizes susceptibilities—responses of posterior expectations to loss perturbations—from supervised learning to deep RL by formulating them w.r.t. regret. This yields a diagnostic that exposes stage-wise development and internal feature formation during training, demonstrated in a gridworld with non-trivial learning phases.
- KL for a KL: On-Policy Distillation with Control Variate Baseline 🆕NEW
- 赛道归属: LLM后训练(On-Policy Distillation/OPD)、方差降低与稳定训练(Policy Gradient)
- 核心创新点: 将OPD严格重写为策略梯度RL问题,指出其单样本蒙特卡洛梯度方差导致不稳定;提出vOPD,通过引入控制变量(control variate)基线(典型为价值基线形式)来做方差降低,在不改变目标期望的前提下显著稳定训练,并给出与KL/蒸馏目标一致的推导框架。
- Track: LLM post-training (On-Policy Distillation), variance reduction & stabilization (policy gradient)
- Core innovation: Recasts OPD as a policy-gradient RL objective and attributes practical instability to high-variance single-sample Monte Carlo gradients. Proposes vOPD by adding a control-variate baseline (value-style baseline) to reduce variance without biasing the objective, yielding a principled and more stable on-policy distillation recipe aligned with KL/distillation formulations.
- [2026-05-08] SOD: Step-wise On-policy Distillation for Small Language Model Agents 🆕NEW
- 赛道归属: 小模型工具使用/代理推理(Tool-Integrated Reasoning)、长时序蒸馏式RL后训练
- 核心创新点: 提出SOD(Step-wise On-policy Distillation),面向小语言模型在长时序工具交互中的不稳定与容量受限问题,将传统OPD的“整段轨迹token级监督”改为更细的逐步(step-wise)蒸馏与对齐:在每一步工具调用/中间状态上提供更密集、可控的教师信号与训练分解,从而缓解仅有结果级奖励(如GRPO)过稀疏、以及OPD在TIR场景易发散的问题。
- Track: Small-LM tool-use agents (tool-integrated reasoning), long-horizon distillation-style RL post-training
- Core innovation: Introduces Step-wise On-policy Distillation (SOD) to stabilize long-horizon tool interactions for small models. Instead of trajectory-level OPD, it decomposes supervision step-by-step around tool calls/intermediate states, providing denser and better-conditioned teacher guidance, mitigating sparse outcome rewards (e.g., GRPO) and OPD divergence in TIR settings.
- [2026-05-08] Gradient Starvation in Binary-Reward GRPO: Why Group-Mean Centering Fails and Why the Simplest Fix Works 🆕NEW
- 赛道归属: RLVR/GRPO算法分析与改进(可验证奖励、二值奖励)、优化稳定性
- 核心创新点: 揭示二值奖励下GRPO的组均值中心化优势函数会产生“梯度饥饿”:当组内全对或全错时优势恒为0导致无学习信号;从理论上证明真实退化概率因Jensen不等式必然高于独立伯努利假设,并据此解释实践中的停滞;提出“最简单但有效”的修复(核心是避免纯组均值中心化在退化组上抹零信号,改用能在全对/全错时仍保留梯度的优势/基线处理)。
- Track: RLVR/GRPO algorithmic analysis & fixes (binary rewards), optimization stability
- Core innovation: Identifies and formalizes gradient starvation in GRPO under binary rewards: group-mean centering makes advantages exactly zero when all samples in a group are correct/incorrect, eliminating learning signal. Proves the degeneracy rate is inherently higher than an i.i.d. Bernoulli estimate via Jensen’s inequality, matching observed training stalls, and proposes a minimal fix that preserves non-zero gradients even in degenerate groups (i.e., avoids pure group-mean centering that zeroes out signal).
- [2026-05-08] Not All Tokens Learn Alike: Attention Entropy Reveals Heterogeneous Signals in RL Reasoning 🆕NEW
- 赛道归属: LLM推理RL后训练机理分析(token级信号)、可解释性/诊断指标
- 核心创新点: 用注意力熵(attention entropy)刻画不同token在RL推理训练中的“上下文支撑集中度”,揭示token级学习信号显著异质;提出并验证token级RL目标具有“稀疏可估计性”(随机抽取约20% token即可保留大部分训练信号),为降低计算/方差提供依据;进一步用注意力熵将token分型,解释哪些token更受RL更新驱动、哪些更噪声化,从而为更精细的token采样/加权策略奠基。
- Track: Mechanistic analysis of RL post-training for LLM reasoning (token-level signals), interpretability/diagnostics
- Core innovation: Uses attention entropy to quantify per-token contextual support concentration, revealing strong heterogeneity in token-level RL learning signals. Shows RL objectives are “sparsely estimable” (e.g., ~20% random token subsets retain much of the full-token signal), motivating compute/variance reductions. Attention-entropy-based token typing further explains which tokens drive updates vs. contribute noise, enabling principled token sampling/weighting schemes.
- [2026-05-08] Your Language Model is Its Own Critic: Reinforcement Learning with Value Estimation from Actor's Internal States 🆕NEW
- 赛道归属: RLVR策略优化(LLM推理)、低成本基线/价值估计(无需独立critic)
- 核心创新点: 提出从actor内部状态直接估计价值基线的策略优化方法(Policy Optimization with Internal State Value Estimation):复用策略模型前向计算中已产生的内部表征/信号来预测baseline,实现几乎零额外开销的方差降低;相较PPO避免训练同规模critic,相较GRPO减少对多rollout组均值稳定性的依赖,从而在可验证奖励场景下以更低成本获得稳定更新。
- Track: RLVR policy optimization for LLM reasoning, low-cost baselines/value estimation without a separate critic
- Core innovation: Proposes estimating the variance-reduction baseline directly from the actor’s internal states computed during the policy forward pass (Policy Optimization with Internal State Value Estimation). This yields near-zero extra-cost value/baseline prediction, avoiding PPO’s full-scale critic and reducing GRPO’s need for multiple rollouts to stabilize group means, enabling more stable and efficient RLVR updates.
GitHub
- [2026-05-11] Unity-Technologies/ml-agents ⭐19392 🆕NEW
The Unity Machine Learning Agents Toolkit (ML-Agents) is an open-source project that enables games and simulations to serve as environments for traini...
- [2026-05-11] huggingface/trl ⭐18344 🆕NEW
Train transformer language models with reinforcement learning.
- [2026-05-12] rllm-org/rllm ⭐5490
Democratizing Reinforcement Learning for LLMs
- [2026-05-12] natolambert/rlhf-book ⭐1904 🆕NEW
Textbook on reinforcement learning from human feedback
- [2026-05-12] radixark/miles ⭐1310 🆕NEW
Miles is an enterprise-facing reinforcement learning framework for LLM and VLM post-training, forked from and co-evolving with slime.
HuggingFace Datasets
- [2026-05-03] ADSKAILab/Zero-To-CAD-1m
Zero-to-CAD 1M
One million executable, interpretable CAD construction sequences synthesized entirely without real-world data.
...
-
[2026-04-23] nvidia/Nemotron-Personas-Korea
Nemotron-Personas-Korea우리나라 실제 분포에 기반한 합성 페르소나를 위한 복합 AI 시스템 A compound AI approach to personas grounded in real-world dist...
世界动作模型 / World Action Model
arXiv
- [2026-05-08] Is the Future Compatible? Diagnosing Dynamic Consistency in World Action Models
- 赛道归属: 世界模型评测与可靠性诊断(World Action Model / 动态一致性)
- 核心创新点: 提出并系统化定义WAM可靠性的关键缺失维度——动作-状态一致性(action-state consistency),用于检验“模型生成的未来”是否与其声称的动作序列在动力学上相容,而不仅是视觉上合理;围绕该一致性构建诊断框架/评测思路,将WAM的失效从“看起来对”细化为“动力学不兼容”的可检测问题,从而为后续训练目标、校准与安全执行提供可操作的评价轴。
- Track: World-model evaluation & reliability diagnostics (World Action Model / dynamic consistency)
- Core innovation: Introduces and formalizes action–state consistency as a missing reliability axis for WAMs, testing whether imagined futures are dynamically compatible with the predicted action sequence rather than merely visually plausible; builds a diagnostic/evaluation perspective around this notion to make WAM failure modes measurable as dynamical incompatibility, enabling more actionable assessment for calibration, training objectives, and safe deployment.
- [2026-05-07] When to Trust Imagination: Adaptive Action Execution for World Action Models
- 赛道归属: 世界模型驱动的机器人控制(自适应执行 / 想象-现实一致性验证)
- 核心创新点: 将WAM的执行策略从“每次推理固定执行N步”提升为自适应动作执行:把是否继续执行想象动作序列建模为未来-现实验证(future-reality verification)问题;核心方法论是在执行过程中持续对比模型想象的未来与真实滚动的偏差/一致性,并据此动态决定执行更长的开环段还是提前重规划,从机制上缓解因想象漂移导致的失控与累积误差,实现“何时信任想象”的可决策化。
- Track: World-model-based robotic control (adaptive execution / imagination–reality verification)
- Core innovation: Replaces the standard “execute a fixed N predicted actions per inference” paradigm with adaptive action execution, formulating it as a future–reality verification problem; methodologically, it continuously checks consistency between imagined rollouts and real-world evolution during execution and uses this signal to decide whether to keep executing longer open-loop segments or replan early, mitigating imagination drift and compounding errors via an explicit trust-and-replan mechanism.
GitHub
- [2026-05-11] DravenALG/awesome-vla-wam ⭐368
A Curated List of Vision-Language-Action (VLA) and World Action Models (WAM) Research and Beyond
Generated automatically by Daily AI Digest Agent 生成时间: 2026-05-12 01:01:46