AI 每日进展速报 / Daily AI Digest - 2026-06-05

图像生成/编辑 / Image Generation/Editing

arXiv

Compositional Text-to-Image Generation Via Region-aware Bimodal Direct Preference Optimization
- 赛道归属: 文生图（偏好对齐/强化学习式对齐，组合生成）
- 核心创新点: 提出Region-aware的双模态直接偏好优化（BiDPO），将“偏好学习”从全图层面对齐推进到“区域级/关系级”的组合语义对齐：通过构建高质控的大规模偏好数据集BiComp，针对属性绑定、对象关系、计数等组合难点提供可学习的偏好信号；并在优化时显式利用区域感知与图文双模态信息，使模型在不改变基础生成范式的情况下，更稳定地满足复杂提示词的结构化约束与局部一致性。
- Track: Text-to-Image (preference alignment / RL-style alignment, compositional generation)
- Core innovation: Proposes BiDPO, a region-aware bimodal Direct Preference Optimization framework that upgrades preference learning from global image alignment to region-/relation-level compositional alignment. It builds a large-scale, strictly quality-controlled preference dataset (BiComp) targeting hard compositional skills (attribute binding, object relations, counting), and optimizes with explicit region awareness plus bimodal (text+image) signals to better satisfy structured constraints in complex prompts without changing the base generation paradigm.

WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation
- 赛道归属: 文生图（Text-to-Image）评测基准 / 语义与世界知识对齐评估
- 核心创新点: 提出面向文生图的“世界知识驱动语义评测”基准WISE，将评估重点从传统的画质与浅层文本-图像匹配，提升到对复杂语义理解、隐含常识/事实知识、关系与组合推理等能力的系统化测量；通过构造需要外部世界知识才能判定对错的提示与判别维度，提供更能暴露模型“看似对齐但语义错误”的评测框架，从而推动T2I模型在知识一致性与深层语义对齐上的改进。
- Track: Text-to-Image evaluation benchmark / semantic & world-knowledge alignment assessment
- Key innovation: Introduces WISE, a world-knowledge-informed semantic evaluation benchmark for T2I that shifts emphasis from realism and shallow text-image matching to systematic measurement of complex semantic understanding—commonsense/factual knowledge, relations, and compositional reasoning. By designing prompts and evaluation dimensions that require external world knowledge to judge correctness, it better exposes “plausible-looking but semantically wrong” generations and drives progress on knowledge-consistent, deep semantic alignment.

Safeguarding Text-to-Image Generation via Inference-Time Prompt-Noise Optimization
- 赛道归属: 文生图安全对齐 / 推理时安全防护（Text-to-Image Safety Alignment at Inference）
- 核心创新点: 提出一种仅在推理阶段生效的安全防护机制，通过对输入提示词注入并优化“提示噪声”(prompt-noise) 来抑制不安全内容的生成；其关键突破在于把安全约束转化为可优化的推理时变量，无需重新训练/微调模型即可动态调整生成轨迹，从而提升对绕过式提示与对抗攻击的鲁棒性，并在尽量保持画质与文本一致性的前提下实现更稳定的安全过滤。
  Track: Text-to-Image safety alignment / Inference-time safety defense Core innovation: Introduces an inference-only safeguarding method that injects and optimizes prompt noise to steer diffusion sampling away from unsafe regions. The key methodological step is formulating safety control as an optimizable inference-time variable, avoiding retraining while improving robustness to jailbreak prompts and adversarial attacks, with minimal degradation to image quality and prompt fidelity.

MemoGen: Can Past Experience Improve Future Text-to-Image Generation?
- 赛道归属: 文生图（Text-to-Image）生成增强 / 记忆与检索增强生成（Memory-augmented Generation）
- 核心创新点: 提出MemoGen，将“单次请求的检索/代理式增强”扩展为“跨任务可积累的经验记忆”机制：把历史生成中的成功/失败案例、隐含约束满足策略、有效提示改写或参考证据进行结构化存储，并在新请求到来时进行检索与复用，以提升对隐式视觉约束、关系推理与外部知识需求场景的可靠性；核心突破在于把T2I生成从一次性优化转为可持续学习的闭环（记录—检索—迁移），减少重复犯错并提高长期一致性。
- Track: Text-to-Image generation enhancement / memory-augmented (experience-reuse) generation
- Key innovation: Proposes MemoGen, extending retrieval/agentic augmentation from per-request assistance to an accumulative experience memory. It stores structured signals from past generations (success/failure cases, constraint-satisfaction tactics, effective prompt rewrites, supporting references) and retrieves them to guide future requests, improving reliability on implicit constraints, relational reasoning, and external-knowledge prompts. The key methodological step is turning T2I generation into a continual closed loop (log–retrieve–transfer) that reduces repeated errors and improves long-horizon consistency.

KG-FairDiff: Knowledge Graph-Guided Prompt Refinement for Demographically Fair Text-to-Image Generation
- 赛道归属: 文生图（公平性/去偏见）、提示词优化（Prompt Refinement）
- 核心创新点: 提出以知识图谱（Knowledge Graph）为约束与检索支撑的提示词自动精炼框架，在不重训/不改动闭源T2I主干模型的前提下，通过对人口统计属性与职业/场景等语义关系的显式建模，系统性地补全或重写提示词中的敏感与相关属性表达，从而在生成阶段实现更均衡的人群呈现；方法重点在“结构化知识→可控prompt变换”的映射，降低仅靠启发式词替换带来的语义漂移，并兼顾公平性提升与文本意图保持。
- Track: Text-to-Image (fairness/de-biasing), Prompt Refinement
- Core innovation: Introduces a knowledge-graph-guided prompt refinement framework that improves demographic fairness without retraining or modifying (potentially closed-source) T2I backbones. By explicitly modeling relationships between demographic attributes and contextual semantics (e.g., occupations, settings), it automatically augments/rewrites prompts to enforce more balanced representation at inference time. The key methodological advance is mapping structured knowledge constraints into controllable prompt transformations, reducing semantic drift compared to heuristic word swaps while preserving the original intent.

RichControl: Structure- and Appearance-Rich Training-Free Spatial Control for Text-to-Image Generation
- 赛道归属: 文生图（可控生成）、训练免（Training-free）空间控制/条件注入
- 核心创新点: 提出一种同时具备“结构+外观”双重约束的训练免空间控制方案，通过改进特征注入/融合机制，在扩散采样过程中更稳定地对齐条件图像的几何结构并保留外观细节；针对训练免注入常见的结构错位、条件泄漏（把条件图像纹理/噪声直接拷入结果）与伪影问题，引入更精细的分层/分步控制与抑制策略，使结构遵循与外观一致性可以解耦调节，从而在无需LoRA/微调的情况下获得更可靠的空间可控生成。
- Track: Controllable Text-to-Image, Training-free spatial control / condition feature injection
- Core innovation: Proposes a training-free spatial control method that is rich in both structure and appearance constraints. It improves feature injection/fusion during diffusion sampling to better align geometry from conditional inputs while preserving appearance details. To address common training-free issues—structural misalignment, condition leakage (copying conditional textures/noise), and artifacts—it introduces finer-grained, stage-/layer-wise control and suppression mechanisms, enabling decoupled tuning of structural adherence vs. appearance fidelity without LoRA or finetuning.

Text-to-Image Models Need Less from Text Encoders Than You Think
- 赛道归属: 文生图（Text-to-Image）基础机制分析 / 文本编码器与条件表征消融（Representation/Conditioning Analysis）
- 核心创新点: 系统性检验文生图模型对文本编码器“丰富语义表征”（上下文、组合性、属性绑定等）的真实依赖程度，提出并验证：图像生成模型可能并未充分利用文本嵌入中的高阶语言信息，从而文本编码器并不需要想象中那么强；通过对文本表征不同成分的消融/替换与对生成质量、对齐能力的影响分析，给出更精确的“哪些文本信息是必要的”结论，为简化文本编码器、重分配模型容量、以及改进条件注入方式提供依据。
- Track: Text-to-Image mechanism analysis / text-encoder & conditioning representation ablation
- Key innovation: Systematically probes how much T2I models truly rely on “rich” text-encoder representations (context, compositionality, attribute binding). It argues and empirically tests that image generators may not fully exploit higher-order linguistic information in embeddings, implying text encoders can be simpler than commonly assumed. By ablating/replacing components of text representations and measuring impacts on generation quality and alignment, it pinpoints which textual signals are actually necessary, informing encoder simplification, capacity reallocation, and improved conditioning injection designs.

Qwen-Image-Bench: From Generation to Creation in Text-to-Image Evaluation
- 赛道归属: 文生图评测（基准/指标，面向创作能力评估）
- 核心创新点: 提出Qwen-Image-Bench，将评测目标从传统“文本-图像一致性/基础画质”扩展到更贴近真实创作工作流的“从生成到创作”能力刻画：强调对真实世界重建的可信度与创意表达等更高阶维度，设计能区分模型在专业创作场景中关键能力差异的评测集合与判别框架，从而缓解现有benchmark对艺术实践需求覆盖不足、区分度不够的问题。
- Track: Text-to-Image evaluation (benchmark/metrics, creativity-oriented assessment)
- Core innovation: Introduces Qwen-Image-Bench to move beyond classic text-image alignment and basic visual quality, toward capabilities that matter in real creative workflows—faithful real-world reconstruction and genuine creative expression. It provides an evaluation suite and judging protocol aimed at better discriminating models on higher-level, practice-relevant skills that existing benchmarks under-represent.

DirectEdit: Step-Level Accurate Inversion for Flow-Based Image Editing
- 赛道归属: 图像编辑（基于流模型/扩散式流程的训练免编辑，反演）
- 核心创新点: 提出DirectEdit，实现“步级准确”的反演以支持流式（flow-based）编辑：针对现有训练免编辑常见的反演-前向去噪流程中“时间步不匹配”导致的重建误差累积问题，DirectEdit在反演阶段对齐每一步的潜变量/时间步，使重建路径与编辑路径在对应step上严格一致，从而显著降低误差传播，提升重建保真度与编辑稳定性（尤其在多步编辑或强编辑强度下）。
- Track: Image editing (flow-based / diffusion-style pipeline, training-free editing, inversion)
- Core innovation: Proposes DirectEdit with step-level accurate inversion for flow-based editing. It addresses error accumulation caused by timestep-mismatched noisy latents in common inversion+forward denoising pipelines by aligning latents per step so reconstruction and editing trajectories are consistent at corresponding timesteps, reducing drift and improving reconstruction fidelity and editing robustness, especially for longer or stronger edits.

FiRe: Fine-grained Multimodal Reasoning for Enhanced Image Generation
- 赛道归属: 文生图（多模态推理增强的图像生成 / Unified MLLM for T2I）
- 核心创新点: 提出细粒度多模态推理框架，将统一式MLLM的“理解-生成”闭环能力用于文生图的自反思与自改写：不再停留在简单的提示词扩写或整体图文一致性打分，而是引入更细粒度的推理与评估信号（如对属性、关系、局部区域/对象级要点的逐项核对），驱动生成过程进行针对性的迭代修正，从而提升复杂指令下的可控性与语义一致性。
- Track: Text-to-Image Generation (multimodal reasoning-enhanced image generation / unified MLLM for T2I)
- Key innovations: Proposes a fine-grained multimodal reasoning framework that leverages a unified MLLM’s closed-loop “understand–generate” capability for self-reflection and self-refinement in T2I. Instead of relying on prompt augmentation or holistic image-text alignment scoring, it introduces finer-grained reasoning/evaluation signals (e.g., attribute-, relation-, and region/object-level checks) to guide targeted iterative corrections during generation, improving controllability and semantic faithfulness for complex prompts.

GitHub

[2026-06-05] YouMind-OpenLab/awesome-nano-banana-pro-prompts ⭐12397

🍌 World's largest Nano Banana Pro prompt library — 10,000+ curated prompts with preview images, 16 languages. Google Gemini AI image generation. Free ...

[2026-06-04] AceDataCloud/Nexior ⭐372

Consumer AI app for chat, image generation, video generation, and music creation powered by Ace Data Cloud APIs.

[2026-06-04] VigoZhao/AI-Visual-Prompt-Cookbook ⭐207

Curated collection of reusable JSON prompt templates & style references for AI image generation. Updated daily.

[2026-06-04] iconben/z-image-studio ⭐115 🆕NEW

A Cli, a webUI, and a MCP server for the Z-Image-Turbo text-to-image generation model (Tongyi-MAI/Z-Image-Turbo base model as well as quantized models...

[2026-06-04] veryyoldman/Genspark-AI ⭐107 🆕NEW

Genspark AI open-source, self-hosted Super Agent. Free alternative to Genspark.ai with multi-agent orchestration, deep research, Sparkpages, AI slides...

HuggingFace Models

ideogram-ai/ideogram-4-fp8 🆕NEW

ideogram-ai/ideogram-4-nf4 🆕NEW

HuggingFace Datasets

[2026-05-29] jasperai/monet
```
Dataset Card for MONET
```

MONET (Massive, Open, Non-redundant and Enriched Text-to-image dataset) is a large-scale, curated image-text dat...

视频生成/编辑 / Video Generation/Editing

arXiv

Spatial-Temporal Decoupled Reference Conditioning for Identity-Preserving Text-to-Video Generation
- 赛道归属: 身份保持文本到视频生成（Reference-conditioned T2V / Video Generation）
- 核心创新点: 提出ST-DRC（Spatial-Temporal Decoupled Reference Conditioning）框架，将参考身份条件在空间与时间维度解耦注入视频扩散/生成过程：用空间侧的细粒度特征强化单帧身份细节（如脸部结构、纹理一致性），用时间侧的机制约束跨帧身份稳定与时序一致，从而在“文本语义可控性”和“低层身份保真度”之间实现更好的平衡；框架层面强调晚期/分阶段的条件融合以减少文本驱动对身份特征的干扰并提升长序列稳定性。
- Track: Identity-preserving text-to-video generation (reference-conditioned T2V / video generation)
- Key innovation: Proposes ST-DRC, a Spatial-Temporal Decoupled Reference Conditioning framework that injects identity reference signals separately along spatial and temporal axes in the video generation (diffusion) process: spatial conditioning strengthens per-frame identity details (geometry/texture), while temporal conditioning enforces cross-frame identity stability and temporal coherence. The method emphasizes late/staged conditioning fusion to reduce interference from text semantics and improve long-range identity consistency.

SafeGen-Bench: Benchmarking Safety in Image-Conditioned Text-to-Video Generation
- 赛道归属: 视频生成安全评测（Image-conditioned T2V Safety Benchmark / Evaluation）
- 核心创新点: 提出SafeGen-Bench，面向图像条件引导的文本到视频生成系统化评测其安全风险，补齐现有安全基准主要聚焦纯文本模式的缺口；通过覆盖非法/政治敏感/伦理风险等多类场景与触发方式，构建更贴近真实使用链路的测试集与评测协议，用于量化模型在“给定初始图像+文本”条件下的越界生成倾向与防护能力，从而推动安全对齐在I2V/T2V条件生成中的可比、可复现评估。
- Track: Safety benchmarking for image-conditioned text-to-video generation (evaluation/benchmark)
- Key innovation: Introduces SafeGen-Bench to systematically evaluate safety risks specifically in image-conditioned T2V settings, addressing the gap of prior benchmarks that mainly test text-only generation. It broadens risk coverage (illegal/political/ethical categories and triggers) and provides a more realistic evaluation protocol to quantify unsafe generation propensity and safety guard effectiveness under “input image + prompt” conditioning, enabling comparable and reproducible safety assessment.

MAVEN A Multi-Agent Framework for Multicultural Text-to-Video Generation 🆕NEW
- 赛道归属: 文生视频（Text-to-Video）/ 提示词工程与多智能体协同（Multi-agent Prompt Refinement）
- 核心创新点: 提出多智能体提示词精炼框架MAVEN，面向“多文化一致性/文化保真度”这一以往T2V较少系统覆盖的目标进行优化；方法上将文本提示分解为“人物(Person)-动作(Action)-地点(Location)”三维语义槽位，由具备专长的代理分别并行或串行地改写与约束，从而在单一文化与跨文化组合提示中减少文化符号混淆与刻板化偏差；同时构建支持系统评测的多文化/跨文化基准与流程，使文化保真度从主观描述转为可对比的评估闭环。
- Track: Text-to-Video / Prompt Engineering with Multi-Agent Collaboration (Multi-agent Prompt Refinement)
- Core innovations: Introduces MAVEN, a multi-agent prompt refinement framework targeting cultural fidelity, a dimension underexplored in prior T2V work; technically, it decomposes prompts into three semantic slots—Person, Action, and Location—and assigns specialized agents to refine/ground each slot in parallel or sequential modes, reducing cultural symbol confusion and stereotyping in mono-cultural and cross-cultural prompts; additionally, it establishes a systematic multicultural/cross-cultural evaluation setup to make cultural fidelity more measurable and comparable.

World-R1: Reinforcing 3D Constraints for Text-to-Video Generation
- 赛道归属: 文生视频（Text-to-Video）/ 3D一致性对齐（强化学习）
- 核心创新点: 通过强化学习而非结构改造来注入3D约束：将“几何一致性/世界约束”显式构造成奖励信号，对视频生成模型进行对齐优化，从而在不显著增加推理开销、保持可扩展性的前提下缓解几何不一致问题；同时构建面向“世界模拟”的纯文本数据集，用于更系统地覆盖可被3D约束检验的描述分布，提升对齐训练的有效性与泛化。
- Track: Text-to-Video / 3D-consistency alignment (Reinforcement Learning)
- Core innovation: Injects 3D constraints via RL-based alignment instead of architectural modifications: formulates geometric/world-consistency as explicit rewards to optimize a video generator, improving geometric coherence without adding substantial inference cost and preserving scalability; additionally introduces a world-simulation-oriented text-only dataset to better cover descriptions that are verifiable under 3D constraints, strengthening alignment and generalization.

Knowledge-Intensive Video Generation
- 赛道归属: 知识密集型文本到视频生成评测（Factuality/Helpfulness Evaluation for T2V）
- 核心创新点: 定义“知识密集型视频生成（KIVI）”任务：针对解释、流程、演示类信息检索式短提示，要求生成视频不仅好看还要事实正确且有用；构建KIVI-Bench（1080条提示）并提出面向事实性（factuality）与帮助性（helpfulness）的自动评测指标，且通过人工评测验证指标相关性，从评测体系上把T2V从感知质量扩展到“知识/实用性”维度，为后续引入检索增强、工具使用或知识对齐的T2V方法提供可量化目标。
- Track: Knowledge-intensive text-to-video generation evaluation (factuality/helpfulness)
- Key innovation: Formulates Knowledge-Intensive Video Generation (KIVI), where prompts request explanations/procedures/demonstrations and outputs must be factually correct and practically helpful, not just visually appealing. Releases KIVI-Bench (1,080 prompts) and proposes automatic metrics for factuality and helpfulness, validated via human studies, extending T2V evaluation from perceptual quality to knowledge/utility and enabling measurable targets for retrieval/tool-augmented or knowledge-aligned T2V models.

OSP-Next: Efficient High-Quality Video Generation with Sparse Sequence Parallelism, HiF8 Quantization, and Reinforcement Learning
- 赛道归属: 视频生成（文生视频/扩散Transformer加速与部署优化）
- 核心创新点: 提出面向DiT视频生成的系统级效率方案，将“稀疏注意力 + 序列并行 + 低比特量化 + 强化学习”联合设计以在质量不降的前提下降本增效：1) 采用混合全注意力-稀疏注意力架构，用固定模式的 Skiparse-2D 在时空token维度做token级与group级稀疏连接，缓解全注意力二次复杂度；2) 引入稀疏序列并行（Sparse Sequence Parallelism）以更好匹配稀疏计算图，提升多卡吞吐与可扩展性；3) 使用 HiF8（8-bit）量化降低显存与带宽开销，面向推理/训练的硬件友好实现；4) 通过强化学习对生成策略/偏好进行对齐，在引入稀疏与量化后维持或提升感知质量与文本一致性。
- Track: Video generation (text-to-video / Diffusion-Transformer acceleration & deployment optimization)
- Core innovations: A system-level efficiency recipe for DiT-based video generation that jointly combines “sparse attention + sequence parallelism + low-bit quantization + RL” to reduce cost without sacrificing quality: 1) a hybrid full–sparse attention design using fixed-pattern Skiparse-2D to apply token-wise and group-wise sparsity over spatiotemporal tokens, mitigating quadratic attention cost; 2) Sparse Sequence Parallelism to better align distributed execution with sparse computation graphs for higher multi-GPU throughput and scalability; 3) HiF8 (8-bit) quantization to cut memory/bandwidth with hardware-friendly training/inference; 4) reinforcement learning-based alignment to preserve/improve perceptual quality and prompt faithfulness under sparsity/quantization constraints.

Paris 2.0: A Decentralized Diffusion Model for Video Generation
- 赛道归属: 视频生成（去中心化训练 / 分布式扩散模型）
- 核心创新点: 提出首个通过去中心化计算预训练的视频扩散生成模型，将原本在图像上验证的去中心化扩散训练范式扩展到需要强时序一致性的文本生成视频任务；核心突破在于给出去中心化场景下实现时序连贯训练的配方与机制，使得无需单体GPU集群也能完成低分辨率T2V预训练，并在去中心化通信与优化约束下维持跨帧一致性与可训练性。
- Track: Video generation (decentralized training / distributed diffusion)
- Key innovation: Introduces the first video diffusion generator pre-trained via decentralized computation, extending decentralized diffusion training from images to temporally coherent text-to-video. The main methodological advance is a training recipe/mechanism that preserves temporal coherence under decentralized optimization and communication constraints, enabling low-res T2V pretraining without a monolithic GPU cluster.

TAGRPO: Boosting GRPO on Image-to-Video Generation with Direct Trajectory Alignment
- 赛道归属: 图生视频生成（I2V）/ 强化学习式后训练（RLHF/RLAIF for generative models）
- 核心创新点: 提出TAGRPO用于I2V的稳健后训练，指出GRPO在I2V上“奖励不稳定/不持续提升”的关键症结在于视频生成的多步轨迹与奖励信号之间存在错位；方法上引入“直接轨迹对齐”(Direct Trajectory Alignment)的对比学习式目标，将高奖励样本的去噪/流匹配轨迹作为正样本对齐参照、低奖励轨迹作为负样本拉开，从而在不改变基础生成架构的情况下，更稳定地把奖励偏好注入到整段生成轨迹而非仅末端结果，提升可控性与一致性。
- Track: Image-to-Video generation (I2V) / RL-style post-training (RLHF/RLAIF for generative models)
- Core innovation: Proposes TAGRPO as a robust post-training framework for I2V, diagnosing that naïvely applying GRPO yields inconsistent reward gains due to misalignment between multi-step generation trajectories and reward signals. It introduces Direct Trajectory Alignment with a contrastive-learning-like objective: align denoising/flow-matching trajectories from high-reward samples as positives and push away low-reward trajectories as negatives, injecting preference into the whole trajectory (not just final frames) without changing the base architecture, improving stability and controllability.

Kandinsky 5.0: A Family of Foundation Models for Image and Video Generation
- 赛道归属: 文生图/文生视频/图生视频（基础大模型体系与工程化）
- 核心创新点: 给出Kandinsky 5.0成体系的图像与视频基础模型家族，通过“分层产品线”覆盖不同算力与质量需求：6B级高分辨率图像模型（Image Lite）、2B级轻量快速的T2V/I2V（Video Lite）、19B级高质量视频模型（Video Pro）。技术价值在于将图像与10秒视频生成统一到可扩展的基础模型栈中，并通过不同规模与配置实现质量-速度-成本的可部署权衡，为实际应用提供从轻量到旗舰的可迁移方案与训练/推理配方。
- Track: Text-to-Image / Text-to-Video / Image-to-Video (foundation model family & systemization)
- Core innovation: Presents Kandinsky 5.0 as a structured family of foundation models spanning high-res image and 10-second video synthesis, organized into tiered lineups to cover different compute/quality regimes: 6B Image Lite, 2B fast/light Video Lite for T2V/I2V, and 19B Video Pro for top quality. The key contribution is a scalable, unified model stack with practical quality–latency–cost trade-offs and deployable recipes across sizes, enabling transfer across product tiers.

Dream.exe: Can Video Generation Models Dream Executable Robot Manipulation? 🆕NEW
- 赛道归属: 视频生成 × 机器人（具身智能）/ 物理一致性评测与可执行性（Executable Manipulation from Generated Video）
- 核心创新点: 提出“从生成视频到可执行机器人操作”的评测范式Dream.exe，把视频生成模型是否学到物理规律的问题转化为可量化的具身任务：将模型生成的操作过程视频作为中间表示，进一步映射/提取为机器人可执行的动作序列并在真实或仿真环境中验证执行效果；该思路以“能否落地执行”作为强约束信号，绕开仅凭视觉逼真度评估的局限，从而系统检验生成模型对接触动力学、时序因果与可操作性（affordance）的隐式建模能力，并为后续将生成模型用于机器人策略生成/数据合成提供可复现的测试框架。
- Track: Video Generation × Robotics (Embodied AI) / Physical-Consistency Evaluation via Executable Manipulation
- Core innovations: Proposes Dream.exe, an evaluation paradigm that turns the question “do video generators internalize physics?” into a measurable embodied task: use a model-generated manipulation video as an intermediate representation, convert/parse it into robot-executable action sequences, and validate by execution in real or simulated environments; by enforcing executability as a hard constraint, it goes beyond visual realism metrics to systematically probe implicit modeling of contact dynamics, temporal causality, and affordances, and provides a reproducible framework toward using generative video models for robot policy generation or data synthesis.

GitHub

[2026-06-05] hao-ai-lab/FastVideo ⭐3678

A unified inference and post-training framework for accelerated video generation.

[2026-06-04] ModelTC/LightX2V ⭐2338

Light Image Video Generation Inference Framework

[2026-06-04] ZeroLu/awesome-seedance ⭐1875

The ultimate collection of high-fidelity Seedance 2.0 prompts and Seedance AI resources. Discover Seedance 2.0 how to use for cinematic film, anime, U...

[2026-06-04] YouMind-OpenLab/awesome-seedance-2-prompts ⭐1298

🎬 2000+ curated Seedance 2.0 video generation prompts — cinematic, anime, UGC, ads, meme styles. Includes Seedance API guides, character consistency t...

[2026-06-04] thu-ml/Causal-Forcing ⭐755 🆕NEW

[ICML 2026] Official codebase for "Causal Forcing: Autoregressive Diffusion Distillation Done Right for High-Quality Real-Time Interactive Video Gener...

HuggingFace Models

meituan-longcat/LongCat-Video-Avatar-1.5

ByteDance/Bernini-R 🆕NEW

音频生成 / Audio Generation

arXiv

ParsVoice: A Large-Scale Multi-Speaker Persian Speech Corpus for Text-to-Speech Synthesis
- 赛道归属: 语音生成 / TTS 数据集与数据构建（低资源语言、多说话人）
- 核心创新点: 提出面向多说话人TTS训练的超大规模波斯语开源语音-文本语料库ParsVoice，并给出可扩展的数据构建流水线：从长篇有声书录音中自动切分与对齐高质量语音-文本对，核心在于结合面向波斯语的句级语义/完整性建模（如微调的ParsBERT用于句子补全/筛选）与质量控制策略，以在低资源语言场景下系统性提升对齐准确性、覆盖度与可用性，从而降低多说话人TTS与语音语言建模的数据门槛。
- Track: Audio Generation / TTS dataset & data pipeline (low-resource, multi-speaker)
- Core innovation: Introduces ParsVoice, the largest publicly available Persian speech–text corpus designed for multi-speaker TTS, together with a scalable pipeline to derive high-quality paired data from long-form audiobooks. The key methodological contribution is an automated segmentation/alignment and quality-control workflow that leverages Persian-specific sentence-level modeling (e.g., a fine-tuned ParsBERT for sentence completion/filtering) to improve alignment reliability, coverage, and usability in low-resource settings.

ImmersiveTTS: Environment-Aware Text-to-Speech with Multimodal Diffusion Transformer and Domain-Specific Representation Alignment
- 赛道归属: 文本到语音（TTS）/ 场景化语音生成（语音+环境声融合）
- 核心创新点: 提出环境感知TTS框架，通过多模态扩散Transformer显式建模语音与环境上下文（如场景/视觉/环境音提示）之间的跨模态交互，解决语音与环境声在声学形态与时间动态上的分布差异；并引入面向领域的表征对齐机制，将“语音生成表征”与“环境/场景表征”在统一空间中对齐，从而实现语音与环境声的自然共存与无缝融合（而非后期拼接）。
- Track: Text-to-Speech (TTS) / Scene-aware speech generation (speech + ambient sound integration)
- Core innovations: Proposes an environment-aware TTS framework that uses a multimodal Diffusion Transformer to explicitly model cross-modal interactions between speech and environmental context (e.g., scene/visual/ambient cues), addressing the distribution and temporal-dynamics mismatch between speech and environmental audio; introduces domain-specific representation alignment to map speech-generation features and environment/scene features into a shared space, enabling coherent in-scene speech generation rather than post-hoc mixing.

UNISON: A Unified Sound Generation and Editing Framework via Deep LLM Fusion
- 赛道归属: 统一音频生成与编辑（Text-to-Audio/TTS/音频编辑一体化，多任务扩散）
- 核心创新点: 用单一潜空间扩散模型统一覆盖文本到音频、文本到语音、零样本音色克隆、语音+音效混合生成、场景级音频编辑与时间编排等任务，实现“同权重多能力”；关键方法是层级式深度LLM融合（将LLM多层隐状态注入扩散网络以增强语义与结构控制）以及面向多任务的统一条件接口/训练范式，使生成与编辑在同一潜空间与同一推理管线内闭环完成，减少任务间割裂与模型堆叠。
- Track: Unified audio generation & editing (Text-to-Audio/TTS/audio editing; multi-task diffusion)
- Core innovations: Introduces a single latent diffusion model that unifies text-to-audio, text-to-speech, zero-shot speaker cloning, mixed speech-and-sound generation, scene-level editing, and temporal composition under one set of weights; key is layer-wise deep LLM fusion—injecting multi-layer LLM hidden states into the diffusion network for stronger semantic/structural control—plus a unified conditioning/training scheme so generation and editing operate in the same latent space and inference pipeline, avoiding fragmented task-specific stacks.

Foley-Omni: A Unified Multimodal Generation Model from Task-Level Audio Synthesis to Complete Video Soundtrack Generation
- 赛道归属: 视频到音频生成（Video-to-Audio）、多模态统一音频生成（Unified Audio Generation）
- 核心创新点: 提出统一的多模态音频生成框架，将传统“单任务级”的语音/音效/音乐生成扩展为“整段视频完整配乐（soundtrack）”的一体化联合生成：在同一模型中对语音、拟音（foley）、环境声与音乐等多音频组件进行协同建模与联合采样，使各组件在时间轴上对齐、在语义与风格上保持一致，从而面向真实视频制作流程实现端到端的完整声轨生成（而非彼此独立的分段合成）。
  Track: Video-to-Audio generation, Unified multimodal audio generation
  Key innovation: Proposes a unified multimodal audio generation framework that moves beyond isolated task-level synthesis (speech/SFX/music) to end-to-end full video soundtrack generation. The model jointly models and co-generates multiple audio components—speech, foley, ambience, and music—within a single system, enforcing temporal alignment and semantic/style consistency across components to produce a coherent, production-ready soundtrack rather than separately generated audio segments.

Sparse Autoencoders for Interpretable Emotion Control in Text-to-Speech
- 赛道归属: 语音生成｜文本到语音（TTS）｜可解释情感控制（表示解析/可控生成）
- 核心创新点: 利用稀疏自编码器（SAE）对LLM-TTS的语义隐状态进行分解与稀疏表征学习，从模型内部表示中自动“挖掘/定位”与情感变化相关的稀疏特征（而非依赖外部情感条件或整体激活粗粒度操控）。该思路将情感控制从黑盒条件注入转为可解释的内部特征级干预：通过识别情感相关的稀疏方向/单元，实现更可诊断、可编辑的情感调节，并为理解情感在TTS隐空间中的编码方式提供机制化证据。
- Track: Speech Generation | Text-to-Speech (TTS) | Interpretable emotion control (representation analysis / controllable generation)
- Core innovation: Applies sparse autoencoders (SAEs) to decompose and sparsify semantic hidden states in LLM-based TTS, automatically isolating emotion-related sparse features from internal representations rather than relying on external emotion conditioning or coarse global activation steering. This reframes emotion control as interpretable, feature-level intervention: by identifying emotion-linked sparse directions/units, the method enables more diagnosable and editable emotion modulation and provides mechanistic insight into how emotion is encoded in the TTS latent space.

DUET: Unified Dual-Space Emotion Control for Diffusion and Flow-Matching Driven Text-to-Speech
- 赛道归属: 语音生成｜文本到语音（TTS）｜扩散/Flow-Matching 可控生成｜即插即用情感控制
- 核心创新点: 发现预训练扩散与flow-matching TTS的冻结隐状态中，情感与说话人身份分别对应近似线性可解码且近乎正交的方向，从而提出DUET的“双空间”统一控制：在不重训主体模型的前提下，以plug-and-play方式在生成过程中沿情感方向进行可控操纵，同时尽量不扰动说话人方向以降低身份泄漏/纠缠。该方法论突破在于把“情感-身份解耦”具体化为可操作的几何结构（线性方向+近正交），并将其转化为跨扩散与flow-matching范式通用的推理期控制接口。
- Track: Speech Generation | Text-to-Speech (TTS) | Diffusion/Flow-Matching controllable generation | Plug-and-play emotion control
- Core innovation: Shows that in pretrained diffusion and flow-matching TTS, emotion and speaker identity correspond to (approximately) linearly decodable and nearly orthogonal directions in frozen hidden states. Based on this geometry, DUET introduces unified “dual-space” control: a plug-and-play inference-time manipulation that steers generation along the emotion direction while minimally perturbing the speaker direction to reduce identity–emotion entanglement, without retraining the backbone. The key methodological advance is operationalizing emotion–identity disentanglement as actionable latent geometry (linear directions + near-orthogonality) and turning it into a model-agnostic control interface across both diffusion and flow-matching TTS.

SwanVoice: Expressive Long-Form Zero-Shot Speech Synthesis for Both Monologue and Dialogue
- 赛道归属: 长文本零样本TTS / 对话式语音合成（多说话人、情感与一致性建模）
- 核心创新点: 面向长篇独白与多轮对话的零样本语音合成，针对“逐轮合成再拼接”导致的音色一致性、韵律连贯性与情绪连续性断裂问题，提出在单模型内联合建模跨轮次的对话上下文与表达状态（如情感/语气/节奏的持续变量），在生成时维持跨turn的声学一致与对话连贯；强调长程依赖与多说话人切换下的表达可控与稳定性，而非仅提升单句质量。
- Track: Long-form zero-shot TTS / Dialogue speech synthesis (multi-speaker, expressive consistency)
- Core innovations: Targets long-form monologue and multi-turn dialogue in zero-shot TTS, addressing the common “synthesize-per-turn then stitch” workaround that breaks timbre, prosody, and affect continuity; proposes single-model joint modeling of cross-turn dialogue context and persistent expressive states (e.g., emotion/intonation/rhythm as continuous trajectories), maintaining acoustic consistency and conversational coherence across turns while supporting multi-speaker switching and expressive control over long horizons.

Towards Streaming Synchronized Spatial Audio Generation via Autoregressive Diffusion Transformer
- 赛道归属: 流式空间音频生成（视频/文本条件的Spatial Audio，低延迟生成）
- 核心创新点: 提出面向实时的流式空间音频生成统一框架，使用自回归扩散Transformer在“可流式输出”的约束下实现高保真生成，并强化与全景视频/文本提示的时序同步与空间一致性；核心突破在于把扩散生成改造为可在线推进的自回归/分段式推理范式，在降低推理延迟的同时保持空间线索（方位、距离、运动）建模精度，缓解“质量-延迟”权衡与多模态空间对齐困难。
- Track: Streaming spatial audio generation (video/text-conditioned spatial audio; low-latency)
- Core innovations: Proposes a unified streaming framework for real-time spatial audio generation conditioned on panoramic video and text, built on an autoregressive Diffusion Transformer to enable incremental (online) synthesis; key contribution is adapting diffusion-style generation to a streaming-compatible autoregressive/segmented inference scheme that preserves high fidelity while improving latency, and strengthening temporal synchronization and spatial consistency (direction/distance/motion cues) from multimodal inputs, mitigating the quality–latency tradeoff and multimodal spatial alignment challenges.

Chatterbox-Flash: Prior-Calibrated Block Diffusion for Streaming Zero-Shot TTS
- 赛道归属: 流式零样本TTS / 推理加速（Block Diffusion并行解码）
- 核心创新点: 将预训练自回归TTS解码器微调为块扩散（block-diffusion）解码器，实现“块内并行、块间流式”的低延迟生成；针对离散语音token长尾分布导致的并行位置选择偏置（高频token主导、质量下降）问题，提出先验校准（prior-calibration）机制，在不大改架构的前提下校正并行采样的token先验/选择策略，从而兼顾并行带来的速度与接近自回归的自然度与稳定性。
- Track: Streaming zero-shot TTS / Inference acceleration (block-diffusion parallel decoding)
- Core innovations: Fine-tunes a pretrained autoregressive TTS decoder into a block-diffusion decoder, enabling parallel token generation within each block while keeping block-by-block streaming for low latency; identifies a discrete-speech-token long-tail issue where naive block diffusion biases parallel positions toward a few high-frequency tokens and degrades quality, and introduces prior calibration to correct the sampling prior/position-selection behavior without major architectural changes, preserving naturalness and stability while gaining speed.

Unlocking Fine-Grained and Within-Utterance Speaking Style Control in Prompt-Based Text-to-Speech Models
- 赛道归属: 文本到语音生成（TTS）/ 语音风格可控生成（Prompt-based Style Control）
- 核心创新点: 在现有“基于提示词的TTS”框架上，针对两类关键瓶颈提出方法级增强：①实现跨语句（inter-utterance）的细粒度风格属性连续可控与插值，使风格强度/属性可在不同句子间平滑调节而非离散切换；②实现单句内部（within-utterance）的时变风格控制，通过引入随时间变化的风格条件/调度机制，让模型不再只能施加全局单一风格，而能在同一句话中完成风格过渡与局部风格片段控制，从而扩展到需要“句内风格转场”的实际应用场景。
- Track: Text-to-Speech (TTS) / Controllable Speech Style Generation (Prompt-based Style Control)
- Core innovations: Proposes method-level extensions to existing prompt-based TTS to overcome two limitations: (1) enables fine-grained, continuous control and interpolation of style attributes across utterances (inter-utterance), allowing smooth adjustment of style intensity/attributes rather than coarse, discrete changes; (2) enables time-varying, within-utterance style control by introducing temporally scheduled/dynamic style conditioning, replacing a single global style per utterance with intra-utterance style transitions and localized style segment control—supporting practical scenarios requiring style changes inside one sentence.

GitHub

[2026-06-05] huggingface/diffusers ⭐33772

🤗 Diffusers: State-of-the-art diffusion models for image, video, and audio generation in PyTorch.

[2026-06-02] SamurAIGPT/Generative-Media-Skills ⭐3391

Multi-modal Generative Media Skills for AI Agents (Claude Code, Cursor, Gemini CLI). High-quality image, video, and audio generation powered by muapi....

[2026-06-03] BinWang28/audio-ai-hub ⭐929

The hub for audio AI research: papers, open models, benchmarks & datasets across audio LLMs, speech recognition, TTS, music & audio generation.

[2026-06-04] apocas/restai ⭐509

RESTai is an AIaaS (AI as a Service) open-source platform. Supports many public and local LLM suported by Ollama/vLLM/etc. Precise embeddings usage, t...

[2026-06-04] dgrauet/ltx-2-mlx ⭐52

Pure MLX port of LTX-2 (Lightricks LTX-2.3) for Apple Silicon — video + audio generation

语言大模型 / Large Language Models

arXiv

Agentic Chain-of-Thought Steering for Efficient and Controllable LLM Reasoning
- 赛道归属: 推理优化（可控推理/测试时推理控制）
- 核心创新点: 将“推理过程如何展开”的控制显式化为一个马尔可夫决策过程（MDP）：引入控制器智能体在推理时按状态自适应决策（如继续思考、切换策略、停止等），以最小化无效token消耗并在不显著牺牲准确率的前提下实现可控的推理长度与推理轨迹；相较仅做截断/早停/压缩的效率方法，ACTS把“思考策略”作为可学习/可调度的动作空间，从而提供更细粒度的推理时控制与效率-性能权衡。
- Track: Reasoning optimization (controllable inference / test-time reasoning control)
- Key innovation: Makes “how the model reasons” an explicit control problem by formulating chain-of-thought steering as an MDP: a controller agent adaptively selects actions at inference (e.g., continue, change strategy, stop) based on the current reasoning state, reducing wasted tokens while maintaining accuracy and enabling controllable reasoning length/trajectory. Unlike prior efficiency methods that mainly shorten/early-stop/compress traces, ACTS treats reasoning strategy as an explicit, schedulable action space for finer-grained control over the efficiency–accuracy trade-off.

An Asymptotic Theory of Chain-of-Thought in In-Context Learning
- 赛道归属: 理论分析（In-Context Learning / Chain-of-Thought 机理与尺度律）
- 核心创新点: 在一个可解析的理论模型中刻画CoT深度与泛化性能的尺度行为：将测试时CoT推理形式化为对线性回归中“权重参数估计”的迭代精炼过程（iterative refinement），从而推导随推理步数增加时误差/泛化的渐近规律与收益递减条件；该框架把“CoT=迭代算法”的观点落到可证明的渐近理论上，为理解何时加深CoT有效、何时无效提供了可计算的判据。
- Track: Theoretical analysis (in-context learning / chain-of-thought mechanism & scaling laws)
- Key innovation: Develops an analytically solvable model to characterize how generalization scales with CoT depth: models test-time CoT as iterative refinement of the weight-parameter estimate in linear regression (in-context weight prediction), enabling asymptotic derivations of error/generalization behavior as the number of reasoning steps grows and identifying regimes of diminishing returns. This provides provable, computable criteria for when deeper CoT helps versus when it does not.

Attention-guided Fine-tuning of Multimodal Large Language Models Improves Chain-of-Thought Reasoning
- 赛道归属: 多模态推理（MLLM Chain-of-Thought 对齐/微调优化）
- 核心创新点: 通过系统性实证分析指出多模态 CoT 在视觉推理中常“越想越错”，并归因于两类稳定失败模式：过早锁定答案（premature answer commitment）与对直接视觉证据利用不足（limited direct visual evidence usage）。在此基础上提出“注意力引导的微调”思路：利用/约束模型注意力分配，使推理步骤更聚焦于与当前推理相关的视觉区域与证据链，从训练层面纠正 CoT 生成时的证据对齐与决策时机问题，从而提升多模态逐步推理的可靠性与可解释性。
- Track: Multimodal reasoning (MLLM Chain-of-Thought alignment / fine-tuning optimization)
- Key innovation: Provides a systematic study showing that CoT prompting can hurt visual reasoning in MLLMs, and identifies two recurring failure modes: premature answer commitment and insufficient use of direct visual evidence. Building on these findings, it proposes an attention-guided fine-tuning strategy that steers/regularizes attention to align each reasoning step with the relevant visual regions and evidence, correcting evidence grounding and decision timing during CoT generation to improve step-wise multimodal reasoning robustness.

COFT: Counterfactual-Conformal Decoding for Fair Chain-of-Thought Reasoning in Large Language Models
- 赛道归属: 公平性可控解码 / 推理阶段偏见抑制（LLM Decoding for Fairness in CoT）
- 核心创新点: 提出一种无需训练、仅在解码阶段生效的公平性控制方法 COFT，用于抑制链式思维（CoT）生成中的社会偏见放大。方法上以反事实提示构造 + 共形预测（Conformal）约束为核心：先将提示中的敏感片段替换为中性占位符形成“掩码反事实”输入，以获得相对去偏的参考分布；再在token 级别对原始解码分布施加公平性约束，并通过分布无关（distribution-free）的边际有效性保证（在 exchangeability 假设下）为公平控制提供可验证的统计保证，从而实现对任意冻结的因果语言模型在推理时的可控去偏解码。
- Track: Fairness-controlled decoding / Inference-time bias mitigation for CoT (LLM Decoding for Fairness in CoT)
- Key innovation: Introduces COFT, a training-free, decoding-time method to curb bias amplification in chain-of-thought generation. The technical core combines counterfactual prompt masking with conformal (distribution-free) constraints: it first replaces sensitive spans with neutral tokens to form a masked counterfactual prompt, yielding a debiased reference distribution; then it enforces token-level fairness control on the original decoding distribution, providing distribution-free marginal validity guarantees (under exchangeability) for any frozen causal LM—enabling verifiable, model-agnostic fairness control at inference time.

CIRF: Tokenizing Chain-of-Thoughts into Reusable Functional Units for Efficient Latent Reasoning in Large Language Models
- 赛道归属: 推理优化（隐式CoT/潜空间推理、推理token化）
- 核心创新点: 提出CIRF，将传统“链式思维”从自然语言解释转为可复用的离散功能token序列来执行隐式推理：把推理过程模块化为功能单元并在推理时动态编排，以适配不同样例复杂度；同时强调与显式CoT的对齐，使隐式推理在降低推理开销的同时尽量保持可解释推理轨迹的一致性与可控性。
- Track: Reasoning optimization (implicit CoT / latent reasoning, tokenized reasoning)
- Core innovations: CIRF converts natural-language chain-of-thought into a sequence of reusable discrete functional tokens for implicit reasoning. It dynamically composes these functional units at inference time to match instance complexity, aiming to reduce inference cost while improving alignment with explicit CoT so latent reasoning remains consistent and controllable.

MetaSICL: Adapting Audiroty LLM via Meta Speech In-Context Learning
- 赛道归属: 多模态理解（语音/音频大模型适配与低资源学习、In-Context Learning）
- 核心创新点: 提出一种面向听觉LLM的元学习式语音上下文学习框架（Meta Speech In-Context Learning），将“推理时用少量示例做ICL适配”作为核心适配机制，用元学习在训练阶段显式优化模型对示例集合的利用方式，从而在标注稀缺或训练-测试分布不匹配时，相比直接微调更稳健地实现快速域内适配与性能提升；强调训练免/轻训练的推理期自适应，降低低资源任务的适配成本并缓解微调脆弱性。
- Track: Multimodal Understanding (speech/audio LLM adaptation for low-resource settings, In-Context Learning)
- Core innovation: Proposes a meta-learning-based speech in-context learning framework (Meta Speech In-Context Learning) for auditory LLMs, treating inference-time adaptation via a few in-domain demonstrations as the primary adaptation mechanism. By meta-optimizing how the model leverages demonstration sets during training, it enables more robust and rapid in-domain adaptation under scarce labels or train–test distribution mismatch, mitigating the brittleness of direct fine-tuning while keeping adaptation largely training-free/lightweight at inference time.

Entity Binding Failures in Speech LLM Reasoning: Diagnosis and Chain-of-Thought Intervention 🆕NEW
- 赛道归属: 语音大模型推理诊断与对齐（Speech LLM Reasoning / 语音-文本推理鲁棒性）
- 核心创新点: 提出并验证“实体绑定失败（entity binding failure）”是语音LLM在复杂推理中相对文本LLM性能塌陷的关键、且高度局部化的原因：通过对多种任务分解评测，发现S2T在空间/句法/事实类任务不弱于T2T，但在需要持续实体跟踪的逻辑推理任务上准确率降至随机水平；进一步将退化机制归因于连续语音表征导致的实体-属性/关系绑定不稳，从而把“模态差距”从笼统能力不足细化为可诊断的绑定机制问题，并提出基于Chain-of-Thought的干预思路以强化实体跟踪与绑定过程。
- Track: Speech LLM reasoning diagnosis & alignment (speech-text reasoning robustness)
- Core innovation: Identifies and empirically validates a localized failure mode—entity binding failure—as the main driver of the reasoning gap between speech LLMs and text LLMs: via task-factorized evaluation, shows S2T matches/exceeds T2T on spatial/syntactic/factual tasks, but collapses to chance on logical tasks requiring persistent entity tracking; attributes the degradation to instability in binding entities to attributes/relations induced by continuous speech representations, reframing the “modality gap” into a concrete, diagnosable binding-mechanism issue and proposing Chain-of-Thought-based interventions to reinforce entity tracking/binding.

LLM-XTM: Enhancing Cross-Lingual Topic Models with Large Language Models
- 赛道归属: 跨语言主题建模（NLP 表征学习/主题模型 + LLM增强）
- 核心创新点: 提出LLM-XTM，将LLM用于“主题层面”的跨语言对齐与可解释性提升，同时通过自一致性不确定性估计抑制幻觉并降低对不可获得的token概率（白盒接口）的依赖：用LLM引导的主题精炼（topic refinement）替代昂贵且易漂移的文档级改写/标注式增强，并以不确定性驱动的自一致性机制筛选/聚合LLM建议，使跨语言主题更连贯、对齐更稳健，且在资源稀缺的双语条件下仍可工作。
- Track: Cross-lingual topic modeling (topic models + LLM augmentation)
- Key innovation: Proposes LLM-XTM, using LLMs for topic-level cross-lingual alignment and interpretability while mitigating hallucinations via self-consistency–based uncertainty estimation and avoiding reliance on inaccessible token-probability (white-box) APIs. It replaces costly, drift-prone document-level LLM refinements with LLM-guided topic refinement, and uses uncertainty-driven selection/aggregation of LLM suggestions to yield more coherent and better-aligned multilingual topics under sparse bilingual resources.

SmartThinker: Progressive Chain-of-Thought Length Calibration for Efficient Large Language Model Reasoning
- 赛道归属: 推理优化（CoT 长度自适应控制 / 高效推理）
- 核心创新点: 提出 SmartThinker 的“渐进式 CoT 长度校准”框架，针对长推理模型在不同难度问题上普遍存在的冗余与过度思考，突破点在于将“长度控制”从静态奖励（对所有样本一刀切）升级为随题目难度动态调整的策略。方法上通过逐步（progressive）校准推理链长度，使模型在简单问题上自动收敛到更短、更经济的推理，在困难问题上保留必要的长推理，从而在尽量不损失准确率的前提下显著降低输出冗余与推理成本，并弥补现有 GRPO 静态长度奖励无法自适应难度的缺陷。
- Track: Reasoning optimization (adaptive CoT length control / efficient inference)
- Key innovation: Introduces SmartThinker, a progressive CoT length calibration framework to reduce redundancy and overthinking in long-reasoning models. The key methodological advance is replacing static, one-size-fits-all length rewards (common in GRPO-based approaches) with a difficulty-adaptive mechanism that progressively calibrates reasoning length: it encourages short, cost-efficient reasoning on easy problems while preserving longer chains when needed for hard ones, improving efficiency with minimal accuracy degradation and addressing the non-adaptivity of static length reward designs.

Visual Instruction Tuning Aligns Modalities through Abstraction 🆕NEW
- 赛道归属: 多模态理解与视觉指令微调（Vision-Language Instruction Tuning / 跨模态对齐机制）
- 核心创新点: 从“层级抽象”视角系统揭示视觉指令微调如何实现跨模态对齐：通过跨多种视觉-语言架构的层间分析，发现指令微调的主要作用并非让视觉信息逐层经过LLM早期的单模态处理层，而是作为“桥接器”将视觉特征直接注入LLM的中间语义层，在抽象层面完成对齐并绕过早期层；该结论为设计更高效的视觉接入方式（如选择性注入层、减少无效早期融合）提供了机制性依据，而不仅是经验性配方。
- Track: Multimodal understanding & visual instruction tuning (vision-language alignment mechanisms)
- Core innovation: Provides a layer-wise abstraction account of how visual instruction tuning aligns modalities: across diverse VLM architectures, shows instruction tuning mainly acts as a bridge that embeds visual features directly into intermediate semantic layers of the LLM backbone, largely bypassing early unimodal layers; this mechanistic finding supports more principled designs for visual integration (e.g., selective layer injection and avoiding inefficient early fusion) beyond recipe-style tuning.

GitHub

[2026-06-05] sgl-project/sglang ⭐28875

SGLang is a high-performance serving framework for large language models and multimodal models.

[2026-06-05] NVIDIA/TensorRT-LLM ⭐13805

TensorRT LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports state-of-the-art optimizations to perfo...

[2026-06-05] google-ai-edge/LiteRT-LM ⭐5391

LiteRT-LM is Google's production-ready, high-performance, open-source inference framework for deploying Large Language Models on edge devices.

[2026-06-05] flagos-ai/FlagGems ⭐1014

FlagGems is an operator library for large language models implemented in the Triton Language.

[2026-06-04] s-kostyaev/ellama ⭐943

Ellama is a tool for interacting with large language models from Emacs.

HuggingFace Models

nvidia/Cosmos3-Nano

nvidia/Cosmos3-Super

HuggingFace Datasets

[2026-05-28] openbmb/UltraData-SFT-2605
```
UltraData-SFT-2605
```

📦 UltraData Collection | 🌐 UltraData | 🤗 MiniCPM5 Series

English | 中文

    📚 Introduction

Ult...

[2026-05-28] openbmb/Ultra-FineWeb-L3
```
Ultra-FineWeb-L3
```

📜 Ultra-FineWeb Technical Report | 📦 UltraData Collection | 🌐 UltraData | 🤗 MiniCPM5 Series

English | 中文

...

[2026-05-01] angrygiraffe/claude-opus-4.6-4.7-reasoning-8.7k
```
Background
```

Ended up with some tokens to burn on a Claude Max plan. Assembly began during 4.6 and moved to 4.7. Model is tagged. The develop...

[2026-06-03] OpenClaw/clawhub-security-signals 🆕NEW
```
ClawHub Security Signals
```

🦀 ClawHub | 📝 OpenClaw Blog | 🤗 Hugging Face Blog | 📄 Paper | 📄 Pre-Print ClawHub Security Signals is a saniti...

多模态大模型 / Multimodal Models

arXiv

MLLM-Microscope: Unlocking Hidden Structure Within Multimodal Large Language Models
- 赛道归属: 多模态理解（MLLM可解释性/表征分析与诊断）
- 核心创新点: 提出一套面向MLLM内部表征的系统化“显微镜”分析框架，沿Transformer层级同时刻画多模态token嵌入的线性度、内在维度与各向异性，并区分主干流与残差流进行对照诊断；在ScienceQA上对LLaVA-NeXT与OmniFusion做跨模型、跨模态的层间结构测量，揭示多模态token在不同流与不同层中呈现高度线性等隐藏结构特征，为后续的可解释性、压缩与对齐机制设计提供可量化的表征指标体系。
- Track: Multimodal understanding (MLLM interpretability / representation analysis & diagnostics)
- Core innovation: Introduces a “microscope”-style, layer-wise diagnostic framework to probe hidden representations in MLLMs by jointly measuring linearity, intrinsic dimension, and anisotropy of multimodal token embeddings, explicitly contrasting main vs. residual streams. Evaluated on ScienceQA with LLaVA-NeXT and OmniFusion, it provides cross-model, cross-modality structural measurements that uncover highly linear behaviors and other latent geometric properties, yielding actionable, quantitative representation metrics for interpretability, compression, and alignment design.

Immuno-VLM: Immunizing Large Vision-Language Models via Generative Semantic Antibodies for Open-World Trustworthiness
- 赛道归属: 多模态安全与可信（开放世界异常检测/拒识、VLM鲁棒性）
- 核心创新点: 提出“语义自负（Hubris of Semantics）”作为开放世界部署中的关键失效模式：VLM会将未知异常强行映射到已知语义并高置信输出。方法上以“生成式语义抗体（Generative Semantic Antibodies）”为核心机制，为模型显式注入“负知识/反语义”以形成可拒识的决策边界，从而在不破坏原有零样本语义对齐能力的前提下提升开放世界可信性与异常处理能力。
- Track: Multimodal safety & trustworthiness (open-world anomaly detection/rejection, VLM robustness)
- Key innovation: Identifies “Hubris of Semantics” as a core open-world failure where VLMs over-confidently force unknown anomalies into known semantic classes. Introduces “Generative Semantic Antibodies” to explicitly inject negative knowledge/counter-semantics, shaping rejectable decision boundaries while preserving zero-shot semantic alignment, improving open-world trustworthiness.

SONIC-O1: A Real-World Benchmark for Evaluating Multimodal Large Language Models on Audio-Video Understanding
- 赛道归属: 多模态理解（音频-视频时序理解评测/Benchmark）
- 核心创新点: 提出SONIC-O1作为面向真实世界音频-视频理解的系统性评测基准：以长时序、多领域对话场景为核心覆盖（60小时、231段、13个真实会话域），并采用全人工核验的数据与标注流程，旨在弥补现有评测偏静态图像、缺少对“音视频联合+时序推理”能力刻画的空白，从而更可靠地区分MLLM在真实音视频理解中的能力边界与失效模式。
- Track: Multimodal Understanding (Audio-Video Temporal Understanding Benchmark)
- Key Innovations: Introduces SONIC-O1, a real-world benchmark for systematic evaluation of MLLMs on sequential audio-video understanding. It emphasizes long-form temporal, multi-domain conversational scenarios (60 hours, 231 clips, 13 domains) with fully human-verified data/annotations, addressing the gap of prior benchmarks that over-focus on static images and under-measure joint audio-video temporal reasoning, enabling clearer diagnosis of capability limits and failure modes.

Cross-modal linkage risk in clinical vision-language models
- 赛道归属: 多模态安全与隐私（视觉-语言模型的链接攻击/成员关联风险评估）
- 核心创新点: 将临床VLM的隐私问题形式化为跨模态重链接（image-to-report linkage）风险：即模型学习到的共享嵌入空间可能保留实例级对应关系，使攻击者仅凭余弦相似度检索即可把去标识化影像重新关联到原始放射学报告；提出相应的威胁模型与评测设定，用以量化在“影像与报告被刻意分离共享/访问控制”的真实流程下，嵌入对齐带来的可重识别性，从而把“表征对齐能力”转化为可度量的隐私攻击面。
- Track: Multimodal security & privacy (vision-language linkage attacks / instance re-identification risk)
- Core innovation: Formalizes a clinical VLM privacy threat as cross-modal re-linkage (image-to-report linkage) risk: the shared embedding space can preserve instance-level correspondence, enabling attackers to re-associate a de-identified radiograph with its original report via cosine-similarity retrieval alone. It defines a concrete threat model and evaluation protocol aligned with real-world workflows where images and reports are intentionally separated, turning representation alignment strength into a measurable privacy attack surface.

Unveiling the Fragility of Vision-Language Models: Multi-Modal Adversarial Synergy via Texture-Constrained Perturbations and Cross-Modal Optimization
- 赛道归属: 多模态安全与鲁棒性（VLM 对抗攻击）
- 核心创新点: 提出一种面向视觉-语言模型的跨模态协同对抗框架，将纹理约束的图像扰动与跨模态联合优化结合：在视觉侧通过受限于纹理/局部统计特性的扰动提升隐蔽性与可迁移性，在语言侧通过与视觉扰动协同的目标设计/优化放大误导效应，从而在无需不现实的强白盒假设下实现更强的多模态攻击，系统性揭示 LVLM 在“多模态联动”攻击面前的脆弱性。
  Track: Multimodal Security & Robustness (Adversarial Attacks on VLMs)
  Key innovation: Proposes a cross-modal synergistic adversarial framework that couples texture-constrained image perturbations with cross-modal joint optimization. The visual perturbation is constrained by texture/local statistics to remain stealthy while improving transferability, and the language-side objective is co-optimized to amplify misalignment, enabling stronger multimodal attacks without relying on impractical strong white-box access and exposing LVLM fragility under coordinated multimodal threats.

Capturing Gaze Shifts for Guidance: Cross-Modal Fusion Enhancement for VLM Hallucination Mitigation
- 赛道归属: 多模态理解（VLM幻觉抑制、跨模态融合/注意力机制改进）
- 核心创新点: 从“视觉注意力汇聚/沉没（attention sink）”角度解释幻觉：并非简单的“语言先验过强”，而是视觉注意力被任务无关区域吸走导致视觉证据未被有效融合。提出利用“注视转移（gaze shifts）”信号来指导跨模态融合增强：通过建模视线在关键区域间的动态转移，重分配视觉-文本对齐时的注意力与融合权重，避免仅按原始注意力分数做放大而加剧偏置，从机制上降低不可证实内容生成。
- Track: Multimodal understanding (VLM hallucination mitigation, cross-modal fusion/attention)
- Key innovation: Reframes hallucination via a “visual attention sink” mechanism—visual attention is diverted to irrelevant regions, preventing evidence from being fused. Uses “gaze shifts” as guidance signals to enhance cross-modal fusion by modeling dynamic transitions between salient regions, reweighting alignment/fusion beyond naive attention amplification, thereby reducing unsupported generations.

Align-KD: Distilling Cross-Modal Alignment Knowledge for Mobile Vision-Language Model Enhancement
- 赛道归属: 多模态模型压缩与端侧部署（知识蒸馏/对齐增强）
- 核心创新点: 提出 Align-KD，将“大模型的跨模态对齐能力”作为可蒸馏的核心知识而非仅蒸馏输出分布/特征；通过显式对齐约束与跨模态一致性信号，把教师VLM在图文对齐、语义绑定等能力迁移到轻量学生模型，从而在移动端/边缘设备的参数与算力受限条件下，尽量减少模型缩小带来的对齐与理解能力退化。
  Track: Multimodal model compression & on-device deployment (knowledge distillation / alignment enhancement)
  Key innovation: Proposes Align-KD, treating cross-modal alignment as the primary distillable knowledge rather than only logits/features; it introduces explicit alignment constraints and cross-modal consistency signals to transfer the teacher VLM’s image-text grounding/alignment capability to a compact student, mitigating the alignment and understanding degradation typically caused by aggressive downsizing for mobile/edge settings.

VLM-GLoc: Vision-Language Model Enhanced Monte Carlo Localization for Robust Semantic Global Localization in Cluttered Quasi-Static Environments
- 赛道归属: 具身智能与机器人定位（语义全局定位、VLM+概率滤波/Monte Carlo Localization）
- 核心创新点: 将VLM的开放词汇语义理解引入Monte Carlo Localization（MCL）框架，面向“几何与语义都高度混淆”的准静态室内环境（如货架平行通道、重复家具）提升全局定位鲁棒性。核心在于用VLM生成/评估与场景观测一致的语义证据，并将其作为观测模型或粒子权重更新信号，与传统几何/外观特征互补，从而在几何别名严重、语义长尾且遮挡杂乱的场景中实现更稳定的语义级全局定位。
- Track: Embodied AI & robot localization (semantic global localization, VLM + probabilistic filtering/MCL)
- Key innovation: Integrates open-vocabulary semantic understanding from VLMs into a Monte Carlo Localization pipeline to handle quasi-static indoor environments with strong geometric/semantic aliasing. Uses VLM-derived semantic evidence as an observation/weighting signal for particle updates, complementing geometric/appearance cues to improve robustness under severe aliasing, long-tail semantics, and clutter/occlusion.

ES-Merging: Biological MLLM Merging via Embedding Space Signals
- 赛道归属: 多模态模型融合（模型合并/参数高效跨模态统一，生物科学MLLM）
- 核心创新点: 提出ES-Merging，用嵌入空间信号（embedding space signals）来指导生物领域MLLM的合并：不再依赖输入无关的参数空间启发式，而是利用各模型在嵌入空间中体现的模态专长与对齐特征来决定合并策略/权重，从而更忠实地保留不同单模态模型的能力并实现跨模态统一；该思路把“模态专门化”从难以观测的参数差异，转化为可直接度量与可优化的表征信号，提高合并后的跨模态任务适配性。
- Track: Multimodal model merging (parameter-efficient cross-modal unification for biological MLLMs)
- Core innovation: Proposes ES-Merging, a model-merging method for biological MLLMs guided by embedding-space signals rather than input-agnostic parameter-space heuristics. By leveraging representation-level cues that reflect modality specialization and alignment, it determines merging behavior/weights to better preserve complementary single-modality strengths while forming a unified cross-modal model. The key methodological shift is making “modality specialization” observable and optimizable through measurable embedding signals, improving post-merge cross-modal capability.

Query-based Cross-Modal Projector Bolstering Mamba Multimodal LLM 🆕NEW
- 赛道归属: 多模态理解 / 视觉-语言建模加速（Mamba/SSM架构）
- 核心创新点: 提出一种基于Query的跨模态投影器（Query-based Cross-Modal Projector）来增强Mamba在视觉-语言任务中的效率与可用性：通过跨注意力以文本/任务Query为条件对视觉token进行自适应压缩与重采样，将高冗余的视觉序列映射为更短、更信息密集的表示，再交由Mamba的线性复杂度序列建模处理，从而在保持关键信息对齐的同时显著降低视觉侧token长度带来的计算与显存开销；其方法论关键在于用“Query驱动的token选择/聚合”替代固定下采样，实现输入相关的动态视觉压缩并更好匹配SSM类模型对长序列高效处理的优势。
- Track: Multimodal Understanding / Vision-Language Modeling Acceleration (Mamba/SSM-based)
- Key Innovation: Introduces a query-based cross-modal projector to make Mamba practical and efficient for vision-language modeling: it uses cross-attention conditioned on text/task queries to adaptively compress and resample visual tokens, producing a shorter, information-dense visual sequence that is then processed by Mamba with linear-time sequence modeling. The core methodological advance is replacing static visual downsampling with input-dependent, query-driven token selection/aggregation, preserving cross-modal alignment while substantially reducing compute/memory from long visual token sequences.

GitHub

[2026-06-04] Blaizzy/mlx-vlm ⭐4946

MLX-VLM is a package for inference and fine-tuning of Vision Language Models (VLMs) on your Mac using MLX.

[2026-06-04] cpystan/SD-VLM ⭐503

[NeurIPS 2025]《SD-VLM: Spatial Measuring and Understanding with Depth-encoded Vision Language Models》

[2026-06-04] jamjamjon/usls ⭐410

A Rust library integrated with ONNXRuntime, providing a collection of Computer Vison and Vision-Language models such as YOLO, FastVLM, and more.

[2026-06-03] NVIDIA-Omniverse/content-agents ⭐128

AI-powered agents for automating 3D content workflows using Vision-Language Models (VLMs). Content Agents analyze 3D assets and automate material assi...

[2026-06-03] ocy1/TRIO ⭐107

Official implementation for "TRIO: Token Reduction via Inference-Objective Guidance for Efficient Vision-Language Models" https://arxiv.org/pdf/2602...

HuggingFace Datasets

[2026-06-01] ReasonCore/open-spatial-reasoning
```
Open Spatial Reasoning
```

A multiple-choice dataset of spatial reasoning questions and answers for evaluating 3D spatial reasoning from si...

强化学习 / Reinforcement Learning

arXiv

CARE-RL: Capability-Aware Reinforcement Learning for Mitigating Cross-Domain Conflicts
- 赛道归属: 多领域LLM强化学习对齐（跨域冲突缓解 / 奖励建模）
- 核心创新点: 提出CARE-RL，将“协议感知”的生成式奖励与“能力感知”的优化联合起来解决多领域RL中的两类关键瓶颈：一是非可验证任务奖励不可靠，二是跨领域能力相互干扰。方法上通过Protocol-Aware Generative Reward Model（PA-GRM）在提示/协议层面构造更稳健的奖励信号以覆盖不可验证场景，并在优化阶段引入能力维度的约束/加权机制，使更新更聚焦于目标能力、减少对其他领域能力的负迁移，从而系统性缓解cross-domain conflicts。
  Track: Multi-domain LLM RL alignment (cross-domain conflict mitigation / reward modeling)
  Key innovations: Proposes CARE-RL, combining protocol-aware generative reward construction with capability-aware optimization to tackle two core issues in multi-domain RL: unreliable rewards for non-verifiable tasks and capability interference across domains. It introduces a Protocol-Aware Generative Reward Model (PA-GRM) that builds more robust reward signals at the prompt/protocol level for non-verifiable settings, and a capability-aware optimization scheme that constrains/weights updates along capability dimensions to focus learning on target skills while reducing negative transfer to other domains.

Survival Reinforcement Learning: Toward Scalable Self-Supervised RL
- 赛道归属: 自监督强化学习 / 目标条件长时序规划（Goal-conditioned RL）
- 核心创新点: 提出Survival Reinforcement Learning（SRL）作为对比式自监督RL（CRL）的替代范式，用在线分类式目标判别取代对比损失，规避对比学习在长时序规划中“uniformity–tolerance”两难导致的表征退化/目标区分不足问题；将“survival value learning”扩展为通过最大化到达目标后的驻留时间（dwell time）来学习可用于长视野目标条件控制的价值信号，从而在深网络可扩展性与长时序可规划性之间取得更稳健的折中。
- Track: Self-supervised RL / Goal-conditioned long-horizon planning
- Core innovation: Proposes Survival Reinforcement Learning (SRL) as an alternative to contrastive self-supervised RL by replacing contrastive objectives with an online classification-based signal, mitigating the contrastive “uniformity–tolerance” dilemma that hurts long-horizon goal discrimination and planning. It extends survival value learning by maximizing dwell time at target goals, yielding a planning-friendly value signal while retaining strong depth-scaling behavior.

A Lecture Note on Offline RL and IRL, Part II: Foundations of Inverse Reinforcement Learning and Dynamic Discrete Choice Models
- 赛道归属: 逆强化学习（IRL）理论 / 离线RL与结构计量经济学（DDC）统一视角
- 核心创新点: 以讲义形式系统梳理IRL的基础，并将熵正则IRL与结构计量中的动态离散选择模型（Dynamic Discrete Choice, DDC）在数学结构上进行对齐：从“由专家离线数据反推奖励/偏好”的角度，统一讨论可辨识性、似然/最大熵目标、价值函数与策略的对应关系，以及由此带来的估计与推断框架；其方法论价值在于提供跨社区的同构映射与推导路径，便于将DDC的统计推断工具与IRL的优化视角互相迁移。
- Track: Inverse Reinforcement Learning theory / Unifying Offline RL–IRL with Dynamic Discrete Choice (DDC)
- Core innovation: A foundations-focused note that aligns entropy-regularized IRL with dynamic discrete choice (DDC) models at the level of objectives and solution structure. It frames reward recovery from expert offline data through a unified lens (identifiability, likelihood/max-entropy criteria, value–policy correspondences), enabling methodological transfer between econometric inference in DDC and optimization-centric IRL formulations.

RL-ACRGNet: Reinforcement Learning-Based Chest Radiology Report Generation Network
- 赛道归属: 医学影像多模态生成（胸部影像报告生成）/ 强化学习用于文本生成
- 核心创新点: 提出RL-ACRGNet，将强化学习引入胸部放射学报告生成的训练框架，以缓解纯监督学习在“疾病识别准确性”和“报告表述质量/一致性”上的不足。方法层面通过将临床相关的序列级目标（如报告整体质量、关键病灶描述覆盖等）显式作为RL优化信号，直接优化生成报告的全局指标而非仅做token级似然拟合，从而提升对细粒度病灶信息的捕获与报告生成的临床可用性与一致性。
  Track: Medical multimodal generation (chest radiology report generation) / RL for text generation
  Key innovations: Introduces RL-ACRGNet, integrating reinforcement learning into chest radiology report generation to address limitations of purely supervised training in disease detection accuracy and report quality/consistency. Methodologically, it optimizes clinically meaningful sequence-level objectives (e.g., overall report quality and coverage of key findings) as RL signals, directly targeting global report metrics rather than token-level likelihood alone, improving fine-grained pathology capture and clinical usability/consistency of generated reports.

Mags-RL: Wearing Multimodal LLMs a Magnifying Glass via Agentic Reinforcement Learning For Complex Scene Reasoning
- 赛道归属: 多模态理解（复杂场景视觉推理）/ Agentic 强化学习
- 核心创新点: 提出一种以“放大镜”式信息获取为核心的智能体强化学习框架，让MLLM在复杂拥挤场景中通过主动、迭代的视觉聚焦与证据收集来提升推理可靠性；相较依赖标注框等显式视觉提示的方法，该思路用RL学习“看哪里、看多细、看几次”的策略，在避免额外标注的同时缓解低分辨率裁剪丢失细节的问题，从而增强细粒度识别与多步推理能力。
- Track: Multimodal understanding (complex-scene visual reasoning) / Agentic Reinforcement Learning
- Core innovation: Introduces an “agentic magnifying-glass” RL framework that trains an MLLM to actively and iteratively acquire visual evidence (where/what to zoom into and how to refine) for reliable reasoning in cluttered, high-density scenes. Unlike prior approaches that inject explicit cues (e.g., annotated boxes) and suffer from detail loss in low-res crops, it learns a sequential visual-attention/inspection policy via RL, improving fine-grained perception and multi-step reasoning without extra annotations.

StepPO: Step-Aligned Policy Optimization for Agentic Reinforcement Learning
- 赛道归属: LLM智能体强化学习（Agentic RL）/ 策略优化算法
- 核心创新点: 提出StepPO（Step-Aligned Policy Optimization），针对现有LLM-RL普遍采用token为基本优化粒度而与智能体“按步骤（observation-action循环）决策”的粒度不匹配问题，改为以“步骤”作为对齐与优化的核心单位。方法突破在于将信用分配与策略更新从token层提升到step层，使奖励/优势估计与环境交互的决策边界一致，从而更贴合agentic行为结构，减少由token级噪声与粒度错配带来的优化偏差，提升多步任务中的决策稳定性与学习效率。
  Track: LLM agent reinforcement learning (Agentic RL) / policy optimization
  Key innovations: Proposes StepPO (Step-Aligned Policy Optimization) to resolve the granularity mismatch where existing LLM RL optimizes at the token level while agents act via step-wise observation–action cycles. The key advance is elevating alignment, credit assignment, and policy updates to the step level so that reward/advantage estimation matches decision boundaries in environment interaction, reducing token-level noise and mismatch-induced bias, and improving stability and sample efficiency in multi-step agent tasks.

Reproducing, Analyzing, and Detecting Reward Hacking in Rubric-Based Reinforcement Learning 🆕NEW
- 赛道归属: 基于LLM裁判的强化学习（Rubric-based RL）安全与对齐 / Reward Hacking检测
- 核心创新点: 提出CHERRL作为“可控的奖励黑客”实验环境，将真实rubric-based RL中隐蔽且多偏置纠缠的reward hacking现象进行可控生成与复现；通过显式参数化/组合裁判（LaaJ）的潜在偏置与策略可利用的漏洞，支持系统化分析“策略如何利用裁判偏差获得高分但低质量/不安全输出”；进一步面向检测提出可操作的评测与识别设置，使reward hacking从难以复盘的现象转化为可基准化、可诊断的研究对象。
- Track: LLM-as-a-Judge Reinforcement Learning (Rubric-based RL) Safety & Alignment / Reward Hacking Detection
- Core innovation: Introduces CHERRL, a controllable reward-hacking environment that makes subtle, bias-entangled reward hacking in real rubric-based RL reproducible and tunable; it explicitly parameterizes and composes judge (LaaJ) latent biases and exploitable loopholes to enable mechanistic analysis of how policies game the judge for high scores despite low-quality/unsafe outputs; additionally provides a concrete detection/evaluation setup that turns reward hacking into a benchmarkable, diagnosable target.

Scenario Generation for Risk-Aware Reinforcement Learning with Probably Approximately Safe Guarantees 🆕NEW
- 赛道归属: 安全强化学习（Risk-aware RL）/ 场景生成与形式化安全保证（Probably Approximately Safe）
- 核心创新点: 面向“策略对转移扰动敏感、易出现未知不安全行为”的问题，将安全验证与数据/场景生成耦合：通过采样策略轨迹构造概率型barrier certificate来刻画安全边界，并提出用于生成“更紧的安全界/更有效暴露风险”的场景采样机制；以Probably Approximately Safe（PAS）形式给出可证明的安全保证，使得生成的验证场景在统计意义上覆盖高风险区域，从而提升对策略安全性的可验证性与风险感知训练的有效性。
- Track: Safe Reinforcement Learning (Risk-aware RL) / Scenario Generation with Formal Safety Guarantees (Probably Approximately Safe)
- Core innovation: Couples safety verification with scenario generation to address policy fragility under transition perturbations: it builds probabilistic barrier certificates from sampled trajectories to delineate safe vs. unknown regions, and designs a scenario sampling/generation procedure aimed at tightening safety bounds and more effectively surfacing risky behaviors; provides Probably Approximately Safe (PAS) guarantees so the generated scenarios statistically target high-risk regions, improving verifiability and risk-aware training.

Trace-Mediated Peak Bias: Bridging Temporal Credit Assignment and Cognitive Heuristics in Deep Reinforcement Learning 🆕NEW
- 赛道归属: 深度强化学习机理分析 / 时间信用分配（Eligibility Traces）与偏置失效模式
- 核心创新点: 识别并形式化一种新的系统性失效模式TMPB（Trace-Mediated Peak Bias）：在中等eligibility trace深度下，结合非线性函数逼近会诱发策略对“高幅度瞬时奖励峰值”的非理性偏好，即便其累计回报更低；将该现象与认知科学中的Peak-End启发式建立机制对应，给出可解释的因果链条（trace深度→信用分配形状→价值/优势估计偏置→策略偏好扭曲），从而为调参/算法设计提供针对性的诊断维度。
- Track: Mechanistic Analysis of Deep RL / Temporal Credit Assignment (Eligibility Traces) Failure Modes
- Core innovation: Identifies and formalizes a new systematic failure mode, Trace-Mediated Peak Bias (TMPB): at intermediate eligibility-trace depths, the interaction with nonlinear function approximation makes agents irrationally favor trajectories with large reward “peaks” even when total return is lower; links this mechanism to the cognitive Peak-End heuristic and provides an interpretable causal chain (trace depth → credit assignment profile → biased value/advantage estimates → distorted policy preferences), yielding actionable diagnostics for algorithm design and tuning.

From Ticks to Flows: Dynamics of Neural Reinforcement Learning in Continuous Environments 🆕NEW
- 赛道归属: 强化学习理论 / 连续时间（Continuous-time）Actor-Critic动力学与收敛分析
- 核心创新点: 将连续环境中的深度RL建模为连续时间随机过程，引入基于随机控制的统一理论框架，把探索与随机转移显式纳入actor-critic的连续时间动力学；针对单隐层神经网络，提出“两时间尺度”刻画：环境状态演化与参数/学习动态分离，从而可用随机微分方程/平均场等工具分析训练过程；该框架把离散“ticks”的算法更新提升为“flows”的动力系统视角，为稳定性、收敛性与噪声影响提供更可解析的理论入口。
- Track: RL Theory / Continuous-time Actor-Critic Dynamics and Convergence in Continuous Environments
- Core innovation: Models deep RL in continuous environments as a continuous-time stochastic process under a stochastic-control-inspired framework, explicitly incorporating exploration and stochastic transitions into actor-critic dynamics; for single-hidden-layer networks, formulates learning as a two-timescale process separating environment evolution from parameter/learning dynamics, enabling analysis via SDE/averaging/mean-field tools; reframes discrete update “ticks” into dynamical “flows,” offering a more tractable route to study stability, convergence, and noise effects.

GitHub

[2026-06-04] OpenPipe/ART ⭐9892

Agent Reinforcement Trainer: train multi-step agents for real-world tasks using GRPO. Give your agents on-the-job training. Reinforcement learning for...

[2026-06-05] rllm-org/rllm ⭐5593

Democratizing Reinforcement Learning for LLMs

[2026-06-04] pytorch/rl ⭐3454

A modular, primitive-first, python-first PyTorch library for Reinforcement Learning.

[2026-06-04] natolambert/rlhf-book ⭐1955

Textbook on reinforcement learning from human feedback

[2026-06-04] radixark/miles ⭐1503

Miles is an enterprise-facing reinforcement learning framework for LLM and VLM post-training, forked from and co-evolving with slime.

HuggingFace Datasets

[2026-06-04] stanford-vision-lab/gpic
```
GPIC: A Giant Permissive Image Corpus for Visual Generation
```
Keshigeyan Chandrasegaran1, Kyle Sargent1, Suchi...

[2026-06-01] VCLab-PolyU/GGT-100K

GGT-100K: Generative Ground Truth for Generalizable Real-World Image Restoration

Real-world LQ–HQ pairs from MFMs to expand IR generalizatio...

世界动作模型 / World Action Model

arXiv

WALL-WM: Carving World Action Modeling at the Event Joints
- 赛道归属: 世界动作模型（World Action Model）/ 视觉-语言-动作预训练（Vision-Language-Action Pretraining）/ 视频动作建模
- 核心创新点:
- 中文：提出从“固定长度动作块（chunk）”转向“语义事件（event）”的世界动作建模范式，将语义连贯的动作事件作为最小学习单元，在事件连接点（event joints）处刻画动作的自然边界与状态转移，从而缓解 chunk 粒度与真实动作结构不匹配带来的学习偏差。方法上以事件为锚点进行视觉-语言-动作联合预训练，使模型学习到更符合人类语义分段的动作表征与跨事件的因果/时序衔接能力，相比直接对当前观测+指令做 chunk 级预测，更强调事件级结构化监督与可组合性。
- English: Introduces an event-grounded paradigm for World Action Models, replacing fixed-length action chunks with semantically coherent action events as the atomic learning unit. By modeling transitions at event joints (natural boundaries between events), it addresses the granularity mismatch inherent in chunk-centric optimization and better captures state changes and temporal/causal continuity. The approach performs Vision-Language-Action pretraining anchored on events, encouraging structured, compositional action representations and improved cross-event linkage, rather than directly predicting chunk-level actions conditioned only on the current observation and instruction.

Unified Video-Action Joint Denoising for Dexterous Action and Data Generation
- 赛道归属: 机器人世界模型 / 视频-动作联合生成（World Action Model, Video-Action Joint Modeling）
- 核心创新点: 从分布建模角度重构“视频先验→动作策略”的对齐方式：不再将视频基础模型的动态先验压缩为“给定观测的未来动作策略分布”，而是直接在交互视频与可执行手部轨迹的联合空间上进行建模与去噪生成；通过支持多种条件化机制/条件模式来保持更“宽”的联合分布，从而在同一框架内同时服务于灵巧动作生成与数据生成（视频与动作的协同合成），提升视频-动作一致性与可控性。
- Track: Robotics World Models / Video-Action Joint Generation (World Action Model, Video-Action Joint Modeling)
- Key innovation: Reframes video-to-action alignment as a distribution modeling problem: instead of collapsing the video foundation model’s dynamics prior into an observation-conditioned action policy over future actions, it models and denoises the joint distribution over interaction videos and executable hand trajectories. By enabling multiple conditioning regimes, it preserves a broader joint distribution, unifying dexterous action generation and data generation (co-synthesizing videos and actions) with improved video–action consistency and controllability.

GitHub

[2026-05-31] DravenALG/awesome-vla-wam ⭐691

A Curated List of Vision-Language-Action (VLA) and World Action Models (WAM) Research and Beyond

Generated automatically by Daily AI Digest Agent 生成时间: 2026-06-05 01:01:24