AI 每日进展速报 / Daily AI Digest - 2026-06-10
图像生成/编辑 / Image Generation/Editing
arXiv
- BLM-SGAN: Bidirectional Language Modeling for Semantic-Spatial Text-to-Image Generation 🆕NEW
- 赛道归属: 文生图(GAN-based Text-to-Image,语义-空间一致性建模)
- 核心创新点:
- 中文:提出“双向语言建模”驱动的GAN式文生图框架,将文本从前向与后向两个方向进行语义建模,用于更完整地捕获描述中的全局语义与局部细节依赖,缓解传统单向编码导致的语义遗漏与细节不稳。进一步面向“语义-空间”一致性问题,引入对文本中对象/属性/关系的空间约束建模与对齐机制,使生成过程同时受语义正确性与空间布局合理性约束,从而提升复杂描述下的可控性与一致性。
- English: Proposes a GAN-based T2I framework powered by bidirectional language modeling, encoding text in both forward and backward directions to better capture global semantics and long-range dependencies, reducing semantic missing and unstable fine details common in uni-directional text encoders. To improve semantic–spatial consistency, it further incorporates explicit modeling/alignment of spatial constraints implied by objects/attributes/relations in the text, jointly regularizing semantic correctness and layout plausibility for more controllable generation under complex prompts.
- TrioPose: Native Triple-Stream Diffusion Transformers for Pose-Guided Text-to-Image Generation
- 赛道归属: 姿态引导文生图(Pose-guided Text-to-Image)/ 多模态扩散Transformer(MM-DiT)
- 核心创新点: 提出原生“三流”(triple-stream) 的扩散Transformer结构,将文本、图像潜变量与姿态条件以更结构化的方式解耦建模,避免在MM-DiT中直接拼接条件信号导致的预训练潜空间分布被破坏;通过为姿态引导建立独立且可控的信息注入路径,增强长程空间依赖建模能力,显著缓解多人复杂姿态下的肢体扭曲与特征串扰问题,并在SD3.5M架构上实现更稳定的姿态对齐与细节一致性。
Track: Pose-guided Text-to-Image / Multimodal Diffusion Transformer (MM-DiT)
Key innovation: Introduces a native triple-stream diffusion Transformer that structurally separates text, latent image tokens, and pose conditioning, avoiding naive concatenation that disrupts the pre-trained latent distribution in MM-DiTs; by creating a dedicated, controllable pose information pathway, it improves long-range spatial dependency modeling and reduces limb distortions and feature crosstalk in complex multi-person scenes, yielding more stable pose adherence and visual consistency on top of the SD3.5M backbone.
- WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation
- 赛道归属: 文生图(Text-to-Image)评测基准 / 语义与世界知识对齐评估
- 核心创新点: 提出面向文生图的“世界知识驱动语义评测”基准WISE,将评估重点从传统的画质与浅层文本-图像匹配,提升到对复杂语义理解、隐含常识/事实知识、关系与组合推理等能力的系统化测量;通过构造需要外部世界知识才能判定对错的提示与判别维度,提供更能暴露模型“看似对齐但语义错误”的评测框架,从而推动T2I模型在知识一致性与深层语义对齐上的改进。
- Track: Text-to-Image evaluation benchmark / semantic & world-knowledge alignment assessment
- Key innovation: Introduces WISE, a world-knowledge-informed semantic evaluation benchmark for T2I that shifts emphasis from realism and shallow text-image matching to systematic measurement of complex semantic understanding—commonsense/factual knowledge, relations, and compositional reasoning. By designing prompts and evaluation dimensions that require external world knowledge to judge correctness, it better exposes “plausible-looking but semantically wrong” generations and drives progress on knowledge-consistent, deep semantic alignment.
- Safeguarding Text-to-Image Generation via Inference-Time Prompt-Noise Optimization
- 赛道归属: 文生图安全对齐 / 推理时安全防护(Text-to-Image Safety Alignment at Inference)
- 核心创新点: 提出一种仅在推理阶段生效的安全防护机制,通过对输入提示词注入并优化“提示噪声”(prompt-noise) 来抑制不安全内容的生成;其关键突破在于把安全约束转化为可优化的推理时变量,无需重新训练/微调模型即可动态调整生成轨迹,从而提升对绕过式提示与对抗攻击的鲁棒性,并在尽量保持画质与文本一致性的前提下实现更稳定的安全过滤。
Track: Text-to-Image safety alignment / Inference-time safety defense Core innovation: Introduces an inference-only safeguarding method that injects and optimizes prompt noise to steer diffusion sampling away from unsafe regions. The key methodological step is formulating safety control as an optimizable inference-time variable, avoiding retraining while improving robustness to jailbreak prompts and adversarial attacks, with minimal degradation to image quality and prompt fidelity.
- Seeing is Believing: Aligning Prompt Rewriting with Visual Anchors for Text-to-Image Generation 🆕NEW
- 赛道归属: 文生图(Prompt Rewriting/Prompt Optimization,视觉锚点对齐)
- 核心创新点:
- 中文:提出FaithRewriter,将“提示词改写”从纯语言润色升级为“视觉锚点对齐”的改写范式:在改写过程中引入可视觉化的锚点(如关键对象、属性、场景要素等)作为约束信号,强制改写结果与可生成的视觉要素一致,减少传统改写因过度脑补细节而造成的intent-generation gap。方法上强调把用户意图拆解为可落地的视觉约束,并在改写目标中显式惩罚偏离锚点的扩写,从而在提升可生成性/细节明确性的同时保持对原始意图的忠实。
- English: Introduces FaithRewriter, reframing prompt rewriting from purely linguistic polishing to visual-anchor-aligned rewriting. It injects visually grounded anchors (e.g., key objects, attributes, scene elements) as constraints during rewriting, forcing the rewritten prompt to stay consistent with realizable visual content and reducing the intent–generation gap caused by over-inferred details. Methodologically, it decomposes user intent into actionable visual constraints and explicitly discourages expansions that drift from these anchors, improving generation readiness and specificity while preserving faithfulness to the original intent.
- Breaking the Lock-in: Diversifying Text-to-Image Generation via Representation Modulation
- 赛道归属: 文生图多样性提升(Text-to-Image Diversity)/ 表征调制(Representation Modulation)
- 核心创新点: 从“同质化输出”的根因出发,分析Transformer中间表征(尤其是潜变量/中间特征的收缩与聚集现象)对采样多样性的限制,提出基于表征调制的多样性增强策略:在不引入昂贵的多次采样、额外优化或外部搜索的前提下,通过对中间特征分布/通道响应进行可控扰动或重标定,打破固定prompt下的表示锁定(lock-in),以较低推理开销提升样本多样性,同时尽量保持文本对齐与画质。
Track: Text-to-Image Diversity / Representation Modulation
Key innovation: Targets the root cause of homogeneity by diagnosing how intermediate Transformer representations collapse/cluster and restrict sample diversity; proposes a representation-modulation mechanism that perturbs or re-scales intermediate features in a controlled manner to break prompt-conditioned “lock-in,” improving diversity without expensive extra sampling loops or auxiliary optimization, while largely preserving text-image alignment and visual quality.
- MemoGen: Can Past Experience Improve Future Text-to-Image Generation?
- 赛道归属: 文生图(Text-to-Image)生成增强 / 记忆与检索增强生成(Memory-augmented Generation)
- 核心创新点: 提出MemoGen,将“单次请求的检索/代理式增强”扩展为“跨任务可积累的经验记忆”机制:把历史生成中的成功/失败案例、隐含约束满足策略、有效提示改写或参考证据进行结构化存储,并在新请求到来时进行检索与复用,以提升对隐式视觉约束、关系推理与外部知识需求场景的可靠性;核心突破在于把T2I生成从一次性优化转为可持续学习的闭环(记录—检索—迁移),减少重复犯错并提高长期一致性。
- Track: Text-to-Image generation enhancement / memory-augmented (experience-reuse) generation
- Key innovation: Proposes MemoGen, extending retrieval/agentic augmentation from per-request assistance to an accumulative experience memory. It stores structured signals from past generations (success/failure cases, constraint-satisfaction tactics, effective prompt rewrites, supporting references) and retrieves them to guide future requests, improving reliability on implicit constraints, relational reasoning, and external-knowledge prompts. The key methodological step is turning T2I generation into a continual closed loop (log–retrieve–transfer) that reduces repeated errors and improves long-horizon consistency.
- KG-FairDiff: Knowledge Graph-Guided Prompt Refinement for Demographically Fair Text-to-Image Generation
- 赛道归属: 文生图(公平性/去偏见)、提示词优化(Prompt Refinement)
- 核心创新点: 提出以知识图谱(Knowledge Graph)为约束与检索支撑的提示词自动精炼框架,在不重训/不改动闭源T2I主干模型的前提下,通过对人口统计属性与职业/场景等语义关系的显式建模,系统性地补全或重写提示词中的敏感与相关属性表达,从而在生成阶段实现更均衡的人群呈现;方法重点在“结构化知识→可控prompt变换”的映射,降低仅靠启发式词替换带来的语义漂移,并兼顾公平性提升与文本意图保持。
- Track: Text-to-Image (fairness/de-biasing), Prompt Refinement
- Core innovation: Introduces a knowledge-graph-guided prompt refinement framework that improves demographic fairness without retraining or modifying (potentially closed-source) T2I backbones. By explicitly modeling relationships between demographic attributes and contextual semantics (e.g., occupations, settings), it automatically augments/rewrites prompts to enforce more balanced representation at inference time. The key methodological advance is mapping structured knowledge constraints into controllable prompt transformations, reducing semantic drift compared to heuristic word swaps while preserving the original intent.
- RichControl: Structure- and Appearance-Rich Training-Free Spatial Control for Text-to-Image Generation
- 赛道归属: 文生图(可控生成)、训练免(Training-free)空间控制/条件注入
- 核心创新点: 提出一种同时具备“结构+外观”双重约束的训练免空间控制方案,通过改进特征注入/融合机制,在扩散采样过程中更稳定地对齐条件图像的几何结构并保留外观细节;针对训练免注入常见的结构错位、条件泄漏(把条件图像纹理/噪声直接拷入结果)与伪影问题,引入更精细的分层/分步控制与抑制策略,使结构遵循与外观一致性可以解耦调节,从而在无需LoRA/微调的情况下获得更可靠的空间可控生成。
- Track: Controllable Text-to-Image, Training-free spatial control / condition feature injection
- Core innovation: Proposes a training-free spatial control method that is rich in both structure and appearance constraints. It improves feature injection/fusion during diffusion sampling to better align geometry from conditional inputs while preserving appearance details. To address common training-free issues—structural misalignment, condition leakage (copying conditional textures/noise), and artifacts—it introduces finer-grained, stage-/layer-wise control and suppression mechanisms, enabling decoupled tuning of structural adherence vs. appearance fidelity without LoRA or finetuning.
- Unified Safe In-context Image Generation in Multimodal Diffusion Transformers via Restricting Unsafe Information Flows
- 赛道归属: 安全文生图与安全图像编辑(Safe T2I & Safe I2I)/ DiT多模态注意力安全对齐
- 核心创新点: 面向带多模态注意力(MM-Attn)的扩散Transformer,提出统一的“限制不安全信息流”(restricting unsafe information flows) 安全框架,解决现有安全机制偏向T2I或U-Net、难以覆盖I2I编辑的问题;核心在于在DiT的跨模态/上下文注入链路中识别并抑制不安全语义从条件端(文本、参考图、上下文示例等)向生成端传播的关键通道,实现in-context生成与编辑场景下的一体化安全控制,在尽量不牺牲正常内容生成能力的同时降低有害内容泄露与绕过风险。
Track: Safe Text-to-Image & Safe Image-to-Image Editing / Safety alignment for MM-Attn DiTs
Key innovation: Proposes a unified safety mechanism for diffusion Transformers with multimodal attention by explicitly restricting unsafe information flows through cross-modal/context injection pathways, addressing the gap where prior safety methods are tailored to T2I or U-Net and fail to generalize to I2I editing; by identifying and suppressing critical channels that propagate unsafe semantics from conditioning sources (text, reference images, in-context examples) into generation, it enables consistent safety mitigation across in-context generation and editing while minimizing degradation on benign outputs.
GitHub
- [2026-06-10] YouMind-OpenLab/awesome-nano-banana-pro-prompts ⭐12449
🍌 World's largest Nano Banana Pro prompt library — 10,000+ curated prompts with preview images, 16 languages. Google Gemini AI image generation. Free ...
- [2026-06-09] AceDataCloud/Nexior ⭐373
Consumer AI app for chat, image generation, video generation, and music creation powered by Ace Data Cloud APIs.
- [2026-06-09] etkecc/baibot ⭐229
🤖 A Matrix bot for using different capabilities (text-generation, text-to-speech, speech-to-text, image-generation, etc.) of AI / Large Language Model...
- [2026-06-09] techjarves/Local-AI-Image-Generator ⭐162
A fully self-contained, offline AI image generation studio for Windows. Runs Stable Diffusion (Safetensors/GGUF) locally with zero manual setup. Auto-...
- [2026-06-09] ferranpons/Llamatik ⭐145
True on-device AI for Kotlin Multiplatform (Android, iOS, Desktop, JVM, WASM). LLM, Speech-to-Text and Image Generation — powered by llama.cpp, whispe...
HuggingFace Models
视频生成/编辑 / Video Generation/Editing
arXiv
- Spatial-Temporal Decoupled Reference Conditioning for Identity-Preserving Text-to-Video Generation
- 赛道归属: 身份保持文本到视频生成(Reference-conditioned T2V / Video Generation)
- 核心创新点: 提出ST-DRC(Spatial-Temporal Decoupled Reference Conditioning)框架,将参考身份条件在空间与时间维度解耦注入视频扩散/生成过程:用空间侧的细粒度特征强化单帧身份细节(如脸部结构、纹理一致性),用时间侧的机制约束跨帧身份稳定与时序一致,从而在“文本语义可控性”和“低层身份保真度”之间实现更好的平衡;框架层面强调晚期/分阶段的条件融合以减少文本驱动对身份特征的干扰并提升长序列稳定性。
- Track: Identity-preserving text-to-video generation (reference-conditioned T2V / video generation)
- Key innovation: Proposes ST-DRC, a Spatial-Temporal Decoupled Reference Conditioning framework that injects identity reference signals separately along spatial and temporal axes in the video generation (diffusion) process: spatial conditioning strengthens per-frame identity details (geometry/texture), while temporal conditioning enforces cross-frame identity stability and temporal coherence. The method emphasizes late/staged conditioning fusion to reduce interference from text semantics and improve long-range identity consistency.
- SafeGen-Bench: Benchmarking Safety in Image-Conditioned Text-to-Video Generation
- 赛道归属: 视频生成安全评测(Image-conditioned T2V Safety Benchmark / Evaluation)
- 核心创新点: 提出SafeGen-Bench,面向图像条件引导的文本到视频生成系统化评测其安全风险,补齐现有安全基准主要聚焦纯文本模式的缺口;通过覆盖非法/政治敏感/伦理风险等多类场景与触发方式,构建更贴近真实使用链路的测试集与评测协议,用于量化模型在“给定初始图像+文本”条件下的越界生成倾向与防护能力,从而推动安全对齐在I2V/T2V条件生成中的可比、可复现评估。
- Track: Safety benchmarking for image-conditioned text-to-video generation (evaluation/benchmark)
- Key innovation: Introduces SafeGen-Bench to systematically evaluate safety risks specifically in image-conditioned T2V settings, addressing the gap of prior benchmarks that mainly test text-only generation. It broadens risk coverage (illegal/political/ethical categories and triggers) and provides a more realistic evaluation protocol to quantify unsafe generation propensity and safety guard effectiveness under “input image + prompt” conditioning, enabling comparable and reproducible safety assessment.
- MAVEN A Multi-Agent Framework for Multicultural Text-to-Video Generation
- 赛道归属: 文生视频(Text-to-Video)/ 提示词工程与多智能体协同(Multi-agent Prompt Refinement)
- 核心创新点: 提出多智能体提示词精炼框架MAVEN,面向“多文化一致性/文化保真度”这一以往T2V较少系统覆盖的目标进行优化;方法上将文本提示分解为“人物(Person)-动作(Action)-地点(Location)”三维语义槽位,由具备专长的代理分别并行或串行地改写与约束,从而在单一文化与跨文化组合提示中减少文化符号混淆与刻板化偏差;同时构建支持系统评测的多文化/跨文化基准与流程,使文化保真度从主观描述转为可对比的评估闭环。
- Track: Text-to-Video / Prompt Engineering with Multi-Agent Collaboration (Multi-agent Prompt Refinement)
- Core innovations: Introduces MAVEN, a multi-agent prompt refinement framework targeting cultural fidelity, a dimension underexplored in prior T2V work; technically, it decomposes prompts into three semantic slots—Person, Action, and Location—and assigns specialized agents to refine/ground each slot in parallel or sequential modes, reducing cultural symbol confusion and stereotyping in mono-cultural and cross-cultural prompts; additionally, it establishes a systematic multicultural/cross-cultural evaluation setup to make cultural fidelity more measurable and comparable.
- Knowledge-Intensive Video Generation
- 赛道归属: 知识密集型文本到视频生成评测(Factuality/Helpfulness Evaluation for T2V)
- 核心创新点: 定义“知识密集型视频生成(KIVI)”任务:针对解释、流程、演示类信息检索式短提示,要求生成视频不仅好看还要事实正确且有用;构建KIVI-Bench(1080条提示)并提出面向事实性(factuality)与帮助性(helpfulness)的自动评测指标,且通过人工评测验证指标相关性,从评测体系上把T2V从感知质量扩展到“知识/实用性”维度,为后续引入检索增强、工具使用或知识对齐的T2V方法提供可量化目标。
- Track: Knowledge-intensive text-to-video generation evaluation (factuality/helpfulness)
- Key innovation: Formulates Knowledge-Intensive Video Generation (KIVI), where prompts request explanations/procedures/demonstrations and outputs must be factually correct and practically helpful, not just visually appealing. Releases KIVI-Bench (1,080 prompts) and proposes automatic metrics for factuality and helpfulness, validated via human studies, extending T2V evaluation from perceptual quality to knowledge/utility and enabling measurable targets for retrieval/tool-augmented or knowledge-aligned T2V models.
- Consistency-Preserving Diverse Video Generation
- 赛道归属: 视频生成(多样性采样/一致性保持,Flow-Matching)
- 核心创新点: 提出面向Flow-Matching视频生成器的联合采样(joint-sampling)框架,在“每个提示词只能生成少量样本”的低采样场景下,显式提升跨视频(batch内)多样性同时保持单个视频内部的时序一致性。相较将图像多样性技巧直接迁移到视频而导致时间一致性下降的方法,该框架避免或减少对视频解码器进行昂贵的反向传播优化,从采样层面实现“多样性-一致性”兼顾。
Track: Video Generation (diverse sampling with consistency preservation, Flow-Matching)
Key innovation: Proposes a joint-sampling framework for Flow-Matching video generators to maximize cross-video (in-batch) diversity in the low-sample regime while preserving within-video temporal consistency. Unlike image-diversity tricks that often harm temporal coherence when applied to video, the method achieves a better diversity–consistency trade-off primarily at the sampling level, avoiding or reducing costly backpropagation through the video decoder.
- CineDance: Towards Next-Generation Multi-Shot Long-Form Cinematic Audio-Video Generation 🆕NEW
- 赛道归属: 视频生成 / 文本-音频-视频联合生成(T2AV)数据集
- 核心创新点: 提出面向“多镜头、长篇幅、电影化叙事”的开源大规模T2AV训练数据集 CineDance-1M,重点补齐开放社区在高质量电影级音画联合生成数据上的缺口;数据设计强调多镜头结构与长时序叙事组织,使模型可学习跨镜头的语义连贯性、镜头语言与音画同步关系,从数据层面提升长视频生成的结构多样性与可控性。
- Track: Video Generation / Text-to-Audio-Video (T2AV) joint generation dataset
- Key innovation: Introduces CineDance-1M, a large-scale open dataset tailored for multi-shot, long-form cinematic T2AV generation, addressing the key bottleneck of scarce high-quality open training data. The dataset is structured to emphasize multi-shot composition and long-horizon narrative organization, enabling models to learn cross-shot coherence, cinematic shot grammar, and audio-video alignment—improving structural diversity and controllability from the data foundation.
- Ultra Flash: Scaling Real-Time Streaming Video Generation to High Resolutions 🆕NEW
- 赛道归属: 视频生成 / 实时流式高分辨率生成 / 推理加速
- 核心创新点: 提出级联式(cascaded)的流式生成框架 Ultra Flash,将“实时流式生成”扩展到高分辨率场景;通过分阶段/分尺度的生成与调度,在单GPU上实现约30FPS@1K、18FPS@2K的吞吐,核心突破在于把高分辨率生成拆解为可流式执行的级联管线,从而在延迟、显存与画质之间实现可扩展的工程化平衡。
- Track: Video Generation / Real-time streaming high-resolution generation / Inference optimization
- Key innovation: Proposes Ultra Flash, a cascaded streaming generation framework that scales real-time streaming video generation to high resolutions. By decomposing generation into staged/multi-scale cascades with streaming-friendly scheduling, it achieves ~30 FPS at 1K and ~18 FPS at 2K on a single GPU. The main methodological leap is turning high-res generation into an efficiently streamable cascade, balancing latency, memory, and quality in a scalable pipeline.
- BioVid: Autoregressive Video Generation with Biological Behavior Semantic Comprehension 🆕NEW
- 赛道归属: 视频生成 / 自回归生成 / 行为时长建模(生物行为)
- 核心创新点: 提出 BioVid:以“行为数据自身统计结构”驱动的视频自回归生成框架,针对生物行为中动作时长天然可变的问题,不再把序列长度当作外部固定超参(固定帧数/仅由文本指定),而是引入对行为语义与时长分布的联合建模,使生成的起止边界与真实行为的时序统计更一致,从而提升行为类视频的真实性与可泛化性。
- Track: Video Generation / Autoregressive generation / Duration modeling for biological behavior
- Key innovation: Presents BioVid, a data-driven autoregressive video generation framework that aligns generated temporal boundaries with the intrinsic statistics of real biological behavior. Instead of treating clip length as an externally fixed parameter (fixed frame counts or prompt-imposed duration), it jointly models behavior semantics and variable action durations, producing sequences whose start/end and temporal structure better match real behavioral distributions—improving realism and generalization for behavior-centric videos.
- VideoWeaver: Evaluating and Evolving Skills for Agentic Long Video Generation 🆕NEW
- 赛道归属: 视频生成 / Agentic工作流与评测基准 / 长视频生成
- 核心创新点: 提出 VideoWeaver:面向“长视频生成”这一长时程多模态任务的 agent harness 与benchmark,用于系统评估并“演化”代理的技能(如工具调用、流程构建、迭代修正);区别于手工固定流水线的视频代理,VideoWeaver强调让通用agent自行搭建与优化生成工作流,并通过基准化任务与反馈机制刻画其在长程规划、分解、质量控制与一致性维护上的能力边界。
- Track: Video Generation / Agentic workflows & benchmarking / Long video generation
- Key innovation: Introduces VideoWeaver, an agent harness and benchmark for long-horizon multimodal video generation that evaluates and evolves agent skills (tool use, workflow construction, iterative refinement). Unlike prior hand-crafted video-agent pipelines, it focuses on agents that autonomously build and improve their own workflows, with benchmarked tasks and feedback loops to characterize limits in long-range planning, decomposition, quality control, and consistency maintenance.
- ViMax: Agentic Video Generation 🆕NEW
- 赛道归属: 视频生成 / 多智能体协作(Agentic)/ 长篇叙事一致性
- 核心创新点: 提出 ViMax:以多智能体协作的方式解决长篇视频生成中的“叙事规划+跨场景一致性”难题;通过将叙事决策、视觉一致性(角色/环境/风格)等职责拆分给专门代理并进行协商与协调,把长视频创作建模为可迭代的规划-生成-校验闭环,从机制上弥补短片段生成方法缺乏全局结构与一致性约束的问题。
- Track: Video Generation / Multi-agent (agentic) collaboration / Long-form narrative consistency
- Key innovation: Proposes ViMax, an agentic multi-agent framework targeting long-form video generation’s core challenges: narrative planning and cross-scene visual consistency. By delegating responsibilities (story decisions, character/environment/style consistency, etc.) to specialized agents that negotiate and coordinate, it models long video creation as an iterative plan–generate–verify loop, addressing the lack of global structure and consistency mechanisms in short-clip generation approaches.
GitHub
- [2026-06-09] hao-ai-lab/FastVideo ⭐3699
A unified inference and post-training framework for accelerated video generation.
- [2026-06-09] ZeroLu/awesome-seedance ⭐1904
The ultimate collection of high-fidelity Seedance 2.0 prompts and Seedance AI resources. Discover Seedance 2.0 how to use for cinematic film, anime, U...
- [2026-06-09] YouMind-OpenLab/awesome-seedance-2-prompts ⭐1327
🎬 2000+ curated Seedance 2.0 video generation prompts — cinematic, anime, UGC, ads, meme styles. Includes Seedance API guides, character consistency t...
- [2026-06-09] bytedance/Bernini ⭐644 🆕NEW
Bernini is a unified framework for video generation and editing that combines an MLLM-based semantic planner with a DiT-based renderer.
- [2026-06-09] pandayuanyu/NewtonGen ⭐131 🆕NEW
[ICLR 2026] NewtonGen: Physics-Consistent and Controllable Text-to-Video Generation via Neural Newtonian Dynamics
音频生成 / Audio Generation
arXiv
- HoliDubber: Holistic Video Dubbing for Complex Acoustic Scenes via Text-Guided Audio Synthesis 🆕NEW
- 赛道归属: 视频配音 / 文本引导的视频到音频生成(Video Dubbing, Text-guided Audio Synthesis)
- 核心创新点: 提出“整体式”视频配音框架,将传统仅做对白的配音扩展为同时生成对白、环境声与音效的统一音轨合成;以文本为高层语义控制信号,对复杂声学场景进行分层/多元素的联合建模与同步生成,从而减少人工后期分轨与混音依赖,实现更完整的端到端配音工作流。
Track: Video dubbing / Text-guided video-to-audio generation - Core innovation: Introduces a holistic dubbing framework that goes beyond speech-only dubbing to jointly synthesize dialogue, ambient sounds, and sound effects into a unified soundtrack; uses text as high-level semantic control to model complex acoustic scenes with multi-element, synchronized generation, reducing manual stem-based post-mixing and enabling a more end-to-end dubbing pipeline.
- SMC-ITA: Sequential Monte Carlo Inference-Time Alignment for Video-to-Audio Generation 🆕NEW
- 赛道归属: 视频到音频生成 / 推理时对齐优化(Video-to-Audio, Inference-time Alignment)
- 核心创新点: 将V2A的视听对齐问题显式建模为“推理时搜索”而非仅依赖训练期目标;针对flow-matching式生成,在采样阶段引入Sequential Monte Carlo(粒子滤波)进行候选轨迹的并行探索与重采样,用对齐度/一致性指标引导生成路径选择,从而在不改动或少改动模型结构的前提下提升时序同步与语义贴合。
Track: Video-to-audio generation / Inference-time alignment optimization - Core innovation: Formulates audiovisual alignment in V2A as an explicit inference-time search problem (rather than purely a training objective); for flow-matching generation, applies Sequential Monte Carlo (particle filtering) during sampling to explore and resample candidate trajectories, using alignment/consistency scores to steer generation—improving temporal sync and semantic fit with minimal architectural changes.
- ImmersiveTTS: Environment-Aware Text-to-Speech with Multimodal Diffusion Transformer and Domain-Specific Representation Alignment
- 赛道归属: 文本到语音(TTS)/ 场景化语音生成(语音+环境声融合)
- 核心创新点: 提出环境感知TTS框架,通过多模态扩散Transformer显式建模语音与环境上下文(如场景/视觉/环境音提示)之间的跨模态交互,解决语音与环境声在声学形态与时间动态上的分布差异;并引入面向领域的表征对齐机制,将“语音生成表征”与“环境/场景表征”在统一空间中对齐,从而实现语音与环境声的自然共存与无缝融合(而非后期拼接)。
- Track: Text-to-Speech (TTS) / Scene-aware speech generation (speech + ambient sound integration)
- Core innovations: Proposes an environment-aware TTS framework that uses a multimodal Diffusion Transformer to explicitly model cross-modal interactions between speech and environmental context (e.g., scene/visual/ambient cues), addressing the distribution and temporal-dynamics mismatch between speech and environmental audio; introduces domain-specific representation alignment to map speech-generation features and environment/scene features into a shared space, enabling coherent in-scene speech generation rather than post-hoc mixing.
- Audio-Oscar: A Multi-Agent System for Complex Audio Scene Generation, Orchestration, and Refinement
- 赛道归属: 音频生成(复杂音频场景生成 / 多智能体编排与后期精修)
- 核心创新点: 提出多智能体(multi-agent)框架,将“复杂音频场景描述→长音频成品”的生成过程拆解为可协作的子任务(如对白/音效/音乐/时间结构/后期处理等),通过代理间的规划、编排与迭代式精修实现长时序结构化生成与可控性提升;重点突破在于用系统级的分工与闭环优化机制,缓解单模型端到端生成在长音频一致性、元素协调与制作级后处理上的困难。
- Track: Audio Generation (Complex audio scene generation / multi-agent orchestration & post-production refinement)
- Core innovation: Introduces a multi-agent framework that decomposes complex scene-to-audio generation into coordinated sub-agents (e.g., speech, SFX, music, temporal layout, post-processing) and uses planning/orchestration plus iterative refinement to improve long-form structure and controllability; the key methodological advance is a system-level division-of-labor and closed-loop refinement pipeline that mitigates drift and poor cross-element coordination in monolithic end-to-end models.
- UNISON: A Unified Sound Generation and Editing Framework via Deep LLM Fusion
- 赛道归属: 统一音频生成与编辑(Text-to-Audio/TTS/音频编辑一体化,多任务扩散)
- 核心创新点: 用单一潜空间扩散模型统一覆盖文本到音频、文本到语音、零样本音色克隆、语音+音效混合生成、场景级音频编辑与时间编排等任务,实现“同权重多能力”;关键方法是层级式深度LLM融合(将LLM多层隐状态注入扩散网络以增强语义与结构控制)以及面向多任务的统一条件接口/训练范式,使生成与编辑在同一潜空间与同一推理管线内闭环完成,减少任务间割裂与模型堆叠。
- Track: Unified audio generation & editing (Text-to-Audio/TTS/audio editing; multi-task diffusion)
- Core innovations: Introduces a single latent diffusion model that unifies text-to-audio, text-to-speech, zero-shot speaker cloning, mixed speech-and-sound generation, scene-level editing, and temporal composition under one set of weights; key is layer-wise deep LLM fusion—injecting multi-layer LLM hidden states into the diffusion network for stronger semantic/structural control—plus a unified conditioning/training scheme so generation and editing operate in the same latent space and inference pipeline, avoiding fragmented task-specific stacks.
- End-to-End Training for Discrete Token LLM based TTS System 🆕NEW
- 赛道归属: 文本到语音 / 端到端训练与对齐(TTS, End-to-End Optimization)
- 核心创新点: 打破“tokenizer—AR LLM—FM声码器”各自独立训练的级联范式,提出统一的端到端优化框架,将语音tokenizer、离散token LLM、flow-matching生成器与奖励模型纳入同一训练闭环;通过联合优化缓解离散化误差与模块间目标不一致问题,并用RM提供面向感知质量/可控性的训练信号,实现更一致的全链路TTS生成。
Track: Text-to-speech / End-to-end training and alignment - Core innovation: Replaces the independently trained tokenizer–AR LLM–flow-matching vocoder cascade with a unified end-to-end optimization framework that jointly trains the speech tokenizer, discrete-token LLM, flow-matching generator, and an added reward model; mitigates quantization and inter-module objective mismatch, while leveraging RM-based signals to improve perceptual quality/controllability in a fully consistent TTS pipeline.
- FlashTTS: Fast Streaming TTS with MTP Acceleration and X-pred Mean Flow Distillation 🆕NEW
- 赛道归属: 流式文本到语音 / 低延迟推理加速(Streaming TTS, Inference Acceleration)
- 核心创新点: 面向对话式场景的流式TTS,针对两大瓶颈(AR预测慢、FM多步采样慢)提出组合式加速:用MTP(多token并行/多步预测)降低自回归解码时延,并通过X-pred mean flow distillation将多步flow采样蒸馏为更少步数的快速生成;同时强调原生流式输入输出能力,降低端到端首包与持续延迟。
Track: Streaming text-to-speech / Low-latency inference acceleration - Core innovation: Targets streaming TTS latency by jointly addressing slow AR decoding and multi-step flow sampling: uses MTP-style multi-token/multi-step prediction to reduce autoregressive delay, and applies X-pred mean flow distillation to compress multi-step flow sampling into fewer steps; emphasizes native streaming I/O to reduce both time-to-first-audio and ongoing latency.
- BareWave: Waveform-Native Flow-Matching Text-to-Speech 🆕NEW
- 赛道归属: 文本到语音 / 波形端到端生成(Waveform-native TTS, Flow-Matching)
- 核心创新点: 提出完全“波形原生”的flow-matching TTS,绕开中间声学表征(如mel谱)与单独训练的解码/声码器阶段,实现直接从文本到raw waveform的生成;围绕raw波形建模带来的训练难点(高频细节、长序列、稳定对齐等)设计针对性的训练与建模策略,使端到端波形生成在质量与可训练性上可行。
Track: Text-to-speech / Waveform-native generation (flow-matching) - Core innovation: Presents a fully waveform-native flow-matching TTS that removes intermediate acoustic representations (e.g., mel) and separately trained decoding/vocoder stages, generating raw waveform directly from text; proposes training/modeling techniques to tackle raw-waveform challenges (high-frequency detail, long sequences, stable alignment), making end-to-end waveform generation practical in both quality and trainability.
- TLDR: Compressing Audio Tokens for Efficient Autoregressive Text-to-Speech 🆕NEW
- 赛道归属: 文本到语音 / Token压缩与高效自回归建模(Efficient AR TTS, Token Compression)
- 核心创新点: 针对codec离散token序列远长于文本导致的AR计算与KV cache膨胀瓶颈,提出TLDR式音频token压缩机制,在尽量保留语音信息的前提下减少序列长度,从结构上降低每步因果计算与缓存开销;从而在不牺牲(或尽量少牺牲)音质的情况下显著提升推理吞吐与降低显存占用,改善长音频生成效率。
Track: Text-to-speech / Efficient autoregressive modeling via token compression - Core innovation: Addresses the efficiency bottleneck of long codec-token sequences in AR TTS (heavy causal compute and growing KV cache) by introducing a TLDR-style audio token compression scheme that shortens sequences while preserving speech information; structurally reduces per-step computation and cache cost, improving throughput and memory for long-form generation with minimal quality loss.
- Audio Imitator: Controlling Timbre and Tempo in Video2Audio Synthesis with Audio Reference
- 赛道归属: 音频生成(Video-to-Audio / 参考音频驱动的风格可控合成)
- 核心创新点: 提出属性感知的Video2Audio框架,将参考音频中的“音色(timbre)”与“速度/节奏(tempo)”显式建模为可控属性,而非把参考音频当作整体条件直接注入;通过对风格属性的解耦表示与定向条件化,实现对生成音频风格维度的细粒度控制,同时保持与视频语义和时间对齐的一致性。
- Track: Audio Generation (Video-to-Audio / reference-audio-driven controllable synthesis)
- Core innovation: Proposes an attribute-aware Video2Audio method that explicitly models timbre and tempo from reference audio as disentangled, controllable attributes rather than using the reference as a single holistic condition; this enables fine-grained style control (timbre/tempo) while preserving semantic consistency and temporal alignment with the input video.
GitHub
- [2026-06-09] huggingface/diffusers ⭐33817
🤗 Diffusers: State-of-the-art diffusion models for image, video, and audio generation in PyTorch.
- [2026-06-08] BinWang28/audio-ai-hub ⭐931
The hub for audio AI research: papers, open models, benchmarks & datasets across audio LLMs, speech recognition, TTS, music & audio generation.
- [2026-06-05] apocas/restai ⭐510
RESTai is an AIaaS (AI as a Service) open-source platform. Supports many public and local LLM suported by Ollama/vLLM/etc. Precise embeddings usage, t...
- [2026-06-08] xiaomi-research/controlfoley ⭐131
ControlFoley: Unified and Controllable Video-to-Audio Generation with Cross-Modal Conflict Handling
- [2026-06-08] dgrauet/ltx-2-mlx ⭐56
Pure MLX port of LTX-2 (Lightricks LTX-2.3) for Apple Silicon — video + audio generation
HuggingFace Models
语言大模型 / Large Language Models
arXiv
- Agentic Chain-of-Thought Steering for Efficient and Controllable LLM Reasoning
- 赛道归属: 推理优化(可控推理/测试时推理控制)
- 核心创新点: 将“推理过程如何展开”的控制显式化为一个马尔可夫决策过程(MDP):引入控制器智能体在推理时按状态自适应决策(如继续思考、切换策略、停止等),以最小化无效token消耗并在不显著牺牲准确率的前提下实现可控的推理长度与推理轨迹;相较仅做截断/早停/压缩的效率方法,ACTS把“思考策略”作为可学习/可调度的动作空间,从而提供更细粒度的推理时控制与效率-性能权衡。
- Track: Reasoning optimization (controllable inference / test-time reasoning control)
- Key innovation: Makes “how the model reasons” an explicit control problem by formulating chain-of-thought steering as an MDP: a controller agent adaptively selects actions at inference (e.g., continue, change strategy, stop) based on the current reasoning state, reducing wasted tokens while maintaining accuracy and enabling controllable reasoning length/trajectory. Unlike prior efficiency methods that mainly shorten/early-stop/compress traces, ACTS treats reasoning strategy as an explicit, schedulable action space for finer-grained control over the efficiency–accuracy trade-off.
- An Asymptotic Theory of Chain-of-Thought in In-Context Learning
- 赛道归属: 理论分析(In-Context Learning / Chain-of-Thought 机理与尺度律)
- 核心创新点: 在一个可解析的理论模型中刻画CoT深度与泛化性能的尺度行为:将测试时CoT推理形式化为对线性回归中“权重参数估计”的迭代精炼过程(iterative refinement),从而推导随推理步数增加时误差/泛化的渐近规律与收益递减条件;该框架把“CoT=迭代算法”的观点落到可证明的渐近理论上,为理解何时加深CoT有效、何时无效提供了可计算的判据。
- Track: Theoretical analysis (in-context learning / chain-of-thought mechanism & scaling laws)
- Key innovation: Develops an analytically solvable model to characterize how generalization scales with CoT depth: models test-time CoT as iterative refinement of the weight-parameter estimate in linear regression (in-context weight prediction), enabling asymptotic derivations of error/generalization behavior as the number of reasoning steps grows and identifying regimes of diminishing returns. This provides provable, computable criteria for when deeper CoT helps versus when it does not.
- Attention-guided Fine-tuning of Multimodal Large Language Models Improves Chain-of-Thought Reasoning
- 赛道归属: 多模态推理(MLLM Chain-of-Thought 对齐/微调优化)
- 核心创新点: 通过系统性实证分析指出多模态 CoT 在视觉推理中常“越想越错”,并归因于两类稳定失败模式:过早锁定答案(premature answer commitment)与对直接视觉证据利用不足(limited direct visual evidence usage)。在此基础上提出“注意力引导的微调”思路:利用/约束模型注意力分配,使推理步骤更聚焦于与当前推理相关的视觉区域与证据链,从训练层面纠正 CoT 生成时的证据对齐与决策时机问题,从而提升多模态逐步推理的可靠性与可解释性。
- Track: Multimodal reasoning (MLLM Chain-of-Thought alignment / fine-tuning optimization)
- Key innovation: Provides a systematic study showing that CoT prompting can hurt visual reasoning in MLLMs, and identifies two recurring failure modes: premature answer commitment and insufficient use of direct visual evidence. Building on these findings, it proposes an attention-guided fine-tuning strategy that steers/regularizes attention to align each reasoning step with the relevant visual regions and evidence, correcting evidence grounding and decision timing during CoT generation to improve step-wise multimodal reasoning robustness.
- COFT: Counterfactual-Conformal Decoding for Fair Chain-of-Thought Reasoning in Large Language Models
- 赛道归属: 公平性可控解码 / 推理阶段偏见抑制(LLM Decoding for Fairness in CoT)
- 核心创新点: 提出一种无需训练、仅在解码阶段生效的公平性控制方法 COFT,用于抑制链式思维(CoT)生成中的社会偏见放大。方法上以反事实提示构造 + 共形预测(Conformal)约束为核心:先将提示中的敏感片段替换为中性占位符形成“掩码反事实”输入,以获得相对去偏的参考分布;再在token 级别对原始解码分布施加公平性约束,并通过分布无关(distribution-free)的边际有效性保证(在 exchangeability 假设下)为公平控制提供可验证的统计保证,从而实现对任意冻结的因果语言模型在推理时的可控去偏解码。
- Track: Fairness-controlled decoding / Inference-time bias mitigation for CoT (LLM Decoding for Fairness in CoT)
- Key innovation: Introduces COFT, a training-free, decoding-time method to curb bias amplification in chain-of-thought generation. The technical core combines counterfactual prompt masking with conformal (distribution-free) constraints: it first replaces sensitive spans with neutral tokens to form a masked counterfactual prompt, yielding a debiased reference distribution; then it enforces token-level fairness control on the original decoding distribution, providing distribution-free marginal validity guarantees (under exchangeability) for any frozen causal LM—enabling verifiable, model-agnostic fairness control at inference time.
- Entity Binding Failures in Speech LLM Reasoning: Diagnosis and Chain-of-Thought Intervention
- 赛道归属: 语音大模型推理诊断与对齐(Speech LLM Reasoning / 语音-文本推理鲁棒性)
- 核心创新点: 提出并验证“实体绑定失败(entity binding failure)”是语音LLM在复杂推理中相对文本LLM性能塌陷的关键、且高度局部化的原因:通过对多种任务分解评测,发现S2T在空间/句法/事实类任务不弱于T2T,但在需要持续实体跟踪的逻辑推理任务上准确率降至随机水平;进一步将退化机制归因于连续语音表征导致的实体-属性/关系绑定不稳,从而把“模态差距”从笼统能力不足细化为可诊断的绑定机制问题,并提出基于Chain-of-Thought的干预思路以强化实体跟踪与绑定过程。
- Track: Speech LLM reasoning diagnosis & alignment (speech-text reasoning robustness)
- Core innovation: Identifies and empirically validates a localized failure mode—entity binding failure—as the main driver of the reasoning gap between speech LLMs and text LLMs: via task-factorized evaluation, shows S2T matches/exceeds T2T on spatial/syntactic/factual tasks, but collapses to chance on logical tasks requiring persistent entity tracking; attributes the degradation to instability in binding entities to attributes/relations induced by continuous speech representations, reframing the “modality gap” into a concrete, diagnosable binding-mechanism issue and proposing Chain-of-Thought-based interventions to reinforce entity tracking/binding.
- LLM-XTM: Enhancing Cross-Lingual Topic Models with Large Language Models
- 赛道归属: 跨语言主题建模(NLP 表征学习/主题模型 + LLM增强)
- 核心创新点: 提出LLM-XTM,将LLM用于“主题层面”的跨语言对齐与可解释性提升,同时通过自一致性不确定性估计抑制幻觉并降低对不可获得的token概率(白盒接口)的依赖:用LLM引导的主题精炼(topic refinement)替代昂贵且易漂移的文档级改写/标注式增强,并以不确定性驱动的自一致性机制筛选/聚合LLM建议,使跨语言主题更连贯、对齐更稳健,且在资源稀缺的双语条件下仍可工作。
- Track: Cross-lingual topic modeling (topic models + LLM augmentation)
- Key innovation: Proposes LLM-XTM, using LLMs for topic-level cross-lingual alignment and interpretability while mitigating hallucinations via self-consistency–based uncertainty estimation and avoiding reliance on inaccessible token-probability (white-box) APIs. It replaces costly, drift-prone document-level LLM refinements with LLM-guided topic refinement, and uses uncertainty-driven selection/aggregation of LLM suggestions to yield more coherent and better-aligned multilingual topics under sparse bilingual resources.
- SmartThinker: Progressive Chain-of-Thought Length Calibration for Efficient Large Language Model Reasoning
- 赛道归属: 推理优化(CoT 长度自适应控制 / 高效推理)
- 核心创新点: 提出 SmartThinker 的“渐进式 CoT 长度校准”框架,针对长推理模型在不同难度问题上普遍存在的冗余与过度思考,突破点在于将“长度控制”从静态奖励(对所有样本一刀切)升级为随题目难度动态调整的策略。方法上通过逐步(progressive)校准推理链长度,使模型在简单问题上自动收敛到更短、更经济的推理,在困难问题上保留必要的长推理,从而在尽量不损失准确率的前提下显著降低输出冗余与推理成本,并弥补现有 GRPO 静态长度奖励无法自适应难度的缺陷。
- Track: Reasoning optimization (adaptive CoT length control / efficient inference)
- Key innovation: Introduces SmartThinker, a progressive CoT length calibration framework to reduce redundancy and overthinking in long-reasoning models. The key methodological advance is replacing static, one-size-fits-all length rewards (common in GRPO-based approaches) with a difficulty-adaptive mechanism that progressively calibrates reasoning length: it encourages short, cost-efficient reasoning on easy problems while preserving longer chains when needed for hard ones, improving efficiency with minimal accuracy degradation and addressing the non-adaptivity of static length reward designs.
- Visual Instruction Tuning Aligns Modalities through Abstraction
- 赛道归属: 多模态理解与视觉指令微调(Vision-Language Instruction Tuning / 跨模态对齐机制)
- 核心创新点: 从“层级抽象”视角系统揭示视觉指令微调如何实现跨模态对齐:通过跨多种视觉-语言架构的层间分析,发现指令微调的主要作用并非让视觉信息逐层经过LLM早期的单模态处理层,而是作为“桥接器”将视觉特征直接注入LLM的中间语义层,在抽象层面完成对齐并绕过早期层;该结论为设计更高效的视觉接入方式(如选择性注入层、减少无效早期融合)提供了机制性依据,而不仅是经验性配方。
- Track: Multimodal understanding & visual instruction tuning (vision-language alignment mechanisms)
- Core innovation: Provides a layer-wise abstraction account of how visual instruction tuning aligns modalities: across diverse VLM architectures, shows instruction tuning mainly acts as a bridge that embeds visual features directly into intermediate semantic layers of the LLM backbone, largely bypassing early unimodal layers; this mechanistic finding supports more principled designs for visual integration (e.g., selective layer injection and avoiding inefficient early fusion) beyond recipe-style tuning.
- Taiji: Pareto Optimal Policy Optimization with Semantics-IDs Trade-off for Industrial LLM-Enhanced Recommendation
- 赛道归属: LLM增强推荐系统(LLM4Rec)/ 后训练对齐(SFT+RL)/ 多目标强化学习优化
- 核心创新点: 提出一种面向工业LLM推荐的“语义空间—ID空间”可控对齐框架,通过将语义质量收益与ID推荐效果建模为相互制约的多目标优化问题,学习帕累托最优的策略集合而非单一加权目标,从而显式刻画并可调节两类奖励的权衡;同时针对开放域推荐中CoT质量难以度量与提升的问题,引入可操作的训练信号/优化机制,使策略优化能够在语义推理质量与ID点击/排序指标之间稳定迁移,缓解传统SFT/RL在该场景下的对齐瓶颈。
- Track: LLM-enhanced Recommender Systems (LLM4Rec) / Post-training alignment (SFT+RL) / Multi-objective RL optimization
- Core innovation: Proposes an industrial LLM4Rec alignment framework that makes the “semantic space vs. ID space” alignment explicitly controllable by formulating semantic-quality gains and ID-based recommendation performance as a constrained multi-objective problem, and learning a Pareto-optimal set of policies instead of a single scalarized objective—thereby exposing and tuning the trade-off between semantic rewards and ID ranking/click metrics. It also addresses the difficulty of measuring/improving CoT quality in open-domain recommendation by introducing actionable training signals/optimization mechanisms so policy optimization can reliably balance semantic reasoning quality with ID-based recommendation KPIs, mitigating key bottlenecks of prior SFT/RL paradigms.
- Rethinking the Divergence Regularization in LLM RL 🆕NEW
- 赛道归属: 大语言模型后训练(RLHF/RLAIF)— 强化学习优化与信赖域/正则化(PPO/GRPO/DPPO改进)
- 核心创新点: 该工作针对LLM强化学习中常见的离策略(policy staleness、训练-推理分布不一致)导致的优化不稳定问题,重新审视“发散度正则/信赖域约束”的实现方式。指出PPO/GRPO依赖的importance ratio clipping在长尾词表下难以真实刻画分布漂移(ratio对分布差异的代理性不足),从而可能带来错误的信赖域控制。论文的关键突破在于:从“分布偏移度量”的角度重构正则项/约束设计,使其更直接、更可靠地约束新旧策略的行为分布差异(而非仅约束token级ratio),以提升离策略场景下的稳定性与可控性,并与DPPO等近期方法形成对比与统一分析框架。
- Track: LLM post-training (RLHF/RLAIF) — RL optimization with trust-region/divergence regularization (PPO/GRPO/DPPO improvements)
- Core innovation: This work targets instability in LLM RL caused by practical off-policy effects (policy staleness and train–inference mismatch) and revisits how “divergence regularization / trust-region control” should be implemented. It argues that the ratio-clipping mechanism used in PPO/GRPO relies on importance ratios that can be a poor surrogate for true distribution shift under long-tailed vocabularies, leading to mis-calibrated trust-region control. The main methodological contribution is to redesign the regularization/constraint from a distribution-shift measurement perspective—more directly constraining behavioral distribution changes between old and new policies rather than token-level ratio proxies—thereby improving stability and controllability in off-policy settings, and providing a comparative/unifying lens relative to recent approaches like DPPO.
GitHub
- [2026-06-10] sgl-project/sglang ⭐28896
SGLang is a high-performance serving framework for large language models and multimodal models.
- [2026-06-10] NVIDIA/TensorRT-LLM ⭐13839
TensorRT LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports state-of-the-art optimizations to perfo...
- [2026-06-10] google-ai-edge/LiteRT-LM ⭐5515
LiteRT-LM is Google's production-ready, high-performance, open-source inference framework for deploying Large Language Models on edge devices.
- [2026-06-10] chrisliu298/awesome-llm-unlearning ⭐596 🆕NEW
A resource repository for machine unlearning in large language models
- [2026-06-10] mikexcohen/LLM_course ⭐283 🆕NEW
Code files for course "A deep understanding of AI large language model mechanisms"
HuggingFace Datasets
- [2026-05-28] openbmb/UltraData-SFT-2605
UltraData-SFT-2605
📦 UltraData Collection | 🌐 UltraData | 🤗 MiniCPM5 Series
English | 中文
📚 Introduction
Ult...
- [2026-05-01] angrygiraffe/claude-opus-4.6-4.7-reasoning-8.7k
Background
Ended up with some tokens to burn on a Claude Max plan. Assembly began during 4.6 and moved to 4.7. Model is tagged. The develop...
- [2026-06-04] nvidia/Nemotron-Pretraining-Code-v3
Nemotron-Pretraining-Code-v3 Dataset Description:
The Nemotron-Pretraining-Code-v3 dataset is part of the Nemotron Pretr...
- [2026-05-28] openbmb/Ultra-FineWeb-L3
Ultra-FineWeb-L3
📜 Ultra-FineWeb Technical Report | 📦 UltraData Collection | 🌐 UltraData | 🤗 MiniCPM5 Series
English | 中文
...
- [2026-06-03] OpenClaw/clawhub-security-signals
ClawHub Security Signals
🦀 ClawHub | 📝 OpenClaw Blog | 🤗 Hugging Face Blog | 📄 Paper | 📄 Pre-Print ClawHub Security Signals is a saniti...
多模态大模型 / Multimodal Models
arXiv
- MLLM-Microscope: Unlocking Hidden Structure Within Multimodal Large Language Models
- 赛道归属: 多模态理解(MLLM可解释性/表征分析与诊断)
- 核心创新点: 提出一套面向MLLM内部表征的系统化“显微镜”分析框架,沿Transformer层级同时刻画多模态token嵌入的线性度、内在维度与各向异性,并区分主干流与残差流进行对照诊断;在ScienceQA上对LLaVA-NeXT与OmniFusion做跨模型、跨模态的层间结构测量,揭示多模态token在不同流与不同层中呈现高度线性等隐藏结构特征,为后续的可解释性、压缩与对齐机制设计提供可量化的表征指标体系。
- Track: Multimodal understanding (MLLM interpretability / representation analysis & diagnostics)
- Core innovation: Introduces a “microscope”-style, layer-wise diagnostic framework to probe hidden representations in MLLMs by jointly measuring linearity, intrinsic dimension, and anisotropy of multimodal token embeddings, explicitly contrasting main vs. residual streams. Evaluated on ScienceQA with LLaVA-NeXT and OmniFusion, it provides cross-model, cross-modality structural measurements that uncover highly linear behaviors and other latent geometric properties, yielding actionable, quantitative representation metrics for interpretability, compression, and alignment design.
- Immuno-VLM: Immunizing Large Vision-Language Models via Generative Semantic Antibodies for Open-World Trustworthiness
- 赛道归属: 多模态安全与可信(开放世界异常检测/拒识、VLM鲁棒性)
- 核心创新点: 提出“语义自负(Hubris of Semantics)”作为开放世界部署中的关键失效模式:VLM会将未知异常强行映射到已知语义并高置信输出。方法上以“生成式语义抗体(Generative Semantic Antibodies)”为核心机制,为模型显式注入“负知识/反语义”以形成可拒识的决策边界,从而在不破坏原有零样本语义对齐能力的前提下提升开放世界可信性与异常处理能力。
- Track: Multimodal safety & trustworthiness (open-world anomaly detection/rejection, VLM robustness)
- Key innovation: Identifies “Hubris of Semantics” as a core open-world failure where VLMs over-confidently force unknown anomalies into known semantic classes. Introduces “Generative Semantic Antibodies” to explicitly inject negative knowledge/counter-semantics, shaping rejectable decision boundaries while preserving zero-shot semantic alignment, improving open-world trustworthiness.
- mllm-shap: A Shapley Value Explainability Platform for Text-Audio Multimodal Large Language Models 🆕NEW
- 赛道归属: 多模态可解释性(Text-Audio MLLM 归因解释 / Shapley Value)
- 核心创新点: 提出开源框架将 Shapley Value 归因从纯文本 LLM 扩展到文本-音频多模态场景,关键方法突破在于:1)模态感知的 coalition masking,针对离散文本 token 与稠密音频编码帧的交织处理,设计可控的遮蔽与组合策略,使 SV 估计在跨模态输入上可计算且语义一致;2)面向多轮对话的归因机制,将上下文轮次与当前输入共同纳入贡献分解,支持分析跨轮次、跨模态的因果贡献;3)以工程化平台形式提供统一接口与流程,降低多模态 SV 解释在不同 MLLM/音频前端上的复用门槛。
- Track: Multimodal Explainability (Text-Audio MLLM attribution / Shapley Value)
- Key innovations: Introduces an open-source framework that extends Shapley Value attribution from text-only LLMs to joint text-audio MLLMs. The main methodological advances are: 1) modality-aware coalition masking that handles the interleaved nature of discrete text tokens and dense audio encoder frames, enabling tractable and semantically consistent SV estimation across modalities; 2) multi-turn conversation attribution, decomposing contributions across dialogue history and current multimodal inputs to analyze cross-turn, cross-modal influence; 3) an engineering-oriented platform with unified APIs/workflows to make multimodal SV explainability reusable across different MLLMs and audio front-ends.
- Cross-modal linkage risk in clinical vision-language models
- 赛道归属: 多模态安全与隐私(视觉-语言模型的链接攻击/成员关联风险评估)
- 核心创新点: 将临床VLM的隐私问题形式化为跨模态重链接(image-to-report linkage)风险:即模型学习到的共享嵌入空间可能保留实例级对应关系,使攻击者仅凭余弦相似度检索即可把去标识化影像重新关联到原始放射学报告;提出相应的威胁模型与评测设定,用以量化在“影像与报告被刻意分离共享/访问控制”的真实流程下,嵌入对齐带来的可重识别性,从而把“表征对齐能力”转化为可度量的隐私攻击面。
- Track: Multimodal security & privacy (vision-language linkage attacks / instance re-identification risk)
- Core innovation: Formalizes a clinical VLM privacy threat as cross-modal re-linkage (image-to-report linkage) risk: the shared embedding space can preserve instance-level correspondence, enabling attackers to re-associate a de-identified radiograph with its original report via cosine-similarity retrieval alone. It defines a concrete threat model and evaluation protocol aligned with real-world workflows where images and reports are intentionally separated, turning representation alignment strength into a measurable privacy attack surface.
- Capturing Gaze Shifts for Guidance: Cross-Modal Fusion Enhancement for VLM Hallucination Mitigation
- 赛道归属: 多模态理解(VLM幻觉抑制、跨模态融合/注意力机制改进)
- 核心创新点: 从“视觉注意力汇聚/沉没(attention sink)”角度解释幻觉:并非简单的“语言先验过强”,而是视觉注意力被任务无关区域吸走导致视觉证据未被有效融合。提出利用“注视转移(gaze shifts)”信号来指导跨模态融合增强:通过建模视线在关键区域间的动态转移,重分配视觉-文本对齐时的注意力与融合权重,避免仅按原始注意力分数做放大而加剧偏置,从机制上降低不可证实内容生成。
- Track: Multimodal understanding (VLM hallucination mitigation, cross-modal fusion/attention)
- Key innovation: Reframes hallucination via a “visual attention sink” mechanism—visual attention is diverted to irrelevant regions, preventing evidence from being fused. Uses “gaze shifts” as guidance signals to enhance cross-modal fusion by modeling dynamic transitions between salient regions, reweighting alignment/fusion beyond naive attention amplification, thereby reducing unsupported generations.
- Align-KD: Distilling Cross-Modal Alignment Knowledge for Mobile Vision-Language Model Enhancement
- 赛道归属: 多模态模型压缩与端侧部署(知识蒸馏/对齐增强)
- 核心创新点: 提出 Align-KD,将“大模型的跨模态对齐能力”作为可蒸馏的核心知识而非仅蒸馏输出分布/特征;通过显式对齐约束与跨模态一致性信号,把教师VLM在图文对齐、语义绑定等能力迁移到轻量学生模型,从而在移动端/边缘设备的参数与算力受限条件下,尽量减少模型缩小带来的对齐与理解能力退化。
Track: Multimodal model compression & on-device deployment (knowledge distillation / alignment enhancement)
Key innovation: Proposes Align-KD, treating cross-modal alignment as the primary distillable knowledge rather than only logits/features; it introduces explicit alignment constraints and cross-modal consistency signals to transfer the teacher VLM’s image-text grounding/alignment capability to a compact student, mitigating the alignment and understanding degradation typically caused by aggressive downsizing for mobile/edge settings.
- VLM-GLoc: Vision-Language Model Enhanced Monte Carlo Localization for Robust Semantic Global Localization in Cluttered Quasi-Static Environments
- 赛道归属: 具身智能与机器人定位(语义全局定位、VLM+概率滤波/Monte Carlo Localization)
- 核心创新点: 将VLM的开放词汇语义理解引入Monte Carlo Localization(MCL)框架,面向“几何与语义都高度混淆”的准静态室内环境(如货架平行通道、重复家具)提升全局定位鲁棒性。核心在于用VLM生成/评估与场景观测一致的语义证据,并将其作为观测模型或粒子权重更新信号,与传统几何/外观特征互补,从而在几何别名严重、语义长尾且遮挡杂乱的场景中实现更稳定的语义级全局定位。
- Track: Embodied AI & robot localization (semantic global localization, VLM + probabilistic filtering/MCL)
- Key innovation: Integrates open-vocabulary semantic understanding from VLMs into a Monte Carlo Localization pipeline to handle quasi-static indoor environments with strong geometric/semantic aliasing. Uses VLM-derived semantic evidence as an observation/weighting signal for particle updates, complementing geometric/appearance cues to improve robustness under severe aliasing, long-tail semantics, and clutter/occlusion.
- ES-Merging: Biological MLLM Merging via Embedding Space Signals
- 赛道归属: 多模态模型融合(模型合并/参数高效跨模态统一,生物科学MLLM)
- 核心创新点: 提出ES-Merging,用嵌入空间信号(embedding space signals)来指导生物领域MLLM的合并:不再依赖输入无关的参数空间启发式,而是利用各模型在嵌入空间中体现的模态专长与对齐特征来决定合并策略/权重,从而更忠实地保留不同单模态模型的能力并实现跨模态统一;该思路把“模态专门化”从难以观测的参数差异,转化为可直接度量与可优化的表征信号,提高合并后的跨模态任务适配性。
- Track: Multimodal model merging (parameter-efficient cross-modal unification for biological MLLMs)
- Core innovation: Proposes ES-Merging, a model-merging method for biological MLLMs guided by embedding-space signals rather than input-agnostic parameter-space heuristics. By leveraging representation-level cues that reflect modality specialization and alignment, it determines merging behavior/weights to better preserve complementary single-modality strengths while forming a unified cross-modal model. The key methodological shift is making “modality specialization” observable and optimizable through measurable embedding signals, improving post-merge cross-modal capability.
- SMART: Shot-Aware Multimodal Video Moment Retrieval with Audio-Enhanced MLLM 🆕NEW
- 赛道归属: 多模态视频理解(视频时刻检索 / Temporal Moment Retrieval)
- 核心创新点: 提出 Shot-Aware 的视频时刻检索框架,核心在于将镜头级结构化时间建模与音频增强的 MLLM结合:1)用镜头边界/shot 作为中层时间单元,替代粗粒度时间切片,实现更精确的时序定位与跨镜头语义对齐;2)引入音频信息增强检索,在复杂视频中利用语音/环境声等线索补足纯视觉的歧义,提升对事件边界与语义触发点的辨识;3)以 MLLM 作为跨模态语义对齐与推理核心,将文本查询与视听证据融合以输出更可靠的时间段定位。
- Track: Multimodal Video Understanding (Temporal Moment Retrieval)
- Key innovations: Proposes a shot-aware moment retrieval framework that combines structured shot-level temporal modeling with an audio-enhanced MLLM: 1) uses shots as mid-level temporal units instead of coarse clips, improving precise localization and cross-shot semantic alignment; 2) leverages audio cues (speech/ambient sounds) to reduce visual ambiguity and better detect event boundaries and triggers; 3) employs an MLLM as the core for cross-modal alignment and reasoning, fusing text queries with audiovisual evidence for more reliable segment localization.
- OmniGameArena: A Unified UE5 Benchmark for VLM Game Agents with Improvement Dynamics 🆕NEW
- 赛道归属: 多模态智能体评测(VLM 游戏智能体基准 / 交互式环境 Benchmark)
- 核心创新点: 构建统一的 UE5 实时游戏基准以评测 VLM 游戏智能体,并在评测方法上做出关键改进:1)提出统一协议,在同一套任务与度量下公平比较商业闭源 VLM、开源权重 VLM 与专用游戏策略;2)从“单次首尝试得分”扩展为改进动态(improvement dynamics)评估,显式衡量智能体在多次尝试/反馈下的学习与适应轨迹,而非静态能力点;3)覆盖多款新建 UE5 游戏并支持实时交互,强化对感知-决策-执行闭环、长时序任务与泛化能力的系统性压力测试。
- Track: Multimodal Agent Evaluation (VLM game-agent benchmark / interactive environments)
- Key innovations: Builds a unified real-time UE5 benchmark for evaluating VLM-based game agents with key methodological upgrades: 1) a unified evaluation protocol that enables fair comparison across commercial closed models, open-weight VLMs, and specialized game policies; 2) moves beyond single first-try scores to measure improvement dynamics, explicitly capturing learning/adaptation trajectories over repeated attempts/feedback; 3) includes multiple newly built UE5 games with real-time interaction, stressing the full perception–decision–action loop, long-horizon tasks, and generalization.
GitHub
- [2026-06-09] Blaizzy/mlx-vlm ⭐4995
MLX-VLM is a package for inference and fine-tuning of Vision Language Models (VLMs) on your Mac using MLX.
- [2026-06-08] NVlabs/Eagle ⭐2313
Eagle: Frontier Vision-Language Models with Data-Centric Strategies
- [2026-06-09] waybarrios/vllm-mlx ⭐1313
OpenAI and Anthropic compatible server for Apple Silicon. Run LLMs and vision-language models (Llama, Qwen-VL, LLaVA) with continuous batching, MCP to...
- [2026-06-08] Roots-Automation/GutenOCR ⭐189 🆕NEW
Open-source tools for training and evaluating Vision Language Models for OCR
- [2026-06-09] ZJU-REAL/SpatialLadder ⭐96 🆕NEW
[ICLR 2026] SpatialLadder: Progressive Training for Spatial Reasoning in Vision-Language Models
强化学习 / Reinforcement Learning
arXiv
- ConSteer-RL: Steering Reasoning Capabilities in Large Language Models via Confidence-Aware Reinforcement Learning 🆕NEW
- 赛道归属: 大语言模型推理强化学习(RLVR / RL for Reasoning)
- 核心创新点: 在传统RLVR稀疏二值可验证奖励之外,引入基于模型log-prob的token级置信度信号,将“模型内部不确定性”显式纳入策略优化;在Group Relative Policy Optimization等相对优势类目标上进行置信度感知的加权/整形,使训练不仅追求通过验证的最终答案,还能在生成过程中更稳定地强化高置信推理路径、抑制低置信步骤,从而提升推理可控性与样本效率。
Track: Reinforcement Learning for LLM Reasoning (RLVR) - Core innovation: Augments sparse binary verifiable rewards with token-level confidence signals derived from model log-probabilities, explicitly incorporating internal uncertainty into policy optimization. Built on relative-advantage objectives (e.g., Group Relative Policy Optimization), it applies confidence-aware weighting/shaping so training reinforces high-confidence reasoning trajectories and suppresses low-confidence steps, improving controllability and sample efficiency beyond final-answer-only supervision.
- CARE-RL: Capability-Aware Reinforcement Learning for Mitigating Cross-Domain Conflicts
- 赛道归属: 多领域LLM强化学习对齐(跨域冲突缓解 / 奖励建模)
- 核心创新点: 提出CARE-RL,将“协议感知”的生成式奖励与“能力感知”的优化联合起来解决多领域RL中的两类关键瓶颈:一是非可验证任务奖励不可靠,二是跨领域能力相互干扰。方法上通过Protocol-Aware Generative Reward Model(PA-GRM)在提示/协议层面构造更稳健的奖励信号以覆盖不可验证场景,并在优化阶段引入能力维度的约束/加权机制,使更新更聚焦于目标能力、减少对其他领域能力的负迁移,从而系统性缓解cross-domain conflicts。
Track: Multi-domain LLM RL alignment (cross-domain conflict mitigation / reward modeling)
Key innovations: Proposes CARE-RL, combining protocol-aware generative reward construction with capability-aware optimization to tackle two core issues in multi-domain RL: unreliable rewards for non-verifiable tasks and capability interference across domains. It introduces a Protocol-Aware Generative Reward Model (PA-GRM) that builds more robust reward signals at the prompt/protocol level for non-verifiable settings, and a capability-aware optimization scheme that constrains/weights updates along capability dimensions to focus learning on target skills while reducing negative transfer to other domains.
- Survival Reinforcement Learning: Toward Scalable Self-Supervised RL
- 赛道归属: 自监督强化学习 / 目标条件长时序规划(Goal-conditioned RL)
- 核心创新点: 提出Survival Reinforcement Learning(SRL)作为对比式自监督RL(CRL)的替代范式,用在线分类式目标判别取代对比损失,规避对比学习在长时序规划中“uniformity–tolerance”两难导致的表征退化/目标区分不足问题;将“survival value learning”扩展为通过最大化到达目标后的驻留时间(dwell time)来学习可用于长视野目标条件控制的价值信号,从而在深网络可扩展性与长时序可规划性之间取得更稳健的折中。
- Track: Self-supervised RL / Goal-conditioned long-horizon planning
- Core innovation: Proposes Survival Reinforcement Learning (SRL) as an alternative to contrastive self-supervised RL by replacing contrastive objectives with an online classification-based signal, mitigating the contrastive “uniformity–tolerance” dilemma that hurts long-horizon goal discrimination and planning. It extends survival value learning by maximizing dwell time at target goals, yielding a planning-friendly value signal while retaining strong depth-scaling behavior.
- A Lecture Note on Offline RL and IRL, Part II: Foundations of Inverse Reinforcement Learning and Dynamic Discrete Choice Models
- 赛道归属: 逆强化学习(IRL)理论 / 离线RL与结构计量经济学(DDC)统一视角
- 核心创新点: 以讲义形式系统梳理IRL的基础,并将熵正则IRL与结构计量中的动态离散选择模型(Dynamic Discrete Choice, DDC)在数学结构上进行对齐:从“由专家离线数据反推奖励/偏好”的角度,统一讨论可辨识性、似然/最大熵目标、价值函数与策略的对应关系,以及由此带来的估计与推断框架;其方法论价值在于提供跨社区的同构映射与推导路径,便于将DDC的统计推断工具与IRL的优化视角互相迁移。
- Track: Inverse Reinforcement Learning theory / Unifying Offline RL–IRL with Dynamic Discrete Choice (DDC)
- Core innovation: A foundations-focused note that aligns entropy-regularized IRL with dynamic discrete choice (DDC) models at the level of objectives and solution structure. It frames reward recovery from expert offline data through a unified lens (identifiability, likelihood/max-entropy criteria, value–policy correspondences), enabling methodological transfer between econometric inference in DDC and optimization-centric IRL formulations.
- RL-ACRGNet: Reinforcement Learning-Based Chest Radiology Report Generation Network
- 赛道归属: 医学影像多模态生成(胸部影像报告生成)/ 强化学习用于文本生成
- 核心创新点: 提出RL-ACRGNet,将强化学习引入胸部放射学报告生成的训练框架,以缓解纯监督学习在“疾病识别准确性”和“报告表述质量/一致性”上的不足。方法层面通过将临床相关的序列级目标(如报告整体质量、关键病灶描述覆盖等)显式作为RL优化信号,直接优化生成报告的全局指标而非仅做token级似然拟合,从而提升对细粒度病灶信息的捕获与报告生成的临床可用性与一致性。
Track: Medical multimodal generation (chest radiology report generation) / RL for text generation
Key innovations: Introduces RL-ACRGNet, integrating reinforcement learning into chest radiology report generation to address limitations of purely supervised training in disease detection accuracy and report quality/consistency. Methodologically, it optimizes clinically meaningful sequence-level objectives (e.g., overall report quality and coverage of key findings) as RL signals, directly targeting global report metrics rather than token-level likelihood alone, improving fine-grained pathology capture and clinical usability/consistency of generated reports.
- StepPO: Step-Aligned Policy Optimization for Agentic Reinforcement Learning
- 赛道归属: LLM智能体强化学习(Agentic RL)/ 策略优化算法
- 核心创新点: 提出StepPO(Step-Aligned Policy Optimization),针对现有LLM-RL普遍采用token为基本优化粒度而与智能体“按步骤(observation-action循环)决策”的粒度不匹配问题,改为以“步骤”作为对齐与优化的核心单位。方法突破在于将信用分配与策略更新从token层提升到step层,使奖励/优势估计与环境交互的决策边界一致,从而更贴合agentic行为结构,减少由token级噪声与粒度错配带来的优化偏差,提升多步任务中的决策稳定性与学习效率。
Track: LLM agent reinforcement learning (Agentic RL) / policy optimization
Key innovations: Proposes StepPO (Step-Aligned Policy Optimization) to resolve the granularity mismatch where existing LLM RL optimizes at the token level while agents act via step-wise observation–action cycles. The key advance is elevating alignment, credit assignment, and policy updates to the step level so that reward/advantage estimation matches decision boundaries in environment interaction, reducing token-level noise and mismatch-induced bias, and improving stability and sample efficiency in multi-step agent tasks.
- Rethinking the Divergence Regularization in LLM RL 🆕NEW
- 赛道归属: 大语言模型后训练(RLHF/RLAIF)— 强化学习优化与信赖域/正则化(PPO/GRPO/DPPO改进)
- 核心创新点: 该工作针对LLM强化学习中常见的离策略(policy staleness、训练-推理分布不一致)导致的优化不稳定问题,重新审视“发散度正则/信赖域约束”的实现方式。指出PPO/GRPO依赖的importance ratio clipping在长尾词表下难以真实刻画分布漂移(ratio对分布差异的代理性不足),从而可能带来错误的信赖域控制。论文的关键突破在于:从“分布偏移度量”的角度重构正则项/约束设计,使其更直接、更可靠地约束新旧策略的行为分布差异(而非仅约束token级ratio),以提升离策略场景下的稳定性与可控性,并与DPPO等近期方法形成对比与统一分析框架。
- Track: LLM post-training (RLHF/RLAIF) — RL optimization with trust-region/divergence regularization (PPO/GRPO/DPPO improvements)
- Core innovation: This work targets instability in LLM RL caused by practical off-policy effects (policy staleness and train–inference mismatch) and revisits how “divergence regularization / trust-region control” should be implemented. It argues that the ratio-clipping mechanism used in PPO/GRPO relies on importance ratios that can be a poor surrogate for true distribution shift under long-tailed vocabularies, leading to mis-calibrated trust-region control. The main methodological contribution is to redesign the regularization/constraint from a distribution-shift measurement perspective—more directly constraining behavioral distribution changes between old and new policies rather than token-level ratio proxies—thereby improving stability and controllability in off-policy settings, and providing a comparative/unifying lens relative to recent approaches like DPPO.
- Self-Paced Curriculum Reinforcement Learning for Autonomous Superbike Racing in Simulation 🆕NEW
- 赛道归属: 机器人/自动驾驶控制强化学习(两轮车竞速与平衡控制、仿真到控制策略学习)
- 核心创新点: 面向超级摩托车竞速这一高动态、强耦合控制任务(平衡/倾角/转向/油门联动),提出自步式课程强化学习框架:根据智能体当前能力自动调节任务难度与训练分布(如速度、赛道段、扰动/初始状态等),逐步从易到难覆盖极限工况;在物理精确的Unity仿真器中实现可扩展训练流程,使策略能同时学到稳定性(不摔车)与竞速性能(圈速/超车)之间的权衡。
Track: RL for Robotics/Autonomous Driving Control (two-wheeler racing & balance in simulation) - Core innovation: Proposes a self-paced curriculum RL framework tailored to superbike racing, a highly coupled control problem requiring simultaneous balance/lean/steering/throttle coordination. The curriculum adapts task difficulty and training distribution to the agent’s competence (e.g., speed regimes, track segments, perturbations/initial states), progressively expanding to edge-case dynamics. Implemented in a physics-accurate Unity simulator, it enables scalable training of policies that balance stability (avoid crashes) with racing performance (lap time/competitiveness).
- Claw-R1: A Step-Level Data Middleware System for Agentic Reinforcement Learning 🆕NEW
- 赛道归属: LLM智能体强化学习工程化(Agentic RL数据中间件/数据闭环)
- 核心创新点: 将关注点从“优化算法本身”前移到“交互数据全生命周期”,提出面向Agentic RL的step级数据中间件:以环境交互的每一步为最小数据单元,统一采集、结构化、索引、回放与消费(训练)接口,支持在线/离线混合、可追溯的轨迹管理与数据质量控制;通过标准化step级schema与管道化处理,降低不同环境/框架间的数据耦合成本,提升大规模代理训练的可复现性与迭代效率。
Track: Engineering for Agentic RL with LLMs (step-level data middleware & data lifecycle) - Core innovation: Shifts emphasis from policy optimization to the full lifecycle of agent-environment interaction data by introducing a step-level data middleware for agentic RL. It treats each environment step as the atomic unit and provides unified collection, structuring, indexing, replay, and training-consumption APIs, enabling online/offline hybrids, traceable trajectory management, and data quality controls. A standardized step schema and pipeline reduce coupling across environments/frameworks and improve reproducibility and iteration speed at scale.
- A Unifying Lens on Reward Uncertainty in RLHF 🆕NEW
- 赛道归属: RLHF奖励建模与不确定性(分布式奖励模型/悲观优化/抗reward hacking)
- 核心创新点: 提出用“分布式奖励模型” (p(r|x,y)) 作为统一刻画RLHF中奖励不确定性的正确对象,而非仅输出标量分数的RM;在贝叶斯或集成等实现下,将不确定性与“悲观化(pessimism)”策略优化联系起来:在RM高不确定区域对奖励进行下调/风险敏感处理,从理论上统一不同不确定性估计与惩罚形式,系统性缓解reward hacking(策略利用RM误差刷分)并提升对真实质量的对齐鲁棒性。
Track: RLHF Reward Modeling & Uncertainty (distributional RM, pessimistic optimization, anti-reward-hacking) - Core innovation: Argues that reward uncertainty in RLHF should be modeled via a distributional reward model (p(r|x,y)) rather than a scalar reward model. Under Bayesian or ensemble realizations, it connects uncertainty estimation to principled pessimism in policy optimization—down-weighting or risk-adjusting rewards in high-uncertainty regions. This provides a unifying view of uncertainty-aware penalties and offers a systematic mitigation of reward hacking by reducing exploitation of reward model errors and improving robustness to true quality alignment.
GitHub
- [2026-06-09] PufferAI/PufferLib ⭐5880
Puffing up reinforcement learning
- [2026-06-10] rllm-org/rllm ⭐5604
Democratizing Reinforcement Learning for LLMs
- [2026-06-09] facebookresearch/ReAgent ⭐3702 🆕NEW
A platform for Reasoning systems (Reinforcement Learning, Contextual Bandits, etc.)
- [2026-06-09] facebookresearch/Pearl ⭐3008
A Production-ready Reinforcement Learning AI Agent Library brought by the Applied Reinforcement Learning team at Meta.
- [2026-06-10] radixark/miles ⭐1527
Miles is an enterprise-facing reinforcement learning framework for LLM and VLM post-training, forked from and co-evolving with slime.
HuggingFace Models
HuggingFace Datasets
-
[2026-06-04] stanford-vision-lab/gpic
GPIC: A Giant Permissive Image Corpus for Visual GenerationKeshigeyan Chandrasegaran1, Kyle Sargent1, Suchi...
-
[2026-06-05] nvidia/Nemotron-Personas-El-Salvador
Nemotron-Personas-El-SalvadorUn enfoque de IA compuesta para personas en español salvadoreño ancladas en distribuciones del...
-
[2026-06-05] nvidia/Nemotron-Personas-Vietnam
Nemotron-Personas-VietnamHệ thống AI kết hợp để tạo personas tổng hợp dựa trên phân bố thực tế của Việt Nam A compound ...
世界动作模型 / World Action Model
arXiv
- C$^3$ache: Accelerating World Action Models with Cross Inference Chunk Cache 🆕NEW
- 赛道归属: 推理优化(面向扩散式世界动作模型/WAM 的加速与缓存)
- 核心创新点: 提出 Cross Inference Chunk Cache(C$^3$ache)用于多 chunk 推理的跨段缓存复用:针对 WAM 在完成任务时需要连续运行多个 inference chunk、且每段都要进行昂贵去噪的特点,将可复用的中间计算(如去噪过程中的特征/状态)以“跨 chunk”方式缓存并在后续段落对齐复用,从而减少重复去噪计算与端到端推理时延,同时尽量保持生成/控制质量不显著下降。
- Track: Inference optimization (acceleration & caching for diffusion-based World Action Models)
- Key innovation: Introduces Cross Inference Chunk Cache (C$^3$ache) to reuse intermediate computations across multiple inference chunks in WAM rollout. By caching and re-aligning reusable denoising intermediates (e.g., features/states) between consecutive chunks, it avoids redundant denoising work, reducing end-to-end latency while aiming to preserve generation/control fidelity.
- Light-WAM: Efficient World Action Models with State-Fusion Action Decoding 🆕NEW
- 赛道归属: 机器人策略学习/世界动作模型(高效架构与动作解码)
- 核心创新点: 提出 Light-WAM,通过“State-Fusion Action Decoding(状态融合动作解码)”将未来预测得到的时序状态表征与当前观测/隐状态进行融合,再进行动作解码;以更轻量的生成式/预测式骨干替代重型 WAM 架构,在保留 WAM 通过未来预测学习任务相关时序结构这一优势的同时,显著降低训练成本与闭环推理延迟,使其更适合实时机器人操作部署。
- Track: Robot policy learning / World Action Models (efficient architecture & action decoding)
- Key innovation: Proposes Light-WAM with State-Fusion Action Decoding: it fuses predicted future state representations with current observations/latent states before decoding actions. This design keeps the WAM benefit of learning task-relevant temporal structure via future prediction, while using a lightweight backbone to cut training cost and closed-loop inference latency for practical real-time manipulation.
- WALL-WM: Carving World Action Modeling at the Event Joints
- 赛道归属: 世界动作模型(World Action Model)/ 视觉-语言-动作预训练(Vision-Language-Action Pretraining)/ 视频动作建模
- 核心创新点:
- 中文:提出从“固定长度动作块(chunk)”转向“语义事件(event)”的世界动作建模范式,将语义连贯的动作事件作为最小学习单元,在事件连接点(event joints)处刻画动作的自然边界与状态转移,从而缓解 chunk 粒度与真实动作结构不匹配带来的学习偏差。方法上以事件为锚点进行视觉-语言-动作联合预训练,使模型学习到更符合人类语义分段的动作表征与跨事件的因果/时序衔接能力,相比直接对当前观测+指令做 chunk 级预测,更强调事件级结构化监督与可组合性。
- English: Introduces an event-grounded paradigm for World Action Models, replacing fixed-length action chunks with semantically coherent action events as the atomic learning unit. By modeling transitions at event joints (natural boundaries between events), it addresses the granularity mismatch inherent in chunk-centric optimization and better captures state changes and temporal/causal continuity. The approach performs Vision-Language-Action pretraining anchored on events, encouraging structured, compositional action representations and improved cross-event linkage, rather than directly predicting chunk-level actions conditioned only on the current observation and instruction.
- Unified Video-Action Joint Denoising for Dexterous Action and Data Generation
- 赛道归属: 机器人世界模型 / 视频-动作联合生成(World Action Model, Video-Action Joint Modeling)
- 核心创新点: 从分布建模角度重构“视频先验→动作策略”的对齐方式:不再将视频基础模型的动态先验压缩为“给定观测的未来动作策略分布”,而是直接在交互视频与可执行手部轨迹的联合空间上进行建模与去噪生成;通过支持多种条件化机制/条件模式来保持更“宽”的联合分布,从而在同一框架内同时服务于灵巧动作生成与数据生成(视频与动作的协同合成),提升视频-动作一致性与可控性。
- Track: Robotics World Models / Video-Action Joint Generation (World Action Model, Video-Action Joint Modeling)
- Key innovation: Reframes video-to-action alignment as a distribution modeling problem: instead of collapsing the video foundation model’s dynamics prior into an observation-conditioned action policy over future actions, it models and denoises the joint distribution over interaction videos and executable hand trajectories. By enabling multiple conditioning regimes, it preserves a broader joint distribution, unifying dexterous action generation and data generation (co-synthesizing videos and actions) with improved video–action consistency and controllability.
GitHub
- [2026-06-08] DravenALG/awesome-vla-wam ⭐721
A Curated List of Vision-Language-Action (VLA) and World Action Models (WAM) Research and Beyond
Generated automatically by Daily AI Digest Agent 生成时间: 2026-06-10 01:02:38