AI 每日进展速报 / Daily AI Digest - 2026-05-18
图像生成/编辑 / Image Generation/Editing
arXiv
- Improving Text-to-Image Generation with Intrinsic Self-Confidence Rewards
- 赛道归属: 文生图(偏好对齐/后训练强化学习)
- 核心创新点: 提出 SOLACE 后训练框架,用“模型内生自信度”替代外部奖励模型/人工偏好监督:将模型自身生成结果重新加噪,并以其对注入噪声的恢复准确性作为自信度奖励信号,从而在不依赖额外标注或奖励网络的情况下进行偏好对齐式优化,提升生成的可靠性与审美一致性。
- Track: Text-to-Image (preference alignment / post-training RL)
- Core innovation: Introduces SOLACE, a post-training framework that replaces external reward supervision with an intrinsic self-confidence signal: it re-noises the model’s own outputs and uses denoising recovery accuracy as a reward, enabling alignment-style optimization without reward models or human labels and improving reliability/aesthetic consistency.
- Skill-Aligned Annotation for Reliable Evaluation in Text-to-Image Generation
- 赛道归属: 文生图评测 / 人类标注与评估协议设计
- 核心创新点: 提出“技能对齐标注(skill-aligned annotation)”评测范式:不再用统一的Likert/BQA覆盖所有评测维度,而是先将T2I评测拆解为异质“技能”(如语义一致性、属性绑定、空间关系、文本可读性、审美/真实感等),再为不同技能匹配更合适的标注接口、问题形式与聚合方式,从而降低标注噪声与维度混淆;强调在模型差距缩小时,通过对齐技能本质来提升评测的可靠性、可复现性与区分度。
- Track: Text-to-image evaluation / human annotation & evaluation protocol design
- Core innovation: Proposes a “skill-aligned annotation” paradigm: instead of applying a uniform Likert/BQA scheme to all criteria, it decomposes T2I evaluation into heterogeneous skills (e.g., semantic faithfulness, attribute binding, spatial relations, text legibility, aesthetics/realism) and matches each skill with an appropriate annotation interface, question form, and aggregation rule. This reduces annotation noise and cross-skill confounding, improving reliability, reproducibility, and discriminability when model gaps are small.
- EPIC: Efficient Predicate-Guided Inference-Time Control for Compositional Text-to-Image Generation
- 赛道归属: 文生图(推理时可控生成/组合式生成)
- 核心创新点: 提出 EPIC 的训练无关(training-free)推理时控制:将复杂提示词一次性解析为包含对象变量与类型化谓词(数量、属性、关系等)的“视觉程序”,并把生成改写为谓词引导的搜索/精炼过程,在不改动模型参数的前提下,通过对违反谓词的部分进行定向修正来提升组合一致性与可控性,同时强调推理效率。
- Track: Text-to-Image (inference-time control / compositional generation)
- Core innovation: Proposes EPIC, a training-free inference-time control method that parses a prompt once into a fixed “visual program” (object variables + typed predicates for count/attributes/relations) and performs predicate-guided search/refinement to correct predicate violations without updating model weights, improving compositional faithfulness with efficient inference.
- SPOT: Selective Prompt Projection via Total Variation for Inference-Only Safe Text-to-Image Generation
- 赛道归属: 文生图安全对齐 / 推理期安全控制(Inference-only safety)
- 核心创新点: 提出SPOT:在冻结扩散生成器的前提下,仅在推理期对文本提示进行“选择性投影(selective prompt projection)”以抑制不安全生成,同时尽量保持对良性提示的行为不变;用总变差距离(Total Variation, TV)将“相对原始提示条件分布的偏移”与“风险期望变化”建立可控上界关系,把安全约束转化为对提示投影强度/选择策略的可优化目标,实现无需再训练的安全-保真折中控制。
- Track: Text-to-image safety alignment / inference-time safety control
- Core innovation: Introduces SPOT, which keeps the diffusion generator frozen and performs inference-only “selective prompt projection” to suppress unsafe generations while preserving behavior on benign prompts. It leverages Total Variation (TV) to upper-bound how much expected risk can change relative to the original prompt-conditioned distribution, turning safety into a controllable constraint on projection strength/selection—achieving a practical safety–fidelity trade-off without retraining.
- Drag within Prior Distribution: Text-Conditioned Point-Based Image Editing within Distribution Constraints
- 赛道归属: 图像编辑 / 扩散模型点控编辑(Drag/point-based editing)
- 核心创新点: 提出“分布内约束”的文本条件点编辑框架:针对传统handle/target点对带来的轨迹歧义与远距离拖拽导致的非必要改动,引入“保持在先验分布内(within prior distribution)”的约束思想,在编辑优化过程中显式限制生成状态偏离模型先验流形;同时用文本条件辅助消歧与稳定语义,使点级位移更可控、对无关区域扰动更小,并提升大位移/复杂场景下的编辑一致性。
- Track: Image editing / diffusion-based point (drag) editing
- Core innovation: Proposes a text-conditioned point-editing framework with explicit “within-prior-distribution” constraints. It addresses trajectory ambiguity from handle/target pairs and unnecessary changes under long-distance drags by constraining the optimization to stay close to the model’s prior manifold, while using text conditioning to disambiguate intent and stabilize semantics—yielding more controllable motion with reduced collateral edits and better consistency in challenging large-move cases.
- Does Engram Do Memory Retrieval in Autoregressive Image Generation?
- 赛道归属: 自回归图像生成 / 记忆增强Transformer机理分析
- 核心创新点: 将Engram(哈希键控、O(1)联想记忆)模块适配到视觉AR生成(2D空间n-gram哈希),并围绕“是否真的在做记忆检索”这一解释进行机制层面的实证检验;通过对比与探针分析区分性能增益来源(内容寻址检索 vs. 其他正则化/表示效应),从而澄清记忆模块在AR图像生成中的作用边界与适用条件,为后续设计更有效的视觉记忆结构提供依据。
- Track: Autoregressive image generation / memory-augmented Transformer mechanism analysis
- Core innovation: Adapts the Engram module (hash-keyed O(1) associative memory) to autoregressive vision via 2D spatial n-gram hashing, and empirically tests whether gains truly come from memory retrieval. Through controlled comparisons and probing, it disentangles content-addressed retrieval effects from alternative mechanisms (e.g., regularization/representation changes), clarifying when and how such memory modules help AR image generation and informing better visual memory designs.
- Early Semantic Grounding in Image Editing Models for Zero-Shot Referring Image Segmentation
- 赛道归属: 多模态理解 / 指代表达分割(Zero-shot RIS)与编辑模型语义指向能力挖掘
- 核心创新点: 发现并系统验证指令式图像编辑模型在编辑前期就形成语言条件的语义指向(grounding)信号;提出利用“早期语义落地(early semantic grounding)”将编辑模型转化为零样本RIS:从模型早期层/早期扩散阶段提取与指代表达对齐的空间响应,并映射为像素级掩码,避免依赖专门分割训练数据;核心突破在于把“编辑所需的隐式定位能力”显式化为可用的分割输出。
- Track: Multimodal understanding / zero-shot referring image segmentation via editing models
- Core innovation: Shows that instruction-based image editing models produce language-conditioned grounding signals early in the generation process. It leverages this “early semantic grounding” to perform zero-shot RIS by extracting expression-aligned spatial responses from early layers/early diffusion steps and converting them into pixel masks—turning the editor’s implicit localization capability into explicit segmentation without task-specific segmentation training.
- Edit-Compass & EditReward-Compass: A Unified Benchmark for Image Editing and Reward Modeling
- 赛道归属: 图像编辑评测基准 / 奖励模型评测(Reward Modeling Benchmark)
- 核心创新点: 提出统一基准Edit-Compass与EditReward-Compass:面向前沿编辑模型,设计更高难度、更贴近人类偏好的任务集合与细粒度评测协议,缓解现有基准“题目过易+打分粗糙”导致的饱和与失真;同时将“编辑质量评测”与“用于RL优化的奖励模型评测”打通,在同一任务空间中评估reward对人类判断的一致性与可泛化性,推动可用、可对比的编辑RL闭环。
- Track: Image editing benchmarks / reward model benchmarking for RL-based editing
- Core innovation: Introduces the unified benchmarks Edit-Compass and EditReward-Compass. They raise task difficulty and adopt more fine-grained, human-aligned evaluation protocols to avoid saturation and misranking on frontier editors. Crucially, they connect editing evaluation with reward-model evaluation in the same task space, enabling systematic measurement of reward–human alignment and generalization—supporting more realistic RL optimization loops for image editing.
- Inline Critic Steers Image Editing
- 赛道归属: 图像编辑 / 推理期引导与自我纠错(Inline critic / test-time steering)
- 核心创新点: 提出Inline Critic:在不等待整张图生成或完整去噪步结束的情况下,把“批评/纠错信号”注入到一次前向过程内部,实现更早、更局部的难点区域分配式修正;通过探测冻结编辑模型,发现早期表征已包含可用于判断编辑偏差的信号,据此训练/构建可内联工作的critic,在生成进行中对中间状态施加引导,从而提升复杂指令与局部区域编辑的稳定性与成功率。
- Track: Image editing / inference-time steering with inline critics
- Core innovation: Proposes an Inline Critic that delivers critique/correction signals within an ongoing forward pass, rather than after a full image or denoising step completes. By probing a frozen editor, it finds early representations already encode signals predictive of edit failures, enabling a critic that steers intermediate states during generation. This yields earlier, region-adaptive refinement and improves robustness/success on difficult, localized instruction edits.
- MULTI: Disentangling Camera Lens, Sensor, View, and Domain for Novel Image Generation
- 赛道归属: 文生图(可控生成/成像因素解耦)
- 核心创新点: 将控制维度从“内容”扩展到“成像链路”,提出对镜头(lens)、传感器(sensor)、视角(view)与场景域(domain)等成像因素的解耦建模与组合生成任务(Imaging Factor Disentanglement):通过显式分离并可组合这些因素,减少文本歧义带来的控制不确定性,实现更精细的风格/设备/视角级别可控的新图像生成。
- Track: Text-to-Image (controllable generation / imaging-factor disentanglement)
- Core innovation: Extends control beyond content by disentangling and compositing imaging factors—lens, sensor, viewpoint, and domain—formulating an Imaging Factor Disentanglement task that explicitly separates these factors to mitigate text ambiguity and enable fine-grained, composable control over device/style/view-level generation.
GitHub
- [2026-05-18] YouMind-OpenLab/awesome-nano-banana-pro-prompts ⭐12024
🍌 World's largest Nano Banana Pro prompt library — 10,000+ curated prompts with preview images, 16 languages. Google Gemini AI image generation. Free ...
- [2026-05-17] vibheksoni/free-ai ⭐456
Free OpenAI-compatible AI API with 16,000+ models, image generation, tool calling, and Discord key signup.
- [2026-05-17] jegly/Box ⭐440
Private on-device AI suite for Android. Fork of Google AI Edge Gallery with llama.cpp, whisper.cpp, stable-diffusion.cpp, GGUF import, voice chat, vis...
- [2026-05-17] Azornes/Comfyui-Resolution-Master ⭐263 🆕NEW
Custom node for total control over resolution and aspect ratio. It provides an intuitive interface with an interactive canvas, advanced scaling option...
- [2026-05-18] GautamVhavle/CatGPT-Gateway ⭐62 🆕NEW
Turn your ChatGPT or Claude account into a fully working OpenAI-compatible API. No API keys needed. Supports tool calling, vision, file attachments, a...
HuggingFace Models
视频生成/编辑 / Video Generation/Editing
arXiv
- Pyramid Forcing: Head-Aware Pyramid KV Cache Policy for High-Quality Long Video Generation
- 赛道归属: 长视频生成(自回归视频生成 / KV Cache与注意力记忆策略优化)
- 核心创新点: 该工作针对自回归长视频生成中“误差累积导致长期退化”的关键瓶颈,提出Head-Aware 的金字塔式 KV Cache 保留策略(Pyramid Forcing):不再对所有注意力头采用统一的历史帧保留,而是基于对“历史帧注意力模式”的实证分析,将注意力头划分为Anchor(需要广域长程上下文)、Wave(周期性时序依赖)等不同类型,并据此为不同头分配差异化的历史信息保留/压缩方案(呈金字塔式的时间尺度覆盖)。核心突破在于把“长程记忆管理”从全局统一策略提升为按头建模的结构化策略,在不牺牲流式生成特性的前提下,更有效抑制长视频的质量漂移与语义/运动一致性退化。
Track: Long video generation (autoregressive video generation / KV-cache & attention memory policy optimization)
Key innovation: This work targets long-horizon degradation in autoregressive long video generation caused by error accumulation, and proposes Pyramid Forcing, a head-aware pyramid KV-cache retention policy. Instead of a uniform history retention across all attention heads, it empirically identifies distinct head behaviors—e.g., Anchor heads needing broad long-range context and Wave heads exhibiting periodic temporal dependencies—and assigns head-specific history retention/compression over multi-scale (pyramidal) temporal coverage. The methodological leap is upgrading memory management from a one-size-fits-all cache policy to a structured, per-head policy, improving long-video stability (semantic/motion consistency) while preserving streaming/open-ended generation.
- CRePE: Curved Ray Expectation Positional Encoding for Unified-Camera-Controlled Video Generation
- 赛道归属: 可控视频生成(相机运动/镜头控制条件的视频生成 / 位置编码与几何表征)
- 核心创新点: 该工作面向“相机条件视频生成”中位置编码在不同镜头模型下失效的问题,提出Curved Ray Expectation Positional Encoding(CRePE),以适配统一相机模型(Unified Camera Model)下的广角/鱼眼等非针孔成像。相较于仅注入光线方向(ray-only)或依赖针孔几何的编码方式,CRePE通过曲线光线(curved ray)期望来构造更稳健的注意力级位置编码,使模型在相机运动、镜头参数变化以及场景结构变化时仍能获得一致、可泛化的相机几何信号。核心突破在于将相机控制信号从“针孔假设下的显式几何”扩展为“统一相机模型下可微、可泛化的曲线光线统计表征”,从而实现更通用的相机可控视频生成。
Track: Controllable video generation (camera-controlled video generation / positional encoding & geometric representation)
Key innovation: This work addresses the brittleness of positional encodings for camera-conditioned video generation under varying camera motions and lens models, and introduces CRePE (Curved Ray Expectation Positional Encoding) compatible with the Unified Camera Model, covering wide-angle and fisheye lenses beyond pinhole assumptions. Unlike ray-only signals or pinhole-geometry-dependent encodings, CRePE builds attention-level positional encoding via the expectation over curved rays, providing a robust, consistent camera geometry signal under changes in motion, intrinsics, and scene structure. The key methodological advance is extending camera control from pinhole-specific explicit geometry to a differentiable, generalizable curved-ray statistical representation under a unified camera model, enabling more universal camera-controlled video generation.
- OmniNFT: Modality-wise Omni Diffusion Reinforcement for Joint Audio-Video Generation
- 赛道归属: 音视频联合生成(多模态生成)
- 核心创新点: 提出面向“联合音频-视频生成”的模态分治式强化学习框架(modality-wise omni diffusion reinforcement),将扩散生成中的优化目标拆解为“单模态保真度 + 跨模态对齐 + 细粒度时序同步”等多目标,并通过针对多模态/多目标RL训练不稳定性的机制化处理(如优势信号冲突与尺度不一致等问题的分析与改造)实现可训练、可控的联合优化,从而在不牺牲单模态质量的前提下提升音画一致性与同步精度。
- Track: Joint Audio-Video Generation (Multimodal Generation)
- Core innovation: Proposes a modality-wise omni diffusion reinforcement framework for joint audio-video generation, decomposing diffusion-time optimization into multi-objectives—per-modality fidelity, cross-modal alignment, and fine-grained temporal synchronization—and introducing training strategies to address key RL obstacles in multi-modal/multi-objective settings (e.g., conflicting advantage signals and scale mismatch). This enables stable, controllable joint optimization that improves A/V coherence and sync without degrading unimodal quality.
- OmniHumanoid: Streaming Cross-Embodiment Video Generation with Paired-Free Adaptation
- 赛道归属: 视频生成(跨具身/跨形体动作迁移,Embodiment-aware Generation)
- 核心创新点: 提出流式(streaming)的跨具身视频生成框架,将“可迁移的运动动力学”与“具身特定的外观/形态”进行解耦建模,并通过无需成对数据(paired-free)的适配机制把模型快速迁移到新的人形载体(如人→机器人、机器人→机器人)。方法层面强调在生成过程中持续接收运动条件并稳定输出视频序列,同时用无配对适配降低对每个目标具身的标注/配对采集成本,提升可扩展性。
- Track: Video Generation (Cross-embodiment / embodiment-aware motion transfer)
- Core innovation: Introduces a streaming cross-embodiment video generation framework that explicitly disentangles transferable motion dynamics from embodiment-specific appearance/morphology, and adapts to new humanoid embodiments via paired-free adaptation (no paired data per target). The method supports online/streaming conditioning for stable long-horizon generation while dramatically reducing data collection requirements, improving scalability to many embodiments.
- SocialDirector: Training-Free Social Interaction Control for Multi-Person Video Generation
- 赛道归属: 视频生成 / 可控多人物交互(训练免控制)
- 核心创新点: 提出训练免(Training-Free)的多人物社交交互控制方法,将“谁在何时对谁做什么”的交互结构显式注入生成过程,解决多人物生成中常见的角色错配与动作归因错误;通过对交互关系与时序的可控编排,实现对对话、手势、协同行为等社会互动的细粒度导演式控制,而无需重新训练基础视频模型。
- Track: Video generation / controllable multi-person interactions (training-free control)
- Key innovation: Presents a training-free control method for multi-person social interactions, explicitly injecting interaction structure—who does what, when, and toward whom—into the generation process to reduce actor/action misbinding; enables fine-grained director-style control over conversations, gestures, and coordinated behaviors without retraining the base video model.
- SWIFT: Prompt-Adaptive Memory for Efficient Interactive Long Video Generation
- 赛道归属: 推理优化 / 流式长视频生成的自适应记忆管理
- 核心创新点: 提出SWIFT的“提示词自适应记忆”(Prompt-Adaptive Memory):针对交互式长视频中频繁语义切换,设计能随prompt更新而重组/选择性保留的记忆机制,避免在提示边界反复重建缓存或受限于固定记忆预算造成的冗余计算与适配迟滞;在保持视觉连续性的同时提升语义切换响应效率。
- Track: Inference optimization / adaptive memory for streaming long-video generation
- Key innovation: Introduces SWIFT with prompt-adaptive memory: for interactive long videos with frequent semantic switches, it reorganizes/selectively retains memory in response to prompt updates, avoiding cache rebuilds at prompt boundaries and inefficiencies of fixed memory budgets; improves responsiveness to semantic changes while maintaining visual continuity.
- EduStory: A Unified Framework for Pedagogically-Consistent Multi-Shot STEM Instructional Video Generation
- 赛道归属: 视频生成 / 多镜头脚本化生成(教育内容一致性)
- 核心创新点: 提出面向STEM教学的多镜头生成统一框架:引入“教学状态建模”跟踪跨镜头的持久知识与概念依赖,并用脚本引导的结构化控制组织叙事与镜头编排,解决长视频中知识一致性、讲解连贯性与多镜头衔接问题;将“内容正确性/教学一致性”作为生成过程的核心约束而非事后筛选。
- Track: Video generation / multi-shot script-driven generation (educational consistency)
- Key innovation: Proposes a unified framework for multi-shot STEM instructional video generation: models a pedagogical state to track persistent knowledge and concept dependencies across shots, and uses script-guided structured control to organize narrative and shot composition; addresses knowledge consistency and pedagogical coherence as first-class generation constraints rather than post-hoc filtering.
- CollabVR: Collaborative Video Reasoning with Vision-Language and Video Generation Models
- 赛道归属: 多模态推理 / 结合视频生成的协同推理框架
- 核心创新点: 提出VLM+视频生成模型的协同推理(CollabVR):用VLM承担显式规划、校验与纠错,将VGM生成的短时“Chain-of-Frames”作为可视化推理草稿;通过迭代式的生成—评估—修正闭环,缓解长任务的时序漂移与中段模拟错误累积,把视频生成从单纯输出器提升为可被语言推理约束与修正的“可视化思维工具”。
- Track: Multimodal reasoning / collaborative reasoning with video generation
- Key innovation: Proposes CollabVR, a VLM+VGM collaborative reasoning framework: the VLM performs explicit planning, verification, and correction while the VGM produces short-horizon Chain-of-Frames as visual reasoning drafts; an iterative generate–evaluate–revise loop mitigates long-horizon drift and mid-clip simulation error accumulation, turning video generation into a language-guided, correctable visual thinking tool.
- Unison: Harmonizing Motion, Speech, and Sound for Human-Centric Audio-Video Generation
- 赛道归属: 人物中心音视频生成(Audio-Video Generation,多模态联合生成:动作-语音-音效)
- 核心创新点: 提出统一框架在生成阶段显式约束“动作-语音-环境音效”三模态的时序一致性与语义协同,针对三者异质时间尺度与对齐难题,通过跨模态协同建模/对齐机制减少常见的口型-语音、动作-音效错配,实现更连贯的人物中心音视频联合生成。
- Track: Human-centric audio-video generation (multimodal joint generation: motion–speech–sound)
- Key innovations: Introduces a unified generation framework that explicitly enforces temporal alignment and semantic coherence across motion, speech, and environmental sound effects. By addressing heterogeneous temporal dynamics with cross-modal coordination/alignment mechanisms, it reduces typical mismatches (e.g., lip–speech and action–sound desynchronization) and improves coherent human-centric audio-video generation.
- From Articulated Kinematics to Routed Visual Control for Action-Conditioned Surgical Video Generation
- 赛道归属: 视频生成(动作条件/机器人手术视频生成,Controllable Video Generation)
- 核心创新点: 提出“运动学到视觉”的提升(kinematic-to-visual lifting)范式,把低维的关节/器械运动学控制量转换为五种统一的、图像对齐的控制模态(image-aligned control modalities),从而把难以直接驱动像素演化的控制信号变成可被生成模型有效利用的视觉条件;在此基础上设计分层路由(hierarchically routed)的视觉控制机制,按层/按区域选择性注入不同控制模态,实现对手术场景中复杂、局部且时序敏感的变化进行更精细的动作约束与可控生成。
- Track: Video Generation (Action-conditioned / surgical robotics controllable generation)
- Core innovation: Proposes a kinematic-to-visual lifting paradigm that converts low-dimensional articulated kinematics into five unified image-aligned control modalities, making control signals directly usable for pixel-space evolution. On top of this representation, a hierarchically routed visual control mechanism selectively injects different control modalities across hierarchy/regions, enabling fine-grained, temporally precise action control for complex surgical video generation.
GitHub
- [2026-05-17] hao-ai-lab/FastVideo ⭐3481
A unified inference and post-training framework for accelerated video generation.
- [2026-05-17] ZeroLu/awesome-seedance ⭐1751
The ultimate collection of high-fidelity Seedance 2.0 prompts and Seedance AI resources. Discover Seedance 2.0 how to use for cinematic film, anime, U...
- [2026-05-17] YouMind-OpenLab/awesome-seedance-2-prompts ⭐1091
🎬 2000+ curated Seedance 2.0 video generation prompts — cinematic, anime, UGC, ads, meme styles. Includes Seedance API guides, character consistency t...
- [2026-05-17] thu-ml/Causal-Forcing ⭐670 🆕NEW
[ICML 2026] Official codebase for "Causal Forcing: Autoregressive Diffusion Distillation Done Right for High-Quality Real-Time Interactive Video Gener...
- [2026-05-17] AceDataCloud/Nexior ⭐371
Consumer AI app for chat, image generation, video generation, and music creation powered by Ace Data Cloud APIs.
HuggingFace Models
音频生成 / Audio Generation
arXiv
- Adapting a Text-to-Audio Model for Room Impulse Response Generation
- 赛道归属: 音频生成 / 声学建模(RIR 房间脉冲响应生成)
- 核心创新点: 将预训练的文生音频大模型作为“生成先验”迁移到RIR这一强物理约束、数据稀缺的声学对象上,通过适配策略把通用音频生成能力对齐到RIR的时域结构与混响特征分布,实现无需从零训练即可生成高质量RIR;关键突破在于证明大规模生成式音频先验可有效覆盖并可控地生成RIR这类非语音/非音乐的声学响应信号,从而缓解真实RIR采集成本高与训练数据不足的问题。
- Track: Audio generation / Acoustic modeling (RIR generation)
- Core innovation: Adapts a pretrained text-to-audio foundation model as a generative prior for Room Impulse Responses, a physically constrained and data-scarce acoustic signal. The method aligns the model’s generic audio generation capability to RIR-specific temporal structure and reverberation statistics, enabling high-quality RIR synthesis without training from scratch. The key methodological contribution is demonstrating that large-scale generative audio priors can be effectively transferred to controllable generation of non-speech/non-music acoustic responses, mitigating real RIR collection and data scarcity.
- TW-Sound580K: A Regional Audio-Text Dataset with Verification-Guided Curation for Localized Audio-Language Modeling
- 赛道归属: 音频-文本数据集构建 / 音频语言模型(LALM)指令微调
- 核心创新点: 提出面向区域方言与韵律(台湾语境)的高质量音频-文本指令数据集构建范式:通过 Verify-Generate-Critique(VGC)流程先“验证再扩增”,用双ASR交叉校验对52.2万原始音频进行一致性过滤以提升转写可靠性,再借助教师模型生成并批判式筛选,扩展为58万高保真指令对;该“验证引导的数据策展”显著降低方言场景的噪声标注与语音-文本错配,为后续Tai-LALM等区域化音频语言建模提供可复用的数据生产管线。
- Track: Audio-text dataset curation / Instruction tuning for Large Audio-Language Models (LALMs)
- Core innovation: Introduces a region- and dialect-focused (Taiwan) high-fidelity audio-text instruction dataset via a verification-first curation paradigm. A Verify-Generate-Critique (VGC) pipeline uses dual-ASR cross-validation to filter 522K raw clips for transcript consistency, then expands and refines them into 580K instruction pairs with a teacher model plus critique-based selection. This verification-guided curation reduces label noise and audio-text misalignment in dialectal prosody settings, providing a reusable data production pipeline for regionalized LALMs (e.g., Tai-LALM).
- The World is Not Mono: Enabling Spatial Understanding in Large Audio-Language Models
- 赛道归属: 多模态理解(音频-语言)/ 空间音频理解(Audio Scene Analysis)
- 核心创新点: 将“空间音频-语言理解”从零散问题提升为统一的任务接口,提出音频场景分析(ASA)的三层级形式化:从原子级感知(事件与方位/距离等空间属性)到对象级绑定(语义与空间属性的归属一致性、多对象分离与排列),再到场景级物理一致性判断(答案是否符合空间声学常识)。核心突破在于把“听到什么”扩展为“在哪里、谁对应谁、整体是否合理”的可评测框架,为大音频语言模型引入可系统训练/评估的空间推理目标。
- Track: Multimodal understanding (audio-language) / Spatial audio understanding (Audio Scene Analysis)
- Core innovation: Elevates spatial audio-language understanding into a unified task interface by formalizing Audio Scene Analysis (ASA) as a three-level problem: atomic perception of events with spatial attributes, object-level binding of semantics to spatial properties across multiple sources, and scene-level physical plausibility checking. The methodological leap is turning “what is in the audio” into a measurable framework for “where it is, which attributes belong to which object, and whether the global answer is physically consistent,” enabling systematic training/evaluation of spatial reasoning in large audio-language models.
- Aliasing-Free Neural Audio Synthesis
- 赛道归属: 神经音频合成 / 声码器与神经编解码器(抗混叠高保真生成)
- 核心创新点: 针对神经声码器/编解码器在音乐与歌声高频段常见的混叠(aliasing)失真,系统性指出其主要来源于非线性激活与上采样结构引入的频谱折叠,并提出“从架构层面消除混叠”的神经合成方案:通过在非线性与上采样路径中引入可控带宽/抗混叠约束(而非仅靠后处理滤波),在生成过程中抑制不可逆的折叠伪影,从而提升高频细节、瞬态与谐波结构的保真度,面向高保真音乐/歌声合成更稳健。
- Track: Neural audio synthesis / Vocoders and neural codecs (anti-aliasing for high-fidelity generation)
- Core innovation: Targets aliasing artifacts that limit high-fidelity music and singing synthesis in neural vocoders/codecs. It attributes severe artifacts to spectral folding introduced by nonlinear activations and upsampling operations, and proposes an aliasing-free synthesis approach that enforces anti-aliasing/bandwidth control within the architecture and generation path (instead of relying on post-hoc filtering). This suppresses irreversible folded components during waveform generation, improving high-frequency detail, transients, and harmonic fidelity for music/singing.
- Kinetic-Optimal Scheduling with Moment Correction for Metric-Induced Discrete Flow Matching in Zero-Shot Text-to-Speech
- 赛道归属: 语音生成(零样本文本转语音)/ 离散流匹配推理优化(MI-DFM 调度与校正)
- 核心创新点: 针对MI-DFM在离散生成中的两大瓶颈提出成体系的推理改进:其一,从概率路径的标量参数化出发推导“动力学最优(kinetic-optimal)”调度器,给出无需训练、免超参搜索的数值调度方案以替代经验式scheduler;其二,针对一阶CTMC求解带来的有限步路径跟踪误差,引入矩校正(moment correction)以在有限步数下更准确匹配目标路径分布。方法论突破在于把“怎么排步长/怎么减误差”从启发式工程变为可推导的最优调度与可控校正,从而提升零样本TTS的稳定性与质量/效率权衡。
- Track: Speech generation (zero-shot TTS) / Discrete flow-matching inference optimization (MI-DFM scheduling & correction)
- Core innovation: Addresses two practical blockers of MI-DFM for discrete generation with principled inference upgrades: (1) derives a kinetic-optimal scheduler for scalar-parameterized probability paths, yielding a training-free numerical schedule that removes heuristic tuning; (2) reduces finite-step path-tracking error of first-order CTMC solvers via moment correction, improving distributional matching under limited steps. The key methodological advance is replacing ad-hoc scheduling/error fixes with derivable optimal scheduling and controllable correction, improving stability and the quality–efficiency trade-off in zero-shot TTS.
- OLaPh: Optimal Language Phonemizer
- 赛道归属: 语音合成前端(TTS Front-end)/ 文本到音素(G2P/Phonemization)
- 核心创新点: 提出混合式音素化框架:融合大规模多语种词典(lexica)与现代NLP建模,并引入统计子词切分来处理OOV与跨语言形态变化;通过“词典强约束 + 神经/统计泛化”的组合,在覆盖率与泛化能力之间取得更优折中,提升多语种音素化鲁棒性。
Track: TTS front-end / Phonemization (G2P)
Key innovation: A hybrid phonemizer combining extensive multilingual lexica with advanced NLP modeling and statistical subword segmentation, achieving better OOV/generalization while retaining lexicon-backed correctness across languages.
- Text2Score: Generating Sheet Music From Textual Prompts
- 赛道归属: 文本到符号音乐生成 / 乐谱(Sheet Music)生成
- 核心创新点: 提出面向“文本→五线谱”而非仅MIDI的两阶段生成框架Text2Score:先在规划阶段从自然语言提示中生成结构化的音乐蓝图(如段落/节奏/动机等可执行约束),再在执行阶段将计划落地为可渲染的乐谱表示;同时通过“直接从乐谱/排版与符号结构中提取监督信号”的方式构建训练对齐,减少对不可靠自动字幕/描述管线的依赖,从数据稀缺场景下提升文本-乐谱对齐与可控生成能力。
- Track: Text-to-symbolic music generation / Sheet music generation
- Core innovation: Proposes Text2Score, a two-stage framework for generating sheet music (not just MIDI) from natural-language prompts. A planning stage produces a structured musical blueprint (e.g., form/rhythm/motifs as executable constraints), followed by an execution stage that realizes the plan into a renderable sheet-music representation. It further derives supervision signals directly from sheet-music symbolic/layout structure to build more reliable text–music alignment, reducing reliance on noisy automated captioning and improving controllability under scarce paired data.
- SF-Flow: Sound field magnitude estimation via flow matching guided by sparse measurements
- 赛道归属: 空间音频重建 / 声场估计(稀疏测量引导的生成式建模)
- 核心创新点: 将Flow Matching引入3D声场重建这一典型病态逆问题,聚焦于声学传递函数(ATF)幅度的生成式估计,并通过“稀疏麦克风测量引导”把条件约束注入流匹配过程,实现从少量观测到完整声场幅度分布的重建。核心突破在于把原本用于语音/音乐生成的FM范式改造成可处理空间声学条件约束的生成式求解器,用生成先验补足欠定测量带来的信息缺失,从而提升声场/房间特性恢复的可行性与精度。
- Track: Spatial audio reconstruction / Sound field estimation (generative modeling guided by sparse measurements)
- Core innovation: Brings Flow Matching to 3D sound-field reconstruction, an ill-posed inverse problem, by modeling Acoustic Transfer Function (ATF) magnitude with a generative estimator and injecting constraints from sparse microphone measurements to guide the flow-matching process. The methodological contribution is adapting an FM-based generative paradigm—previously dominant in speech/music generation—into a conditional solver for spatial acoustics, using learned priors to compensate for underdetermined measurements and improving feasibility/accuracy of sound-field and room characterization.
- Polyphonia: Zero-Shot Timbre Transfer in Polyphonic Music with Acoustic-Informed Attention Calibration
- 赛道归属: 音乐编辑 / 音频生成式编辑(零样本复音音乐音色迁移、stem级控制)
- 核心创新点: 面向复音混音中的“指定声部(stem)音色迁移”提出零样本编辑方法,关键在于对扩散模型的跨注意力进行“声学信息驱动的注意力校准(acoustic-informed attention calibration)”,纠正原生cross-attention在密集混合物中对目标声部定位不准、易串扰的问题,从而在改变目标声部音色的同时严格保持伴奏与其他声部不变。方法论突破是把stem级可控编辑转化为注意力层面的可校准机制,用声学线索约束注意力分配,实现对多声部绑定与隔离更可靠的零样本音色迁移。
- Track: Music editing / Generative audio editing (zero-shot polyphonic timbre transfer with stem-level control)
- Core innovation: Proposes a zero-shot method for stem-specific timbre transfer in polyphonic mixtures by introducing acoustic-informed attention calibration for diffusion models. It corrects vanilla cross-attention’s mis-localization and leakage in dense mixtures, enabling timbre changes on a target stem while strictly preserving accompaniment and other stems. The methodological leap is reframing stem-level controllable editing as an attention-calibration problem, using acoustic cues to constrain attention allocation for more reliable source binding and separation during zero-shot timbre transfer.
- Unison: Harmonizing Motion, Speech, and Sound for Human-Centric Audio-Video Generation
- 赛道归属: 人物中心音视频生成(Audio-Video Generation,多模态联合生成:动作-语音-音效)
- 核心创新点: 提出统一框架在生成阶段显式约束“动作-语音-环境音效”三模态的时序一致性与语义协同,针对三者异质时间尺度与对齐难题,通过跨模态协同建模/对齐机制减少常见的口型-语音、动作-音效错配,实现更连贯的人物中心音视频联合生成。
- Track: Human-centric audio-video generation (multimodal joint generation: motion–speech–sound)
- Key innovations: Introduces a unified generation framework that explicitly enforces temporal alignment and semantic coherence across motion, speech, and environmental sound effects. By addressing heterogeneous temporal dynamics with cross-modal coordination/alignment mechanisms, it reduces typical mismatches (e.g., lip–speech and action–sound desynchronization) and improves coherent human-centric audio-video generation.
GitHub
- [2026-05-17] huggingface/diffusers ⭐33636
🤗 Diffusers: State-of-the-art diffusion models for image, video, and audio generation in PyTorch.
- [2026-05-17] AudioKit/AudioKit ⭐11354 🆕NEW
Audio synthesis, processing, & analysis platform for iOS, macOS and tvOS
- [2026-05-16] SamurAIGPT/Generative-Media-Skills ⭐3274
Multi-modal Generative Media Skills for AI Agents (Claude Code, Cursor, Gemini CLI). High-quality image, video, and audio generation powered by muapi....
- [2026-05-16] apocas/restai ⭐505
RESTai is an AIaaS (AI as a Service) open-source platform. Supports many public and local LLM suported by Ollama/vLLM/etc. Precise embeddings usage, t...
- [2026-05-13] Blaizzy/mlx-video ⭐227
MLX-Video is the best package for inference and finetuning of Image-Video-Audio generation models on your Mac using MLX.
HuggingFace Models
HuggingFace Spaces
语言大模型 / Large Language Models
arXiv
- Large Language Models for Sequential Decision-Making: Improving In-Context Learning via Supervised Fine-Tuning
- 赛道归属: LLM+强化学习/序列决策(离线RL、MDP/POMDP上的In-Context Learning与微调)
- 核心创新点: 通过对“离线、oracle标注的轨迹”进行监督微调(SFT),把LLM的少样本ICL能力显式迁移到序列决策任务中,使模型能够在MDP、POMDP及更具不确定性的APOMDP设定下,直接从上下文轨迹中进行few-shot决策;方法上将“轨迹作为上下文提示”的ICL与“用高质量轨迹进行SFT”的训练范式结合,系统化提升LLM在长期依赖与部分可观测场景中的决策稳健性。
Track: LLM + Reinforcement Learning / Sequential Decision-Making (offline RL; ICL + fine-tuning on MDP/POMDP) Key innovation: Uses supervised fine-tuning (SFT) on offline, oracle-labeled trajectories to explicitly transfer LLM in-context learning into sequential decision-making, enabling few-shot action selection from trajectory context across MDPs, POMDPs, and ambiguous POMDPs; methodologically couples “trajectory-as-prompt” ICL with high-quality trajectory SFT to improve robustness under long-horizon dependencies and partial observability.
- RLearner-LLM: Balancing Logical Grounding and Fluency in Large Language Models via Hybrid Direct Preference Optimization
- 赛道归属: 对齐训练 / 偏好优化(DPO改进、逻辑一致性对齐)
- 核心创新点: 提出Hybrid-DPO的自动化偏好构建与优化框架,用“逻辑正确性信号 + 流畅性偏好信号”的混合偏好来替代单一偏好监督,针对DPO在知识密集生成中被“冗长/流畅偏置”误导的问题进行校正;通过引入基于NLI/逻辑判别器(如DeBERTa类模型)的可验证逻辑一致性评估,与LLM评审/人类偏好形成互补,从而在不牺牲可读性的前提下缩小“逻辑对齐缺口”,提升生成的可证正确性与事实/推理一致性。
- Track: Alignment training / Preference optimization (DPO improvements, logical-grounding alignment)
- Key innovation: Proposes Hybrid-DPO with an automated preference pipeline that blends “logical correctness” signals with “fluency” preference signals, explicitly correcting DPO’s systematic verbosity/fluency bias on knowledge-intensive generation; integrates an NLI/logic verifier (e.g., DeBERTa-style entailment scoring) to provide verifiable grounding that complements LLM-judge/human preferences, narrowing the logical alignment gap while maintaining readability/fluency.
- Many-Shot CoT-ICL: Making In-Context Learning Truly Learn
- 赛道归属: 提示学习 / In-Context Learning(长上下文Many-shot CoT推理)
- 核心创新点: 系统研究many-shot链式思维(CoT)ICL在推理任务上的缩放规律,指出将非推理任务的many-shot经验法则直接迁移会失效;围绕“示例数量、示例组织方式、CoT展示形式与任务泛化”的相互作用给出新的可操作策略,使长上下文下的ICL不仅是“检索相似示例”,而更像在提示内进行可学习的推理程序归纳,从而在无需参数更新的情况下稳定提升推理性能并逼近微调效果。
- Track: Prompting / In-Context Learning (long-context many-shot CoT for reasoning)
- Key innovation: Provides a dedicated scaling study of many-shot Chain-of-Thought ICL on reasoning tasks, showing that heuristics derived from non-reasoning settings do not transfer; proposes improved practices around the number/ordering/format of demonstrations and CoT exposure so that long-context ICL behaves more like learnable in-prompt induction of reasoning procedures (not mere example matching), yielding more stable reasoning gains without parameter updates.
- An Agentic AI Framework with Large Language Models and Chain-of-Thought for UAV-Assisted Logistics Scheduling with Mobile Edge Computing
- 赛道归属: LLM智能体 / 工业调度优化(UAV物流 + 边缘计算MEC的联合调度)
- 核心创新点: 构建面向UAV辅助云制造的Agentic AI框架,将LLM的Chain-of-Thought用于把“物理物流路径/取送决策”与“计算任务卸载/边缘执行/资源分配”耦合建模与分步求解;通过智能体式规划-执行流程,把复杂混合调度问题分解为可解释的决策链,并在动态环境约束(站点任务到达、UAV算力/电量、时延约束等)下实现联合优化,相比传统启发式/单域优化更易扩展到多约束、多目标的实际工业场景。
- Track: LLM agents / Industrial scheduling optimization (joint UAV logistics + MEC scheduling)
- Key innovation: Introduces an agentic framework that leverages LLM Chain-of-Thought to jointly model and solve coupled decisions across physical UAV logistics (routing/pickup-delivery) and computational scheduling (offloading/edge execution/resource allocation); uses a plan–execute decomposition to turn a hard hybrid scheduling problem into interpretable decision steps under dynamic constraints (task arrivals, UAV energy/compute limits, latency), improving extensibility to real multi-constraint, multi-objective industrial settings.
- U-STS-LLM A Unified Spatio-Temporal Steered Large Language Model for Traffic Prediction and Imputation
- 赛道归属: 时空序列建模与预测(通信/网络流量预测与缺失值插补的LLM化)
- 核心创新点: 提出统一的时空“预测+插补”框架,将原本分离的两类任务在同一模型内联合建模;通过“时空引导/steering”的LLM结构,把交通/流量数据的空间关联与时间动态以可控方式注入语言模型,实现对未来负载预测与缺失数据修复的共享表示与协同优化,从而减少任务割裂带来的误差传播并提升跨场景泛化。
Track: Spatio-temporal modeling & forecasting (LLM-based traffic prediction and missing-value imputation) Key innovation: Introduces a unified framework that jointly models forecasting and imputation—traditionally treated separately—within a single spatio-temporally steered LLM; injects spatial dependencies and temporal dynamics into the LLM via controllable steering mechanisms to learn shared representations that support both future load prediction and missing-data recovery, improving consistency and generalization across settings.
- Extracting Search Trees from LLM Reasoning Traces Reveals Myopic Planning
- 赛道归属: 推理与规划分析(CoT可解释性/行为刻画)
- 核心创新点: 提出从LLM推理轨迹中“抽取搜索树”的方法学,用结构化量化替代仅看最终答案/文本:在四子棋环境中将CoT中的分支、回溯与前瞻显式拟合为搜索树并度量其深度、分支因子与局部性,从而揭示推理模型存在“短视规划(myopic planning)”等行为特征,并将性能与搜索结构属性建立可检验关联。
- Track: Reasoning & planning analysis (CoT interpretability/behavior characterization)
- Core innovation: Introduces a method to extract and quantify search trees from LLM reasoning traces, fitting deliberative CoT into explicit tree structures in a four-in-a-row game and measuring properties (depth/branching/locality) to reveal myopic planning and link performance to measurable search-structure attributes.
- One Prompt, Many Sounds: Modeling Listener Variability in LLM-Based Equalization
- 赛道归属: 音频生成/音频控制(文本到均衡器EQ参数的LLM交互式音频调音)
- 核心创新点: 将“自然语言提示→EQ设置”建模为LLM驱动的可对话控制问题,并显式建模听众差异(listener variability):利用受控听音实验数据,让模型在ICL与个性化建模的结合下,同一提示可输出符合不同听众偏好的多样化均衡曲线;方法突破在于把主观偏好分布作为学习目标的一部分,而非学习单一“平均”EQ映射。
Track: Audio control / text-to-audio-parameter mapping (LLM-based equalization) Key innovation: Frames “natural-language prompt → EQ parameters” as a conversational LLM control task while explicitly modeling listener variability; leverages controlled listening-study data so the model can, via a combination of in-context learning and personalization, produce diverse EQ settings for the same prompt aligned with different user preferences—optimizing for preference distributions rather than a single averaged mapping.
- MARLIN: Multi-Agent Game-Theoretic Reinforcement Learning for Sustainable LLM Inference in Cloud Datacenters
- 赛道归属: 推理优化 / 绿色AI(数据中心LLM推理的能耗与碳水足迹优化)
- 核心创新点: 提出MARLIN的多智能体博弈论强化学习框架,把云数据中心的LLM推理服务建模为多主体决策问题,在满足SLA/吞吐/时延的同时联合优化能耗、碳排与用水等可持续性指标;通过博弈论机制刻画不同调度/资源管理主体(如集群、机架、作业/请求层)的策略交互,学习在时空变化的电网碳强度与负载波动下的自适应推理调度与资源分配策略,实现比静态规则或单智能体RL更稳健的可持续推理服务。
- Track: Inference optimization / Green AI (datacenter sustainability for LLM serving)
- Key innovation: Proposes MARLIN, a multi-agent game-theoretic RL framework that models LLM inference serving as interacting decision-makers and jointly optimizes sustainability metrics (energy, carbon emissions, water use) under SLA/latency/throughput constraints; uses game-theoretic structure to capture strategic interactions across scheduling/resource-control entities and learns adaptive policies under time-varying grid carbon intensity and workload dynamics, improving robustness over static heuristics or single-agent RL.
- The Readability Spectrum: Patterns, Issues, and Prompt Effects in LLM-Generated Code
- 赛道归属: 代码生成质量评估 / 可读性与提示工程(LLM生成代码的非功能属性)
- 核心创新点: 提出“可读性谱系(Readability Spectrum)”的系统化分析框架,超越功能正确性,从模式、缺陷与提示影响三个维度刻画LLM生成代码的可读性分布;通过对比人类代码与模型代码,识别典型可读性问题(如命名、结构、注释、复杂度、风格一致性等)及其与提示设计的因果关联,进而给出可操作的提示干预方法,使提示不仅控制“能跑”,还能可控地提升“易读、易审、易维护”的代码质量。
- Track: Code generation evaluation / Readability & prompt engineering (non-functional quality)
- Key innovation: Establishes a “Readability Spectrum” framework to systematically study LLM-generated code beyond functional correctness, characterizing patterns, issues, and prompt effects on readability; contrasts human vs. model code to surface recurring readability defects (naming, structure, comments, complexity, style consistency) and links them to prompt choices, enabling actionable prompt interventions that steer code toward being not only correct but also reviewable and maintainable.
- Beyond Cooperative Simulators: Generating Realistic User Personas for Robust Evaluation of LLM Agents 🆕NEW
- 赛道归属: LLM智能体评测与用户模拟(Persona生成/对话仿真)
- 核心创新点: 提出超越“合作型”LLM用户模拟器的评测框架,通过生成更贴近真实世界分布的用户画像与交互行为(如不清晰表达、不耐烦、信息保留/抗拒等非合作特征),缓解传统模拟器因底座模型偏好而导致的同质化与过度配合问题;核心方法论在于将“用户多样性与非合作性”显式建模为可控的persona/行为生成机制,用以构造更具压力测试性质的交互数据,从而提升对LLM智能体鲁棒性与失效模式的评估可信度。
- Track: LLM agent evaluation & user simulation (persona generation / dialogue simulation)
- Key innovations: Introduces an evaluation approach that moves beyond overly cooperative LLM-based user simulators by generating user personas and interaction behaviors that better match real-world variability—e.g., unclear requests, impatience, reluctance to disclose information, and other non-cooperative traits. Methodologically, it explicitly models and controls “user diversity and non-cooperativeness” as a persona/behavior generation mechanism, enabling stress-test style interaction data that exposes failure modes and yields more reliable robustness evaluation for LLM agents.
GitHub
- [2026-05-18] sgl-project/sglang ⭐27928
SGLang is a high-performance serving framework for large language models and multimodal models.
- [2026-05-17] NVIDIA/TensorRT-LLM ⭐13664
TensorRT LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports state-of-the-art optimizations to perfo...
- [2026-05-17] basellm/llm-metadata ⭐109
A lightweight interface for accessing and integrating LLM metadata, enabling applications to seamlessly discover, query, and integrate large language ...
- [2026-05-17] ahammadmejbah/Awesome-Datasets-Hub ⭐105 🆕NEW
A curated collection of datasets for Large Language Models (LLMs), covering medical AI, NLP, multimodal learning, instruction tuning, reasoning, code ...
- [2026-05-17] gpt-cmdr/ras-commander ⭐59 🆕NEW
The RAS-Commander library provides a python API for automating HEC-RAS 6.x and accessing HDF data using Python, built with and driven by large languag...
HuggingFace Models
HuggingFace Datasets
- [2026-05-01] angrygiraffe/claude-opus-4.6-4.7-reasoning-8.7k
Background
Ended up with some tokens to burn on a Claude Max plan. Assembly began during 4.6 and moved to 4.7. Model is tagged. The develop...
- [2026-02-10] Modotte/CodeX-2M-Thinking
Modotte
Note: This dataset is part of the lineup CodeX by Modotte. You can get lots of datasets in this same lineup, with the main ...
多模态大模型 / Multimodal Models
arXiv
- GRIP-VLM: Group-Relative Importance Pruning for Efficient Vision-Language Models
- 赛道归属: 多模态模型推理优化 / 视觉Token剪枝与模型压缩
- 核心创新点: 将VLM视觉token剪枝从常见的连续梯度松弛视角,转向更贴合本质的离散组合优化建模;提出“组相对重要性”(Group-Relative Importance)的剪枝框架,通过在组内/组间进行相对重要性比较来稳定地选择保留token,缓解连续近似在激进压缩下易陷入次优局部最小的问题,从而在显著降低视觉token计算开销的同时尽量保持多模态性能。
- Track: Multimodal inference optimization / Visual token pruning & model compression
- Key innovation: Reframes visual-token pruning in VLMs from continuous gradient-relaxation heuristics to a formulation closer to the underlying discrete combinatorial nature; introduces Group-Relative Importance pruning that selects tokens via relative importance comparisons within/across groups, reducing the tendency of continuous approximations to get stuck in suboptimal minima under aggressive pruning, thereby cutting visual-token compute while preserving VLM capability.
- PPU-Bench:Real World Benchmark for Personalized Partial Unlearning in Vision Language Models
- 赛道归属: 多模态安全与隐私 / 机器遗忘(VLM个性化部分遗忘评测)
- 核心创新点: 提出PPU-Bench,一个面向真实世界“个性化部分遗忘”(personalized partial unlearning)的VLM基准:不依赖合成知识注入、也不做整类/整主体删除,而是覆盖更贴近用户请求的细粒度跨模态事实删除需求;同时强调fine-tuning-free的评测设定,用统一任务与数据规模(约24K多模态样本)系统衡量模型对敏感记忆的可控删除能力与残留风险。
- Track: Multimodal safety & privacy / Machine unlearning (personalized partial unlearning benchmark for VLMs)
- Core innovation: Introduces PPU-Bench, a real-world benchmark for personalized partial unlearning in VLMs: it avoids synthetic knowledge injection and coarse subject-level deletion, instead targeting fine-grained cross-modal factual removal aligned with realistic user requests; it further adopts a fine-tuning-free evaluation setup and a sizable multimodal dataset (~24K) to systematically measure controllable deletion and residual memorization risk.
- DistractMIA: Black-Box Membership Inference on Vision-Language Models via Semantic Distraction
- 赛道归属: 多模态安全与隐私 / VLM成员推断攻击(黑盒审计)
- 核心创新点: 面向仅能观察文本输出的部署型VLM黑盒场景,提出基于“语义干扰”(Semantic Distraction)的成员推断方法:通过构造会诱导模型在语义层面产生分心/偏移的查询与对照,放大训练集成员样本与非成员样本在生成响应上的可分性;规避对logits/概率等不可得信号的依赖,相比依赖掩码预测等任务的既有方法,更适配真实API审计条件并提升攻击有效性。
- Track: Multimodal security & privacy / Black-box membership inference on VLMs
- Key innovation: Targets deployed black-box VLMs where only textual generations are observable; proposes Semantic Distraction–based membership inference by crafting distraction/contrast prompts that induce semantic-level deviations, amplifying separability between member vs. non-member samples in generated responses; avoids reliance on inaccessible probability/logit signals and improves practicality and effectiveness over mask-based semantic prediction style attacks in real API auditing settings.
- Fill the GAP: A Granular Alignment Paradigm for Visual Reasoning in Multimodal Large Language Models
- 赛道归属: 多模态推理(视觉潜变量/latent reasoning 对齐与训练范式)
- 核心创新点: 提出GAP(Granular Alignment Paradigm)以稳定提升视觉潜变量推理:指出现有“output-as-input”视觉latent范式不稳定的关键原因在于特征空间不匹配——常见做法在pre-norm MLLM中直接复用decoder隐藏态作为下一步latent输入,导致预测latent与期望视觉特征分布错位;GAP通过更细粒度的对齐机制/训练约束来缓解该mismatch,从而让连续视觉证据token的生成与消费更一致、收益更稳定。
- Track: Multimodal reasoning (visual latent reasoning alignment/training paradigm)
- Core innovation: Proposes GAP (Granular Alignment Paradigm) to stabilize gains from visual latent reasoning: it diagnoses instability in the common “output-as-input” latent pipeline as a feature-space mismatch—pre-norm MLLMs often reuse decoder hidden states as predicted latent inputs, misaligning the latent distribution with the intended visual feature space; GAP introduces finer-grained alignment/constraints to better match produced and consumed continuous visual-evidence tokens, yielding more consistent improvements.
- 20/20 Vision Language Models: A Prescription for Better VLMs through Data Curation Alone
- 赛道归属: 视觉语言预训练 / 数据工程(仅数据策划提升VLM)
- 核心创新点: 系统验证“只靠数据策划即可显著提升VLM”的上限:在架构、训练配方与算力固定的前提下,仅改变训练数据,通过一套数据筛选/清洗/重配比的策划流水线作用于MAmmoTH-VL单图子集,在20个公开VLM基准(覆盖grounding、VQA等)上平均提升+11.7pp;方法论贡献在于把性能增益明确归因到数据分布与质量控制,并给出可复用的数据策划处方来移动质量-算力前沿。
- Track: Vision-language pretraining / Data curation (VLM improvement via data only)
- Core innovation: Demonstrates how far data curation alone can push VLMs: holding architecture, training recipe, and compute constant, it varies only the training data and applies a systematic filtering/cleaning/rebalancing pipeline to the MAmmoTH-VL single-image subset, achieving +11.7pp average gains across 20 public VLM benchmarks (grounding, VQA, etc.); the key methodological contribution is isolating performance gains to data quality/distribution control and providing a reusable curation “prescription.”
- SoccerLens: Grounded Soccer Video Understanding Beyond Accuracy
- 赛道归属: 视频多模态理解评测(体育视频理解 + 可解释/可落地grounding)
- 核心创新点: 提出SoccerLens评测框架,将足球视频理解从“只看分类准确率”推进到“是否基于真实视觉证据”的检验:针对视角变化大、镜头切换快、场景拥挤等足球视频特性,新增/强化视觉grounding与证据一致性评估,旨在识别VLM是否依赖伪相关与捷径学习;核心突破在于把“超越准确率”的可视化证据对齐与鲁棒性诊断纳入标准化基准。
- Track: Video multimodal understanding evaluation (sports video grounding & reliability)
- Core innovation: Introduces SoccerLens to move soccer video understanding evaluation beyond classification accuracy toward grounded evidence: tailored to soccer’s viewpoint shifts, rapid shot transitions, and clutter, it incorporates visual grounding/evidence-consistency assessments to detect spurious correlations and shortcut learning in VLMs; the methodological advance is standardizing “beyond-accuracy” grounding-centric diagnostics for video VLM evaluation.
- Mirror, Mirror on the Wall: Can VLM Agents Tell Who They Are at All?
- 赛道归属: 具身智能 / 多模态代理评测(镜像自我识别与自他区分)
- 核心创新点: 构建受控3D基准测试VLM代理的“镜像自我识别”功能:第一人称具身代理需从镜中反射推断自身隐藏身体属性并匹配目标,同时避免把他者误认为自己(self-other misattribution);通过任务设计将“镜像线索推理能力”与“基于先验/捷径的猜测”区分开,为评估VLM代理的自我表征、视角几何理解与身份归因提供可控实验范式。
- Track: Embodied AI / Multimodal agent evaluation (mirror self-recognition)
- Core innovation: Builds a controlled 3D benchmark to test mirror self-recognition in first-person VLM agents: the agent must infer a hidden body attribute from its reflection and select the matching target while avoiding self–other misattribution; the key methodological contribution is disentangling mirror-grounded reasoning from shortcut/priors, enabling controlled evaluation of self-representation, viewpoint geometry understanding, and identity attribution.
- Lost in Volume: The CT-SpatialVQA Benchmark for Evaluating Semantic-Spatial Understanding of 3D Medical Vision-Language Models
- 赛道归属: 3D医学多模态理解评测(体数据语义-空间推理VQA)
- 核心创新点: 提出CT-SpatialVQA,用于系统评估3D医学VLM在CT体数据上的“语义-空间”理解:针对现有模型可能依赖语言相关性与数据先验、缺乏空间落地的问题,基准聚焦解剖语义与三维空间关系的联合推理(例如方位、相对位置、跨切片一致性等),从而更精确地区分“真正读懂体数据”与“靠先验答题”。
- Track: 3D medical multimodal understanding evaluation (semantic-spatial VQA on volumes)
- Core innovation: Introduces CT-SpatialVQA to systematically evaluate semantic-spatial reasoning of 3D medical VLMs on CT volumes: it targets joint anatomical semantics and 3D spatial relations (e.g., orientation, relative position, cross-slice consistency) to reveal whether models are truly grounded in volumetric evidence versus relying on priors and language correlations.
- Where Reliability Lives in Vision-Language Models: A Mechanistic Study of Attention, Hidden States, and Causal Circuits
- 赛道归属: 多模态可靠性与可解释性 / 机制可解释(VLM内部因果电路分析)
- 核心创新点: 提出统一的机制分析流水线VRP(VLM Reliability Probe),直接检验“注意力越尖锐越可靠”的Attention-Confidence假设:在LLaVA-1.5、PaliGemma、Qwen2-VL等开源VLM上,联合对比注意力结构、生成动态与隐藏态表征,并通过因果电路(causal circuits)视角定位可靠性来源;方法论突破在于把可靠性从表层可视化(attention map)提升到可干预、可归因的内部状态与因果路径层面。
- Track: Multimodal reliability & interpretability / Mechanistic interpretability (causal circuit analysis in VLMs)
- Core innovation: Proposes VRP (VLM Reliability Probe), a unified mechanistic pipeline to directly test the Attention–Confidence assumption (“sharper attention implies more reliable answers”): across open-weight VLMs (LLaVA-1.5, PaliGemma, Qwen2-VL), it jointly analyzes attention structure, generation dynamics, and hidden-state representations, and uses causal-circuit perspectives to localize where reliability arises; the key advance is shifting reliability assessment from surface attention visualizations to intervenable, attributable internal states and causal pathways.
- [2026-05-08] Fine-tuning a vision-language model for fracture-surface morphology recognition
- 赛道归属: 科学影像理解 / 领域VLM微调(材料断口形貌识别)
- 核心创新点: 基于开源VLM(Qwen3-VL-32B-Instruct)进行材料断口图像的领域适配微调,构建并利用13,168张文献挖掘的断口图像数据集;通过推理型大模型从“图像+文本”联合生成形貌标注,实现低人工成本的可扩展标注管线,从而把通用VLM的视觉表征对齐到材料学形貌判别所需的细粒度纹理/结构知识。
- Track: Scientific image understanding / domain VLM fine-tuning (fracture-surface morphology recognition)
- Key innovation: Domain-adapts an open VLM (Qwen3-VL-32B-Instruct) via fine-tuning on a curated 13,168-image literature-mined fracture dataset; uses a reasoning LLM to generate morphology annotations from joint image+text evidence, forming a scalable, low-manual-cost labeling pipeline that aligns generic VLM representations to fine-grained materials morphology cues.
GitHub
- [2026-05-16] Blaizzy/mlx-vlm ⭐4738
MLX-VLM is a package for inference and fine-tuning of Vision Language Models (VLMs) on your Mac using MLX.
- [2026-05-17] waybarrios/vllm-mlx ⭐1178
OpenAI and Anthropic compatible server for Apple Silicon. Run LLMs and vision-language models (Llama, Qwen-VL, LLaVA) with continuous batching, MCP to...
- [2026-05-16] zli12321/Vision-Language-Models-Overview ⭐588
A most Frontend Collection and survey of vision-language model papers, and models GitHub repository. Continuous updates.
- [2026-05-13] Roots-Automation/GutenOCR ⭐187
Open-source tools for training and evaluating Vision Language Models for OCR
- [2026-05-15] ocy1/TRIO ⭐108
Official implementation for "TRIO: Token Reduction via Inference-Objective Guidance for Efficient Vision-Language Models"
强化学习 / Reinforcement Learning
arXiv
- MTA-RL: Robust Urban Driving via Multi-modal Transformer-based 3D Affordances and Reinforcement Learning
- 赛道归属: 自动驾驶强化学习(多模态感知-控制融合、可解释决策)
- 核心创新点: 提出以“3D可供性(affordances)”作为感知与控制之间的中间表征,用多模态Transformer从RGB等输入预测结构化、可解释的3D可供性,再由强化学习在该表征空间上进行策略学习;相较端到端直接回归动作,减少感知-控制脆弱接口带来的误差传播,同时提升在城市密集交互场景下的鲁棒性与可解释性。
- Track: Autonomous driving RL (multimodal perception-control fusion, interpretable decision-making)
- Core innovations: Bridges perception and control via explicit 3D affordance representations: a multimodal Transformer predicts structured, interpretable 3D affordances from RGB (and other modalities), and an RL policy is trained on this intermediate space. This avoids brittle end-to-end action regression, mitigates error propagation across modules, and improves robustness under dense urban interactions while enhancing interpretability.
- How to Compress KV Cache in RL Post-Training? Shadow Mask Distillation for Memory-Efficient Alignment
- 赛道归属: 大模型对齐与推理优化(RL后训练、KV Cache压缩/显存优化)
- 核心创新点: 提出“Shadow Mask Distillation”用于RL在线rollout阶段的KV cache压缩:通过蒸馏得到可学习的掩码/稀疏策略,在尽量不破坏对齐与长上下文推理质量的前提下,显著降低轨迹生成时KV缓存的显存占用,从而缓解长上下文RL后训练的“memory wall”,提升可扩展性与吞吐。
- Track: LLM alignment & inference optimization (RL post-training, KV-cache compression/memory efficiency)
- Core innovation: Proposes Shadow Mask Distillation to compress KV cache during online RL rollouts by distilling a learnable masking/sparsification policy, reducing KV-memory footprint while preserving alignment and long-context reasoning quality, thereby breaking the rollout “memory wall” and improving scalability/throughput.
- A Unified Pair-GRPO Family: From Implicit to Explicit Preference Constraints for Stable and General RL Alignment 🆕NEW
- 赛道归属: 大语言模型对齐(RLHF)/ 偏好强化学习优化(Preference-based RL Optimization)
-
核心创新点: 提出一个以 Pair-GRPO 为核心的统一偏好优化理论框架,将主流成对偏好学习中“隐式约束”的做法系统化为一族方法,并给出两种紧耦合变体:Soft-Pair-GRPO(以软方式施加偏好约束)与 Hard-Pair-GRPO(以显式/硬约束形式刻画偏好约束)。该框架旨在从方法层面同时缓解 RLHF 中常见的策略更新不稳定、梯度方向不明确、可解释性差与梯度方差高等问题:通过把偏好信号转化为可控、可解释的约束形式,统一了从隐式到显式约束的优化路径,从而提升训练稳定性与泛化对齐效果。
-
Track: LLM Alignment (RLHF) / Preference-based RL Optimization
- Core innovations: Introduces a unified theoretical framework built around the Pair-GRPO family to systematize pairwise preference optimization, explicitly bridging implicit and explicit preference constraints. It defines two tightly coupled variants—Soft-Pair-GRPO (softly enforced preference constraints) and Hard-Pair-GRPO (explicit/hard constraints)—to address key RLHF pain points at the algorithmic level: unstable policy updates, ambiguous gradient directions, low interpretability, and high gradient variance. By converting preference signals into controllable, interpretable constraint formulations, the framework unifies optimization behaviors across constraint regimes and targets more stable and generalizable alignment.
- Self-Supervised On-Policy Reinforcement Learning via Contrastive Proximal Policy Optimisation
- 赛道归属: 自监督强化学习(对比学习 + On-policy策略优化,离散/连续动作通用)
- 核心创新点: 将对比式强化学习(CRL)的表征学习目标与PPO式on-policy优化深度耦合,提出“Contrastive Proximal Policy Optimisation”框架:在不依赖手工奖励的前提下,通过对比目标学习目标条件价值/表征,同时用近端策略更新保证on-policy训练稳定性与样本效率;突破既有CRL几乎都依赖off-policy且偏连续动作的限制,使其更贴近主流on-policy算法并可扩展到离散环境。
- Track: Self-supervised RL (contrastive learning + on-policy policy optimization; discrete/continuous actions)
- Core innovation: Introduces Contrastive Proximal Policy Optimisation that tightly integrates CRL’s contrastive representation/value learning with PPO-style on-policy updates. It removes handcrafted rewards while retaining stable, efficient on-policy training via proximal updates, addressing the prior CRL reliance on off-policy optimization and improving applicability to discrete-action settings.
- HLS-Seek: QoR-Aware Code Generation for High-Level Synthesis via Proxy Comparative Reward Reinforcement Learning
- 赛道归属: 强化学习用于代码生成/编译优化(面向高层综合HLS的QoR优化)
- 核心创新点: 提出“代理比较奖励(proxy comparative reward)”的RL范式:不依赖昂贵且噪声大的绝对综合QoR数值,而以候选pragma/代码变体之间的相对优劣比较作为奖励信号来驱动策略学习,从而显著降低综合评估成本并更直接对齐QoR(时延/资源)目标;将LLM生成与QoR感知的RL闭环结合,使训练目标从“功能正确”扩展到“可综合且高QoR”的可控搜索与生成。
- Track: RL for code generation / compiler optimization (QoR-aware HLS)
- Core innovation: Proposes proxy comparative reward RL: instead of requiring expensive absolute synthesis QoR, it learns from relative comparisons among candidate pragma/code variants, reducing evaluation cost and aligning optimization directly with latency/resource QoR. It closes the loop between LLM-based generation and QoR-aware RL to move beyond functional correctness toward controllable, high-QoR synthesis-oriented generation.
- D-VLA: A High-Concurrency Distributed Asynchronous Reinforcement Learning Framework for Vision-Language-Action Models
- 赛道归属: 分布式强化学习系统(面向具身智能VLA大模型的高并发异步训练)
- 核心创新点: 面向视觉-语言-动作(VLA)大模型在具身仿真中的RL训练瓶颈,提出高并发分布式异步框架:通过解耦“仿真执行(CPU/物理引擎)”与“模型训练(GPU/显存带宽)”的资源竞争,采用异步流水与调度机制提升端到端吞吐,缓解因仿真与训练互相阻塞导致的利用率低下;使大规模VLA的RL更可扩展、更接近工程可用的训练效率。
- Track: Distributed RL systems (high-concurrency async training for embodied VLA models)
- Core innovation: Presents a high-concurrency distributed asynchronous RL framework tailored to VLA models, mitigating systemic bottlenecks caused by contention between high-fidelity simulation and GPU-heavy training. By decoupling execution and learning with asynchronous pipelining/scheduling, it improves end-to-end throughput and scalability for large embodied VLA RL training.
- Active Sensing with Meta-Reinforcement Learning for Emitter Localization from RF Observations
- 赛道归属: 主动感知与定位(Meta-RL用于RF/GNSS干扰源定位)
- 核心创新点: 将干扰源定位建模为“序贯主动感知”问题:智能体基于RF观测在环境中主动选择下一步采样/移动策略,以最大化定位信息增益并最终推断发射源位置;引入元强化学习以在不同环境/多径条件/场景分布间快速适应,实现“少量交互即可学会有效探测策略”的跨场景泛化,相比被动测量或固定路径扫描更具样本效率与鲁棒性。
- Track: Active sensing & localization (Meta-RL for RF/GNSS interference emitter localization)
- Core innovation: Formulates emitter localization as sequential active sensing: an agent adaptively chooses sensing/motion actions from RF observations to maximize localization effectiveness. Uses meta-RL to rapidly adapt across environments and multipath conditions, enabling sample-efficient, robust policies that generalize better than passive or pre-planned scanning strategies.
- Learning When to Act: Communication-Efficient Reinforcement Learning via Run-Time Assurance
- 赛道归属: 安全强化学习 + 事件触发控制/通信效率(Run-time Assurance)
- 核心创新点: 将“何时行动/通信”作为与“采取何种控制”同等重要的决策变量,提出单一策略联合学习控制输入与触发时机;在点式Lyapunov安全盾(run-time assurance)约束下,用CARE-LQR备份与Lyapunov证书保证稳定与安全,同时通过学习到的触发机制减少不必要的动作更新/通信开销;把经典Lyapunov-STC的解析安全性与RL的自适应性结合,实现可证明安全前提下的通信高效RL。
- Track: Safe RL + event-triggered control / communication-efficient RL (run-time assurance)
- Core innovation: Treats “when to act/communicate” as a first-class decision alongside control, learning both with a single policy. Enforces safety/stability via a pointwise Lyapunov shield with CARE-based LQR backups and Lyapunov certificates, while the learned triggering reduces unnecessary updates/communication. It bridges analytical Lyapunov-STC guarantees with RL adaptability under run-time assurance.
- Discrete Flow Matching for Offline-to-Online Reinforcement Learning
- 赛道归属: 离线到在线强化学习(离散动作生成式策略、Flow Matching)
- 核心创新点: 针对离散动作空间提出基于离散Flow Matching的生成式策略在线微调框架(DRIFT):将原本偏连续控制的扩散/flow匹配范式离散化以适配离散动作,并设计在线更新机制,使策略在与环境交互后能持续改进且不“遗忘”离线数据中学到的有效行为,从而缓解offline-to-online迁移中的分布偏移与性能退化。
- Track: Offline-to-online RL (discrete-action generative policies, Flow Matching)
- Core innovations: Introduces DRIFT, an online fine-tuning method for discrete action spaces using discrete flow matching. It adapts diffusion/flow-matching-style generative policies—typically built for continuous control—to discrete actions, and proposes an online update procedure that improves with new interaction while preserving useful behaviors from offline datasets, addressing distribution shift and degradation in offline-to-online RL.
- On the Importance of Multistability for Horizon Generalization in Reinforcement Learning
- 赛道归属: POMDP长时序强化学习(记忆机制与泛化、RNN动力学)
- 核心创新点: 将长视野(长horizon)POMDP中的泛化困难归因到记忆网络的动力学性质,提出“多稳态(multistability)”对horizon泛化的重要性:通过分析/构造具有多个稳定吸引子状态的记忆表征,使RNN能在长时间间隔后仍保持关键信息并形成可分离的内部状态,从而提升长时依赖任务的样本效率与跨horizon泛化能力。
- Track: Long-horizon RL in POMDPs (memory mechanisms & generalization, RNN dynamics)
- Core innovations: Attributes poor horizon generalization in long-horizon POMDPs to the dynamical properties of memory networks, highlighting the role of multistability. By encouraging/leveraging multiple stable attractor states in recurrent memory representations, the agent can retain task-relevant information across long delays and maintain separable internal states, improving sample efficiency and generalization across horizons.
GitHub
- [2026-05-17] rllm-org/rllm ⭐5530
Democratizing Reinforcement Learning for LLMs
- [2026-05-17] google-deepmind/dm_control ⭐4581 🆕NEW
Google DeepMind's software stack for physics-based simulation and Reinforcement Learning environments, using MuJoCo.
- [2026-05-17] pytorch/rl ⭐3431
A modular, primitive-first, python-first PyTorch library for Reinforcement Learning.
- [2026-05-18] radixark/miles ⭐1341
Miles is an enterprise-facing reinforcement learning framework for LLM and VLM post-training, forked from and co-evolving with slime.
- [2026-05-17] LucasAlegre/morl-baselines ⭐520 🆕NEW
Multi-Objective Reinforcement Learning algorithms implementations.
HuggingFace Datasets
- [2026-05-13] TuringEnterprises/Open-MM-RL
Dataset Summary
Open-MM-RL is a multimodal STEM reasoning dataset covering Physics, Mathematics, Biology, and Chemistry. It is designed for...
- [2026-05-14] PsiBotAI/SynData
SynData
中文说明
Demo
If the video cannot be displayed in your environment, open it directly: assets/syndata-demo.mp4
...
- [2026-05-14] AlienKevin/SWE-ZERO-12M-trajectories
SWE-ZERO 12M Trajectories
The largest agentic-coding trace dataset to date: 112 B tokens of execution-free agentic trajectories covering 12...
- [2026-05-03] ADSKAILab/Zero-To-CAD-1m
Zero-to-CAD 1M
One million executable, interpretable CAD construction sequences synthesized entirely without real-world data.
...
- [2026-05-08] Qwen/WebWorldData
WebWorldData 🌐 Overview
WebWorldData is a large-scale dataset of 1.06M web interaction trajectories collect...
世界动作模型 / World Action Model
arXiv
- World Action Models: The Next Frontier in Embodied AI
- 赛道归属: 具身智能 / 视觉-语言-动作(VLA)+ 世界模型(World Model)融合的策略学习(World Action Model范式)
- 核心创新点:
中文:提出并系统化“世界动作模型(WAM)”这一新范式:将环境动力学的显式预测(世界模型)纳入动作生成/策略学习管线,突破传统VLA仅做“观测→动作”的反应式映射,转向“可干预的未来演化建模 + 基于预测的动作决策”的统一框架,从而为具身基础模型提供更强的可规划性、可推演性与对物理演化的建模能力。
English: Introduces and formalizes the “World Action Model (WAM)” paradigm: explicitly integrates environment dynamics prediction (world models) into action generation/policy learning, moving beyond reactive VLA observation-to-action mappings toward a unified framework that models intervention-conditioned future evolution and uses it for decision making—improving planning capability, rollouts, and physical-world evolution modeling in embodied foundation models.
- [2026-05-08] Is the Future Compatible? Diagnosing Dynamic Consistency in World Action Models
- 赛道归属: 世界模型评测与可靠性诊断(World Action Model / 动态一致性)
- 核心创新点: 提出并系统化定义WAM可靠性的关键缺失维度——动作-状态一致性(action-state consistency),用于检验“模型生成的未来”是否与其声称的动作序列在动力学上相容,而不仅是视觉上合理;围绕该一致性构建诊断框架/评测思路,将WAM的失效从“看起来对”细化为“动力学不兼容”的可检测问题,从而为后续训练目标、校准与安全执行提供可操作的评价轴。
- Track: World-model evaluation & reliability diagnostics (World Action Model / dynamic consistency)
- Core innovation: Introduces and formalizes action–state consistency as a missing reliability axis for WAMs, testing whether imagined futures are dynamically compatible with the predicted action sequence rather than merely visually plausible; builds a diagnostic/evaluation perspective around this notion to make WAM failure modes measurable as dynamical incompatibility, enabling more actionable assessment for calibration, training objectives, and safe deployment.
- When to Trust Imagination: Adaptive Action Execution for World Action Models 🆕NEW
- 赛道归属: 机器人操控(World Action Model / 视觉-动作联合预测)与闭环控制策略(自适应动作执行)
- 核心创新点: 将WAM的“固定步长开环执行”问题重构为未来-现实一致性验证(future-reality verification):在执行过程中持续对比模型想象的未来观测与真实环境rollout的一致性,并据此自适应决定每次推理后应连续执行的动作步数(何时继续信任想象、何时提前中止并重新推理)。该方法在不改变WAM本体预测机制的前提下,引入面向执行阶段的验证与决策模块,实现从开环到更稳健的闭环执行,降低因预测漂移导致的失配与失败风险。
Track: Robotic manipulation (World Action Models / joint vision-action prediction) & closed-loop control policy (adaptive action execution)
Key innovation: Recasts the fixed-horizon open-loop execution of WAMs as a future–reality verification problem: during rollout, the agent continuously checks whether the imagined future observations remain consistent with real observations, and adaptively chooses how many predicted actions to execute per inference (when to keep trusting imagination vs. when to stop early and re-plan). This adds an execution-time verification/decision layer without modifying the core WAM predictor, improving robustness against prediction drift and model–world mismatch.
- The DAWN of World-Action Interactive Models
- 赛道归属: 自动驾驶 / 世界模型驱动的交互式规划与动作生成(World-Action Interactive Models, WAIM)
- 核心创新点:
中文:提出“世界-动作交互模型(WAIM)”以刻画世界预测与动作选择的互依关系,指出现有WAM常见的并行分支或“先预测再规划”的刚性流水线难以体现“动作影响场景演化、场景演化反过来约束动作”的闭环耦合;并在自动驾驶中实例化为DAWN,通过在生成过程中联合/交替地对动作与世界演化进行一致性建模(以去噪式生成视角实现动作与未来场景的协同推断),实现更符合交互逻辑的场景-动作联合生成与规划。
English: Proposes “World-Action Interactive Models (WAIMs)” to explicitly capture the reciprocity between world prediction and action selection, addressing limitations of prior WAMs that use parallel branches or rigid predict-then-plan pipelines. Instantiated as DAWN for autonomous driving, it performs coupled, consistency-driven co-inference of actions and future scene evolution during generation (via a denoising-style formulation), enabling more interaction-faithful joint scene–action generation and planning.
GitHub
- [2026-05-14] DravenALG/awesome-vla-wam ⭐407
A Curated List of Vision-Language-Action (VLA) and World Action Models (WAM) Research and Beyond
- [2026-05-15] OpenMOSS/Awesome-WAM ⭐297
A curated, continuously updated reading list, paper blogs, and resources for World Action Models (WAMs) in embodied AI.
- [2026-05-12] jiangranlv/DyWA ⭐83
[ICCV 2025] DyWA:Dynamics-adaptive World Action Model for Generalizable Non-prehensile Manipulation
Generated automatically by Daily AI Digest Agent 生成时间: 2026-05-18 01:00:45