AI 每日进展速报 / Daily AI Digest - 2026-05-14
图像生成/编辑 / Image Generation/Editing
arXiv
- Improving Text-to-Image Generation with Intrinsic Self-Confidence Rewards 🆕NEW
- 赛道归属: 文生图(偏好对齐/后训练强化学习)
- 核心创新点: 提出 SOLACE 后训练框架,用“模型内生自信度”替代外部奖励模型/人工偏好监督:将模型自身生成结果重新加噪,并以其对注入噪声的恢复准确性作为自信度奖励信号,从而在不依赖额外标注或奖励网络的情况下进行偏好对齐式优化,提升生成的可靠性与审美一致性。
- Track: Text-to-Image (preference alignment / post-training RL)
- Core innovation: Introduces SOLACE, a post-training framework that replaces external reward supervision with an intrinsic self-confidence signal: it re-noises the model’s own outputs and uses denoising recovery accuracy as a reward, enabling alignment-style optimization without reward models or human labels and improving reliability/aesthetic consistency.
- EPIC: Efficient Predicate-Guided Inference-Time Control for Compositional Text-to-Image Generation 🆕NEW
- 赛道归属: 文生图(推理时可控生成/组合式生成)
- 核心创新点: 提出 EPIC 的训练无关(training-free)推理时控制:将复杂提示词一次性解析为包含对象变量与类型化谓词(数量、属性、关系等)的“视觉程序”,并把生成改写为谓词引导的搜索/精炼过程,在不改动模型参数的前提下,通过对违反谓词的部分进行定向修正来提升组合一致性与可控性,同时强调推理效率。
- Track: Text-to-Image (inference-time control / compositional generation)
- Core innovation: Proposes EPIC, a training-free inference-time control method that parses a prompt once into a fixed “visual program” (object variables + typed predicates for count/attributes/relations) and performs predicate-guided search/refinement to correct predicate violations without updating model weights, improving compositional faithfulness with efficient inference.
- MULTI: Disentangling Camera Lens, Sensor, View, and Domain for Novel Image Generation 🆕NEW
- 赛道归属: 文生图(可控生成/成像因素解耦)
- 核心创新点: 将控制维度从“内容”扩展到“成像链路”,提出对镜头(lens)、传感器(sensor)、视角(view)与场景域(domain)等成像因素的解耦建模与组合生成任务(Imaging Factor Disentanglement):通过显式分离并可组合这些因素,减少文本歧义带来的控制不确定性,实现更精细的风格/设备/视角级别可控的新图像生成。
- Track: Text-to-Image (controllable generation / imaging-factor disentanglement)
- Core innovation: Extends control beyond content by disentangling and compositing imaging factors—lens, sensor, viewpoint, and domain—formulating an Imaging Factor Disentanglement task that explicitly separates these factors to mitigate text ambiguity and enable fine-grained, composable control over device/style/view-level generation.
- UniCustom: Unified Visual Conditioning for Multi-Reference Image Generation 🆕NEW
- 赛道归属: 多参考文生图(身份保持/统一条件注入)
- 核心创新点: 提出 UniCustom 的“统一视觉条件”机制,解决现有 VLM+扩散模型中语义特征(ViT/VLM)与外观特征(VAE)分路注入导致的指令-身份关联弱问题:将多参考图像的语义理解与外观表征在同一条件空间/同一注入路径中统一建模,使模型能更一致地对齐文本指令与多主体身份特征,从而提升多参考身份保持与可编辑性。
- Track: Multi-reference Text-to-Image (identity preservation / unified conditioning)
- Core innovation: Proposes UniCustom with unified visual conditioning to replace the common decoupled pipeline (VLM semantic features vs. VAE appearance features injected separately). By modeling instruction understanding and appearance/identity cues in a single conditioning pathway/space, it strengthens instruction–identity association and improves multi-reference identity fidelity and editability.
- Power Reinforcement Post-Training of Text-to-Image Models with Super-Linear Advantage Shaping
- 赛道归属: 文生图(扩散模型)/ 强化学习后训练(RLHF/GRPO)
- 核心创新点: 指出GRPO类后训练中“归一化”会导致优势/奖励失配,从而诱发reward hacking;提出“超线性优势塑形”(super-linear advantage shaping) 的后训练策略,通过对优势函数进行非线性重标定来放大高质量样本的学习信号、抑制利用奖励偏置的投机解,并避免直接移除prompt相关项带来的校准问题,从机制上提升对齐增益的真实性与稳定性。
Track: Text-to-Image (diffusion) / RL post-training (RLHF/GRPO) - Core innovation: Identifies that normalization in GRPO-style post-training can miscalibrate advantages/rewards and trigger reward hacking; introduces super-linear advantage shaping to nonlinearly rescale advantages—amplifying learning from genuinely good samples while suppressing exploitative reward-bias shortcuts—without bluntly dropping prompt-related terms, improving alignment stability and real quality gains.
- Masked Generative Transformer Is What You Need for Image Editing
- 赛道归属: 图像编辑(基于生成式Transformer的局部编辑)
- 核心创新点: 用Masked Generative Transformer替代扩散模型做编辑,利用“掩码token预测”的局部生成范式天然实现编辑区域的空间隔离,避免扩散全局去噪导致的改动外溢;提出EditMGT框架,将编辑建模为受mask约束的token重生成,并配套多阶段/多粒度的训练与推理策略以兼顾局部可控性与全局一致性,实现“只改该改的地方”。
Track: Image editing (generative Transformer, localized editing) - Core innovation: Replaces diffusion-based global denoising with a Masked Generative Transformer that performs masked token prediction, inherently confining changes to the intended region and preventing edit leakage; proposes EditMGT that formulates editing as mask-constrained token regeneration with multi-stage/multi-granularity training/inference to balance strict locality and global coherence.
- LimeCross: Context-Conditioned Layered Image Editing with Structural Consistency
- 赛道归属: 图像编辑(分层/Layered资产编辑与结构一致性)
- 核心创新点: 面向真实创作流程中的分层图像资产,提出LimeCross:在“上下文条件化”的分层表示上进行编辑,显式建模层间结构关系与接触/遮挡/光照一致性;通过跨层约束与结构一致性机制,避免传统“先压平再编辑再分解”导致的层间不一致与重组伪影,实现可重组、非破坏式的可控分层编辑。
Track: Image editing (layered assets / structural consistency) - Core innovation: Proposes LimeCross for context-conditioned layered image editing, explicitly modeling inter-layer structure and enforcing consistency (e.g., contact, occlusion, illumination) across layers; avoids the common flatten-edit-redecompose pipeline that breaks layer coherence, enabling controllable, non-destructive edits that remain recomposable with fewer cross-layer artifacts.
- FlashAR: Efficient Post-Training Acceleration for Autoregressive Image Generation 🆕NEW
- 赛道归属: 图像生成(自回归生成加速/推理优化/后训练)
- 核心创新点: 提出 FlashAR 的后训练加速方案,面向自回归图像生成中昂贵的逐 token 光栅扫描解码:在不从零预训练新范式、且尽量不改变原始预测目标的前提下,通过后训练引入更高并行度/更少步数的生成策略,缩小训练-推理差距,实现高效推理加速。
- Track: Image generation (autoregressive acceleration / inference optimization / post-training)
- Core innovation: Introduces FlashAR, a post-training acceleration method for autoregressive image generators that reduces costly raster-scan next-token decoding. It boosts parallelism / reduces decoding steps without pretraining a new paradigm from scratch and aims to minimize training–inference mismatch while preserving the original prediction objective.
- Towards Robust Sequential Decomposition for Complex Image Editing 🆕NEW
- 赛道归属: 图像编辑(复杂指令编辑/多步规划与分解)
- 核心创新点: 针对包含组合操作与跨步依赖的复杂编辑指令,提出更鲁棒的“顺序分解”范式:在单轮编辑易误解、朴素多轮编辑易累积误差的矛盾下,强调对指令进行结构化拆解、显式建模步骤依赖与中间状态约束,从而提升复杂编辑的可执行性、稳定性与结果可控性。
- Track: Image editing (complex instruction editing / multi-step planning & decomposition)
- Core innovation: Targets complex edits with combinatorial operations and inter-step dependencies by advocating a more robust sequential decomposition paradigm: it structurally decomposes instructions, explicitly models step dependencies and intermediate-state constraints, mitigating single-turn misparsing and naive multi-turn error accumulation to improve controllability and stability.
- MULTITEXTEDIT: Benchmarking Cross-Lingual Degradation in Text-in-Image Editing 🆕NEW
- 赛道归属: 图像编辑评测(文本在图像中的编辑/跨语言基准)
- 核心创新点: 提出 MULTITEXTEDIT 跨语言受控基准,系统量化 text-in-image 编辑在多语言下的性能退化,并避免将“视觉合理性”与“语义正确性”混淆:通过 12 种语言、5 类视觉域、7 种编辑操作的 3600 个实例设计,使不同语言版本共享同一视觉底图,并提供人工编辑参考与区域 mask,以隔离语言因素、精确评估模型的跨语言鲁棒性与语义对齐能力。
- Track: Image editing evaluation (text-in-image editing / cross-lingual benchmark)
- Core innovation: Introduces MULTITEXTEDIT, a controlled cross-lingual benchmark to quantify degradation in text-in-image editing across 12 languages. By keeping a shared visual base across language variants and providing human-edited references plus region masks, it isolates language effects and disentangles semantic correctness from mere visual plausibility for robust evaluation.
GitHub
- [2026-05-14] YouMind-OpenLab/awesome-nano-banana-pro-prompts ⭐11971
🍌 World's largest Nano Banana Pro prompt library — 10,000+ curated prompts with preview images, 16 languages. Google Gemini AI image generation. Free ...
- [2026-05-13] vibheksoni/free-ai ⭐436
Free OpenAI-compatible AI API with 16,000+ models, image generation, tool calling, and Discord key signup.
- [2026-05-13] jegly/Box ⭐403
Private on-device AI suite for Android. Fork of Google AI Edge Gallery with llama.cpp, whisper.cpp, stable-diffusion.cpp, GGUF import, voice chat, vis...
- [2026-05-13] hackclub/ai ⭐114 🆕NEW
💭 Free, unlimited AI and image generation for teens
- [2026-05-13] CorentinGS/chess ⭐79
chess is a set of go packages which provide common chess utilities such as move generation, turn management, checkmate detection, PGN encoding, UCI in...
HuggingFace Models
视频生成/编辑 / Video Generation/Editing
arXiv
- [2026-05-06] FaithfulFaces: Pose-Faithful Facial Identity Preservation for Text-to-Video Generation
- 赛道归属: 文生视频(人脸身份保持/可控生成)
- 核心创新点: 提出 FaithfulFaces 的“姿态共享身份表示”学习框架,在大姿态变化与遮挡场景下强化身份一致性;通过将身份特征与姿态因素解耦并在跨姿态条件下共享/对齐身份表征,减少因姿态迁移导致的身份漂移与面部结构失真,从而提升复杂动态场景中的人脸身份保真度。
- Track: Text-to-Video (face identity preservation / controllable generation)
- Key innovation: Proposes FaithfulFaces with a pose-shared identity representation learning scheme to improve identity consistency under large pose changes and occlusions; it explicitly disentangles identity from pose and aligns/shares identity features across pose conditions, reducing pose-induced identity drift and facial distortion in dynamic scenes.
- OmniNFT: Modality-wise Omni Diffusion Reinforcement for Joint Audio-Video Generation 🆕NEW
- 赛道归属: 音视频联合生成(多模态生成)
- 核心创新点: 提出面向“联合音频-视频生成”的模态分治式强化学习框架(modality-wise omni diffusion reinforcement),将扩散生成中的优化目标拆解为“单模态保真度 + 跨模态对齐 + 细粒度时序同步”等多目标,并通过针对多模态/多目标RL训练不稳定性的机制化处理(如优势信号冲突与尺度不一致等问题的分析与改造)实现可训练、可控的联合优化,从而在不牺牲单模态质量的前提下提升音画一致性与同步精度。
- Track: Joint Audio-Video Generation (Multimodal Generation)
- Core innovation: Proposes a modality-wise omni diffusion reinforcement framework for joint audio-video generation, decomposing diffusion-time optimization into multi-objectives—per-modality fidelity, cross-modal alignment, and fine-grained temporal synchronization—and introducing training strategies to address key RL obstacles in multi-modal/multi-objective settings (e.g., conflicting advantage signals and scale mismatch). This enables stable, controllable joint optimization that improves A/V coherence and sync without degrading unimodal quality.
- OmniHumanoid: Streaming Cross-Embodiment Video Generation with Paired-Free Adaptation 🆕NEW
- 赛道归属: 视频生成(跨具身/跨形体动作迁移,Embodiment-aware Generation)
- 核心创新点: 提出流式(streaming)的跨具身视频生成框架,将“可迁移的运动动力学”与“具身特定的外观/形态”进行解耦建模,并通过无需成对数据(paired-free)的适配机制把模型快速迁移到新的人形载体(如人→机器人、机器人→机器人)。方法层面强调在生成过程中持续接收运动条件并稳定输出视频序列,同时用无配对适配降低对每个目标具身的标注/配对采集成本,提升可扩展性。
- Track: Video Generation (Cross-embodiment / embodiment-aware motion transfer)
- Core innovation: Introduces a streaming cross-embodiment video generation framework that explicitly disentangles transferable motion dynamics from embodiment-specific appearance/morphology, and adapts to new humanoid embodiments via paired-free adaptation (no paired data per target). The method supports online/streaming conditioning for stable long-horizon generation while dramatically reducing data collection requirements, improving scalability to many embodiments.
- SocialDirector: Training-Free Social Interaction Control for Multi-Person Video Generation
- 赛道归属: 视频生成 / 可控多人物交互(训练免控制)
- 核心创新点: 提出训练免(Training-Free)的多人物社交交互控制方法,将“谁在何时对谁做什么”的交互结构显式注入生成过程,解决多人物生成中常见的角色错配与动作归因错误;通过对交互关系与时序的可控编排,实现对对话、手势、协同行为等社会互动的细粒度导演式控制,而无需重新训练基础视频模型。
- Track: Video generation / controllable multi-person interactions (training-free control)
- Key innovation: Presents a training-free control method for multi-person social interactions, explicitly injecting interaction structure—who does what, when, and toward whom—into the generation process to reduce actor/action misbinding; enables fine-grained director-style control over conversations, gestures, and coordinated behaviors without retraining the base video model.
- SWIFT: Prompt-Adaptive Memory for Efficient Interactive Long Video Generation
- 赛道归属: 推理优化 / 流式长视频生成的自适应记忆管理
- 核心创新点: 提出SWIFT的“提示词自适应记忆”(Prompt-Adaptive Memory):针对交互式长视频中频繁语义切换,设计能随prompt更新而重组/选择性保留的记忆机制,避免在提示边界反复重建缓存或受限于固定记忆预算造成的冗余计算与适配迟滞;在保持视觉连续性的同时提升语义切换响应效率。
- Track: Inference optimization / adaptive memory for streaming long-video generation
- Key innovation: Introduces SWIFT with prompt-adaptive memory: for interactive long videos with frequent semantic switches, it reorganizes/selectively retains memory in response to prompt updates, avoiding cache rebuilds at prompt boundaries and inefficiencies of fixed memory budgets; improves responsiveness to semantic changes while maintaining visual continuity.
- EduStory: A Unified Framework for Pedagogically-Consistent Multi-Shot STEM Instructional Video Generation
- 赛道归属: 视频生成 / 多镜头脚本化生成(教育内容一致性)
- 核心创新点: 提出面向STEM教学的多镜头生成统一框架:引入“教学状态建模”跟踪跨镜头的持久知识与概念依赖,并用脚本引导的结构化控制组织叙事与镜头编排,解决长视频中知识一致性、讲解连贯性与多镜头衔接问题;将“内容正确性/教学一致性”作为生成过程的核心约束而非事后筛选。
- Track: Video generation / multi-shot script-driven generation (educational consistency)
- Key innovation: Proposes a unified framework for multi-shot STEM instructional video generation: models a pedagogical state to track persistent knowledge and concept dependencies across shots, and uses script-guided structured control to organize narrative and shot composition; addresses knowledge consistency and pedagogical coherence as first-class generation constraints rather than post-hoc filtering.
- CollabVR: Collaborative Video Reasoning with Vision-Language and Video Generation Models
- 赛道归属: 多模态推理 / 结合视频生成的协同推理框架
- 核心创新点: 提出VLM+视频生成模型的协同推理(CollabVR):用VLM承担显式规划、校验与纠错,将VGM生成的短时“Chain-of-Frames”作为可视化推理草稿;通过迭代式的生成—评估—修正闭环,缓解长任务的时序漂移与中段模拟错误累积,把视频生成从单纯输出器提升为可被语言推理约束与修正的“可视化思维工具”。
- Track: Multimodal reasoning / collaborative reasoning with video generation
- Key innovation: Proposes CollabVR, a VLM+VGM collaborative reasoning framework: the VLM performs explicit planning, verification, and correction while the VGM produces short-horizon Chain-of-Frames as visual reasoning drafts; an iterative generate–evaluate–revise loop mitigates long-horizon drift and mid-clip simulation error accumulation, turning video generation into a language-guided, correctable visual thinking tool.
- Unison: Harmonizing Motion, Speech, and Sound for Human-Centric Audio-Video Generation
- 赛道归属: 人物中心音视频生成(Audio-Video Generation,多模态联合生成:动作-语音-音效)
- 核心创新点: 提出统一框架在生成阶段显式约束“动作-语音-环境音效”三模态的时序一致性与语义协同,针对三者异质时间尺度与对齐难题,通过跨模态协同建模/对齐机制减少常见的口型-语音、动作-音效错配,实现更连贯的人物中心音视频联合生成。
- Track: Human-centric audio-video generation (multimodal joint generation: motion–speech–sound)
- Key innovations: Introduces a unified generation framework that explicitly enforces temporal alignment and semantic coherence across motion, speech, and environmental sound effects. By addressing heterogeneous temporal dynamics with cross-modal coordination/alignment mechanisms, it reduces typical mismatches (e.g., lip–speech and action–sound desynchronization) and improves coherent human-centric audio-video generation.
- From Articulated Kinematics to Routed Visual Control for Action-Conditioned Surgical Video Generation 🆕NEW
- 赛道归属: 视频生成(动作条件/机器人手术视频生成,Controllable Video Generation)
- 核心创新点: 提出“运动学到视觉”的提升(kinematic-to-visual lifting)范式,把低维的关节/器械运动学控制量转换为五种统一的、图像对齐的控制模态(image-aligned control modalities),从而把难以直接驱动像素演化的控制信号变成可被生成模型有效利用的视觉条件;在此基础上设计分层路由(hierarchically routed)的视觉控制机制,按层/按区域选择性注入不同控制模态,实现对手术场景中复杂、局部且时序敏感的变化进行更精细的动作约束与可控生成。
- Track: Video Generation (Action-conditioned / surgical robotics controllable generation)
- Core innovation: Proposes a kinematic-to-visual lifting paradigm that converts low-dimensional articulated kinematics into five unified image-aligned control modalities, making control signals directly usable for pixel-space evolution. On top of this representation, a hierarchically routed visual control mechanism selectively injects different control modalities across hierarchy/regions, enabling fine-grained, temporally precise action control for complex surgical video generation.
- Alice v1: Distillation-Enhanced Video Generation Surpassing Closed-Source Models 🆕NEW
- 赛道归属: 视频生成(扩散模型蒸馏/一致性蒸馏,推理加速与质量提升)
- 核心创新点: 提出基于一致性蒸馏并引入分数正则(score regularization)的rCM蒸馏路线,用“正则化后的得分匹配”改变传统蒸馏在速度与质量间的权衡,使学生模型在更少采样步数下不仅保持质量,甚至可超越教师模型。其关键方法论在于:通过分数正则项形成更偏“择模态/择优输出”的优化压力(mode-seeking),并配合一致性训练稳定学生分布,从而在开源14B规模视频生成模型上实现高质量与高效率的统一。
- Track: Video Generation (Diffusion distillation / consistency distillation for faster inference and quality gain)
- Core innovation: Presents rCM—consistency distillation with score regularization—reframing diffusion distillation so the student can use fewer sampling steps while not only preserving quality but surpassing the teacher. Methodologically, the score-regularization term induces a mode-seeking pressure that concentrates probability mass on high-quality outputs, while consistency training stabilizes the student distribution, achieving a strong quality–efficiency Pareto frontier in an open-source 14B video model.
GitHub
- [2026-05-14] hao-ai-lab/FastVideo ⭐3475
A unified inference and post-training framework for accelerated video generation.
- [2026-05-13] ZeroLu/awesome-seedance ⭐1722
The ultimate collection of high-fidelity Seedance 2.0 prompts and Seedance AI resources. Discover Seedance 2.0 how to use for cinematic film, anime, U...
- [2026-05-13] YouMind-OpenLab/awesome-seedance-2-prompts ⭐1007
🎬 2000+ curated Seedance 2.0 video generation prompts — cinematic, anime, UGC, ads, meme styles. Includes Seedance API guides, character consistency t...
- [2026-05-13] Video-Reason/Awesome-Video-Reasoning ⭐153 🆕NEW
This is a collection of recent papers on reasoning in video generation models.
- [2026-05-13] OpenLoaf/OpenLoaf ⭐56 🆕NEW
🍞Open-source, local-first AI workspace with Agents, multi-model chat (GPT/Claude/Gemini/DeepSeek), Notion-like docs, AI image & video generation, emai...
HuggingFace Models
音频生成 / Audio Generation
arXiv
- Adapting a Text-to-Audio Model for Room Impulse Response Generation 🆕NEW
- 赛道归属: 音频生成 / 声学建模(RIR 房间脉冲响应生成)
- 核心创新点: 将预训练的文生音频大模型作为“生成先验”迁移到RIR这一强物理约束、数据稀缺的声学对象上,通过适配策略把通用音频生成能力对齐到RIR的时域结构与混响特征分布,实现无需从零训练即可生成高质量RIR;关键突破在于证明大规模生成式音频先验可有效覆盖并可控地生成RIR这类非语音/非音乐的声学响应信号,从而缓解真实RIR采集成本高与训练数据不足的问题。
- Track: Audio generation / Acoustic modeling (RIR generation)
- Core innovation: Adapts a pretrained text-to-audio foundation model as a generative prior for Room Impulse Responses, a physically constrained and data-scarce acoustic signal. The method aligns the model’s generic audio generation capability to RIR-specific temporal structure and reverberation statistics, enabling high-quality RIR synthesis without training from scratch. The key methodological contribution is demonstrating that large-scale generative audio priors can be effectively transferred to controllable generation of non-speech/non-music acoustic responses, mitigating real RIR collection and data scarcity.
- The World is Not Mono: Enabling Spatial Understanding in Large Audio-Language Models 🆕NEW
- 赛道归属: 多模态理解(音频-语言)/ 空间音频理解(Audio Scene Analysis)
- 核心创新点: 将“空间音频-语言理解”从零散问题提升为统一的任务接口,提出音频场景分析(ASA)的三层级形式化:从原子级感知(事件与方位/距离等空间属性)到对象级绑定(语义与空间属性的归属一致性、多对象分离与排列),再到场景级物理一致性判断(答案是否符合空间声学常识)。核心突破在于把“听到什么”扩展为“在哪里、谁对应谁、整体是否合理”的可评测框架,为大音频语言模型引入可系统训练/评估的空间推理目标。
- Track: Multimodal understanding (audio-language) / Spatial audio understanding (Audio Scene Analysis)
- Core innovation: Elevates spatial audio-language understanding into a unified task interface by formalizing Audio Scene Analysis (ASA) as a three-level problem: atomic perception of events with spatial attributes, object-level binding of semantics to spatial properties across multiple sources, and scene-level physical plausibility checking. The methodological leap is turning “what is in the audio” into a measurable framework for “where it is, which attributes belong to which object, and whether the global answer is physically consistent,” enabling systematic training/evaluation of spatial reasoning in large audio-language models.
- Kinetic-Optimal Scheduling with Moment Correction for Metric-Induced Discrete Flow Matching in Zero-Shot Text-to-Speech 🆕NEW
- 赛道归属: 语音生成(零样本文本转语音)/ 离散流匹配推理优化(MI-DFM 调度与校正)
- 核心创新点: 针对MI-DFM在离散生成中的两大瓶颈提出成体系的推理改进:其一,从概率路径的标量参数化出发推导“动力学最优(kinetic-optimal)”调度器,给出无需训练、免超参搜索的数值调度方案以替代经验式scheduler;其二,针对一阶CTMC求解带来的有限步路径跟踪误差,引入矩校正(moment correction)以在有限步数下更准确匹配目标路径分布。方法论突破在于把“怎么排步长/怎么减误差”从启发式工程变为可推导的最优调度与可控校正,从而提升零样本TTS的稳定性与质量/效率权衡。
- Track: Speech generation (zero-shot TTS) / Discrete flow-matching inference optimization (MI-DFM scheduling & correction)
- Core innovation: Addresses two practical blockers of MI-DFM for discrete generation with principled inference upgrades: (1) derives a kinetic-optimal scheduler for scalar-parameterized probability paths, yielding a training-free numerical schedule that removes heuristic tuning; (2) reduces finite-step path-tracking error of first-order CTMC solvers via moment correction, improving distributional matching under limited steps. The key methodological advance is replacing ad-hoc scheduling/error fixes with derivable optimal scheduling and controllable correction, improving stability and the quality–efficiency trade-off in zero-shot TTS.
- [2026-05-07] LiVeAction: a Lightweight, Versatile, and Asymmetric Neural Codec Design for Real-time Operation
- 赛道归属: 神经编解码(Neural Codec)/ 端侧实时压缩与机器感知友好编码
- 核心创新点: 提出轻量、通用且非对称(asymmetric)的神经编解码设计:以更强的解码端能力换取编码端低算力/低功耗,面向实时与带宽受限设备;同时强调对“机器感知任务与非传统模态”(如空间音频阵列等)的适配,而非仅优化人类感知指标,从体系结构上兼顾码率-质量-端侧可部署性。
Track: Neural codecs / Real-time edge compression for machine perception
Key innovation: A lightweight, versatile asymmetric codec architecture that shifts complexity to the decoder to enable low-power real-time encoding under bandwidth constraints, and is designed to serve machine-perception tasks and non-standard modalities (e.g., spatial audio arrays) beyond human-perceptual optimization.
- [2026-05-07] BitCal-TTS: Bit-Calibrated Test-Time Scaling for Quantized Reasoning Models
- 赛道归属: 推理优化(Test-Time Scaling)/ 量化推理模型校准
- 核心创新点: 针对后训练量化导致的置信度/停止信号失真,提出“比特校准”的测试时缩放方法:在固定生成token预算下,校准量化模型的在线不确定性与早停/算力分配控制信号,缓解“看似合理但推理未收敛”的过早停止问题,使自适应推理深度在量化条件下更稳定可靠。
Track: Inference optimization (Test-Time Scaling) / Quantized reasoning calibration
Key innovation: Bit-calibrated test-time scaling that corrects confidence/halting signals distorted by post-training quantization under a fixed token budget, reducing harmful early stopping and stabilizing adaptive compute allocation for quantized reasoning models.
- OLaPh: Optimal Language Phonemizer
- 赛道归属: 语音合成前端(TTS Front-end)/ 文本到音素(G2P/Phonemization)
- 核心创新点: 提出混合式音素化框架:融合大规模多语种词典(lexica)与现代NLP建模,并引入统计子词切分来处理OOV与跨语言形态变化;通过“词典强约束 + 神经/统计泛化”的组合,在覆盖率与泛化能力之间取得更优折中,提升多语种音素化鲁棒性。
Track: TTS front-end / Phonemization (G2P)
Key innovation: A hybrid phonemizer combining extensive multilingual lexica with advanced NLP modeling and statistical subword segmentation, achieving better OOV/generalization while retaining lexicon-backed correctness across languages.
- SF-Flow: Sound field magnitude estimation via flow matching guided by sparse measurements 🆕NEW
- 赛道归属: 空间音频重建 / 声场估计(稀疏测量引导的生成式建模)
- 核心创新点: 将Flow Matching引入3D声场重建这一典型病态逆问题,聚焦于声学传递函数(ATF)幅度的生成式估计,并通过“稀疏麦克风测量引导”把条件约束注入流匹配过程,实现从少量观测到完整声场幅度分布的重建。核心突破在于把原本用于语音/音乐生成的FM范式改造成可处理空间声学条件约束的生成式求解器,用生成先验补足欠定测量带来的信息缺失,从而提升声场/房间特性恢复的可行性与精度。
- Track: Spatial audio reconstruction / Sound field estimation (generative modeling guided by sparse measurements)
- Core innovation: Brings Flow Matching to 3D sound-field reconstruction, an ill-posed inverse problem, by modeling Acoustic Transfer Function (ATF) magnitude with a generative estimator and injecting constraints from sparse microphone measurements to guide the flow-matching process. The methodological contribution is adapting an FM-based generative paradigm—previously dominant in speech/music generation—into a conditional solver for spatial acoustics, using learned priors to compensate for underdetermined measurements and improving feasibility/accuracy of sound-field and room characterization.
- Polyphonia: Zero-Shot Timbre Transfer in Polyphonic Music with Acoustic-Informed Attention Calibration 🆕NEW
- 赛道归属: 音乐编辑 / 音频生成式编辑(零样本复音音乐音色迁移、stem级控制)
- 核心创新点: 面向复音混音中的“指定声部(stem)音色迁移”提出零样本编辑方法,关键在于对扩散模型的跨注意力进行“声学信息驱动的注意力校准(acoustic-informed attention calibration)”,纠正原生cross-attention在密集混合物中对目标声部定位不准、易串扰的问题,从而在改变目标声部音色的同时严格保持伴奏与其他声部不变。方法论突破是把stem级可控编辑转化为注意力层面的可校准机制,用声学线索约束注意力分配,实现对多声部绑定与隔离更可靠的零样本音色迁移。
- Track: Music editing / Generative audio editing (zero-shot polyphonic timbre transfer with stem-level control)
- Core innovation: Proposes a zero-shot method for stem-specific timbre transfer in polyphonic mixtures by introducing acoustic-informed attention calibration for diffusion models. It corrects vanilla cross-attention’s mis-localization and leakage in dense mixtures, enabling timbre changes on a target stem while strictly preserving accompaniment and other stems. The methodological leap is reframing stem-level controllable editing as an attention-calibration problem, using acoustic cues to constrain attention allocation for more reliable source binding and separation during zero-shot timbre transfer.
- Unison: Harmonizing Motion, Speech, and Sound for Human-Centric Audio-Video Generation
- 赛道归属: 人物中心音视频生成(Audio-Video Generation,多模态联合生成:动作-语音-音效)
- 核心创新点: 提出统一框架在生成阶段显式约束“动作-语音-环境音效”三模态的时序一致性与语义协同,针对三者异质时间尺度与对齐难题,通过跨模态协同建模/对齐机制减少常见的口型-语音、动作-音效错配,实现更连贯的人物中心音视频联合生成。
- Track: Human-centric audio-video generation (multimodal joint generation: motion–speech–sound)
- Key innovations: Introduces a unified generation framework that explicitly enforces temporal alignment and semantic coherence across motion, speech, and environmental sound effects. By addressing heterogeneous temporal dynamics with cross-modal coordination/alignment mechanisms, it reduces typical mismatches (e.g., lip–speech and action–sound desynchronization) and improves coherent human-centric audio-video generation.
- Rethinking Entropy Minimization in Test-Time Adaptation for Autoregressive Models 🆕NEW
- 赛道归属: 推理优化 / 测试时自适应(TTA)用于自回归生成模型
- 核心创新点: 针对自回归生成模型中“熵最小化TTA”缺乏统一理论、现有方法分裂为伪标签teacher forcing或策略梯度RL等启发式的问题,给出面向自回归建模的严格熵最小化推导与统一目标形式,从数学上打通不同实现路径之间的关系与差异,并为稳定可复现的TTA算法设计提供原则性依据。核心突破在于把生成式自回归场景下的EM-TTA从经验技巧提升为可分析、可推导的统一框架,便于后续构造更稳健的在线适配策略。
- Track: Inference optimization / Test-time adaptation (TTA) for autoregressive generative models
- Core innovation: Provides a rigorous entropy-minimization formulation tailored to autoregressive generative models, unifying previously fragmented heuristics such as pseudo-label teacher forcing and policy-gradient RL under a single mathematical objective. The key advance is elevating EM-based TTA for autoregressive generation from ad-hoc tricks to an analyzable, derivable framework, clarifying connections among methods and enabling principled design of more stable and reproducible online adaptation strategies.
GitHub
- [2026-05-13] huggingface/diffusers ⭐33609
🤗 Diffusers: State-of-the-art diffusion models for image, video, and audio generation in PyTorch.
- [2026-05-13] SamurAIGPT/Generative-Media-Skills ⭐3225 🆕NEW
Multi-modal Generative Media Skills for AI Agents (Claude Code, Cursor, Gemini CLI). High-quality image, video, and audio generation powered by muapi....
- [2026-05-09] Ameobea/web-synth ⭐554
Browser-based DAW and audio synthesis platform with dozens of effects, synths, and modules
- [2026-05-14] apocas/restai ⭐504
RESTai is an AIaaS (AI as a Service) open-source platform. Supports many public and local LLM suported by Ollama/vLLM/etc. Precise embeddings usage, t...
- [2026-05-13] Blaizzy/mlx-video ⭐223 🆕NEW
MLX-Video is the best package for inference and finetuning of Image-Video-Audio generation models on your Mac using MLX.
HuggingFace Models
HuggingFace Spaces
语言大模型 / Large Language Models
arXiv
- Large Language Models for Sequential Decision-Making: Improving In-Context Learning via Supervised Fine-Tuning 🆕NEW
- 赛道归属: LLM+强化学习/序列决策(离线RL、MDP/POMDP上的In-Context Learning与微调)
- 核心创新点: 通过对“离线、oracle标注的轨迹”进行监督微调(SFT),把LLM的少样本ICL能力显式迁移到序列决策任务中,使模型能够在MDP、POMDP及更具不确定性的APOMDP设定下,直接从上下文轨迹中进行few-shot决策;方法上将“轨迹作为上下文提示”的ICL与“用高质量轨迹进行SFT”的训练范式结合,系统化提升LLM在长期依赖与部分可观测场景中的决策稳健性。
Track: LLM + Reinforcement Learning / Sequential Decision-Making (offline RL; ICL + fine-tuning on MDP/POMDP) Key innovation: Uses supervised fine-tuning (SFT) on offline, oracle-labeled trajectories to explicitly transfer LLM in-context learning into sequential decision-making, enabling few-shot action selection from trajectory context across MDPs, POMDPs, and ambiguous POMDPs; methodologically couples “trajectory-as-prompt” ICL with high-quality trajectory SFT to improve robustness under long-horizon dependencies and partial observability.
- [2026-05-07] LCC-LLM: Leveraging Code-Centric Large Language Models for Malware Attribution
- 赛道归属: 代码安全与恶意软件分析(LLM+静态分析/归因)
- 核心创新点: 构建面向“代码证据”的恶意软件归因基准与框架:提出代码中心的LCCD数据集(约34K PE样本)并配套证据落地的归因流程,强调从二进制/反汇编等静态线索中定位“恶意/脆弱代码片段”的可验证证据链,弥补以往LLM归因依赖不受支持指标、缺乏代码级grounding的问题,实现归因与多任务静态分析的统一评测与训练范式。
- Track: Code security & malware analysis (LLM + static analysis/attribution)
- Core innovation: Introduces an evidence-grounded, code-centric benchmark and framework: the LCCD dataset (~34K PE samples) plus an attribution pipeline that explicitly grounds decisions in verifiable code-level evidence (malicious/vulnerable segments) extracted from static artifacts, addressing prior LLM attribution’s unsupported indicators and weak code grounding, and enabling unified evaluation/training for attribution and multi-task static malware analysis.
- [2026-05-06] RLearner-LLM: Balancing Logical Grounding and Fluency in Large Language Models via Hybrid Direct Preference Optimization
- 赛道归属: 对齐与偏好优化(RLHF/DPO改进、逻辑一致性对齐)
- 核心创新点: 提出Hybrid-DPO的自动化偏好构造机制以纠正DPO“偏啰嗦/重流畅轻逻辑”的系统性偏差:将逻辑可靠性信号(如基于NLI/蕴含判别的DeBERTa等判据)与生成流畅度偏好融合,形成更平衡的偏好对,从训练目标层面缩小“逻辑对齐缺口”,在知识密集型生成中同时提升逻辑正确性与可读性。
- Track: Alignment & preference optimization (RLHF/DPO, logical grounding)
- Core innovation: Proposes Hybrid-DPO with an automated preference pipeline that counteracts DPO’s verbosity/fluency bias by fusing logical reliability signals (e.g., DeBERTa-based NLI/entailment judgments) with fluency preferences, producing better-balanced preference pairs and reducing the “logical alignment gap” in knowledge-intensive generation.
- U-STS-LLM A Unified Spatio-Temporal Steered Large Language Model for Traffic Prediction and Imputation 🆕NEW
- 赛道归属: 时空序列建模与预测(通信/网络流量预测与缺失值插补的LLM化)
- 核心创新点: 提出统一的时空“预测+插补”框架,将原本分离的两类任务在同一模型内联合建模;通过“时空引导/steering”的LLM结构,把交通/流量数据的空间关联与时间动态以可控方式注入语言模型,实现对未来负载预测与缺失数据修复的共享表示与协同优化,从而减少任务割裂带来的误差传播并提升跨场景泛化。
Track: Spatio-temporal modeling & forecasting (LLM-based traffic prediction and missing-value imputation) Key innovation: Introduces a unified framework that jointly models forecasting and imputation—traditionally treated separately—within a single spatio-temporally steered LLM; injects spatial dependencies and temporal dynamics into the LLM via controllable steering mechanisms to learn shared representations that support both future load prediction and missing-data recovery, improving consistency and generalization across settings.
- Extracting Search Trees from LLM Reasoning Traces Reveals Myopic Planning
- 赛道归属: 推理与规划分析(CoT可解释性/行为刻画)
- 核心创新点: 提出从LLM推理轨迹中“抽取搜索树”的方法学,用结构化量化替代仅看最终答案/文本:在四子棋环境中将CoT中的分支、回溯与前瞻显式拟合为搜索树并度量其深度、分支因子与局部性,从而揭示推理模型存在“短视规划(myopic planning)”等行为特征,并将性能与搜索结构属性建立可检验关联。
- Track: Reasoning & planning analysis (CoT interpretability/behavior characterization)
- Core innovation: Introduces a method to extract and quantify search trees from LLM reasoning traces, fitting deliberative CoT into explicit tree structures in a four-in-a-row game and measuring properties (depth/branching/locality) to reveal myopic planning and link performance to measurable search-structure attributes.
- One Prompt, Many Sounds: Modeling Listener Variability in LLM-Based Equalization 🆕NEW
- 赛道归属: 音频生成/音频控制(文本到均衡器EQ参数的LLM交互式音频调音)
- 核心创新点: 将“自然语言提示→EQ设置”建模为LLM驱动的可对话控制问题,并显式建模听众差异(listener variability):利用受控听音实验数据,让模型在ICL与个性化建模的结合下,同一提示可输出符合不同听众偏好的多样化均衡曲线;方法突破在于把主观偏好分布作为学习目标的一部分,而非学习单一“平均”EQ映射。
Track: Audio control / text-to-audio-parameter mapping (LLM-based equalization) Key innovation: Frames “natural-language prompt → EQ parameters” as a conversational LLM control task while explicitly modeling listener variability; leverages controlled listening-study data so the model can, via a combination of in-context learning and personalization, produce diverse EQ settings for the same prompt aligned with different user preferences—optimizing for preference distributions rather than a single averaged mapping.
- Predicting Disagreement with Human Raters in LLM-as-a-Judge Difficulty Assessment without Using Generation-Time Probability Signals 🆕NEW
- 赛道归属: LLM评测与对齐(LLM-as-a-Judge可靠性/分歧预测,教育难度评估)
- 核心创新点: 针对LLM-as-a-Judge在材料难度评级中与人类评分者不一致的问题,提出“分歧风险预测”机制:在不依赖生成时概率信号(如token logprob、entropy等)的前提下,仅基于可观测的输入/输出与判分上下文特征,预测哪些自动评级更可能与人类产生分歧,从而实现选择性复审与人机协同标注;关键突破是将“置信度/不确定性估计”从模型内部概率解耦出来,适配黑盒或不可得logprob的评测场景。
Track: LLM evaluation & alignment (LLM-as-a-Judge reliability; disagreement prediction for difficulty assessment) Key innovation: Proposes a disagreement-risk predictor to flag LLM-as-a-Judge difficulty ratings likely to diverge from human raters, enabling selective re-rating; crucially avoids any generation-time probability signals (e.g., token logprobs/entropy) and instead relies on observable input/output and contextual features—decoupling uncertainty estimation from internal probabilities and supporting black-box judge settings.
- Stories in Space: In-Context Learning Trajectories in Conceptual Belief Space 🆕NEW
- 赛道归属: LLM机理解释/可解释性(In-Context Learning的几何/贝叶斯视角)
- 核心创新点: 提出“概念信念空间(conceptual belief space)”假设:LLM在上下文中的行为更新可视为在低维几何潜空间上对信念分布进行更新,ICL对应在该空间中的轨迹;以故事理解为实验载体,刻画随上下文证据累积而发生的信念迁移路径,从而把ICL从黑盒现象转化为可分析的状态空间动力学问题,为解释与诊断ICL失败模式提供结构化表征。
Track: Mechanistic interpretability of LLMs (geometric/Bayesian view of in-context learning) Key innovation: Posits a low-dimensional “conceptual belief space” where an LLM maintains beliefs and updates them in context; in-context learning is modeled as a trajectory through this space as evidence accumulates, studied via story understanding—turning ICL into analyzable state-space dynamics and enabling structured explanations/diagnostics of ICL behavior and failure modes.
- A Unified Graph Language Model for Multi-Domain Multi-Task Graph Alignment Instruction Tuning 🆕NEW
- 赛道归属: 图-语言模型(Graph Language Model)/多域多任务对齐与指令微调
- 核心创新点: 面向多域多任务的图对齐指令微调,提出统一的图语言模型:在“GNN编码图结构→与LLM对齐”的主流范式上,进一步解决跨域/跨任务的表示对齐问题,使GNN侧的图表征在不同数据域与任务指令下能与LLM语义空间保持一致;方法突破在于把“对齐”从单一数据集/单任务扩展为可泛化的指令化对齐过程,提升GLM在异构图、不同任务(如匹配/对齐/推断等)间的迁移能力。
Track: Graph-Language Models (multi-domain/multi-task alignment; instruction tuning) Key innovation: Proposes a unified GLM with graph-alignment instruction tuning that explicitly addresses cross-domain and cross-task representation alignment between GNN-encoded graph embeddings and the LLM semantic space; extends alignment beyond single-dataset/single-task setups into a generalizable instruction-driven alignment process, improving transfer across heterogeneous graph domains and tasks.
- PrivacySIM: Evaluating LLM Simulation of User Privacy Behavior 🆕NEW
- 赛道归属: LLM社会模拟与评测(用户隐私行为模拟/Persona驱动仿真基准)
- 核心创新点: 提出PrivacySIM评测套件,用于检验“少量persona属性是否足以驱动LLM模拟个体级隐私决策”;以1,000名真实用户的隐私选择作为ground truth,对比LLM在给定persona条件下的行为一致性与个体差异复现能力;方法论价值在于把隐私行为仿真从主观案例评估提升为可量化基准,并聚焦“个体层面”而非群体平均,从而更严格地检验LLM作为人类行为模拟器的有效性与边界。
Track: LLM-based social simulation & evaluation (persona-driven simulation of privacy behavior) Key innovation: Introduces PrivacySIM, a benchmark evaluating whether a compact set of persona attributes can drive LLMs to simulate individual-level privacy decisions; uses ground-truth responses from 1,000 users to quantify fidelity and the ability to reproduce inter-individual variability—elevating privacy-behavior simulation from anecdotal evaluation to a rigorous, measurable suite focused on individual (not just aggregate) behavior.
GitHub
- [2026-05-14] sgl-project/sglang ⭐27763
SGLang is a high-performance serving framework for large language models and multimodal models.
- [2026-05-13] NVIDIA/TensorRT-LLM ⭐13629
TensorRT LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports state-of-the-art optimizations to perfo...
- [2026-05-14] stanford-crfm/helm ⭐2788
Holistic Evaluation of Language Models (HELM) is an open source Python framework created by the Center for Research on Foundation Models (CRFM) at Sta...
- [2026-05-14] NVIDIA-NeMo/Skills ⭐950
A project to improve skills of large language models
- [2026-05-13] awslabs/llm-hosting-container ⭐92 🆕NEW
Large Language Model Hosting Container
HuggingFace Datasets
- [2026-05-01] angrygiraffe/claude-opus-4.6-4.7-reasoning-8.7k
Background
Ended up with some tokens to burn on a Claude Max plan. Assembly began during 4.6 and moved to 4.7. Model is tagged. The develop...
- [2026-04-24] Jackrong/DeepSeek-V4-Distill-8000x
🐳 DeepSeek-V4-Distill-8100x Dataset Summary
DeepSeek-V4-Distill-8100x is a supervised fine-tuning dataset for re...
- [2026-04-19] Jackrong/GLM-5.1-Reasoning-1M-Cleaned
GLM-5.1-Reasoning-1M-Cleaned
GLM-5.1-Reasoning-1M-Cleaned is a cleaned and reformatted derivative of Kassadin88/GLM-5.1-1000000x. It prese...
HuggingFace Spaces
多模态大模型 / Multimodal Models
arXiv
- PPU-Bench:Real World Benchmark for Personalized Partial Unlearning in Vision Language Models 🆕NEW
- 赛道归属: 多模态安全与隐私 / 机器遗忘(VLM个性化部分遗忘评测)
- 核心创新点: 提出PPU-Bench,一个面向真实世界“个性化部分遗忘”(personalized partial unlearning)的VLM基准:不依赖合成知识注入、也不做整类/整主体删除,而是覆盖更贴近用户请求的细粒度跨模态事实删除需求;同时强调fine-tuning-free的评测设定,用统一任务与数据规模(约24K多模态样本)系统衡量模型对敏感记忆的可控删除能力与残留风险。
- Track: Multimodal safety & privacy / Machine unlearning (personalized partial unlearning benchmark for VLMs)
- Core innovation: Introduces PPU-Bench, a real-world benchmark for personalized partial unlearning in VLMs: it avoids synthetic knowledge injection and coarse subject-level deletion, instead targeting fine-grained cross-modal factual removal aligned with realistic user requests; it further adopts a fine-tuning-free evaluation setup and a sizable multimodal dataset (~24K) to systematically measure controllable deletion and residual memorization risk.
- Fill the GAP: A Granular Alignment Paradigm for Visual Reasoning in Multimodal Large Language Models 🆕NEW
- 赛道归属: 多模态推理(视觉潜变量/latent reasoning 对齐与训练范式)
- 核心创新点: 提出GAP(Granular Alignment Paradigm)以稳定提升视觉潜变量推理:指出现有“output-as-input”视觉latent范式不稳定的关键原因在于特征空间不匹配——常见做法在pre-norm MLLM中直接复用decoder隐藏态作为下一步latent输入,导致预测latent与期望视觉特征分布错位;GAP通过更细粒度的对齐机制/训练约束来缓解该mismatch,从而让连续视觉证据token的生成与消费更一致、收益更稳定。
- Track: Multimodal reasoning (visual latent reasoning alignment/training paradigm)
- Core innovation: Proposes GAP (Granular Alignment Paradigm) to stabilize gains from visual latent reasoning: it diagnoses instability in the common “output-as-input” latent pipeline as a feature-space mismatch—pre-norm MLLMs often reuse decoder hidden states as predicted latent inputs, misaligning the latent distribution with the intended visual feature space; GAP introduces finer-grained alignment/constraints to better match produced and consumed continuous visual-evidence tokens, yielding more consistent improvements.
- 20/20 Vision Language Models: A Prescription for Better VLMs through Data Curation Alone 🆕NEW
- 赛道归属: 视觉语言预训练 / 数据工程(仅数据策划提升VLM)
- 核心创新点: 系统验证“只靠数据策划即可显著提升VLM”的上限:在架构、训练配方与算力固定的前提下,仅改变训练数据,通过一套数据筛选/清洗/重配比的策划流水线作用于MAmmoTH-VL单图子集,在20个公开VLM基准(覆盖grounding、VQA等)上平均提升+11.7pp;方法论贡献在于把性能增益明确归因到数据分布与质量控制,并给出可复用的数据策划处方来移动质量-算力前沿。
- Track: Vision-language pretraining / Data curation (VLM improvement via data only)
- Core innovation: Demonstrates how far data curation alone can push VLMs: holding architecture, training recipe, and compute constant, it varies only the training data and applies a systematic filtering/cleaning/rebalancing pipeline to the MAmmoTH-VL single-image subset, achieving +11.7pp average gains across 20 public VLM benchmarks (grounding, VQA, etc.); the key methodological contribution is isolating performance gains to data quality/distribution control and providing a reusable curation “prescription.”
- SoccerLens: Grounded Soccer Video Understanding Beyond Accuracy 🆕NEW
- 赛道归属: 视频多模态理解评测(体育视频理解 + 可解释/可落地grounding)
- 核心创新点: 提出SoccerLens评测框架,将足球视频理解从“只看分类准确率”推进到“是否基于真实视觉证据”的检验:针对视角变化大、镜头切换快、场景拥挤等足球视频特性,新增/强化视觉grounding与证据一致性评估,旨在识别VLM是否依赖伪相关与捷径学习;核心突破在于把“超越准确率”的可视化证据对齐与鲁棒性诊断纳入标准化基准。
- Track: Video multimodal understanding evaluation (sports video grounding & reliability)
- Core innovation: Introduces SoccerLens to move soccer video understanding evaluation beyond classification accuracy toward grounded evidence: tailored to soccer’s viewpoint shifts, rapid shot transitions, and clutter, it incorporates visual grounding/evidence-consistency assessments to detect spurious correlations and shortcut learning in VLMs; the methodological advance is standardizing “beyond-accuracy” grounding-centric diagnostics for video VLM evaluation.
- Mirror, Mirror on the Wall: Can VLM Agents Tell Who They Are at All? 🆕NEW
- 赛道归属: 具身智能 / 多模态代理评测(镜像自我识别与自他区分)
- 核心创新点: 构建受控3D基准测试VLM代理的“镜像自我识别”功能:第一人称具身代理需从镜中反射推断自身隐藏身体属性并匹配目标,同时避免把他者误认为自己(self-other misattribution);通过任务设计将“镜像线索推理能力”与“基于先验/捷径的猜测”区分开,为评估VLM代理的自我表征、视角几何理解与身份归因提供可控实验范式。
- Track: Embodied AI / Multimodal agent evaluation (mirror self-recognition)
- Core innovation: Builds a controlled 3D benchmark to test mirror self-recognition in first-person VLM agents: the agent must infer a hidden body attribute from its reflection and select the matching target while avoiding self–other misattribution; the key methodological contribution is disentangling mirror-grounded reasoning from shortcut/priors, enabling controlled evaluation of self-representation, viewpoint geometry understanding, and identity attribution.
- Lost in Volume: The CT-SpatialVQA Benchmark for Evaluating Semantic-Spatial Understanding of 3D Medical Vision-Language Models 🆕NEW
- 赛道归属: 3D医学多模态理解评测(体数据语义-空间推理VQA)
- 核心创新点: 提出CT-SpatialVQA,用于系统评估3D医学VLM在CT体数据上的“语义-空间”理解:针对现有模型可能依赖语言相关性与数据先验、缺乏空间落地的问题,基准聚焦解剖语义与三维空间关系的联合推理(例如方位、相对位置、跨切片一致性等),从而更精确地区分“真正读懂体数据”与“靠先验答题”。
- Track: 3D medical multimodal understanding evaluation (semantic-spatial VQA on volumes)
- Core innovation: Introduces CT-SpatialVQA to systematically evaluate semantic-spatial reasoning of 3D medical VLMs on CT volumes: it targets joint anatomical semantics and 3D spatial relations (e.g., orientation, relative position, cross-slice consistency) to reveal whether models are truly grounded in volumetric evidence versus relying on priors and language correlations.
- Where Reliability Lives in Vision-Language Models: A Mechanistic Study of Attention, Hidden States, and Causal Circuits 🆕NEW
- 赛道归属: 多模态可靠性与可解释性 / 机制可解释(VLM内部因果电路分析)
- 核心创新点: 提出统一的机制分析流水线VRP(VLM Reliability Probe),直接检验“注意力越尖锐越可靠”的Attention-Confidence假设:在LLaVA-1.5、PaliGemma、Qwen2-VL等开源VLM上,联合对比注意力结构、生成动态与隐藏态表征,并通过因果电路(causal circuits)视角定位可靠性来源;方法论突破在于把可靠性从表层可视化(attention map)提升到可干预、可归因的内部状态与因果路径层面。
- Track: Multimodal reliability & interpretability / Mechanistic interpretability (causal circuit analysis in VLMs)
- Core innovation: Proposes VRP (VLM Reliability Probe), a unified mechanistic pipeline to directly test the Attention–Confidence assumption (“sharper attention implies more reliable answers”): across open-weight VLMs (LLaVA-1.5, PaliGemma, Qwen2-VL), it jointly analyzes attention structure, generation dynamics, and hidden-state representations, and uses causal-circuit perspectives to localize where reliability arises; the key advance is shifting reliability assessment from surface attention visualizations to intervenable, attributable internal states and causal pathways.
- [2026-05-08] Fine-tuning a vision-language model for fracture-surface morphology recognition
- 赛道归属: 科学影像理解 / 领域VLM微调(材料断口形貌识别)
- 核心创新点: 基于开源VLM(Qwen3-VL-32B-Instruct)进行材料断口图像的领域适配微调,构建并利用13,168张文献挖掘的断口图像数据集;通过推理型大模型从“图像+文本”联合生成形貌标注,实现低人工成本的可扩展标注管线,从而把通用VLM的视觉表征对齐到材料学形貌判别所需的细粒度纹理/结构知识。
- Track: Scientific image understanding / domain VLM fine-tuning (fracture-surface morphology recognition)
- Key innovation: Domain-adapts an open VLM (Qwen3-VL-32B-Instruct) via fine-tuning on a curated 13,168-image literature-mined fracture dataset; uses a reasoning LLM to generate morphology annotations from joint image+text evidence, forming a scalable, low-manual-cost labeling pipeline that aligns generic VLM representations to fine-grained materials morphology cues.
- [2026-05-07] MedHorizon: Towards Long-context Medical Video Understanding in the Wild
- 赛道归属: 医疗长视频多模态理解 / 长上下文视频理解基准
- 核心创新点: 面向“全流程临床视频回顾”这一真实场景,聚焦医疗过程视频的关键难点(高冗余视角、关键证据稀疏且细微、强上下文依赖),提出在野外(long-context in the wild)的长视频理解任务设定与评测方向,突破以往依赖已定位片段/预分割视频的基准假设,使模型必须在长时序中自主发现与整合决定性证据。
- Track: Medical long-context video understanding / benchmark & task setting
- Key innovation: Targets full-procedure clinical video review with a “long-context in-the-wild” formulation, explicitly modeling medical-procedure properties (high redundancy, temporally sparse and subtle decisive evidence, strong context dependence) and moving beyond benchmarks that pre-localize evidence via clips/segments—forcing models to discover and aggregate key evidence over long timelines.
- [2026-05-07] Null Space Constrained Contrastive Visual Forgetting for MLLM Unlearning
- 赛道归属: 多模态模型遗忘/机器反学习(MLLM Unlearning)
- 核心创新点: 提出“零空间约束”的对比式视觉遗忘方法,在多模态耦合更强的MLLM中实现“定向移除目标视觉知识,同时最大化保留非目标视觉知识与全部文本知识”;通过在参数更新中引入零空间/正交约束,将遗忘梯度限制在不干扰保留子空间的方向上,并用对比学习目标强化“忘/记”边界,缓解遗忘-保留的权衡冲突。
- Track: Multimodal unlearning / machine unlearning for MLLMs
- Key innovation: Introduces null-space-constrained contrastive visual forgetting to remove target visual knowledge while preserving non-target visual and all textual knowledge; enforces orthogonality/null-space constraints on updates to avoid interfering with retained subspaces, and uses contrastive objectives to sharpen the forget-vs-retain boundary, improving the unlearning/retention trade-off in tightly coupled MLLMs.
GitHub
- [2026-05-14] Blaizzy/mlx-vlm ⭐4713
MLX-VLM is a package for inference and fine-tuning of Vision Language Models (VLMs) on your Mac using MLX.
- [2026-05-11] waybarrios/vllm-mlx ⭐1158
OpenAI and Anthropic compatible server for Apple Silicon. Run LLMs and vision-language models (Llama, Qwen-VL, LLaVA) with continuous batching, MCP to...
- [2026-05-08] zhengli97/Awesome-Prompt-Adapter-Learning-for-VLMs-CLIP ⭐772
A curated list of awesome prompt/adapter learning methods for vision-language models like CLIP.
- [2026-05-10] dongyangli-del/EEG_Image_decode ⭐203
Using vision-language models to decode natural image perception from non-invasive brain recordings.
- [2026-05-12] ydyhello/Awesome-VLM-Streaming-Video ⭐155
📚 A curated collection of papers and open-source code repositories dedicated to the application of Vision-Language Models (VLMs) for streaming video.
强化学习 / Reinforcement Learning
arXiv
- MTA-RL: Robust Urban Driving via Multi-modal Transformer-based 3D Affordances and Reinforcement Learning 🆕NEW
- 赛道归属: 自动驾驶强化学习(多模态感知-控制融合、可解释决策)
- 核心创新点: 提出以“3D可供性(affordances)”作为感知与控制之间的中间表征,用多模态Transformer从RGB等输入预测结构化、可解释的3D可供性,再由强化学习在该表征空间上进行策略学习;相较端到端直接回归动作,减少感知-控制脆弱接口带来的误差传播,同时提升在城市密集交互场景下的鲁棒性与可解释性。
- Track: Autonomous driving RL (multimodal perception-control fusion, interpretable decision-making)
- Core innovations: Bridges perception and control via explicit 3D affordance representations: a multimodal Transformer predicts structured, interpretable 3D affordances from RGB (and other modalities), and an RL policy is trained on this intermediate space. This avoids brittle end-to-end action regression, mitigates error propagation across modules, and improves robustness under dense urban interactions while enhancing interpretability.
- How to Compress KV Cache in RL Post-Training? Shadow Mask Distillation for Memory-Efficient Alignment
- 赛道归属: 大模型对齐与推理优化(RL后训练、KV Cache压缩/显存优化)
- 核心创新点: 提出“Shadow Mask Distillation”用于RL在线rollout阶段的KV cache压缩:通过蒸馏得到可学习的掩码/稀疏策略,在尽量不破坏对齐与长上下文推理质量的前提下,显著降低轨迹生成时KV缓存的显存占用,从而缓解长上下文RL后训练的“memory wall”,提升可扩展性与吞吐。
- Track: LLM alignment & inference optimization (RL post-training, KV-cache compression/memory efficiency)
- Core innovation: Proposes Shadow Mask Distillation to compress KV cache during online RL rollouts by distilling a learnable masking/sparsification policy, reducing KV-memory footprint while preserving alignment and long-context reasoning quality, thereby breaking the rollout “memory wall” and improving scalability/throughput.
- [2026-05-07] A Unified Pair-GRPO Family: From Implicit to Explicit Preference Constraints for Stable and General RL Alignment
- 赛道归属: 大模型对齐(偏好学习RL、GRPO/PPO类稳定训练)
- 核心创新点: 构建统一的Pair-GRPO理论框架,将“隐式偏好约束”到“显式偏好约束”纳入同一族方法,并提出Soft-Pair-GRPO与Hard-Pair-GRPO两种紧耦合变体;通过更清晰的约束形式与梯度方向刻画,降低梯度方差、提升更新稳定性与可解释性,并增强跨任务/奖励形态的泛化鲁棒性。
- Track: LLM alignment (preference-based RL, stable GRPO/PPO-style optimization)
- Core innovation: Establishes a unified Pair-GRPO framework spanning implicit-to-explicit preference constraints, introducing tightly coupled Soft- and Hard-Pair-GRPO variants; by making preference constraints and gradient directions more explicit, it reduces gradient variance, improves update stability/interpretability, and strengthens generalization across tasks and reward formulations.
- Discrete Flow Matching for Offline-to-Online Reinforcement Learning 🆕NEW
- 赛道归属: 离线到在线强化学习(离散动作生成式策略、Flow Matching)
- 核心创新点: 针对离散动作空间提出基于离散Flow Matching的生成式策略在线微调框架(DRIFT):将原本偏连续控制的扩散/flow匹配范式离散化以适配离散动作,并设计在线更新机制,使策略在与环境交互后能持续改进且不“遗忘”离线数据中学到的有效行为,从而缓解offline-to-online迁移中的分布偏移与性能退化。
- Track: Offline-to-online RL (discrete-action generative policies, Flow Matching)
- Core innovations: Introduces DRIFT, an online fine-tuning method for discrete action spaces using discrete flow matching. It adapts diffusion/flow-matching-style generative policies—typically built for continuous control—to discrete actions, and proposes an online update procedure that improves with new interaction while preserving useful behaviors from offline datasets, addressing distribution shift and degradation in offline-to-online RL.
- On the Importance of Multistability for Horizon Generalization in Reinforcement Learning 🆕NEW
- 赛道归属: POMDP长时序强化学习(记忆机制与泛化、RNN动力学)
- 核心创新点: 将长视野(长horizon)POMDP中的泛化困难归因到记忆网络的动力学性质,提出“多稳态(multistability)”对horizon泛化的重要性:通过分析/构造具有多个稳定吸引子状态的记忆表征,使RNN能在长时间间隔后仍保持关键信息并形成可分离的内部状态,从而提升长时依赖任务的样本效率与跨horizon泛化能力。
- Track: Long-horizon RL in POMDPs (memory mechanisms & generalization, RNN dynamics)
- Core innovations: Attributes poor horizon generalization in long-horizon POMDPs to the dynamical properties of memory networks, highlighting the role of multistability. By encouraging/leveraging multiple stable attractor states in recurrent memory representations, the agent can retain task-relevant information across long delays and maintain separable internal states, improving sample efficiency and generalization across horizons.
- When Policy Entropy Constraint Fails: Preserving Diversity in Flow-based RLHF via Perceptual Entropy 🆕NEW
- 赛道归属: 文生图RLHF(Flow/Flow-Matching模型对齐、去多样性崩塌)
- 核心创新点: 揭示在flow-based文生图RLHF中“策略熵约束=保多样性”的常见假设失效:由于flow模型的固定生成噪声/变换结构,策略熵可保持不变但感知多样性仍会崩塌;据此提出“感知熵(Perceptual Entropy)”作为与人类感知一致的多样性度量与正则信号,用于在偏好优化过程中显式约束输出分布的感知覆盖度,从而缓解多样性塌缩。
- Track: Text-to-image RLHF (flow/flow-matching alignment, diversity collapse mitigation)
- Core innovations: Shows that entropy regularization can fail for flow-based RLHF: policy entropy may remain constant even as perceptual diversity collapses, due to the fixed noise/transform structure in flow models. Proposes Perceptual Entropy as a diversity metric and regularizer aligned with human perception, explicitly constraining perceptual coverage during preference optimization to prevent diversity collapse.
- Block-R1: Rethinking the Role of Block Size in Multi-domain Reinforcement Learning for Diffusion Large Language Models 🆕NEW
- 赛道归属: 推理增强与后训练强化学习(扩散式LLM、半自回归块生成/GRPO优化)
- 核心创新点: 系统性重审扩散式LLM在块级半自回归生成中的“块大小(block size)”作用,将其视为影响RL rollout轨迹与跨域学习动态的关键超参;提出面向多领域RL后训练的块大小选择/调度思路(Block-R1),以在并行解码粒度、信用分配长度与优化稳定性之间取得更优折中,从而提升多域推理强化学习的整体收益。
- Track: Reasoning post-training RL (diffusion LLMs, block-wise semi-autoregressive generation / GRPO)
- Core innovations: Reframes block size in block-wise semi-autoregressive diffusion LLMs as a central factor shaping RL rollout trajectories and multi-domain learning dynamics. Block-R1 studies and leverages block-size selection/scheduling to balance parallel decoding granularity, credit assignment horizon, and optimization stability, improving overall gains in multi-domain RL post-training for reasoning.
- Trust Region Inverse Reinforcement Learning: Explicit Dual Ascent using Local Policy Updates 🆕NEW
- 赛道归属: 逆强化学习(最大熵IRL、信赖域优化与稳定训练)
- 核心创新点: 提出Trust Region Inverse Reinforcement Learning:在经典最大熵IRL的对偶上升框架中,用“局部策略更新+信赖域约束”显式近似每轮所需的RL求解,从而在不完全解RL内环的情况下仍能进行稳定的对偶优化;相较对抗式IRL,兼顾计算效率与训练稳定性,并更接近经典方法的单调改进特性。
- Track: Inverse RL (maximum-entropy IRL, trust-region optimization & stability)
- Core innovations: Proposes Trust Region IRL: performs explicit dual ascent for maximum-entropy IRL while replacing expensive full RL solves with local policy updates constrained by a trust region. This yields stable dual optimization with improved efficiency, retaining properties closer to classical monotonic-improvement IRL compared to adversarial IRL approaches.
- Robust Probabilistic Shielding for Safe Offline Reinforcement Learning 🆕NEW
- 赛道归属: 安全离线强化学习(概率安全盾、SPI性能保证融合)
- 核心创新点: 将“安全盾(shield)”与离线RL的“安全策略改进(SPI)”统一到鲁棒概率框架:在仅有离线数据、模型不确定性存在时,构造带概率保证的shield对策略动作进行约束/修正,同时与SPI的性能下界保证协同,力图同时给出“相对基线不变差”的性能保证与“满足安全约束”的风险控制,实现更可落地的离线安全学习流程。
- Track: Safe offline RL (probabilistic shielding, integration with SPI guarantees)
- Core innovations: Unifies shielding and safe policy improvement in a robust probabilistic framework for offline RL. Under dataset-only learning and model uncertainty, it builds a probabilistic shield that constrains/corrects actions with safety guarantees, while coordinating with SPI-style performance improvement guarantees relative to a safe baseline—aiming to jointly ensure non-degradation in performance and controlled safety risk.
- Towards Autonomous Railway Operations: A Semi-Hierarchical Deep Reinforcement Learning Approach to the Vehicle Rescheduling Problem 🆕NEW
- 赛道归属: 组合优化与调度强化学习(铁路交通重调度、半层级DRL/多智能体协同)
- 核心创新点: 面向铁路运营扰动下的车辆重调度(VRSP)提出半层级深度强化学习:将高层决策(如重排/分配策略)与低层执行(具体调度动作)解耦,以降低指数级组合空间的搜索难度,并更好适配实时性需求;通过层级化结构提升在复杂约束、多主体交互场景中的可扩展性与决策质量,推动从人工经验/OR向更自治的调度控制过渡。
- Track: RL for combinatorial optimization & scheduling (railway rescheduling, semi-hierarchical DRL / multi-agent coordination)
- Core innovations: Proposes a semi-hierarchical DRL approach for railway vehicle rescheduling under disruptions (VRSP). It decomposes high-level rescheduling/allocation decisions from low-level executable scheduling actions, reducing the effective combinatorial search complexity and improving real-time applicability. The hierarchical structure enhances scalability and decision quality under complex constraints and multi-entity interactions, moving toward more autonomous operations beyond manual/OR-based dispatching.
GitHub
- [2026-05-13] huggingface/trl ⭐18366
Train transformer language models with reinforcement learning.
- [2026-05-13] PufferAI/PufferLib ⭐5677 🆕NEW
Puffing up reinforcement learning
- [2026-05-13] rllm-org/rllm ⭐5500
Democratizing Reinforcement Learning for LLMs
- [2026-05-13] natolambert/rlhf-book ⭐1905
Textbook on reinforcement learning from human feedback
- [2026-05-14] pettingllms-ai/PettingLLMs ⭐167
[ICLR'26] Stronger-MAS: A RL Framework for multi LLM agent system; [arxiv] MetaAgent-X: End-to-End Reinforcement Learning Automatic Multi-Agent Syste...
HuggingFace Datasets
- [2026-05-03] ADSKAILab/Zero-To-CAD-1m
Zero-to-CAD 1M
One million executable, interpretable CAD construction sequences synthesized entirely without real-world data.
...
- [2026-05-13] TuringEnterprises/Open-MM-RL
Dataset Summary
Open-MM-RL is a multimodal STEM reasoning dataset covering Physics, Mathematics, Biology, and Chemistry. It is designed for...
-
[2026-04-23] nvidia/Nemotron-Personas-Korea
Nemotron-Personas-Korea우리나라 실제 분포에 기반한 합성 페르소나를 위한 복합 AI 시스템 A compound AI approach to personas grounded in real-world dist...
- [2026-05-08] Qwen/WebWorldData 🆕NEW
WebWorldData 🌐 Overview
WebWorldData is a large-scale dataset of 1.06M web interaction trajectories collect...
世界动作模型 / World Action Model
arXiv
- World Action Models: The Next Frontier in Embodied AI 🆕NEW
- 赛道归属: 具身智能 / 视觉-语言-动作(VLA)+ 世界模型(World Model)融合的策略学习(World Action Model范式)
- 核心创新点:
中文:提出并系统化“世界动作模型(WAM)”这一新范式:将环境动力学的显式预测(世界模型)纳入动作生成/策略学习管线,突破传统VLA仅做“观测→动作”的反应式映射,转向“可干预的未来演化建模 + 基于预测的动作决策”的统一框架,从而为具身基础模型提供更强的可规划性、可推演性与对物理演化的建模能力。
English: Introduces and formalizes the “World Action Model (WAM)” paradigm: explicitly integrates environment dynamics prediction (world models) into action generation/policy learning, moving beyond reactive VLA observation-to-action mappings toward a unified framework that models intervention-conditioned future evolution and uses it for decision making—improving planning capability, rollouts, and physical-world evolution modeling in embodied foundation models.
- [2026-05-08] Is the Future Compatible? Diagnosing Dynamic Consistency in World Action Models
- 赛道归属: 世界模型评测与可靠性诊断(World Action Model / 动态一致性)
- 核心创新点: 提出并系统化定义WAM可靠性的关键缺失维度——动作-状态一致性(action-state consistency),用于检验“模型生成的未来”是否与其声称的动作序列在动力学上相容,而不仅是视觉上合理;围绕该一致性构建诊断框架/评测思路,将WAM的失效从“看起来对”细化为“动力学不兼容”的可检测问题,从而为后续训练目标、校准与安全执行提供可操作的评价轴。
- Track: World-model evaluation & reliability diagnostics (World Action Model / dynamic consistency)
- Core innovation: Introduces and formalizes action–state consistency as a missing reliability axis for WAMs, testing whether imagined futures are dynamically compatible with the predicted action sequence rather than merely visually plausible; builds a diagnostic/evaluation perspective around this notion to make WAM failure modes measurable as dynamical incompatibility, enabling more actionable assessment for calibration, training objectives, and safe deployment.
- [2026-05-07] When to Trust Imagination: Adaptive Action Execution for World Action Models
- 赛道归属: 世界模型驱动的机器人控制(自适应执行 / 想象-现实一致性验证)
- 核心创新点: 将WAM的执行策略从“每次推理固定执行N步”提升为自适应动作执行:把是否继续执行想象动作序列建模为未来-现实验证(future-reality verification)问题;核心方法论是在执行过程中持续对比模型想象的未来与真实滚动的偏差/一致性,并据此动态决定执行更长的开环段还是提前重规划,从机制上缓解因想象漂移导致的失控与累积误差,实现“何时信任想象”的可决策化。
- Track: World-model-based robotic control (adaptive execution / imagination–reality verification)
- Core innovation: Replaces the standard “execute a fixed N predicted actions per inference” paradigm with adaptive action execution, formulating it as a future–reality verification problem; methodologically, it continuously checks consistency between imagined rollouts and real-world evolution during execution and uses this signal to decide whether to keep executing longer open-loop segments or replan early, mitigating imagination drift and compounding errors via an explicit trust-and-replan mechanism.
- The DAWN of World-Action Interactive Models 🆕NEW
- 赛道归属: 自动驾驶 / 世界模型驱动的交互式规划与动作生成(World-Action Interactive Models, WAIM)
- 核心创新点:
中文:提出“世界-动作交互模型(WAIM)”以刻画世界预测与动作选择的互依关系,指出现有WAM常见的并行分支或“先预测再规划”的刚性流水线难以体现“动作影响场景演化、场景演化反过来约束动作”的闭环耦合;并在自动驾驶中实例化为DAWN,通过在生成过程中联合/交替地对动作与世界演化进行一致性建模(以去噪式生成视角实现动作与未来场景的协同推断),实现更符合交互逻辑的场景-动作联合生成与规划。
English: Proposes “World-Action Interactive Models (WAIMs)” to explicitly capture the reciprocity between world prediction and action selection, addressing limitations of prior WAMs that use parallel branches or rigid predict-then-plan pipelines. Instantiated as DAWN for autonomous driving, it performs coupled, consistency-driven co-inference of actions and future scene evolution during generation (via a denoising-style formulation), enabling more interaction-faithful joint scene–action generation and planning.
GitHub
- [2026-05-12] DravenALG/awesome-vla-wam ⭐385
A Curated List of Vision-Language-Action (VLA) and World Action Models (WAM) Research and Beyond
- [2026-05-13] OpenMOSS/Awesome-WAM ⭐136 🆕NEW
A curated, continuously updated reading list, paper blogs, and resources for World Action Models (WAMs) in embodied AI.
- [2026-05-12] jiangranlv/DyWA ⭐82
[ICCV 2025] DyWA:Dynamics-adaptive World Action Model for Generalizable Non-prehensile Manipulation
Generated automatically by Daily AI Digest Agent 生成时间: 2026-05-14 01:03:07