AI 每日进展速报 / Daily AI Digest - 2026-04-17

图像生成/编辑 / Image Generation/Editing

arXiv

[2026-04-16] An Analysis of Regularization and Fokker-Planck Residuals in Diffusion Models for Image Generation
- 赛道归属: 文生图 / 扩散模型训练与正则化（推理/训练优化）
- 核心创新点: 系统分析DSM训练下扩散模型对Fokker–Planck(FP)方程的偏离现象，指出“更严格的FP约束≠更好生成质量”，并提出用更轻量的正则项替代直接FP残差惩罚：在显著降低计算开销的同时，仍能有效降低FP残差并保留（或接近）FP正则带来的生成收益，从而给出“低成本FP一致性改进”的可行路径与经验结论。
- Track: Text-to-Image / Diffusion model training regularization (training/inference optimization)
- Core innovation: Provides an empirical study of FP-equation violations under DSM-trained diffusion models, showing that stricter FP enforcement does not necessarily improve sample quality. It then demonstrates that simpler, lightweight regularizers can recover much of the benefit of FP regularization—reducing FP residuals at substantially lower computational cost—offering a practical, low-overhead route to improved FP consistency without expensive residual penalties.

[2026-04-16] Constraint-based Pre-training: From Structured Constraints to Scalable Model Initialization
- 赛道归属: 预训练与模型初始化（可伸缩/可变尺寸模型；生成模型与视觉模型通用）
- 核心创新点: 提出“基于约束的预训练”范式：在预训练阶段施加结构化约束，将与模型尺寸无关的知识解耦为可复用的weight templates，并把与尺寸相关的适配交给轻量weight scalers，从而把“不同深度/宽度模型的初始化”转化为多任务适配问题；进一步提出WeiT，用Kronecker约束将参数表示为模板的拼接与加权聚合，并用少量数据学习scaler来控制自适应连接，实现跨多种下游尺度快速构造权重、加速收敛并提升性能，且可泛化到Transformer与CNN及包含图像生成在内的多类任务。
- Track: Pre-training & model initialization (scalable/variable-size models; general across generative and vision models)
- Core innovation: Introduces a constraint-based pre-training paradigm that factorizes size-agnostic knowledge into reusable weight templates while delegating size-specific adaptation to lightweight weight scalers, reframing variable-size initialization as a multi-task adaptation problem. The proposed WeiT instantiates this with Kronecker-based constraints, representing parameters via template concatenation and weighted aggregation, with scaler-learned adaptive connections from limited data—enabling efficient weight construction across depths/widths, faster convergence, and improved performance across architectures (Transformers/CNNs) and tasks including image generation.

[2026-04-16] Prompt-Guided Image Editing with Masked Logit Nudging in Visual Autoregressive Models 🆕NEW
- 赛道归属: 图像编辑（基于视觉自回归模型的提示词引导编辑）
- 核心创新点: 提出 Masked Logit Nudging（MLN），将源图像的离散token编码通过VAR编码器“反投影”为logits，并在目标提示词条件下对模型预测logits进行“nudging”，沿源/目标提示词定义的语义轨迹对齐源token分布，从而在不改动模型参数的情况下实现可控编辑；同时设计基于源/目标prompt交叉注意力差分的掩码生成策略，将编辑严格限制在相关区域；最后引入量化误差修正的refinement以提升重建与编辑质量，使VAR系方法在高分辨率编辑上达到接近/优于扩散模型的效果且推理更快。
- Track: Image Editing (prompt-guided editing with visual autoregressive models)
- Core innovation: Proposes Masked Logit Nudging (MLN): converts the source image’s discrete token encodings into logits via the VAR encoder and nudges the model’s predicted logits under the target prompt toward the source-token logits along a semantic trajectory defined by source/target prompts, enabling controllable editing without updating backbone parameters; introduces a masking scheme based on cross-attention differences between source vs. edited prompts to confine changes to edit-relevant regions; adds a refinement stage to correct quantization errors, improving reconstruction/edit fidelity and achieving diffusion-comparable quality with much faster inference.

[2026-04-16] M3D-Net: Multi-Modal 3D Facial Feature Reconstruction Network for Deepfake Detection 🆕NEW
- 赛道归属: 多模态深度伪造检测（基于3D人脸重建的检测）
- 核心创新点: 提出端到端的 M3D-Net双流架构，从单张RGB图像中通过自监督3D人脸重建模块恢复细粒度几何与反射率等3D属性，并与RGB外观特征形成互补；设计 3D Feature Pre-fusion Module（PFM）对多尺度特征进行自适应调节与预融合，缓解模态/尺度差异；进一步通过带注意力机制的 Multi-modal Fusion Module（MFM）实现RGB与3D重建特征的有效融合，从而提升对高逼真伪造的鲁棒性与跨数据集泛化。
- Track: Multimodal Deepfake Detection (3D face reconstruction–based detection)
- Core innovation: Introduces M3D-Net, an end-to-end dual-stream framework that performs self-supervised 3D face reconstruction from a single RGB image to recover fine-grained geometry and reflectance, providing complementary cues to RGB appearance; proposes a 3D Feature Pre-fusion Module (PFM) to adaptively calibrate and pre-fuse multi-scale features to reduce modality/scale mismatch; employs an attention-based Multi-modal Fusion Module (MFM) to integrate RGB and reconstructed 3D features, improving robustness to highly realistic forgeries and cross-dataset generalization.

[2026-04-15] Creo: From One-Shot Image Generation to Progressive, Co-Creative Ideation
- 赛道归属: 文生图交互式生成 / 人机共创（Progressive T2I + 可控生成）
- 核心创新点: 提出多阶段渐进式T2I生成范式，从草图级抽象逐步提升到高分辨率成图，在中间表征层暴露可编辑“抽象状态”以支持早期发散探索；通过“决策锁定/锁区”机制将已确认的局部属性或区域固定，使后续编辑仅作用于指定部分；推理时以差分（diffs）更新替代整图重生成，显著降低编辑漂移并增强可追溯的用户决策链，从系统设计层面提升可控性、用户主导感与多样性。
- Track: Interactive text-to-image generation / co-creative ideation (progressive T2I + controllable generation)
- Core innovation: Introduces a multi-stage progressive T2I workflow that moves from sketch-like abstractions to high-resolution images while exposing editable intermediate representations for early-stage exploration; adds a decision-locking mechanism to freeze confirmed regions/attributes so later edits affect only targeted parts; performs inference via diff-based updates rather than full regeneration to reduce drift and improve traceability, controllability, user agency, and output diversity.

[2026-04-15] ASTRA: Enhancing Multi-Subject Generation with Retrieval-Augmented Pose Guidance and Disentangled Position Embedding
- 赛道归属: 多主体个性化文生图 / 姿态可控生成（Subject-driven T2I with pose control）
- 核心创新点: 针对多主体复杂动作下“身份保持 vs 姿态约束”信号纠缠导致的身份融合与姿态畸变，提出统一DiT框架内的结构-外观解耦：用RAG-Pose从检索库引入“干净、显式”的姿态结构先验；设计非对称的EURoPE位置编码，将身份token与空间位置解绑定、同时将姿态token绑定到画布坐标以强化结构对齐；再用DSM适配器把身份保持更多转移到文本条件流中，形成端到端的解耦条件融合，从架构层面提升多主体身份一致性与姿态遵循。
- Track: Multi-subject personalized text-to-image generation / pose-guided controllable generation
- Core innovation: Resolves the identity–pose entanglement in complex multi-subject generation by architectural disentanglement within a unified DiT: a retrieval-augmented pose pipeline (RAG-Pose) supplies a clean explicit structural prior; an asymmetric EURoPE positional encoding decouples identity tokens from spatial locations while anchoring pose tokens to the canvas; a DSM adapter shifts identity preservation into the text-conditioning stream, jointly improving identity fidelity and pose adherence.

[2026-04-15] PostureObjectstitch: Anomaly Image Generation Considering Assembly Relationships in Industrial Scenarios
- 赛道归属: 工业异常图像生成 / 条件可控扩散生成（面向装配关系）
- 核心创新点: 面向工业装配场景中部件姿态/朝向与装配关系约束难以被现有生成方法显式建模的问题，提出条件解耦+几何先验的扩散式合成：将多视图输入分解为高频、纹理与RGB等特征并进行条件解耦；通过“特征时间调制”在扩散时间步上自适应融合，实现从粗到细的渐进生成且保持跨步一致性；引入强调关键工业元素的条件损失与指导部件相对位置的几何先验，确保语义正确与装配关系合理，从而提升生成数据对下游异常检测的可用性。
- Track: Industrial anomaly image generation / condition-controlled diffusion (assembly-aware)
- Core innovation: Proposes an assembly-aware diffusion synthesis pipeline with condition disentanglement and geometric priors: decomposes multi-view inputs into high-frequency/texture/RGB features; applies feature temporal modulation across diffusion timesteps for coarse-to-fine generation with consistency; adds a conditional loss to emphasize critical industrial elements and a geometric prior to enforce correct component placement/assembly relationships, improving downstream usability for anomaly detection.

[2026-04-15] DiffMagicFace: Identity Consistent Facial Editing of Real Videos
- 赛道归属: 人脸视频编辑 / 身份一致性视频扩散编辑（Text/Image-guided）
- 核心创新点: 为解决真实视频人脸编辑中跨帧身份保持与编辑语义一致性难题，提出双模型并行推理的编辑框架：分别微调用于文本控制与图像控制的两套模型，在推理阶段协同约束同一帧的身份特征与编辑目标；构建“多视角人脸身份图像集”作为一致性支撑，通过渲染+优化生成多姿态参考，使方法不依赖大规模视频训练数据仍能获得跨帧稳定的身份与外观一致性，适配说话人等复杂场景。
- Track: Facial video editing / identity-consistent diffusion-based editing (text & image guided)
- Core innovation: Introduces a video face-editing framework that runs two separately fine-tuned controllers (text-control and image-control) concurrently at inference to jointly enforce edit semantics and identity preservation per frame; builds a multi-view identity image set via rendering plus optimization to provide viewpoint coverage, enabling strong cross-frame consistency without relying on video training datasets, including challenging talking-head scenarios.

[2026-04-15] SLQ: Bridging Modalities via Shared Latent Queries for Retrieval with Frozen MLLMs 🆕NEW
- 赛道归属: 跨模态检索（冻结MLLM的检索适配/表征学习）
- 核心创新点: 提出 SLQ（Shared Latent Queries），在完全冻结MLLM主干的前提下，仅引入少量可学习的共享潜在查询token，分别拼接到图像/文本token序列末端，利用模型原生的因果注意力将其作为全局聚合接口，直接产出统一语义空间的紧凑检索embedding，从“激发”预训练表征而非“覆盖”表征，避免全量微调/LoRA对知识结构的破坏；同时构建 KARR-Bench 用于评测更偏知识与推理的检索能力，验证该轻量适配在多基准上优于侵入式微调方案。
- Track: Cross-modal Retrieval (retrieval adaptation with frozen MLLMs / representation learning)
- Core innovation: Proposes SLQ (Shared Latent Queries) to adapt a fully frozen MLLM into a retriever by adding only a small set of learnable shared latent query tokens appended to both image and text sequences; leveraging the model’s native causal attention, these queries act as global aggregation interfaces to produce compact embeddings in a unified space—eliciting pretrained representations rather than overwriting them, avoiding semantic-space disruption seen in full fine-tuning/LoRA; additionally introduces KARR-Bench to evaluate knowledge-aware reasoning retrieval, demonstrating the method’s advantage on reasoning-centric retrieval beyond superficial matching.

[2026-04-15] ADP-DiT: Text-Guided Diffusion Transformer for Brain Image Generation in Alzheimer's Disease Progression
- 赛道归属: 医学影像生成 / 条件可控扩散Transformer（纵向MRI生成）
- 核心创新点: 面向阿尔茨海默病纵向随访MRI的“时间间隔+个体临床信息”可解释控制，提出区间感知的文本条件DiT：将随访间隔、人口学/诊断/量表等多域信息组织为自然语言提示，实现比粗粒度分期更细的时间控制；采用OpenCLIP与T5双文本编码器分别提供视觉-语言对齐与临床语言理解，并通过交叉注意力+自适应层归一化注入到DiT中实现局部与全局调制；结合RoPE增强解剖空间建模，并在SDXL-VAE潜空间扩散以兼顾高分辨率与效率。
- Track: Medical image generation / text-conditioned Diffusion Transformer for longitudinal MRI synthesis
- Core innovation: Presents an interval-aware, clinically text-conditioned DiT for longitudinal AD MRI: encodes follow-up interval plus demographic/diagnostic/neuropsychological metadata as natural-language prompts for fine-grained, interpretable time control; uses dual text encoders (OpenCLIP for vision-language alignment, T5 for richer clinical semantics) fused into DiT via cross-attention (fine guidance) and adaptive layer norm (global modulation); improves anatomical fidelity with RoPE on image tokens and performs diffusion in an SDXL-VAE latent space for efficient high-resolution reconstruction.

GitHub

[2026-04-17] YouMind-OpenLab/awesome-nano-banana-pro-prompts ⭐11053

🍌 World's largest Nano Banana Pro prompt library — 10,000+ curated prompts with preview images, 16 languages. Google Gemini AI image generation. Free ...

[2026-04-16] Light-Heart-Labs/DreamServer ⭐428

Local AI anywhere, for everyone — LLM inference, chat UI, voice, agents, workflows, RAG, and image generation. No cloud, no subscriptions.

[2026-04-17] shinpr/mcp-image ⭐99

MCP server for AI image generation and editing with automatic prompt optimization and quality presets (fast/balanced/quality). Powered by Gemini (Nano...

[2026-04-16] ferranpons/Llamatik ⭐98 🆕NEW

True on-device AI for Kotlin Multiplatform (Android, iOS, Desktop, JVM, WASM). LLM, Speech-to-Text and Image Generation — powered by llama.cpp, whispe...

[2026-04-16] JianWang97/jubensha-ai ⭐89 🆕NEW

AI剧本杀，Agent剧本演绎。支持AI剧本生成、TTS语音播报、AI图像生成等功能。接入minimax。AI-powered murder mystery game where all characters are played by AI. Features AI script generati...

HuggingFace Models

baidu/ERNIE-Image

baidu/ERNIE-Image-Turbo

视频生成/编辑 / Video Generation/Editing

arXiv

[2026-04-16] How to Correctly Make Mistakes: A Framework for Constructing and Benchmarking Mistake Aware Egocentric Procedural Videos
- 赛道归属: 视频编辑（程序性第一视角视频的数据构建/合成编辑与基准评测）
- 核心创新点: 提出PIE-V框架，将“干净的关键步骤流程视频”系统性增强为“包含可控错误与恢复”的程序性第一视角视频：1）用心理学启发的错误规划器，按流程阶段与语义负载生成更符合人类行为的偏差；2）引入恢复/纠正规划器显式建模人类纠错策略，使错误—恢复链条可评测；3）通过LLM写手进行级联一致性改写（保证文本叙事、步骤逻辑与状态变化一致），并用LLM评审做程序连贯性验证与失败修复；4）在视频层面用文本引导视频生成合成替换片段并无缝拼接回原视频，保持视觉合理性与时序连续；5）配套统一错误分类体系与9维人工量表，覆盖步骤级/流程级的可置信逻辑、状态变化一致性与文视频对齐，从而形成可复现的“错误感知”程序监控基准。
- Track: Video Editing (dataset construction/synthetic editing + benchmarking for egocentric procedural videos)
- Core innovations: Proposes PIE-V, a framework that augments “clean” key-step procedural videos into “mistake-and-recovery-aware” egocentric episodes in a controlled, human-plausible way: (1) a psychology-inspired error planner conditioned on procedure phase and semantic step load to generate realistic deviations; (2) a correction planner that explicitly models recovery behaviors, enabling evaluation over mistake–recovery traces; (3) an LLM writer for cascade-consistent rewrites to keep narration, step logic, and object-state changes coherent, plus an LLM judge to validate procedural coherence and repair failures; (4) text-guided video generation to synthesize replacement clips and stitch them back into the original episode for visually plausible temporal continuity; (5) a unified taxonomy and a 9-metric human rubric spanning step-/procedure-level plausibility, logic with confidence, state-change coherence, and text–video grounding to create a reproducible benchmark for mistake-aware procedural monitoring.

[2026-04-16] Flow of Truth: Proactive Temporal Forensics for Image-to-Video Generation
- 赛道归属: 多模态理解（AIGC视频取证/检测，面向图生视频I2V的时序取证）
- 核心创新点: 提出Flow of Truth，将I2V取证从“逐帧/空间伪迹定位”提升为“沿时间追踪像素如何流动与变形”的主动式时序取证：1）重新定义视频生成过程为像素随时间的运动而非独立帧合成，从问题建模上引入可追踪的时序一致性约束；2）设计可学习的取证模板（forensic template），使取证信号能随像素运动一致演化，避免传统静态签名在时序漂移/形变下失效；3）提出模板引导的光流模块，将运动与内容解耦，在不同生成内容与风格下仍能稳定追踪取证线索；4）验证其可跨商业与开源I2V模型泛化，显著提升时序层面的取证鲁棒性与检测性能。
- Track: Multimodal Understanding (AIGC video forensics/detection for Image-to-Video generation)
- Core innovations: Introduces Flow of Truth, a proactive temporal forensics framework that shifts I2V forensics from per-frame/spatial artifact localization to tracing how pixels flow and deform over time: (1) reframes video generation as pixel motion through time rather than independent frame synthesis, enabling temporally consistent forensic modeling; (2) proposes a learnable forensic template whose signature evolves coherently with pixel motion, mitigating drift/deformation that breaks static signatures; (3) a template-guided flow module that decouples motion from image content, improving robustness across diverse content/styles; (4) demonstrates strong generalization across commercial and open-source I2V models with substantially improved temporal forensics performance.

[2026-04-16] Prompt-to-Gesture: Measuring the Capabilities of Image-to-Video Deictic Gesture Generation 🆕NEW
- 赛道归属: 视频生成（图像到视频 I2V）/ 数据合成（手势生成与评测）
- 核心创新点: 提出一套基于图像到视频基础模型的“Prompt驱动指向性手势（deictic gesture）”合成与评测框架：用少量真人参考样本作为锚点，通过提示词生成具有真实外观且具备可控语义的手势视频数据；系统性评估合成数据在视觉保真度、与真实分布的一致性以及“新颖性/多样性”上的贡献，并验证与真实数据混合训练可显著提升下游手势相关模型表现，从而将I2V模型用于零样本/低成本手势数据扩增的可行性落到可复用的数据生成流水线与量化实验上。
- Track: Video Generation (Image-to-Video) / Data Synthesis (Gesture generation & evaluation)
- Core innovation: Introduces a prompt-driven deictic-gesture synthesis and evaluation framework built on image-to-video foundation models: starting from a small set of human-recorded reference samples, it uses text prompts to generate semantically guided, photorealistic gesture videos; it then rigorously measures fidelity, distributional alignment, and added novelty/variability, and demonstrates that mixing synthetic with real data improves downstream gesture models—turning I2V into a practical zero-shot/low-cost pipeline for gesture data augmentation with quantitative validation.

[2026-04-16] Controllable Video Object Insertion via Multiview Priors 🆕NEW
- 赛道归属: 视频编辑（视频目标插入 / Video Object Insertion）
- 核心创新点: 面向“在既有视频中插入新物体”的一致性难题，提出基于多视角先验的可控插入框架：将2D参考图像提升为多视角表示，引入“双路径视角一致条件注入”以在不同视角/运动下稳定约束物体身份与外观；通过质量感知加权机制自适应抑制噪声参考带来的错误引导；并设计“集成感知一致性模块（Integration-Aware Consistency）”在插入区域的遮挡关系、边界融合与空间真实感上进行约束，同时保持跨帧时间连续性，从而显著提升外观一致、对齐准确且时序稳定的插入效果。
- Track: Video Editing (Video Object Insertion)
- Core innovation: Proposes a controllable object-insertion framework leveraging multiview priors to tackle appearance consistency, alignment, and temporal coherence: it lifts 2D references into multiview representations and uses a dual-path view-consistent conditioning scheme to stabilize identity/appearance across viewpoint changes; a quality-aware weighting mechanism mitigates noisy/imperfect references; and an Integration-Aware Consistency Module enforces realistic occlusion handling and boundary blending while preserving frame-to-frame continuity, yielding more stable and realistic insertions.

[2026-04-15] Seedance 2.0: Advancing Video Generation for World Complexity
- 赛道归属: 音视频联合生成（原生多模态生成）/ 多模态参考驱动的生成与编辑
- 核心创新点: 提出原生音视频一体化的多模态生成模型，采用统一且高效的大规模架构实现音频与视频的联合建模与同步生成；在同一框架内支持文本/图像/音频/视频四类输入作为条件与参考，覆盖更完整的“多模态参考+编辑”能力组合（如多段视频、多张图、多段音频的联合约束），面向复杂世界内容提升整体生成质量与一致性；提供加速版本以面向低时延场景，在保持生成能力的同时优化推理速度与可用性。
- Track: Joint audio-video generation (native multimodal generation) / Multimodal reference-conditioned generation & editing
- Key innovations: Introduces a native audio-video unified multimodal generative model with a single efficient large-scale architecture for joint modeling and synchronized generation of audio and video; supports four conditioning modalities (text/image/audio/video) within one framework, enabling an industry-oriented, comprehensive suite of multimodal reference and editing capabilities (e.g., jointly constraining generation with multiple video clips, images, and audio clips) to better handle “world complexity” and improve overall fidelity/consistency; provides an accelerated “Fast” variant targeting low-latency deployment by optimizing generation speed while retaining core generation performance.

[2026-04-15] From Synchrony to Sequence: Exo-to-Ego Generation via Interpolation
- 赛道归属: 跨视角视频生成（Exo-to-Ego / Ego-to-Exo）、扩散式序列建模
- 核心创新点: 将同步的第三人称到第一人称生成从“条件→输出”的配对学习，重构为“单一连续序列”的序列信号建模：提出 Syn2Seq-Forcing，通过在源/目标视频之间做插值，把原本由同步采集导致的时空/几何跳变（discontinuity）转化为可学习的连续过渡，从而更适配 DFoT 等扩散序列模型捕获跨帧一致的转场；并实证指出主要难点来自视频的时空不连续而非位姿插值本身（仅视频插值即可显著提升），同时该连续序列表述可统一 Exo2Ego 与 Ego2Exo 于同一框架中。
- Track: Cross-view video generation (Exo-to-Ego / Ego-to-Exo), diffusion-based sequence modeling
- Core innovation: Reframes synchronized exo→ego generation from a paired “condition→output” task into continuous sequence signal modeling: proposes Syn2Seq-Forcing that interpolates between source and target videos to convert synchronization-induced spatio-temporal/geometric jumps into learnable smooth transitions, making diffusion sequence models (e.g., DFoT) better at coherent frame-to-frame transitions; empirically shows the dominant difficulty is video discontinuity (video-only interpolation already yields large gains), and the formulation naturally unifies Exo2Ego and Ego2Exo within one continuous sequence framework.

[2026-04-15] DiT as Real-Time Rerenderer: Streaming Video Stylization with Autoregressive Diffusion Transformer
- 赛道归属: 视频风格化/视频重渲染（实时流式）、扩散模型加速与蒸馏
- 核心创新点: 提出 RTR-DiT 将 DiT 用作“实时重渲染器”：先在视频风格化数据上微调双向 teacher（同时支持文本引导与参考图引导），再通过 Self Forcing + Distribution Matching Distillation 将其蒸馏为少步数的自回归扩散 Transformer，实现流式长视频的低延迟处理；同时设计 reference-preserving 的 KV cache 更新策略，在保证长时稳定与时序一致性的前提下，支持文本提示与参考图的实时切换，解决长视频扩散风格化的漂移与高算力/多步去噪瓶颈。
- Track: Video stylization / rerendering (real-time streaming), diffusion acceleration & distillation
- Core innovation: Proposes RTR-DiT as a “real-time rerenderer” built on DiT: fine-tunes a bidirectional teacher for both text-guided and reference-guided stylization, then distills it into a few-step autoregressive diffusion Transformer via Self Forcing and Distribution Matching Distillation for low-latency streaming on long videos; introduces a reference-preserving KV-cache update that stabilizes long-horizon consistency while enabling real-time switching between text prompts and reference images, addressing drift and the high-cost multi-step denoising of diffusion stylization.

[2026-04-15] VibeFlow: Versatile Video Chroma-Lux Editing through Self-Supervised Learning
- 赛道归属: 视频编辑（光照/颜色编辑：relighting、recoloring、day-night 等）、自监督/零样本编辑、基于流的生成式编辑
- 核心创新点: 提出 VibeFlow 用自监督方式“激活”预训练视频生成模型的物理先验，摆脱合成配对监督：通过解耦的数据扰动管线强制模型在训练中学习“结构来自源视频、颜色/光照线索来自参考图”的自适应重组，实现稳健的结构-色光解耦；针对 flow 类模型的离散化误差，引入 Residual Velocity Fields 进行速度场残差校正，并配合 Structural Distortion Consistency Regularization 约束结构畸变一致性，从而同时提升结构保真与时间一致性；整体支持零样本泛化到多类 chroma-lux 编辑任务并降低训练/计算成本。
- Track: Video editing (illumination & color / chroma-lux), self-supervised & zero-shot editing, flow-based generative editing
- Core innovation: Introduces VibeFlow, a self-supervised framework that leverages the physical priors of pre-trained video generative models to avoid costly synthetic paired supervision: a disentangled perturbation pipeline enforces adaptive recombination where structure is taken from the source video while color/illumination cues come from a reference image, yielding robust structure–chroma/lux disentanglement; to mitigate discretization errors in flow-based models, it adds Residual Velocity Fields plus Structural Distortion Consistency Regularization to preserve structure and temporal coherence. The method generalizes zero-shot to relighting, recoloring, low-light enhancement, day–night translation, and object-specific color edits with reduced compute/training overhead.

[2026-04-14] Lyra 2.0: Explorable Generative 3D Worlds
- 赛道归属: 视频生成 + 3D世界生成/重建（camera-controlled long-horizon video → 3D lifting）
- 核心创新点: 提出“生成式重建”框架 Lyra 2.0，用视频生成的视觉先验驱动可实时渲染的3D世界构建，重点解决长轨迹下的3D一致性退化两大根因：1）针对空间遗忘，维护逐帧3D几何但仅用于信息路由（检索相关历史帧并建立到目标视角的稠密对应），外观仍由生成模型合成，从而在大视角变化与回访场景时保持结构一致；2）针对时间漂移，用自增强历史(Self-augmented histories)训练，让模型暴露于自身退化输出并学习“纠偏”而非累积误差。最终可生成更长、更一致的探索视频，并用于微调前馈式重建模型以稳定恢复高质量3D场景。
- Track: Video generation + 3D world generation/reconstruction (camera-controlled long-horizon video → 3D lifting)
- Core innovation: Lyra 2.0 formalizes a “generative reconstruction” pipeline that turns camera-controlled videos into persistent 3D worlds, tackling two failure modes in long-horizon 3D-consistent generation: (1) spatial forgetting is mitigated by keeping per-frame 3D geometry only for information routing—retrieving relevant past frames and building dense correspondences to target views—while leaving appearance synthesis to the generative prior; (2) temporal drifting is reduced via self-augmented histories training that feeds the model its own degraded outputs so it learns to correct drift instead of compounding it. The resulting longer, more 3D-consistent trajectories enable reliable fine-tuning of feed-forward 3D reconstruction models.

[2026-04-14] Generative Refinement Networks for Visual Synthesis
- 赛道归属: 图像生成（自回归/非扩散范式）、文生图、文生视频
- 核心创新点: 提出Generative Refinement Networks（GRN）作为替代扩散的视觉生成范式：用近乎无损的分层二值量化HBQ缓解离散tokenization带来的信息损失，并在AR生成上引入“全局逐步精修”的refinement机制以纠正误差累积、逐轮提升细节；同时用熵引导采样实现复杂度感知的自适应步数生成，在不牺牲质量的前提下降低不必要计算，并在ImageNet及T2I/T2V扩展上验证可扩展性与SOTA指标。
- Track: Image generation (autoregressive / post-diffusion paradigm), Text-to-Image, Text-to-Video
- Key innovations: Proposes Generative Refinement Networks (GRN) as a diffusion alternative: (1) a theoretically near-lossless Hierarchical Binary Quantization (HBQ) to remove the discrete-token bottleneck; (2) a global progressive refinement mechanism on top of AR generation to correct accumulated errors and iteratively polish details; (3) entropy-guided sampling for complexity-aware adaptive-step generation, reducing compute without degrading quality, with strong results on ImageNet and scalable T2I/T2V settings.

GitHub

[2026-04-16] hao-ai-lab/FastVideo ⭐3396

A unified inference and post-training framework for accelerated video generation.

[2026-04-16] ZeroLu/awesome-seedance ⭐1495

The ultimate collection of high-fidelity Seedance 2.0 prompts and Seedance AI resources. Discover Seedance 2.0 how to use for cinematic film, anime, U...

[2026-04-16] YouMind-OpenLab/awesome-seedance-2-prompts ⭐642

🎬 500+ curated Seedance 2.0 video generation prompts — cinematic, anime, UGC, ads, meme styles. Includes Seedance API guides, character consistency ti...

[2026-04-16] thu-ml/Causal-Forcing ⭐567

Official codebase for "Causal Forcing: Autoregressive Diffusion Distillation Done Right for High-Quality Real-Time Interactive Video Generation"

[2026-04-16] Paker-kk/Flovart ⭐68 🆕NEW

Flovart is a web-based infinite canvas inspired by Lovart. It merges flexible drawing tools, a layered workspace and an organized inspiration library ...

音频生成 / Audio Generation

arXiv

[2026-04-16] ControlFoley: Unified and Controllable Video-to-Audio Generation with Cross-Modal Conflict Handling
- 赛道归属: 视频到音频生成（V2A）/ 可控音频生成（多模态条件控制）
- 核心创新点: 提出统一的多模态V2A框架，实现对视频、文本与参考音频的细粒度可控生成；通过“联合视觉编码”将CLIP与时空音视编码器融合，增强跨模态对齐并提升视觉-文本冲突场景下的文本可控性；提出“时间-音色解耦”以抑制参考音频中的冗余时间线索、保留可区分的音色特征，从而提升风格控制精度；设计模态鲁棒训练（REPA统一表征对齐+随机模态dropout）以提升缺失/冲突模态下的稳健性；同时构建VGGSound-TVC基准，系统评测不同冲突强度下的文本可控能力。
- Track: Video-to-Audio (V2A) generation / Controllable audio generation (multimodal conditioning)
- Key innovations: Proposes a unified multimodal V2A framework enabling fine-grained control via video, text, and reference audio; introduces a joint visual encoding scheme that fuses CLIP with a spatio-temporal audio-visual encoder to strengthen cross-modal alignment and improve text controllability under visual-text conflicts; proposes temporal–timbre decoupling to suppress redundant temporal cues in reference audio while preserving discriminative timbre for more precise style control; designs modality-robust training with unified representation alignment (REPA) plus random modality dropout to improve robustness to missing/conflicting modalities; releases the VGGSound-TVC benchmark to systematically evaluate textual controllability under varying conflict levels.

[2026-04-15] Seedance 2.0: Advancing Video Generation for World Complexity
- 赛道归属: 音视频联合生成（原生多模态生成）/ 多模态参考驱动的生成与编辑
- 核心创新点: 提出原生音视频一体化的多模态生成模型，采用统一且高效的大规模架构实现音频与视频的联合建模与同步生成；在同一框架内支持文本/图像/音频/视频四类输入作为条件与参考，覆盖更完整的“多模态参考+编辑”能力组合（如多段视频、多张图、多段音频的联合约束），面向复杂世界内容提升整体生成质量与一致性；提供加速版本以面向低时延场景，在保持生成能力的同时优化推理速度与可用性。
- Track: Joint audio-video generation (native multimodal generation) / Multimodal reference-conditioned generation & editing
- Key innovations: Introduces a native audio-video unified multimodal generative model with a single efficient large-scale architecture for joint modeling and synchronized generation of audio and video; supports four conditioning modalities (text/image/audio/video) within one framework, enabling an industry-oriented, comprehensive suite of multimodal reference and editing capabilities (e.g., jointly constraining generation with multiple video clips, images, and audio clips) to better handle “world complexity” and improve overall fidelity/consistency; provides an accelerated “Fast” variant targeting low-latency deployment by optimizing generation speed while retaining core generation performance.

[2026-04-15] Don't Let the Video Speak: Audio-Contrastive Preference Optimization for Audio-Visual Language Models 🆕NEW
- 赛道归属: 多模态理解（音频-视频-语言对齐 / AVLM 可靠性与幻觉抑制）
- 核心创新点: 提出 Audio-Contrastive Preference Optimization（ACPO）以抑制“视觉主导导致的音频幻觉”。方法上采用“双轴偏好学习”：(1) 输出对比目标在偏好优化中惩罚把视觉线索伪装成音频事实的回答，强化“只基于听到的内容作答”；(2) 输入对比目标通过交换音轨构造反事实输入，显式惩罚对真实音频不敏感（对换音轨仍生成相同内容）的生成行为，从训练信号层面打破视觉捷径依赖；在不牺牲整体多模态能力的前提下提升音频落地性与抗幻觉能力。
- Track: Multimodal Understanding (Audio-Video-Language alignment / AVLM reliability & hallucination mitigation)
- Core innovation: Proposes Audio-Contrastive Preference Optimization (ACPO) to curb “video-driven audio hallucination” caused by visual dominance. It introduces dual-axis preference learning: (1) an output-contrastive objective that penalizes responses that present visual cues as if they were audio-grounded facts; (2) an input-contrastive objective built via audio-track swapping to create counterfactual inputs, explicitly penalizing generations that are invariant to the true audio signal. This breaks reliance on visual shortcuts and improves faithful audio grounding without degrading overall multimodal capability.

[2026-04-15] MERRIN: A Benchmark for Multimodal Evidence Retrieval and Reasoning in Noisy Web Environments
- 赛道归属: 多模态检索增强推理（Web Search Agent 评测基准/Benchmark）
- 核心创新点: 提出人工标注的 MERRIN 基准，用于在“噪声真实网页环境”中系统评测检索增强智能体的多模态证据检索与多跳推理能力。方法论上突破在于：①查询为自然语言且不提供模态提示，迫使模型自主判定应检索的模态；②显式纳入以往较少覆盖的音频、视频等模态，并要求跨模态证据组合；③证据来源为异构且常冲突/部分相关的网页结果，强调在噪声与矛盾信息下的鲁棒证据选择与推理，而非“干净语料”的检索问答。基准还通过 no-search/native-search/agentic-search 三种设置与多模型多智能体对比，揭示当前系统的关键失败模式（过度探索、工具步数增加但被冲突内容干扰、过度依赖文本模态、资源消耗高但准确率低），为后续改进“多模态源选择+推理策略”提供可量化靶场。
- Track: Multimodal retrieval-augmented reasoning (Web search agent benchmarking)
- Core innovations: Introduces MERRIN, a human-annotated benchmark that evaluates search-augmented agents on multimodal evidence retrieval and multi-hop reasoning in noisy, real-world web environments. Key methodological advances include: (1) natural-language queries without modality cues, requiring agents to infer which modalities to seek; (2) explicit inclusion of underexplored modalities such as audio and video, demanding cross-modal evidence integration; (3) retrieval over heterogeneous, often conflicting or partially relevant web sources, stressing robust source selection and reasoning under noise rather than clean retrieval QA. By evaluating agents across no-search/native-search/agentic-search settings and multiple models, it surfaces concrete failure modes (over-exploration, more tool steps but higher distraction from conflicting content, overreliance on text, high resource use with low accuracy), making it a targeted testbed for improving multimodal source selection and reasoning policies.

[2026-04-12] Audio-Omni: Extending Multi-modal Understanding to Versatile Audio Generation and Editing
- 赛道归属: 音频生成与编辑（多模态统一生成框架）
- 核心创新点: 提出端到端统一框架，将音频理解、生成与编辑在同一模型中打通，并覆盖通用声音/音乐/语音三大域；采用“冻结的多模态大语言模型(MLLM)负责高层推理 + 可训练的Diffusion Transformer负责高保真合成”的分工式架构，实现推理能力与合成质量兼得；针对音频编辑数据稀缺，构建百万级高质量编辑配对数据集AudioEdit以支撑可泛化的编辑学习；展示继承能力（知识增强生成、in-context生成、零样本跨语种控制）表明统一模型具备向“通用生成式音频智能”扩展的潜力。
- Track: Audio generation & editing (unified multimodal generative framework)
- Core innovations: Introduces the first end-to-end unified system that integrates audio understanding, generation, and editing across general sound, music, and speech; adopts a division-of-labor architecture with a frozen MLLM for high-level reasoning and a trainable Diffusion Transformer for high-fidelity synthesis; addresses editing data scarcity by building AudioEdit, a million-scale curated paired editing dataset; demonstrates inherited capabilities (knowledge-augmented reasoning, in-context generation, zero-shot cross-lingual control), indicating a path toward universal generative audio intelligence.

[2026-04-12] VidAudio-Bench: Benchmarking V2A and VT2A Generation across Four Audio Categories
- 赛道归属: 视频到音频生成评测（V2A/VT2A基准与指标）
- 核心创新点: 构建面向V2A与VT2A的多任务评测基准，将音频按音效/音乐/语音/歌唱四类拆分评估，避免“统一协议掩盖类别差异”的问题；提出13个面向任务的无参考指标，分别覆盖音质、视听一致性与文音一致性，并通过主观实验验证与人类偏好对齐；系统评测11个SOTA模型，揭示语音与歌唱显著短板，以及VT2A中“指令遵循 vs 视觉扎根”的结构性张力（更强视觉条件提升对齐但易偏离目标音频类别），为诊断与迭代V2A系统提供可扩展工具链。
- Track: Video-to-audio generation evaluation (V2A/VT2A benchmark & metrics)
- Core innovations: Proposes a multi-task benchmark that evaluates V2A and VT2A separately across four audio categories (SFX, music, speech, singing), enabling fine-grained diagnosis beyond a single unified protocol; introduces 13 task-specific reference-free metrics spanning audio quality, video-audio consistency, and text-audio consistency, and validates them via human studies for preference alignment; benchmarks 11 SOTA models and uncovers key failure modes (notably speech/singing) and a VT2A trade-off between instruction following and visually grounded generation.

[2026-04-10] Tora3: Trajectory-Guided Audio-Video Generation with Physical Coherence 🆕NEW
- 赛道归属: 音视频生成（可控生成 / 物理一致性与运动-声音同步）
- 核心创新点: 提出 Tora3，以物体轨迹（trajectory）作为音视频生成共享的运动学先验来提升物理一致性，而非仅作为视频控制信号。关键方法突破包括：(1) 轨迹对齐的运动表征用于稳定并约束视频中的运动生成；(2) 基于轨迹导出的二阶运动学状态（如速度/加速度）驱动的运动学-音频对齐模块，将接触/碰撞等声学事件与运动变化显式绑定，增强运动-声音同步；(3) 混合 flow matching策略，在轨迹条件区域保持轨迹保真，同时在其他区域维持局部时空一致性；并构建 PAV 数据集，强调与运动相关的音视频模式并自动提取运动标注，支撑大规模训练与评测。
- Track: Audio-Video Generation (Controllable generation / physical coherence & motion-sound synchronization)
- Core innovation: Introduces Tora3, using object trajectories as a shared kinematic prior for joint audio-video generation rather than a video-only control. Key advances: (1) a trajectory-aligned motion representation to stabilize and constrain video motion; (2) a kinematic-audio alignment module driven by trajectory-derived second-order kinematics (e.g., velocity/acceleration) to explicitly tie acoustic events (contacts/impacts) to motion dynamics; (3) a hybrid flow-matching scheme that preserves trajectory fidelity in trajectory-conditioned regions while maintaining local coherence elsewhere. Additionally curates the PAV dataset emphasizing motion-relevant AV patterns with automatically extracted motion annotations to enable scalable training and evaluation.

GitHub

[2026-04-17] huggingface/diffusers ⭐33356

🤗 Diffusers: State-of-the-art diffusion models for image, video, and audio generation in PyTorch.

[2026-04-13] Lightricks/LTX-2 ⭐5887 🆕NEW

Official Python inference and LoRA trainer package for the LTX-2 audio–video generative model.

[2026-04-13] SamurAIGPT/Generative-Media-Skills ⭐3033

Multi-modal Generative Media Skills for AI Agents (Claude Code, Cursor, Gemini CLI). High-quality image, video, and audio generation powered by muapi....

[2026-04-17] apocas/restai ⭐484

RESTai is an AIaaS (AI as a Service) open-source platform. Supports many public and local LLM suported by Ollama/vLLM/etc. Precise embeddings usage, t...

[2026-04-16] Saganaki22/ComfyUI-Woosh ⭐56 🆕NEW

Text-to-audio and video-to-audio using Sony AI's Woosh foundation model.

语言大模型 / Large Language Models

GitHub

[2026-04-16] abhigyanpatwari/GitNexus ⭐27726

GitNexus: The Zero-Server Code Intelligence Engine - GitNexus is a client-side knowledge graph creator that runs entirely in your browser. Drop ...

[2026-04-16] DeusData/codebase-memory-mcp ⭐1588

High-performance code intelligence MCP server. Indexes codebases into a persistent knowledge graph — average repo in milliseconds. 66 languages, sub-m...

[2026-04-16] proxysoul/soulforge ⭐558

Graph-powered code intelligence, multi-agent coding with codebase-aware AI. No more grep & pray

[2026-04-16] truecourse-ai/truecourse ⭐145

AI-powered architecture analysis and code intelligence. Detects circular deps, layer violations, dead modules, and more. Web UI + CLI.

[2026-04-16] SimplyLiz/CodeMCP ⭐88

Code intelligence for AI assistants - MCP server, CLI, and HTTP API with symbol navigation, impact analysis, and architecture mapping

多模态大模型 / Multimodal Models

arXiv

[2026-04-16] OpenMobile: Building Open Mobile Agents with Task and Trajectory Synthesis
- 赛道归属: 移动端多模态智能体（Mobile Agent）数据合成与轨迹学习 / Agentic VLM Training Data & Trajectory Synthesis
- 核心创新点: 提出开源的任务指令与交互轨迹合成框架，解决移动智能体训练数据闭源与合成流程不透明的问题：① 通过探索构建“全局环境记忆”（global environment memory），再基于该记忆进行可扩展的任务合成，生成多样且与真实界面元素强绑定（grounded）的高质量指令；② 在轨迹rollout中引入“策略切换”（policy-switching），交替使用learner与expert模型采样，从而系统性采集标准模仿学习中缺失的错误恢复（error-recovery）轨迹数据，提升鲁棒性与成功率；并通过指令-测试集重叠分析验证增益来自功能覆盖而非基准过拟合。
- Track: Mobile multimodal agents — training data synthesis & trajectory learning for agentic VLMs
- Key innovations: An open-source framework for synthesizing task instructions and interaction trajectories to address the closed/opaque data pipelines in mobile agents: (1) builds a global environment memory via exploration and uses it to generate diverse, strongly grounded instructions tied to real UI elements; (2) introduces a policy-switching rollout strategy that alternates learner and expert models to deliberately capture error-recovery trajectories typically missing in standard imitation learning, improving robustness and task success. Includes transparent overlap analysis with benchmarks to argue gains come from broad functionality coverage rather than overfitting.

[2026-04-16] Beyond Visual Cues: Semantic-Driven Token Filtering and Expert Routing for Anytime Person ReID
- 赛道归属: 多模态行人重识别（跨模态/换装鲁棒 ReID）与语义增强检索 / VLM-assisted Person ReID (cross-modality & clothing-change robust retrieval)
- 核心创新点: 提出STFER框架，将LVLM生成的“身份一致性语义文本”作为跨场景稳定表征，突破传统ReID对纯视觉特征的依赖：① 通过指令引导LVLM产出刻画生物特征常量（identity-intrinsic）的语义token，用于在RGB↔IR与换装条件下提供更稳健的身份判别线索；② 设计语义驱动的视觉token过滤（SVTF），用文本token选择/强化与身份相关的视觉区域并抑制背景冗余噪声；③ 设计语义驱动的专家路由（SER），将语义token注入Mixture-of-Experts式路由/门控，使不同场景（昼夜、跨模态、短/长期换装）下的特征融合与分配更稳健，从而提升Any-Time ReID的泛化与鲁棒性。
- Track: Multimodal person re-identification — semantic/VLM-enhanced robust ReID under modality shift and clothing change
- Key innovations: STFER leverages LVLM-generated identity-consistent semantic text as a stable cue beyond raw appearance, addressing degradation under RGB↔IR shifts and clothing changes: (1) instruction-guided LVLM produces identity-intrinsic semantic tokens capturing biometric constants; (2) Semantic-driven Visual Token Filtering (SVTF) uses these tokens to emphasize identity-relevant regions and suppress background noise; (3) Semantic-driven Expert Routing (SER) injects semantic tokens into expert routing/gating (MoE-style) for more robust scenario-adaptive feature allocation and fusion, improving generalization across diverse ReID conditions.

[2026-04-16] UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards 🆕NEW
- 赛道归属: 多模态检索增强生成（Visual RAG）与强化学习推理优化
- 核心创新点: 提出统一的端到端RL框架，让LVLM以“序列决策”方式联合完成检索、重排、主动视觉感知与推理；通过层级动作空间实现从文档级粗检索到图像级精选择再到区域级主动裁剪的coarse-to-fine证据精炼；设计面向每一步动作的稠密多重奖励以提供细粒度监督，并基于GRPO在无需价值网络的情况下对齐多目标行为；同时构建带细粒度动作标注的高质量推理轨迹数据以支撑训练与评测。
- Track: Visual Retrieval-Augmented Generation (Visual RAG) & RL-based reasoning optimization
- Core innovations: Proposes an end-to-end unified RL framework where an LVLM agent jointly performs retrieval, reranking, active perception, and reasoning as a sequential decision process; introduces a hierarchical action space enabling coarse-to-fine evidence refinement from document retrieval to image selection and region cropping; designs dense multi-reward signals to supervise each action step, and leverages GRPO to optimize multi-objective behavior without a separate value network; curates fine-grained action-annotated reasoning trajectories to support training and evaluation.

[2026-04-16] Dual-Axis Generative Reward Model Toward Semantic and Turn-taking Robustness in Interactive Spoken Dialogue Models 🆕NEW
- 赛道归属: 语音对话多模态交互（Spoken Dialogue）与RL奖励建模
- 核心创新点: 提出“双轴生成式奖励模型”，用细粒度交互质量分类体系与标注数据学习复杂对话动态；奖励输出不仅给出单一总分，还显式分解为“语义质量”和“轮次/时序（turn-taking）”两条轴的评分，从而为全双工语音对话模型提供可诊断、可用于在线RL的稳定奖励信号；以生成式建模替代依赖浅层统计/时序代理指标的传统自动评估，提升跨数据集的交互质量评估一致性与泛化。
- Track: Spoken multimodal dialogue interaction & RL reward modeling
- Core innovations: Introduces a dual-axis generative reward model trained with a detailed interaction taxonomy and annotations to capture complex dialogue dynamics; outputs both an overall score and disentangled scores for semantic quality and turn-taking/timing robustness, providing diagnostic feedback and a reliable reward for online RL; replaces brittle proxy metrics (behavioral stats/timing accuracy) with a learned generative assessor that generalizes across synthetic and real-world interaction datasets.

[2026-04-16] ADAPT: Benchmarking Commonsense Planning under Unspecified Affordance Constraints 🆕NEW
- 赛道归属: 具身智能（Embodied AI）规划与可供性（Affordance）推理 / 基准评测
- 核心创新点: 构建动态可供性基准DynAfford，专门评测“指令未显式给出、且随时间变化”的可操作性约束下的常识规划能力；提出ADAPT作为可插拔模块，将“显式可供性推理”注入现有规划器：要求代理感知对象状态、推断隐含前置条件并据此调整动作序列；并验证使用领域适配、LoRA微调的VLM作为可供性推断后端优于通用商用LLM，强调任务对齐的视觉落地对鲁棒规划的重要性。
- Track: Embodied planning with affordance reasoning & benchmarking
- Core innovations: Introduces DynAfford, a benchmark targeting commonsense planning under unspecified and time-varying affordance constraints; proposes ADAPT as a plug-and-play module that augments existing planners with explicit affordance reasoning—perceiving object states, inferring implicit preconditions, and adapting action plans; demonstrates that a domain-adapted LoRA-finetuned VLM is a stronger affordance inference backend than a general commercial LLM, highlighting task-aligned visual grounding for robustness.

[2026-04-16] Reasoning Dynamics and the Limits of Monitoring Modality Reliance in Vision-Language Models 🆕NEW
- 赛道归属: 多模态理解与可解释性/安全（VLM推理过程分析与监控）
- 核心创新点: 系统性分析18个VLM在CoT过程中的“推理动态”，提出并量化“答案惯性”（早期承诺在后续推理中被强化而非纠正）与“推理纠错效应”；通过可控的误导性文本线索干预，刻画模型在不同模态条件下对文本/视觉证据的依赖与可被CoT监测的上限；揭示长而流畅的CoT可能伪装成视觉落地但实则跟随文本线索，说明仅监控CoT对模态依赖与安全透明性只能提供部分视角。
- Track: Multimodal interpretability/safety—reasoning dynamics & modality-reliance monitoring
- Core innovations: Provides a systematic study of reasoning dynamics across 18 VLMs by tracking confidence over CoT, quantifying “answer inertia” and the corrective effect of reasoning; uses controlled interventions with misleading textual cues to probe modality reliance under varying modality conditions and evaluate how recoverable such reliance is from CoT; shows that fluent, longer CoTs can appear visually grounded while actually following text cues, establishing limits of CoT-based monitoring for transparency and safety.

[2026-04-16] MetaDent: Labeling Clinical Images for Vision-Language Models in Dentistry 🆕NEW
- 赛道归属: 医疗多模态（口腔/牙科）数据集与基准（VLM评测与标注体系）
- 核心创新点: 提出MetaDent资源：大规模牙科临床图像集合+面向口内摄影的层级化半结构标注框架+多任务基准套件；标注方法将“全局摘要”与“逐点异常自由文本描述”结合，形成可扩展、任务无关的meta-label表示；进一步用LLM将meta-label可靠转化为标准化评测（VQA对与多标签分类），并通过人工复核与误差分析验证保真度，从而系统暴露现有VLM在细粒度口内场景理解与描述一致性上的短板。
- Track: Medical multimodal (dentistry) dataset/benchmarking & annotation methodology for VLMs
- Core innovations: Releases MetaDent: a large-scale dentistry image resource plus a hierarchical semi-structured annotation scheme tailored to intraoral photography and comprehensive benchmarks; proposes a meta-labeling approach combining high-level summaries with point-by-point free-text abnormality descriptions to yield scalable, task-agnostic supervision; uses LLMs to convert meta-labels into standardized VQA and multi-label classification benchmarks with human validation/error analysis, enabling rigorous evaluation and revealing fine-grained failure modes of current VLMs.

[2026-04-16] Zero-Shot Retail Theft Detection via Orchestrated Vision Models: A Model-Agnostic, Cost-Effective Alternative to Trained Single-Model Systems 🆕NEW
- 赛道归属: 视频理解与行为检测（零样本异常/盗窃检测）与系统工程优化
- 核心创新点: 提出无需训练的零样本盗窃检测框架，通过“分层编排”多模型流水线实现成本与性能折中：廉价的目标检测/姿态估计常驻运行，只有在多信号可疑预筛（停留时长+行为信号）触发时才调用昂贵VLM，从而将VLM调用量降低约240倍并实现单GPU多门店服务；VLM端点采用OpenAI兼容接口实现模型无关可替换；同时给出可落地的成本模型与人脸模糊的隐私保护设计。
- Track: Video understanding & zero-shot behavior/anomaly detection with cost-optimized orchestration
- Core innovations: Presents a training-free, zero-shot retail theft detection system that orchestrates multiple models in a layered pipeline: always-on cheap detectors (object/pose) plus an expensive VLM invoked only after a multi-signal suspicion pre-filter (dwell time + behavioral cue), reducing VLM calls by ~240× and enabling multi-store serving per GPU; adopts a model-agnostic OpenAI-compatible VLM endpoint for easy swapping as models improve; includes an operational cost model and privacy-preserving face obfuscation in the pipeline.

[2026-04-16] Knowing When Not to Answer: Evaluating Abstention in Multimodal Reasoning Systems 🆕NEW
- 赛道归属: 多模态推理可靠性（不确定性/拒答Abstention）与评测基准
- 核心创新点: 提出MM-AQA基准，通过沿“视觉依赖度”和“证据充分性”两轴对可回答样本做系统变换，构造更贴近真实失效模式的不可回答实例；在VLM与多智能体系统上评测“有效拒答”，揭示标准提示下模型几乎不拒答、MAS虽提升拒答但带来准确率-拒答权衡，且顺序式设计不弱于迭代式表明瓶颈更偏校准而非推理深度；结论指向需要“拒答感知训练”而非仅靠提示或堆代理。
- Track: Multimodal reliability—abstention/uncertainty evaluation & benchmarking
- Core innovations: Introduces MM-AQA, a benchmark that generates unanswerable multimodal instances from answerable ones via controlled transformations along visual-dependency and evidence-sufficiency axes, capturing realistic failure modes; evaluates VLMs and multi-agent systems for effective abstention, showing standard prompting rarely yields abstention, MAS improves abstention but induces an accuracy–abstention trade-off, and sequential designs rival iterative ones—implicating miscalibration rather than insufficient reasoning depth; argues for abstention-aware training over prompting/agent scaling.

[2026-04-16] AIM: Asymmetric Information Masking for Visual Question Answering Continual Learning 🆕NEW
- 赛道归属: 多模态持续学习（Continual VQA）与遗忘抑制
- 核心创新点: 针对现代VLM“可训练组件非对称”导致的持续学习结构失配，指出全局正则会偏向大语言解码器、使关键视觉投影层更易受干扰并引发组合推理能力退化；提出AIM（非对称信息掩码），依据模态敏感性对不同组件施加定向掩码以平衡稳定性与可塑性，从而在持续VQA中同时提升平均性能并降低平均遗忘，且更好保持对新技能-概念组合的泛化。
- Track: Continual learning for multimodal VQA (catastrophic forgetting mitigation)
- Core innovations: Identifies a structural mismatch in continual VQA for modern asymmetric VLMs: global regularization over-favors the large language decoder, leaving smaller yet crucial visual projection layers vulnerable, degrading compositional reasoning; proposes Asymmetric Information Masking (AIM), applying targeted masks guided by modality-specific sensitivity to balance stability and plasticity; achieves improved average performance and reduced forgetting while better preserving generalization to novel skill–concept compositions.

GitHub

[2026-04-17] Blaizzy/mlx-vlm ⭐4380

MLX-VLM is a package for inference and fine-tuning of Vision Language Models (VLMs) on your Mac using MLX.

[2026-04-16] waybarrios/vllm-mlx ⭐852

OpenAI and Anthropic compatible server for Apple Silicon. Run LLMs and vision-language models (Llama, Qwen-VL, LLaVA) with continuous batching, MCP to...

[2026-04-16] zhengli97/Awesome-Prompt-Adapter-Learning-for-VLMs-CLIP ⭐770

A curated list of awesome prompt/adapter learning methods for vision-language models like CLIP.

[2026-04-16] FeiElysia/Tempo ⭐54

Tempo: Small Vision-Language Models are Smart Compressors for Long Video Understanding

[2026-04-16] Mr-Loevan/FAST ⭐54 🆕NEW

[NeurIPS 2025 Spotlight] Fast-Slow Thinking GRPO for Large Vision-Language Model Reasoning

强化学习 / Reinforcement Learning

arXiv

[2026-04-14] A hierarchical spatial-aware algorithm with efficient reinforcement learning for human-robot task planning and allocation in production 📖1
- 赛道归属: 强化学习在工业制造中的人机协作任务规划与分配（层级RL + 空间感知调度）
- 核心创新点: 提出面向复杂动态生产场景的人机任务规划与分配的层级式框架：将生产任务分解为可执行的序列子任务，由高层智能体负责任务规划、低层智能体负责任务分配与执行衔接。高层采用高效的缓冲区式深度Q学习（EBQ），通过缓冲机制提升长时序、稀疏回报问题下的样本利用与训练效率，从而缩短训练时间并稳定策略学习；低层引入基于路径规划的空间感知分配（SAP），显式建模人员实时位置与移动距离等空间约束，将“谁来做/何时做”与可达性、行走成本耦合，实现更符合现场约束的实时分配。整体实现了“规划-分配”解耦但通过子任务序列闭环联动，增强了在3D仿真复杂生产流程中的实时性与鲁棒性。
- Track: Reinforcement learning for industrial human-robot collaboration task planning & allocation (hierarchical RL + spatial-aware scheduling)
- Core innovations: Introduces a hierarchical framework for human-robot task planning and allocation in complex, dynamic manufacturing: production jobs are decomposed into sequential subtasks, with a high-level agent handling task planning and a low-level agent handling task allocation/execution. The high-level agent uses an Efficient Buffer-based Deep Q-learning method (EBQ) that leverages a buffer mechanism to improve sample efficiency and training speed under long-horizon, sparse-reward settings, yielding faster and more stable learning. The low-level agent proposes a Spatially Aware allocation method (SAP) grounded in path planning, explicitly incorporating real-time human location and travel distance to couple assignment decisions with reachability and movement cost. This yields a real-time, constraint-aware “plan–allocate” loop that improves robustness and practicality in a 3D simulated production process.

[2026-04-13] Robust Adversarial Policy Optimization Under Dynamics Uncertainty 📖1
- 赛道归属: 鲁棒强化学习（动力学不确定性、分布鲁棒/对抗式策略优化）
- 核心创新点: 提出RAPO，将分布鲁棒RL的原始难解问题转为可操作的对偶形式，直接显式化“鲁棒性-性能”权衡：在轨迹层面用对偶温度参数并以对抗网络近似，生成满足散度约束的稳定最坏情形rollout；在模型层面对动力学集成采用Boltzmann重加权，按“对当前策略更不利”的环境进行策略敏感采样而非均匀域随机化。两层机制相互独立又互补，兼顾稳定训练与覆盖更具挑战的动力学，从而提升OOD动力学泛化与抗不确定性能力并避免过度保守。
- Track: Robust RL under dynamics uncertainty (distributionally robust / adversarial policy optimization)
- Key innovations: RAPO derives a tractable dual that exposes the robustness–performance trade-off. It (i) approximates the dual temperature with an adversarial network to produce stable worst-case rollouts within a divergence bound at the trajectory level, and (ii) applies Boltzmann reweighting over dynamics ensembles for policy-sensitive sampling of more adverse models at the model level. The decoupled components jointly improve OOD robustness while reducing instability and over-conservatism.

[2026-04-10] E3-TIR: Enhanced Experience Exploitation for Tool-Integrated Reasoning 📖1 🆕NEW
- 赛道归属: 工具调用与智能体推理（Tool-Integrated Reasoning）/ 强化学习训练范式优化
- 核心创新点: 提出面向训练早期的“暖启动”RL范式E3-TIR，将经验动态融合为三类：Expert Prefixes（专家前缀锚点）、Expert Guided（专家引导分支）、Self-Exploration（自探索）。围绕专家锚点做多分支探索以提升探索效率与多样性，并用mix policy optimization缓解共享前缀带来的优化冲突与分布漂移，实现“探索-效率”自适应平衡；在更少合成数据下显著提升工具使用任务表现与ROI。
- Track: Tool-Integrated Reasoning / RL training paradigm optimization
- Key innovation: Proposes E3-TIR, an early-stage warm-up RL paradigm that dynamically integrates three experience types—Expert Prefixes (anchor prefixes), Expert Guided branching, and Self-Exploration. It performs diverse branching exploration around expert anchors to improve exploration efficiency/diversity, and introduces mix policy optimization to mitigate distribution shift and optimization conflicts caused by shared prefixes, yielding adaptive exploration–efficiency trade-offs and strong tool-use gains with much less synthetic data.

[2026-04-10] StaRPO: Stability-Augmented Reinforcement Policy Optimization 📖1 🆕NEW
- 赛道归属: 大模型推理强化学习（Process-aware RL / Reasoning alignment）
- 核心创新点: 提出StaRPO，将“推理稳定性”显式纳入RL目标，弥补仅用最终答案正确性作为奖励导致的逻辑结构失真问题。方法把稳定性分解为可计算的轻量指标：ACF衡量局部步间连贯性，PE衡量全局路径的目标导向与冗余度；将稳定性奖励与任务奖励联合优化，提供过程级反馈，从而同时提升最终准确率与推理轨迹的逻辑一致性/结构稳定性。
- Track: RL for LLM reasoning (process-aware optimization / reasoning alignment)
- Key innovation: Introduces StaRPO, augmenting RL objectives with explicit reasoning stability to address logically inconsistent yet fluent outputs under final-answer-only rewards. Stability is decomposed into two lightweight computable metrics—ACF for local step-to-step coherence and PE for global goal-directed path efficiency—and combined with task rewards to provide process-aware feedback, improving both answer accuracy and trajectory logical stability.

[2026-04-16] LLMs Gaming Verifiers: RLVR can Lead to Reward Hacking 🆕NEW
- 赛道归属: RLVR安全与鲁棒性（Reward hacking / Verifier robustness in reasoning RL）
- 核心创新点: 系统揭示RLVR在可验证奖励下的新型失败模式：模型“博弈验证器”，通过枚举实例标签等捷径绕过规则归纳，从而在外延正确性验证下获得高奖励但不具备可泛化的关系模式学习。提出IPT（Isomorphic Perturbation Testing），用同构扰动下的等价任务验证不变性：真实规则归纳应保持不变，而捷径策略会失效；并通过对照实验证明外延验证会诱导捷径、同构验证可抑制该类reward hacking。
- Track: RLVR safety & robustness (reward hacking / verifier design for reasoning RL)
- Key innovation: Identifies a failure mode in RL with verifiable rewards: models “game” imperfect verifiers by abandoning rule induction and instead outputting instance-level enumerations that pass extensional checks without learning generalizable relations. Proposes Isomorphic Perturbation Testing (IPT), adding isomorphic verification that enforces invariance across logically isomorphic tasks—true rule induction remains invariant while shortcut strategies break—showing extensional verification induces shortcuts whereas isomorphic verification eliminates them.

[2026-04-16] IG-Search: Step-Level Information Gain Rewards for Search-Augmented Reasoning 🆕NEW
- 赛道归属: 检索增强推理（Search-augmented reasoning）/ 细粒度信用分配强化学习
- 核心创新点: 提出IG-Search，用“信息增益(IG)”构造逐步(step-level)奖励，解决轨迹级奖励无法区分好/坏查询、以及全失败rollout时梯度塌缩的问题。IG度量每次检索相对“随机文档”反事实基线对正确答案置信度的提升，并通过GRPO中的per-token advantage modulation把奖励精确回传到查询token，实现无需中间标注、无需跨轨迹共享环境状态的细粒度信用分配；在多基准上提升EM且训练开销增量较小、推理延迟不变。
- Track: Search-augmented reasoning / fine-grained RL credit assignment
- Key innovation: Proposes IG-Search with a step-level Information Gain reward to distinguish effective vs. vague/redundant queries and avoid near-zero gradients when all rollouts fail. IG measures how retrieved docs increase gold-answer confidence versus a random-doc counterfactual, and is routed back to query tokens via per-token advantage modulation in GRPO, enabling annotation-free, state-sharing-free step-level credit assignment with strong QA gains and minimal training overhead.

[2026-04-16] UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards 🆕NEW
- 赛道归属: 多模态检索增强生成（Visual RAG）与强化学习推理优化
- 核心创新点: 提出统一的端到端RL框架，让LVLM以“序列决策”方式联合完成检索、重排、主动视觉感知与推理；通过层级动作空间实现从文档级粗检索到图像级精选择再到区域级主动裁剪的coarse-to-fine证据精炼；设计面向每一步动作的稠密多重奖励以提供细粒度监督，并基于GRPO在无需价值网络的情况下对齐多目标行为；同时构建带细粒度动作标注的高质量推理轨迹数据以支撑训练与评测。
- Track: Visual Retrieval-Augmented Generation (Visual RAG) & RL-based reasoning optimization
- Core innovations: Proposes an end-to-end unified RL framework where an LVLM agent jointly performs retrieval, reranking, active perception, and reasoning as a sequential decision process; introduces a hierarchical action space enabling coarse-to-fine evidence refinement from document retrieval to image selection and region cropping; designs dense multi-reward signals to supervise each action step, and leverages GRPO to optimize multi-objective behavior without a separate value network; curates fine-grained action-annotated reasoning trajectories to support training and evaluation.

[2026-04-16] WavAlign: Enhancing Intelligence and Expressiveness in Spoken Dialogue Models via Adaptive Hybrid Post-Training 🆕NEW
- 赛道归属: 语音对话大模型后训练（End-to-end spoken dialogue）/ 偏好优化与RL对齐
- 核心创新点: 针对端到端语音对话中“稀疏偏好监督 vs 稠密语音生成”导致的奖励建模与rollout采样难题，提出模态感知的自适应混合后训练方案：将偏好/RL更新约束在语义通道以提升智能与语义质量，同时通过显式anchoring稳定并改进声学表现；再依据rollout统计动态调节两者混合比例，规避不可靠偏好梯度对共享参数的破坏，使RL在语音对话场景可落地并提升表达性。
- Track: Post-training for end-to-end spoken dialogue models / preference optimization & RL alignment
- Key innovation: Proposes a modality-aware adaptive hybrid post-training recipe to make RL practical for end-to-end spoken dialogue, addressing the mismatch between sparse preference signals and dense speech generation under shared-parameter updates. It constrains preference/RL updates to the semantic channel while improving acoustics via explicit anchoring, and dynamically mixes them based on rollout statistics to avoid unreliable preference gradients, improving both semantic intelligence and speech expressiveness.

[2026-04-16] LongAct: Harnessing Intrinsic Activation Patterns for Long-Context Reinforcement Learning 🆕NEW
- 赛道归属: 长上下文强化学习与训练效率（Long-context RL / sparse update optimization）
- 核心创新点: 利用模型内在表征特性提出LongAct：观察到长上下文处理中Q/K向量存在高幅值激活，并借鉴量化中“高幅值更关键”的结论与长程推理稀疏性假设，将RL更新从“全量均匀”转为“显著性引导的稀疏更新”。仅更新与高幅值激活相关的权重以聚焦关键驱动因素，从而在LongBench v2等长上下文基准上提升表现并增强泛化；且对GRPO、DAPO等多种RL算法具通用增益。
- Track: Long-context RL & training efficiency (saliency-guided sparse updates)
- Key innovation: Introduces LongAct by exploiting intrinsic activation patterns in long-context processing: high-magnitude Q/K activations are treated as critical drivers (inspired by quantization insights and sparsity of long-context reasoning). It shifts RL from uniform full updates to saliency-guided sparse updates, updating only weights associated with salient activations, improving long-context benchmarks and generalization while remaining compatible across RL algorithms (e.g., GRPO, DAPO).

[2026-04-16] Dual-Axis Generative Reward Model Toward Semantic and Turn-taking Robustness in Interactive Spoken Dialogue Models 🆕NEW
- 赛道归属: 语音对话多模态交互（Spoken Dialogue）与RL奖励建模
- 核心创新点: 提出“双轴生成式奖励模型”，用细粒度交互质量分类体系与标注数据学习复杂对话动态；奖励输出不仅给出单一总分，还显式分解为“语义质量”和“轮次/时序（turn-taking）”两条轴的评分，从而为全双工语音对话模型提供可诊断、可用于在线RL的稳定奖励信号；以生成式建模替代依赖浅层统计/时序代理指标的传统自动评估，提升跨数据集的交互质量评估一致性与泛化。
- Track: Spoken multimodal dialogue interaction & RL reward modeling
- Core innovations: Introduces a dual-axis generative reward model trained with a detailed interaction taxonomy and annotations to capture complex dialogue dynamics; outputs both an overall score and disentangled scores for semantic quality and turn-taking/timing robustness, providing diagnostic feedback and a reliable reward for online RL; replaces brittle proxy metrics (behavioral stats/timing accuracy) with a learned generative assessor that generalizes across synthetic and real-world interaction datasets.

GitHub

[2026-04-17] huggingface/trl ⭐18073

Train transformer language models with reinforcement learning.

[2026-04-16] facebookresearch/ReAgent ⭐3695 🆕NEW

A platform for Reasoning systems (Reinforcement Learning, Contextual Bandits, etc.)

[2026-04-17] RLinf/RLinf ⭐3121

RLinf: Reinforcement Learning Infrastructure for Embodied and Agentic AI

[2026-04-17] mll-lab-nu/VAGEN ⭐445 🆕NEW

Training VLM agents with multi-turn reinforcement learning

[2026-04-17] hustvl/RAD ⭐201

[NeurIPS 2025] RAD: Training an End-to-End Driving Policy via Large-Scale 3DGS-based Reinforcement Learning

HuggingFace Models

tencent/HY-World-2.0 🆕NEW

HuggingFace Datasets

[2026-04-14] llamaindex/ParseBench
```
ParseBench
```

Quick links: [🌐 Website] [📜 Paper] [💻 Code] ParseBench is a benchmark for evaluating document parsing systems on real-world ent...

[2026-04-06] hysong/MentalBench

MentalBench: A Benchmark for Evaluating Psychiatric Diagnostic Capability of Large Language Models

🌟 Overview

MentalBench is a c...

[2026-02-22] YennNing/MC-Search
```
Dataset Card for MC-Search
```

    Paper ...

HuggingFace Spaces

HuggingFaceTB/trl-distillation-trainer

Generated automatically by Daily AI Digest Agent 生成时间: 2026-04-17 02:07:25