AI 每日进展速报 / Daily AI Digest - 2026-04-16
图像生成/编辑 / Image Generation/Editing
arXiv
- [2026-04-15] Creo: From One-Shot Image Generation to Progressive, Co-Creative Ideation 🆕NEW
- 赛道归属: 文生图交互式生成 / 人机共创(Progressive T2I + 可控生成)
- 核心创新点: 提出多阶段渐进式T2I生成范式,从草图级抽象逐步提升到高分辨率成图,在中间表征层暴露可编辑“抽象状态”以支持早期发散探索;通过“决策锁定/锁区”机制将已确认的局部属性或区域固定,使后续编辑仅作用于指定部分;推理时以差分(diffs)更新替代整图重生成,显著降低编辑漂移并增强可追溯的用户决策链,从系统设计层面提升可控性、用户主导感与多样性。
- Track: Interactive text-to-image generation / co-creative ideation (progressive T2I + controllable generation)
- Core innovation: Introduces a multi-stage progressive T2I workflow that moves from sketch-like abstractions to high-resolution images while exposing editable intermediate representations for early-stage exploration; adds a decision-locking mechanism to freeze confirmed regions/attributes so later edits affect only targeted parts; performs inference via diff-based updates rather than full regeneration to reduce drift and improve traceability, controllability, user agency, and output diversity.
- [2026-04-15] ASTRA: Enhancing Multi-Subject Generation with Retrieval-Augmented Pose Guidance and Disentangled Position Embedding 🆕NEW
- 赛道归属: 多主体个性化文生图 / 姿态可控生成(Subject-driven T2I with pose control)
- 核心创新点: 针对多主体复杂动作下“身份保持 vs 姿态约束”信号纠缠导致的身份融合与姿态畸变,提出统一DiT框架内的结构-外观解耦:用RAG-Pose从检索库引入“干净、显式”的姿态结构先验;设计非对称的EURoPE位置编码,将身份token与空间位置解绑定、同时将姿态token绑定到画布坐标以强化结构对齐;再用DSM适配器把身份保持更多转移到文本条件流中,形成端到端的解耦条件融合,从架构层面提升多主体身份一致性与姿态遵循。
- Track: Multi-subject personalized text-to-image generation / pose-guided controllable generation
- Core innovation: Resolves the identity–pose entanglement in complex multi-subject generation by architectural disentanglement within a unified DiT: a retrieval-augmented pose pipeline (RAG-Pose) supplies a clean explicit structural prior; an asymmetric EURoPE positional encoding decouples identity tokens from spatial locations while anchoring pose tokens to the canvas; a DSM adapter shifts identity preservation into the text-conditioning stream, jointly improving identity fidelity and pose adherence.
- [2026-04-15] PostureObjectstitch: Anomaly Image Generation Considering Assembly Relationships in Industrial Scenarios 🆕NEW
- 赛道归属: 工业异常图像生成 / 条件可控扩散生成(面向装配关系)
- 核心创新点: 面向工业装配场景中部件姿态/朝向与装配关系约束难以被现有生成方法显式建模的问题,提出条件解耦+几何先验的扩散式合成:将多视图输入分解为高频、纹理与RGB等特征并进行条件解耦;通过“特征时间调制”在扩散时间步上自适应融合,实现从粗到细的渐进生成且保持跨步一致性;引入强调关键工业元素的条件损失与指导部件相对位置的几何先验,确保语义正确与装配关系合理,从而提升生成数据对下游异常检测的可用性。
- Track: Industrial anomaly image generation / condition-controlled diffusion (assembly-aware)
- Core innovation: Proposes an assembly-aware diffusion synthesis pipeline with condition disentanglement and geometric priors: decomposes multi-view inputs into high-frequency/texture/RGB features; applies feature temporal modulation across diffusion timesteps for coarse-to-fine generation with consistency; adds a conditional loss to emphasize critical industrial elements and a geometric prior to enforce correct component placement/assembly relationships, improving downstream usability for anomaly detection.
- [2026-04-15] DiffMagicFace: Identity Consistent Facial Editing of Real Videos 🆕NEW
- 赛道归属: 人脸视频编辑 / 身份一致性视频扩散编辑(Text/Image-guided)
- 核心创新点: 为解决真实视频人脸编辑中跨帧身份保持与编辑语义一致性难题,提出双模型并行推理的编辑框架:分别微调用于文本控制与图像控制的两套模型,在推理阶段协同约束同一帧的身份特征与编辑目标;构建“多视角人脸身份图像集”作为一致性支撑,通过渲染+优化生成多姿态参考,使方法不依赖大规模视频训练数据仍能获得跨帧稳定的身份与外观一致性,适配说话人等复杂场景。
- Track: Facial video editing / identity-consistent diffusion-based editing (text & image guided)
- Core innovation: Introduces a video face-editing framework that runs two separately fine-tuned controllers (text-control and image-control) concurrently at inference to jointly enforce edit semantics and identity preservation per frame; builds a multi-view identity image set via rendering plus optimization to provide viewpoint coverage, enabling strong cross-frame consistency without relying on video training datasets, including challenging talking-head scenarios.
- [2026-04-15] SLQ: Bridging Modalities via Shared Latent Queries for Retrieval with Frozen MLLMs 🆕NEW
- 赛道归属: 多模态检索 / 冻结MLLM的检索适配(Embedding学习与对齐)
- 核心创新点: 提出在不更新MLLM主干参数的前提下进行检索适配的SLQ:在文本与图像token序列末尾追加少量“共享潜在查询”(Shared Latent Queries),利用因果注意力将其作为全局聚合接口,直接读出统一语义空间中的紧凑检索向量;以“激发预训练表征”替代“覆盖式微调”,避免破坏推理所需的结构化知识;并构建KARR-Bench评测知识感知推理型检索,推动检索评估从表面匹配走向推理一致性。
- Track: Multimodal retrieval / adapting frozen MLLMs for embedding-based retrieval
- Core innovation: Proposes SLQ to turn a frozen MLLM into a retriever without backbone updates: appends a small set of Shared Latent Queries to both text and image token sequences and leverages native causal attention as a global aggregation interface to output compact unified embeddings; focuses on eliciting rather than overwriting pre-trained representations to preserve reasoning-relevant knowledge; introduces KARR-Bench to evaluate knowledge-aware reasoning retrieval beyond superficial matching.
- [2026-04-15] ADP-DiT: Text-Guided Diffusion Transformer for Brain Image Generation in Alzheimer's Disease Progression 🆕NEW
- 赛道归属: 医学影像生成 / 条件可控扩散Transformer(纵向MRI生成)
- 核心创新点: 面向阿尔茨海默病纵向随访MRI的“时间间隔+个体临床信息”可解释控制,提出区间感知的文本条件DiT:将随访间隔、人口学/诊断/量表等多域信息组织为自然语言提示,实现比粗粒度分期更细的时间控制;采用OpenCLIP与T5双文本编码器分别提供视觉-语言对齐与临床语言理解,并通过交叉注意力+自适应层归一化注入到DiT中实现局部与全局调制;结合RoPE增强解剖空间建模,并在SDXL-VAE潜空间扩散以兼顾高分辨率与效率。
- Track: Medical image generation / text-conditioned Diffusion Transformer for longitudinal MRI synthesis
- Core innovation: Presents an interval-aware, clinically text-conditioned DiT for longitudinal AD MRI: encodes follow-up interval plus demographic/diagnostic/neuropsychological metadata as natural-language prompts for fine-grained, interpretable time control; uses dual text encoders (OpenCLIP for vision-language alignment, T5 for richer clinical semantics) fused into DiT via cross-attention (fine guidance) and adaptive layer norm (global modulation); improves anatomical fidelity with RoPE on image tokens and performs diffusion in an SDXL-VAE latent space for efficient high-resolution reconstruction.
- [2026-04-15] Enhanced Text-to-Image Generation by Fine-grained Multimodal Reasoning 🆕NEW
- 赛道归属: 文生图 / 基于多模态推理的测试时自改进(Reasoning-guided T2I)
- 核心创新点: 提出FiMR将“整体对齐打分”的粗粒度自反思升级为可定位的细粒度推理闭环:把输入prompt分解为最小语义单元(实体、属性、关系等),通过分解式VQA逐项核验生成图像是否满足每个单元,产出显式、可操作的错误反馈;再据此进行有针对性的局部修正/再生成,实现测试时的细粒度自纠错,从而在组合属性与复杂描述上显著提升prompt遵循与生成质量。
- Track: Text-to-image generation / test-time multimodal reasoning for self-refinement
- Core innovation: Proposes FiMR to replace holistic alignment judgments with a fine-grained reasoning loop: decomposes prompts into minimal semantic units (entities/attributes/relations), verifies each unit via decomposed VQA to produce explicit actionable feedback, and applies targeted/local refinements at test time; this enables fine-grained self-correction and improves prompt adherence and quality on compositional T2I benchmarks.
- [2026-04-14] Bias at the End of the Score 🆕NEW
- 赛道归属: 文生图对齐与安全 / 奖励模型审计(公平性与鲁棒性)
- 核心创新点: 系统性揭示奖励模型在T2I流水线中的“末端偏置”风险:将RM作为数据过滤、训练监督、评估指标与后过滤时,会把其内含的人口统计偏见放大到生成与优化过程中;通过大规模审计给出定量/定性证据,表明reward-guided优化会导致对女性主体的过度性化、强化性别/种族刻板印象并压缩人口多样性;该工作的方法论贡献在于把RM从“质量度量”转为“可被审计的价值函数”,并指出需要改进数据与训练流程以提升评分函数的公平鲁棒性。
- Track: Text-to-image alignment & safety / reward model auditing (fairness and robustness)
- Core innovation: Demonstrates “end-of-score” bias in reward models used throughout T2I pipelines (filtering, supervision, evaluation, post-filtering): large-scale audits show RMs encode demographic biases that get amplified by reward-guided optimization, leading to disproportionate sexualization of female subjects, reinforced gender/racial stereotypes, and reduced demographic diversity; reframes RMs as auditable value functions rather than neutral quality metrics, motivating improved data collection and training procedures for robust, fair scoring.
- [2026-04-14] Generative Refinement Networks for Visual Synthesis
- 赛道归属: 图像生成(自回归/非扩散范式)、文生图、文生视频
- 核心创新点: 提出Generative Refinement Networks(GRN)作为替代扩散的视觉生成范式:用近乎无损的分层二值量化HBQ缓解离散tokenization带来的信息损失,并在AR生成上引入“全局逐步精修”的refinement机制以纠正误差累积、逐轮提升细节;同时用熵引导采样实现复杂度感知的自适应步数生成,在不牺牲质量的前提下降低不必要计算,并在ImageNet及T2I/T2V扩展上验证可扩展性与SOTA指标。
- Track: Image generation (autoregressive / post-diffusion paradigm), Text-to-Image, Text-to-Video
- Key innovations: Proposes Generative Refinement Networks (GRN) as a diffusion alternative: (1) a theoretically near-lossless Hierarchical Binary Quantization (HBQ) to remove the discrete-token bottleneck; (2) a global progressive refinement mechanism on top of AR generation to correct accumulated errors and iteratively polish details; (3) entropy-guided sampling for complexity-aware adaptive-step generation, reducing compute without degrading quality, with strong results on ImageNet and scalable T2I/T2V settings.
- [2026-04-14] Representation geometry shapes task performance in vision-language modeling for CT enterography
- 赛道归属: 多模态理解(医学视觉-语言)、检索增强生成(RAG)、表征学习分析
- 核心创新点: 系统研究CT肠道造影的视觉-语言迁移学习中“表征几何/聚合方式”对任务的因果影响:发现mean pooling更利于疾病分类而attention pooling更利于跨模态检索,揭示聚合器在表征属性上的分工;提出并验证多窗位RGB编码(不同HU窗映射到RGB)比增加多平面覆盖更关键,甚至额外视角会伤害分类;在报告生成上证明仅微调难学到序关系,引入RAG显著提升序级评估;用三教师伪标注框架在无专家标注下完成可比实验与基线建立。
- Track: Multimodal understanding (medical vision-language), Retrieval-Augmented Generation (RAG), representation analysis
- Key innovations: A first systematic study of vision-language transfer for CT enterography that links representation geometry/aggregation to downstream performance: mean pooling favors disease classification while attention pooling favors cross-modal retrieval, indicating distinct representational emphases. Shows multi-window RGB encoding (HU windows → RGB channels) is more beneficial than increasing spatial coverage via multiplanar views (which can even hurt classification). For report generation, demonstrates limited ordinal learning from plain fine-tuning and quantifies consistent gains from RAG. Uses a three-teacher pseudo-labeling setup to enable comparisons without expert annotations and establishes baselines for this modality.
GitHub
- [2026-04-16] YouMind-OpenLab/awesome-nano-banana-pro-prompts ⭐10964
🍌 World's largest Nano Banana Pro prompt library — 10,000+ curated prompts with preview images, 16 languages. Google Gemini AI image generation. Free ...
- [2026-04-15] Anil-matcha/Open-Generative-AI ⭐4919
Open-source alternative to Higgsfield AI, Freepik, Krea, Openart AI — Free AI image generation & cinema studio with 20+ models (Flux, SDXL, Midjourney...
- [2026-04-15] MiniMax-AI/MiniMax-MCP ⭐1419 🆕NEW
Official MiniMax Model Context Protocol (MCP) server that enables interaction with powerful Text to Speech, image generation and video generation APIs...
- [2026-04-15] Light-Heart-Labs/DreamServer ⭐427 🆕NEW
Local AI anywhere, for everyone — LLM inference, chat UI, voice, agents, workflows, RAG, and image generation. No cloud, no subscriptions.
- [2026-04-15] etkecc/baibot ⭐215
🤖 A Matrix bot for using different capabilities (text-generation, text-to-speech, speech-to-text, image-generation, etc.) of AI / Large Language Model...
HuggingFace Models
- baidu/ERNIE-Image 🆕NEW
视频生成/编辑 / Video Generation/Editing
arXiv
- [2026-04-15] From Synchrony to Sequence: Exo-to-Ego Generation via Interpolation 🆕NEW
- 赛道归属: 跨视角视频生成(Exo-to-Ego / Ego-to-Exo)、扩散式序列建模
- 核心创新点: 将同步的第三人称到第一人称生成从“条件→输出”的配对学习,重构为“单一连续序列”的序列信号建模:提出 Syn2Seq-Forcing,通过在源/目标视频之间做插值,把原本由同步采集导致的时空/几何跳变(discontinuity)转化为可学习的连续过渡,从而更适配 DFoT 等扩散序列模型捕获跨帧一致的转场;并实证指出主要难点来自视频的时空不连续而非位姿插值本身(仅视频插值即可显著提升),同时该连续序列表述可统一 Exo2Ego 与 Ego2Exo 于同一框架中。
- Track: Cross-view video generation (Exo-to-Ego / Ego-to-Exo), diffusion-based sequence modeling
- Core innovation: Reframes synchronized exo→ego generation from a paired “condition→output” task into continuous sequence signal modeling: proposes Syn2Seq-Forcing that interpolates between source and target videos to convert synchronization-induced spatio-temporal/geometric jumps into learnable smooth transitions, making diffusion sequence models (e.g., DFoT) better at coherent frame-to-frame transitions; empirically shows the dominant difficulty is video discontinuity (video-only interpolation already yields large gains), and the formulation naturally unifies Exo2Ego and Ego2Exo within one continuous sequence framework.
- [2026-04-15] DiT as Real-Time Rerenderer: Streaming Video Stylization with Autoregressive Diffusion Transformer 🆕NEW
- 赛道归属: 视频风格化/视频重渲染(实时流式)、扩散模型加速与蒸馏
- 核心创新点: 提出 RTR-DiT 将 DiT 用作“实时重渲染器”:先在视频风格化数据上微调双向 teacher(同时支持文本引导与参考图引导),再通过 Self Forcing + Distribution Matching Distillation 将其蒸馏为少步数的自回归扩散 Transformer,实现流式长视频的低延迟处理;同时设计 reference-preserving 的 KV cache 更新策略,在保证长时稳定与时序一致性的前提下,支持文本提示与参考图的实时切换,解决长视频扩散风格化的漂移与高算力/多步去噪瓶颈。
- Track: Video stylization / rerendering (real-time streaming), diffusion acceleration & distillation
- Core innovation: Proposes RTR-DiT as a “real-time rerenderer” built on DiT: fine-tunes a bidirectional teacher for both text-guided and reference-guided stylization, then distills it into a few-step autoregressive diffusion Transformer via Self Forcing and Distribution Matching Distillation for low-latency streaming on long videos; introduces a reference-preserving KV-cache update that stabilizes long-horizon consistency while enabling real-time switching between text prompts and reference images, addressing drift and the high-cost multi-step denoising of diffusion stylization.
- [2026-04-15] VibeFlow: Versatile Video Chroma-Lux Editing through Self-Supervised Learning 🆕NEW
- 赛道归属: 视频编辑(光照/颜色编辑:relighting、recoloring、day-night 等)、自监督/零样本编辑、基于流的生成式编辑
- 核心创新点: 提出 VibeFlow 用自监督方式“激活”预训练视频生成模型的物理先验,摆脱合成配对监督:通过解耦的数据扰动管线强制模型在训练中学习“结构来自源视频、颜色/光照线索来自参考图”的自适应重组,实现稳健的结构-色光解耦;针对 flow 类模型的离散化误差,引入 Residual Velocity Fields 进行速度场残差校正,并配合 Structural Distortion Consistency Regularization 约束结构畸变一致性,从而同时提升结构保真与时间一致性;整体支持零样本泛化到多类 chroma-lux 编辑任务并降低训练/计算成本。
- Track: Video editing (illumination & color / chroma-lux), self-supervised & zero-shot editing, flow-based generative editing
- Core innovation: Introduces VibeFlow, a self-supervised framework that leverages the physical priors of pre-trained video generative models to avoid costly synthetic paired supervision: a disentangled perturbation pipeline enforces adaptive recombination where structure is taken from the source video while color/illumination cues come from a reference image, yielding robust structure–chroma/lux disentanglement; to mitigate discretization errors in flow-based models, it adds Residual Velocity Fields plus Structural Distortion Consistency Regularization to preserve structure and temporal coherence. The method generalizes zero-shot to relighting, recoloring, low-light enhancement, day–night translation, and object-specific color edits with reduced compute/training overhead.
- [2026-04-14] Lyra 2.0: Explorable Generative 3D Worlds
- 赛道归属: 视频生成 + 3D世界生成/重建(camera-controlled long-horizon video → 3D lifting)
- 核心创新点: 提出“生成式重建”框架 Lyra 2.0,用视频生成的视觉先验驱动可实时渲染的3D世界构建,重点解决长轨迹下的3D一致性退化两大根因:1)针对空间遗忘,维护逐帧3D几何但仅用于信息路由(检索相关历史帧并建立到目标视角的稠密对应),外观仍由生成模型合成,从而在大视角变化与回访场景时保持结构一致;2)针对时间漂移,用自增强历史(Self-augmented histories)训练,让模型暴露于自身退化输出并学习“纠偏”而非累积误差。最终可生成更长、更一致的探索视频,并用于微调前馈式重建模型以稳定恢复高质量3D场景。
- Track: Video generation + 3D world generation/reconstruction (camera-controlled long-horizon video → 3D lifting)
- Core innovation: Lyra 2.0 formalizes a “generative reconstruction” pipeline that turns camera-controlled videos into persistent 3D worlds, tackling two failure modes in long-horizon 3D-consistent generation: (1) spatial forgetting is mitigated by keeping per-frame 3D geometry only for information routing—retrieving relevant past frames and building dense correspondences to target views—while leaving appearance synthesis to the generative prior; (2) temporal drifting is reduced via self-augmented histories training that feeds the model its own degraded outputs so it learns to correct drift instead of compounding it. The resulting longer, more 3D-consistent trajectories enable reliable fine-tuning of feed-forward 3D reconstruction models.
- [2026-04-14] Generative Refinement Networks for Visual Synthesis
- 赛道归属: 图像生成(自回归/非扩散范式)、文生图、文生视频
- 核心创新点: 提出Generative Refinement Networks(GRN)作为替代扩散的视觉生成范式:用近乎无损的分层二值量化HBQ缓解离散tokenization带来的信息损失,并在AR生成上引入“全局逐步精修”的refinement机制以纠正误差累积、逐轮提升细节;同时用熵引导采样实现复杂度感知的自适应步数生成,在不牺牲质量的前提下降低不必要计算,并在ImageNet及T2I/T2V扩展上验证可扩展性与SOTA指标。
- Track: Image generation (autoregressive / post-diffusion paradigm), Text-to-Image, Text-to-Video
- Key innovations: Proposes Generative Refinement Networks (GRN) as a diffusion alternative: (1) a theoretically near-lossless Hierarchical Binary Quantization (HBQ) to remove the discrete-token bottleneck; (2) a global progressive refinement mechanism on top of AR generation to correct accumulated errors and iteratively polish details; (3) entropy-guided sampling for complexity-aware adaptive-step generation, reducing compute without degrading quality, with strong results on ImageNet and scalable T2I/T2V settings.
- [2026-04-14] VideoFlexTok: Flexible-Length Coarse-to-Fine Video Tokenization
- 赛道归属: 视频生成(tokenizer/表示学习)+ 推理/训练效率优化
- 核心创新点: 提出 VideoFlexTok,将传统固定大小的时空3D网格token改为可变长度、由粗到细的token序列:前段token自发承载语义与运动等抽象信息,后段token逐步补充细节;配套生成式流(Flow)解码器支持“任意token数量”重建,使下游模型可按任务/算力动态分配token预算,并在同等预算下编码更长视频。该表示显著降低下游生成模型学习低层细节的负担,实现更小模型规模下的可比质量,并支持长视频训练(以更少token覆盖更多帧)。
- Track: Video generation (tokenization/representation learning) + training/inference efficiency
- Core innovation: VideoFlexTok replaces fixed spatiotemporal 3D-grid tokens with a variable-length, coarse-to-fine token sequence where early tokens (emergently) capture semantics/motion and later tokens refine details. A generative flow decoder reconstructs realistic videos from any token count, enabling adaptive token budgeting for downstream models and longer-video encoding under the same budget. This reduces the burden of predicting low-level details uniformly, yielding comparable generation quality with much smaller generators and enabling long-video training with dramatically fewer tokens.
- [2026-04-14] Towards Realistic and Consistent Orbital Video Generation via 3D Foundation Priors
- 赛道归属: 单图驱动视频生成(物体环绕/轨道视角)+ 3D先验条件控制
- 核心创新点: 用3D基础生成模型的形状先验替代仅靠像素注意力的跨帧一致性约束,提升长程外推(如背面视角)时的几何可信度与多视一致性。方法以两级3D潜特征进行条件:1)去噪的全局latent向量提供整体结构约束;2)从体素特征投影得到的视角相关latent图像提供细粒度几何细节;相较深度/法线等2.5D条件,这些特征能表达完整形状且避免显式网格提取以提升效率。并提出多尺度3D Adapter通过cross-attention向通用视频模型注入特征token,实现模型无关、轻量微调且保留原有生成能力。
- Track: Image-to-video generation (orbital/object turntable) + 3D-prior conditioning
- Core innovation: Introduces 3D foundation-model shape priors as auxiliary constraints beyond pixel-wise attention, improving geometric realism and multi-view consistency for long-range extrapolation (e.g., back views). Conditioning uses two-scale 3D latents: (1) a denoised global latent for overall structure, and (2) view-dependent latent images projected from volumetric features for fine geometry—more complete than 2.5D cues (depth/normals) and more efficient by avoiding explicit mesh extraction. A multi-scale 3D adapter injects these tokens via cross-attention, enabling model-agnostic, lightweight fine-tuning while retaining general video priors.
- [2026-04-14] ArtifactWorld: Scaling 3D Gaussian Splatting Artifact Restoration via Video Generation Models
- 赛道归属: 视频编辑/修复(面向3D Gaussian Splatting的多视一致修复)+ 数据集构建
- 核心创新点: 提出 ArtifactWorld,将3DGS在稀疏视角下的几何/光度伪影修复统一为视频扩散式修复问题,并通过“数据+结构”两端扩展提升泛化与一致性:1)建立3DGS伪影的现象学细粒度分类并构建107.5K成对视频训练集,覆盖多样真实伪影分布以缓解数据瓶颈;2)采用同构双模型范式:在视频扩散骨干内引入同构预测器输出伪影热力图定位结构缺陷,再用Artifact-Aware Triplet Fusion将热力图作为强引导,在原生自注意力中实现强度可控的时空联合修复,从而减少多视不一致与错误几何幻觉。
- Track: Video editing/restoration (3D Gaussian Splatting artifact repair) + dataset scaling
- Core innovation: ArtifactWorld reframes sparse-view 3DGS artifact repair as a unified video-diffusion restoration task and scales both data and architecture for robustness and multi-view consistency. (1) It builds a fine-grained phenomenological taxonomy of 3DGS artifacts and a 107.5K paired video dataset to cover diverse real-world degradations. (2) A homogeneous dual-model design adds an isomorphic predictor that outputs an artifact heatmap to localize structural defects, then an Artifact-Aware Triplet Fusion mechanism uses the heatmap to guide intensity-aware spatiotemporal repair directly inside native self-attention, reducing inconsistent views and geometric hallucinations.
- [2026-04-14] Ride the Wave: Precision-Allocated Sparse Attention for Smooth Video Generation
- 赛道归属: 推理优化(视频扩散Transformer稀疏注意力加速)
- 核心创新点: 提出训练无关的 PASA(Precision-Allocated Sparse Attention),针对稀疏注意力常见的闪烁问题,从“预算分配+路由稳定性”两方面改造:1)用曲率/加速度感知的动态预算,根据生成轨迹在时间步上的语义变化强度弹性分配精算力,仅在关键转折处保精度;2)用硬件对齐的分组近似替代全局同质估计,兼顾局部差异表达与吞吐;3)在路由中引入随机选择偏置软化硬边界,抑制选择振荡与局部算力饥饿,从机制上降低时间闪烁,在显著加速下保持时序平滑与结构稳定。
- Track: Inference optimization (sparse attention acceleration for video diffusion Transformers)
- Core innovation: PASA is a training-free sparse-attention framework that accelerates video diffusion while mitigating flicker by redesigning compute allocation and routing stability: (1) a curvature/acceleration-aware dynamic budget allocates exact attention only at critical semantic transitions across timesteps; (2) hardware-aligned grouped approximations replace global homogenized estimates to preserve local variations with high throughput; (3) stochastic selection bias in routing softens rigid boundaries, reducing oscillations and local compute starvation that cause temporal flicker—achieving substantial speedups with smoother, more stable videos.
- [2026-04-13] OmniShow: Unifying Multimodal Conditions for Human-Object Interaction Video Generation
- 赛道归属: 多条件可控视频生成(人-物交互 HOI)+ 多模态对齐
- 核心创新点: 提出 OmniShow,面向工业可用的 HOIVG,将文本/参考图/音频/姿态等条件在同一框架内统一建模并兼顾质量与可控性:1)用 Unified Channel-wise Conditioning高效注入图像与姿态条件,降低对主干生成能力的干扰;2)用 Gated Local-Context Attention强化音画局部对齐,实现更精确的口型/节奏同步;3)提出 Decoupled-Then-Joint Training,先分解子任务利用异构数据集分阶段训练,再通过模型合并进行联合能力整合,以数据稀缺下获得全面条件覆盖;并构建 HOIVG-Bench补齐评测体系。
- Track: Controllable video generation with multimodal conditions (human-object interaction) + multimodal alignment
- Core innovation: OmniShow unifies text, reference images, audio, and pose for practical HOI video generation while balancing controllability and quality: (1) Unified Channel-wise Conditioning efficiently injects image/pose signals with minimal disruption to the base generator; (2) Gated Local-Context Attention improves fine-grained audio-visual synchronization; (3) Decoupled-Then-Joint Training leverages heterogeneous sub-task datasets via staged training and model merging to overcome data scarcity and achieve full-condition coverage. It also introduces HOIVG-Bench to standardize evaluation.
GitHub
- [2026-04-16] hao-ai-lab/FastVideo ⭐3391
A unified inference and post-training framework for accelerated video generation.
- [2026-04-15] Tencent-Hunyuan/HunyuanWorld-Voyager ⭐1536 🆕NEW
Voyager is an interactive RGBD video generation model conditioned on camera input, and supports real-time 3D reconstruction.
- [2026-04-15] YouMind-OpenLab/awesome-seedance-2-prompts ⭐629
🎬 500+ curated Seedance 2.0 video generation prompts — cinematic, anime, UGC, ads, meme styles. Includes Seedance API guides, character consistency ti...
- [2026-04-15] wendell0218/Awesome-RL-for-Video-Generation ⭐450 🆕NEW
A curated list of papers on reinforcement learning for video generation
- [2026-04-15] EvoLinkAI/awesome-seedance-2.0-prompts ⭐60 🆕NEW
100+ curated Seedance 2 prompts, examples, and guides for AI video generation.
音频生成 / Audio Generation
arXiv
- [2026-04-15] MERRIN: A Benchmark for Multimodal Evidence Retrieval and Reasoning in Noisy Web Environments 🆕NEW
- 赛道归属: 多模态检索增强推理(Web Search Agent 评测基准/Benchmark)
- 核心创新点: 提出人工标注的 MERRIN 基准,用于在“噪声真实网页环境”中系统评测检索增强智能体的多模态证据检索与多跳推理能力。方法论上突破在于:①查询为自然语言且不提供模态提示,迫使模型自主判定应检索的模态;②显式纳入以往较少覆盖的音频、视频等模态,并要求跨模态证据组合;③证据来源为异构且常冲突/部分相关的网页结果,强调在噪声与矛盾信息下的鲁棒证据选择与推理,而非“干净语料”的检索问答。基准还通过 no-search/native-search/agentic-search 三种设置与多模型多智能体对比,揭示当前系统的关键失败模式(过度探索、工具步数增加但被冲突内容干扰、过度依赖文本模态、资源消耗高但准确率低),为后续改进“多模态源选择+推理策略”提供可量化靶场。
- Track: Multimodal retrieval-augmented reasoning (Web search agent benchmarking)
- Core innovations: Introduces MERRIN, a human-annotated benchmark that evaluates search-augmented agents on multimodal evidence retrieval and multi-hop reasoning in noisy, real-world web environments. Key methodological advances include: (1) natural-language queries without modality cues, requiring agents to infer which modalities to seek; (2) explicit inclusion of underexplored modalities such as audio and video, demanding cross-modal evidence integration; (3) retrieval over heterogeneous, often conflicting or partially relevant web sources, stressing robust source selection and reasoning under noise rather than clean retrieval QA. By evaluating agents across no-search/native-search/agentic-search settings and multiple models, it surfaces concrete failure modes (over-exploration, more tool steps but higher distraction from conflicting content, overreliance on text, high resource use with low accuracy), making it a targeted testbed for improving multimodal source selection and reasoning policies.
- [2026-04-12] Audio-Omni: Extending Multi-modal Understanding to Versatile Audio Generation and Editing
- 赛道归属: 音频生成与编辑(多模态统一生成框架)
- 核心创新点: 提出端到端统一框架,将音频理解、生成与编辑在同一模型中打通,并覆盖通用声音/音乐/语音三大域;采用“冻结的多模态大语言模型(MLLM)负责高层推理 + 可训练的Diffusion Transformer负责高保真合成”的分工式架构,实现推理能力与合成质量兼得;针对音频编辑数据稀缺,构建百万级高质量编辑配对数据集AudioEdit以支撑可泛化的编辑学习;展示继承能力(知识增强生成、in-context生成、零样本跨语种控制)表明统一模型具备向“通用生成式音频智能”扩展的潜力。
- Track: Audio generation & editing (unified multimodal generative framework)
- Core innovations: Introduces the first end-to-end unified system that integrates audio understanding, generation, and editing across general sound, music, and speech; adopts a division-of-labor architecture with a frozen MLLM for high-level reasoning and a trainable Diffusion Transformer for high-fidelity synthesis; addresses editing data scarcity by building AudioEdit, a million-scale curated paired editing dataset; demonstrates inherited capabilities (knowledge-augmented reasoning, in-context generation, zero-shot cross-lingual control), indicating a path toward universal generative audio intelligence.
- [2026-04-12] VidAudio-Bench: Benchmarking V2A and VT2A Generation across Four Audio Categories
- 赛道归属: 视频到音频生成评测(V2A/VT2A基准与指标)
- 核心创新点: 构建面向V2A与VT2A的多任务评测基准,将音频按音效/音乐/语音/歌唱四类拆分评估,避免“统一协议掩盖类别差异”的问题;提出13个面向任务的无参考指标,分别覆盖音质、视听一致性与文音一致性,并通过主观实验验证与人类偏好对齐;系统评测11个SOTA模型,揭示语音与歌唱显著短板,以及VT2A中“指令遵循 vs 视觉扎根”的结构性张力(更强视觉条件提升对齐但易偏离目标音频类别),为诊断与迭代V2A系统提供可扩展工具链。
- Track: Video-to-audio generation evaluation (V2A/VT2A benchmark & metrics)
- Core innovations: Proposes a multi-task benchmark that evaluates V2A and VT2A separately across four audio categories (SFX, music, speech, singing), enabling fine-grained diagnosis beyond a single unified protocol; introduces 13 task-specific reference-free metrics spanning audio quality, video-audio consistency, and text-audio consistency, and validates them via human studies for preference alignment; benchmarks 11 SOTA models and uncovers key failure modes (notably speech/singing) and a VT2A trade-off between instruction following and visually grounded generation.
- [2026-04-10] Tora3: Trajectory-Guided Audio-Video Generation with Physical Coherence
- 赛道归属: 音频-视频联合生成(物理一致性/运动-声音对齐控制)
- 核心创新点: 以“物体轨迹”作为音视频生成共享的运动学先验,解决现有方法中运动不稳与声画接触事件对齐松散的问题;设计轨迹对齐的视频运动表征,并利用由轨迹导出的二阶运动学状态(如速度/加速度变化)驱动运动-音频对齐模块,使声事件与运动/碰撞更同步;提出混合式flow matching,在轨迹条件区域保持轨迹保真、在其他区域维持局部一致性,从而兼顾可控性与自然度;配套构建强调运动相关模式、带自动运动标注的大规模PAV数据集以支撑训练与评测。
- Track: Audio-video joint generation (physical coherence & motion-sound alignment control)
- Core innovations: Uses object trajectories as a shared kinematic prior to jointly guide visual motion and acoustic events, targeting physically plausible motion-sound relations; introduces a trajectory-aligned motion representation for video and a kinematic-audio alignment module driven by trajectory-derived second-order kinematics to better synchronize sound events with motion/contact; proposes a hybrid flow-matching scheme that preserves trajectory fidelity in conditioned regions while maintaining local coherence elsewhere; curates a large-scale PAV dataset with automatic motion annotations emphasizing motion-relevant AV patterns.
- [2026-04-09] AVGen-Bench: A Task-Driven Benchmark for Multi-Granular Evaluation of Text-to-Audio-Video Generation
- 赛道归属: 文本到音频-视频生成评测(T2AV基准与多粒度评估)
- 核心创新点: 提出面向真实提示词的任务驱动T2AV评测基准,覆盖11类应用场景,弥补现有评测“音频/视频割裂或仅靠粗粒度embedding相似度”的不足;构建多粒度评估框架,将轻量专用模型与多模态大语言模型结合,从感知质量到细粒度语义可控性进行分层评估;通过系统评测揭示当前模型“审美强但语义可靠性弱”的关键鸿沟,并定位共性失败(文字渲染、语音连贯性、物理推理、以及普遍的音乐音高控制崩溃),为后续模型训练目标与指标设计提供明确方向。
- Track: Text-to-audio-video generation evaluation (T2AV benchmark & multi-granular assessment)
- Core innovations: Introduces a task-driven T2AV benchmark with high-quality prompts across 11 real-world categories, addressing the limitation of evaluating audio/video separately or via coarse embedding similarity; proposes a multi-granular evaluation pipeline combining lightweight specialist models with MLLMs to assess everything from perceptual quality to fine-grained semantic controllability; reveals a consistent gap between strong AV aesthetics and weak semantic reliability, pinpointing recurring failures (text rendering, speech coherence, physical reasoning, and a universal breakdown in musical pitch control).
- [2026-04-09] Semantic Noise Reduction via Teacher-Guided Dual-Path Audio-Visual Representation Learning
- 赛道归属: 音频-视觉表征学习(跨模态预训练/检索)
- 核心创新点: 指出对比对齐与掩码重建在同一次前向中联合优化会因“重建分支的随机可见patch”污染对比分支的跨模态对齐,产生语义噪声与优化干扰;提出Teacher-Guided Dual-Path(TG-DP)双路径框架,将重建与对齐解耦为两条优化路径,并为对比分支使用更适配对齐的可见性模式;引入教师模型对对比分支可见token的组织结构进行辅助约束,降低干扰、稳定训练,从而显著提升零样本跨模态检索与线性探测表现。
- Track: Audio-visual representation learning (cross-modal pretraining & retrieval)
- Core innovations: Identifies semantic noise/optimization interference when contrastive alignment and masked reconstruction share a single forward pass, forcing the contrastive branch to rely on reconstruction-oriented random visible patches; proposes TG-DP, a teacher-guided dual-path framework that decouples reconstruction and alignment into separate optimization paths and uses an alignment-suitable visibility pattern for the contrastive path; adds teacher guidance to structure visible tokens in the contrastive branch, reducing interference and stabilizing learning, yielding SOTA gains in zero-shot retrieval and strong linear-probe robustness.
GitHub
- [2026-04-15] huggingface/diffusers ⭐33341
🤗 Diffusers: State-of-the-art diffusion models for image, video, and audio generation in PyTorch.
- [2026-04-13] Lightricks/LTX-2 ⭐5861
Official Python inference and LoRA trainer package for the LTX-2 audio–video generative model.
- [2026-04-13] SamurAIGPT/Generative-Media-Skills ⭐3027
Multi-modal Generative Media Skills for AI Agents (Claude Code, Cursor, Gemini CLI). High-quality image, video, and audio generation powered by muapi....
- [2026-04-15] apocas/restai ⭐484
RESTai is an AIaaS (AI as a Service) open-source platform. Supports many public and local LLM suported by Ollama/vLLM/etc. Precise embeddings usage, t...
语言大模型 / Large Language Models
GitHub
- [2026-04-15] abhigyanpatwari/GitNexus ⭐27555
GitNexus: The Zero-Server Code Intelligence Engine - GitNexus is a client-side knowledge graph creator that runs entirely in your browser. Drop ...
- [2026-04-15] DeusData/codebase-memory-mcp ⭐1562
High-performance code intelligence MCP server. Indexes codebases into a persistent knowledge graph — average repo in milliseconds. 66 languages, sub-m...
- [2026-04-15] justrach/codedb ⭐712
Zig code intelligence server and MCP toolset for AI agents. Fast tree, outline, symbol, search, read, edit, deps, snapshot, and remote GitHub repo que...
- [2026-04-15] proxysoul/soulforge ⭐542
Graph-powered code intelligence, multi-agent coding with codebase-aware AI. No more grep & pray
- [2026-04-15] SimplyLiz/CodeMCP ⭐85
Code intelligence for AI assistants - MCP server, CLI, and HTTP API with symbol navigation, impact analysis, and architecture mapping
多模态大模型 / Multimodal Models
arXiv
- [2026-04-15] Reward Design for Physical Reasoning in Vision-Language Models 🆕NEW
- 赛道归属: 多模态推理优化(VLM 物理推理的强化学习/奖励建模)
- 核心创新点: 系统性剖析 GRPO 训练 VLM 做物理推理时“奖励设计→能力形态”的因果影响,提出从低语义到高语义的四类奖励并做消融对比(格式、准确率、包含物理原理与单位一致性的 rubric、以及基于注意力权重的内部奖励)。关键方法突破在于引入无需空间标注的“注意力区域监督式”内部奖励,用模型生成时对图像区域的注意力作为训练信号,显著提升空间关系类物理推理,同时揭示不同奖励会诱导强烈的领域特化(空间增强但符号域退化等),为面向视觉物理推理的奖励可控训练提供可操作的设计准则。
- Track: Multimodal reasoning optimization (RL/reward modeling for VLM physical reasoning)
- Core innovation: A systematic study of how reward design causally shapes GRPO-trained VLM physical reasoning, comparing four rewards with increasing semantic richness (format, accuracy, rubric with principle+unit consistency, and an internal attention-derived reward). The key methodological advance is an annotation-free “attention-region supervision” internal reward that uses the model’s own attention over image regions during generation as training signal, boosting spatial physical reasoning while exposing domain-specific tradeoffs (e.g., spatial gains vs. symbolic degradation), yielding actionable guidance for controllable reward design in visually grounded physics reasoning.
- [2026-04-15] MApLe: Multi-instance Alignment of Diagnostic Reports and Large Medical Images 🆕NEW
- 赛道归属: 医学多模态对齐(影像-报告细粒度对齐/定位)
- 核心创新点: 提出多任务、多实例的报告-大幅医学图像对齐框架 MApLe,通过“解耦解剖部位与诊断发现”来降低文本语义纠缠,并以 patch 级方式将句子/短语与局部影像证据建立多实例对应。方法上结合:面向句子级的概念化文本嵌入(同时编码解剖与发现)、解剖结构条件化的 patch-wise 图像编码器、以及多实例对齐目标,使模型能在自由文本报告中对多个发现进行区域级关联,改善小病灶与短文本之间的弱监督对齐难题。
- Track: Medical vision-language alignment (fine-grained image–report grounding/localization)
- Core innovation: MApLe introduces a multi-task, multi-instance alignment framework that explicitly disentangles anatomical region vs. diagnostic finding to reduce semantic entanglement, and performs patch-wise multi-instance matching between sentences and local image evidence. It combines (i) sentence embeddings capturing both anatomy and findings, (ii) an anatomy-conditioned patch-wise image encoder, and (iii) a multi-instance alignment objective, enabling robust grounding of multiple findings in free-text reports and improving weakly supervised alignment for tiny, clinically critical regions.
- [2026-04-15] GeoAgentBench: A Dynamic Execution Benchmark for Tool-Augmented Agents in Spatial Analysis 🆕NEW
- 赛道归属: 工具增强智能体评测(GIS/空间分析 Agent 基准与度量)
- 核心创新点: 构建面向 GIS 工具链的动态执行型基准 GeoAgentBench:提供集成 117 个原子 GIS 工具的可执行沙箱与 53 类典型空间分析任务,强调“运行时反馈+多模态输出”而非静态文本/代码匹配。提出参数执行准确率 PEA(Last-Attempt Alignment)专门度量隐式参数推断与配置是否到位,并引入 VLM 校验用于评估空间数据正确性与制图风格一致性;同时给出 Plan-and-React 代理架构,将全局规划与逐步反应式执行解耦,提升多步流程的鲁棒性与错误恢复能力。
- Track: Tool-augmented agent evaluation (GIS/spatial analysis benchmarks & metrics)
- Core innovation: GeoAgentBench provides a dynamic execution benchmark with a runnable sandbox integrating 117 atomic GIS tools across 53 task types, evaluating agents via runtime feedback and multimodal outputs rather than static matching. It introduces Parameter Execution Accuracy (PEA) with a “Last-Attempt Alignment” strategy to quantify implicit parameter inference fidelity, plus VLM-based verification for spatial correctness and cartographic style. It also proposes a Plan-and-React agent that decouples global orchestration from step-wise reactive execution to improve robustness and error recovery in multi-step geospatial workflows.
- [2026-04-15] Gaslight, Gatekeep, V1-V3: Early Visual Cortex Alignment Shields Vision-Language Models from Sycophantic Manipulation 🆕NEW
- 赛道归属: 多模态安全与鲁棒性(VLM 迎合/煤气灯操控防御与可解释性)
- 核心创新点: 将“神经科学脑对齐”引入 VLM 抗迎合操控的机制研究:用 fMRI(自然场景数据集)预测一致性衡量模型与人类视觉皮层各 ROI 的对齐程度,并用大规模两轮 gaslighting 提示集量化 sycophancy。方法论突破在于发现解剖学上特异的相关性——早期视觉皮层 V1–V3 的对齐度稳定负相关于迎合程度,而高阶类别选择区不显著,提出“忠实低级视觉编码”可作为抵抗语言对视觉事实覆写的锚点,为通过表征约束提升多模态安全提供可检验指标。
- Track: Multimodal safety & robustness (sycophancy/gaslighting resistance in VLMs)
- Core innovation: Links neuroscience-style brain alignment to VLM robustness against sycophantic manipulation by measuring ROI-specific alignment via fMRI predictivity (Natural Scenes Dataset) and quantifying sycophancy with a large-scale two-turn gaslighting prompt suite. The key methodological finding is anatomically specific: alignment in early visual cortex (V1–V3) reliably predicts lower sycophancy, while higher-level category-selective regions do not—suggesting faithful low-level visual encoding acts as an anchor against linguistic override, and providing a testable representation-level metric for multimodal safety.
- [2026-04-15] Failure Identification in Imitation Learning Via Statistical and Semantic Filtering 🆕NEW
- 赛道归属: 机器人多模态异常/失败检测(模仿学习部署鲁棒性)
- 核心创新点: 提出与策略无关的失败识别模块 FIDeL:先用视觉异常检测方法学习“正常示范”的紧凑表征,再通过最优传输匹配对齐在线观测,输出异常分数与热力图;在判别层面引入扩展的 conformal prediction 形成时空阈值以控制误报,并用 VLM 做语义过滤,将“良性偏离”与“真实失败”区分开来。配套发布 BotFails 多模态真实机器人失败数据集,形成从像素异常到语义失败的闭环检测范式。
- Track: Multimodal failure/anomaly detection for robotics (robust deployment of imitation learning)
- Core innovation: FIDeL is a policy-agnostic failure identification module that (i) builds compact representations of nominal demonstrations using modern anomaly detection, (ii) aligns incoming observations via optimal-transport matching to produce anomaly scores and heatmaps, (iii) derives spatio-temporal thresholds with an extended conformal prediction scheme, and (iv) applies VLM-based semantic filtering to separate benign anomalies from true failures. Together with the BotFails multimodal real-world dataset, it establishes an end-to-end pipeline from pixel-level anomaly cues to semantically grounded failure detection.
- [2026-04-15] SocialMirror: Reconstructing 3D Human Interaction Behaviors from Monocular Videos with Semantic and Geometric Guidance 🆕NEW
- 赛道归属: 3D人体重建(单目视频多人交互/遮挡场景)
- 核心创新点: 提出面向近距离交互强遮挡的扩散式 3D 重建框架 SocialMirror,将 VLM 生成的高层交互语义作为先验,驱动“语义引导的运动补全器”对被遮挡身体进行合理幻觉与局部姿态消歧;再用序列级时间精炼器约束全局时序平滑,且在扩散采样中注入几何约束以保证接触与相对空间关系的可行性。核心突破在于把语言语义先验与几何约束以“生成-采样”方式耦合,显著缓解互遮挡导致的时空不连续与关系错误。
- Track: 3D human reconstruction (monocular close-interaction with occlusions)
- Core innovation: SocialMirror is a diffusion-based framework for reconstructing interacting humans from monocular video under severe mutual occlusion. It uses VLM-generated high-level interaction descriptions to guide a semantic motion infiller that hallucinates occluded bodies and resolves local pose ambiguity, then applies a sequence-level temporal refiner for smooth, jitter-free motion while injecting geometric constraints during diffusion sampling to enforce plausible contact and spatial relations. The key advance is coupling language-derived semantic priors with geometry-constrained generative sampling to address occlusion-driven ambiguity and relational errors.
- [2026-04-15] UHR-BAT: Budget-Aware Token Compression Vision-Language model for Ultra-High-Resolution Remote Sensing 🆕NEW
- 赛道归属: 多模态推理加速/上下文压缩(超高分辨率遥感 VLM)
- 核心创新点: 提出 UHR-BAT 的“预算感知”token 压缩:针对 UHR 遥感中“超大范围上下文+极小目标证据”导致的 token 二次方爆炸,采用文本引导的多尺度重要性估计进行查询相关的 token 选择,并通过分区保留与合并(region-wise preserve & merge)在保持关键区域细节的同时消除冗余,确保在严格上下文预算下计算量可控且细粒度目标不被下采样抹除。
- Track: Multimodal efficiency (token compression for ultra-high-resolution remote sensing VLMs)
- Core innovation: UHR-BAT introduces budget-aware, query-guided token compression for ultra-high-resolution remote sensing where kilometer-scale context coexists with pixel-scale evidence. It performs text-guided multi-scale importance estimation to select query-critical visual tokens, and applies region-wise preserve-and-merge strategies to reduce redundancy while retaining fine details. The methodological breakthrough is delivering predictable compute under strict context budgets without sacrificing small-object evidence that naive downsampling or global pruning would lose.
- [2026-04-15] CLIP Architecture for Abdominal CT Image-Text Alignment and Zero-Shot Learning: Investigating Batch Composition and Data Scaling 🆕NEW
- 赛道归属: 医学多模态对比学习(3D CT-文本对齐/零样本诊断训练策略)
- 核心创新点: 在复现 3D CT-报告对比学习(Merlin/InfoNCE 双编码器)的基础上,系统研究训练批次组成与数据规模对零样本诊断表征的影响,提出并验证一个反直觉结论:显式的正常/异常比例平衡(多种配比、不同粒度采样)会稳定损害性能。其技术价值在于用严格控制变量的消融揭示:小 batch 的 3D 体数据训练中,随机采样带来的“随机多样性”与解剖分段交替 batching 形成更有效的正则化,给出可复用的训练配方与数据扩展规律(次线性 scaling、发现级敏感性差异)。
- Track: Medical multimodal contrastive learning (3D CT–text alignment & zero-shot training strategy)
- Core innovation: Beyond reproducing a 3D CT–report dual-encoder with symmetric InfoNCE, this work isolates how batch composition and data scaling affect zero-shot diagnostic representations. The key methodological insight is that explicit normal/abnormal class balancing within batches (across ratios and granularities) consistently hurts performance, suggesting that stochastic diversity from random sampling—combined with Merlin’s alternating anatomical-subsection batching—acts as better regularization under the small batch sizes imposed by 3D volumes. It also characterizes sub-linear data scaling and finding-specific data sensitivity, yielding actionable training recipes.
- [2026-04-15] Evolvable Embodied Agent for Robotic Manipulation via Long Short-Term Reflection and Optimization 🆕NEW
- 赛道归属: 具身智能体(机器人操作的提示学习/自反思优化)
- 核心创新点: 提出可进化具身智能体 EEAgent,用 VLM 做环境理解与策略规划,并设计长短期反思优化 LSTRO:将近期轨迹反馈(短期)与跨任务沉淀的经验教训(长期)结合,动态改写/优化提示词以持续自我进化,而非仅对单次失败做表层反思。方法突破在于把“反思”做成可累积、可迁移的提示优化机制,从而在无需大规模再训练的前提下提升跨场景操作成功率。
- Track: Embodied agents for robotic manipulation (prompt-based self-reflection optimization)
- Core innovation: EEAgent enables self-evolving robotic manipulation by leveraging VLMs for environment interpretation and planning, and introducing Long Short-Term Reflective Optimization (LSTRO). LSTRO jointly uses short-term episode feedback and long-term distilled lessons across experiences to dynamically refine prompts, making reflection cumulative and transferable rather than a one-off postmortem. The methodological advance is a training-light, prompt-optimization loop that improves cross-task robustness and success rates without extensive retraining.
- [2026-04-15] From Relevance to Authority: Authority-aware Generative Retrieval in Web Search Engines 🆕NEW
- 赛道归属: 生成式检索/搜索(GenIR 的可信度/权威性建模)
- 核心创新点: 提出首个将“权威性”显式纳入生成式检索的框架 AuthGR:用 VLM 从文本与视觉线索进行多模态权威评分(不仅看语义相关性),并通过三阶段训练逐步把权威偏好注入生成式检索器;部署侧采用混合集成管线提升线上稳定性。方法论突破在于把检索目标从 relevance 扩展为 relevance+authority 的可训练信号,并在真实商业搜索 A/B 与人工评测中验证对可靠性与用户参与度的提升,同时实现小模型对大模型的效果逼近。
- Track: Generative retrieval for web search (authority/trust-aware GenIR)
- Core innovation: AuthGR is the first GenIR framework to explicitly optimize for document authority/trustworthiness in addition to relevance. It introduces multimodal authority scoring using a VLM to extract authority cues from both text and visuals, a three-stage training pipeline to progressively instill authority awareness into the generative retriever, and a hybrid ensemble deployment pipeline for robustness. The key advance is turning “authority” into a trainable retrieval objective with demonstrated real-world gains via large-scale online A/B tests and human evaluation, while enabling a smaller model to match a much larger baseline.
GitHub
- [2026-04-16] Blaizzy/mlx-vlm ⭐4368
MLX-VLM is a package for inference and fine-tuning of Vision Language Models (VLMs) on your Mac using MLX.
- [2026-04-15] waybarrios/vllm-mlx ⭐847
OpenAI and Anthropic compatible server for Apple Silicon. Run LLMs and vision-language models (Llama, Qwen-VL, LLaVA) with continuous batching, MCP to...
- [2026-04-15] opendatalab/mineru-vl-utils ⭐112
A Python package for interacting with the MinerU Vision-Language Model.
- [2026-04-14] Ice-wave/AttentionLens-LVLM ⭐87
A lightweight and extensible toolkit for visualizing attention flow in Large Vision-Language Models (LVLMs). It renders token-to-token attention maps,...
- [2026-04-15] FeiElysia/Tempo ⭐52 🆕NEW
Tempo: Small Vision-Language Models are Smart Compressors for Long Video Understanding
强化学习 / Reinforcement Learning
arXiv
- [2026-04-09] MedVR: Annotation-Free Medical Visual Reasoning via Agentic Reinforcement Learning 📖3
- 赛道归属: 医学多模态理解与视觉推理(Medical VLM Reasoning / RL对齐)
- 核心创新点: 提出无需中间步骤标注的强化学习框架 MedVR,让医学VLM在推理时更强地“以图为证”。方法上用两项机制协同:①熵引导视觉再落地(EVR)用不确定性驱动探索,把注意力/检索导向更可能提供证据的视觉区域以减少幻觉;②基于一致性的信用分配(CCA)从多次rollout的一致性中提炼伪监督信号,实现无人工标注的过程级学习与稳定优化,从而在多医学VQA基准上显著提升推理与鲁棒性。
Track: Medical multimodal understanding & visual reasoning (Medical VLM reasoning / RL alignment)
Key innovations: Proposes MedVR, an annotation-free RL framework that forces medical VLMs to reason grounded in visual evidence. It combines (1) Entropy-guided Visual Regrounding (EVR), using model uncertainty to steer exploration toward evidence-bearing visual cues, and (2) Consensus-based Credit Assignment (CCA), distilling pseudo-supervision from agreement across rollouts to enable process-level learning without human intermediate annotations, improving performance and reducing hallucinations on medical VQA benchmarks.
- [2026-04-14] A hierarchical spatial-aware algorithm with efficient reinforcement learning for human-robot task planning and allocation in production 📖1 🆕NEW
- 赛道归属: 人机协作制造强化学习(层级任务规划与分配、空间感知调度)
- 核心创新点: 提出面向复杂动态产线的人-机器人任务规划与分配的层级RL框架:高层用高效缓冲区深度Q学习(EBQ)缓解长时序/稀疏回报导致的训练低效,通过更有效的经验利用降低训练时间并提升策略质量;低层引入基于路径规划的空间感知分配(SAP),将人员实时位置与移动距离等空间约束显式纳入资源分配决策,实现“规划-分配”解耦下的实时可执行性与序列子任务落地。
- Track: Human-robot collaborative manufacturing RL (hierarchical task planning & allocation, spatial-aware scheduling)
- Key innovations: A hierarchical RL system for dynamic production TPA: a high-level planner using Efficient Buffer-based Deep Q-learning (EBQ) to improve learning under long-horizon sparse rewards via better experience utilization, and a low-level Spatially Aware allocation module (SAP) grounded in path planning that explicitly accounts for real-time positions and travel distances, enabling real-time feasibility through plan–allocate decoupling.
- [2026-04-13] Robust Adversarial Policy Optimization Under Dynamics Uncertainty 📖1 🆕NEW
- 赛道归属: 鲁棒强化学习(动力学不确定性、分布鲁棒/对抗式策略优化)
- 核心创新点: 提出RAPO,将分布鲁棒RL的原始难解问题转为可操作的对偶形式,直接显式化“鲁棒性-性能”权衡:在轨迹层面用对偶温度参数并以对抗网络近似,生成满足散度约束的稳定最坏情形rollout;在模型层面对动力学集成采用Boltzmann重加权,按“对当前策略更不利”的环境进行策略敏感采样而非均匀域随机化。两层机制相互独立又互补,兼顾稳定训练与覆盖更具挑战的动力学,从而提升OOD动力学泛化与抗不确定性能力并避免过度保守。
- Track: Robust RL under dynamics uncertainty (distributionally robust / adversarial policy optimization)
- Key innovations: RAPO derives a tractable dual that exposes the robustness–performance trade-off. It (i) approximates the dual temperature with an adversarial network to produce stable worst-case rollouts within a divergence bound at the trajectory level, and (ii) applies Boltzmann reweighting over dynamics ensembles for policy-sensitive sampling of more adverse models at the model level. The decoupled components jointly improve OOD robustness while reducing instability and over-conservatism.
- [2026-04-09] SPARD: Self-Paced Curriculum for RL Alignment via Integrating Reward Dynamics and Data Utility 📖1
- 赛道归属: LLM对齐强化学习 / 多目标奖励自适应与课程学习
- 核心创新点: 提出自步进课程框架SPARD,同时建模“奖励动态”(learning progress导致的非平稳性)与“数据效用”(不同数据对不同能力维度的贡献差异),在线动态调整多目标奖励权重与样本重要性;通过把训练意图与数据效用同步,缓解固定权重在多目标对齐中的失配与数据异质性问题,实现跨域能力的整体提升。
- Track: RL alignment for LLMs / Multi-objective reward weighting & curriculum learning
- Core innovation: SPARD builds an automated self-paced curriculum that jointly accounts for non-stationary reward/learning dynamics and heterogeneous data utility. It dynamically adjusts multi-objective reward weights and data importance based on perceived learning progress, aligning optimization intent with which data is currently most useful, overcoming the limitations of fixed reward weights in complex multi-objective alignment.
- [2026-04-15] Hierarchical Reinforcement Learning with Runtime Safety Shielding for Power Grid Operation 🆕NEW
- 赛道归属: 安全强化学习用于电力系统(层级控制、运行时安全屏蔽/Shielding)
- 核心创新点: 提出“层级RL + 运行时安全盾”的电网控制架构,将长时域决策与实时可行性强约束显式解耦:高层RL仅负责提出抽象控制动作(如拓扑/调度意图),低层通过快速前向仿真的确定性安全盾在运行时过滤不安全动作,把安全作为与训练分布/策略质量无关的运行时不变量来保证。该设计在压力测试与零样本迁移到未见大电网拓扑时,相比平坦RL的脆弱与纯安全规则的保守,实现更长生存时间、更低峰值线路负载与更强泛化,强调“架构性安全”优于奖励工程堆叠。
- Track: Safe RL for power grid operation (hierarchical control, runtime safety shielding)
- Key innovations: A safety-constrained hierarchical framework that decouples long-horizon RL decision-making from real-time feasibility: a high-level RL policy proposes abstract actions, while a deterministic runtime safety shield uses fast forward simulation to filter unsafe actions, enforcing safety as a runtime invariant independent of training distribution. This yields robustness under rare disturbances and zero-shot generalization to unseen grid topologies.
- [2026-04-15] Provably Efficient Offline-to-Online Value Adaptation with General Function Approximation 🆕NEW
- 赛道归属: 离线到在线强化学习(Offline-to-Online迁移、价值函数自适应、理论样本复杂度)
- 核心创新点: 系统刻画“从不完美离线预训练Q函数出发、用少量在线交互快速适配”的难度边界:给出极小极大下界,指出即便预训练Q接近最优,在某些困难实例上在线适配也不可能显著优于纯在线RL;进一步提出新的结构性条件来刻画“可被高效适配”的预训练价值函数族,并在此条件下给出O2O-LSVI,获得问题依赖的样本复杂度改进与可证明优于纯在线RL的保证,同时用神经网络实验验证可行性。
- Track: Offline-to-online RL (value adaptation, sample-complexity theory with general function approximation)
- Key innovations: Establishes a minimax lower bound showing offline-pretrained Q-functions—even near-optimal—may not enable faster online adaptation on hard instances. Introduces a novel structural condition under which adaptation is provably easier, and proposes O2O-LSVI with problem-dependent sample complexity that can outperform pure online RL, supported by neural-network experiments.
- [2026-04-15] DiPO: Disentangled Perplexity Policy Optimization for Fine-grained Exploration-Exploitation Trade-Off 🆕NEW
- 赛道归属: LLM强化学习对齐(RLVR、探索-利用权衡、策略优化)
- 核心创新点: 提出DiPO,通过“困惑度空间解耦”实现细粒度探索-利用控制:用困惑度将样本划分为高困惑度探索子空间与低困惑度利用子空间,专门挖掘极难/极易样本导致的训练失衡;再设计双向奖励分配机制,在尽量不扰动可验证奖励(verification rewards)的前提下,对不同子空间施加差异化激励,实现困惑度引导的稳定策略优化,从而在数学推理与函数调用等RLVR任务上提升性能与训练稳定性。
- Track: RL for LLM alignment (RL with verifiable rewards, exploration–exploitation control)
- Key innovations: DiPO disentangles the sample space by perplexity into exploration (high perplexity) and exploitation (low perplexity) subspaces to address imbalance from extremely hard/easy samples. It then introduces a bidirectional reward allocation scheme that minimally perturbs verification rewards while enabling perplexity-guided exploration/exploitation, leading to more stable policy optimization and improved performance on reasoning and function-calling tasks.
- [2026-04-15] Beyond Conservative Automated Driving in Multi-Agent Scenarios via Coupled Model Predictive Control and Deep Reinforcement Learning 🆕NEW
- 赛道归属: 自动驾驶决策控制(多智能体交互、MPC+深度强化学习融合、跨场景泛化)
- 核心创新点: 提出耦合MPC与深度RL的集成框架,用MPC提供显式约束处理与可迁移的结构先验,缓解端到端RL的安全与泛化问题,同时用RL学习替代/修正手工规则以降低MPC的过度保守:在无信号交叉口多车博弈中,相比纯MPC与纯RL同时提升安全与通行效率;并在不重训条件下零样本迁移到高速汇入场景,显示MPC骨架带来的跨场景鲁棒性与更快训练收敛(降低学习负担)。
- Track: Autonomous driving control (multi-agent interaction, MPC–RL hybrid, generalization)
- Key innovations: An integrated MPC-RL framework where MPC provides constraint-aware structure and transferability, while RL learns adaptive behaviors to avoid hand-crafted overly conservative rules. The coupling improves safety/efficiency in unsignalized multi-agent intersections and enables stronger zero-shot transfer (e.g., highway merging) than end-to-end RL, with faster training stabilization due to reduced learning burden.
- [2026-04-15] Drowsiness-Aware Adaptive Autonomous Braking System based on Deep Reinforcement Learning for Enhanced Road Safety 🆕NEW
- 赛道归属: 智能驾驶安全控制(生理信号融合、深度强化学习制动决策)
- 核心创新点: 构建“困倦感知”的自适应自动制动RL系统:先用RNN从ECG中识别驾驶员困倦状态(通过不同窗口/重叠配置的系统性基准选择最佳方案),再将困倦状态并入Double-Dueling DQN的可观测状态,并用“动作延迟”建模困倦导致的反应迟缓,从而学习在不同生理状态下自适应的制动策略;在CARLA高保真仿真中验证在困倦/非困倦条件下均能高成功率避碰,体现生理闭环的人因建模对安全控制策略的增益。
- Track: Safety-critical driving control (physiology-aware RL, adaptive braking)
- Key innovations: A physiology-aware autonomous braking system: an RNN infers drowsiness from ECG (with systematic window/overlap benchmarking), and the drowsiness signal is injected into a Double-Dueling DQN state, while impairment is modeled as action delay. This enables learning braking policies that adapt to driver state, achieving near-perfect collision avoidance in high-fidelity CARLA simulations across drowsy and non-drowsy conditions.
- [2026-04-15] MUSE: Multi-Domain Chinese User Simulation via Self-Evolving Profiles and Rubric-Guided Alignment 🆕NEW
- 赛道归属: 对话式智能体训练与评测(用户模拟、多轮强化学习对齐、中文多领域)
- 核心创新点: 提出MUSE中文多领域用户模拟框架,面向“可控且长程一致”的用户行为建模:用迭代式画像自进化(IPSE)通过对比模拟轨迹与真实对话行为的差异并进行推理修正,逐步优化用户画像以提升跨轮一致性;用角色互换监督微调提升局部表达与拟人性;再训练基于rubric的专用奖励模型,并进行rubric引导的多轮RL在会话级优化,使行为约束从句子级提升到长时程策略级,从而增强长期persona一致性与多域可迁移性。
- Track: Interactive dialogue agents (user simulation, multi-turn RL alignment, Chinese multi-domain)
- Key innovations: MUSE builds a controllable, long-horizon consistent Chinese user simulator via (i) Iterative Profile Self-Evolution (IPSE) that refines user profiles by reasoning over discrepancies between simulated trajectories and real dialogues, (ii) role-reversal SFT to improve local realism, and (iii) a rubric-based reward model used in rubric-guided multi-turn RL to optimize dialogue-level behavior, improving long-term persona consistency and multi-domain fidelity.
GitHub
- [2026-04-15] verl-project/verl ⭐20711
verl: Volcano Engine Reinforcement Learning for LLMs
- [2026-04-15] huggingface/trl ⭐18058
Train transformer language models with reinforcement learning.
- [2026-04-16] pytorch/rl ⭐3394
A modular, primitive-first, python-first PyTorch library for Reinforcement Learning.
- [2026-04-15] natolambert/rlhf-book ⭐1818
Textbook on reinforcement learning from human feedback
- [2026-04-16] radixark/miles ⭐1085
Miles is an enterprise-facing reinforcement learning framework for LLM and VLM post-training, forked from and co-evolving with slime.
HuggingFace Datasets
- [2026-04-14] llamaindex/ParseBench
ParseBench
Quick links: [🌐 Website] [📜 Paper] [💻 Code] ParseBench is a benchmark for evaluating document parsing systems on real-world ent...
- [2026-04-06] hysong/MentalBench
MentalBench: A Benchmark for Evaluating Psychiatric Diagnostic Capability of Large Language Models 🌟 Overview
MentalBench is a c...
- [2026-02-22] YennNing/MC-Search
Dataset Card for MC-Search
Paper Information | Dataset Description | Dataset Usage | Data Format | Knowledge Base | Citation
Paper ...
HuggingFace Spaces
Generated automatically by Daily AI Digest Agent 生成时间: 2026-04-16 01:49:45