AI 每日进展速报 / Daily AI Digest - 2026-04-15
图像生成/编辑 / Image Generation/Editing
arXiv
- [2026-04-14] Generative Refinement Networks for Visual Synthesis 🆕NEW
- 赛道归属: 图像生成(自回归/非扩散范式)、文生图、文生视频
- 核心创新点: 提出Generative Refinement Networks(GRN)作为替代扩散的视觉生成范式:用近乎无损的分层二值量化HBQ缓解离散tokenization带来的信息损失,并在AR生成上引入“全局逐步精修”的refinement机制以纠正误差累积、逐轮提升细节;同时用熵引导采样实现复杂度感知的自适应步数生成,在不牺牲质量的前提下降低不必要计算,并在ImageNet及T2I/T2V扩展上验证可扩展性与SOTA指标。
- Track: Image generation (autoregressive / post-diffusion paradigm), Text-to-Image, Text-to-Video
- Key innovations: Proposes Generative Refinement Networks (GRN) as a diffusion alternative: (1) a theoretically near-lossless Hierarchical Binary Quantization (HBQ) to remove the discrete-token bottleneck; (2) a global progressive refinement mechanism on top of AR generation to correct accumulated errors and iteratively polish details; (3) entropy-guided sampling for complexity-aware adaptive-step generation, reducing compute without degrading quality, with strong results on ImageNet and scalable T2I/T2V settings.
- [2026-04-14] Representation geometry shapes task performance in vision-language modeling for CT enterography 🆕NEW
- 赛道归属: 多模态理解(医学视觉-语言)、检索增强生成(RAG)、表征学习分析
- 核心创新点: 系统研究CT肠道造影的视觉-语言迁移学习中“表征几何/聚合方式”对任务的因果影响:发现mean pooling更利于疾病分类而attention pooling更利于跨模态检索,揭示聚合器在表征属性上的分工;提出并验证多窗位RGB编码(不同HU窗映射到RGB)比增加多平面覆盖更关键,甚至额外视角会伤害分类;在报告生成上证明仅微调难学到序关系,引入RAG显著提升序级评估;用三教师伪标注框架在无专家标注下完成可比实验与基线建立。
- Track: Multimodal understanding (medical vision-language), Retrieval-Augmented Generation (RAG), representation analysis
- Key innovations: A first systematic study of vision-language transfer for CT enterography that links representation geometry/aggregation to downstream performance: mean pooling favors disease classification while attention pooling favors cross-modal retrieval, indicating distinct representational emphases. Shows multi-window RGB encoding (HU windows → RGB channels) is more beneficial than increasing spatial coverage via multiplanar views (which can even hurt classification). For report generation, demonstrates limited ordinal learning from plain fine-tuning and quantifies consistent gains from RAG. Uses a three-teacher pseudo-labeling setup to enable comparisons without expert annotations and establishes baselines for this modality.
- [2026-04-14] Transformer Based Machine Fault Detection From Audio Input 🆕NEW
- 赛道归属: 音频理解(工业故障检测)、Transformer在声学建模
- 核心创新点: 将ViT式Transformer用于基于声谱图的机器故障检测,并与传统CNN在同任务上做对比,重点分析两类模型产生的embedding差异与其对故障判别的影响;以“更低归纳偏置”的Transformer替代CNN的局部性假设,验证在足够数据条件下对声谱图模式建模的有效性与潜在优势。
- Track: Audio understanding (industrial fault detection), transformer-based acoustic modeling
- Key innovations: Applies ViT-style transformers to spectrogram-based machine fault detection and directly compares against CNN baselines, emphasizing how the learned embeddings differ and how that impacts fault classification. Motivates transformers as lower-inductive-bias alternatives to CNN locality/weight-sharing assumptions for spectrogram patterns, and empirically validates effectiveness given sufficient data.
- [2026-04-14] OFA-Diffusion Compression: Compressing Diffusion Model in One-Shot Manner 🆕NEW
- 赛道归属: 推理优化/模型压缩(扩散模型)、Once-for-All子网训练
- 核心创新点: 提出面向扩散模型的一次训练多规格部署(OFA)压缩框架:通过“限制候选子网参数规模集合”缩小搜索空间以加速优化;在给定目标规模下按通道重要性逐步分配/保留通道来构建子网;并用重加权策略平衡不同子网在一阶段训练中的优化冲突,从而以显著更低的重复训练成本产出多种算力/参数预算下的可用扩散子模型。
- Track: Inference optimization / model compression (diffusion), once-for-all subnet training
- Key innovations: Introduces a one-shot OFA compression framework for diffusion models to avoid retraining for each device budget. Key design choices: restrict the candidate subnet space to a predefined set of parameter sizes to speed optimization; construct each target-size subnetwork via progressive channel allocation based on importance; and apply a reweighting strategy to balance optimization across subnetworks during shared training, yielding multiple deployable compressed DPMs with much lower training overhead.
- [2026-04-14] PromptEcho: Annotation-Free Reward from Vision-Language Models for Text-to-Image Reinforcement Learning 🆕NEW
- 赛道归属: 文生图对齐强化学习(RLHF/RLAIF)、奖励建模(免标注)
- 核心创新点: 提出PromptEcho:无需人工偏好标注、无需训练奖励模型的T2I强化学习奖励构造方法。核心做法是用冻结VLM对“以原prompt为标签”的token级交叉熵损失作为奖励信号,直接提取VLM预训练中蕴含的细粒度图文对齐知识,相比CLIP分数更密、更可扩展(VLM越强奖励越强);同时构建DenseAlignBench以概念密集caption评测prompt-following,并在多模型上验证显著提升且优于同VLM的推理式打分。
- Track: Text-to-Image alignment RL (RLHF/RLAIF), reward construction without annotations
- Key innovations: Proposes PromptEcho, an annotation-free and training-free reward for T2I RL. It computes token-level cross-entropy of a frozen VLM using the original prompt as labels, extracting fine-grained image-text alignment knowledge from VLM pretraining—denser than CLIP score and automatically improving with stronger VLMs. Introduces DenseAlignBench (concept-rich dense captions) to stress-test prompt following, and shows large gains across models and benchmarks, outperforming inference-based scoring using the same VLM.
- [2026-04-14] StructDiff: A Structure-Preserving and Spatially Controllable Diffusion Model for Single-Image Generation 🆕NEW
- 赛道归属: 图像生成(单图生成/内部学习)、可控生成、图像编辑/外延(outpainting)
- 核心创新点: 提出StructDiff用于单图生成中的结构保持与空间可控:用自适应感受野同时建模全局布局与局部纹理分布,缓解大刚体/强空间约束下结构漂移;引入3D位置编码作为空间先验,实现对生成内容的位置、尺度与局部细节的显式操控(在单图生成中首次系统探索PE操控);并提出基于LLM的单图生成评测准则以替代不适配的客观指标/高成本用户研究,且方法可迁移到文本引导生成、编辑、外延与paint-to-image等任务。
- Track: Image generation (single-image/internal learning), controllable generation, image editing/outpainting
- Key innovations: StructDiff targets structure-preserving and spatially controllable single-image generation. It introduces an adaptive receptive field module to jointly capture global layout and local texture statistics, improving structural consistency under rigid objects and strict spatial constraints. Adds 3D positional encoding as a spatial prior to explicitly control object position/scale/details—position-encoding-based manipulation is presented as a first exploration in this setting. Also proposes an LLM-based evaluation criterion tailored to single-image generation, and demonstrates applicability to text-guided generation, editing, outpainting, and paint-to-image.
- [2026-04-14] T2I-BiasBench: A Multi-Metric Framework for Auditing Demographic and Cultural Bias in Text-to-Image Models 🆕NEW
- 赛道归属: 文生图评测/安全与公平(偏见审计)、基准与指标体系
- 核心创新点: 提出T2I-BiasBench统一评测框架,用13个互补指标同时覆盖三类问题:人口统计偏见、元素遗漏、文化坍缩(首次三维一体化);在既有指标基础上新增/改造多项度量(如Composite Bias Score、Grounded/Implicit Missing Rate、Cultural Accuracy Ratio等),并用结构化prompt集与跨模型对比(含RLHF对齐模型)揭示:偏见可被特定上下文约束削弱但文化覆盖仍普遍坍缩,从而为细粒度、可复现的偏见诊断提供标准化工具。
- Track: Text-to-Image evaluation / safety & fairness (bias auditing), benchmarks and metrics
- Key innovations: Introduces T2I-BiasBench, a unified auditing framework with 13 complementary metrics that jointly measure demographic bias, element omission, and cultural collapse—addressing all three dimensions in one benchmark. Adds new/adapted measures (e.g., Composite Bias Score, Grounded/Implicit Missing Rate, Cultural Accuracy Ratio) and evaluates multiple open models against an RLHF-aligned reference, showing context can attenuate some demographic biases while cultural representation collapse persists even with alignment. Provides a standardized, fine-grained, reproducible bias evaluation toolkit.
- [2026-04-14] Scaling Exposes the Trigger: Input-Level Backdoor Detection in Text-to-Image Diffusion Models via Cross-Attention Scaling 🆕NEW
- 赛道归属: 文生图安全(后门检测)、扩散模型鲁棒性/主动探测
- 核心创新点: 提出基于“交叉注意力缩放主动探测”的输入级后门检测:发现CSRD现象——对cross-attention施加可控缩放扰动时,良性与后门输入在去噪步间的响应演化呈系统性分歧;据此构建SET框架,在多尺度扰动下提取响应偏移特征,并用少量干净样本学习紧凑的良性响应空间,通过偏离度实现攻击无关、无需训练数据/训练过程访问的检测,尤其对语义保持、隐式触发器场景更稳健。
- Track: Text-to-Image security (backdoor detection), diffusion robustness / active probing
- Key innovations: Proposes an input-level backdoor detector via active probing with cross-attention scaling. Discovers Cross-Attention Scaling Response Divergence (CSRD): benign vs. backdoored inputs exhibit systematically different response evolution across denoising steps under controlled cross-attention scaling perturbations. Builds SET by extracting response-offset features across multiple scales, learning a compact benign response space from a small clean set, and flagging deviations—attack-agnostic, requiring no training-time access, and particularly effective for stealthy semantics-preserving/implicit triggers.
- [2026-04-14] Bridging the Micro--Macro Gap: Frequency-Aware Semantic Alignment for Image Manipulation Localization 🆕NEW
- 赛道归属: 图像篡改定位/生成内容检测(forensics)、频域-语义融合
- 核心创新点: 提出FASA以弥合“微观取证痕迹 vs 宏观语义一致性”的鸿沟:用自适应双频段DCT提取对篡改敏感的频域线索;在冻结CLIP表征上做patch级对比对齐学习篡改语义先验;通过语义-频率侧向适配器将语义先验注入分层频域通路实现多尺度交互,并用原型引导、频率门控的mask解码器同时强化边界与语义一致性,从而对传统篡改与扩散编辑均实现更强泛化与抗退化鲁棒性。
- Track: Image manipulation localization / generative forensics, frequency-semantic fusion
- Key innovations: Proposes FASA to bridge the micro–macro gap between low-level forensic artifacts and high-level semantics. It extracts manipulation-sensitive frequency cues via an adaptive dual-band DCT module, learns manipulation-aware semantic priors through patch-level contrastive alignment on frozen CLIP features, injects these priors into a hierarchical frequency pathway using a semantic-frequency side adapter for multi-scale interaction, and employs a prototype-guided, frequency-gated mask decoder to combine semantic consistency with boundary-aware localization—achieving strong generalization across generators/datasets and robustness to degradations.
- [2026-04-14] Self-Adversarial One Step Generation via Condition Shifting 🆕NEW
- 赛道归属: 文生图加速(一步采样/一致性蒸馏替代)、训练方法(无判别器对抗校正)
- 核心创新点: 提出APEX用于一步生成的质量-速度-训练效率折中突破:通过“条件平移(condition shifting)”在同一flow模型内构造shifted condition分支,其速度场作为当前生成分布的独立估计器,产生可证明与GAN目标对齐的对抗校正梯度,从而在无外部判别器的情况下获得锐化信号,避免判别器导致的不稳定与梯度消失;该设计保持架构不变、可即插即用并兼容LoRA调参,在NFE=1下实现显著质量提升与大幅推理加速。
- Track: Text-to-Image acceleration (one-step sampling), training methodology (discriminator-free adversarial correction)
- Key innovations: APEX improves one-step generation by extracting adversarial correction signals endogenously from a flow model via condition shifting. A shifted-condition branch provides an independent estimate of the current generation distribution through its velocity field, yielding gradients provably aligned with GAN objectives while avoiding external discriminators (and their instability/vanishing gradients). The method is architecture-preserving and plug-and-play, compatible with full fine-tuning and LoRA, delivering strong NFE=1 quality and large inference speedups.
GitHub
- [2026-04-15] YouMind-OpenLab/awesome-nano-banana-pro-prompts ⭐10939
🍌 World's largest Nano Banana Pro prompt library — 10,000+ curated prompts with preview images, 16 languages. Google Gemini AI image generation. Free ...
- [2026-04-14] Anil-matcha/Open-Generative-AI ⭐4784
Open-source alternative to Higgsfield AI, Freepik, Krea, Openart AI — Free AI image generation & cinema studio with 20+ models (Flux, SDXL, Midjourney...
- [2026-04-15] AceDataCloud/Nexior ⭐354
Consumer AI app for chat, image generation, video generation, and music creation powered by Ace Data Cloud APIs.
- [2026-04-14] ferranpons/Llamatik ⭐98
True on-device AI for Kotlin Multiplatform (Android, iOS, Desktop, JVM, WASM). LLM, Speech-to-Text and Image Generation — powered by llama.cpp, whispe...
- [2026-04-14] baidu/ERNIE-Image ⭐78
ERNIE-Image is an open text-to-image generation model developed by the ERNIE-Image team at Baidu. It is built on a single-stream Diffusion Transformer...
视频生成/编辑 / Video Generation/Editing
arXiv
- [2026-04-14] Lyra 2.0: Explorable Generative 3D Worlds 🆕NEW
- 赛道归属: 视频生成 + 3D世界生成/重建(camera-controlled long-horizon video → 3D lifting)
- 核心创新点: 提出“生成式重建”框架 Lyra 2.0,用视频生成的视觉先验驱动可实时渲染的3D世界构建,重点解决长轨迹下的3D一致性退化两大根因:1)针对空间遗忘,维护逐帧3D几何但仅用于信息路由(检索相关历史帧并建立到目标视角的稠密对应),外观仍由生成模型合成,从而在大视角变化与回访场景时保持结构一致;2)针对时间漂移,用自增强历史(Self-augmented histories)训练,让模型暴露于自身退化输出并学习“纠偏”而非累积误差。最终可生成更长、更一致的探索视频,并用于微调前馈式重建模型以稳定恢复高质量3D场景。
- Track: Video generation + 3D world generation/reconstruction (camera-controlled long-horizon video → 3D lifting)
- Core innovation: Lyra 2.0 formalizes a “generative reconstruction” pipeline that turns camera-controlled videos into persistent 3D worlds, tackling two failure modes in long-horizon 3D-consistent generation: (1) spatial forgetting is mitigated by keeping per-frame 3D geometry only for information routing—retrieving relevant past frames and building dense correspondences to target views—while leaving appearance synthesis to the generative prior; (2) temporal drifting is reduced via self-augmented histories training that feeds the model its own degraded outputs so it learns to correct drift instead of compounding it. The resulting longer, more 3D-consistent trajectories enable reliable fine-tuning of feed-forward 3D reconstruction models.
- [2026-04-14] Generative Refinement Networks for Visual Synthesis 🆕NEW
- 赛道归属: 图像生成(自回归/非扩散范式)、文生图、文生视频
- 核心创新点: 提出Generative Refinement Networks(GRN)作为替代扩散的视觉生成范式:用近乎无损的分层二值量化HBQ缓解离散tokenization带来的信息损失,并在AR生成上引入“全局逐步精修”的refinement机制以纠正误差累积、逐轮提升细节;同时用熵引导采样实现复杂度感知的自适应步数生成,在不牺牲质量的前提下降低不必要计算,并在ImageNet及T2I/T2V扩展上验证可扩展性与SOTA指标。
- Track: Image generation (autoregressive / post-diffusion paradigm), Text-to-Image, Text-to-Video
- Key innovations: Proposes Generative Refinement Networks (GRN) as a diffusion alternative: (1) a theoretically near-lossless Hierarchical Binary Quantization (HBQ) to remove the discrete-token bottleneck; (2) a global progressive refinement mechanism on top of AR generation to correct accumulated errors and iteratively polish details; (3) entropy-guided sampling for complexity-aware adaptive-step generation, reducing compute without degrading quality, with strong results on ImageNet and scalable T2I/T2V settings.
- [2026-04-14] VideoFlexTok: Flexible-Length Coarse-to-Fine Video Tokenization 🆕NEW
- 赛道归属: 视频生成(tokenizer/表示学习)+ 推理/训练效率优化
- 核心创新点: 提出 VideoFlexTok,将传统固定大小的时空3D网格token改为可变长度、由粗到细的token序列:前段token自发承载语义与运动等抽象信息,后段token逐步补充细节;配套生成式流(Flow)解码器支持“任意token数量”重建,使下游模型可按任务/算力动态分配token预算,并在同等预算下编码更长视频。该表示显著降低下游生成模型学习低层细节的负担,实现更小模型规模下的可比质量,并支持长视频训练(以更少token覆盖更多帧)。
- Track: Video generation (tokenization/representation learning) + training/inference efficiency
- Core innovation: VideoFlexTok replaces fixed spatiotemporal 3D-grid tokens with a variable-length, coarse-to-fine token sequence where early tokens (emergently) capture semantics/motion and later tokens refine details. A generative flow decoder reconstructs realistic videos from any token count, enabling adaptive token budgeting for downstream models and longer-video encoding under the same budget. This reduces the burden of predicting low-level details uniformly, yielding comparable generation quality with much smaller generators and enabling long-video training with dramatically fewer tokens.
- [2026-04-14] Towards Realistic and Consistent Orbital Video Generation via 3D Foundation Priors 🆕NEW
- 赛道归属: 单图驱动视频生成(物体环绕/轨道视角)+ 3D先验条件控制
- 核心创新点: 用3D基础生成模型的形状先验替代仅靠像素注意力的跨帧一致性约束,提升长程外推(如背面视角)时的几何可信度与多视一致性。方法以两级3D潜特征进行条件:1)去噪的全局latent向量提供整体结构约束;2)从体素特征投影得到的视角相关latent图像提供细粒度几何细节;相较深度/法线等2.5D条件,这些特征能表达完整形状且避免显式网格提取以提升效率。并提出多尺度3D Adapter通过cross-attention向通用视频模型注入特征token,实现模型无关、轻量微调且保留原有生成能力。
- Track: Image-to-video generation (orbital/object turntable) + 3D-prior conditioning
- Core innovation: Introduces 3D foundation-model shape priors as auxiliary constraints beyond pixel-wise attention, improving geometric realism and multi-view consistency for long-range extrapolation (e.g., back views). Conditioning uses two-scale 3D latents: (1) a denoised global latent for overall structure, and (2) view-dependent latent images projected from volumetric features for fine geometry—more complete than 2.5D cues (depth/normals) and more efficient by avoiding explicit mesh extraction. A multi-scale 3D adapter injects these tokens via cross-attention, enabling model-agnostic, lightweight fine-tuning while retaining general video priors.
- [2026-04-14] ArtifactWorld: Scaling 3D Gaussian Splatting Artifact Restoration via Video Generation Models 🆕NEW
- 赛道归属: 视频编辑/修复(面向3D Gaussian Splatting的多视一致修复)+ 数据集构建
- 核心创新点: 提出 ArtifactWorld,将3DGS在稀疏视角下的几何/光度伪影修复统一为视频扩散式修复问题,并通过“数据+结构”两端扩展提升泛化与一致性:1)建立3DGS伪影的现象学细粒度分类并构建107.5K成对视频训练集,覆盖多样真实伪影分布以缓解数据瓶颈;2)采用同构双模型范式:在视频扩散骨干内引入同构预测器输出伪影热力图定位结构缺陷,再用Artifact-Aware Triplet Fusion将热力图作为强引导,在原生自注意力中实现强度可控的时空联合修复,从而减少多视不一致与错误几何幻觉。
- Track: Video editing/restoration (3D Gaussian Splatting artifact repair) + dataset scaling
- Core innovation: ArtifactWorld reframes sparse-view 3DGS artifact repair as a unified video-diffusion restoration task and scales both data and architecture for robustness and multi-view consistency. (1) It builds a fine-grained phenomenological taxonomy of 3DGS artifacts and a 107.5K paired video dataset to cover diverse real-world degradations. (2) A homogeneous dual-model design adds an isomorphic predictor that outputs an artifact heatmap to localize structural defects, then an Artifact-Aware Triplet Fusion mechanism uses the heatmap to guide intensity-aware spatiotemporal repair directly inside native self-attention, reducing inconsistent views and geometric hallucinations.
- [2026-04-14] Ride the Wave: Precision-Allocated Sparse Attention for Smooth Video Generation 🆕NEW
- 赛道归属: 推理优化(视频扩散Transformer稀疏注意力加速)
- 核心创新点: 提出训练无关的 PASA(Precision-Allocated Sparse Attention),针对稀疏注意力常见的闪烁问题,从“预算分配+路由稳定性”两方面改造:1)用曲率/加速度感知的动态预算,根据生成轨迹在时间步上的语义变化强度弹性分配精算力,仅在关键转折处保精度;2)用硬件对齐的分组近似替代全局同质估计,兼顾局部差异表达与吞吐;3)在路由中引入随机选择偏置软化硬边界,抑制选择振荡与局部算力饥饿,从机制上降低时间闪烁,在显著加速下保持时序平滑与结构稳定。
- Track: Inference optimization (sparse attention acceleration for video diffusion Transformers)
- Core innovation: PASA is a training-free sparse-attention framework that accelerates video diffusion while mitigating flicker by redesigning compute allocation and routing stability: (1) a curvature/acceleration-aware dynamic budget allocates exact attention only at critical semantic transitions across timesteps; (2) hardware-aligned grouped approximations replace global homogenized estimates to preserve local variations with high throughput; (3) stochastic selection bias in routing softens rigid boundaries, reducing oscillations and local compute starvation that cause temporal flicker—achieving substantial speedups with smoother, more stable videos.
- [2026-04-13] OmniShow: Unifying Multimodal Conditions for Human-Object Interaction Video Generation 🆕NEW
- 赛道归属: 多条件可控视频生成(人-物交互 HOI)+ 多模态对齐
- 核心创新点: 提出 OmniShow,面向工业可用的 HOIVG,将文本/参考图/音频/姿态等条件在同一框架内统一建模并兼顾质量与可控性:1)用 Unified Channel-wise Conditioning高效注入图像与姿态条件,降低对主干生成能力的干扰;2)用 Gated Local-Context Attention强化音画局部对齐,实现更精确的口型/节奏同步;3)提出 Decoupled-Then-Joint Training,先分解子任务利用异构数据集分阶段训练,再通过模型合并进行联合能力整合,以数据稀缺下获得全面条件覆盖;并构建 HOIVG-Bench补齐评测体系。
- Track: Controllable video generation with multimodal conditions (human-object interaction) + multimodal alignment
- Core innovation: OmniShow unifies text, reference images, audio, and pose for practical HOI video generation while balancing controllability and quality: (1) Unified Channel-wise Conditioning efficiently injects image/pose signals with minimal disruption to the base generator; (2) Gated Local-Context Attention improves fine-grained audio-visual synchronization; (3) Decoupled-Then-Joint Training leverages heterogeneous sub-task datasets via staged training and model merging to overcome data scarcity and achieve full-condition coverage. It also introduces HOIVG-Bench to standardize evaluation.
- [2026-04-13] LottieGPT: Tokenizing Vector Animation for Autoregressive Generation 🆕NEW
- 赛道归属: 向量动画生成(结构化序列生成/自回归)+ 多模态生成
- 核心创新点: 首次系统化解决“原生向量动画”生成:以Lottie(JSON)为载体设计Lottie Tokenizer,将分层几何图元、变换与关键帧运动编码为紧凑且语义对齐的token序列,显著缩短序列长度同时保持结构保真,使自回归学习动态矢量内容可行;并构建大规模 LottieAnimation-660K 数据集支撑训练。在此基础上微调 Qwen-VL 得到 LottieGPT,实现从文本或视觉提示直接生成可编辑的矢量动画,且在SVG(单帧特例)上优于既有方法。
- Track: Vector animation generation (structured autoregressive sequence generation) + multimodal generation
- Core innovation: Establishes the first end-to-end framework for native vector animation generation by adopting Lottie (JSON) and designing a Lottie Tokenizer that encodes layered primitives, transforms, and keyframe motion into a compact, semantically aligned token sequence, enabling effective autoregressive learning while preserving structural fidelity. It further contributes LottieAnimation-660K, a large-scale dataset for training. Built on these, LottieGPT (fine-tuned from Qwen-VL) generates coherent, editable vector animations from text or visual prompts and improves over prior SVG-generation baselines.
- [2026-04-13] HDR Video Generation via Latent Alignment with Logarithmic Encoding 🆕NEW
- 赛道归属: 视频生成(HDR/高动态范围)+ 表示对齐/轻量微调
- 核心创新点: 提出通过对数编码(Log encoding)实现HDR与预训练生成模型潜空间的分布对齐:无需重训新编码器/表示,即可将HDR映射到更贴近模型已学先验的域内,从而用轻量微调完成HDR视频生成/适配;同时引入相机退化模拟(Camera-mimicking degradations)训练策略,迫使模型从受限观测中“补全”不可见的高动态范围细节,提升高光/暗部细节恢复与跨场景鲁棒性。核心突破在于用表示选择与对齐替代架构重设计,降低HDR生成门槛。
- Track: Video generation (HDR) + representation alignment / lightweight adaptation
- Core innovation: Achieves HDR video generation by aligning HDR data with pretrained generative priors via logarithmic encoding, which maps HDR into a distribution naturally compatible with the model’s latent space—avoiding training a new encoder or redesigning representations and enabling lightweight fine-tuning. A camera-mimicking degradation training strategy encourages the model to infer missing HDR details from priors, improving highlight/shadow detail recovery and robustness across challenging lighting, demonstrating that representation alignment can replace heavy architectural changes.
- [2026-04-13] Empowering Video Translation using Multimodal Large Language Models 🆕NEW
- 赛道归属: 多模态理解与生成综述(MLLM赋能视频翻译/配音/唇形同步)
- 核心创新点: 该工作为综述型贡献,提出面向“MLLM驱动视频翻译”的三角色分类框架以系统化拆解端到端能力如何替代传统级联:1)Semantic Reasoner(视频理解、时序推理、多模态融合);2)Expressive Performer(可控、富表现力语音生成与说话人/情绪一致性);3)Visual Synthesizer(唇形同步与视觉对齐的视频生成器)。其方法论价值在于给出统一问题分解、能力边界与开放挑战(时序建模、多模态对齐等),为后续系统设计与评测提供结构化路线图。
- Track: Survey on multimodal understanding & generation (MLLM-enabled video translation/dubbing/lip-sync)
- Core innovation: As a survey, it contributes a task-specific three-role taxonomy that systematizes how MLLMs reshape video translation beyond cascaded ASR→MT→TTS→lip-sync pipelines: (1) Semantic Reasoner for video understanding, temporal reasoning, and multimodal fusion; (2) Expressive Performer for controllable, expressive speech with speaker/emotion consistency; (3) Visual Synthesizer for high-fidelity lip-sync and visual alignment via video generators. The key value is a unified decomposition of capabilities, limitations, and open challenges (temporal modeling, multimodal alignment), serving as a design and evaluation roadmap.
GitHub
- [2026-04-14] hao-ai-lab/FastVideo ⭐3380
A unified inference and post-training framework for accelerated video generation.
- [2026-04-14] ModelTC/LightX2V ⭐2173
Light Image Video Generation Inference Framework
- [2026-04-14] YouMind-OpenLab/awesome-seedance-2-prompts ⭐615
🎬 500+ curated Seedance 2.0 video generation prompts — cinematic, anime, UGC, ads, meme styles. Includes Seedance API guides, character consistency ti...
- [2026-04-15] vargHQ/sdk ⭐277
AI video generation SDK — JSX for videos. One API for Kling, Flux, ElevenLabs, Sora. Built on Vercel AI SDK.
- [2026-04-14] Correr-Zhou/OmniShow ⭐98
ByteDance's All-in-One Video Generation Model for Human-Object Interaction Video Generation
音频生成 / Audio Generation
arXiv
- [2026-04-12] Audio-Omni: Extending Multi-modal Understanding to Versatile Audio Generation and Editing 🆕NEW
- 赛道归属: 音频生成与编辑(多模态统一生成框架)
- 核心创新点: 提出端到端统一框架,将音频理解、生成与编辑在同一模型中打通,并覆盖通用声音/音乐/语音三大域;采用“冻结的多模态大语言模型(MLLM)负责高层推理 + 可训练的Diffusion Transformer负责高保真合成”的分工式架构,实现推理能力与合成质量兼得;针对音频编辑数据稀缺,构建百万级高质量编辑配对数据集AudioEdit以支撑可泛化的编辑学习;展示继承能力(知识增强生成、in-context生成、零样本跨语种控制)表明统一模型具备向“通用生成式音频智能”扩展的潜力。
- Track: Audio generation & editing (unified multimodal generative framework)
- Core innovations: Introduces the first end-to-end unified system that integrates audio understanding, generation, and editing across general sound, music, and speech; adopts a division-of-labor architecture with a frozen MLLM for high-level reasoning and a trainable Diffusion Transformer for high-fidelity synthesis; addresses editing data scarcity by building AudioEdit, a million-scale curated paired editing dataset; demonstrates inherited capabilities (knowledge-augmented reasoning, in-context generation, zero-shot cross-lingual control), indicating a path toward universal generative audio intelligence.
- [2026-04-12] VidAudio-Bench: Benchmarking V2A and VT2A Generation across Four Audio Categories 🆕NEW
- 赛道归属: 视频到音频生成评测(V2A/VT2A基准与指标)
- 核心创新点: 构建面向V2A与VT2A的多任务评测基准,将音频按音效/音乐/语音/歌唱四类拆分评估,避免“统一协议掩盖类别差异”的问题;提出13个面向任务的无参考指标,分别覆盖音质、视听一致性与文音一致性,并通过主观实验验证与人类偏好对齐;系统评测11个SOTA模型,揭示语音与歌唱显著短板,以及VT2A中“指令遵循 vs 视觉扎根”的结构性张力(更强视觉条件提升对齐但易偏离目标音频类别),为诊断与迭代V2A系统提供可扩展工具链。
- Track: Video-to-audio generation evaluation (V2A/VT2A benchmark & metrics)
- Core innovations: Proposes a multi-task benchmark that evaluates V2A and VT2A separately across four audio categories (SFX, music, speech, singing), enabling fine-grained diagnosis beyond a single unified protocol; introduces 13 task-specific reference-free metrics spanning audio quality, video-audio consistency, and text-audio consistency, and validates them via human studies for preference alignment; benchmarks 11 SOTA models and uncovers key failure modes (notably speech/singing) and a VT2A trade-off between instruction following and visually grounded generation.
- [2026-04-10] Tora3: Trajectory-Guided Audio-Video Generation with Physical Coherence 🆕NEW
- 赛道归属: 音频-视频联合生成(物理一致性/运动-声音对齐控制)
- 核心创新点: 以“物体轨迹”作为音视频生成共享的运动学先验,解决现有方法中运动不稳与声画接触事件对齐松散的问题;设计轨迹对齐的视频运动表征,并利用由轨迹导出的二阶运动学状态(如速度/加速度变化)驱动运动-音频对齐模块,使声事件与运动/碰撞更同步;提出混合式flow matching,在轨迹条件区域保持轨迹保真、在其他区域维持局部一致性,从而兼顾可控性与自然度;配套构建强调运动相关模式、带自动运动标注的大规模PAV数据集以支撑训练与评测。
- Track: Audio-video joint generation (physical coherence & motion-sound alignment control)
- Core innovations: Uses object trajectories as a shared kinematic prior to jointly guide visual motion and acoustic events, targeting physically plausible motion-sound relations; introduces a trajectory-aligned motion representation for video and a kinematic-audio alignment module driven by trajectory-derived second-order kinematics to better synchronize sound events with motion/contact; proposes a hybrid flow-matching scheme that preserves trajectory fidelity in conditioned regions while maintaining local coherence elsewhere; curates a large-scale PAV dataset with automatic motion annotations emphasizing motion-relevant AV patterns.
- [2026-04-09] AVGen-Bench: A Task-Driven Benchmark for Multi-Granular Evaluation of Text-to-Audio-Video Generation 🆕NEW
- 赛道归属: 文本到音频-视频生成评测(T2AV基准与多粒度评估)
- 核心创新点: 提出面向真实提示词的任务驱动T2AV评测基准,覆盖11类应用场景,弥补现有评测“音频/视频割裂或仅靠粗粒度embedding相似度”的不足;构建多粒度评估框架,将轻量专用模型与多模态大语言模型结合,从感知质量到细粒度语义可控性进行分层评估;通过系统评测揭示当前模型“审美强但语义可靠性弱”的关键鸿沟,并定位共性失败(文字渲染、语音连贯性、物理推理、以及普遍的音乐音高控制崩溃),为后续模型训练目标与指标设计提供明确方向。
- Track: Text-to-audio-video generation evaluation (T2AV benchmark & multi-granular assessment)
- Core innovations: Introduces a task-driven T2AV benchmark with high-quality prompts across 11 real-world categories, addressing the limitation of evaluating audio/video separately or via coarse embedding similarity; proposes a multi-granular evaluation pipeline combining lightweight specialist models with MLLMs to assess everything from perceptual quality to fine-grained semantic controllability; reveals a consistent gap between strong AV aesthetics and weak semantic reliability, pinpointing recurring failures (text rendering, speech coherence, physical reasoning, and a universal breakdown in musical pitch control).
- [2026-04-09] Semantic Noise Reduction via Teacher-Guided Dual-Path Audio-Visual Representation Learning 🆕NEW
- 赛道归属: 音频-视觉表征学习(跨模态预训练/检索)
- 核心创新点: 指出对比对齐与掩码重建在同一次前向中联合优化会因“重建分支的随机可见patch”污染对比分支的跨模态对齐,产生语义噪声与优化干扰;提出Teacher-Guided Dual-Path(TG-DP)双路径框架,将重建与对齐解耦为两条优化路径,并为对比分支使用更适配对齐的可见性模式;引入教师模型对对比分支可见token的组织结构进行辅助约束,降低干扰、稳定训练,从而显著提升零样本跨模态检索与线性探测表现。
- Track: Audio-visual representation learning (cross-modal pretraining & retrieval)
- Core innovations: Identifies semantic noise/optimization interference when contrastive alignment and masked reconstruction share a single forward pass, forcing the contrastive branch to rely on reconstruction-oriented random visible patches; proposes TG-DP, a teacher-guided dual-path framework that decouples reconstruction and alignment into separate optimization paths and uses an alignment-suitable visibility pattern for the contrastive path; adds teacher guidance to structure visible tokens in the contrastive branch, reducing interference and stabilizing learning, yielding SOTA gains in zero-shot retrieval and strong linear-probe robustness.
GitHub
- [2026-04-15] huggingface/diffusers ⭐33329
🤗 Diffusers: State-of-the-art diffusion models for image, video, and audio generation in PyTorch.
- [2026-04-13] Lightricks/LTX-2 ⭐5822
Official Python inference and LoRA trainer package for the LTX-2 audio–video generative model.
- [2026-04-13] SamurAIGPT/Generative-Media-Skills ⭐3025
Multi-modal Generative Media Skills for AI Agents (Claude Code, Cursor, Gemini CLI). High-quality image, video, and audio generation powered by muapi....
- [2026-04-15] apocas/restai ⭐484
RESTai is an AIaaS (AI as a Service) open-source platform. Supports many public and local LLM suported by Ollama/vLLM/etc. Precise embeddings usage, t...
语言大模型 / Large Language Models
GitHub
- [2026-04-14] abhigyanpatwari/GitNexus ⭐27372
GitNexus: The Zero-Server Code Intelligence Engine - GitNexus is a client-side knowledge graph creator that runs entirely in your browser. Drop ...
- [2026-04-14] DeusData/codebase-memory-mcp ⭐1524
High-performance code intelligence MCP server. Indexes codebases into a persistent knowledge graph — average repo in milliseconds. 66 languages, sub-m...
- [2026-04-14] justrach/codedb ⭐702
Zig code intelligence server and MCP toolset for AI agents. Fast tree, outline, symbol, search, read, edit, deps, snapshot, and remote GitHub repo que...
- [2026-04-14] truecourse-ai/truecourse ⭐120
AI-powered architecture analysis and code intelligence. Detects circular deps, layer violations, dead modules, and more. Web UI + CLI.
- [2026-04-14] SimplyLiz/CodeMCP ⭐85
Code intelligence for AI assistants - MCP server, CLI, and HTTP API with symbol navigation, impact analysis, and architecture mapping
多模态大模型 / Multimodal Models
arXiv
- [2026-04-09] MedVR: Annotation-Free Medical Visual Reasoning via Agentic Reinforcement Learning 📖3 🆕NEW
- 赛道归属: 医学多模态理解与视觉推理(Medical VLM Reasoning / RL对齐)
- 核心创新点: 提出无需中间步骤标注的强化学习框架 MedVR,让医学VLM在推理时更强地“以图为证”。方法上用两项机制协同:①熵引导视觉再落地(EVR)用不确定性驱动探索,把注意力/检索导向更可能提供证据的视觉区域以减少幻觉;②基于一致性的信用分配(CCA)从多次rollout的一致性中提炼伪监督信号,实现无人工标注的过程级学习与稳定优化,从而在多医学VQA基准上显著提升推理与鲁棒性。
Track: Medical multimodal understanding & visual reasoning (Medical VLM reasoning / RL alignment)
Key innovations: Proposes MedVR, an annotation-free RL framework that forces medical VLMs to reason grounded in visual evidence. It combines (1) Entropy-guided Visual Regrounding (EVR), using model uncertainty to steer exploration toward evidence-bearing visual cues, and (2) Consensus-based Credit Assignment (CCA), distilling pseudo-supervision from agreement across rollouts to enable process-level learning without human intermediate annotations, improving performance and reducing hallucinations on medical VQA benchmarks.
- [2026-04-14] SceneCritic: A Symbolic Evaluator for 3D Indoor Scene Synthesis 🆕NEW
- 赛道归属: 3D室内场景生成评测与可解释判别(3D Scene Synthesis Evaluation / Critic)
- 核心创新点: 提出符号化的布局级评估器 SceneCritic,替代对渲染视角/提示词敏感的VLM/LLM“裁判”。其关键在于构建空间本体 SceneOnto(融合3D-FRONT、ScanNet、Visual Genome先验),以可执行约束联合校验语义关系、朝向一致性与几何可行性,并输出对象/关系级的违规定位与成功放置解释;同时提供迭代式refinement测试床,对比规则critic、文本LLM critic、图像VLM critic在修正不同错误类型上的作用,从而把“评测稳定性+可诊断反馈”引入3D生成闭环。
Track: Evaluation & explainable critics for 3D indoor scene synthesis
Key innovations: Introduces SceneCritic, a symbolic floor-plan/layout-level evaluator that avoids viewpoint/prompt sensitivity of render-based VLM/LLM judges. It builds a spatial ontology (SceneOnto) by aggregating priors from 3D-FRONT, ScanNet, and Visual Genome, then enforces executable constraints to jointly verify semantic relations, orientation coherence, and geometric feasibility, producing object-/relation-level diagnostics. An iterative refinement testbed compares rule-based, text-LLM, and image-VLM critics, enabling stable evaluation and actionable feedback for closing the 3D generation loop.
- [2026-04-14] Representation geometry shapes task performance in vision-language modeling for CT enterography 🆕NEW
- 赛道归属: 多模态理解(医学视觉-语言)、检索增强生成(RAG)、表征学习分析
- 核心创新点: 系统研究CT肠道造影的视觉-语言迁移学习中“表征几何/聚合方式”对任务的因果影响:发现mean pooling更利于疾病分类而attention pooling更利于跨模态检索,揭示聚合器在表征属性上的分工;提出并验证多窗位RGB编码(不同HU窗映射到RGB)比增加多平面覆盖更关键,甚至额外视角会伤害分类;在报告生成上证明仅微调难学到序关系,引入RAG显著提升序级评估;用三教师伪标注框架在无专家标注下完成可比实验与基线建立。
- Track: Multimodal understanding (medical vision-language), Retrieval-Augmented Generation (RAG), representation analysis
- Key innovations: A first systematic study of vision-language transfer for CT enterography that links representation geometry/aggregation to downstream performance: mean pooling favors disease classification while attention pooling favors cross-modal retrieval, indicating distinct representational emphases. Shows multi-window RGB encoding (HU windows → RGB channels) is more beneficial than increasing spatial coverage via multiplanar views (which can even hurt classification). For report generation, demonstrates limited ordinal learning from plain fine-tuning and quantifies consistent gains from RAG. Uses a three-teacher pseudo-labeling setup to enable comparisons without expert annotations and establishes baselines for this modality.
- [2026-04-14] GlotOCR Bench: OCR Models Still Struggle Beyond a Handful of Unicode Scripts 🆕NEW
- 赛道归属: 多语种OCR评测基准与泛化分析(OCR Benchmarking / Multilingual Robustness)
- 核心创新点: 提出 GlotOCR Bench,用于系统评估OCR在100+ Unicode文字体系上的跨脚本泛化能力,并提供可复现的渲染与退化生成流水线。技术要点包括:基于真实多语文本、使用HarfBuzz字形整形与FreeType栅格化、覆盖LTR/RTL脚本并人工抽检渲染正确性;同时设置干净/退化两种图像条件,揭示现有VLM/OCR模型强依赖预训练脚本覆盖,遇到陌生脚本会输出噪声或“相似脚本幻觉”,从评测层面把“脚本覆盖盲区”量化为可对比指标。
Track: Multilingual OCR benchmarking & generalization analysis
Key innovations: Releases GlotOCR Bench to evaluate OCR generalization across 100+ Unicode scripts with a reproducible rendering/degradation pipeline. It renders real multilingual text using HarfBuzz shaping and FreeType rasterization, supports LTR/RTL scripts, and includes manual verification of rendering correctness, with both clean and degraded variants. The benchmark exposes that performance tracks script coverage in pretraining and that models often hallucinate characters from familiar scripts when facing unseen ones, turning script-coverage gaps into measurable, comparable metrics.
- [2026-04-14] Challenging Vision-Language Models with Physically Deployable Multimodal Semantic Lighting Attacks 🆕NEW
- 赛道归属: 多模态模型安全与物理世界对抗攻击(Physical Adversarial Attacks / VLM Robustness)
- 核心创新点: 提出首个可物理部署的VLM对抗框架 MSLA(Multimodal Semantic Lighting Attacks),用“可控对抗光照”在真实场景中攻击多模态语义对齐,而非仅针对单一任务输出做数字扰动。其方法论突破在于把攻击载体从像素级补丁扩展到可实现的照明条件,通过改变场景光照分布诱发CLIP类零样本分类退化,并在LLaVA/BLIP等生成式VLM上触发跨任务(caption/VQA)的语义幻觉;并验证数字域与物理域的有效性与可迁移性,补齐VLM物理鲁棒性评估空白。
Track: Multimodal security—physically realizable adversarial attacks & robustness
Key innovations: Proposes MSLA, the first physically deployable adversarial framework against VLMs using controllable adversarial lighting. Instead of purely digital perturbations or task-specific attacks, it targets multimodal semantic alignment in real scenes by manipulating illumination, degrading CLIP-style zero-shot classification and inducing semantic hallucinations in generative VLMs (e.g., LLaVA/BLIP) across captioning and VQA. Demonstrates effectiveness, transferability, and real-world realizability, highlighting a major gap in physical-world robustness evaluation.
- [2026-04-14] Do VLMs Truly "Read" Candlesticks? A Multi-Scale Benchmark for Visual Stock Price Forecasting 🆕NEW
- 赛道归属: 多模态金融图表理解与预测评测(Chart Understanding / Multimodal Time-Series Forecasting Benchmark)
- 核心创新点: 构建多尺度K线图数据集与标准化评测框架,用于检验VLM是否真正理解蜡烛图视觉模式并能融合长短周期信号。方法上通过多尺度输入设计对齐人类分析流程,并引入混淆矩阵诊断+信息系数(IC)时间序列等金融预测指标,同时提供XGBoost特征时序基线以隔离“视觉理解增益”与“统计规律拟合”。实验揭示多数VLM只在单边趋势下表现较好、对提示中预测周期不敏感且存在系统性偏置,从而把VLM在精细时间推理与多尺度融合上的短板量化出来。
Track: Multimodal financial chart understanding & forecasting evaluation
Key innovations: Builds a multi-scale candlestick chart dataset and a standardized evaluation suite to test whether VLMs truly understand visual candlestick patterns and integrate long-/short-horizon signals. It mirrors human multi-scale analysis via multi-horizon chart inputs, combines confusion-matrix diagnostics with finance-specific metrics like IC time series, and includes an XGBoost feature-based temporal baseline to disentangle visual gains from statistical fitting. Results quantify key limitations: strong performance mainly in persistent trends, notable prediction biases, and weak sensitivity to explicitly prompted forecast horizons.
- [2026-04-14] PromptEcho: Annotation-Free Reward from Vision-Language Models for Text-to-Image Reinforcement Learning 🆕NEW
- 赛道归属: 文生图对齐强化学习(RLHF/RLAIF)、奖励建模(免标注)
- 核心创新点: 提出PromptEcho:无需人工偏好标注、无需训练奖励模型的T2I强化学习奖励构造方法。核心做法是用冻结VLM对“以原prompt为标签”的token级交叉熵损失作为奖励信号,直接提取VLM预训练中蕴含的细粒度图文对齐知识,相比CLIP分数更密、更可扩展(VLM越强奖励越强);同时构建DenseAlignBench以概念密集caption评测prompt-following,并在多模型上验证显著提升且优于同VLM的推理式打分。
- Track: Text-to-Image alignment RL (RLHF/RLAIF), reward construction without annotations
- Key innovations: Proposes PromptEcho, an annotation-free and training-free reward for T2I RL. It computes token-level cross-entropy of a frozen VLM using the original prompt as labels, extracting fine-grained image-text alignment knowledge from VLM pretraining—denser than CLIP score and automatically improving with stronger VLMs. Introduces DenseAlignBench (concept-rich dense captions) to stress-test prompt following, and shows large gains across models and benchmarks, outperforming inference-based scoring using the same VLM.
- [2026-04-14] Every Picture Tells a Dangerous Story: Memory-Augmented Multi-Agent Jailbreak Attacks on VLMs 🆕NEW
- 赛道归属: 多模态模型安全:语义级越狱攻击与自动化红队(VLM Jailbreak / Multi-agent Attacks)
- 核心创新点: 提出 MemJack:记忆增强的多智能体越狱框架,专门利用“自然图像的语义结构”而非像素扰动来扩大攻击面。其关键机制包括:多智能体协作将图像实体映射到恶意意图、通过多角度视觉-语义伪装生成对抗提示;引入INLP(Iterative Nullspace Projection)几何过滤以绕过潜空间的早期拒答;并用持久化的多模态经验记忆跨图像迁移成功策略,提升新图像上的ASR与多轮对话一致性。同时发布大规模交互轨迹数据集 MemJack-Bench(11.3万+),为防御对齐研究提供可复现实证基础。
Track: Multimodal security—semantic jailbreaks & automated red-teaming for VLMs
Key innovations: Introduces MemJack, a memory-augmented multi-agent jailbreak framework that exploits the semantic structure of natural images rather than pixel-level perturbations. It coordinates agents to map visual entities to malicious intents and craft adversarial prompts via multi-angle visual-semantic camouflage; applies Iterative Nullspace Projection (INLP) as a geometric filter to bypass premature latent-space refusals; and uses persistent multimodal experience memory to transfer successful strategies across images, improving ASR and coherence in multi-turn attacks. Releases MemJack-Bench with 113k+ interactive jailbreak trajectories to support reproducible defense/alignment research.
- [2026-04-14] Relaxing Anchor-Frame Dominance for Mitigating Hallucinations in Video Large Language Models 🆕NEW
- 赛道归属: 视频多模态理解的幻觉抑制与推理时优化(Video-LLM Hallucination Mitigation / Inference-time Decoding)
- 核心创新点: 揭示Video-LLM解码阶段存在“锚帧主导”(anchor-frame dominance)的结构性偏置:模型在生成时将注意力质量过度集中于少数帧,导致时序证据聚合失衡并诱发幻觉。提出训练无关的解码侧时序再平衡 DTR:在中后层选择性地重标定视觉注意力分配,抑制锚帧过度占比并提升被忽视帧的贡献,从而在不改编码器、不加外部模型、低额外开销下提升跨模型家族的抗幻觉能力,同时保持视频理解性能。
Track: Video-LLM hallucination mitigation via inference-time decoding optimization
Key innovations: Identifies a decoder-side structural bias in Video-LLMs—anchor-frame dominance—where attention mass concentrates on a small subset of frames, causing temporally imbalanced evidence aggregation and hallucinations. Proposes Decoder-side Temporal Rebalancing (DTR), a training-free, layer-selective inference method that recalibrates visual attention in mid-to-late decoder layers to reduce anchor over-dominance and amplify under-attended frames, improving hallucination robustness across Video-LLM families without changing visual encoders or adding auxiliary models.
- [2026-04-14] Cross-Attentive Multiview Fusion of Vision-Language Embeddings 🆕NEW
- 赛道归属: 3D开放词汇语义/实例理解(多视角融合)(3D Open-Vocabulary Understanding / Multiview Fusion)
- 核心创新点: 提出 CAMFusion:用多视角Transformer对来自不同视角的视觉-语言嵌入进行跨注意力融合,生成每个3D实例的统一表征,替代传统的回投后平均/单视角启发式选择,从结构上解决多视角信息互补未被充分利用的问题;并引入“多视角一致性”作为自监督信号,与监督的目标类损失结合,显著提升3D语义与实例分类(含零样本、跨域)表现,实现更稳健的3D开放词汇表示学习。
Track: 3D open-vocabulary semantic/instance understanding via multiview fusion
Key innovations: Proposes CAMFusion, a multiview transformer that cross-attends over vision-language embeddings from multiple viewpoints and fuses them into a unified per-3D-instance representation, overcoming limitations of back-project-and-average or single-view selection heuristics. Adds multiview consistency as a self-supervision signal alongside standard supervised class losses, yielding stronger 3D semantic/instance classification, including zero-shot and out-of-domain evaluations, via more robust multiview-aligned 3D representations.
GitHub
- [2026-04-14] Blaizzy/mlx-vlm ⭐4341
MLX-VLM is a package for inference and fine-tuning of Vision Language Models (VLMs) on your Mac using MLX.
- [2026-04-14] waybarrios/vllm-mlx ⭐834
OpenAI and Anthropic compatible server for Apple Silicon. Run LLMs and vision-language models (Llama, Qwen-VL, LLaVA) with continuous batching, MCP to...
- [2026-04-13] zli12321/Vision-Language-Models-Overview ⭐561
A most Frontend Collection and survey of vision-language model papers, and models GitHub repository. Continuous updates.
- [2026-04-14] Ice-wave/AttentionLens-LVLM ⭐87
A lightweight and extensible toolkit for visualizing attention flow in Large Vision-Language Models (LVLMs). It renders token-to-token attention maps,...
- [2026-04-13] uni-medical/GMAI-VL ⭐86
GMAI-VL & GMAI-VL-5.5M: A Large Vision-Language Model and A Comprehensive Multimodal Dataset Towards General Medical AI.
强化学习 / Reinforcement Learning
arXiv
- [2026-04-09] MedVR: Annotation-Free Medical Visual Reasoning via Agentic Reinforcement Learning 📖3 🆕NEW
- 赛道归属: 医学多模态理解与视觉推理(Medical VLM Reasoning / RL对齐)
- 核心创新点: 提出无需中间步骤标注的强化学习框架 MedVR,让医学VLM在推理时更强地“以图为证”。方法上用两项机制协同:①熵引导视觉再落地(EVR)用不确定性驱动探索,把注意力/检索导向更可能提供证据的视觉区域以减少幻觉;②基于一致性的信用分配(CCA)从多次rollout的一致性中提炼伪监督信号,实现无人工标注的过程级学习与稳定优化,从而在多医学VQA基准上显著提升推理与鲁棒性。
Track: Medical multimodal understanding & visual reasoning (Medical VLM reasoning / RL alignment)
Key innovations: Proposes MedVR, an annotation-free RL framework that forces medical VLMs to reason grounded in visual evidence. It combines (1) Entropy-guided Visual Regrounding (EVR), using model uncertainty to steer exploration toward evidence-bearing visual cues, and (2) Consensus-based Credit Assignment (CCA), distilling pseudo-supervision from agreement across rollouts to enable process-level learning without human intermediate annotations, improving performance and reducing hallucinations on medical VQA benchmarks.
- [2026-04-08] Android Coach: Improve Online Agentic Training Efficiency with Single State Multiple Actions 📖2 🆕NEW
- 赛道归属: Android UI智能体在线强化学习 / Agentic RL 训练效率优化
- 核心创新点: 提出从“单状态单动作”(SSSA)转向“单状态多动作”(SSMA)的在线训练范式:在同一次昂贵的模拟器状态上采样并利用多个候选动作,通过学习Q值critic在不增加模拟器交互开销的前提下评估多动作价值;引入过程奖励模型提升critic作为“教练”的可靠性,并用基于critic均值的组内优势估计器稳定策略更新,从而在相同成功率下显著提升样本/时间效率与成功率。
- Track: Online RL for Android UI agents / Agentic RL training efficiency
- Core innovation: Shifts online training from Single-State-Single-Action (SSSA) to Single-State-Multiple-Actions (SSMA): for each expensive emulator state, sample multiple candidate actions and reuse them via a learned Q-critic without extra environment overhead. To make the critic a reliable “coach”, it integrates a process reward model and a group-wise advantage estimator based on averaged critic outputs, yielding more stable updates and higher success rates with better training efficiency.
- [2026-04-13] Robust Adversarial Policy Optimization Under Dynamics Uncertainty 📖1 🆕NEW
- 赛道归属: 鲁棒强化学习 / 动力学不确定性下的对抗式策略优化
- 核心创新点: 用分布鲁棒RL的对偶视角显式刻画“鲁棒性-性能”权衡,避免仅依赖代理对手带来的盲区与过度保守:在轨迹层面用对偶温度参数并以对抗网络近似,实现满足散度约束的稳定最坏情况rollout;在模型层面对动力学集成采用Boltzmann重加权,按策略敏感方式偏向更不利的环境而非均匀采样。两层机制解耦互补,兼顾可解性、稳定性与OOD动力学泛化。
- Track: Robust RL / Adversarial policy optimization under dynamics uncertainty
- Core innovation: Introduces a dual formulation of distributionally robust RL that makes the robustness–performance trade-off explicit, reducing blind spots and over-conservatism from surrogate adversaries. At the trajectory level, a dual “temperature” is approximated by an adversarial network to produce stable worst-case rollouts within a divergence bound; at the model level, Boltzmann reweighting over dynamics ensembles prioritizes policy-adverse environments instead of uniform sampling. The two independent components complement each other for tractable, stable, and OOD-robust learning.
- [2026-04-09] SPARD: Self-Paced Curriculum for RL Alignment via Integrating Reward Dynamics and Data Utility 📖1 🆕NEW
- 赛道归属: LLM对齐强化学习 / 多目标奖励自适应与课程学习
- 核心创新点: 提出自步进课程框架SPARD,同时建模“奖励动态”(learning progress导致的非平稳性)与“数据效用”(不同数据对不同能力维度的贡献差异),在线动态调整多目标奖励权重与样本重要性;通过把训练意图与数据效用同步,缓解固定权重在多目标对齐中的失配与数据异质性问题,实现跨域能力的整体提升。
- Track: RL alignment for LLMs / Multi-objective reward weighting & curriculum learning
- Core innovation: SPARD builds an automated self-paced curriculum that jointly accounts for non-stationary reward/learning dynamics and heterogeneous data utility. It dynamically adjusts multi-objective reward weights and data importance based on perceived learning progress, aligning optimization intent with which data is currently most useful, overcoming the limitations of fixed reward weights in complex multi-objective alignment.
- [2026-04-14] Cycle-Consistent Search: Question Reconstructability as a Proxy Reward for Search Agent Training 🆕NEW
- 赛道归属: 搜索智能体强化学习 / 无金标监督的检索轨迹奖励建模
- 核心创新点: 提出Cycle-Consistent Search,用“问题可重构性”作为训练搜索策略的代理奖励:假设高质量检索轨迹是对问题意图的无损编码,通过从轨迹重建原问题来产生可扩展的无金标奖励信号;针对循环一致性易发生的词面泄漏,设计信息瓶颈(去除最终回答、对查询做NER遮蔽)迫使重构依赖检索到的观察与结构脚手架,从而让奖励更反映信息充分性而非语言冗余。
- Track: RL for search agents / Gold-supervision-free reward design for retrieval trajectories
- Core innovation: Cycle-Consistent Search (CCS) replaces gold answers with a proxy reward based on question reconstructability: a good search trajectory should preserve enough information to reconstruct the original question, enabling scalable RL without ground-truth supervision. To prevent trivial lexical leakage, it adds information bottlenecks (excluding the final response and NER-masking queries), forcing reconstruction to rely on retrieved observations and trajectory structure so the reward reflects informational adequacy rather than surface cues.
- [2026-04-14] Don't Show Pixels, Show Cues: Unlocking Visual Tool Reasoning in Language Models via Perception Programs 🆕NEW
- 赛道归属: 多模态理解与工具增强推理 / 视觉工具输出表征优化
- 核心创新点: 指出视觉工具(深度/光流/对应等)的像素级稠密输出与LLM语言推理接口不匹配是主要瓶颈,提出训练无关、模型无关的Perception Programs (P²):将工具输出重写为紧凑、结构化、语言原生的“线索摘要”,使模型能直接解析并进行符号化推理;以“表征重写”替代“更多工具调用/更大模型/额外训练”,在多项感知推理任务上显著提升并可迁移到小模型。
- Track: Multimodal understanding & tool-augmented reasoning / Representation of visual tool outputs
- Core innovation: Argues the key bottleneck is representational misalignment: dense pixel-level tool outputs don’t match LLMs’ language-native reasoning. Proposes Perception Programs (P²), a training-free, model-agnostic rewriting layer that converts tool outputs into compact, structured, language-native cue summaries that models can directly parse and reason over—improving performance without extra training, model changes, or more tool calls.
- [2026-04-14] FastGrasp: Learning-based Whole-body Control method for Fast Dexterous Grasping with Mobile Manipulators 🆕NEW
- 赛道归属: 机器人强化学习 / 移动操作臂全身控制与快速灵巧抓取
- 核心创新点: 提出面向高速抓取的学习式全身控制框架FastGrasp:两阶段RL将“抓取候选生成”和“全身执行控制”解耦——先用条件VAE基于点云生成多样抓取候选,再通过RL进行最优抓取选择并协同控制移动底盘-机械臂-手爪;引入触觉反馈实现碰撞/冲击下的在线调整,提升高速运动稳定性与跨物体泛化,并通过sim-to-real验证可迁移性。
- Track: Robot RL / Whole-body control for fast dexterous grasping with mobile manipulators
- Core innovation: FastGrasp combines grasp proposal generation, whole-body coordination, and tactile feedback via a two-stage RL design: a conditional VAE generates diverse grasp candidates from object point clouds, then RL selects and executes the optimal grasp with coordinated base–arm–hand control. Real-time tactile feedback enables impact-aware adjustments during high-speed interactions, improving stabilization, generalization, and sim-to-real transfer.
- [2026-04-14] QuarkMedSearch: A Long-Horizon Deep Search Agent for Exploring Medical Intelligence 🆕NEW
- 赛道归属: 垂直领域深度搜索智能体 / 长时程工具使用的SFT+RL后训练
- 核心创新点: 面向中文医疗长链路深度搜索,构建“数据-训练-评测”全流水线:用医疗知识图谱结合在线探索合成长时程多跳搜索数据,缓解领域数据稀缺;采用两阶段SFT+RL逐步强化规划、工具调用与反思能力,并兼顾搜索效率;同时引入专家人工核验的领域Benchmark,形成可度量的上限推进与对齐闭环。
- Track: Domain-specific long-horizon deep search agents / Post-training with SFT+RL and tool-use
- Core innovation: Provides a full pipeline for Chinese medical long-horizon deep search: synthesizes multi-hop training data by combining a medical knowledge graph with live online exploration; applies a staged SFT+RL post-training recipe to progressively improve planning, tool invocation, and reflection while maintaining search efficiency; and introduces an expert-verified benchmark to reliably measure progress and push domain performance ceilings.
- [2026-04-14] RePAIR: Interactive Machine Unlearning through Prompt-Aware Model Repair 🆕NEW
- 赛道归属: 机器遗忘与模型编辑 / 推理时交互式去知识化(Unlearning)
- 核心创新点: 提出“交互式机器遗忘”(IMU)范式,使终端用户可在推理时用自然语言指令触发定向遗忘;RePAIR采用看门狗检测遗忘意图、外科医生生成修复方案、患者模型自更新的三角色架构。核心STAMP方法在单样本、免训练条件下,通过伪逆闭式解对MLP激活做定向操控,将表示推向拒答子空间;并给出低秩近似显著降复杂度,支持端侧高效遗忘且尽量保持保留能力。
- Track: Machine unlearning & model editing / Interactive, prompt-driven unlearning at inference time
- Core innovation: Introduces Interactive Machine Unlearning (IMU), enabling end users to request targeted forgetting via natural language at inference time. RePAIR operationalizes this with a watchdog (intent detection), surgeon (repair procedure generation), and patient (self-updating model). Its key method STAMP performs training-free, single-sample unlearning by closed-form pseudoinverse updates that steer MLP activations toward a refusal subspace; a low-rank variant reduces compute, enabling efficient on-device unlearning while preserving retained utility.
- [2026-04-14] Teaching LLMs Human-Like Editing of Inappropriate Argumentation via Reinforcement Learning 🆕NEW
- 赛道归属: 文本编辑对齐强化学习 / 人类风格的可组合句级编辑策略学习
- 核心创新点: 针对LLM编辑“改动分散且易偏离原意”的问题,用RL学习人类式“自包含、可独立接受/拒绝”的句级编辑单元;采用Group Relative Policy Optimization并设计多分量奖励,将编辑级语义相似度、流畅性、编辑模式一致性与论证级适当性联合优化,使模型在多轮编辑中以更小语义漂移实现适当性提升,逼近重写效果但保持可控可组合。
- Track: RL for text editing alignment / Human-like, compositional sentence-level edits
- Core innovation: Trains LLMs to perform human-like editing—self-contained sentence-level edits that can be independently accepted/rejected—addressing scattered, meaning-altering LLM edits. Uses Group Relative Policy Optimization with a multi-component reward jointly optimizing edit-level semantic similarity, fluency, and pattern conformity plus argument-level appropriateness, enabling multi-round editing that improves appropriateness with reduced semantic drift and better controllability than full rewriting.
GitHub
- [2026-04-14] huggingface/trl ⭐18047
Train transformer language models with reinforcement learning.
- [2026-04-14] utiasDSL/gym-pybullet-drones ⭐1962
PyBullet Gymnasium environments for single and multi-agent reinforcement learning of quadcopter control
- [2026-04-14] radixark/miles ⭐1085
Miles is an enterprise-facing reinforcement learning framework for LLM and VLM post-training, forked from and co-evolving with slime.
- [2026-04-14] erwinmsmith/SOMAS ⭐503
A Trusted Human-Multi-Agent Reinforcement Learning Interaction Framework
- [2026-04-14] nvidia-cosmos/cosmos-rl ⭐396
Cosmos-RL is a flexible and scalable Reinforcement Learning framework specialized for Physical AI applications.
HuggingFace Datasets
- [2026-04-06] hysong/MentalBench
MentalBench: A Benchmark for Evaluating Psychiatric Diagnostic Capability of Large Language Models 🌟 Overview
MentalBench is a c...
- [2026-04-14] llamaindex/ParseBench
ParseBench
Quick links: [🌐 Website] [📜 Paper] [💻 Code] ParseBench is a benchmark for evaluating document parsing systems on real-world ent...
- [2026-02-22] YennNing/MC-Search 🆕NEW
Dataset Card for MC-Search
Paper Information | Dataset Description | Dataset Usage | Data Format | Knowledge Base | Citation
Paper ...
HuggingFace Spaces
Generated automatically by Daily AI Digest Agent 生成时间: 2026-04-15 02:25:55