AI 每日进展速报 / Daily AI Digest - 2026-03-31
图像生成/编辑 / Image Generation/Editing
arXiv
- [2026-03-24] PersonalQ: Select, Quantize, and Serve Personalized Diffusion Models for Efficient Inference 📖2 🆕NEW
- 赛道归属: 个性化文生图模型服务 / 推理优化(检索路由 + 量化压缩)
- 核心创新点: 提出统一框架将“个性化checkpoint选择”和“后训练量化”用同一信号——触发词(trigger token)——贯通:1) Check-in 通过意图感知的混合检索 + 基于LLM的上下文重排,在意图仍歧义时才发起最小化澄清问答,并将提示词重写为插入所选checkpoint的规范触发词以降低误路由;2) Trigger-Aware Quantization (TAQ) 在跨注意力中做触发词感知的混合精度,显式保护触发条件下的K/V行及其注意力权重,同时对其余通路激进量化,从而在不破坏脆弱个性化表征的前提下获得更优压缩-画质折中,支撑大规模个性化模型仓库的高效部署。
- Track: Personalized text-to-image serving / Inference optimization (routing + quantization)
- Core innovations: Unifies personalized checkpoint selection and post-training quantization via a shared signal—the checkpoint trigger token: (1) Check-in performs intent-aligned routing with intent-aware hybrid retrieval plus LLM reranking over checkpoint context, asking a brief clarification only when ambiguity remains, then rewrites the prompt by inserting the selected canonical trigger; (2) Trigger-Aware Quantization (TAQ) applies trigger-aware mixed precision in cross-attention, preserving trigger-conditioned K/V rows (and attention weights) while aggressively quantizing other paths, achieving a better compression–quality trade-off without degrading fragile personalized concept representations.
- [2026-03-30] TGIF2: Extended Text-Guided Inpainting Forgery Dataset & Benchmark 🆕NEW
- 赛道归属: 图像编辑取证 / 伪造定位与检测基准(数据集与评测)
- 核心创新点: 构建面向“文本引导修复(inpainting)伪造”的更新数据集与基准TGIF2,以覆盖新一代生成式修复模型与更贴近真实攻击面:1) 引入FLUX.1生成的修复编辑样本,检验取证方法对新分布的泛化;2) 加入随机非语义mask以揭示模型/方法对“物体区域”的偏置(在FR全再生场景下尤其明显);3) 系统评测IFL(定位)与SID(检测)并纳入“生成式超分辨率”作为后处理攻击,证明常见增强操作会显著抹除取证痕迹;从而把“FR图像可定位性、跨模型泛化、后处理鲁棒性”变成可量化对比的基准问题。
- Track: Image editing forensics / Benchmarking for forgery localization & detection
- Core innovations: Introduces TGIF2, an updated benchmark for text-guided inpainting forgeries that targets modern threat models: (1) adds edits generated by FLUX.1 to stress-test cross-generator generalization; (2) includes random non-semantic masks to expose object-centric bias in localization, especially for fully-regenerated (FR) images; (3) evaluates both IFL (localization) and SID (detection) and incorporates generative super-resolution as a post-processing attack, showing common enhancement can erase forensic traces—turning FR localization, generalization, and post-processing robustness into measurable benchmark axes.
- [2026-03-30] GEditBench v2: A Human-Aligned Benchmark for General Image Editing 🆕NEW
- 赛道归属: 图像编辑评测 / 人类对齐的基准与自动评审模型
- 核心创新点: 提出更贴近真实用户需求的通用图像编辑评测体系:1) GEditBench v2以1200条真实用户指令覆盖23类任务,并加入开放集(open-set)类别以评估超出预定义任务的泛化编辑能力;2) 提出PVC-Judge作为“视觉一致性(身份/结构/语义连贯)”的开源成对偏好评审模型,核心在于两条“区域解耦”的偏好数据合成管线,使评审更聚焦编辑前后应保持的区域与属性;3) 构建VCReward-Bench专家标注偏好对用于校准评审与人类一致性,从而让编辑评测从传统指标转向可复现、可扩展的人类一致性判别。
- Track: Image editing evaluation / Human-aligned benchmarks and automatic judges
- Core innovations: Builds a more user-realistic evaluation stack for general image editing: (1) GEditBench v2 contains 1,200 real user queries across 23 tasks plus an open-set category to test out-of-distribution instructions; (2) proposes PVC-Judge, an open-source pairwise visual-consistency assessor (identity/structure/semantic coherence) trained with two region-decoupled preference data synthesis pipelines to better isolate “what must stay consistent”; (3) introduces VCReward-Bench with expert-labeled preference pairs to validate human alignment, shifting evaluation from weak standard metrics to scalable, reproducible human-consistency judgments.
- [2026-03-30] EdgeDiT: Hardware-Aware Diffusion Transformers for Efficient On-Device Image Generation 🆕NEW
- 赛道归属: 文生图 / 端侧生成与硬件感知模型加速(DiT优化)
- 核心创新点: 面向移动NPU的硬件感知DiT结构优化:1) 通过硬件感知优化框架定位在移动数据流上代价高的结构冗余,并进行结构级裁剪/精简,而非仅做通用剪枝或算子替换;2) 在保持DiT可扩展性与表达能力的前提下,实现参数、FLOPs与端侧时延的系统性下降,并在FID-时延维度取得更优Pareto前沿;3) 明确以Qualcomm Hexagon/Apple ANE等NPU为目标,使“可离线、低时延”的端侧高保真生成成为可落地的模型族设计范式。
- Track: Text-to-image / On-device generation & hardware-aware acceleration (DiT)
- Core innovations: Hardware-aware DiT redesign for mobile NPUs: (1) uses a hardware-aware optimization framework to identify and remove structural redundancies that are particularly expensive under mobile dataflows, beyond generic pruning/operator tweaks; (2) reduces parameters/FLOPs/latency while preserving DiT scaling behavior and expressivity, achieving a better FID–latency Pareto frontier; (3) targets real NPUs (Qualcomm Hexagon, Apple ANE), providing a practical blueprint for responsive offline on-device high-fidelity generation.
- [2026-03-30] Rethinking Structure Preservation in Text-Guided Image Editing with Visual Autoregressive Models 🆕NEW
- 赛道归属: 图像编辑 / 视觉自回归(VAR)编辑与结构保持
- 核心创新点: 从VAR中间表征分布出发重构“可编辑区域定位 + 结构保持”的编辑机制:1) 提出由粗到细的token定位策略,逐步收敛可编辑token集合,在编辑强度与背景保真之间实现可控权衡;2) 识别VAR中间层的结构相关特征,并设计特征注入(feature injection)以在生成过程中显式约束结构一致性,而非仅依赖损失或后处理;3) 引入强化学习自适应注入策略,学习不同尺度/层的注入比例,实现对“编辑一致性 vs 编辑遵循度”的自动调参优化,提升局部与全局编辑的结构稳定性。
- Track: Image editing / Visual autoregressive (VAR) editing with structure preservation
- Core innovations: Reframes VAR-based editing around intermediate feature distributions to improve editable-token localization and structural consistency: (1) a coarse-to-fine token localization strategy that refines editable regions to balance edit fidelity and background preservation; (2) identifies structure-related intermediate features in VAR and introduces a feature-injection mechanism to explicitly enforce source–edit structural consistency during generation; (3) an RL-based adaptive injection scheme learns layer-/scale-specific injection ratios to jointly optimize instruction fidelity and structure preservation across local and global edits.
- [2026-03-30] Integrating Multimodal Large Language Model Knowledge into Amodal Completion 🆕NEW
- 赛道归属: 图像补全 / 遮挡补全(Amodal Completion) + MLLM知识注入
- 核心创新点: 提出AmodalCG,将多模态大模型的常识/物理知识显式引入遮挡补全生成闭环:1) 先评估遮挡程度,仅在重遮挡时触发MLLM推理,降低不必要的语言先验干扰与成本;2) 让MLLM同时推断“缺失范围(extent)”与“缺失内容(content)”并作为条件指导生成,而不是只在分割阶段使用知识;3) 通过视觉生成模型的迭代细化机制对MLLM可能不准的指导进行纠偏与收敛,提高真实场景下对强遮挡目标的补全可靠性。
- Track: Image completion / Amodal completion with MLLM knowledge injection
- Core innovations: Proposes AmodalCG to explicitly inject MLLM commonsense/physical knowledge into the amodal completion generation loop: (1) estimates occlusion severity and invokes MLLM guidance only for heavy occlusions to reduce unnecessary priors and cost; (2) uses MLLM to reason about both missing-region extent and content as explicit generation conditions rather than only aiding segmentation; (3) employs an iterative refinement process in a visual generative model to correct imperfect completions caused by inaccurate MLLM guidance, improving robustness on real-world heavily occluded cases.
- [2026-03-30] ObjectMorpher: 3D-Aware Image Editing via Deformable 3DGS Models 🆕NEW
- 赛道归属: 3D感知图像编辑 / 3DGS可变形编辑与交互式操控
- 核心创新点: 用“实例级3D提升 + 可变形3DGS”把2D编辑歧义转化为几何可控操作:1) 将目标实例通过image-to-3D生成提升为可编辑的3D Gaussian Splatting表示,实现快速、身份保持的3D操控;2) 以用户拖拽控制点为交互接口,采用图结构的非刚性形变并结合ARAP约束,保证形状/姿态变化的物理合理性与局部刚性;3) 通过组合式扩散模块做光照/颜色/边界一致性融合,解决3D编辑后回贴到原图的域差与接缝问题,从而在效率与可控性上优于纯2D拖拽与重优化式3D方法。
- Track: 3D-aware image editing / Deformable 3D Gaussian Splatting (3DGS) editing
- Core innovations: Converts ambiguous 2D edits into geometry-grounded operations via instance-level 3D lifting and deformable 3DGS: (1) lifts target instances with an image-to-3D generator into editable 3D Gaussian Splatting for fast, identity-preserving manipulation; (2) provides drag-based control points and applies graph-based non-rigid deformation with ARAP constraints for physically plausible pose/shape changes; (3) uses a composite diffusion module to harmonize lighting/color/boundaries for seamless reintegration, outperforming 2D drag and optimization-heavy 3D-aware baselines in controllability and efficiency.
- [2026-03-30] LogiStory: A Logic-Aware Framework for Multi-Image Story Visualization 🆕NEW
- 赛道归属: 多图故事可视化 / 图像序列生成(逻辑一致性建模)
- 核心创新点: 将“视觉逻辑(角色-动作-场景的感知与因果连贯)”从隐式期望变为显式建模目标:1) 设计多智能体系统分别负责角色设定落地、因果链抽取、以及跨图一致性验证,把结构化故事规划与图像生成解耦并闭环约束;2) 通过一致性校验机制减少动作断裂、叙事碎片化等典型失败模式,提升多图序列的可读性与因果连贯;3) 构建LogicTale基准,提供强调因果推理与可解释标注的评测资源,并配套自动+人工协议以同时衡量逻辑与感知质量。
- Track: Multi-image story visualization / Image-sequence generation with logical consistency
- Core innovations: Makes “visual logic” (perceptual + causal coherence across characters/actions/scenes over time) an explicit objective rather than an emergent property: (1) a multi-agent system that grounds roles, extracts causal chains, and verifies story-level consistency, bridging structured planning and image generation; (2) consistency verification reduces disjoint actions and fragmented narratives, improving readability and causal flow in image sequences; (3) introduces LogicTale, a benchmark with rich causal/interpretability annotations plus automatic and human protocols to evaluate both visual logic and perceptual quality.
- [2026-03-30] AIBench: Evaluating Visual-Logical Consistency in Academic Illustration Generation 🆕NEW
- 赛道归属: 文生图评测 / 学术插图生成的视觉-逻辑一致性基准
- 核心创新点: 用VQA驱动的分层问答评测替代“直接用VLM整体打分”的不稳定方案:1) 构建AIBench,将论文方法部分抽象为逻辑图,并据此设计四个层级的问题体系,从局部组件到全局流程逐级核验插图是否与文本方法一致;2) 以VQA形式把评测拆解为可验证子命题,降低对评审VLM“长文本+复杂图理解”的oracle能力依赖,提高评测细粒度与可诊断性;3) 同时用VLM评估美学质量,揭示逻辑正确性与美学往往难以同时优化,并指出测试时扩展(test-time scaling)可同时提升两类能力。
- Track: Text-to-image evaluation / Visual-logical consistency for academic illustration generation
- Core innovations: Proposes a VQA-driven, hierarchical evaluation to avoid unreliable holistic VLM judging: (1) AIBench derives a logic diagram from paper method sections and designs four levels of questions to verify alignment from components to full pipelines; (2) decomposes evaluation into checkable sub-claims via VQA, reducing dependence on an “oracle” judge’s long-text/complex-figure understanding while improving granularity and diagnosability; (3) pairs logic evaluation with VLM-based aesthetics assessment, exposing the tension between optimizing correctness and aesthetics and showing test-time scaling can boost both.
- [2026-03-30] MathGen: Revealing the Illusion of Mathematical Competence through Text-to-Image Generation 🆕NEW
- 赛道归属: 文生图评测 / 数学图形与结构化符号的可验证生成
- 核心创新点: 提出MathGen基准以“可执行验证器”客观评估数学视觉生成能力:1) 覆盖7类数学视觉表达(图形、几何构造、符号布局等)共900题,聚焦“视觉构成正确性”而非文本解题;2) 采用Script-as-a-Judge协议,为每题配套可执行判定脚本,实现确定性、可复现的自动评分,避免主观或VLM评审漂移;3) 实证揭示当前T2I在结构化与精确布局任务上存在系统性短板(高端闭源也仅中等准确率),把“数学视觉保真”明确为独立且困难的能力维度。
- Track: Text-to-image evaluation / Verifiable mathematical diagram & structured layout generation
- Core innovations: Introduces MathGen to objectively measure mathematical visual generation with executable verification: (1) 900 problems across seven domains targeting diagrams, plots, geometric constructions, and structured symbolic layouts where correctness depends on precise composition; (2) Script-as-a-Judge provides per-problem executable verifiers for deterministic, reproducible scoring, avoiding subjective or judge-VLM drift; (3) empirically exposes a major gap in current T2I models on structured/precision tasks, establishing “mathematical visual fidelity” as a distinct and challenging capability axis.
GitHub
- [2026-03-31] YouMind-OpenLab/awesome-nano-banana-pro-prompts ⭐10288 🆕NEW
🍌 World's largest Nano Banana Pro prompt library — 10,000+ curated prompts with preview images, 16 languages. Google Gemini AI image generation. Free ...
- [2026-03-31] apocas/restai ⭐481 🆕NEW
RESTai is an AIaaS (AI as a Service) open-source platform. Supports many public and local LLM suported by Ollama/vLLM/etc. Precise embeddings usage an...
- [2026-03-30] Light-Heart-Labs/DreamServer ⭐437 🆕NEW
Local AI anywhere, for everyone — LLM inference, chat UI, voice, agents, workflows, RAG, and image generation. No cloud, no subscriptions.
- [2026-03-30] eleiton/ollama-intel-arc ⭐290 🆕NEW
Make use of Intel Arc Series GPU to Run Ollama, StableDiffusion, Whisper and Open WebUI, for image generation, speech recognition and interaction with...
- [2026-03-30] CorentinGS/chess ⭐75 🆕NEW
chess is a set of go packages which provide common chess utilities such as move generation, turn management, checkmate detection, PGN encoding, UCI in...
视频生成/编辑 / Video Generation/Editing
arXiv
- [2026-03-30] VistaGEN: Consistent Driving Video Generation with Fine-Grained Control Using Multiview Visual-Language Reasoning 🆕NEW
- 赛道归属: 视频生成(自动驾驶场景生成 / 多视角可控生成)
- 核心创新点: 将多视角视觉-语言推理注入长时驾驶视频生成器,在中间表征层融合VLM特征以实现对象级细粒度控制(可用3D对象/图像/文本指定实体)且保持长序列时空一致性;提出多视角视觉语言评估器(MV-VLM)对生成结果进行自动一致性评测,并构建“生成-评估-再生成”的闭环自纠错机制;在闭环中加入对象级精修模块,针对MV-VLM判定不满足的实体进行局部修复并回灌再生成,从而提升长尾目标可控性与长视频一致性。
Track: Video generation (autonomous driving scenario generation / multi-view controllable generation)
Key innovations: Injects multi-view vision-language reasoning into long-horizon driving video generators by fusing VLM features in intermediate representations to enable fine-grained, object-level control (via 3D assets/images/text) while preserving spatiotemporal coherence; introduces a Multi-View Vision-Language Evaluator (MV-VLM) to automatically assess consistency and forms a generate–evaluate–regenerate closed loop; adds an object-level refinement module to locally fix MV-VLM-failed entities and feed them back for regeneration, improving long-tail controllability and long-video consistency.
- [2026-03-30] LogiStory: A Logic-Aware Framework for Multi-Image Story Visualization 🆕NEW
- 赛道归属: 多图故事可视化 / 图像序列生成(逻辑一致性建模)
- 核心创新点: 将“视觉逻辑(角色-动作-场景的感知与因果连贯)”从隐式期望变为显式建模目标:1) 设计多智能体系统分别负责角色设定落地、因果链抽取、以及跨图一致性验证,把结构化故事规划与图像生成解耦并闭环约束;2) 通过一致性校验机制减少动作断裂、叙事碎片化等典型失败模式,提升多图序列的可读性与因果连贯;3) 构建LogicTale基准,提供强调因果推理与可解释标注的评测资源,并配套自动+人工协议以同时衡量逻辑与感知质量。
- Track: Multi-image story visualization / Image-sequence generation with logical consistency
- Core innovations: Makes “visual logic” (perceptual + causal coherence across characters/actions/scenes over time) an explicit objective rather than an emergent property: (1) a multi-agent system that grounds roles, extracts causal chains, and verifies story-level consistency, bridging structured planning and image generation; (2) consistency verification reduces disjoint actions and fragmented narratives, improving readability and causal flow in image sequences; (3) introduces LogicTale, a benchmark with rich causal/interpretability annotations plus automatic and human protocols to evaluate both visual logic and perceptual quality.
- [2026-03-30] FlashSign: Pose-Free Guidance for Efficient Sign Language Video Generation 🆕NEW
- 赛道归属: 视频生成(手语视频生成 / 高效推理)
- 核心创新点: 提出无姿态(pose-free)手语生成框架,直接从自然语言到视频的扩散建模,避免“文本→姿态→渲染”的中间表示依赖,从而提升灵活性并减少误差传播;设计可训练滑窗分块注意力(T-STA),利用时空局部性引入“训练期+推理期一致”的可学习稀疏注意力,解决以往training-free稀疏带来的train-test gap,在保持质量的同时显著加速推理(报告3.07×)。
Track: Video generation (sign language synthesis / efficient inference)
Key innovations: Proposes a pose-free diffusion framework that maps text directly to sign-language videos, removing pose intermediates and reducing error accumulation; introduces Trainable Sliding Tile Attention (T-STA) that exploits spatiotemporal locality with trainable sparsity used consistently in training and inference, eliminating the train–test gap of training-free sparsification while achieving substantial speedups (3.07×) without quality loss.
- [2026-03-29] Wan-R1: Verifiable-Reinforcement Learning for Video Reasoning 🆕NEW
- 赛道归属: 视频推理强化 / 生成模型对齐(RL微调与奖励设计)
- 核心创新点: 将GRPO式强化学习适配到流式(flow-based)视频模型以提升迷宫/导航等需要多步规划的“视频推理”能力,并系统揭示多模态奖励模型在该设定下易崩溃;核心突破在于提出可验证(verifiable)奖励:对结构化环境构造多分量轨迹奖励,对机器人导航提出嵌入空间可验证奖励,用客观任务度量替代主观VLM打分,从而显著提升训练稳定性与泛化,并给出奖励设计的系统性实证结论。
Track: Video reasoning RL / generative model alignment (RL fine-tuning & reward design)
Key innovations: Adapts GRPO-style RL to flow-based video models for multi-step planning tasks (mazes/navigation) and shows multimodal reward models can fail catastrophically in this regime; introduces verifiable rewards grounded in objective task metrics—multi-component trajectory rewards for structured games and an embedding-level verifiable reward for robot navigation—yielding more stable RL training and improved generalization, supported by a systematic reward-design study.
- [2026-03-29] TokenDial: Continuous Attribute Control in Text-to-Video via Spatiotemporal Token Offsets 🆕NEW
- 赛道归属: 视频编辑(文生视频的连续属性控制 / 可控生成)
- 核心创新点: 提出TokenDial,在预训练T2V模型的时空patch-token中间空间学习“语义控制方向”,通过对token施加可加性偏移(offset)实现类似滑条的连续强度控制(外观与运动均可),并通过调节偏移幅度获得可预测、连贯的变化且尽量不漂移身份/背景/时序一致性;学习offset时不重训主干,而是利用预训练理解信号:外观用语义方向匹配,运动用运动幅度缩放约束,实现轻量、可迁移的控制接口。
Track: Video editing (continuous attribute control for text-to-video / controllable generation)
Key innovations: TokenDial learns semantic control directions in the pretrained T2V model’s spatiotemporal patch-token space and performs slider-like continuous attribute control via additive token offsets, enabling predictable, coherent changes in appearance and motion while reducing identity/background/temporal drift; learns attribute-specific offsets without retraining the backbone, leveraging pretrained understanding signals (semantic direction matching for appearance; motion-magnitude scaling for dynamics) for a lightweight, transferable control mechanism.
- [2026-03-29] KV Cache Quantization for Self-Forcing Video Generation: A 33-Method Empirical Study 🆕NEW
- 赛道归属: 推理优化(长视频生成系统 / KV Cache压缩与量化)
- 核心创新点: 面向self-forcing长时滚动生成的关键瓶颈(KV cache随长度线性膨胀),给出覆盖33种KV量化与缓存策略的系统性实证评测框架,联合度量显存峰值、时延、压缩率、画质与漂移;提出并验证更可部署的“FlowCache式soft-prune + INT4自适配”工作区间,在较小质量代价下实现约5.4×压缩并显著降低峰值显存;同时揭示“名义压缩率≠真实显存收益”的工程根因(注意力/refresh阶段仍保留或重建BF16大buffer),为后续内存集成优化指明方向。
Track: Inference optimization (long-horizon video generation systems / KV-cache compression & quantization)
Key innovations: Provides a large-scale empirical study (33 methods) targeting the core bottleneck of self-forcing long rollouts—KV cache growth—evaluating VRAM peak, latency, compression, quality, and drift; identifies a deployment-friendly regime with FlowCache-inspired soft-prune INT4 adaptation achieving ~5.4× compression and large VRAM reduction with modest overhead; crucially shows nominal compression can fail to reduce peak VRAM due to BF16 buffer reconstruction/retention during attention/refresh, pinpointing integration issues and practical research directions.
- [2026-03-28] LOME: Learning Human-Object Manipulation with Action-Conditioned Egocentric World Model 🆕NEW
- 赛道归属: 视频生成(动作条件生成 / 人-物交互与操控)
- 核心创新点: 提出面向第一人称场景的人-物操控世界模型LOME,支持以输入图像+文本+逐帧人体动作(身体姿态+手势)为条件生成交互视频;方法上通过训练时联合估计空间动作与环境上下文,把强动作约束“注入”到视频生成中,提升动作跟随与接触丰富交互的物理一致性;通过在多样化egocentric交互数据上微调预训练视频生成模型,实现对未见物体/场景的泛化,并能生成如倒水等具有合理后果的交互动态。
Track: Video generation (action-conditioned generation / human–object interaction & manipulation)
Key innovations: LOME is an egocentric world model that generates human–object manipulation videos conditioned on an input image, text, and per-frame human actions (body pose + hand gestures); it injects strong action guidance by jointly estimating spatial actions and environment context during training, improving action adherence and contact-rich physical plausibility; fine-tunes a pretrained video generator on diverse egocentric interactions to generalize to unseen scenarios and produce realistic consequences (e.g., pouring dynamics) without explicit 3D/4D simulation.
- [2026-03-28] TrackMAE: Video Representation Learning via Track Mask and Predict 🆕NEW
- 赛道归属: 视频表征学习(自监督 / Masked Video Modeling)
- 核心创新点: 提出TrackMAE,将MVM从“隐式学运动”推进到以显式运动轨迹为重建目标:用现成点跟踪器生成稀疏轨迹作为运动监督信号;同时用轨迹引导设计运动感知mask策略,改进随机tube masking以更聚焦动态区域;在像素与特征语义重建之外引入轨迹目标作为互补监督,从而学习到更强的运动敏感表征,在多数据集下游任务上稳定超越现有自监督视频预训练方法。
Track: Video representation learning (self-supervised / masked video modeling)
Key innovations: TrackMAE augments MVM with explicit motion trajectories as reconstruction targets by extracting sparse tracks via an off-the-shelf point tracker; uses trajectories to build a motion-aware masking strategy that improves over random tube masking by focusing learning on dynamic regions; adds trajectory targets as complementary supervision alongside pixel/feature reconstruction, yielding more motion-aware and transferable video representations across diverse downstream benchmarks.
- [2026-03-28] EFlow: Fast Few-Step Video Generator Training from Scratch via Efficient Solution Flow 🆕NEW
- 赛道归属: 视频生成(从零训练 / 少步扩散与高效注意力)
- 核心创新点: 提出EFlow的few-step从零训练框架,基于solution-flow目标学习从噪声状态t直接映射到s以减少采样步数;为在视频尺度可行,提出Gated Local-Global Attention作为可丢token的混合注意力块,在激进随机token-dropping下仍稳定,降低单步注意力计算;训练上用Path-Drop Guided以廉价弱路径替代昂贵guidance目标,并引入Mean-Velocity Additivity正则保证极低步数下的保真度,实现训练吞吐提升与推理时延大幅下降。
Track: Video generation (training from scratch / few-step diffusion & efficient attention)
Key innovations: EFlow trains few-step video generators from scratch using a solution-flow objective that maps a noised state at time t directly to s, reducing sampling steps; introduces Gated Local-Global Attention, a token-droppable hybrid block that remains stable under aggressive random token dropping to cut per-step attention cost; proposes Path-Drop Guided training to replace expensive guidance targets with cheap weak paths plus a Mean-Velocity Additivity regularizer to maintain fidelity at extremely low step counts, improving training throughput and drastically reducing inference latency.
- [2026-03-27] Think over Trajectories: Leveraging Video Generation to Reconstruct GPS Trajectories from Cellular Signaling 🆕NEW
- 赛道归属: 视频生成(跨模态应用 / 地图域轨迹生成与强化优化)
- 核心创新点: 将Sig2GPS从坐标回归/多阶段工程重构为图像到视频生成:把蜂窝信令轨迹渲染到地图图像上,让视频生成模型“绘制”连续GPS路径,实现对地图约束的端到端建模;构建配对的“信令→轨迹视频”数据集以微调开源视频模型;引入轨迹感知的强化学习优化,用奖励直接约束路径连续性与贴合度,提升生成保真与可迁移性(含跨城迁移与下一位置预测扩展)。
Track: Video generation (map-domain trajectory generation / RL-based refinement)
Key innovations: Reframes Sig2GPS as image-to-video generation in the map-visual domain: cellular signaling traces are rendered on maps and a video generator is trained to “draw” continuous GPS paths, enabling end-to-end modeling under map constraints; builds paired signaling-to-trajectory video data to fine-tune an open-source video model; introduces trajectory-aware RL optimization with reward-driven fidelity improvements (continuity/adherence), demonstrating scalability, cross-city transfer, and extensions to next-location prediction.
GitHub
- [2026-03-31] hao-ai-lab/FastVideo ⭐3332 🆕NEW
A unified inference and post-training framework for accelerated video generation.
- [2026-03-30] ModelTC/LightX2V ⭐2121 🆕NEW
Light Image Video Generation Inference Framework
- [2026-03-30] YouMind-OpenLab/awesome-seedance-2-prompts ⭐451 🆕NEW
🎬 500+ curated Seedance 2.0 video generation prompts — cinematic, anime, UGC, ads, meme styles. Includes Seedance API guides, character consistency ti...
- [2026-03-31] vargHQ/sdk ⭐247 🆕NEW
AI video generation SDK — JSX for videos. One API for Kling, Flux, ElevenLabs, Sora. Built on Vercel AI SDK.
- [2026-03-31] OpenDriveLab/SparseVideoNav ⭐65 🆕NEW
Sparse Video Generation Model for Embodied Navigation conditioned on loose language guidance, 100% real world verification
HuggingFace Models
- Lightricks/LTX-2.3 🆕NEW
音频生成 / Audio Generation
arXiv
- [2026-03-30] AutoCut: End-to-end advertisement video editing based on multimodal discretization and controllable generation 🆕NEW
- 赛道归属: 视频编辑(广告短视频自动化剪辑)/ 多模态生成与编辑(视频-音频-文本统一表征)
- 核心创新点: 提出端到端广告视频编辑框架,将视频与音频分别编码后通过残差向量量化(RVQ)离散化为统一token,并与文本对齐,构建共享的“视频-音频-文本”token空间;在此基础上通过多模态对齐+监督微调训练面向编辑的多模态大模型,在同一框架内联合完成素材选择与排序、脚本生成、BGM选择等决策,并将预测token序列映射回可部署的长视频输出,从而提升跨模态一致性与可控性、降低制作与迭代成本。
Track: Video editing (automated ad video production) / Multimodal generation & editing (unified video-audio-text representation)
Core innovations: Proposes an end-to-end ad video editing system that encodes video and audio, then discretizes them via residual vector quantization (RVQ) into unified tokens aligned with text, forming a shared video-audio-text token space. On top of a foundation model, it trains an editing-oriented multimodal LLM via multimodal alignment plus supervised fine-tuning to jointly handle clip selection/ordering, script generation, and background music selection within one pipeline, and finally renders predicted token sequences into deployable long-form videos—improving cross-modal consistency and controllability while reducing production/iteration cost.
- [2026-03-25] AVControl: Efficient Framework for Training Audio-Visual Controls 🆕NEW
- 赛道归属: 音视频联合生成的可控生成(模块化控制/LoRA适配)/ 视频生成与编辑控制
- 核心创新点: 提出可扩展的音视频控制训练框架AVControl:基于联合音视频基础模型LTX-2,将每一种控制模态(深度、姿态、边缘、相机轨迹与内参、稀疏运动、修复/扩展、音频变换等)独立训练为单独的LoRA模块;通过“并行画布(parallel canvas)”把参考信号以额外token注入注意力层,实现无需改动主干架构即可添加新控制模态,并解决将图像in-context控制直接扩展到视频时在结构控制上失效的问题;训练上具备数据与算力高效性(小数据、少步数收敛),同时实现对多控制模态的可插拔组合,并给出面向联合音视频生成模型的模块化控制(含音视频控制)的系统化落地。
Track: Controllable audio-visual generation (modular controls via LoRA adapters) / Video generation & editing controls
Core innovations: Introduces AVControl, an extensible control-training framework on the joint audio-visual foundation model LTX-2. Each control modality (depth, pose, edges, camera trajectory with intrinsics, sparse motion, inpainting/outpainting, audio-related transforms, etc.) is trained as an independent LoRA module. A “parallel canvas” injects the reference signal as extra tokens into attention, enabling new modalities without backbone architectural changes and fixing the failure of naively extending image in-context control to video for structural guidance. The approach is compute/data efficient (small datasets, few hundred–thousand steps), supports plug-and-play composition of independently trained controls, and provides a modular control recipe for joint audio-visual generation models (including audio-visual controls).
GitHub
- [2026-03-30] huggingface/diffusers ⭐33214 🆕NEW
🤗 Diffusers: State-of-the-art diffusion models for image, video, and audio generation in PyTorch.
- [2026-03-25] FunAudioLLM/ThinkSound ⭐1274 🆕NEW
[NeurIPS 2025] PyTorch implementation of [ThinkSound], a unified framework for generating audio from any modality, guided by Chain-of-Thought (CoT) re...
语言大模型 / Large Language Models
arXiv
- [2026-03-25] Composer 2 Technical Report 🆕NEW
- 赛道归属: 代码大模型 / Agentic 软件工程(工具使用与长程任务强化学习)
- 核心创新点: 采用“两阶段训练范式”打造面向真实软件工程的专用智能体模型:先通过持续预训练增强领域知识与潜在编码能力,再进行大规模强化学习以提升端到端编码表现,重点强化长程推理、多步执行准确性与长上下文一致性;同时构建与线上部署一致的 Cursor harness 训练基础设施,使训练时的工具链、交互结构与真实使用环境对齐,并使用高度贴近真实问题的环境进行学习;提出源自大型真实代码库的软件工程基准(CursorBench)用于分级评估更高难度的长程工程任务能力,从而形成“真实环境对齐训练 + 强化学习优化执行”的可复用专用模型训练流程。
- Track: Code LLMs / Agentic Software Engineering (tool-use and long-horizon RL)
- Core innovations: Introduces a two-stage training recipe for a domain-specialized software-engineering agent: (1) continued pretraining to strengthen domain knowledge and latent coding skills, followed by (2) large-scale reinforcement learning to improve end-to-end coding via stronger reasoning, accurate multi-step execution, and long-horizon coherence; builds training infrastructure that mirrors the deployed Cursor harness so tools, interaction structure, and runtime constraints are aligned between training and real usage, and trains in environments closely matching real-world problems; proposes a benchmark derived from real large-codebase engineering tasks (CursorBench) to evaluate progressively harder long-horizon workflows, yielding a reusable pipeline of “realistic environment alignment + RL for execution quality” for frontier coding agents.
GitHub
- [2026-03-30] abhigyanpatwari/GitNexus ⭐20842 🆕NEW
GitNexus: The Zero-Server Code Intelligence Engine - GitNexus is a client-side knowledge graph creator that runs entirely in your browser. Drop ...
- [2026-03-31] clice-io/clice ⭐1188 🆕NEW
A next-generation C++ language server for modern C++, focused on high performance and deep code intelligence
- [2026-03-29] DeusData/codebase-memory-mcp ⭐1083 🆕NEW
High-performance code intelligence MCP server. Indexes codebases into a persistent knowledge graph — average repo in milliseconds. 66 languages, sub-m...
- [2026-03-28] Anandb71/arbor ⭐102 🆕NEW
Graph-native code intelligence that replaces embedding-based RAG with deterministic program understanding.
- [2026-03-30] Cre4T3Tiv3/gitvoyant ⭐72 🆕NEW
Temporal Code Intelligence platform. Time-series complexity analysis across Python, JavaScript, Java, and Go. Linear regression trend detection, cyclo...
多模态大模型 / Multimodal Models
arXiv
- [2026-03-30] AMIGO: Agentic Multi-Image Grounding Oracle Benchmark 🆕NEW
- 赛道归属: 多模态理解 / Agentic评测基准(多轮交互式视觉定位)
- 核心创新点: 提出长时程、多轮“问答式检索目标图”的评测范式:oracle私选目标图,模型需在严格协议下通过一系列属性型Yes/No/Unsure问题逐步缩小候选并最终命中;用Skip惩罚无效动作以约束真实代理行为。基准重点刻画不确定性下的提问策略、跨轮约束一致性维护、细粒度相似图判别,并引入可控oracle噪声以系统评估鲁棒性与证据核验能力,配套提供轨迹级诊断与交互质量指标体系。
- Track: Multimodal Understanding / Agentic Evaluation Benchmark (multi-turn visual grounding)
- Core innovations: Introduces a long-horizon, protocol-constrained “QA-to-retrieve-the-hidden-target” benchmark: an oracle privately selects a target image and the model must identify it by asking sequential attribute-focused Yes/No/Unsure questions, with invalid actions penalized via Skip to enforce realistic agent behavior. The benchmark stresses question selection under uncertainty, consistent constraint tracking across turns, and fine-grained discrimination among near-duplicate images, and adds controlled oracle imperfections to probe robustness and evidence-verification behavior, with trajectory-level diagnostics and interaction-quality metrics.
- [2026-03-30] Unsafe2Safe: Controllable Image Anonymization for Downstream Utility 🆕NEW
- 赛道归属: 图像编辑 / 隐私保护数据构建(可控匿名化)
- 核心创新点: 提出端到端自动化“检测—改写”匿名化流水线:先用VLM识别隐私风险并生成“私有/公开”双字幕,再由LLM基于公开字幕产出结构化、去身份化的编辑指令;随后用指令驱动扩散编辑仅重写敏感区域,在保持全局结构与任务语义的同时中和身份信息。方法论上将多模态风险审查、双提示(private/public)约束与扩散局部编辑耦合,并构建统一评测维度(质量/作弊/隐私/效用);同时利用自动生成三元组微调编辑器以进一步提升隐私-保真权衡。
- Track: Image Editing / Privacy-preserving dataset construction (controllable anonymization)
- Core innovations: Proposes a fully automated detect-and-rewrite anonymization pipeline: a VLM flags privacy risks and produces paired private/public captions, then an LLM generates structured identity-neutral edit instructions conditioned on the public caption; an instruction-driven diffusion editor rewrites only sensitive regions, preserving global structure and task-relevant semantics while neutralizing identity cues. Methodologically, it couples multimodal risk inspection, dual-text constraints (private/public prompts), and localized diffusion editing, introduces a unified evaluation suite (Quality/Cheating/Privacy/Utility), and improves the privacy–fidelity trade-off by fine-tuning editors on automatically generated (private caption, public caption, instruction) triplets.
- [2026-03-30] Navigating the Mirage: A Dual-Path Agentic Framework for Robust Misleading Chart Question Answering 🆕NEW
- 赛道归属: 多模态理解 / 图表问答鲁棒推理(反欺骗Agent框架)
- 核心创新点: 提出双路径agentic框架以对抗“误导性图表”:将感知与核验解耦——诊断视觉路径通过策略性ROI裁剪捕捉结构异常(如坐标轴反转),OCR数据路径提供数值级落地;再用Agentic Summarizer做跨模态冲突消解与一致性汇总。训练上采用“Oracle-Informed SFT + Deception-Aware GRPO”两阶段对齐:先蒸馏正确怀疑式推理,再用对抗式强化优化惩罚视觉陷阱、强化逻辑一致性,从而让较小开源骨干获得显著鲁棒性提升。
- Track: Multimodal Understanding / Robust chart QA & anti-deception reasoning (agentic framework)
- Core innovations: Introduces a dual-path agentic framework for misleading charts by decoupling perception from verification: a Diagnostic Vision Path uses strategic ROI cropping to detect structural anomalies (e.g., inverted axes), while an OCR-driven Data Path enforces numerical grounding; an Agentic Summarizer resolves cross-modal conflicts and consolidates consistent answers. Training uses a two-stage alignment—Oracle-Informed SFT for reasoning distillation followed by Deception-Aware GRPO to adversarially penalize visual traps and enforce logical consistency—yielding large robustness gains for a smaller open-source backbone.
- [2026-03-30] XSPA: Crafting Imperceptible X-Shaped Sparse Adversarial Perturbations for Transferable Attacks on VLMs 🆕NEW
- 赛道归属: 多模态安全 / 对抗攻击(VLM可迁移稀疏扰动)
- 核心创新点: 提出几何先验强约束的稀疏攻击XSPA:将扰动限制在两条对角线交叉形成的“X形”像素集合(极低预算、固定形状),在该稀疏支撑上联合优化分类目标、跨任务语义引导(促使caption/VQA语义漂移)以及幅值与线内平滑正则,实现对共享视觉-文本嵌入空间的可迁移破坏。该方法以更苛刻的攻击约束揭示VLM跨任务共享表征的系统性脆弱性。
- Track: Multimodal Security / Adversarial attacks (transferable sparse perturbations on VLMs)
- Core innovations: Proposes XSPA, a highly constrained sparse attack with a fixed geometric prior: perturbations are restricted to an “X-shaped” support (two intersecting diagonals), creating a stringent low-budget setting. Within this sparse region, it jointly optimizes a classification loss, cross-task semantic guidance to induce caption/VQA semantic drift, and regularizers on magnitude and along-line smoothness, enabling transferable failures through the shared vision–language embedding space and exposing a robustness gap under imperceptible, structured perturbations.
- [2026-03-30] Domain-Invariant Prompt Learning for Vision-Language Models 🆕NEW
- 赛道归属: 推理优化 / Prompt学习(CLIP域泛化)
- 核心创新点: 提出DiCoOp,将CoOp软提示学习扩展到域泛化:通过对抗训练引入“域不变性”约束,使学习到的上下文向量在保持类别判别性的同时,抑制对训练域统计特征的依赖,从而提升跨未见域的零样本/少样本识别稳健性。方法关键在于把“prompt参数”作为域对齐对象,用对抗目标显式消除域可分信息。
- Track: Inference/Adaptation Optimization / Prompt learning (domain generalization for CLIP)
- Core innovations: Proposes DiCoOp, extending CoOp soft prompt learning to domain generalization via adversarial training: it enforces domain-invariant context vectors while preserving class discriminability, reducing reliance on source-domain statistics and improving robustness on unseen domains. The key methodological shift is treating prompt parameters as the alignment target and explicitly removing domain-separable information with an adversarial objective.
- [2026-03-30] Hydra: Unifying Document Retrieval and Generation in a Single Vision-Language Model 🆕NEW
- 赛道归属: 多模态检索与生成统一 / 文档理解系统(单模型双头)
- 核心创新点: 提出单一VLM内统一“检索+生成”的Hydra双头机制:用可切换的单个LoRA适配器实现模式切换——开启时输出ColBERT式late-interaction多向量嵌入用于检索,关闭时恢复原模型自回归生成且保证输出字节级一致。并系统总结三项关键工程约束(注意力模式恢复、lm_head保持、KV-cache感知解码)以避免“权重恢复正确但生成悄然退化”的隐性失败;在显著降低峰值显存的同时维持检索性能,并展示向音频/视频检索嵌入的可扩展性。
- Track: Unified Multimodal Retrieval & Generation / Document VLM systems (single-model dual-head)
- Core innovations: Proposes Hydra, unifying retrieval and generation inside one VLM via a dual-head mechanism with a single toggleable LoRA adapter: enabling the adapter yields ColBERT-style late-interaction multi-vector embeddings for retrieval; disabling it restores the base autoregressive generator with byte-identical outputs. It identifies three critical engineering requirements (attention-mode restoration, lm_head preservation, KV-cache-aware decoding) to prevent silent generation degradation despite correct weight recovery, achieving large peak-memory savings while maintaining retrieval quality and demonstrating extensibility to audio/video embedding and speech generation.
- [2026-03-30] The Scaffold Effect: How Prompt Framing Drives Apparent Multimodal Gains in Clinical VLM Evaluation 🆕NEW
- 赛道归属: 多模态评测与可靠性 / 临床VLM评估偏差分析(提示脚手架效应)
- 核心创新点: 揭示并量化“脚手架效应(scaffold effect)”:在影像本身不含个体级诊断信号的设定下,仅在提示中提及“有MRI可用”就能显著抬升小模型指标,且该提升与是否提供影像无关,属于领域特定的模态塌缩/提示诱导伪增益。方法上通过对比置信度分析与专家审阅,系统证明模型会编造影像依据;并指出偏好对齐虽能抑制提及MRI,但会将性能拉回随机水平,强调临床多模态评测需避免表面化prompt framing伪提升。
- Track: Multimodal Evaluation & Reliability / Clinical VLM evaluation bias (prompt scaffold effect)
- Core innovations: Identifies and quantifies the “scaffold effect”: in settings where MRI carries no reliable individual-level diagnostic signal, merely mentioning MRI availability in the prompt yields large apparent performance gains for smaller VLMs, independent of whether imaging is actually provided—an instance of modality collapse / prompt-induced spurious gains. Using contrastive confidence analysis and expert review, it shows models fabricate imaging-grounded rationales; preference alignment can suppress MRI-referencing behavior but collapses performance toward chance, highlighting that prompt framing can invalidate surface clinical multimodal evaluations.
- [2026-03-30] SEA: Evaluating Sketch Abstraction Efficiency via Element-level Commonsense Visual Question Answering 🆕NEW
- 赛道归属: 多模态评测 / 草图理解与抽象度量(无参考指标与数据集)
- 核心创新点: 提出SEA无参考草图抽象效率指标:以“类别常识定义的关键元素集合”为中间语义单位,调用VQA模型判断草图是否表达这些元素,并以“语义保留/笔画经济性”量化抽象效率,避免依赖参考图或低层特征。同步构建CommonSketch:首个带元素级语义标注的大规模草图数据集(含caption与元素标注),使得对VLM的元素级草图理解与抽象评估可系统化、可对齐人类判断。
- Track: Multimodal Evaluation / Sketch understanding & abstraction metrics (reference-free)
- Core innovations: Proposes SEA, a reference-free metric for sketch abstraction efficiency: it defines class-specific key elements from commonsense knowledge, uses a VQA model to verify the presence of each element in a sketch, and scores abstraction as semantic retention under visual economy—avoiding reliance on reference images or low-level features. It also introduces CommonSketch, the first large-scale sketch dataset with element-level semantic annotations (plus captions), enabling systematic evaluation of element-level sketch understanding and abstraction aligned with human judgments.
- [2026-03-30] Explaining CLIP Zero-shot Predictions Through Concepts 🆕NEW
- 赛道归属: 多模态可解释性 / 零样本识别解释(概念空间投影)
- 核心创新点: 提出EZPC,在不引入额外概念监督的前提下,用“语言学习的概念空间”解释CLIP零样本预测:将CLIP图文联合嵌入投影到可读概念维度,并通过对齐+重构联合目标约束投影既保持CLIP原有语义结构(忠实性),又产生可解释的概念激活(可读性)。核心突破在于把开放词表预测与人类概念对齐起来,同时尽量不牺牲零样本精度。
- Track: Multimodal Interpretability / Explaining zero-shot CLIP via concepts
- Core innovations: Proposes EZPC to explain CLIP zero-shot predictions through a human-interpretable concept space learned from language, without additional concept supervision. It projects CLIP’s joint image–text embeddings into concept dimensions and trains the projection with combined alignment and reconstruction objectives so concept activations remain faithful to CLIP’s semantic structure while becoming interpretable, grounding open-vocabulary predictions in explicit concepts with minimal loss of zero-shot accuracy.
- [2026-03-30] $AutoDrive\text{-}P^3$: Unified Chain of Perception-Prediction-Planning Thought via Reinforcement Fine-Tuning 🆕NEW
- 赛道归属: 多模态推理与决策 / 自动驾驶端到端规划(P3链式思维+强化微调)
- 核心创新点: 提出统一的Perception-Prediction-Planning链式思维框架:通过结构化CoT把感知输出作为预测与规划的条件输入,并让预测与感知共同约束最终规划,解决“直接出规划”或“三模块割裂”导致的协同不足。构建P^3-CoT数据集,并提出分层强化微调算法P^3-GRPO,对三阶段提供渐进式监督与奖励分解;同时引入“细致/快速”双思考模式在推理成本与性能间可控切换,实现更安全、可解释的驾驶决策。
- Track: Multimodal Reasoning & Decision-making / Autonomous driving planning (P3 chain-of-thought + RL fine-tuning)
- Core innovations: Proposes a unified Perception–Prediction–Planning chain-of-thought framework where structured reasoning explicitly feeds perception into prediction and planning, and jointly uses perception + prediction to constrain the final plan, addressing both “direct-to-plan” gaps and fragmented multi-module pipelines. It introduces the P^3-CoT dataset and a hierarchical RL fine-tuning algorithm (P^3-GRPO) that provides progressive supervision and reward decomposition across the three stages, plus dual thinking modes (detailed vs fast) to trade off inference cost and performance for safer, more interpretable driving decisions.
GitHub
- [2026-03-30] Blaizzy/mlx-vlm ⭐2564 🆕NEW
MLX-VLM is a package for inference and fine-tuning of Vision Language Models (VLMs) on your Mac using MLX.
- [2026-03-28] TIGER-AI-Lab/VLM2Vec ⭐612 🆕NEW
This repo contains the code for "VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks" [ICLR 2025]
- [2026-03-27] zli12321/Vision-Language-Models-Overview ⭐550 🆕NEW
A most Frontend Collection and survey of vision-language model papers, and models GitHub repository. Continuous updates.
- [2026-03-30] shuyansy/Earth-Observation-VLMs ⭐116 🆕NEW
🔥🔥A Family of Multi-Sensor, Multi-Granularity Vision-Language Models for Earth Observation Understanding
- [2026-03-29] xytian1008/VAPO ⭐99 🆕NEW
Official repo for "More Thought, Less Accuracy? On the Dual Nature of Reasoning in Vision-Language Models" (ICLR 2026)
Generated automatically by Daily AI Digest Agent 生成时间: 2026-03-31 02:25:33