AI 每日进展速报 / Daily AI Digest - 2026-04-01
图像生成/编辑 / Image Generation/Editing
arXiv
- [2026-03-29] ImagenWorld: Stress-Testing Image Generation Models with Explainable Human Evaluation on Open-ended Real-World Tasks 📖1 🆕NEW
- 赛道归属: 图像生成/编辑评测基准(Benchmark + 可解释人类评测)
- 核心创新点: 构建覆盖“生成/编辑 + 单/多参考”六类核心任务与六种真实域(含截图、信息图、文本图形等)的3.6K条件集,并配套2万条细粒度人工标注;提出可解释评测框架,将失败模式以对象级/分割级局部错误标签显式归因,用于诊断模型在局部编辑、符号/文本密集场景的系统性短板,同时与VLM指标形成互补(VLM可较好排序但难以做细粒度错误归因)。
- Track: Image generation/editing evaluation benchmark (Benchmark + explainable human evaluation)
- Core innovation: Builds a 3.6K-condition benchmark spanning six task types (generation/editing with single/multi references) and six real-world domains (including screenshots, infographics, text-heavy graphics). It pairs this with 20K fine-grained human annotations and an explainable evaluation schema that explicitly attributes failures via localized object-/segment-level error tags, enabling diagnostic analysis (e.g., local edit failures, symbolic/text-heavy weaknesses) beyond what VLM-based metrics can provide (good ranking but limited error attribution).
- [2026-03-31] FED-Bench: A Cross-Granular Benchmark for Disentangled Evaluation of Facial Expression Editing 🆕NEW
- 赛道归属: 人脸表情图像编辑评测(Benchmark + 指标体系)
- 核心创新点: 提出面向“身份/背景严格保持 + 表情精确操控”的专用基准FED-Bench(747三元组:原图-指令-GT),解决通用编辑基准缺少高质量人脸与可精确对齐GT的问题;提出FED-Score跨粒度解耦评测,将能力拆分为Alignment(指令遵循)、Fidelity(画质与身份保持)、Relative Expression Gain(表情变化幅度),用“变化幅度”维度抑制偏向“偷懒编辑/过拟合编辑”的指标偏置;并提供可扩展数据引擎产出20k+野外训练集验证可提升模型。
- Track: Facial expression image editing evaluation (Benchmark + metrics)
- Core innovation: Introduces FED-Bench, a task-specific benchmark for expression editing with strict identity/background preservation, built as 747 (source, instruction, GT) triplets enabling precise quantitative evaluation. Proposes FED-Score, a cross-granular disentangled protocol separating Alignment (instruction following), Fidelity (quality + identity preservation), and Relative Expression Gain (magnitude of expression change) to counter metric biases that reward “lazy” or overfit edits. Also provides a scalable data engine yielding 20k+ in-the-wild training data to demonstrate measurable gains via fine-tuning.
- [2026-03-31] MacTok: Robust Continuous Tokenization for Image Generation 🆕NEW
- 赛道归属: 图像生成基础组件(连续tokenizer/潜变量表征学习)
- 核心创新点: 针对VAE式连续tokenizer在低token数下易“后验坍塌”的关键瓶颈,提出MacTok:通过随机mask与DINO语义引导mask迫使编码器在缺失视觉证据下仍编码语义;再结合全局+局部表征对齐约束,稳定学习高压缩1D连续潜表示(仅64/128 tokens)而不丢失判别信息,从而在大幅降token(最高64×)下仍保持高保真生成性能。
- Track: Image generation building blocks (continuous tokenization / latent representation learning)
- Core innovation: Addresses posterior collapse in VAE-style continuous tokenizers under extreme token compression. MacTok enforces semantic encoding from incomplete evidence via random masking plus DINO-guided semantic masking, and stabilizes learning with global+local representation alignment. This yields compact 1D continuous latents with only 64/128 tokens while retaining discriminative information, enabling large token reductions (up to 64×) without sacrificing high-fidelity generation.
- [2026-03-31] Unify-Agent: A Unified Multimodal Agent for World-Grounded Image Synthesis 🆕NEW
- 赛道归属: 文生图(世界知识增强/Agentic图像生成)
- 核心创新点: 将开放世界、长尾知识驱动的图像生成重构为“多模态Agent流水线”:提示理解→多模态证据检索→基于证据的grounded重描述→最终合成,突破仅依赖冻结参数知识导致的事实/长尾概念生成不可靠;构建143K高质量agent轨迹对全流程进行监督训练,并提出FactIP基准专测文化/长尾事实概念的外部知识落地能力,实现推理-检索-生成的紧耦合以提升可验证的世界一致性。
- Track: Text-to-image (world-grounded / agentic image synthesis)
- Core innovation: Reframes open-world, knowledge-intensive image synthesis as an agentic pipeline—prompt understanding, multimodal evidence search, grounded recaptioning, and final synthesis—mitigating failures from relying solely on frozen parametric knowledge. Trains with 143K curated agent trajectories supervising the full process, and introduces FactIP to explicitly test grounding on culturally significant and long-tail factual concepts, tightly coupling reasoning/search with generation for more reliable world-consistent outputs.
- [2026-03-31] FlowID : Enhancing Forensic Identification with Latent Flow-Matching Models 🆕NEW
- 赛道归属: 人脸修复/重建(身份保持的生成式编辑 + 法医应用)
- 核心创新点: 面向严重面部损伤场景提出FlowID:结合单图微调使生成模型适配分布外“受损人脸”,并用基于注意力的mask将编辑局限于受损区域以最大化身份关键特征保留;同时发布InjuredFaces基准,标准化评测“极端损伤下的身份保持重建”,强调在低显存、本地部署与隐私约束下的可用性。
- Track: Face reconstruction/restoration (identity-preserving generative editing for forensics)
- Core innovation: Proposes FlowID for identity-preserving reconstruction under severe facial damage by combining single-image fine-tuning (to adapt to out-of-distribution injured faces) with attention-based masking that localizes edits to damaged regions while preserving identity-critical cues. Introduces the InjuredFaces benchmark to standardize evaluation in extreme conditions, with an emphasis on low-memory, privacy-preserving local deployment.
- [2026-03-31] Quantization with Unified Adaptive Distillation to enable multi-LoRA based one-for-all Generative Vision Models on edge 🆕NEW
- 赛道归属: 推理优化/端侧部署(生成式视觉模型量化 + 多LoRA一体化运行时)
- 核心创新点: 提出“LoRA权重作为运行时输入”的系统设计,将多任务LoRA从编译图中解耦,实现单一共享基础模型上动态切换任务,避免“每个LoRA一份模型副本”的冗余;进一步提出QUAD(Unified Adaptive Distillation量化感知训练),在共享量化配置下对多LoRA进行统一蒸馏对齐,降低量化带来的跨任务退化,使移动NPU上实现显著内存/时延下降且保持多任务视觉质量。
- Track: Inference optimization / edge deployment (quantization + multi-LoRA unified runtime)
- Core innovation: Treats LoRA weights as runtime inputs rather than embedding them into compiled graphs, enabling one shared foundation model with dynamic multi-task switching and eliminating per-LoRA model duplication. Introduces QUAD, a quantization-aware training strategy with unified adaptive distillation that aligns multiple LoRAs under a shared quantization profile, reducing cross-task quality degradation and delivering substantial memory/latency gains on mobile NPUs.
- [2026-03-31] Adversarial Prompt Injection Attack on Multimodal Large Language Models 🆕NEW
- 赛道归属: 多模态安全(MLLM视觉提示注入/对抗攻击)
- 核心创新点: 提出不可感知的视觉Prompt Injection:用受限文本叠加作为“语义锚点”嵌入恶意指令,同时优化不可见扰动,使被攻击图像在粗粒度与细粒度特征空间同时对齐“恶意视觉目标+恶意文本目标”;并将视觉目标实例化为文本渲染图像、在迭代中逐步精炼以提升语义一致性与跨模型迁移,从而对闭源强模型实现更有效的视觉侧注入攻击。
- Track: Multimodal security (visual prompt injection / adversarial attacks on MLLMs)
- Core innovation: Develops an imperceptible visual prompt-injection attack that embeds malicious instructions via a bounded text overlay as semantic guidance while iteratively optimizing invisible perturbations to align the attacked image with both malicious visual and textual targets at coarse and fine feature levels. The visual target is instantiated as a text-rendered image and progressively refined during optimization to improve semantic fidelity and transferability, enabling stronger attacks against closed-source MLLMs.
- [2026-03-31] PromptForge-350k: A Large-Scale Dataset and Contrastive Framework for Prompt-Based AI Image Forgery Localization 🆕NEW
- 赛道归属: 图像取证(提示词驱动AI编辑的篡改定位:数据集 + 方法)
- 核心创新点: 提出自动化mask标注框架,利用关键点对齐与语义空间相似度生成高精度编辑区域GT,构建PromptForge-350k大规模数据集覆盖多种SOTA提示词编辑模型,解决该新型篡改定位的数据稀缺;提出ICL-Net三流骨干+图内对比学习,通过同图对比挖掘更稳健、可迁移的取证特征,在退化扰动下保持鲁棒并对未见编辑模型具备更强泛化。
- Track: Image forensics (prompt-based AI editing forgery localization: dataset + method)
- Core innovation: Introduces an automated mask-annotation pipeline using keypoint alignment and semantic-space similarity to produce accurate ground-truth edited-region masks, enabling PromptForge-350k—a large-scale dataset spanning multiple SOTA prompt-based editing models. Proposes ICL-Net with a triple-stream backbone and intra-image contrastive learning to learn robust, transferable forensic cues, improving localization accuracy, robustness to degradations, and generalization to unseen editing models.
- [2026-03-31] CIPHER: Counterfeit Image Pattern High-level Examination via Representation 🆕NEW
- 赛道归属: 深度伪造检测(跨生成模型泛化)
- 核心创新点: 提出“复用生成器判别器”的检测范式:系统性抽取并微调原本用于生成训练的判别器表征(如ProGAN判别器的尺度自适应特征),并融合扩散模型的时序一致性相关特征,捕获更“生成无关”的伪造痕迹;通过跨9类生成模型验证显著提升跨模型检测F1,核心突破在于用生成训练中学到的伪迹先验替代仅依赖通用ViT特征的脆弱检测器。
- Track: Deepfake detection (cross-generator generalization)
- Core innovation: Proposes a detector built on systematic reuse and fine-tuning of discriminators originally trained for image generation—e.g., scale-adaptive features from ProGAN discriminators—augmented with diffusion-related temporal-consistency features to capture more generation-agnostic artifacts. This leverages forgery priors learned during generative training, improving cross-model robustness beyond conventional ViT-based detectors across diverse generators.
- [2026-03-31] GazeCLIP: Gaze-Guided CLIP with Adaptive-Enhanced Fine-Grained Language Prompt for Deepfake Attribution and Detection 🆕NEW
- 赛道归属: 深度伪造检测与溯源(CLIP式多模态归因/检测)
- 核心创新点: 利用“真实/伪造在注视(gaze)向量分布存在显著差异、且不同生成范式对注视保持程度不同”的观察,提出Gaze-guided CLIP:通过GIE将注视提示(由gaze encoder提取)与图像伪造嵌入融合,学习更稳定的跨攻击归因特征空间;再用LRE自适应词选择生成细粒度增强语言嵌入,提升视觉-语言匹配精度,实现检测与归因的协同优化,并配套更细粒度评测基准覆盖扩散/flow等新型生成器。
- Track: Deepfake detection & attribution (CLIP-style multimodal attribution/detection)
- Core innovation: Builds on the observation that gaze-vector distributions differ between real and forged faces and that different generators preserve gaze differently. Proposes GazeCLIP: a gaze-aware image encoder (GIE) that fuses gaze prompts (from a gaze encoder) with forged-image embeddings to form a more stable, shared feature space for both detection and attribution, plus a language refinement encoder (LRE) with adaptive word selection to create fine-grained enhanced prompts for better vision-language matching. Accompanied by a fine-grained benchmark targeting modern generators (diffusion/flow) to evaluate generalization.
GitHub
- [2026-04-01] YouMind-OpenLab/awesome-nano-banana-pro-prompts ⭐10398
🍌 World's largest Nano Banana Pro prompt library — 10,000+ curated prompts with preview images, 16 languages. Google Gemini AI image generation. Free ...
- [2026-04-01] apocas/restai ⭐481
RESTai is an AIaaS (AI as a Service) open-source platform. Supports many public and local LLM suported by Ollama/vLLM/etc. Precise embeddings usage an...
- [2026-03-31] Light-Heart-Labs/DreamServer ⭐431
Local AI anywhere, for everyone — LLM inference, chat UI, voice, agents, workflows, RAG, and image generation. No cloud, no subscriptions.
- [2026-03-31] hackclub/ai ⭐111 🆕NEW
💭 Free, unlimited AI and image generation for teens
- [2026-03-31] ferranpons/Llamatik ⭐85 🆕NEW
True on-device AI for Kotlin Multiplatform (Android, iOS, Desktop, JVM, WASM). LLM, Speech-to-Text and Image Generation — powered by llama.cpp, whispe...
视频生成/编辑 / Video Generation/Editing
arXiv
- [2026-03-31] SLVMEval: Synthetic Meta Evaluation Benchmark for Text-to-Long Video Generation 🆕NEW
- 赛道归属: 视频生成评测 / 文生长视频生成(Text-to-Long Video)评估基准
- 核心创新点: 提出SLVMEval“合成长视频元评测”基准,用于评估“评测系统”本身在长视频(最长约10,486秒/约3小时)场景下的可靠性;采用成对比较的元评测范式,基于密集视频描述数据集对源视频进行可控的合成退化,构造覆盖10类质量维度的“高质量vs低质量”对,并通过众包筛选仅保留人类可稳定感知差异的样本,从而形成对评测器更“可判别”的测试床;用该测试床直接度量现有自动评测系统对长视频质量排序的正确率,系统性暴露其在多数维度上显著落后于人类判断的短板。
- Track: Video generation evaluation / Text-to-Long Video evaluation benchmark
- Core innovation: Introduces SLVMEval, a synthetic long-video meta-evaluation benchmark that evaluates the evaluators themselves under very long durations (up to ~10,486s / ~3h); adopts a pairwise-comparison meta-eval protocol and creates controlled “high-vs-low quality” pairs by synthetically degrading videos along 10 distinct aspects from dense video-captioning sources, then uses crowdsourcing to retain only pairs with clearly perceivable degradations—yielding a highly discriminative testbed; measures existing automatic evaluators by their ranking accuracy on these pairs, revealing systematic gaps versus human judgment across most aspects.
- [2026-03-31] TrajectoryMover: Generative Movement of Object Trajectories in Videos 🆕NEW
- 赛道归属: 视频编辑 / 目标运动编辑(3D轨迹迁移与生成式移动)
- 核心创新点: 聚焦“移动物体的3D运动轨迹且保持其相对3D运动规律”的缺失能力,提出TrajectoryMover以实现对视频中目标轨迹的生成式迁移(不仅是指定轨迹或改外观);关键突破在于提出TrajectoryAtlas大规模合成配对数据生成管线,绕开真实配对数据难获取、以及“从一个视频构造另一个视频”在该任务上不可行的问题;利用TrajectoryAtlas生成的成对数据对视频生成模型进行定向微调,使模型学会在保持身份一致性与时空合理性的同时,对目标的3D运动轨迹进行可控移动。
- Track: Video editing / object motion editing (3D trajectory transfer & generative relocation)
- Core innovation: Targets the missing capability of moving an object’s 3D motion trajectory while preserving its relative 3D dynamics, beyond prior work that mainly prescribes 2D/3D paths or edits appearance; introduces TrajectoryAtlas, a scalable synthetic paired-data generation pipeline that avoids the need for hard-to-obtain real paired videos and the failure modes of “construct one video from another” pairing tricks; fine-tunes a video generator (TrajectoryMover) on these paired samples to enable controllable trajectory relocation while maintaining identity and overall video plausibility.
- [2026-03-30] Stepper: Stepwise Immersive Scene Generation with Multiview Panoramas 🆕NEW
- 赛道归属: 3D生成 / 沉浸式场景生成(文本到可探索3D世界,多视角全景扩展)
- 核心创新点: 提出Stepper,通过“分步式全景场景扩展”统一解决沉浸式文本生成中“高保真 vs 可探索性”的矛盾:避免自回归扩展的上下文漂移,同时突破全景视频生成分辨率受限的问题;核心方法是新的多视角360°扩散模型,用于一致且高分辨率的全景扩展,并结合几何重建管线对结构施加几何一致性约束,从生成阶段到重建阶段共同保证可探索3D的结构稳定;配套构建大规模多视角全景数据集进行训练,从而在视觉质量与结构一致性上达到SOTA。
- Track: 3D generation / immersive scene generation (text-to-explorable worlds, multiview panorama expansion)
- Core innovation: Proposes Stepper, a unified framework that resolves the fidelity–explorability trade-off in text-driven immersive scene synthesis via stepwise panoramic expansion: it mitigates context drift seen in autoregressive expansion while overcoming the low-resolution limitation of panoramic video generation; introduces a novel multiview 360° diffusion model for consistent high-resolution panorama expansion, coupled with a geometry reconstruction pipeline that enforces geometric coherence to stabilize explorable 3D structure; trained on a new large-scale multiview panorama dataset, achieving state-of-the-art visual fidelity and structural consistency.
- [2026-03-30] VistaGEN: Consistent Driving Video Generation with Fine-Grained Control Using Multiview Visual-Language Reasoning
- 赛道归属: 视频生成(自动驾驶场景生成 / 多视角可控生成)
- 核心创新点: 将多视角视觉-语言推理注入长时驾驶视频生成器,在中间表征层融合VLM特征以实现对象级细粒度控制(可用3D对象/图像/文本指定实体)且保持长序列时空一致性;提出多视角视觉语言评估器(MV-VLM)对生成结果进行自动一致性评测,并构建“生成-评估-再生成”的闭环自纠错机制;在闭环中加入对象级精修模块,针对MV-VLM判定不满足的实体进行局部修复并回灌再生成,从而提升长尾目标可控性与长视频一致性。
Track: Video generation (autonomous driving scenario generation / multi-view controllable generation)
Key innovations: Injects multi-view vision-language reasoning into long-horizon driving video generators by fusing VLM features in intermediate representations to enable fine-grained, object-level control (via 3D assets/images/text) while preserving spatiotemporal coherence; introduces a Multi-View Vision-Language Evaluator (MV-VLM) to automatically assess consistency and forms a generate–evaluate–regenerate closed loop; adds an object-level refinement module to locally fix MV-VLM-failed entities and feed them back for regeneration, improving long-tail controllability and long-video consistency.
- [2026-03-30] LogiStory: A Logic-Aware Framework for Multi-Image Story Visualization
- 赛道归属: 多图故事可视化 / 图像序列生成(逻辑一致性建模)
- 核心创新点: 将“视觉逻辑(角色-动作-场景的感知与因果连贯)”从隐式期望变为显式建模目标:1) 设计多智能体系统分别负责角色设定落地、因果链抽取、以及跨图一致性验证,把结构化故事规划与图像生成解耦并闭环约束;2) 通过一致性校验机制减少动作断裂、叙事碎片化等典型失败模式,提升多图序列的可读性与因果连贯;3) 构建LogicTale基准,提供强调因果推理与可解释标注的评测资源,并配套自动+人工协议以同时衡量逻辑与感知质量。
- Track: Multi-image story visualization / Image-sequence generation with logical consistency
- Core innovations: Makes “visual logic” (perceptual + causal coherence across characters/actions/scenes over time) an explicit objective rather than an emergent property: (1) a multi-agent system that grounds roles, extracts causal chains, and verifies story-level consistency, bridging structured planning and image generation; (2) consistency verification reduces disjoint actions and fragmented narratives, improving readability and causal flow in image sequences; (3) introduces LogicTale, a benchmark with rich causal/interpretability annotations plus automatic and human protocols to evaluate both visual logic and perceptual quality.
- [2026-03-30] FlashSign: Pose-Free Guidance for Efficient Sign Language Video Generation
- 赛道归属: 视频生成(手语视频生成 / 高效推理)
- 核心创新点: 提出无姿态(pose-free)手语生成框架,直接从自然语言到视频的扩散建模,避免“文本→姿态→渲染”的中间表示依赖,从而提升灵活性并减少误差传播;设计可训练滑窗分块注意力(T-STA),利用时空局部性引入“训练期+推理期一致”的可学习稀疏注意力,解决以往training-free稀疏带来的train-test gap,在保持质量的同时显著加速推理(报告3.07×)。
Track: Video generation (sign language synthesis / efficient inference)
Key innovations: Proposes a pose-free diffusion framework that maps text directly to sign-language videos, removing pose intermediates and reducing error accumulation; introduces Trainable Sliding Tile Attention (T-STA) that exploits spatiotemporal locality with trainable sparsity used consistently in training and inference, eliminating the train–test gap of training-free sparsification while achieving substantial speedups (3.07×) without quality loss.
- [2026-03-29] Wan-R1: Verifiable-Reinforcement Learning for Video Reasoning
- 赛道归属: 视频推理强化 / 生成模型对齐(RL微调与奖励设计)
- 核心创新点: 将GRPO式强化学习适配到流式(flow-based)视频模型以提升迷宫/导航等需要多步规划的“视频推理”能力,并系统揭示多模态奖励模型在该设定下易崩溃;核心突破在于提出可验证(verifiable)奖励:对结构化环境构造多分量轨迹奖励,对机器人导航提出嵌入空间可验证奖励,用客观任务度量替代主观VLM打分,从而显著提升训练稳定性与泛化,并给出奖励设计的系统性实证结论。
Track: Video reasoning RL / generative model alignment (RL fine-tuning & reward design)
Key innovations: Adapts GRPO-style RL to flow-based video models for multi-step planning tasks (mazes/navigation) and shows multimodal reward models can fail catastrophically in this regime; introduces verifiable rewards grounded in objective task metrics—multi-component trajectory rewards for structured games and an embedding-level verifiable reward for robot navigation—yielding more stable RL training and improved generalization, supported by a systematic reward-design study.
- [2026-03-29] TokenDial: Continuous Attribute Control in Text-to-Video via Spatiotemporal Token Offsets
- 赛道归属: 视频编辑(文生视频的连续属性控制 / 可控生成)
- 核心创新点: 提出TokenDial,在预训练T2V模型的时空patch-token中间空间学习“语义控制方向”,通过对token施加可加性偏移(offset)实现类似滑条的连续强度控制(外观与运动均可),并通过调节偏移幅度获得可预测、连贯的变化且尽量不漂移身份/背景/时序一致性;学习offset时不重训主干,而是利用预训练理解信号:外观用语义方向匹配,运动用运动幅度缩放约束,实现轻量、可迁移的控制接口。
Track: Video editing (continuous attribute control for text-to-video / controllable generation)
Key innovations: TokenDial learns semantic control directions in the pretrained T2V model’s spatiotemporal patch-token space and performs slider-like continuous attribute control via additive token offsets, enabling predictable, coherent changes in appearance and motion while reducing identity/background/temporal drift; learns attribute-specific offsets without retraining the backbone, leveraging pretrained understanding signals (semantic direction matching for appearance; motion-magnitude scaling for dynamics) for a lightweight, transferable control mechanism.
- [2026-03-29] KV Cache Quantization for Self-Forcing Video Generation: A 33-Method Empirical Study
- 赛道归属: 推理优化(长视频生成系统 / KV Cache压缩与量化)
- 核心创新点: 面向self-forcing长时滚动生成的关键瓶颈(KV cache随长度线性膨胀),给出覆盖33种KV量化与缓存策略的系统性实证评测框架,联合度量显存峰值、时延、压缩率、画质与漂移;提出并验证更可部署的“FlowCache式soft-prune + INT4自适配”工作区间,在较小质量代价下实现约5.4×压缩并显著降低峰值显存;同时揭示“名义压缩率≠真实显存收益”的工程根因(注意力/refresh阶段仍保留或重建BF16大buffer),为后续内存集成优化指明方向。
Track: Inference optimization (long-horizon video generation systems / KV-cache compression & quantization)
Key innovations: Provides a large-scale empirical study (33 methods) targeting the core bottleneck of self-forcing long rollouts—KV cache growth—evaluating VRAM peak, latency, compression, quality, and drift; identifies a deployment-friendly regime with FlowCache-inspired soft-prune INT4 adaptation achieving ~5.4× compression and large VRAM reduction with modest overhead; crucially shows nominal compression can fail to reduce peak VRAM due to BF16 buffer reconstruction/retention during attention/refresh, pinpointing integration issues and practical research directions.
- [2026-03-28] LOME: Learning Human-Object Manipulation with Action-Conditioned Egocentric World Model
- 赛道归属: 视频生成(动作条件生成 / 人-物交互与操控)
- 核心创新点: 提出面向第一人称场景的人-物操控世界模型LOME,支持以输入图像+文本+逐帧人体动作(身体姿态+手势)为条件生成交互视频;方法上通过训练时联合估计空间动作与环境上下文,把强动作约束“注入”到视频生成中,提升动作跟随与接触丰富交互的物理一致性;通过在多样化egocentric交互数据上微调预训练视频生成模型,实现对未见物体/场景的泛化,并能生成如倒水等具有合理后果的交互动态。
Track: Video generation (action-conditioned generation / human–object interaction & manipulation)
Key innovations: LOME is an egocentric world model that generates human–object manipulation videos conditioned on an input image, text, and per-frame human actions (body pose + hand gestures); it injects strong action guidance by jointly estimating spatial actions and environment context during training, improving action adherence and contact-rich physical plausibility; fine-tunes a pretrained video generator on diverse egocentric interactions to generalize to unseen scenarios and produce realistic consequences (e.g., pouring dynamics) without explicit 3D/4D simulation.
GitHub
- [2026-03-31] showlab/Awesome-Video-Diffusion ⭐5552 🆕NEW
A curated list of recent diffusion models for video generation, editing, and various other applications.
- [2026-03-31] hao-ai-lab/FastVideo ⭐3335
A unified inference and post-training framework for accelerated video generation.
- [2026-03-31] ModelTC/LightX2V ⭐2126
Light Image Video Generation Inference Framework
- [2026-03-31] YouMind-OpenLab/awesome-seedance-2-prompts ⭐464
🎬 500+ curated Seedance 2.0 video generation prompts — cinematic, anime, UGC, ads, meme styles. Includes Seedance API guides, character consistency ti...
- [2026-03-31] vargHQ/sdk ⭐249
AI video generation SDK — JSX for videos. One API for Kling, Flux, ElevenLabs, Sora. Built on Vercel AI SDK.
HuggingFace Models
音频生成 / Audio Generation
arXiv
- [2026-03-30] AutoCut: End-to-end advertisement video editing based on multimodal discretization and controllable generation
- 赛道归属: 视频编辑(广告短视频自动化剪辑)/ 多模态生成与编辑(视频-音频-文本统一表征)
- 核心创新点: 提出端到端广告视频编辑框架,将视频与音频分别编码后通过残差向量量化(RVQ)离散化为统一token,并与文本对齐,构建共享的“视频-音频-文本”token空间;在此基础上通过多模态对齐+监督微调训练面向编辑的多模态大模型,在同一框架内联合完成素材选择与排序、脚本生成、BGM选择等决策,并将预测token序列映射回可部署的长视频输出,从而提升跨模态一致性与可控性、降低制作与迭代成本。
Track: Video editing (automated ad video production) / Multimodal generation & editing (unified video-audio-text representation)
Core innovations: Proposes an end-to-end ad video editing system that encodes video and audio, then discretizes them via residual vector quantization (RVQ) into unified tokens aligned with text, forming a shared video-audio-text token space. On top of a foundation model, it trains an editing-oriented multimodal LLM via multimodal alignment plus supervised fine-tuning to jointly handle clip selection/ordering, script generation, and background music selection within one pipeline, and finally renders predicted token sequences into deployable long-form videos—improving cross-modal consistency and controllability while reducing production/iteration cost.
- [2026-03-25] AVControl: Efficient Framework for Training Audio-Visual Controls
- 赛道归属: 音视频联合生成的可控生成(模块化控制/LoRA适配)/ 视频生成与编辑控制
- 核心创新点: 提出可扩展的音视频控制训练框架AVControl:基于联合音视频基础模型LTX-2,将每一种控制模态(深度、姿态、边缘、相机轨迹与内参、稀疏运动、修复/扩展、音频变换等)独立训练为单独的LoRA模块;通过“并行画布(parallel canvas)”把参考信号以额外token注入注意力层,实现无需改动主干架构即可添加新控制模态,并解决将图像in-context控制直接扩展到视频时在结构控制上失效的问题;训练上具备数据与算力高效性(小数据、少步数收敛),同时实现对多控制模态的可插拔组合,并给出面向联合音视频生成模型的模块化控制(含音视频控制)的系统化落地。
Track: Controllable audio-visual generation (modular controls via LoRA adapters) / Video generation & editing controls
Core innovations: Introduces AVControl, an extensible control-training framework on the joint audio-visual foundation model LTX-2. Each control modality (depth, pose, edges, camera trajectory with intrinsics, sparse motion, inpainting/outpainting, audio-related transforms, etc.) is trained as an independent LoRA module. A “parallel canvas” injects the reference signal as extra tokens into attention, enabling new modalities without backbone architectural changes and fixing the failure of naively extending image in-context control to video for structural guidance. The approach is compute/data efficient (small datasets, few hundred–thousand steps), supports plug-and-play composition of independently trained controls, and provides a modular control recipe for joint audio-visual generation models (including audio-visual controls).
GitHub
- [2026-03-31] huggingface/diffusers ⭐33224
🤗 Diffusers: State-of-the-art diffusion models for image, video, and audio generation in PyTorch.
- [2026-03-30] Lightricks/LTX-2 ⭐5452 🆕NEW
Official Python inference and LoRA trainer package for the LTX-2 audio–video generative model.
语言大模型 / Large Language Models
arXiv
- [2026-03-25] Composer 2 Technical Report
- 赛道归属: 代码大模型 / Agentic 软件工程(工具使用与长程任务强化学习)
- 核心创新点: 采用“两阶段训练范式”打造面向真实软件工程的专用智能体模型:先通过持续预训练增强领域知识与潜在编码能力,再进行大规模强化学习以提升端到端编码表现,重点强化长程推理、多步执行准确性与长上下文一致性;同时构建与线上部署一致的 Cursor harness 训练基础设施,使训练时的工具链、交互结构与真实使用环境对齐,并使用高度贴近真实问题的环境进行学习;提出源自大型真实代码库的软件工程基准(CursorBench)用于分级评估更高难度的长程工程任务能力,从而形成“真实环境对齐训练 + 强化学习优化执行”的可复用专用模型训练流程。
- Track: Code LLMs / Agentic Software Engineering (tool-use and long-horizon RL)
- Core innovations: Introduces a two-stage training recipe for a domain-specialized software-engineering agent: (1) continued pretraining to strengthen domain knowledge and latent coding skills, followed by (2) large-scale reinforcement learning to improve end-to-end coding via stronger reasoning, accurate multi-step execution, and long-horizon coherence; builds training infrastructure that mirrors the deployed Cursor harness so tools, interaction structure, and runtime constraints are aligned between training and real usage, and trains in environments closely matching real-world problems; proposes a benchmark derived from real large-codebase engineering tasks (CursorBench) to evaluate progressively harder long-horizon workflows, yielding a reusable pipeline of “realistic environment alignment + RL for execution quality” for frontier coding agents.
GitHub
- [2026-03-31] abhigyanpatwari/GitNexus ⭐21005
GitNexus: The Zero-Server Code Intelligence Engine - GitNexus is a client-side knowledge graph creator that runs entirely in your browser. Drop ...
- [2026-03-31] clice-io/clice ⭐1188
A next-generation C++ language server for modern C++, focused on high performance and deep code intelligence
- [2026-03-31] DeusData/codebase-memory-mcp ⭐1105
High-performance code intelligence MCP server. Indexes codebases into a persistent knowledge graph — average repo in milliseconds. 66 languages, sub-m...
- [2026-03-28] Anandb71/arbor ⭐102
Graph-native code intelligence that replaces embedding-based RAG with deterministic program understanding.
- [2026-03-30] Cre4T3Tiv3/gitvoyant ⭐73
Temporal Code Intelligence platform. Time-series complexity analysis across Python, JavaScript, Java, and Go. Linear regression trend detection, cyclo...
多模态大模型 / Multimodal Models
arXiv
- [2026-03-30] $AutoDrive\text{-}P^3$: Unified Chain of Perception-Prediction-Planning Thought via Reinforcement Fine-Tuning 📖1
- 赛道归属: 多模态推理与决策 / 自动驾驶端到端规划(P3链式思维+强化微调)
- 核心创新点: 提出统一的Perception-Prediction-Planning链式思维框架:通过结构化CoT把感知输出作为预测与规划的条件输入,并让预测与感知共同约束最终规划,解决“直接出规划”或“三模块割裂”导致的协同不足。构建P^3-CoT数据集,并提出分层强化微调算法P^3-GRPO,对三阶段提供渐进式监督与奖励分解;同时引入“细致/快速”双思考模式在推理成本与性能间可控切换,实现更安全、可解释的驾驶决策。
- Track: Multimodal Reasoning & Decision-making / Autonomous driving planning (P3 chain-of-thought + RL fine-tuning)
- Core innovations: Proposes a unified Perception–Prediction–Planning chain-of-thought framework where structured reasoning explicitly feeds perception into prediction and planning, and jointly uses perception + prediction to constrain the final plan, addressing both “direct-to-plan” gaps and fragmented multi-module pipelines. It introduces the P^3-CoT dataset and a hierarchical RL fine-tuning algorithm (P^3-GRPO) that provides progressive supervision and reward decomposition across the three stages, plus dual thinking modes (detailed vs fast) to trade off inference cost and performance for safer, more interpretable driving decisions.
- [2026-03-30] Efficient Inference of Large Vision Language Models 📖1 🆕NEW
- 赛道归属: 多模态推理优化(LVLM高效推理/部署加速综述)
- 核心创新点: 提出面向LVLM推理加速的系统化综述与分类学,将现有方法按四个关键维度统一组织:视觉token压缩、内存管理与服务化、效率导向架构设计、以及高级解码策略;在同一框架下对各类技术的适用场景与瓶颈进行批判性梳理,并明确高分辨率视觉token导致注意力二次复杂度这一核心痛点,进一步归纳开放问题以指导可扩展多模态系统的后续研究。
- Track: Multimodal inference optimization (efficient LVLM inference & deployment survey)
- Key innovations: Provides a structured survey and taxonomy for LVLM inference acceleration, organizing methods into four dimensions—visual token compression, memory/serving, efficient architecture design, and advanced decoding—while critically analyzing limitations under the key bottleneck of quadratic attention with high-resolution visual tokens, and outlining open problems to guide scalable multimodal system research.
- [2026-03-28] VIRST: Video-Instructed Reasoning Assistant for SpatioTemporal Segmentation 📖1 🆕NEW
- 赛道归属: 视频理解与分割(指代视频目标分割/RVOS,时空分割+推理)
- 核心创新点: 提出端到端统一框架VIRST,将“全局视频级语义推理”和“像素级掩码预测”融合到单模型中,避免传统“关键帧VLM+传播模块”在大运动与多步推理场景下的割裂;通过Spatio-Temporal Fusion(STF)把分割感知的视频特征注入视觉-语言骨干以对齐语义与分割表征,并引入Temporal Dynamic Anchor Updater动态维护时间相邻锚帧,在遮挡/再出现/大位移下提供稳定时序线索,从而提升复杂时空动态与推理型查询的鲁棒性与泛化。
- Track: Video understanding & segmentation (Referring Video Object Segmentation with spatiotemporal reasoning)
- Key innovations: Proposes VIRST, an end-to-end unified model that couples global video-level reasoning with pixel-level mask prediction, addressing the fragmentation of keyframe-VLM + propagation pipelines under large motion and multi-step reasoning. It introduces Spatio-Temporal Fusion (STF) to inject segmentation-aware video features into the VLM backbone for semantic–segmentation alignment, and a Temporal Dynamic Anchor Updater to maintain temporally adjacent anchor frames for stable cues under occlusion, reappearance, and large motion.
- [2026-03-28] Towards Intrinsic-Aware Monocular 3D Object Detection 📖1 🆕NEW
- 赛道归属: 3D感知(单目3D目标检测/跨相机泛化)
- 核心创新点: 提出MonoIA,将相机内参变化从“数值条件”转化为“语义/感知变换”建模:利用LLM/VLM生成内参嵌入,编码焦距/主点等参数对尺度、透视与空间几何外观的影响;通过层级式Intrinsic Adaptation Module把该嵌入注入检测网络以调制特征表征,实现跨不同内参配置的一致3D检测与更强跨数据集泛化,缓解Mono3D对相机内参敏感的问题。
- Track: 3D perception (monocular 3D object detection, cross-intrinsics generalization)
- Key innovations: Introduces MonoIA, reframing camera intrinsics variation from numeric conditioning to a semantic/perceptual transformation. It uses LLM/VLMs to produce intrinsic embeddings capturing how intrinsics affect apparent scale, perspective, and geometry, and hierarchically injects them via an Intrinsic Adaptation Module to modulate features for consistent 3D detection across cameras, improving robustness and multi-dataset generalization.
- [2026-03-27] GUIDED: Granular Understanding via Identification, Detection, and Discrimination for Fine-Grained Open-Vocabulary Object Detection 📖1 🆕NEW
- 赛道归属: 多模态检测(细粒度开放词汇目标检测/属性-主体解耦)
- 核心创新点: 提出GUIDED分解式框架,针对VLM嵌入中“主体-属性语义纠缠”导致的属性过度主导、定位漂移与语义漂移问题,将定位与细粒度识别解耦为模块化流水线:先用语言模型从细粒度类别名中抽取粗粒度主体与属性;仅用主体嵌入驱动检测以稳定定位;再用注意力式属性嵌入融合模块选择性注入有效属性以保留判别力;最后通过区域级属性判别模块(带投影头的精炼VLM对齐)对候选框进行细粒度文本对比判别,实现更稳的定位与更准的细粒度开放词汇识别。
- Track: Multimodal detection (fine-grained open-vocabulary object detection; subject–attribute disentanglement)
- Key innovations: Proposes GUIDED, a decomposition framework to mitigate subject–attribute entanglement in pretrained VLM embeddings that causes attribute over-dominance and mislocalization. It separates localization and fine-grained recognition: an LM extracts a coarse subject plus attributes; detection is guided only by the subject embedding for stable localization; an attention-based attribute fusion selectively injects helpful attributes; and a region-level attribute discrimination module with a refined VLM + projection head performs fine-grained alignment against full class names.
- [2026-03-31] From Skeletons to Semantics: Design and Deployment of a Hybrid Edge-Based Action Detection System for Public Safety 🆕NEW
- 赛道归属: 边缘端多模态视频分析(公共安全动作检测/系统工程)
- 核心创新点: 提出并落地一种混合式边缘动作检测系统架构:以骨架(skeleton)运动分析作为低算力、隐私友好的常驻检测前端,并在需要时调用视觉-语言模型进行语义场景解释与零样本推理,从系统层面实现“快速运动学筛查 + 高层语义补充”的分层协同;贡献重点在真实边缘约束下对两类范式的延迟、资源占用与运行权衡进行对比评估,并给出可部署的选择性增强策略,而非提出新的识别模型。
- Track: Edge multimodal video analytics (public-safety action detection; system design/deployment)
- Key innovations: Designs and deploys a hybrid edge action-detection system that pairs always-on skeleton-based motion analysis (low-latency, privacy-preserving, lightweight) with on-demand vision-language semantic interpretation and zero-shot reasoning. The main contribution is a system-level architecture and empirical trade-off analysis (latency/resource/operational constraints) on a GPU edge device, demonstrating a selective augmentation strategy rather than proposing new recognition models.
- [2026-03-31] TSHA: A Benchmark for Visual Language Models in Trustworthy Safety Hazard Assessment Scenarios 🆕NEW
- 赛道归属: 多模态评测基准(室内安全隐患评估/可信VLM Benchmark)
- 核心创新点: 提出TSHA可信安全隐患评估基准,针对现有评测“仿真合成域差、任务过简、缺少严格协议”三大缺陷:构建81,809条多源训练数据(现有室内数据集+互联网图像+AIGC图像+新采集图像),并设计包含视频与全景图、且多隐患共存的高难测试集(1707)以检验复杂场景鲁棒性;通过对23个VLM的系统实验揭示现有模型能力缺口,并验证用TSHA训练可显著提升本基准与跨基准泛化(最高+18.3),从数据与协议层面推动“可信安全评估”更贴近真实部署。
- Track: Multimodal evaluation/benchmarking (trustworthy indoor safety hazard assessment)
- Key innovations: Introduces TSHA to address key benchmark gaps—simulation-heavy domain shift, oversimplified tasks, and weak evaluation protocols—by curating 81,809 real/mixed-source training samples and a challenging 1,707-sample test set including videos and panoramas with multiple hazards to stress robustness. Large-scale evaluation on 23 VLMs exposes capability deficits, and training on TSHA yields substantial gains (up to +18.3) and improved cross-benchmark generalization, advancing realistic trustworthy safety assessment.
- [2026-03-31] A Comprehensive Information-Decomposition Analysis of Large Vision-Language Models 🆕NEW
- 赛道归属: 多模态可解释性与分析(LVLM信息分解/融合归因)
- 核心创新点: 提出基于部分信息分解(PID)的模型无关分析框架,量化LVLM决策信息的“冗余-独有-协同”谱系,从而回答性能来源是多模态协同融合还是单模态先验;通过可扩展估计器适配现代LVLM输出,对26个LVLM在多数据集上进行跨模型/跨任务(广度)、层级动态(深度)与训练过程(时间)三维剖析,发现“协同驱动 vs 知识驱动”任务分区、“融合中心 vs 语言中心”家族策略,以及层内三阶段处理模式,并指出视觉指令微调是学习融合的关键阶段,为后续LVLM设计与评测提供超越准确率的定量诊断工具。
- Track: Multimodal interpretability & analysis (LVLM information decomposition / fusion attribution)
- Key innovations: Proposes a model-agnostic PID-based framework to decompose decision-relevant information into redundant, unique, and synergistic components, directly attributing whether gains come from true multimodal fusion or unimodal priors. With a scalable estimator adapted to LVLM outputs, it profiles 26 LVLMs across datasets along breadth (models/tasks), depth (layer-wise dynamics), and time (training dynamics), revealing synergy- vs knowledge-driven task regimes, fusion-centric vs language-centric family strategies, a consistent three-phase layer pattern, and identifying visual instruction tuning as the stage where fusion is learned.
- [2026-03-31] Storing Less, Finding More: How Novelty Filtering Improves Cross-Modal Retrieval on Edge Cameras 🆕NEW
- 赛道归属: 边缘端跨模态检索(流式检索/新颖性过滤/索引构建)
- 核心创新点: 提出面向常开边缘摄像头的流式跨模态检索架构,用设备端epsilon-net新颖性过滤在单遍扫描中仅保留“语义新颖”帧,构建去冗余嵌入索引,解决冗余帧挤占top-k导致召回下降的问题;同时引入跨模态适配器与云端重排序以补偿小型端侧编码器对齐能力弱的缺陷;在多种VLM规模与两套第一视角数据集上验证流式过滤优于多种离线采样/聚类基线,并展示在极低功耗端侧模型下仍可获得显著检索效果提升,强调“存得更少、找得更多”的系统性收益。
- Track: Edge cross-modal retrieval (streaming indexing; novelty filtering)
- Key innovations: Presents a streaming retrieval architecture for always-on edge cameras: an on-device epsilon-net novelty filter keeps only semantically novel frames to build a denoised embedding index in a single pass, preventing redundant frames from crowding top-k results. A cross-modal adapter plus cloud re-ranking compensates for weak alignment of compact on-device encoders. Experiments across VLM sizes and egocentric datasets show consistent gains over offline alternatives (k-means, farthest-point, uniform/random), enabling strong retrieval with very low-power edge deployment.
- [2026-03-31] Video-Oasis: Rethinking Evaluation of Video Understanding 🆕NEW
- 赛道归属: 多模态评测与诊断(视频理解评估/基准质量审计)
- 核心创新点: 提出Video-Oasis诊断套件,从“评估应测什么”出发系统审计现有视频理解基准的有效性与时空挑战覆盖,而非再造新任务;通过可持续的诊断设计揭示大量样本可在无视觉或无时间信息下被解决(暴露语言先验/静态捷径),并在真正需要时空信息的子集上发现SOTA接近随机;进一步分析哪些算法设计选择更能带来稳健视频理解,输出面向未来基准构建与架构研发的可操作指导原则,实现从“分数驱动”到“能力归因驱动”的评测范式转变。
- Track: Multimodal evaluation & diagnostics (video understanding evaluation auditing)
- Key innovations: Introduces Video-Oasis, a sustainable diagnostic suite that rethinks what video understanding evaluations should measure, systematically auditing existing benchmarks for spatiotemporal validity rather than creating another benchmark. It shows many samples are solvable without visual or temporal input (shortcut/priors), while on truly spatiotemporal samples SOTA is near random, and it studies which algorithmic design choices yield robust video understanding—providing actionable guidelines for benchmark construction and architecture evaluation beyond score chasing.
GitHub
- [2026-03-31] Blaizzy/mlx-vlm ⭐2583
MLX-VLM is a package for inference and fine-tuning of Vision Language Models (VLMs) on your Mac using MLX.
- [2026-03-31] waybarrios/vllm-mlx ⭐718 🆕NEW
OpenAI and Anthropic compatible server for Apple Silicon. Run LLMs and vision-language models (Llama, Qwen-VL, LLaVA) with continuous batching, MCP to...
- [2026-03-30] shuyansy/Earth-Observation-VLMs ⭐116
🔥🔥A Family of Multi-Sensor, Multi-Granularity Vision-Language Models for Earth Observation Understanding
- [2026-03-29] xytian1008/VAPO ⭐102
Official repo for "More Thought, Less Accuracy? On the Dual Nature of Reasoning in Vision-Language Models" (ICLR 2026)
- [2026-03-30] ydyhello/Awesome-VLM-Streaming-Video ⭐58 🆕NEW
📚 A curated collection of papers and open-source code repositories dedicated to the application of Vision-Language Models (VLMs) for streaming video.
强化学习 / Reinforcement Learning
arXiv
- [2026-03-25] CoordLight: Learning Decentralized Coordination for Network-Wide Traffic Signal Control 📖11 🆕NEW
- 赛道归属: 多智能体强化学习(MARL)/去中心化交通信号控制(ATSC)
- 核心创新点: 提出面向大规模路网的去中心化协同框架CoordLight:用基于排队论的Queue Dynamic State Encoding(QDSE)构造更可预测的路口状态表征,增强局部交通动态建模;提出Neighbor-aware Policy Optimization(NAPO),通过注意力显式建模相邻路口的状态-动作依赖,并结合更稳健的优势估计来聚焦“关键邻居”交互,从而在部分可观测下提升跨路口协同与可扩展性。
- Track: Multi-Agent Reinforcement Learning / Decentralized Adaptive Traffic Signal Control
- Core innovations: Introduces CoordLight for scalable network-wide decentralized coordination: (1) Queue Dynamic State Encoding (QDSE) grounded in queuing models to better capture/predict local traffic dynamics; (2) Neighbor-aware Policy Optimization (NAPO) that uses attention to model inter-agent state/action dependencies and improves policy updates via robust advantage estimation, enabling agents to prioritize influential neighbors for targeted coordination under partial observability.
- [2026-03-28] Dynamic resource matching in manufacturing using deep reinforcement learning 📖3 🆕NEW
- 赛道归属: 工业制造运筹优化/资源匹配的深度强化学习(DRL)
- 核心创新点: 将多期、多对多的制造需求-产能动态匹配建模为序贯决策并采用无模型DRL以规避高维状态动作与未知需求分布建模;针对Q-learning在约束动作空间下的不可行动作与max算子导致的偏置/收敛慢问题,引入两类惩罚项:基于先验策略的领域知识惩罚与满足供需约束的不可行惩罚;进一步将该思想注入DDPG形成DKDDPG,在连续/大规模设置中以“约束感知+先验引导”提升样本效率与稳定性。
- Track: Industrial optimization / Deep RL for dynamic resource matching
- Core innovations: Formulates multi-period many-to-many demand–capacity matching as an MDP and applies model-free DRL to avoid explicit transition/joint-demand modeling. Proposes domain-knowledge-informed Q-learning with two penalties—(i) a prior-policy (domain knowledge) penalty and (ii) an infeasibility penalty enforcing demand–supply constraints—to mitigate invalid actions and max-operator bias/slow convergence. Extends the idea to DDPG (DKDDPG) for large-scale settings, improving stability and efficiency via constraint-aware, prior-guided learning.
- [2026-03-30] $AutoDrive\text{-}P^3$: Unified Chain of Perception-Prediction-Planning Thought via Reinforcement Fine-Tuning 📖1
- 赛道归属: 多模态推理与决策 / 自动驾驶端到端规划(P3链式思维+强化微调)
- 核心创新点: 提出统一的Perception-Prediction-Planning链式思维框架:通过结构化CoT把感知输出作为预测与规划的条件输入,并让预测与感知共同约束最终规划,解决“直接出规划”或“三模块割裂”导致的协同不足。构建P^3-CoT数据集,并提出分层强化微调算法P^3-GRPO,对三阶段提供渐进式监督与奖励分解;同时引入“细致/快速”双思考模式在推理成本与性能间可控切换,实现更安全、可解释的驾驶决策。
- Track: Multimodal Reasoning & Decision-making / Autonomous driving planning (P3 chain-of-thought + RL fine-tuning)
- Core innovations: Proposes a unified Perception–Prediction–Planning chain-of-thought framework where structured reasoning explicitly feeds perception into prediction and planning, and jointly uses perception + prediction to constrain the final plan, addressing both “direct-to-plan” gaps and fragmented multi-module pipelines. It introduces the P^3-CoT dataset and a hierarchical RL fine-tuning algorithm (P^3-GRPO) that provides progressive supervision and reward decomposition across the three stages, plus dual thinking modes (detailed vs fast) to trade off inference cost and performance for safer, more interpretable driving decisions.
- [2026-03-30] Reducing Oracle Feedback with Vision-Language Embeddings for Preference-Based RL 📖1 🆕NEW
- 赛道归属: 偏好强化学习(Preference-based RL)/人类反馈与标注成本优化(VLM辅助)
- 核心创新点: 提出ROVED混合监督框架,用轻量视觉-语言嵌入(VLE)生成片段级偏好作为“廉价但噪声”的监督来源,并通过不确定性过滤机制仅对高不确定样本请求oracle比较,从而显著减少人工反馈;同时引入参数高效微调,将获取到的oracle反馈反哺适配VLE,使嵌入模型随训练迭代变得更可靠并实现跨任务泛化,形成“嵌入可扩展性+oracle精度”的协同闭环。
- Track: Preference-based RL / Feedback-efficient RL with vision-language models
- Core innovations: Proposes ROVED, a hybrid supervision scheme that uses lightweight vision-language embeddings (VLE) to produce segment-level preferences and queries an oracle only for high-uncertainty samples via a filtering mechanism, cutting expensive comparisons. Adds parameter-efficient fine-tuning to adapt the VLE using collected oracle feedback, improving the embedding reward signal over time and enabling cross-task generalization—combining scalability of embeddings with accuracy of targeted oracle supervision.
- [2026-03-26] Bridging Perception and Reasoning: Token Reweighting for RLVR in Multimodal LLMs 📖1 🆕NEW
- 赛道归属: 多模态大模型强化学习(RLVR)/训练稳定性与信用分配(token级优化)
- 核心创新点: 针对MLLM在RLVR中“感知token(视觉指代)与推理token(链式推理)交织且耦合”的优化难题,提出可插拔Token-Reweighting(ToR):先识别两类关键token,再在RL更新中动态重加权以显式建模二者依赖关系,避免只优化单一token子集导致的性能退化;可直接叠加到GRPO、DAPO等方法上,提升视觉落地与推理一致性。
- Track: RLVR for multimodal LLMs / Token-level credit assignment & optimization
- Core innovations: Addresses the coupled nature of perception-grounding tokens and reasoning-chain tokens in MLLM RLVR. Introduces plug-and-play Token Reweighting (ToR): identify critical tokens of both types and dynamically reweight them during RL updates to model their interdependence, avoiding the degradation seen when optimizing only one token subset. Works as an add-on to GRPO/DAPO-style methods, improving both visual grounding and coherent reasoning.
- [2026-03-25] Towards Effective Experiential Learning: Dual Guidance for Utilization and Internalization 📖1 🆕NEW
- 赛道归属: 大语言模型强化学习(RLVR)/经验利用与探索策略优化
- 核心创新点: 提出Dual Guidance Optimization(DGO)以更接近人类“外部经验利用+内部知识内化”的学习机制:构建由历史轨迹组成的experience bank作为外部经验记忆;在探索阶段同时受experience bank与模型内部知识联合引导,提升有效探索概率;再用新轨迹反向更新经验库并优化参数,形成“利用—内化”的闭环,从而在可验证奖励的推理任务上提升训练效率与效果。
- Track: RLVR for LLM reasoning / Experience reuse and guided exploration
- Core innovations: Proposes Dual Guidance Optimization (DGO) to improve how LLMs utilize and internalize experience in RLVR. Builds an external experience bank from past trajectories, guides exploration jointly with (i) the experience bank and (ii) the model’s internal knowledge, then uses newly collected trajectories to both refine the bank and update model parameters—creating a closed loop of experience utilization and internalization that boosts reasoning performance.
- [2026-03-31] Reinforced Reasoning for End-to-End Retrosynthetic Planning 🆕NEW
- 赛道归属: 科学智能(化学)/逆合成规划的端到端强化学习推理(RLVR)
- 核心创新点: 提出ReTriP将逆合成从“单步预测+外部搜索启发式”的割裂范式,重构为端到端生成的Chain-of-Thought规划任务;设计路径一致(path-coherent)的分子表示以维持全局路线语义连贯;采用渐进式课程训练,从推理蒸馏过渡到可验证奖励的强化学习,使逐步生成与路线效用(全局目标)对齐,从而提升长时域规划鲁棒性。
- Track: Scientific AI (chemistry) / End-to-end RLVR for retrosynthetic planning
- Core innovations: Introduces ReTriP, reframing retrosynthesis as direct end-to-end Chain-of-Thought generation instead of hybrid single-step + external heuristic search. Uses a path-coherent molecular representation to preserve global route consistency, and a progressive curriculum from reasoning distillation to RL with verifiable rewards to align stepwise generation with route utility, improving robustness in long-horizon planning.
- [2026-03-31] 6GAgentGym: Tool Use, Data Synthesis, and Agentic Learning for Network Management 🆕NEW
- 赛道归属: 智能体工具使用与闭环强化学习/网络管理(6G)仿真环境与数据合成
- 核心创新点: 提出6GAgentGym提供可交互闭环环境:定义42个带类型的工具并区分只读观测与可改变状态的配置操作;用基于NS-3数据校准的Experiment Model近似环境反馈以支持可扩展训练;提出6G-Forge从NS-3种子出发,通过可执行验证的Self-Instruct迭代合成闭环轨迹;在此数据上SFT后再进行在线闭环RL,使开源8B模型在长时域任务上达到/接近顶级闭源模型表现,关键在于“可执行工具+状态变更反馈+可验证数据合成”的一体化管线。
- Track: Tool-using agents & closed-loop RL / 6G network management simulation and data synthesis
- Core innovations: Presents 6GAgentGym for true closed-loop interaction: 42 typed tools with explicit read-only vs state-mutating effects, backed by an Experiment Model calibrated on NS-3 simulations to provide scalable feedback. 6G-Forge bootstraps training trajectories from NS-3 seeds via iterative Self-Instruct with execution verification against the Experiment Model. SFT on the synthesized corpus followed by online closed-loop RL enables an open 8B model to reach competitive success rates and stronger long-horizon performance—via an integrated pipeline of executable tools, state-change feedback, and verifiable data synthesis.
- [2026-03-31] ASI-Evolve: AI Accelerates AI 🆕NEW
- 赛道归属: AI for AI/自动化研究与算法-架构-数据的闭环优化(进化式智能体+实验反馈)
- 核心创新点: 提出ASI-Evolve将“学习-设计-实验-分析”研究闭环系统化:用cognition base注入累积的人类先验以约束/引导探索空间,降低长周期弱监督研发的盲搜成本;用专用analyzer把复杂实验结果蒸馏为可复用洞见,作为下一轮进化的可操作信号;在数据配方、网络架构(线性注意力)与RL算法三类核心研发对象上统一验证,体现“实验结果→结构化知识→再设计”的可迁移闭环机制。
- Track: AI-for-AI / Closed-loop automated research (evolutionary agents over data, architectures, and algorithms)
- Core innovations: Proposes ASI-Evolve, a unified learn–design–experiment–analyze loop for long-horizon, weakly supervised AI R&D. Adds (i) a cognition base to inject accumulated human priors each iteration to guide exploration, and (ii) a dedicated analyzer that distills complex experimental outcomes into reusable insights that become actionable signals for subsequent evolution. Demonstrates the same closed-loop mechanism across data curation, architecture discovery (linear attention), and RL algorithm design—turning experimental feedback into structured knowledge for iterative improvement.
- [2026-03-31] Learning Diagnostic Reasoning for Decision Support in Toxicology 🆕NEW
- 赛道归属: 医疗多模态决策支持/LLM强化学习对齐(GRPO,多标签诊断推理)
- 核心创新点: 提出DeToxR将RL引入急诊毒理多物质中毒决策:构建能融合非结构化现场叙述与结构化生命体征的鲁棒数据融合引擎,并用GRPO对LLM进行策略优化;将临床目标直接写入奖励函数——以多标签一致性度量作为reward,显式惩罚漏检共摄入物质与“幻觉”不存在毒物,实现面向高风险场景的可控推理对齐;相较监督与未对齐LLM显著提升多标签识别能力。
- Track: Medical decision support / RL-aligned LLM diagnostic reasoning (multi-label, GRPO)
- Core innovations: Introduces DeToxR, adapting RL to emergency toxicology decision support. Builds a robust fusion pipeline combining unstructured pre-clinical narratives with structured vitals, and fine-tunes an LLM with GRPO. Encodes clinical objectives directly as a reward via a multi-label agreement metric, explicitly penalizing missed co-ingestions and hallucinated substances—yielding controllable, clinically aligned reasoning and improved multi-label diagnosis over supervised and base-LLM baselines.
GitHub
- [2026-03-31] rllm-org/rllm ⭐5332 🆕NEW
Democratizing Reinforcement Learning for LLMs
- [2026-03-31] natolambert/rlhf-book ⭐1745 🆕NEW
Textbook on reinforcement learning from human feedback
- [2026-04-01] radixark/miles ⭐1032 🆕NEW
Miles is an enterprise-facing reinforcement learning framework for LLM and VLM post-training, forked from and co-evolving with slime.
- [2026-04-01] X-GenGroup/Flow-Factory ⭐305 🆕NEW
A unified framework for easy reinforcement learning in Flow-Matching models
- [2026-03-31] AmineAndam04/cleanmarl ⭐66 🆕NEW
Single file implementations of Deep Multi-agent Reinforcement Learning
HuggingFace Models
HuggingFace Datasets
-
[2026-03-27] OpenMOSS-Team/OmniAction
RoboOmni: Proactive Robot Manipulation in Omni-modal Context
📖 arXiv Paper (Accepted to ICLR 2026 🎉) |
🌐 Website |
🤗 Model...
- [2026-03-24] ServiceNow-AI/eva 🆕NEW
A New Framework for Evaluating Voice Agents (EVA)
Most voice agent benchmarks evaluate either what the agent does or how it sounds. EVA ev...
Generated automatically by Daily AI Digest Agent 生成时间: 2026-04-01 01:57:42