AI 每日进展速报 / Daily AI Digest - 2026-04-07
图像生成/编辑 / Image Generation/Editing
arXiv
- [2026-03-31] Adversarial Prompt Injection Attack on Multimodal Large Language Models 📖1
- 赛道归属: 多模态安全(MLLM视觉提示注入/对抗攻击)
- 核心创新点: 提出不可感知的视觉Prompt Injection:用受限文本叠加作为“语义锚点”嵌入恶意指令,同时优化不可见扰动,使被攻击图像在粗粒度与细粒度特征空间同时对齐“恶意视觉目标+恶意文本目标”;并将视觉目标实例化为文本渲染图像、在迭代中逐步精炼以提升语义一致性与跨模型迁移,从而对闭源强模型实现更有效的视觉侧注入攻击。
- Track: Multimodal security (visual prompt injection / adversarial attacks on MLLMs)
- Core innovation: Develops an imperceptible visual prompt-injection attack that embeds malicious instructions via a bounded text overlay as semantic guidance while iteratively optimizing invisible perturbations to align the attacked image with both malicious visual and textual targets at coarse and fine feature levels. The visual target is instantiated as a text-rendered image and progressively refined during optimization to improve semantic fidelity and transferability, enabling stronger attacks against closed-source MLLMs.
- [2026-04-06] Think in Strokes, Not Pixels: Process-Driven Image Generation via Interleaved Reasoning 🆕NEW
- 赛道归属: 文生图(多步生成/可解释生成过程)
- 核心创新点: 提出“过程驱动”的多轮生成范式,将一次性合成拆解为“文本规划→视觉草稿→文本反思→视觉细化”的交错推理轨迹;通过“视觉中间态约束下一轮文本推理、文本推理显式指导视觉演化”的闭环实现可解释、可监督的生成过程。针对中间态歧义,设计密集逐步监督:视觉侧强调空间与语义一致性,文本侧在保留既有视觉信息的同时定位并纠正违背提示的元素,从而稳定多步迭代并提升可控性与可诊断性。
- Track: Text-to-Image (multi-step / interpretable generation)
- Key innovations: Proposes process-driven generation that decomposes one-shot synthesis into an interleaved reasoning loop of “text planning → visual drafting → text reflection → visual refinement,” where intermediate images ground subsequent textual reasoning and text explicitly dictates state evolution. Addresses ambiguity of intermediate states via dense step-wise supervision: enforcing spatial/semantic consistency for visual intermediates and preserving prior visual knowledge while detecting/correcting prompt violations for textual intermediates, making the trajectory explicit, interpretable, and directly supervisable.
- [2026-04-06] Training-Free Refinement of Flow Matching with Divergence-based Sampling 🆕NEW
- 赛道归属: 推理优化(Flow Matching/采样加速与质量提升)
- 核心创新点: 提出无需训练的 Flow Divergence Sampler(FDS),在每个数值求解步之前对中间状态进行“推理期修正”,缓解边际速度场因样本速度冲突而将轨迹推向低密度区域的问题;关键在于用推理时可计算的“速度场散度”量化误导严重程度,并据此将状态引导至更少歧义的区域。方法即插即用,兼容标准ODE/SDE求解器与现成flow骨干,在文生图与逆问题等任务上提升保真度。
- Track: Inference optimization (Flow Matching sampling refinement)
- Key innovations: Introduces a training-free Flow Divergence Sampler (FDS) that refines intermediate states before each solver step to counteract marginal-velocity misguidance caused by conflicting sample-wise velocities. Uses the divergence of the marginal velocity field—computable at inference with a well-optimized model—as a signal to quantify ambiguity and steer states toward less ambiguous regions. Plug-and-play with standard solvers and off-the-shelf flow backbones, improving fidelity across tasks including text-to-image and inverse problems.
- [2026-04-06] Erasure or Erosion? Evaluating Compositional Degradation in Unlearned Text-To-Image Diffusion Models 🆕NEW
- 赛道归属: 文生图安全/模型遗忘(Unlearning评测)
- 核心创新点: 从“组合式生成能力”角度系统评估扩散模型的概念遗忘(以SD1.4去除裸露为例),揭示仅以“擦除成功率”衡量会掩盖对通用生成能力的损伤;通过T2I-CompBench++与GenEval等基准,实证发现“强擦除↔组合完整性(属性绑定/空间推理/计数)退化”的稳定权衡,并指出需要将语义保持纳入遗忘目标与评测体系,而非只做定点抑制。
- Track: Text-to-Image safety / unlearning evaluation
- Key innovations: Provides a systematic study of diffusion concept unlearning through the lens of compositional T2I generation (nudity removal on SD1.4), showing that erasure-only evaluation misses broader capability degradation. Using T2I-CompBench++ and GenEval, it empirically uncovers a consistent trade-off: methods with strong erasure often substantially harm attribute binding, spatial reasoning, and counting, while composition-preserving methods tend to fail at robust erasure. Motivates unlearning objectives and benchmarks that explicitly account for semantic preservation beyond targeted suppression.
- [2026-04-06] Training-Free Image Editing with Visual Context Integration and Concept Alignment 🆕NEW
- 赛道归属: 图像编辑(训练免/上下文条件编辑)
- 核心创新点: 提出 VicoEdit:训练免且无需扩散反演的视觉上下文注入编辑框架,直接利用“上下文图像”驱动源图到目标图的变换,避免反演带来的轨迹偏移与一致性问题;并设计由“概念对齐”引导的后验采样,在不训练的前提下强化编辑一致性与可控性。整体作为即插即用方案,在编辑质量上可超过部分训练式方法。
- Track: Image editing (training-free, visual-context-conditioned editing)
- Key innovations: Proposes VicoEdit, a training-free and inversion-free method to inject a visual context image into pretrained text-prompted editors, directly transforming source to target without diffusion inversion to avoid trajectory deviation. Introduces concept-alignment-guided posterior sampling to improve edit consistency and controllability under a plug-and-play inference procedure, achieving performance competitive with or better than training-based context-aware editors.
- [2026-04-05] Graphic-Design-Bench: A Comprehensive Benchmark for Evaluating AI on Graphic Design Tasks 🆕NEW
- 赛道归属: 多模态评测(图形设计理解与生成基准)
- 核心创新点: 构建首个面向“专业图形设计全流程”的综合基准 GraphicDesignBench:覆盖布局、字体排印、信息图、模板/设计语义、动画五大轴,并同时评测理解与生成;以真实分层模板(LICA)为基础,定义包含空间准确性、文本保真、结构有效性(如矢量代码合法性)等指标体系,从而把“结构化布局推理、矢量生成、精细排版感知、动画时序分解”等现有模型短板显式量化与可复现追踪。
- Track: Multimodal benchmarking (graphic design understanding & generation)
- Key innovations: Introduces GraphicDesignBench (GDB), a comprehensive benchmark targeting professional graphic design tasks beyond natural-image T2I, spanning five axes (layout, typography, infographics, templates/design semantics, animation) under both understanding and generation settings. Grounded in real layered templates (LICA) and paired with a standardized metric taxonomy (spatial accuracy, text fidelity, semantic alignment, structural validity incl. vector-code correctness), enabling reproducible measurement of key unsolved challenges like structured layout reasoning, faithful vector generation, fine-grained typographic perception, and temporal decomposition for animation.
- [2026-04-05] GENFIG1: Visual Summaries of Scholarly Work as a Challenge for Vision-Language Models 🆕NEW
- 赛道归属: 多模态评测(科学图示生成/视觉摘要)
- 核心创新点: 提出 GENFIG1 基准,将“论文 Figure 1 视觉摘要生成”定义为需要科学理解与视觉设计耦合的生成任务:模型需从标题/摘要/引言/图注中抽取关键概念并组织成连贯、忠实且具传播效果的图形表达。数据来自顶会论文并经严格质控,同时给出与专家评审相关性较高的自动评测指标,用于系统性暴露现有VLM/生成模型在“概念提炼+图形化表达”上的能力缺口。
- Track: Multimodal benchmarking (scientific figure generation / visual summarization)
- Key innovations: Presents GENFIG1, a benchmark framing “Figure 1” creation as a coupled scientific-understanding + visual-synthesis problem: models must comprehend paper content (title/abstract/intro/caption), select salient concepts, and design a coherent, faithful, and effective visual summary. Curated from top DL conferences with stringent QC, and accompanied by an automatic metric that correlates well with expert judgments, enabling systematic evaluation of current VLMs’ gaps in concept distillation and communicative graphic synthesis.
- [2026-04-05] ATSS: Detecting AI-Generated Videos via Anomalous Temporal Self-Similarity 🆕NEW
- 赛道归属: 视频取证/检测(AIGC视频检测)
- 核心创新点: 提出 AIGV 的新型指纹“异常时间自相似性”(ATSS):指出生成视频受提示锚点驱动呈更确定的全局演化轨迹,导致跨时间的视觉/语义相关性异常重复;据此构建三重相似性表示(视觉、文本、跨模态相似矩阵),用独立Transformer编码并通过双向交叉注意力融合建模时序动态,从而捕获全局生成逻辑而非局部伪影,在多基准上实现更强泛化的检测性能。
- Track: Video forensics/detection (AI-generated video detection)
- Key innovations: Identifies anomalous temporal self-similarity (ATSS) as a fingerprint of AI-generated videos: prompt/anchor-driven deterministic trajectories induce unnaturally repetitive correlations over time across visual and semantic domains. Builds a triple-similarity representation (visual, textual, cross-modal similarity matrices), encodes them with dedicated Transformers, and fuses dynamics via bidirectional cross-attentive fusion to capture global temporal generative logic beyond local artifacts, yielding strong cross-model generalization on multiple benchmarks.
- [2026-04-05] 1.x-Distill: Breaking the Diversity, Quality, and Efficiency Barrier in Distribution Matching Distillation 🆕NEW
- 赛道归属: 推理优化(扩散模型蒸馏/少步生成)
- 核心创新点: 提出 1.x-Distill,将分布匹配蒸馏从“整数步”推进到“分数步(1.x步)”可用区间,缓解≤2步时常见的多样性坍塌与保真下降;方法上分析并修正 teacher CFG 在DMD中的关键作用以抑制模式坍塌,并提出两阶段 Stagewise Focused Distillation:先用保多样的分布匹配学粗结构,再用与推理一致的对抗蒸馏补细节;同时引入轻量补偿模块实现 Distill-Cache 协同训练,把块级缓存自然纳入蒸馏管线,在极低NFE下兼顾质量/多样性/效率。
- Track: Inference optimization (diffusion distillation / few-step generation)
- Key innovations: Proposes 1.x-Distill, enabling practical fractional-step (1.x-step) generation for distribution matching distillation, addressing diversity collapse and fidelity loss at ≤2 steps. Key techniques include analyzing the overlooked role of teacher CFG in DMD and modifying it to suppress mode collapse; a two-stage Stagewise Focused Distillation that learns coarse structure via diversity-preserving distribution matching then refines details with inference-consistent adversarial distillation; and a lightweight compensation module for Distill–Cache co-training that integrates block-level caching into the distillation pipeline for extreme-NFE efficiency.
- [2026-04-05] SafeCtrl: Region-Aware Safety Control for Text-to-Image Diffusion via Detect-Then-Suppress 🆕NEW
- 赛道归属: 文生图安全控制(区域级内容安全/对抗鲁棒)
- 核心创新点: 提出 SafeCtrl 的“先检测后抑制”区域感知安全框架:通过注意力引导的 Detect 模块精确定位风险区域,再用局部 Suppress 模块仅在该区域内中和有害语义、将不安全对象替换为安全替代物,从而显著降低“全局擦除”带来的上下文损伤;并用图像级 DPO 优化抑制策略以提升安全-保真权衡与对抗提示绕过的鲁棒性。
- Track: Text-to-Image safety control (region-aware, adversarially robust)
- Key innovations: Introduces SafeCtrl, a region-aware Detect-Then-Suppress safety framework: an attention-guided detector localizes risky regions, followed by a localized suppressor that neutralizes harmful semantics only within detected areas (transforming unsafe objects into safe alternatives) to preserve surrounding context. The suppressor is optimized with image-level Direct Preference Optimization (DPO), improving the safety–fidelity trade-off and robustness against adversarial prompt attacks compared to global filtering/erasure approaches.
GitHub
- [2026-04-07] YouMind-OpenLab/awesome-nano-banana-pro-prompts ⭐10659
🍌 World's largest Nano Banana Pro prompt library — 10,000+ curated prompts with preview images, 16 languages. Google Gemini AI image generation. Free ...
- [2026-04-06] Light-Heart-Labs/DreamServer ⭐451
Local AI anywhere, for everyone — LLM inference, chat UI, voice, agents, workflows, RAG, and image generation. No cloud, no subscriptions.
- [2026-04-07] jd-opensource/JoyAI-Image ⭐356 🆕NEW
JoyAI-Image is the unified multimodal foundation model for image understanding, text-to-image generation, and instruction-guided image editing.
- [2026-04-06] AceDataCloud/Nexior ⭐351 🆕NEW
Consumer AI app for chat, image generation, video generation, and music creation powered by Ace Data Cloud APIs.
- [2026-04-05] akandr/bc250 ⭐57 🆕NEW
AMD BC-250 (PS5 APU) setup guide — Ollama + Vulkan inference, poor man's AI assistant via Signal, stable-diffusion.cpp image generation
视频生成/编辑 / Video Generation/Editing
arXiv
- [2026-04-06] Preserving Forgery Artifacts: AI-Generated Video Detection at Native Scale 🆕NEW
- 赛道归属: AI生成视频检测 / 视频取证(Video Deepfake Detection & Forensics)
- 核心创新点: 提出“原生尺度(native-scale)”检测范式:基于Qwen2.5-VL的ViT在可变分辨率与可变时长上直接建模,避免固定resize/crop导致的高频伪造痕迹丢失与空间畸变,从而更好捕捉细粒度伪影与时空不一致;同时构建覆盖15种SOTA生成器、14万+视频的大规模数据集与面向超逼真内容的Magic Videos基准,推动检测训练/评测从“过时分布”迁移到“现代生成模型分布”。
- Track: AI-generated video detection / video forensics
- Core innovation: Introduces a native-scale detection paradigm: a Qwen2.5-VL–based ViT operates directly on variable spatial resolutions and temporal lengths, avoiding fixed resizing/cropping that destroys high-frequency forgery traces and induces spatial distortion, thus better capturing subtle artifacts and spatiotemporal inconsistencies. Also releases a 140K+ video dataset spanning 15 SOTA generators plus the Magic Videos benchmark targeting ultra-realistic synthetic content, updating training/evaluation to modern generator distributions.
- [2026-04-06] Beyond Few-Step Inference: Accelerating Video Diffusion Transformer Model Serving with Inter-Request Caching Reuse 🆕NEW
- 赛道归属: 推理优化 / 视频扩散模型服务加速(Serving Acceleration for Video Diffusion)
- 核心创新点: 提出Chorus跨请求(inter-request)缓存复用机制,突破以往仅在单请求扩散步内做冗余跳步(intra-request)的局限;设计三阶段缓存策略:早期对相似请求进行latent特征全量复用,中期对特定latent区域进行局部复用,并通过Token-Guided Attention Amplification增强条件语义对齐,使“全量复用”可延伸到更后期去噪步;在4-step蒸馏工业模型上实现最高45%加速,覆盖以往缓存方法失效的场景。
- Track: Inference optimization / video diffusion model serving acceleration
- Core innovation: Proposes Chorus, an inter-request caching reuse method that exploits similarity across different user requests—beyond prior intra-request diffusion-step skipping. It uses a three-stage caching pipeline: full latent reuse for similar requests early, region-wise latent reuse in intermediate steps, and Token-Guided Attention Amplification to maintain prompt semantic alignment and extend full reuse deeper into denoising. Achieves up to 45% speedup on industrial 4-step distilled models where prior caching is ineffective.
- [2026-04-06] UENR-600K: A Large-Scale Physically Grounded Dataset for Nighttime Video Deraining 🆕NEW
- 赛道归属: 视频复原 / 视频去雨(Nighttime Video Deraining)
- 核心创新点: 构建UENR-600K:60万对1080p配对帧的“物理一致”夜间去雨数据集,使用Unreal Engine将雨建模为3D粒子并与人工光照交互,显式覆盖夜雨的颜色折射、局部照明、遮挡与雨幕等物理现象,弥补2D叠加合成数据的域差;方法上将去雨重构为video-to-video生成任务,改造Wan 2.2视频生成模型作为强生成先验的去雨基线,显著缩小sim-to-real泛化差距并建立新SOTA基线。
- Track: Video restoration / nighttime video deraining
- Core innovation: Releases UENR-600K, a physically grounded nighttime deraining dataset with 600K paired 1080p frames. Rain is simulated as 3D particles in Unreal Engine with realistic interactions with artificial lighting, capturing refraction color shifts, local illumination, occlusions, and rain curtains—addressing the domain gap of 2D overlay synthesis. Recasts deraining as video-to-video generation by adapting the Wan 2.2 video generator as a strong generative-prior baseline, substantially narrowing sim-to-real generalization and setting a new baseline.
- [2026-04-05] DriveVA: Video Action Models are Zero-Shot Drivers 🆕NEW
- 赛道归属: 自动驾驶世界模型 / 视觉规划(World Model for Driving with Video Generation)
- 核心创新点: 提出DriveVA联合生成式解码:在共享latent生成过程中同时解码未来视频与动作序列(轨迹),用DiT解码器实现“视觉想象—轨迹规划”强耦合,缓解以往松耦合规划带来的视频-轨迹不一致;继承大规模视频生成模型的运动与物理先验以提升跨域泛化与零样本能力,并引入视频续写(continuation)策略增强长时滚动一致性,在闭环NAVSIM上取得高PDM并显著降低跨数据集误差与碰撞率。
- Track: Autonomous driving world models / vision-based planning with video generation
- Core innovation: DriveVA jointly decodes future visual rollouts and action/trajectory sequences within a shared latent generative process. A DiT-based decoder tightly couples “visual imagination” with planning, improving video–trajectory consistency compared to loosely coupled planners. It leverages priors from large pretrained video generators for motion/physics plausibility and introduces a video continuation strategy for long-horizon rollout consistency, yielding strong closed-loop performance and notable zero-shot cross-domain generalization.
- [2026-04-05] OP-GRPO: Efficient Off-Policy GRPO for Flow-Matching Models 🆕NEW
- 赛道归属: 生成模型后训练 / 强化学习对齐(RL Post-training for Flow-Matching Image/Video Generation)
- 核心创新点: 提出首个面向Flow-Matching模型的Off-Policy GRPO(OP-GRPO),用可复用的高质量轨迹回放缓冲区提升样本效率;针对off-policy分布偏移,提出序列级重要性采样校正以保持GRPO裁剪(clipping)机制的稳定性;进一步发现后期去噪步的off-policy ratio病态,提出截断晚期轨迹以稳定训练,在图像与视频生成上以约34.2%训练步数达到/超过on-policy Flow-GRPO效果。
- Track: Post-training for generative models / RL alignment for flow-matching (image & video)
- Core innovation: OP-GRPO is the first off-policy GRPO framework for flow-matching models. It improves sample efficiency via a replay buffer with active selection and reuse of high-quality trajectories. To handle off-policy distribution shift, it introduces sequence-level importance sampling correction that preserves GRPO’s clipping behavior for stable updates. It also identifies ill-conditioned off-policy ratios in late denoising steps and stabilizes training by truncating late-step trajectories, matching or surpassing Flow-GRPO with ~34.2% of training steps on average across image/video benchmarks.
- [2026-04-05] ATSS: Detecting AI-Generated Videos via Anomalous Temporal Self-Similarity 🆕NEW
- 赛道归属: 视频取证/检测(AIGC视频检测)
- 核心创新点: 提出 AIGV 的新型指纹“异常时间自相似性”(ATSS):指出生成视频受提示锚点驱动呈更确定的全局演化轨迹,导致跨时间的视觉/语义相关性异常重复;据此构建三重相似性表示(视觉、文本、跨模态相似矩阵),用独立Transformer编码并通过双向交叉注意力融合建模时序动态,从而捕获全局生成逻辑而非局部伪影,在多基准上实现更强泛化的检测性能。
- Track: Video forensics/detection (AI-generated video detection)
- Key innovations: Identifies anomalous temporal self-similarity (ATSS) as a fingerprint of AI-generated videos: prompt/anchor-driven deterministic trajectories induce unnaturally repetitive correlations over time across visual and semantic domains. Builds a triple-similarity representation (visual, textual, cross-modal similarity matrices), encodes them with dedicated Transformers, and fuses dynamics via bidirectional cross-attentive fusion to capture global temporal generative logic beyond local artifacts, yielding strong cross-model generalization on multiple benchmarks.
- [2026-04-04] ActivityForensics: A Comprehensive Benchmark for Localizing Manipulated Activity in Videos 🆕NEW
- 赛道归属: 视频取证 / 篡改活动定位(Temporal Forgery Localization for Activity Manipulation)
- 核心创新点: 提出ActivityForensics首个大规模“活动级”篡改时序定位基准:将6K+伪造活动片段无缝融入上下文,强调事件语义被改写但视觉一致性极高的难例,补齐以往偏外观篡改(换脸/抹除)的评测空白;并提出TADiff基线,通过扩散式特征正则将隐蔽伪影“扩散显化”,提升对活动篡改的可分性;同时给出intra/cross-domain与open-world协议,系统化评估现有定位器的泛化能力。
- Track: Video forensics / temporal localization of activity manipulation
- Core innovation: Introduces ActivityForensics, the first large-scale benchmark for temporal localization of activity-level manipulations—6K+ forged segments seamlessly blended into context, targeting semantic event distortion with high visual consistency (beyond appearance-only forgeries). Proposes TADiff, a baseline that uses a diffusion-based feature regularizer to expose subtle artifact cues. Provides comprehensive intra-domain, cross-domain, and open-world evaluation protocols to stress-test generalization of forgery localizers.
- [2026-04-04] Rethinking Position Embedding as a Context Controller for Multi-Reference and Multi-Shot Video Generation 🆕NEW
- 赛道归属: 视频生成 / 多参考多镜头一致性控制(Multi-Reference, Multi-Shot Video Generation)
- 核心创新点: 将位置编码重新定义为“上下文控制器”PoCo:在多参考且外观高度相似时,利用token侧信息(position embedding)进行精确的token级匹配与检索,缓解语义相近token导致的reference confusion;在不破坏隐式语义一致性建模的前提下,提升跨镜头角色一致性与参考保真度,从机制上补足仅依赖语义检索/注意力的控制瓶颈。
- Track: Video generation / multi-reference & multi-shot consistency control
- Core innovation: PoCo reframes positional embeddings as an explicit context controller. When multiple references look highly similar, it uses token side information (position encoding) for precise token-level matching/retrieval, mitigating reference confusion caused by semantically similar tokens. This improves cross-shot character consistency and reference fidelity without sacrificing implicit semantic consistency modeling, addressing limitations of purely semantic retrieval/attention-based control.
- [2026-04-04] SymphoMotion: Joint Control of Camera Motion and Object Dynamics for Coherent Video Generation 🆕NEW
- 赛道归属: 可控视频生成 / 相机运动与物体运动联合控制(Unified Motion Control)
- 核心创新点: 提出SymphoMotion统一运动控制框架,在单模型内解耦并联合控制相机轨迹与物体动力学:相机侧引入显式轨迹+几何感知线索以稳定视角转换与结构一致性;物体侧融合2D引导与3D轨迹embedding实现具深度一致性的空间运动控制,避免2D线索将视差与真实物体运动纠缠;同时构建RealCOD-25K真实数据集,提供相机位姿与物体级3D轨迹配对,补齐统一控制所需数据缺口。
- Track: Controllable video generation / joint camera-motion and object-dynamics control
- Core innovation: SymphoMotion unifies and disentangles camera trajectory control and object dynamics control within a single model. For camera motion, it integrates explicit camera paths with geometry-aware cues to ensure stable viewpoint transitions and structural consistency. For object motion, it combines 2D guidance with 3D trajectory embeddings for depth-aware, spatially coherent manipulation, avoiding entanglement between parallax and true object motion. It also releases RealCOD-25K with paired camera poses and object-level 3D trajectories to enable large-scale training/evaluation.
- [2026-04-04] CRAFT: Video Diffusion for Bimanual Robot Data Generation 🆕NEW
- 赛道归属: 视频生成用于机器人数据合成 / Sim2Real数据增强(Robot Demonstration Generation with Video Diffusion)
- 核心创新点: 提出CRAFT:用视频扩散Transformer将仿真轨迹“渲染”为时序一致、照片级的双臂操作视频,并保留/生成与之对齐的动作标签;以Canny边缘等结构线索作为条件,将仿真提供的几何/运动约束注入扩散生成,从而在不重放真实机器人的情况下实现可控的数据扩增(物体位姿、视角、光照背景、跨机体迁移、多视角合成等),显著提升双臂策略在视角与外观变化下的泛化与成功率。
- Track: Video generation for robotics data synthesis / Sim2Real augmentation
- Core innovation: CRAFT uses Video Diffusion Transformers to convert simulator trajectories into temporally coherent, photorealistic bimanual manipulation videos while keeping aligned action labels. It conditions diffusion on Canny/edge-based structural cues to inject geometric and motion constraints from simulation, enabling controllable augmentation (object pose, camera viewpoint, lighting/background, cross-embodiment transfer, multi-view synthesis) without replaying demonstrations on real robots. This substantially increases demonstration diversity and improves generalization and success rates in bimanual tasks.
GitHub
- [2026-04-06] hao-ai-lab/FastVideo ⭐3346
A unified inference and post-training framework for accelerated video generation.
- [2026-04-07] leofan90/Awesome-World-Models ⭐1446 🆕NEW
A comprehensive list of papers for the definition of World Models and using World Models for General Video Generation, Embodied AI, and Autonomous Dri...
- [2026-04-07] ZeroLu/awesome-seedance ⭐1369 🆕NEW
The ultimate collection of high-fidelity Seedance 2.0 prompts and Seedance AI resources. Discover Seedance 2.0 how to use for cinematic film, anime, U...
- [2026-04-07] YouMind-OpenLab/awesome-seedance-2-prompts ⭐536
🎬 500+ curated Seedance 2.0 video generation prompts — cinematic, anime, UGC, ads, meme styles. Includes Seedance API guides, character consistency ti...
- [2026-04-07] vargHQ/sdk ⭐257
AI video generation SDK — JSX for videos. One API for Kling, Flux, ElevenLabs, Sora. Built on Vercel AI SDK.
音频生成 / Audio Generation
arXiv
- [2026-04-06] OmniSonic: Towards Universal and Holistic Audio Generation from Video and Text 🆕NEW
- 赛道归属: 音频生成(视频/文本条件的扩散式音频生成;全景声场/多源音频合成)
- 核心创新点: 提出“Universal Holistic Audio Generation (UniHAGen)”任务,强调同时生成屏幕内环境声、屏幕外环境声与人声的完整声场;提出基于flow-matching的扩散框架OmniSonic,在DiT中设计TriAttn三路跨注意力分别建模三类条件,并引入MoE门控自适应分配各条件对生成的贡献,从结构上解决“多源条件互相干扰/权重难平衡”的问题;同时构建覆盖典型“屏幕内/外+人声-环境声”组合的新基准UniHAGen-Bench以系统评测该能力。
- Track: Audio Generation (video/text-conditioned diffusion; holistic soundscape synthesis)
- Key innovations: Formulates UniHAGen to synthesize holistic auditory scenes containing on-screen sounds, off-screen sounds, and speech (beyond prior non-speech holistic setups); proposes OmniSonic, a flow-matching diffusion framework with a TriAttn-DiT that uses three dedicated cross-attention branches for on-screen ambience, off-screen ambience, and speech, plus an MoE gating mechanism to dynamically balance their influence—addressing multi-condition interference and weighting; introduces UniHAGen-Bench to evaluate representative on/off-screen speech–environment scenarios.
- [2026-04-02] CoLoRSMamba: Conditional LoRA-Steered Mamba for Supervised Multimodal Violence Detection 🆕NEW
- 赛道归属: 多模态理解(音视频暴力检测;高效跨模态融合/状态空间模型)
- 核心创新点: 提出CoLoRSMamba,用“方向性Video→Audio”的条件化LoRA在不使用token级跨注意力的情况下实现跨模态调制:由VideoMamba的CLS token在每层生成通道级调制向量与稳定门控,直接作用于AudioMamba中选择性状态空间参数(Δ、B、C及步长通路)的投影,从而让音频动态对场景语义自适应、并提升噪声/弱相关音频下的鲁棒性;训练上结合二分类与对称AV-InfoNCE对齐clip级嵌入,强化跨模态一致性;并通过从NTU-CCTV与DVD构建“有音频可用”的过滤子集,提升多模态评测的可比性与公平性。
- Track: Multimodal Understanding (audio-visual violence detection; efficient cross-modal fusion with SSM/Mamba)
- Key innovations: Introduces CoLoRSMamba, a directional Video→Audio fusion scheme that replaces token-level cross-attention with CLS-guided conditional LoRA: the VideoMamba CLS token produces per-layer channel-wise modulation and a stabilization gate to adapt AudioMamba projections for selective state-space parameters (Δ, B, C) including the step-size pathway, yielding scene-aware audio dynamics under noisy/weakly-related audio; trains with classification plus symmetric AV-InfoNCE to align clip-level embeddings; curates audio-available subsets of NTU-CCTV and DVD for fair multimodal evaluation.
- [2026-04-02] Woosh: A Sound Effects Foundation Model 🆕NEW
- 赛道归属: 音频生成(音效基础模型;文本到音频/视频到音频;编解码与对齐)
- 核心创新点: 发布面向“音效”优化的开源基础模型Woosh,将音频生成系统拆解为可复用的模块化栈:高质量音频encoder/decoder、文本-音频对齐模型、以及文本到音频与视频到音频生成模型,形成从表征学习到条件生成的完整开源基座;同时提供蒸馏版T2A/V2A以在低资源与快速推理场景下保持可用性能,强调“可部署性/效率”作为基础模型能力的一部分;并在公私数据上对关键模块与现有开源方案进行对比评测,给出可复现实验与权重代码,降低社区复用门槛。
- Track: Audio Generation (sound effects foundation model; text-to-audio & video-to-audio; codec and alignment)
- Key innovations: Releases Woosh as an open sound-effects-focused foundation stack, modularizing the pipeline into a high-quality audio encoder/decoder, a text–audio alignment model, and generative T2A and V2A models—providing an end-to-end reusable base from representation to conditional generation; includes distilled T2A/V2A variants to enable low-resource, fast inference, treating deployability/efficiency as a first-class capability; benchmarks each module against open alternatives and ships reproducible code/weights to lower adoption friction.
GitHub
- [2026-04-07] huggingface/diffusers ⭐33270
🤗 Diffusers: State-of-the-art diffusion models for image, video, and audio generation in PyTorch.
- [2026-04-02] Lightricks/LTX-2 ⭐5586
Official Python inference and LoRA trainer package for the LTX-2 audio–video generative model.
- [2026-04-03] FunAudioLLM/ThinkSound ⭐1298
[NeurIPS 2025] PyTorch implementation of [ThinkSound], a unified framework for generating audio from any modality, guided by Chain-of-Thought (CoT) re...
- [2026-04-01] OpenMOSS/MOVA ⭐882 🆕NEW
MOVA: Towards Scalable and Synchronized Video–Audio Generation
- [2026-04-06] apocas/restai ⭐483
RESTai is an AIaaS (AI as a Service) open-source platform. Supports many public and local LLM suported by Ollama/vLLM/etc. Precise embeddings usage, t...
语言大模型 / Large Language Models
GitHub
- [2026-04-06] abhigyanpatwari/GitNexus ⭐23657
GitNexus: The Zero-Server Code Intelligence Engine - GitNexus is a client-side knowledge graph creator that runs entirely in your browser. Drop ...
- [2026-04-06] DeusData/codebase-memory-mcp ⭐1271
High-performance code intelligence MCP server. Indexes codebases into a persistent knowledge graph — average repo in milliseconds. 66 languages, sub-m...
- [2026-04-07] clice-io/clice ⭐1200
A next-generation C++ language server for modern C++, focused on high performance and deep code intelligence
- [2026-04-07] justrach/codedb ⭐589 🆕NEW
Zig code intelligence server and MCP toolset for AI agents. Fast tree, outline, symbol, search, read, edit, deps, snapshot, and remote GitHub repo que...
- [2026-04-07] proxysoul/soulforge ⭐186 🆕NEW
Graph-powered code intelligence, multi-agent coding with codebase-aware AI. No more grep & pray
多模态大模型 / Multimodal Models
arXiv
- [2026-04-01] JAMMEval: A Refined Collection of Japanese Benchmarks for Reliable VLM Evaluation 📖1 🆕NEW
- 赛道归属: 多模态评测 / VLM评测基准(日语VQA)
- 核心创新点: 通过对7个既有日语VQA基准进行两轮人工系统化精炼,集中修复“问题歧义、标注错误、无需视觉即可作答”等导致评测失真的数据缺陷;构建JAMMEval以提升评测可靠性,并实证其带来更低的重复运行方差、更强的模型区分度与更贴近真实能力的得分表现,同时开源数据与代码以促进日语VLM可复现实证评估。
- Track: Multimodal evaluation / VLM benchmarking (Japanese VQA)
- Core innovation: Systematically refines seven existing Japanese VQA benchmarks via two rounds of human annotation to fix ambiguity, wrong labels, and non-visual-solvable items that undermine evaluation validity; releases JAMMEval and shows it yields more capability-faithful scores, lower run-to-run variance, and better separation between models, with dataset and code open-sourced.
- [2026-04-06] Rethinking Model Efficiency: Multi-Agent Inference with Large Models 🆕NEW
- 赛道归属: 推理优化 / 高效推理(多智能体推理、token效率)
- 核心创新点: 从端到端延迟分解出“输出token数”是VLM推理瓶颈之一,并提出反直觉结论:大模型若能用更短输出可比小模型长输出更高效;据此提出多智能体推理框架——默认使用“大模型短回答”以降低生成开销,在需要更强推理时复用/迁移“小模型产生的关键推理token”,以接近大模型自带长推理的效果并兼顾效率。
- Track: Inference optimization / Efficient inference (multi-agent, token efficiency)
- Core innovation: Identifies output token length as a major latency bottleneck and empirically shows large models with shorter outputs can be more efficient than smaller models with long generations; proposes a multi-agent inference scheme that keeps a large model’s short responses by default while reusing/transferring key reasoning tokens from a smaller model when needed, approaching large-model reasoning quality with reduced generation cost.
- [2026-04-06] Vero: An Open RL Recipe for General Visual Reasoning 🆕NEW
- 赛道归属: 多模态推理强化学习 / VLM后训练(RL配方与数据)
- 核心创新点: 提供可复现、全开源的视觉推理RL训练“配方”:汇聚59个数据源构建600K规模的Vero-600K,并用“任务路由奖励”统一处理跨任务、异构答案格式;在30项挑战基准上系统验证RL数据覆盖面与奖励设计对泛化视觉推理的关键作用,证明无需专有“thinking数据”也能显著超越同底座开源模型,并通过消融揭示不同任务类别诱发的推理模式迁移性有限,强调广覆盖RL数据是主要驱动力。
- Track: Multimodal RL post-training / Visual reasoning VLMs
- Core innovation: Delivers a fully open, reproducible RL recipe for general visual reasoning by scaling to Vero-600K (600K samples aggregated from 59 datasets) and introducing task-routed rewards to handle heterogeneous answer formats; demonstrates strong gains across 30 hard benchmarks without proprietary “thinking” data and shows via ablations that broad task coverage (not isolated categories) is the main driver of RL scaling due to limited cross-category transfer of reasoning patterns.
- [2026-04-06] ClickAIXR: On-Device Multimodal Vision-Language Interaction with Real-World Objects in Extended Reality 🆕NEW
- 赛道归属: 端侧多模态应用 / XR人机交互(on-device VLM)
- 核心创新点: 将“控制器点击选物”的显式指代机制与端侧VLM推理结合,形成以对象为中心的XR多模态问答流程,降低仅靠凝视/语音带来的指代歧义;通过ONNX本地推理实现隐私与低延迟优势,并在Magic Leap上落地与云端大模型方案进行用户研究对比,从系统层面验证“端侧+可点击对象选择”在可信交互中的可行性。
- Track: On-device multimodal systems / XR interaction (on-device VLM)
- Core innovation: Combines controller-based click selection (explicit object reference) with on-device VLM inference to enable object-centric XR Q&A, reducing ambiguity inherent to gaze/voice-only interfaces; implements local ONNX-based inference in Magic Leap and validates practicality via user studies against cloud LMM baselines, emphasizing privacy-preserving, latency-aware trustworthy interaction.
- [2026-04-06] Beyond the Global Scores: Fine-Grained Token Grounding as a Robust Detector of LVLM Hallucinations 🆕NEW
- 赛道归属: 多模态幻觉检测 / 可解释性与可靠性(token级grounding)
- 核心创新点: 指出全局相关性度量会被“弱但分散”的伪相关掩盖,导致幻觉token逃逸检测;提出基于patch级细粒度token grounding的检测框架,跨层分析token-区域交互,归纳幻觉的两类结构性特征(注意力弥散不聚焦、与任一区域缺乏语义对齐),并用轻量统计特征结合隐层表示实现可解释的token级幻觉判别,在检测精度上显著提升。
- Track: Multimodal hallucination detection / Interpretability & reliability (token grounding)
- Core innovation: Shows global image-level relevance can be fooled by weak-but-diffuse correlations, letting hallucinated tokens pass; introduces patch-level, fine-grained token grounding analysis across layers and identifies two signatures of hallucination—diffuse non-localized attention and lack of semantic alignment to any region—then builds a lightweight, interpretable detector using patch-level statistics plus hidden representations, achieving strong token-level detection accuracy.
- [2026-04-06] The Blind Spot of Adaptation: Quantifying and Mitigating Forgetting in Fine-tuned Driving Models 🆕NEW
- 赛道归属: 自动驾驶多模态 / 适配与持续学习(灾难性遗忘缓解)
- 核心创新点: 首次系统化量化“驾驶场景微调导致VLM世界知识遗忘”的问题:构建18万场景的大规模基准用于评测遗忘;提出Drive Expert Adapter(DEA)将适配从权重空间转移到prompt/专家路由空间,通过场景线索动态选择不同知识专家进行推理,在提升驾驶任务表现的同时避免破坏底座参数,从机制上缓解“任务适配 vs 通用知识保留”的矛盾。
- Track: Autonomous driving multimodal / Adaptation & continual learning (catastrophic forgetting)
- Core innovation: Establishes the first large-scale benchmark (180K scenes) to quantify catastrophic forgetting when fine-tuning VLMs for driving; proposes Drive Expert Adapter (DEA) that shifts adaptation from weight updates to prompt-space expert routing, dynamically selecting knowledge experts from scene cues to improve driving performance while preserving pretrained world knowledge and mitigating forgetting.
- [2026-04-06] Less Detail, Better Answers: Degradation-Driven Prompting for VQA 🆕NEW
- 赛道归属: 多模态理解 / VQA提示工程与鲁棒推理(输入退化策略)
- 核心创新点: 提出“以退为进”的Degradation-Driven Prompting(DDP):通过有策略地降低图像保真度(如80p降采样)并叠加结构化视觉提示(白底mask、正交线、模糊/对比增强等),迫使模型聚焦几何结构与关键线索、抑制高频纹理噪声引发的幻觉;并引入任务分类+工具化提示的组合流程,分别针对物理属性与多类感知错觉/异常场景提升VQA推理准确率。
- Track: Multimodal understanding / VQA prompting & robust reasoning (input degradation)
- Core innovation: Proposes Degradation-Driven Prompting (DDP), intentionally reducing image fidelity (e.g., 80p downsampling) and adding structural visual cues (white-background masks, orthometric lines, blur/contrast tools) to steer VLMs toward essential geometry and away from texture noise that triggers hallucinations; uses a task-classification plus tool-specialized prompting pipeline to improve reasoning on physical-attribute and perceptual-illusion/anomaly VQA settings.
- [2026-04-06] Discovering Failure Modes in Vision-Language Models using RL 🆕NEW
- 赛道归属: 多模态评测与红队 / 失败模式挖掘(RL自动对抗提问)
- 核心创新点: 用强化学习将“发现VLM盲点”自动化:训练一个提问者智能体根据候选VLM的回答自适应生成问题,以最大化诱发错误;通过逐步提升问题复杂度、组合细粒度视觉细节与技能要素,替代人工枚举弱点的低效流程,最终挖掘出36种新的失败模式,并展示对不同模型组合与分布的可迁移性。
- Track: Multimodal evaluation & red-teaming / Failure mode discovery (RL question generation)
- Core innovation: Automates VLM blind-spot discovery via RL by training a questioner agent that adaptively generates queries conditioned on a target VLM’s responses to elicit mistakes; increases complexity over training by focusing on fine-grained details and skill compositions, uncovering 36 novel failure modes and demonstrating generalization across different model pairings and data distributions.
- [2026-04-06] ROSClaw: A Hierarchical Semantic-Physical Framework for Heterogeneous Multi-Agent Collaboration 🆕NEW
- 赛道归属: 具身智能 / 多机器人协作框架(VLM控制、sim-to-real闭环)
- 核心创新点: 提出面向异构多机器人协作的分层语义-物理统一框架:以VLM作为统一控制器贯通语义推理与执行,并利用e-URDF将机器人物理约束显式注入,构建仿真到真实的拓扑映射以实时访问多体状态;同时把真实执行中的多模态观测、状态与轨迹纳入数据闭环,支持迭代式策略优化与跨平台迁移,并在部署时动态分配任务控制权以提升长时序、多策略执行鲁棒性。
- Track: Embodied AI / Multi-robot collaboration framework (VLM control, sim-to-real loop)
- Core innovation: Introduces a hierarchical semantic-physical framework where a unified VLM controller bridges reasoning and execution for heterogeneous robots; injects physical constraints via e-URDF to build a sim-to-real topological mapping with real-time access to simulated/real agent states, and closes the loop by logging multimodal observations, states, and trajectories during real execution for iterative policy optimization and cross-platform transfer, with dynamic task-to-agent control assignment for robust long-horizon execution.
- [2026-04-06] InCTRLv2: Generalist Residual Models for Few-Shot Anomaly Detection and Segmentation 🆕NEW
- 赛道归属: 视觉异常检测 / 通用少样本异常检测与分割(VLM语义先验)
- 核心创新点: 在InCTRL“in-context residual”少样本范式上扩展为双分支通用异常检测与分割:主分支DASL引入正常+异常数据学习语义引导的“正常/异常”判别空间,辅分支OASL仅用正常数据学习更泛化的“正常性”语义表征;两分支共同利用大规模视觉-文本模型提供的语义先验,从“区分视角+偏离视角”双重刻画异常,提高跨域泛化与少样本鲁棒性,并在多数据集上取得SOTA。
- Track: Visual anomaly detection / Generalist few-shot anomaly detection & segmentation (VLM priors)
- Core innovation: Extends InCTRL’s few-shot in-context residual paradigm into a dual-branch generalist AD/segmentation framework: DASL (main branch) learns a semantic-guided normal/abnormal discriminative space using both normal and abnormal data, while OASL (aux branch) learns generalized normality semantics using only normal data; both are guided by rich vision-language priors, providing complementary “discrimination” and “deviation-from-normality” perspectives that improve cross-domain generalization and few-shot robustness, achieving strong results across many datasets.
GitHub
- [2026-04-06] Blaizzy/mlx-vlm ⭐4120
MLX-VLM is a package for inference and fine-tuning of Vision Language Models (VLMs) on your Mac using MLX.
- [2026-04-06] liudaizong/Awesome-LVLM-Attack ⭐526 🆕NEW
😎 up-to-date & curated list of awesome Attacks on Large-Vision-Language-Models papers, methods & resources.
- [2026-04-06] Roots-Automation/GutenOCR ⭐180 🆕NEW
Open-source tools for training and evaluating Vision Language Models for OCR
- [2026-04-06] OpenGVLab/MMT-Bench ⭐118 🆕NEW
[ICML 2024] | MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI
- [2026-04-06] yunncheng/MMRL ⭐102 🆕NEW
[CVPR 2025 & IJCV2026] Official PyTorch Code for "MMRL: Multi-Modal Representation Learning for Vision-Language Models" and its extension "MMRL++: Par...
强化学习 / Reinforcement Learning
arXiv
- [2026-04-02] DEFT: Distribution-guided Efficient Fine-Tuning for Human Alignment 📖2 🆕NEW
- 赛道归属: 大语言模型对齐(RLHF高效微调 / 数据选择)
- 核心创新点: 提出DEFT对齐框架,用“差分分布奖励”同时刻画(1)模型输出分布与(2)偏好数据差异分布之间的偏离程度,据此从原始偏好数据中筛出小而高质量的子集,并将该分布引导信号注入现有对齐方法以约束输出分布迁移;在显著降低训练时间的同时,提升对齐效果并缓解对齐导致的泛化能力下降。
- Track: LLM alignment (efficient RLHF / data selection)
- Key innovation: Proposes DEFT, using a differential distribution reward to measure mismatch between the model’s output distribution and the discrepancy distribution in preference data, enabling high-quality subset filtering and distribution-guided alignment when plugged into existing methods; improves alignment and preserves generalization with much lower training cost.
- [2026-03-31] Empirical Validation of the Classification-Verification Dichotomy for AI Safety Gates 📖1 🆕NEW
- 赛道归属: 强化学习安全(安全门控/验证与自改进)
- 核心创新点: 系统性实证证明“分类器式安全门”(safety gate)在迭代自改进场景下存在结构性失效:即便训练精度100%、NP最优检验、以及多种安全RL基线(CPO/Lyapunov/Shielding)也无法同时满足低误放行与可持续改进;进一步提出并验证“验证器式门控”替代路径——基于Lipschitz球的解析可证安全界给出零误放行,并通过ball chaining实现跨越多个安全球的无界参数空间遍历;还给出按组组合验证以显著放大可验证半径。
- Track: RL safety (safety gating / verification for self-improvement)
- Key innovation: Empirically establishes a structural impossibility for classifier-based safety gates under iterative self-improvement (even NP-optimal tests fail), and demonstrates verifier-based oversight via analytically bounded Lipschitz balls with zero false accepts; introduces ball chaining for unbounded safe traversal and compositional per-group verification to enlarge certified radii.
- [2026-04-06] Stratifying Reinforcement Learning with Signal Temporal Logic 🆕NEW
- 赛道归属: 强化学习理论(STL时序逻辑奖励/表示几何分析)
- 核心创新点: 将信号时序逻辑(STL)的原子谓词重新解释为“分层空间(stratified space)成员测试”,提出基于分层理论的STL语义并建立对应原理:多数STL公式可视为诱导时空分层结构;据此把DRL学到的潜在嵌入空间结构与决策空间几何联系起来,并提出可计算、效率较高的分层“签名”用于从嵌入中挖掘分层结构,在以STL鲁棒性为奖励的Minigrid示例中进行数值验证。
- Track: RL theory (temporal-logic rewards / representation geometry)
- Key innovation: Recasts STL semantics through stratification theory by interpreting predicates as membership in stratified spaces, yielding a correspondence where STL formulas induce space-time stratifications; connects DRL latent embeddings to ambient decision-space geometry and proposes efficient computable signatures to detect stratification structure, validated on Minigrid with STL-robustness rewards.
- [2026-04-06] Vero: An Open RL Recipe for General Visual Reasoning 🆕NEW
- 赛道归属: 多模态推理强化学习 / VLM后训练(RL配方与数据)
- 核心创新点: 提供可复现、全开源的视觉推理RL训练“配方”:汇聚59个数据源构建600K规模的Vero-600K,并用“任务路由奖励”统一处理跨任务、异构答案格式;在30项挑战基准上系统验证RL数据覆盖面与奖励设计对泛化视觉推理的关键作用,证明无需专有“thinking数据”也能显著超越同底座开源模型,并通过消融揭示不同任务类别诱发的推理模式迁移性有限,强调广覆盖RL数据是主要驱动力。
- Track: Multimodal RL post-training / Visual reasoning VLMs
- Core innovation: Delivers a fully open, reproducible RL recipe for general visual reasoning by scaling to Vero-600K (600K samples aggregated from 59 datasets) and introducing task-routed rewards to handle heterogeneous answer formats; demonstrates strong gains across 30 hard benchmarks without proprietary “thinking” data and shows via ablations that broad task coverage (not isolated categories) is the main driver of RL scaling due to limited cross-category transfer of reasoning patterns.
- [2026-04-06] Analyzing Symbolic Properties for DRL Agents in Systems and Networking 🆕NEW
- 赛道归属: 强化学习可验证性(系统与网络DRL的符号性质分析)
- 核心创新点: 从“点性质”验证推进到“符号性质”分析:提出面向系统/网络DRL策略的通用符号性质形式化(如单调性、鲁棒性),并将其编码为同一策略在“相关输入范围”上的成对执行比较;通过分解为可求解的子性质,使现有DNN验证引擎可直接用于大范围输入区间的性质检查;在三类真实控制系统上用diffRL系统性评估,展示符号性质能覆盖更广状态空间、发现更具操作意义的反例,并量化不同求解器/模型规模的可验证性权衡。
- Track: RL verification (symbolic property checking for systems & networking)
- Key innovation: Introduces a generic formulation of symbolic properties (e.g., monotonicity, robustness) for DRL policies and reduces their analysis to comparisons between related executions of the same policy, decomposed into tractable sub-properties solvable by existing DNN verifiers; demonstrates broader coverage than point-wise checks and practical solver/model-size trade-offs via diffRL across multiple real control domains.
- [2026-04-06] QED-Nano: Teaching a Tiny Model to Prove Hard Theorems 🆕NEW
- 赛道归属: 数学推理与证明生成(小模型RL后训练 / 可验证奖励)
- 核心创新点: 构建4B小模型QED-Nano的可复现训练流水线,以“三阶段后训练”实现奥赛级证明能力:先从强数学模型蒸馏做SFT获得证明文风与基础能力,再用基于评分细则(rubric)的RL进行可验证偏好优化,最后引入“推理缓存(reasoning cache)”把长证明拆成迭代的总结-改写循环以增强训练与测试时推理;在显著低推理成本下超过更大开源模型并逼近部分闭源系统,同时开源数据与代码以促进可复现实验。
- Track: Mathematical reasoning & proof generation (small-model RL post-training / verifiable rewards)
- Key innovation: Delivers a reproducible 3-stage recipe for a 4B prover: distillation-based SFT for proof style, rubric-reward RL for verifiable optimization, and a reasoning cache that decomposes long proofs into iterative summarize-and-refine cycles to strengthen both training and test-time reasoning; achieves strong Olympiad-level proof performance at low inference cost with full pipeline release.
- [2026-04-06] Rethinking Exploration in RLVR: From Entropy Regularization to Refinement via Bidirectional Entropy Modulation 🆕NEW
- 赛道归属: LLM推理强化学习(RLVR探索机制 / 熵调控)
- 核心创新点: 针对RLVR中“探索受限”与传统熵正则不稳定的问题,从群组相对优势估计(GRPO类)推导熵动态,提出将策略熵分解为“信息熵”(保留有效多解路径)与“伪熵”(破坏推理模式的噪声);提出“熵精炼(entropy refinement)”观点:对正样本维持信息熵、对负样本抑制伪熵;据此提出AsymGRPO,显式解耦正/负rollout的熵调制,实现更可控的探索-收敛权衡,并在多任务上优于强基线且可与熵正则协同。
- Track: RL for LLM reasoning (RLVR exploration / entropy control)
- Key innovation: Reinterprets exploration in RLVR by decomposing policy entropy into informative vs spurious components via analysis of group-relative advantage dynamics; proposes “entropy refinement” (preserve entropy on positive rollouts, suppress on negative) and implements it as AsymGRPO with decoupled modulation for positive/negative trajectories, yielding stronger and more stable gains than standard entropy regularization.
- [2026-04-06] Data Attribution in Adaptive Learning 🆕NEW
- 赛道归属: 强化学习理论(自适应学习的数据归因/因果归因)
- 核心创新点: 面向在线bandit/RL/LLM后训练等“自适应数据生成”场景,提出发生级(occurrence-level)数据归因的形式化目标:用条件干预(conditional interventional)定义单条观测对最终性能的因果贡献,显式纳入“该观测会改变未来采样分布”的反馈效应;证明仅靠回放(replay)侧信息一般无法识别该归因目标,并刻画一类具有结构约束的情形可从日志数据中实现可识别,为自适应训练中的可解释与责任追踪提供理论基础。
- Track: RL theory (data attribution / causal attribution in adaptive learning)
- Key innovation: Formalizes occurrence-level attribution in finite-horizon adaptive learning using a conditional interventional causal target that accounts for feedback (each sample shifts future data collection); proves non-identifiability from replay-side information in general and identifies a structural class where the target becomes identifiable from logged data.
- [2026-04-06] Synthetic Sandbox for Training Machine Learning Engineering Agents 🆕NEW
- 赛道归属: 智能体强化学习(ML工程智能体 / 合成环境与可验证训练)
- 核心创新点: 针对MLE智能体在真实环境中“验证成本极高、导致轨迹级on-policy RL不可用”的瓶颈,提出SandMLE:从少量种子任务自动生成多样、可验证的合成MLE沙盒环境,并将数据集压缩到微规模(每任务50–200样本)以保留结构复杂度同时大幅降低训练/评测开销;通过多智能体生成提升任务覆盖与多样性,使大规模轨迹级on-policy RL首次在MLE域可行,并验证其相对SFT在多模型规模上显著提升且能跨脚手架泛化。
- Track: Agentic RL (ML engineering agents / synthetic verifiable environments)
- Key innovation: Introduces SandMLE, generating diverse verifiable synthetic MLE environments from few seeds while constraining datasets to micro-scale (50–200 samples) to slash rollout verification cost; enables large-scale trajectory-wise on-policy RL for MLE agents (previously prohibitively slow) and shows substantial gains over SFT with strong scaffold generalization.
- [2026-04-06] Selecting Decision-Relevant Concepts in Reinforcement Learning 🆕NEW
- 赛道归属: 可解释强化学习(概念瓶颈/概念选择与状态抽象)
- 核心创新点: 将“概念选择”提升为可证明的序贯决策问题:提出把概念是否决策相关(decision-relevant)视为一种状态抽象——移除某概念会否导致把需不同行为的状态混淆;据此提出DRS算法,在候选概念集中自动选择子集,并给出所选概念与最终策略性能之间的界;实验表明可自动恢复人工概念集且性能不降甚至更优,并提升测试时概念干预的有效性,降低对领域专家的依赖。
- Track: Interpretable RL (concept-based policies / state abstraction)
- Key innovation: Frames automatic concept selection as state abstraction: a concept is decision-relevant if removing it merges states requiring different optimal actions; proposes DRS to select a concept subset with performance bounds linking selected concepts to policy quality, empirically matching or outperforming manually curated concept sets and improving test-time concept interventions.
GitHub
- [2026-04-07] verl-project/verl ⭐20478 🆕NEW
verl: Volcano Engine Reinforcement Learning for LLMs
- [2026-04-07] alibaba/ROLL ⭐3051 🆕NEW
An Efficient and User-Friendly Scaling Library for Reinforcement Learning with Large Language Models
- [2026-04-07] radixark/miles ⭐1049
Miles is an enterprise-facing reinforcement learning framework for LLM and VLM post-training, forked from and co-evolving with slime.
- [2026-04-07] Denghaoyuan123/Awesome-RL-VLA ⭐617 🆕NEW
A Survey on Reinforcement Learning of Vision-Language-Action Models for Robotic Manipulation
- [2026-04-07] InternLM/Spatial-SSRL ⭐123 🆕NEW
[CVPR 2026] Official release of "Spatial-SSRL: Enhancing Spatial Understanding via Self-Supervised Reinforcement Learning"
HuggingFace Datasets
-
[2026-03-27] OpenMOSS-Team/OmniAction
RoboOmni: Proactive Robot Manipulation in Omni-modal Context
📖 arXiv Paper (Accepted to ICLR 2026 🎉) |
🌐 Website |
🤗 Model...
Generated automatically by Daily AI Digest Agent 生成时间: 2026-04-07 03:58:16