AI 每日进展速报 / Daily AI Digest - 2026-04-22
图像生成/编辑 / Image Generation/Editing
arXiv
- [2026-04-18] mEOL: Training-Free Instruction-Guided Multimodal Embedder for Vector Graphics and Image Retrieval 📖4 🆕NEW
- 赛道归属: 多模态检索(文本/图像/SVG向量图检索与嵌入)
- 核心创新点: 提出训练-free、指令驱动的多模态嵌入框架:用MLLM在不同模态上通过“模态专属指令”对齐到同一向量空间,避免投影头与对比学习;用mEOL将任意输入压缩为“单词级token”,直接取其隐藏态作为紧凑语义向量,实现可控的嵌入方向;并通过“语义SVG重写”模块基于渲染图进行视觉推理,给SVG元素赋予语义化标识并扁平化嵌套结构,显式暴露几何/关系线索,从而提升结构感知的文本到SVG检索效果。
- Track: Multimodal retrieval (text/image/SVG vector-graphics embedding)
- Key innovation: A training-free, instruction-guided embedding framework using an MLLM to align text, raster images, and SVG code without projection heads or contrastive training. mEOL forces the model to summarize any input into a single token and uses its hidden state as a controllable compact embedding. A semantic SVG rewriting module performs visual reasoning on the rendered SVG to rename elements and simplify nesting, surfacing geometric/relational cues for structure-aware retrieval.
- [2026-04-21] Counting Worlds Branching Time Semantics for post-hoc Bias Mitigation in generative AI 🆕NEW
- 赛道归属: 生成式AI安全与公平(推理期偏见缓解/形式化验证)
- 核心创新点: 提出CTLF分支时间逻辑,将“生成序列”形式化为分支世界(每个world对应一步可能输出),并引入“计数世界语义”把公平性约束表达为对受保护属性分布的可验证性质;支持三类关键推理:验证当前序列是否满足目标分布、预测后续生成继续满足约束的概率/可达性、以及计算最少需要剔除多少输出才能恢复公平,从而为后处理偏见缓解提供可解释且带形式化语义的工具。
- Track: GenAI safety & fairness (inference-time bias mitigation / formal verification)
- Key innovation: CTLF, a branching-time logic with counting-worlds semantics that models a generation process as a tree of possible outputs. It enables formally stating and checking fairness constraints over protected-attribute distributions, forecasting whether future outputs will remain within bounds, and computing the minimal number of outputs to remove to restore fairness—turning post-hoc mitigation into a logic-based, interpretable procedure.
- [2026-04-21] HP-Edit: A Human-Preference Post-Training Framework for Image Editing 🆕NEW
- 赛道归属: 图像编辑(扩散模型后训练 / 人类偏好对齐)
- 核心创新点: 提出面向扩散式图像编辑的RLHF式后训练框架HP-Edit:用少量人工偏好打分数据+预训练VLM训练出自动偏好评估器HP-Scorer,并将其同时用于(1)低成本扩展构建大规模偏好数据集(RealPref-50K,覆盖多任务且做对象编辑均衡),(2)作为奖励函数对编辑模型进行后训练;同时给出RealPref-Bench用于真实编辑评测,实现“以评促训”的闭环,把人类偏好显式注入编辑行为。
- Track: Image editing (diffusion post-training / human preference alignment)
- Key innovation: HP-Edit brings RLHF-style post-training to diffusion-based editing via an automatic preference evaluator (HP-Scorer) trained from a small amount of human ratings plus a pretrained VLM. HP-Scorer is used both to scale preference data collection (RealPref-50K) and as the reward for post-training editing models, with RealPref-Bench enabling realistic evaluation—forming a scalable “evaluator-as-reward” loop for preference-aligned editing.
- [2026-04-21] Denoising, Fast and Slow: Difficulty-Aware Adaptive Sampling for Image Generation 🆕NEW
- 赛道归属: 文生图/图像生成(扩散/流模型推理优化与自适应采样)
- 核心创新点: 从“空间异质性”出发提出patch级噪声尺度与自适应计算分配:指出直接对patch使用不同timestep会造成训练-推理状态不一致(信息泄露),因此设计受控的timestep采样器来限制训练时patch可获得的最大信息量;进一步加入轻量级“每patch难度预测头”,在采样时动态把更多步数/函数评估分配给困难区域;提出Patch Forcing(PF)让容易区域更早收敛以为困难区域提供上下文,并联合“空间×时间”的噪声变化策略,在不依赖对齐/引导技巧的情况下提升生成质量与效率。
- Track: Text-to-image / image generation (diffusion/flow inference optimization & adaptive sampling)
- Key innovation: Introduces patch-level noise schedules and difficulty-aware adaptive compute allocation. To avoid train–test mismatch from naive per-patch timesteps, it uses a controlled timestep sampler that caps patch information during training. A lightweight per-patch difficulty head then drives adaptive sampling to spend more steps where needed. Patch Forcing advances easy regions earlier to provide context for hard ones, combining spatial-and-temporal noise variation for better quality/efficiency, orthogonal to guidance/alignment methods.
- [2026-04-21] Diff-SBSR: Learning Multimodal Feature-Enhanced Diffusion Models for Zero-Shot Sketch-Based 3D Shape Retrieval 🆕NEW
- 赛道归属: 3D检索(草图到3D形状检索 / 扩散特征表征)
- 核心创新点: 首次系统性探索用文生图扩散模型做零样本草图-3D检索:冻结Stable Diffusion,从U-Net中间层提取并聚合表征以获得开放词汇与形状偏置带来的泛化;针对草图稀疏与域差距,引入“多模态特征增强”而非重训:注入CLIP全局/局部视觉特征以强化轮廓关注与语义上下文,并结合BLIP生成硬文本描述+可学习soft prompt形成更强文本引导;再用Circle-T损失在负样本分离后动态增强正对吸引,提升噪声草图下的跨模态对齐与检索鲁棒性。
- Track: 3D retrieval (zero-shot sketch-to-3D / diffusion-based representations)
- Key innovation: Uses a frozen Stable Diffusion backbone as an open-vocabulary, shape-biased feature extractor by aggregating intermediate U-Net representations for sketches and rendered 3D views. To handle sketch sparsity and domain gap without retraining, it injects CLIP global/local visual features and strengthens textual conditioning via BLIP-generated hard descriptions plus learnable soft prompts. Circle-T loss adaptively increases positive attraction after negatives separate, improving sketch–3D alignment and zero-shot retrieval.
- [2026-04-20] Back into Plato's Cave: Examining Cross-modal Representational Convergence at Scale 🆕NEW
- 赛道归属: 多模态表征分析(跨模态表示对齐评测/可解释性)
- 核心创新点: 对“跨模态表征趋同(Platonic)”提出规模化反证与评测范式纠偏:指出以小样本(~1K)互为最近邻的对齐指标在扩展到百万级样本时显著退化,剩余对齐更多是粗粒度语义重叠而非细粒度结构一致;揭示一对一图文配对评测会高估对齐,在真实多对多语义关系下对齐更弱;并发现“更强语言模型更趋同视觉”的趋势在新模型上不稳定,从而强调结论对评测设定高度敏感。
- Track: Multimodal representation analysis (cross-modal alignment evaluation)
- Key innovation: Stress-tests the Platonic Representation Hypothesis and shows prior evidence is evaluation-fragile. Mutual-nearest-neighbor alignment measured on ~1K samples collapses when scaled to millions, with remaining alignment reflecting coarse semantics rather than fine-grained structural agreement. It also shows one-to-one image–caption evaluation inflates alignment versus realistic many-to-many settings, and that “stronger LMs align more with vision” does not consistently hold for newer models.
- [2026-04-20] A multimodal and temporal foundation model for virtual patient representations at healthcare system scale 🆕NEW
- 赛道归属: 医疗多模态时序基础模型(患者表征/检索与预测)
- 核心创新点: 提出面向医疗系统规模的多模态时序基础模型Apollo:在跨三十年、数十亿级事件的纵向EHR上统一建模28种医疗模态(结构化事件、临床文本、医学图像等),学习覆盖10万+医疗事件的共享表示空间(“医疗概念图谱/atlas”),将整段就医旅程压缩为“虚拟患者表征”;通过大规模多任务(预后预测与相似检索)验证其通用迁移能力,并结合特征归因给出可临床解释的多模态生物标志物证据,支持文本/图像查询的医疗语义搜索。
- Track: Multimodal temporal foundation model (healthcare patient representation, retrieval & forecasting)
- Key innovation: Apollo is a healthcare-system-scale multimodal temporal foundation model trained on decades of longitudinal EHR spanning structured events, clinical text, and images across 28 modalities. It learns a unified “atlas” embedding space over 100k+ medical events and compresses full care trajectories into virtual patient representations. It is validated on hundreds of prognosis and retrieval tasks, with attribution providing clinically interpretable multimodal biomarkers and enabling text/image-driven medical semantic search.
- [2026-04-20] MetaCloak-JPEG: JPEG-Robust Adversarial Perturbation for Preventing Unauthorized DreamBooth-Based Deepfake Generation 🆕NEW
- 赛道归属: 生成式AI安全(反深伪/反DreamBooth个体保护,对抗扰动鲁棒性)
- 核心创新点: 针对现有“人脸保护扰动”在社交媒体JPEG压缩下失效的问题,提出JPEG鲁棒的MetaCloak-JPEG:在优化中显式反传JPEG管线,引入基于STE的可微JPEG层(前向真实压缩、反向用恒等近似round梯度),避免扰动能量集中到会被量化丢弃的高频DCT;结合JPEG-aware的EOT增强分布与质量因子课程(QF 95→50)嵌入双层元学习,使扰动在多压缩强度下保持有效,从而更稳定地破坏DreamBooth式微调深伪。
- Track: GenAI security (anti-deepfake / adversarial protection against DreamBooth, robustness)
- Key innovation: MetaCloak-JPEG makes protective adversarial perturbations robust to ubiquitous JPEG compression by backpropagating through a differentiable JPEG layer using a straight-through estimator (real JPEG forward, identity gradient for round()). This prevents perturbation energy from being wasted in high-frequency DCT bands that JPEG discards. Combined with JPEG-aware EOT augmentations and a curriculum over JPEG quality factors inside a bilevel meta-learning loop, it yields perturbations that survive compression and better disrupt DreamBooth fine-tuning.
- [2026-04-20] Document-as-Image Representations Fall Short for Scientific Retrieval 🆕NEW
- 赛道归属: 文档检索(科学文献多模态表示与基准)
- 核心创新点: 质疑“文档当图片(page rendering)”的科学检索范式并给出结构化替代:构建基于LaTeX源的ArXivDoc,使查询可被精确锚定到证据类型(段落/表格/图/公式等),从而公平评估不同表示;系统对比文本、文档图像、以及交错式图文多模态的单向量/多向量检索,发现文档图像表示随长度增长显著劣化,而文本表示即便在图相关查询上也可借助caption与上下文占优;提出“交错图文表示”无需专门训练即可超过纯文档图像路线。
- Track: Document retrieval (scientific multimodal representations & benchmarking)
- Key innovation: Introduces ArXivDoc, a benchmark built from LaTeX sources to expose structured evidence types (sections, tables, figures, equations) and avoid biases of page-rendered “document-as-image” benchmarks. Through controlled queries and systematic comparisons across single-/multi-vector retrievers, it shows document-as-image embeddings are consistently suboptimal (worse with longer docs), text embeddings often win even for figure queries via captions/context, and interleaved text+image representations outperform document-as-image without specialized training.
- [2026-04-20] Towards Robust Text-to-Image Person Retrieval: Multi-View Reformulation for Semantic Compensation 🆕NEW
- 赛道归属: 跨模态检索(文本到行人图像检索 / 训练free鲁棒对齐)
- 核心创新点: 针对文本表述多样导致的Expression Drift,提出LLM驱动的多视角语义补偿MVR框架:用双分支提示生成“语义等价但分布多样”的改写文本——一支用特征相似度引导抽取视觉关键属性,另一支做多样性改写以覆盖表达空间;在不训练的情况下,通过多视角文本特征均值池化+残差的潜空间补偿抑制噪声并捕获“语义回声”;同时用VLM生成多视角图像描述并共享同一改写机制,补齐视觉语义缺口,从而在不改动主模型参数的前提下提升图文对齐鲁棒性与检索性能。
- Track: Cross-modal retrieval (text-to-person image retrieval / training-free robustness)
- Key innovation: Addresses expression drift via an LLM-driven Multi-View Reformulation (MVR) semantic compensation pipeline. A dual-branch prompting strategy generates semantically equivalent yet distributionally diverse text variants—one guided by visually critical attributes (via feature similarity), the other diversity-aware rewriting. A training-free latent compensation (multi-view mean pooling + residual) suppresses noise and captures “semantic echoes.” It also generates multi-perspective image captions with a VLM and applies shared reformulation to close visual semantic gaps, improving alignment and retrieval without updating the base model.
GitHub
- [2026-04-22] YouMind-OpenLab/awesome-nano-banana-pro-prompts ⭐11378
🍌 World's largest Nano Banana Pro prompt library — 10,000+ curated prompts with preview images, 16 languages. Google Gemini AI image generation. Free ...
- [2026-04-21] Light-Heart-Labs/DreamServer ⭐445
Local AI anywhere, for everyone — LLM inference, chat UI, voice, agents, workflows, RAG, and image generation. No cloud, no subscriptions.
- [2026-04-20] etkecc/baibot ⭐217
🤖 A Matrix bot for using different capabilities (text-generation, text-to-speech, speech-to-text, image-generation, etc.) of AI / Large Language Model...
- [2026-04-21] PKU-YuanGroup/WISE ⭐193
WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation
- [2026-04-21] xdit-project/DistVAE ⭐90 🆕NEW
A parallelism VAE avoids OOM for high resolution image generation
HuggingFace Models
视频生成/编辑 / Video Generation/Editing
arXiv
- [2026-04-21] TS-Attn: Temporal-wise Separable Attention for Multi-Event Video Generation 🆕NEW
- 赛道归属: 视频生成(多事件/长文本对齐,训练无关推理增强)
- 核心创新点: 提出训练无关的Temporal-wise Separable Attention(TS-Attn),通过在注意力层面对“时间维对齐”和“文本-运动对象注意力解耦”进行动态重分配,缓解多段动作描述下的时序错位与条件冲突;可即插即用集成到多种预训练T2V/I2V模型,在几乎不增加推理开销的前提下同时提升多事件动作遵循与全局时序一致性。
Track: Video generation (multi-event / long prompt alignment, training-free inference enhancement)
Key innovation: Proposes Temporal-wise Separable Attention (TS-Attn), a training-free attention mechanism that dynamically redistributes attention to (1) enforce temporal awareness/alignment and (2) decouple conflicting couplings between motion-related visual entities and their text conditions; plug-and-play for pretrained T2V/I2V models, improving multi-event prompt following while preserving global temporal coherence with minimal inference overhead.
- [2026-04-21] Learning to Credit the Right Steps: Objective-aware Process Optimization for Visual Generation 🆕NEW
- 赛道归属: 生成模型后训练(扩散模型RL/偏好优化,视频/图像通用)
- 核心创新点: 提出OTCA(Objective-aware Trajectory Credit Assignment),针对GRPO在扩散轨迹上“统一标量奖励、全步同权回传”导致的粗粒度信用分配问题:一方面进行去噪步级别的轨迹信用分解,估计不同denoising阶段的重要性;另一方面对多目标奖励(画质/运动一致/文本对齐等)做随时间步自适应分配与加权融合,从而把静态奖励转为“时间步感知+目标感知”的结构化训练信号,更贴合扩散迭代过程并稳定提升图像与视频生成质量。
Track: Post-training for generative models (diffusion RL / preference optimization for image & video)
Key innovation: Introduces OTCA to fix coarse credit assignment in GRPO for diffusion: (1) trajectory-level credit decomposition to assign importance across denoising timesteps, and (2) multi-objective credit allocation to adaptively weight heterogeneous reward models over the trajectory. This converts a single static reward into a timestep- and objective-aware structured supervision signal aligned with diffusion’s iterative nature, improving both image and video generation.
- [2026-04-21] How Far Are Video Models from True Multimodal Reasoning? 🆕NEW
- 赛道归属: 视频生成评测(多模态推理/上下文学习基准与自动评估)
- 核心创新点: 提出CLVG-Bench,用“视频生成中的上下文学习(Context Learning in Video Generation)”来系统探测视频模型零样本多模态推理能力,覆盖物理模拟、逻辑推理、交互式上下文等复杂类别,并提供细粒度元数据标注;同时提出AVE自适应评估器,在少量标注下对齐专家感知,输出可解释的文本反馈,实现可扩展、可诊断的推理能力评估,而非碎片化指标堆叠。
Track: Evaluation for video generation (multimodal reasoning / context learning benchmarks)
Key innovation: Presents CLVG-Bench to rigorously probe zero-shot multimodal reasoning via context learning in video generation, with rich manual metadata spanning physical simulation, logical reasoning, and interactive contexts. Proposes AVE, an adaptive evaluator that matches expert perception with minimal annotations and provides interpretable textual feedback, enabling scalable, diagnostic evaluation beyond fragmented metrics.
- [2026-04-21] RoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulation 🆕NEW
- 赛道归属: 视频世界模型评测(机器人操控/具身可执行性验证)
- 核心创新点: 提出RoboWM-Bench,将视频世界模型生成的操控行为从“视觉上合理”提升到“具身可执行”的评测:把人手/机器人操控视频中的预测行为转换为可执行动作序列,并通过真实/仿真机器人执行来验证任务完成度,形成统一可复现协议;从而系统暴露空间推理、接触稳定性、非物理形变等导致“看起来对但做不到”的关键失败模式,为面向机器人学习的物理一致视频生成提供明确诊断路径。
Track: Video world model evaluation (robot manipulation / embodiment-grounded executability)
Key innovation: Introduces RoboWM-Bench to evaluate world models by whether generated manipulation behaviors are physically executable: converts predicted behaviors from human-hand/robot videos into embodied action sequences and validates them via robotic execution under a unified, reproducible protocol. This shifts evaluation from visual plausibility to task-executable physical plausibility and surfaces concrete failure modes (spatial reasoning, contact instability, non-physical deformation).
- [2026-04-21] AutoAWG: Adverse Weather Generation with Adaptive Multi-Controls for Automotive Videos 🆕NEW
- 赛道归属: 视频编辑/视频生成(自动驾驶场景可控恶劣天气生成)
- 核心创新点: 提出AutoAWG面向车载视频的多控制自适应融合框架:用语义引导的多控制融合在强天气风格化与安全关键目标保真之间做动态权衡;用“消失点锚定”的时序合成策略从静态图像构造训练序列,降低对合成视频数据依赖;并通过masked训练增强长时生成稳定性,从而在保持标注可复用(结构/语义不被破坏)的同时显著提升风格一致与时序一致。
Track: Video editing / controllable video generation (adverse weather synthesis for autonomous driving)
Key innovation: Proposes AutoAWG with adaptive multi-control fusion guided by semantics to balance strong weather stylization and faithful preservation of safety-critical targets; introduces vanishing-point-anchored temporal synthesis to build training sequences from still images (reducing reliance on synthetic videos); and uses masked training to improve long-horizon stability, enabling annotation-reusable, temporally consistent adverse-weather video generation.
- [2026-04-20] MultiWorld: Scalable Multi-Agent Multi-View Video World Models 🆕NEW
- 赛道归属: 视频世界模型(多智能体+多视角动作条件生成)
- 核心创新点: 提出MultiWorld统一框架,解决传统动作条件视频世界模型难以扩展到多智能体交互与多视角一致的问题:通过Multi-Agent Condition Module实现多智能体动作的可控注入与解耦控制;通过Global State Encoder在不同视角间共享全局状态以维持观测一致;并支持智能体数与视角数的可扩展配置与并行多视角合成,提高效率同时提升动作跟随与跨视角一致性。
Track: Video world models (multi-agent + multi-view action-conditioned generation)
Key innovation: Introduces MultiWorld for scalable multi-agent, multi-view world modeling: a Multi-Agent Condition Module enables precise, decoupled control of multiple agents, while a Global State Encoder enforces coherent observations across views. The framework scales agent/view counts and synthesizes views in parallel, improving fidelity, action following, and multi-view consistency.
- [2026-04-20] AdaCluster: Adaptive Query-Key Clustering for Sparse Attention in Video Generation 🆕NEW
- 赛道归属: 推理优化(视频扩散Transformer稀疏注意力加速,训练无关)
- 核心创新点: 提出AdaCluster训练无关的自适应Q/K聚类稀疏注意力:对Query采用角相似度保持的聚类以获得更高压缩率;对Key设计欧氏相似度保持的聚类,并包含层内自适应簇数分配、阈值驱动聚类与关键簇选择机制,以适配不同层/不同token分布的异质性;在不显著损伤生成质量的情况下,将DiT视频生成推理加速到多倍。
Track: Inference optimization (training-free sparse attention for video Diffusion Transformers)
Key innovation: Proposes AdaCluster, a training-free adaptive Q/K clustering scheme for sparse attention: angle-similarity-preserving clustering for queries to maximize compression, and euclidean-similarity-preserving clustering for keys with adaptive cluster-number assignment, threshold-wise clustering, and efficient critical-cluster selection. It adapts to heterogeneous token distributions across layers, achieving multi-fold speedups with negligible quality loss.
- [2026-04-20] OmniHuman: A Large-scale Dataset and Benchmark for Human-Centric Video Generation 🆕NEW
- 赛道归属: 数据集与基准(人类中心视频生成/音视频联合生成评测)
- 核心创新点: 提出OmniHuman大规模人类中心视频数据集与OHBench分层评测体系,针对现有数据在“场景与机位多样性、人物-人物/人物-物交互、个体属性对齐”三方面结构性缺陷:构建视频级场景、帧级交互、个体级属性的层次化标注,并提供自动化采集与多模态标注流水线;OHBench进一步以三层诊断+更贴近人类感知的指标,系统评估全局场景、关系交互与个体属性一致性,提升对人类视频生成瓶颈的可解释定位能力。
Track: Dataset & benchmark (human-centric video generation / audio-video synthesis evaluation)
Key innovation: Releases OmniHuman and OHBench to address dataset structural gaps in (1) scene/camera diversity, (2) interaction modeling (person-person & person-object), and (3) individual attribute alignment. Provides hierarchical annotations (video-level scenes, frame-level interactions, individual attributes) via an automated collection and multimodal annotation pipeline, and a three-level benchmark with perception-aligned metrics for diagnostic evaluation across global scenes, relational interactions, and attribute consistency.
- [2026-04-20] WebCompass: Towards Multimodal Web Coding Evaluation for Code Language Models 🆕NEW
- 赛道归属: 多模态评测(代码大模型的Web生成/编辑/修复,含视频输入)
- 核心创新点: 提出WebCompass覆盖真实Web工程生命周期的多模态基准:将任务建模为生成-编辑-修复的迭代闭环,支持文本/图像/视频三种输入与多类操作/缺陷类型;评测上除LLM-as-a-Judge清单式判分外,提出Agent-as-a-Judge:在真实浏览器中自动运行生成网站,通过MCP探索交互并迭代生成针对性测试用例,逼近人类验收测试,从而把“视觉保真+交互质量+代码库级推理”纳入统一可执行评估。
Track: Multimodal evaluation (web coding for code LMs: generation/edit/repair with video input)
Key innovation: Introduces WebCompass, a lifecycle benchmark modeling real web development as an iterative loop of generation, editing, and repair across text/image/video inputs. Beyond checklist-guided LLM-as-a-Judge, it proposes Agent-as-a-Judge: executes generated sites in a real browser, probes interactivity via MCP, and iteratively synthesizes targeted test cases—bringing visual fidelity, interaction quality, and codebase-level reasoning into a unified, executable evaluation.
- [2026-04-20] Memorize When Needed: Decoupled Memory Control for Spatially Consistent Long-Horizon Video Generation 🆕NEW
- 赛道归属: 长视频生成(相机轨迹条件下的空间一致性/记忆机制)
- 核心创新点: 提出“记忆-生成解耦”的长时空一致视频生成框架:用轻量独立记忆分支学习历史观测中的空间一致线索,采用混合记忆表征融合时序与空间信息;通过逐帧cross-attention只检索与当前视角最相关的历史内容,避免全局记忆干扰;并引入相机感知门控,在缺乏有效历史参照时抑制记忆注入,从而在场景回访时提升一致性、在探索新区域时保持生成能力,同时显著降低训练成本与数据需求。
Track: Long-horizon video generation (spatial consistency along camera trajectories / memory mechanisms)
Key innovation: Proposes a decoupled memory-control framework separating memory conditioning from generation: a lightweight memory branch learns spatial-consistency cues with a hybrid spatiotemporal memory representation; per-frame cross-attention retrieves only the most spatially relevant history to avoid interference; and a camera-aware gating mechanism injects memory only when meaningful references exist, improving revisit consistency while preserving novelty generation with lower training/data cost.
GitHub
- [2026-04-22] Anil-matcha/Open-Generative-AI ⭐5908
Uncensored, open-source alternative to Higgsfield AI, Freepik, Krea, Openart AI — Free, unrestricted AI image & video generation studio with 200+ mode...
- [2026-04-22] hao-ai-lab/FastVideo ⭐3408
A unified inference and post-training framework for accelerated video generation.
- [2026-04-21] ZeroLu/awesome-seedance ⭐1545
The ultimate collection of high-fidelity Seedance 2.0 prompts and Seedance AI resources. Discover Seedance 2.0 how to use for cinematic film, anime, U...
- [2026-04-21] YouMind-OpenLab/awesome-seedance-2-prompts ⭐698
🎬 500+ curated Seedance 2.0 video generation prompts — cinematic, anime, UGC, ads, meme styles. Includes Seedance API guides, character consistency ti...
- [2026-04-21] Winn1y/Awesome-Human-Motion-Video-Generation ⭐323
【Accepted by TPAMI】Human Motion Video Generation: A Survey (https://ieeexplore.ieee.org/document/11106267)
音频生成 / Audio Generation
GitHub
- [2026-04-22] huggingface/diffusers ⭐33406
🤗 Diffusers: State-of-the-art diffusion models for image, video, and audio generation in PyTorch.
- [2026-04-21] apocas/restai ⭐485
RESTai is an AIaaS (AI as a Service) open-source platform. Supports many public and local LLM suported by Ollama/vLLM/etc. Precise embeddings usage, t...
- [2026-04-18] Saganaki22/ComfyUI-Woosh ⭐74
Text-to-audio and video-to-audio using Sony AI's Woosh foundation model.
语言大模型 / Large Language Models
GitHub
- [2026-04-21] abhigyanpatwari/GitNexus ⭐28350
GitNexus: The Zero-Server Code Intelligence Engine - GitNexus is a client-side knowledge graph creator that runs entirely in your browser. Drop ...
- [2026-04-21] justrach/codedb ⭐743
Zig code intelligence server and MCP toolset for AI agents. Fast tree, outline, symbol, search, read, edit, deps, snapshot, and remote GitHub repo que...
- [2026-04-20] proxysoul/soulforge ⭐594
Graph-powered code intelligence, multi-agent coding with codebase-aware AI. No more grep & pray
- [2026-04-21] truecourse-ai/truecourse ⭐194
AI-powered architecture analysis and code intelligence. Detects circular deps, layer violations, dead modules, and more. Web UI + CLI.
- [2026-04-21] Anandb71/arbor ⭐108 🆕NEW
Graph-native code intelligence that replaces embedding-based RAG with deterministic program understanding.
多模态大模型 / Multimodal Models
arXiv
- [2026-04-16] Dual-Axis Generative Reward Model Toward Semantic and Turn-taking Robustness in Interactive Spoken Dialogue Models 📖1
- 赛道归属: 语音对话多模态交互(Spoken Dialogue)与RL奖励建模
- 核心创新点: 提出“双轴生成式奖励模型”,用细粒度交互质量分类体系与标注数据学习复杂对话动态;奖励输出不仅给出单一总分,还显式分解为“语义质量”和“轮次/时序(turn-taking)”两条轴的评分,从而为全双工语音对话模型提供可诊断、可用于在线RL的稳定奖励信号;以生成式建模替代依赖浅层统计/时序代理指标的传统自动评估,提升跨数据集的交互质量评估一致性与泛化。
- Track: Spoken multimodal dialogue interaction & RL reward modeling
- Core innovations: Introduces a dual-axis generative reward model trained with a detailed interaction taxonomy and annotations to capture complex dialogue dynamics; outputs both an overall score and disentangled scores for semantic quality and turn-taking/timing robustness, providing diagnostic feedback and a reliable reward for online RL; replaces brittle proxy metrics (behavioral stats/timing accuracy) with a learned generative assessor that generalizes across synthetic and real-world interaction datasets.
- [2026-04-16] ADAPT: Benchmarking Commonsense Planning under Unspecified Affordance Constraints 📖1 🆕NEW
- 赛道归属: 具身智能规划与多模态推理(affordance/可供性推理、动态环境鲁棒规划)
- 核心创新点: 提出DynAfford基准,专门评测“指令未显式给出、且随时间变化”的对象可供性约束下的常识规划能力;提出ADAPT即插即用模块,将“显式可供性推理”注入现有规划器:通过感知对象状态、推断隐式前置条件并据此调整动作序列,提升在已见/未见环境的鲁棒性与成功率;实证表明用任务对齐的VLM(LoRA领域自适应)作为可供性推断后端优于通用商用LLM,强调可供性需要视觉落地与领域对齐。
- Track: Embodied planning & multimodal reasoning (affordance reasoning, robustness in dynamic environments)
- Core innovations: Introduces DynAfford, a benchmark targeting commonsense planning under unspecified and time-varying affordance constraints; proposes ADAPT, a plug-and-play module that augments existing planners with explicit affordance reasoning by perceiving object states, inferring implicit preconditions, and adapting action plans accordingly, improving robustness in seen/unseen environments; shows a domain-adapted LoRA-tuned VLM backend for affordance inference can outperform a general commercial LLM, highlighting the need for grounded, task-aligned affordance modeling.
- [2026-04-21] VCE: A zero-cost hallucination mitigation method of LVLMs via visual contrastive editing 🆕NEW
- 赛道归属: 多模态理解与对齐(LVLM幻觉抑制/模型编辑/推理可靠性)
- 核心创新点: 提出VCE零成本(无需微调、无需标注)的事后干预方法,通过对比视觉扰动分析模型响应,定位由语言先验导致的“物体幻觉”倾向;利用SVD分解激活模式以隔离“幻觉子空间”,并进行定向参数编辑以抑制该子空间影响,在保持原有计算效率的同时显著降低多基准OH。
- Track: Multimodal understanding & alignment (LVLM hallucination mitigation, model editing, reliability)
- Core innovations: Proposes VCE, a zero-cost post-hoc intervention (no fine-tuning, no labels) that probes contrastive visual perturbations to identify object-hallucination tendencies driven by language priors; applies SVD on activation patterns to isolate a “hallucination subspace” and performs targeted parameter edits to attenuate it, reducing hallucinations across benchmarks while preserving the model’s original compute profile.
- [2026-04-21] Lost in Translation: Do LVLM Judges Generalize Across Languages? 🆕NEW
- 赛道归属: 多模态评测与对齐(Reward Model/LVLM Judge,多语言鲁棒性评测)
- 核心创新点: 构建MM-JudgeBench首个大规模多语言多模态裁判模型评测基准(60K+成对偏好、25种语言),同时覆盖通用VL偏好与图表推理两类场景,实现对“跨语言泛化”系统性诊断;提供与评测集解耦的多语言训练集支持领域自适应;大规模评测22个开源/闭源LVLM judge,揭示跨语言性能方差显著且模型规模/架构并不能预测多语言鲁棒性,指出当前reward modeling在多语言场景的根本局限。
- Track: Multimodal evaluation & alignment (reward models / LVLM judges, multilingual robustness)
- Core innovations: Releases MM-JudgeBench, the first large-scale multilingual multimodal benchmark for judge/reward model evaluation (60K+ pairwise preferences across 25 languages), spanning both general VL preference and chart-centric visual-text reasoning for systematic cross-lingual analysis; provides a disjoint multilingual training set to enable adaptation; evaluates 22 open/proprietary LVLM judges and finds large cross-lingual variance where size/architecture poorly predict robustness, exposing fundamental limitations of current reward modeling beyond English.
- [2026-04-21] PLaMo 2.1-VL Technical Report 🆕NEW
- 赛道归属: 多模态理解(轻量化VLM/边缘端部署/日语视觉问答与指代定位)
- 核心创新点: 提出面向本地/边缘设备的轻量VLM(2B/8B),以日语为核心工作语言并强化VQA与视觉指代定位;构建大规模合成数据生成流水线与系统化日语训练/评测资源,提升低资源语言与落地场景的可用性;在工厂工具识别任务分析与基础设施异常检测两类真实应用中验证零样本与微调增益,体现“轻量模型+数据管线+场景评测”一体化工程路线。
- Track: Multimodal understanding (lightweight VLMs, edge deployment, Japanese VQA & grounding)
- Core innovations: Introduces a lightweight VLM family (2B/8B) designed for on-device/edge use with Japanese-first operation, emphasizing VQA and visual grounding; builds a large-scale synthetic data generation pipeline plus comprehensive Japanese training/evaluation resources to improve low-resource language readiness; validates in real applications (factory task analysis via tool recognition, infrastructure anomaly detection) demonstrating practical zero-shot and fine-tuning gains via an integrated “small model + data pipeline + scenario evaluation” approach.
- [2026-04-21] Benchmarking Vision Foundation Models for Domain-Generalizable Face Anti-Spoofing 🆕NEW
- 赛道归属: 多模态/视觉基础模型在安全生物识别中的应用评测(人脸活体检测FAS、跨域泛化、效率优化)
- 核心创新点: 系统基准测试15种视觉基础模型(监督CNN/ViT、自监督ViT)在严苛跨域协议(MICO、LSD)下的FAS泛化能力,重新论证“视觉-only”在效率与鲁棒性上的上限;发现自监督ViT(DINOv2+Registers)能抑制注意力伪影并捕获细粒度欺骗线索;结合FAS-Aug、PDA与APL形成高效训练配方,建立可复现的SOTA视觉-only强基线,为后续多模态FAS提供更强视觉骨干参照。
- Track: Security biometrics with foundation models (face anti-spoofing, domain generalization, efficiency)
- Core innovations: Benchmarks 15 vision foundation models under harsh cross-domain FAS protocols (MICO, LSD), re-establishing a strong and efficient vision-only baseline versus heavier multimodal supervision; identifies self-supervised ViTs—especially DINOv2 with Registers—as better at suppressing attention artifacts and capturing fine-grained spoof cues; combines FAS-Aug, patch-wise augmentation (PDA), and attention-weighted patch loss (APL) into an effective recipe achieving SOTA while remaining compute-efficient, providing a definitive vision-only reference backbone for future multimodal FAS.
- [2026-04-21] ST-Prune: Training-Free Spatio-Temporal Token Pruning for Vision-Language Models in Autonomous Driving 🆕NEW
- 赛道归属: 推理优化(VLM视频/多视角输入加速、训练无关token剪枝、自动驾驶多模态)
- 核心创新点: 提出ST-Prune训练无关、即插即用的时空token剪枝框架,针对自动驾驶多帧多视角输入的冗余而非逐帧独立处理;MTP将运动波动与时间新近性作为软约束融入多样性选择目标,优先保留动态轨迹与当前帧关键信息;RSP利用环视相机几何对跨视角相似性施加惩罚,去除重复投影与背景残留;在90% token削减下实现近无损甚至部分指标超越全量输入的效果,刷新训练无关剪枝SOTA。
- Track: Inference optimization (training-free token pruning for multi-frame/multi-view VLMs in autonomous driving)
- Core innovations: Proposes ST-Prune, a training-free plug-and-play spatio-temporal token pruning framework tailored to multi-view, multi-frame driving inputs (beyond per-frame pruning); Motion-aware Temporal Pruning (MTP) injects motion volatility and temporal recency as soft constraints into a diversity selection objective to prioritize dynamic trajectories and current content; Ring-view Spatial Pruning (RSP) exploits surround-view geometry to penalize cross-view similarity, removing duplicate projections and residual background; achieves near-lossless performance even at 90% token reduction, setting new SOTA for training-free pruning.
- [2026-04-21] EgoMotion: Hierarchical Reasoning and Diffusion for Egocentric Vision-Language Motion Generation 🆕NEW
- 赛道归属: 具身智能生成(3D人体动作生成、第一视角多模态条件、扩散模型)
- 核心创新点: 针对第一视角视觉+语言条件的3D动作生成,提出“推理-生成纠缠”导致梯度冲突的关键问题;提出EgoMotion分层两阶段框架,仿生地解耦认知推理与运动控制:先由VLM将多模态输入映射到离散动作原语的结构化空间以强化目标一致的语义表征,再以该表征作为条件驱动扩散式动作生成器在连续潜空间迭代去噪,提升物理可行性与时序一致性,并实现SOTA语义落地与运动质量。
- Track: Embodied generation (3D human motion generation, egocentric VL conditioning, diffusion)
- Core innovations: Identifies “reasoning–generation entanglement” in egocentric vision-language conditioned 3D motion synthesis, where jointly optimizing semantic reasoning and kinematics causes gradient conflicts; proposes EgoMotion, a hierarchical two-stage framework that decouples cognition and motor control: a VLM first maps multimodal inputs into a structured discrete motion-primitive space for goal-consistent representations, then a diffusion-based generator uses these representations as conditioning to iteratively denoise in a continuous latent space, improving physical plausibility, temporal coherence, and grounded semantics to reach SOTA.
- [2026-04-21] Explore Like Humans: Autonomous Exploration with Online SG-Memo Construction for Embodied Agents 🆕NEW
- 赛道归属: 具身智能导航与探索(在线语义地图/空间记忆构建、VLM驱动主动探索、RGB-only)
- 核心创新点: 提出ABot-Explorer,将探索与结构化空间记忆构建从“先探索后离线重建”的两阶段范式,推进为在线、端到端的RGB-only过程;利用大VLM提炼语义导航可供性SNA作为人类认知地图式锚点(如门洞、楼梯等关键通行节点),并动态写入分层SG-Memo以引导下一步探索策略,避免几何中心化方法忽略语义关键地标;同时扩展InteriorGS并提供SNA与SG-Memo标注数据集,验证在探索效率、覆盖率及下游任务可迁移性上的显著提升。
- Track: Embodied navigation & exploration (online semantic spatial memory, VLM-guided active exploration, RGB-only)
- Core innovations: Proposes ABot-Explorer, shifting from the common two-stage “explore then offline reconstruct memory” pipeline to an online RGB-only unified process; uses large VLMs to distill Semantic Navigational Affordances (SNAs) as cognitively aligned anchors (e.g., doorways, staircases) and incrementally integrates them into a hierarchical SG-Memo to guide exploration, preventing geometry-centric mapping from missing semantically critical landmarks; releases an InteriorGS extension with SNA/SG-Memo annotations and demonstrates improved exploration efficiency, coverage, and downstream utility of the constructed memory.
- [2026-04-21] Bridging Foundation Models and ASTM Metallurgical Standards for Automated Grain Size Estimation from Microscopy Images 🆕NEW
- 赛道归属: 行业多模态/视觉基础模型落地(显微图像实例分割、材料表征、标准化计量)
- 核心创新点: 提出面向金相显微图的自动化晶粒度估计流水线,将Cellpose-SAM适配到微观组织的密集实例分割,并结合拓扑感知的梯度追踪保持晶粒分离;进一步把分割结果与ASTM E112 Jeffries平面计数法模块化对接,实现从图像到标准化晶粒度G值的端到端计算;通过与U-Net、MatSAM、Qwen2.5-VL对比,揭示VLM在局部密集计数/空间推理上的不足与自适应提示模型的过分割问题;在极少样本(2张)下仍达低MAPE,体现“基础模型+领域算法/标准”融合的少样本可扩展性与工程可用性。
- Track: Industrial vision foundation model integration (microscopy instance segmentation, materials characterization, standards-based measurement)
- Core innovations: Presents an automated grain-size estimation pipeline for metallurgical microscopy by adapting Cellpose-SAM for dense instance segmentation and adding topology-aware gradient tracking to preserve grain separation; bridges segmentation outputs with an ASTM E112 Jeffries planimetric module to compute standardized grain size number (G) end-to-end; benchmarks against U-Net, MatSAM, and Qwen2.5-VL, showing VLMs struggle with localized dense counting and MatSAM tends to over-segment, while the proposed integration maintains topological correctness; demonstrates strong few-shot scalability (as low as 2 training samples with very low MAPE), highlighting practical value from combining foundation models with domain standards and algorithms.
GitHub
- [2026-04-20] m87-labs/moondream ⭐9599 🆕NEW
tiny vision language model
- [2026-04-21] Blaizzy/mlx-vlm ⭐4453
MLX-VLM is a package for inference and fine-tuning of Vision Language Models (VLMs) on your Mac using MLX.
- [2026-04-21] waybarrios/vllm-mlx ⭐917
OpenAI and Anthropic compatible server for Apple Silicon. Run LLMs and vision-language models (Llama, Qwen-VL, LLaVA) with continuous batching, MCP to...
- [2026-04-20] hustvl/InfiniteVL ⭐104 🆕NEW
InfiniteVL: Synergizing Linear and Sparse Attention for Highly-Efficient, Unlimited-Input Vision-Language Models
- [2026-04-21] FeiElysia/Tempo ⭐59
Tempo: Small Vision-Language Models are Smart Compressors for Long Video Understanding
强化学习 / Reinforcement Learning
arXiv
- [2026-04-16] WavAlign: Enhancing Intelligence and Expressiveness in Spoken Dialogue Models via Adaptive Hybrid Post-Training 📖1
- 赛道归属: 语音对话大模型后训练(End-to-end spoken dialogue)/ 偏好优化与RL对齐
- 核心创新点: 针对端到端语音对话中“稀疏偏好监督 vs 稠密语音生成”导致的奖励建模与rollout采样难题,提出模态感知的自适应混合后训练方案:将偏好/RL更新约束在语义通道以提升智能与语义质量,同时通过显式anchoring稳定并改进声学表现;再依据rollout统计动态调节两者混合比例,规避不可靠偏好梯度对共享参数的破坏,使RL在语音对话场景可落地并提升表达性。
- Track: Post-training for end-to-end spoken dialogue models / preference optimization & RL alignment
- Key innovation: Proposes a modality-aware adaptive hybrid post-training recipe to make RL practical for end-to-end spoken dialogue, addressing the mismatch between sparse preference signals and dense speech generation under shared-parameter updates. It constrains preference/RL updates to the semantic channel while improving acoustics via explicit anchoring, and dynamically mixes them based on rollout statistics to avoid unreliable preference gradients, improving both semantic intelligence and speech expressiveness.
- [2026-04-16] Dual-Axis Generative Reward Model Toward Semantic and Turn-taking Robustness in Interactive Spoken Dialogue Models 📖1
- 赛道归属: 语音对话多模态交互(Spoken Dialogue)与RL奖励建模
- 核心创新点: 提出“双轴生成式奖励模型”,用细粒度交互质量分类体系与标注数据学习复杂对话动态;奖励输出不仅给出单一总分,还显式分解为“语义质量”和“轮次/时序(turn-taking)”两条轴的评分,从而为全双工语音对话模型提供可诊断、可用于在线RL的稳定奖励信号;以生成式建模替代依赖浅层统计/时序代理指标的传统自动评估,提升跨数据集的交互质量评估一致性与泛化。
- Track: Spoken multimodal dialogue interaction & RL reward modeling
- Core innovations: Introduces a dual-axis generative reward model trained with a detailed interaction taxonomy and annotations to capture complex dialogue dynamics; outputs both an overall score and disentangled scores for semantic quality and turn-taking/timing robustness, providing diagnostic feedback and a reliable reward for online RL; replaces brittle proxy metrics (behavioral stats/timing accuracy) with a learned generative assessor that generalizes across synthetic and real-world interaction datasets.
- [2026-04-21] EVPO: Explained Variance Policy Optimization for Adaptive Critic Utilization in LLM Post-Training 🆕NEW
- 赛道归属: LLM后训练强化学习 / 策略优化与方差控制(PPO/GRPO改进)
- 核心创新点: 将“是否使用critic作为baseline”的选择形式化为卡尔曼滤波中的增益选择问题,统一解释PPO(强依赖critic)与GRPO(无critic)为同一框架下的两个极端;提出用单batch可计算的Explained Variance(EV)作为判别阈值:EV>0时critic确实降方差,EV≤0时critic反而注入噪声并抬升优势方差;据此提出EVPO,在训练过程中按步监控batch级EV并在critic-baseline与batch-mean优势估计之间自适应切换,理论上保证每一步的方差不劣于两者中更优者,从而在稀疏奖励等critic不成熟阶段避免“负增益”critic带来的训练不稳定。
- Track: RL for LLM post-training / policy optimization & variance control (PPO/GRPO improvement)
- Core innovation: Formulates the baseline choice (critic vs. no critic) as a Kalman-filter gain selection problem, unifying PPO (critic-heavy) and GRPO (critic-free) as two extremes of one framework; introduces single-batch computable Explained Variance (EV) as the exact decision boundary—EV>0 means the critic reduces variance, EV≤0 means it injects noise and increases advantage variance; proposes EVPO, which monitors batch-level EV each step and adaptively switches between critic-based baselines and batch-mean advantage estimation, with a per-step guarantee of variance no worse than the better of the two, stabilizing training especially in sparse-reward regimes where critics are initially unreliable.
- [2026-04-21] HP-Edit: A Human-Preference Post-Training Framework for Image Editing 🆕NEW
- 赛道归属: 图像编辑(扩散模型后训练 / 人类偏好对齐)
- 核心创新点: 提出面向扩散式图像编辑的RLHF式后训练框架HP-Edit:用少量人工偏好打分数据+预训练VLM训练出自动偏好评估器HP-Scorer,并将其同时用于(1)低成本扩展构建大规模偏好数据集(RealPref-50K,覆盖多任务且做对象编辑均衡),(2)作为奖励函数对编辑模型进行后训练;同时给出RealPref-Bench用于真实编辑评测,实现“以评促训”的闭环,把人类偏好显式注入编辑行为。
- Track: Image editing (diffusion post-training / human preference alignment)
- Key innovation: HP-Edit brings RLHF-style post-training to diffusion-based editing via an automatic preference evaluator (HP-Scorer) trained from a small amount of human ratings plus a pretrained VLM. HP-Scorer is used both to scale preference data collection (RealPref-50K) and as the reward for post-training editing models, with RealPref-Bench enabling realistic evaluation—forming a scalable “evaluator-as-reward” loop for preference-aligned editing.
- [2026-04-21] LASER: Learning Active Sensing for Continuum Field Reconstruction 🆕NEW
- 赛道归属: 主动感知强化学习 / 科学计算中的连续场重建(POMDP + 世界模型)
- 核心创新点: 将连续物理场的自适应采样/传感器移动建模为POMDP闭环主动感知问题,引入“连续场潜变量世界模型”来刻画物理动力学并在潜空间提供内在奖励信号;通过在latent imagination中进行“what-if”传感模拟,RL策略可基于对未来潜状态的预测来规划传感器动作,主动驶向高信息增益区域而非依赖固定/离线优化布局,从而在稀疏测量约束下提升重建保真度与泛化。
- Track: Active sensing RL / continuum field reconstruction in scientific computing (POMDP + world model)
- Core innovation: Casts adaptive sampling and sensor motion for continuum fields as a closed-loop POMDP; introduces a latent continuum-field world model capturing physical dynamics and providing intrinsic reward in latent space; enables RL to run “what-if” sensing rollouts via latent imagination and condition actions on predicted latent states, steering sensors toward high-information regions beyond current observations, outperforming static and offline-optimized layouts under sparse sensing.
- [2026-04-21] Learning to Credit the Right Steps: Objective-aware Process Optimization for Visual Generation 🆕NEW
- 赛道归属: 生成模型后训练(扩散模型RL/偏好优化,视频/图像通用)
- 核心创新点: 提出OTCA(Objective-aware Trajectory Credit Assignment),针对GRPO在扩散轨迹上“统一标量奖励、全步同权回传”导致的粗粒度信用分配问题:一方面进行去噪步级别的轨迹信用分解,估计不同denoising阶段的重要性;另一方面对多目标奖励(画质/运动一致/文本对齐等)做随时间步自适应分配与加权融合,从而把静态奖励转为“时间步感知+目标感知”的结构化训练信号,更贴合扩散迭代过程并稳定提升图像与视频生成质量。
Track: Post-training for generative models (diffusion RL / preference optimization for image & video)
Key innovation: Introduces OTCA to fix coarse credit assignment in GRPO for diffusion: (1) trajectory-level credit decomposition to assign importance across denoising timesteps, and (2) multi-objective credit allocation to adaptively weight heterogeneous reward models over the trajectory. This converts a single static reward into a timestep- and objective-aware structured supervision signal aligned with diffusion’s iterative nature, improving both image and video generation.
- [2026-04-21] Thinking Before Matching: A Reinforcement Reasoning Paradigm Towards General Person Re-Identification 🆕NEW
- 赛道归属: 行人重识别(ReID)+ 强化学习 / 推理驱动表征学习(CoT引入视觉识别)
- 核心创新点: 提出从“感知拟合”转向“身份因果线索推理”的ReID-R范式,将Chain-of-Thought显式引入ReID流程以获得可解释的身份理解;两阶段训练:先进行无CoT标注的“判别推理warm-up”以学习身份相关的特征理解,再通过强化学习结合非平凡采样构造更具跨场景泛化的数据,并用高质量奖励信号驱动模型聚焦ID相关线索,从而在显著减少训练数据规模的情况下提升跨场景鲁棒性并输出可解释推理。
- Track: Person Re-Identification + RL / reasoning-driven representation learning (CoT for vision recognition)
- Core innovation: Proposes ReID-R, shifting ReID from perception-driven fitting to reasoning over identity-causal cues by injecting Chain-of-Thought into the pipeline for explicit, interpretable identity understanding; uses a two-stage recipe: (i) label-free CoT-style discriminative reasoning warm-up to build identity-aware feature understanding, and (ii) efficient RL with non-trivial sampling to construct scene-generalizable training data, leveraging high-quality rewards to focus learning on ID-relevant cues—improving robustness with substantially less data while providing interpretable rationales.
- [2026-04-21] Reasoning-Aware AIGC Detection via Alignment and Reinforcement 🆕NEW
- 赛道归属: AIGC文本检测 / 可解释推理链检测 + 强化学习对齐
- 核心创新点: 构建覆盖多领域、多LLM来源与多作者场景的AIGC-text-bank数据集以提升检测评测的真实性与覆盖度;提出REVEAL框架,在分类前先生成可解释的推理链(reasoning chain)以增强透明性与可审计性;采用“两阶段训练”:先监督微调对齐推理能力,再用强化学习进一步优化检测准确率与推理逻辑一致性并抑制幻觉,使检测器在模型迭代背景下更稳健、同时输出可解释依据。
- Track: AIGC text detection / interpretable reasoning-chain detection + RL alignment
- Core innovation: Releases AIGC-text-bank, a multi-domain dataset spanning diverse LLM sources and authorship scenarios to better stress-test detectors; proposes REVEAL, which generates an interpretable reasoning chain before classification to improve transparency; trains in two stages—supervised fine-tuning to establish reasoning behavior, then reinforcement learning to boost accuracy, improve logical consistency, and reduce hallucinations—yielding a more robust, explainable detector as generators evolve.
- [2026-04-21] RL-ABC: Reinforcement Learning for Accelerator Beamline Control 🆕NEW
- 赛道归属: 科学与工程控制强化学习 / 粒子加速器束线优化(仿真到RL环境工程化)
- 核心创新点: 提供开源框架将Elegant束线配置自动转换为RL环境,核心在于把束线调参系统化地形式化为MDP并显著降低RL接入成本;自动在可调元件前插入诊断watch points,构建包含束流统计量、协方差与孔径约束的57维状态表示,并提供可配置奖励以优化传输等目标;通过Stable-Baselines3兼容与分阶段学习(stage learning)把高维、长链路调参拆解为子问题提升训练效率,实现RL在真实工程仿真中的可复用落地。
- Track: RL for scientific/engineering control / particle accelerator beamline optimization (simulation-to-RL environment tooling)
- Core innovation: Delivers an open-source framework that automatically turns standard Elegant beamline configurations into RL environments, operationalizing beamline tuning as an MDP while minimizing RL-specific engineering; auto-inserts diagnostic watch points, builds a 57-D state from beam statistics/covariances/aperture constraints, and offers configurable rewards for objectives like transmission; supports modern RL via Stable-Baselines3 and introduces stage learning to decompose complex high-dimensional tuning into manageable subproblems, improving training efficiency and reproducibility in accelerator control.
- [2026-04-21] ReflectMT: Internalizing Reflection for Efficient and High-Quality Machine Translation 🆕NEW
- 赛道归属: 机器翻译(MT)+ 强化学习 / 推理内化与推理成本压缩(test-time compute优化)
- 核心创新点: 提出“先翻译后反思”的ReflectMT以替代主流“先思考再翻译”的显式推理轨迹,目标是在保持/提升质量的同时显著降低推理token与延迟;两阶段反思内化:第一阶段用强化学习训练“翻译-反思-改写”能力,强化高质量反思与修订以注入语义理解与任务知识;第二阶段再训练模型将反思中获得的知识内化到一次性直译策略中,使推理时无需显式CoT即可输出高质量首译,实现质量提升与推理成本大幅下降。
- Track: Machine translation + RL / reflection internalization & test-time compute reduction
- Core innovation: Introduces ReflectMT’s “translate-first, think-later” paradigm to replace explicit “think-then-translate” reasoning trajectories, targeting high quality with much lower inference cost; uses a two-stage reflection-internalization scheme: stage 1 applies RL to learn a “translate–reflect–refine” behavior, reinforcing high-quality reflection and edits to inject semantic/task knowledge; stage 2 trains the model to internalize what reflection teaches into a direct one-pass translation policy, eliminating explicit CoT at inference while improving first-pass quality and drastically reducing token usage/latency.
GitHub
- [2026-04-22] huggingface/trl ⭐18132
Train transformer language models with reinforcement learning.
- [2026-04-22] OpenPipe/ART ⭐9214 🆕NEW
Agent Reinforcement Trainer: train multi-step agents for real-world tasks using GRPO. Give your agents on-the-job training. Reinforcement learning for...
- [2026-04-21] pytorch/rl ⭐3404
A modular, primitive-first, python-first PyTorch library for Reinforcement Learning.
- [2026-04-22] radixark/miles ⭐1104 🆕NEW
Miles is an enterprise-facing reinforcement learning framework for LLM and VLM post-training, forked from and co-evolving with slime.
- [2026-04-21] Farama-Foundation/stable-retro ⭐367 🆕NEW
Retro games for Reinforcement Learning Research
HuggingFace Models
HuggingFace Datasets
- [2026-04-19] llamaindex/ParseBench
ParseBench
Quick links: [🌐 Website] [📜 Paper] [💻 Code] ParseBench is a benchmark for evaluating document parsing systems on real-world ent...
Generated automatically by Daily AI Digest Agent 生成时间: 2026-04-22 01:44:09