AI 每日进展速报 / Daily AI Digest - 2026-05-11
图像生成/编辑 / Image Generation/Editing
arXiv
- SCOPE: Structured Decomposition and Conditional Skill Orchestration for Complex Image Generation
- 赛道归属: 文生图(复杂意图生成/结构化规划与验证)
- 核心创新点: 提出“语义承诺(semantic commitments)”及其在生成生命周期中断裂的“概念裂隙(Conceptual Rift)”框架,将复杂生成需求从文本理解、图像生成到结果核验进行结构化分解与可追踪表示;通过条件化“技能编排(skill orchestration)”把不同子能力(如属性绑定、关系约束、计数/布局等)按承诺单元进行调度与闭环校验,减少局部满足但全局失真的失败模式,提升复杂指令的可控一致性。
- Track: Text-to-Image (complex intent generation / structured planning & verification)
- Core innovation: Introduces “semantic commitments” and formalizes their lifecycle discontinuity as the “Conceptual Rift,” making complex requirements explicitly decomposable and trackable across understanding, generation, and verification; uses conditional skill orchestration to route specialized sub-skills (attribute binding, relational constraints, counting/layout, etc.) around commitment units with closed-loop checking, reducing cases where constraints are satisfied locally but violated globally.
- Delta-Adapter: Scalable Exemplar-Based Image Editing with Single-Pair Supervision
- 赛道归属: 图像编辑(示例驱动编辑/少样本迁移)
- 核心创新点: 将示例编辑从传统“pair-of-pairs”监督降为“单对样本(single-pair)”监督:从源-目标示例对中显式抽取可迁移的“编辑增量/差分(delta)”表征,并以适配器(adapter)形式注入到生成/编辑模型中,实现编辑语义与内容解耦;从而在无需同语义第二对样本的情况下学习可泛化的编辑操作,显著降低数据构建成本并提升跨编辑类型的可扩展性。
- Track: Image Editing (exemplar-based editing / low-shot transfer)
- Core innovation: Replaces the pair-of-pairs supervision with single-pair supervision by explicitly extracting a transferable edit “delta” (difference) representation from one source–target exemplar pair and injecting it via an adapter module, decoupling edit semantics from image content; enables scalable learning of generalizable edits without needing a second pair sharing the same edit semantics.
- [2026-05-08] ReasonEdit: Towards Interpretable Image Editing Evaluation via Reinforcement Learning
- 赛道归属: 图像编辑(评测/可解释性评价与奖励建模)
- 核心创新点: 面向文本引导图像编辑评测引入强化学习范式:构建可解释的评价信号(不仅是标量分数),并训练能输出“原因链/错误归因”的评估器作为奖励模型;通过RL优化使评估器在识别伪影、非预期改动、审美退化等问题时同时给出可读的解释依据,弥补现有评测缺少解释数据与可训练奖励模型的短板。
- Track: Image Editing (evaluation / interpretable scoring & reward modeling)
- Core innovation: Brings reinforcement learning into text-guided image editing evaluation by training an evaluator/reward model that produces interpretable rationales (error attribution/reasoning traces) rather than only scalar scores; improves diagnosis of artifacts, unintended edits, and aesthetic regressions by coupling scoring with explanation generation.
- [2026-05-08] EditRefiner: A Human-Aligned Agentic Framework for Image Editing Refinement
- 赛道归属: 图像编辑(Agentic 精修/人类对齐的局部纠错)
- 核心创新点: 提出面向编辑结果“精修(refinement)”的代理式框架:通过构建人类反馈驱动的数据集(如细粒度缺陷与修复指令/偏好)对齐模型行为;结合具备更强空间落点能力的诊断-定位-修复流程(而非弱 grounding 的一次性VLM建议或反复重采样),在避免语义漂移的同时实现可靠的局部修补(物体不自然、光照不一致、局部纹理破坏等)。
- Track: Image Editing (agentic refinement / human-aligned local correction)
- Core innovation: Proposes an agentic refinement framework aligned with human feedback via a dedicated dataset of fine-grained defects and fixes/preferences; uses a diagnose–localize–repair pipeline with stronger spatial grounding than generic VLM-based refinement or costly iterative regeneration, enabling reliable local corrections while reducing semantic drift.
- [2026-05-08] OrchJail: Jailbreaking Tool-Calling Text-to-Image Agents by Orchestration-Guided Fuzzing
- 赛道归属: 文生图安全(工具调用代理/越狱与红队测试)
- 核心创新点: 针对“工具调用型”文生图代理提出编排引导的模糊测试越狱框架:不再只优化单轮提示词,而是系统探索多步工具链的组合空间,利用“单步无害、组合有害”的编排漏洞生成攻击序列;通过对代理的规划/调用轨迹进行引导式变异与覆盖驱动搜索,发现传统prompt-only jailbreak难以触达的安全失效模式。
- Track: T2I Safety (tool-calling agents / jailbreak & red-teaming)
- Core innovation: Introduces orchestration-guided fuzzing for jailbreaking tool-calling T2I agents by exploring the multi-step toolchain composition space, targeting “benign individually, harmful jointly” orchestration vulnerabilities; uses guided mutation/coverage over planning and tool-invocation traces to uncover failure modes beyond prompt-only jailbreaks.
- [2026-05-07] FREPix: Frequency-Heterogeneous Flow Matching for Pixel-Space Image Generation
- 赛道归属: 文生图(像素空间生成/扩散-流匹配/频域建模)
- 核心创新点: 提出频率异质的流匹配框架,将像素空间生成显式分解为低频与高频的不同学习/生成动力学:对结构与全局布局(低频)与细节纹理(高频)采用差异化的建模与训练调度(如分支/条件化流、不同噪声或时间策略),缓解像素扩散“频率同质”假设带来的训练低效与细节/结构冲突,在不依赖VAE潜空间瓶颈的前提下提升质量与收敛效率。
- Track: Text-to-Image (pixel-space generation / diffusion-flow matching / frequency modeling)
- Core innovation: Proposes frequency-heterogeneous flow matching for pixel-space generation by explicitly decomposing generation into low- vs high-frequency components with different dynamics and training schedules (e.g., branched/conditioned flows, distinct time/noise strategies), addressing inefficiencies of frequency-homogeneous modeling and improving structure–detail trade-offs without VAE latent bottlenecks.
- [2026-05-07] DynT2I-Eval: A Dynamic Evaluation Framework for Text-to-Image Models
- 赛道归属: 文生图评测(动态基准/抗污染评估)
- 核心创新点: 提出全自动动态评测框架以对抗固定prompt集的过拟合与基准污染:从长描述构建结构化视觉语义空间,并将提示分解为可控维度(主体、逻辑约束、环境、构图等),按维度组合与采样动态生成评测集;从而实现可持续更新、可诊断维度能力缺陷的评测,并降低模型“背题”风险。
- Track: Text-to-Image Evaluation (dynamic benchmarks / contamination-resistant evaluation)
- Core innovation: Builds a fully automated dynamic evaluation framework that constructs a structured visual semantic space from long-form descriptions and decomposes prompts into controllable dimensions (subject, logical constraints, environment, composition, etc.), enabling dynamic prompt generation and continual refresh; improves diagnostic evaluation and reduces overfitting/benchmark contamination from fixed public prompt sets.
- [2026-05-07] T2I-VeRW: Part-level Fine-grained Perception for Text-to-Image Vehicle Retrieval
- 赛道归属: 跨模态检索(文本到图像车辆检索/细粒度Re-ID)
- 核心创新点: 面向“文本描述找车”的细粒度检索,提出部件级跨模态对齐:构建车辆局部部件与文本短语/属性的局部配对与交互建模,在全局匹配之外引入可解释的部件级判别特征(如车灯、格栅、车身贴纸等);提升跨摄像头变化下的细粒度区分能力,并缓解仅靠全局嵌入导致的语义混淆。
- Track: Cross-modal Retrieval (text-to-image vehicle Re-ID / fine-grained perception)
- Core innovation: Proposes part-level fine-grained cross-modal alignment for text-to-image vehicle retrieval by locally pairing vehicle parts with textual phrases/attributes and modeling their interactions beyond global embeddings; improves discriminability under cross-camera variations and reduces semantic confusion with interpretable part-based cues.
- [2026-05-01] ScribbleEdit: Synthetic Data for Image Editing with Scribbles and Text
- 赛道归属: 图像编辑(多条件控制/涂鸦+文本/数据合成)
- 核心创新点: 提出用于“涂鸦+文本”联合控制编辑的合成数据方案:用可规模化的方式生成同时包含空间约束(scribble提供布局/边界)与语义约束(文本提供材质/类别/颜色等)的训练样本与标注,缓解真实数据难以同时具备两类监督的问题;从数据层面提升模型对精确局部编辑与语义一致性的协同控制能力。
- Track: Image Editing (multi-condition control / scribble+text / synthetic data)
- Core innovation: Introduces a scalable synthetic data pipeline for joint scribble-and-text guided editing, producing training pairs that simultaneously encode spatial constraints (scribbles for layout/boundaries) and semantic constraints (text for category/material/color); addresses the scarcity of real datasets with both supervisions and improves precise localized control with semantic consistency.
- [2026-05-01] Disciplined Diffusion: Text-to-Image Diffusion Model against NSFW Generation
- 赛道归属: 文生图安全(NSFW鲁棒防护/生成时约束)
- 核心创新点: 提出“纪律化扩散”以替代简单的外部二值过滤:在扩散生成过程中引入对NSFW风险的内生约束/引导(而非仅事后拦截),使模型在检测到不安全倾向时倾向生成安全替代内容或朝安全分布偏移;该策略减少“允许/阻断”反馈带来的对抗可利用信号,并提升对提示词对抗与绕过攻击的鲁棒性。
- Track: T2I Safety (NSFW mitigation / in-generation constraints)
- Core innovation: Proposes “Disciplined Diffusion” as an in-generation safety mechanism rather than a post-hoc binary filter, injecting NSFW-aware constraints/guidance during the diffusion process to steer samples toward safe alternatives; reduces adversarial leverage from explicit allow/block signals and improves robustness to prompt attacks and bypass attempts.
GitHub
- [2026-05-11] YouMind-OpenLab/awesome-nano-banana-pro-prompts ⭐11924
🍌 World's largest Nano Banana Pro prompt library — 10,000+ curated prompts with preview images, 16 languages. Google Gemini AI image generation. Free ...
- [2026-05-11] vibheksoni/free-ai ⭐404
Free OpenAI-compatible AI API with 16,000+ models, image generation, tool calling, and Discord key signup.
- [2026-05-11] AceDataCloud/Nexior ⭐369
Consumer AI app for chat, image generation, video generation, and music creation powered by Ace Data Cloud APIs.
- [2026-05-11] CorentinGS/chess ⭐78
chess is a set of go packages which provide common chess utilities such as move generation, turn management, checkmate detection, PGN encoding, UCI in...
- [2026-05-11] mattleong/pi-better-openai ⭐51
A pi extension for OpenAI subscription workflows: fast mode, usage visibility, footer polish, and image generation through openai-codex auth.
HuggingFace Models
HuggingFace Datasets
- [2026-05-07] unh1nge/comfyui-character-composer
AIO Qwen Workflow
The repository now includes:
AIO Comfyui-Character-Composer Qwen Workflow.json
A unified all-in-one Qwen wor...
视频生成/编辑 / Video Generation/Editing
arXiv
- [2026-05-06] FaithfulFaces: Pose-Faithful Facial Identity Preservation for Text-to-Video Generation
- 赛道归属: 文生视频(人脸身份保持/可控生成)
- 核心创新点: 提出 FaithfulFaces 的“姿态共享身份表示”学习框架,在大姿态变化与遮挡场景下强化身份一致性;通过将身份特征与姿态因素解耦并在跨姿态条件下共享/对齐身份表征,减少因姿态迁移导致的身份漂移与面部结构失真,从而提升复杂动态场景中的人脸身份保真度。
- Track: Text-to-Video (face identity preservation / controllable generation)
- Key innovation: Proposes FaithfulFaces with a pose-shared identity representation learning scheme to improve identity consistency under large pose changes and occlusions; it explicitly disentangles identity from pose and aligns/shares identity features across pose conditions, reducing pose-induced identity drift and facial distortion in dynamic scenes.
- [2026-05-08] Do Joint Audio-Video Generation Models Understand Physics?
- 赛道归属: 多模态评测(音视频联合生成/物理一致性基准)
- 核心创新点: 提出 AV-Phys Bench,用于系统评估音视频联合生成模型是否具备“物理常识一致性”而非仅生成表面合理的声画;基准覆盖稳态、事件转变、环境转变三类场景,并围绕物理因果与跨模态一致性设计测试,从评测维度上把“声画同步”提升到“物理可解释的一致”。
- Track: Multimodal evaluation (joint audio-video generation / physics consistency benchmark)
- Key innovation: Introduces AV-Phys Bench to test whether joint audio-video generators exhibit physics-grounded commonsense rather than merely plausible A/V outputs; it structures evaluation into Steady State, Event Transition, and Environment Transition, emphasizing causal, cross-modal physical consistency beyond simple synchronization.
- Not All Tokens Need 40 Steps: Heterogeneous Step Allocation in Diffusion Transformers for Efficient Video Generation
- 赛道归属: 视频生成 / DiT 推理加速(异构去噪步数分配)
- 核心创新点: 提出训练免的异构步数分配(HSA),打破“所有时空 token 统一 40 步”等固定采样范式:根据 token 的重要性/冗余度(尤其是运动冗余)为不同空间位置与时间片分配不同去噪步数,对低贡献 token 早停或少步更新、对关键 token 保持充分迭代,从而在不改训练的情况下显著降低总体计算量,同时尽量维持视频的时序一致性与细节质量。
Track: Video generation / DiT inference acceleration (heterogeneous denoising step allocation)
Key innovation: Proposes training-free Heterogeneous Step Allocation (HSA) to replace uniform denoising steps across all spatiotemporal tokens. It allocates fewer steps (or early stopping) to redundant/low-importance tokens—particularly those with motion redundancy—while preserving sufficient iterations for critical tokens, reducing compute substantially without retraining and with minimal loss in temporal coherence and visual fidelity.
- [2026-05-06] Stream-T1: Test-Time Scaling for Streaming Video Generation
- 赛道归属: 视频生成(流式生成/推理时扩展 Test-Time Scaling)
- 核心创新点: 提出 Stream-T1,将 Test-Time Scaling 从传统扩散式“多候选探索”转向更适配的流式视频生成范式;利用分块(chunk-level)合成与更少去噪步数的结构优势显著降低 TTS 的候选搜索成本,并引入面向时间维度的指导机制以增强跨块时序一致性,实现更可扩展的推理时质量提升。
- Track: Video generation (streaming generation / test-time scaling)
- Key innovation: Stream-T1 reframes Test-Time Scaling around streaming video generation, exploiting chunk-wise synthesis and fewer denoising steps to drastically cut candidate exploration cost; it further adds temporal guidance across chunks to improve long-range coherence, enabling scalable test-time quality gains without retraining.
- [2026-05-01] UniVidX: A Unified Multimodal Framework for Versatile Video Generation via Diffusion Priors
- 赛道归属: 视频生成(统一多模态条件生成/扩散先验复用)
- 核心创新点: 提出 UniVidX,将视频扩散模型作为通用先验,在共享的多模态空间中把像素对齐类任务统一表述为条件生成,从而避免“每个任务训练一个模型”的固定映射;通过统一框架建模多模态间相关性,实现更通用的输入输出组合与任务迁移能力(同一套扩散先验适配多种视频生成设定)。
- Track: Video generation (unified multimodal conditional generation via diffusion priors)
- Key innovation: UniVidX repurposes a video diffusion model as a shared multimodal prior, casting pixel-aligned tasks into conditional generation within a unified multimodal space; this removes per-task specialized models, better captures cross-modal correlations, and enables flexible I/O configurations and stronger task transfer with one framework.
- Attention Sparsity is Input-Stable: Training-Free Sparse Attention for Video Generation via Offline Sparsity Profiling and Online QK Co-Clustering
- 赛道归属: 视频生成 / DiT 推理优化(训练免稀疏注意力)
- 核心创新点: 发现注意力稀疏模式对输入具有稳定性(input-stable),据此提出“离线稀疏画像 + 在线 QK 协同聚类”的训练免稀疏注意力方案:离线为不同层建立更细粒度的稀疏度剖面以处理层间异质性;在线通过 Query-Key 联合分块/聚类显式建模 QK 耦合关系,避免仅按单侧划分带来的信息断裂,从而在不训练的前提下提升 3D 注意力加速的质量-速度曲线。
Track: Video generation / DiT inference optimization (training-free sparse attention)
Key innovation: Observes that attention sparsity patterns are input-stable, and proposes a training-free sparse attention pipeline combining offline sparsity profiling and online QK co-clustering. Offline profiling captures layer-wise heterogeneity; online joint query-key block partitioning explicitly models QK coupling, reducing information loss from one-sided partitioning and improving the quality–speed trade-off for accelerating dense 3D attention without retraining.
- [2026-05-07] FreeSpec: Training-Free Long Video Generation via Singular-Spectrum Reconstruction
- 赛道归属: 长视频生成(训练免/一致性增强)
- 核心创新点: 提出 FreeSpec,通过“奇异谱重建(Singular-Spectrum Reconstruction)”在训练免设置下延展短视频扩散模型到长视频,缓解内容漂移、时序不一致与动态过平滑;相较于依赖预定义规则拆分外观/运动的全局-局部分支方法,FreeSpec用谱结构重建来更自适应地耦合外观一致性与动作演进,降低错误分配带来的失真。
- Track: Long video generation (training-free / temporal consistency)
- Key innovation: FreeSpec extends short-video diffusion models to long videos without training via Singular-Spectrum Reconstruction, addressing drift, temporal inconsistency, and over-smoothed motion; unlike global/local branch methods that heuristically separate appearance vs dynamics, it leverages spectral reconstruction to adaptively couple identity/appearance consistency with action progression, reducing mis-assignment artifacts.
- [2026-05-07] SwiftI2V: Efficient High-Resolution Image-to-Video Generation via Conditional Segment-wise Generation
- 赛道归属: 图生视频(高分辨率/高效生成)
- 核心创新点: 提出 SwiftI2V 的“条件分段生成(conditional segment-wise generation)”策略,将 2K 级高分辨率 I2V 的生成过程拆分为可控的分段/分块推理,在显著降低显存与时延的同时保持输入图像的细粒度结构;相较于“低清生成+通用超分”的级联方案,该方法在生成阶段就注入输入条件约束,减少细节幻觉与对输入局部结构的漂移。
- Track: Image-to-Video (efficient high-resolution generation)
- Key innovation: SwiftI2V introduces conditional segment-wise generation for 2K I2V, partitioning synthesis into conditioned segments to cut memory/latency while preserving fine-grained appearance from the source image; compared to low-res generation plus generic video SR, it enforces input-conditioned structure during generation, reducing hallucinated details and local-structure drift.
- [2026-05-07] RealCam: Real-Time Novel-View Video Generation with Interactive Camera Control
- 赛道归属: 视频编辑/视频到视频(实时新视角生成/交互式相机控制)
- 核心创新点: 提出 RealCam,用面向实时的因果式架构实现单目视频的交互式相机控制新视角生成,突破以往依赖全序列非因果处理与前缀拼接的范式;通过避免双向注意力带来的二次复杂度与高延迟,使模型能够低时延、可流式地响应相机轨迹控制,同时提升在线生成的时序连贯性与可用性。
- Track: Video editing / Video-to-Video (real-time novel-view generation with interactive camera control)
- Key innovation: RealCam proposes a real-time, causal architecture for camera-controllable novel-view V2V from monocular footage, replacing non-causal full-sequence processing and rigid prefix concatenation; by eliminating bidirectional attention’s quadratic cost and latency, it enables low-latency streaming generation that can interactively follow camera controls with improved temporal coherence.
- Multimodal Diffusion Transformer with Memory Bank for Scalable Long-Duration Talking Video Generation
- 赛道归属: 视频生成 / 长时长说话人视频生成(多模态扩散 Transformer)
- 核心创新点: 提出带 Memory Bank 的多模态扩散 Transformer 框架用于可扩展的长时长 talking video:通过显式记忆机制跨段存储并检索人物身份与关键时序状态,缓解长序列生成中的人像漂移、误差累积与时序伪影;同时融合多模态条件(如语音/文本等)进行稳定驱动,在视频长度增长时仍保持一致性与计算可控性。
Track: Video generation / Long-duration talking video generation (multimodal diffusion transformer)
Key innovation: Presents a multimodal Diffusion Transformer with a Memory Bank for scalable long-duration talking video synthesis. The explicit memory stores and retrieves identity and key temporal states across segments, reducing portrait drift, error accumulation, and temporal artifacts in long sequences, while leveraging multimodal conditioning (e.g., audio/text) for stable driving and improved scalability as duration increases.
GitHub
- [2026-05-11] Anil-matcha/Open-Generative-AI ⭐12735
Unrestricted, open-source alternative to AI video platforms — Free, unrestricted AI image & video generation studio with 200+ models (Flux, Midjourney...
- [2026-05-11] hao-ai-lab/FastVideo ⭐3463
A unified inference and post-training framework for accelerated video generation.
- [2026-05-11] ModelTC/LightX2V ⭐2253 🆕NEW
Light Image Video Generation Inference Framework
- [2026-05-11] YouMind-OpenLab/awesome-seedance-2-prompts ⭐977
🎬 2000+ curated Seedance 2.0 video generation prompts — cinematic, anime, UGC, ads, meme styles. Includes Seedance API guides, character consistency t...
- [2026-05-11] Winn1y/Awesome-Human-Motion-Video-Generation ⭐324
【Accepted by TPAMI】Human Motion Video Generation: A Survey (https://ieeexplore.ieee.org/document/11106267)
HuggingFace Models
音频生成 / Audio Generation
arXiv
- [2026-05-01] Fast Text-to-Audio Generation with One-Step Sampling via Energy-Scoring and Auxiliary Contextual Representation Distillation
- 赛道归属: 文本到音频生成(Text-to-Audio)/ 推理加速与采样优化
- 核心创新点: 提出一阶段(one-step)采样的文生音频框架,用能量距离(energy-distance)训练目标约束“噪声→音频latent”的直接映射,并引入辅助上下文表征蒸馏(representation distillation)把多步扩散/递归解码中的条件信息压缩到单步模型中,从而在尽量保持音质与文本一致性的同时显著降低采样延迟。
Track: Text-to-Audio generation / Inference acceleration & sampling
Key innovation: A one-step sampling framework that directly maps Gaussian noise to audio latents via an energy-distance objective, plus auxiliary contextual representation distillation to compress multi-step diffusion/recursive decoding conditioning into a single pass—reducing latency while preserving fidelity and text-audio alignment.
- [2026-05-01] MMAudio-LABEL: Audio Event Labeling via Audio Generation for Silent Video
- 赛道归属: 视频到音频生成(Silent Video→Audio)/ 音频事件检测与标注(Sound Event Labeling)
- 核心创新点: 将“生成音频后再做事件检测”的后处理流水线改为生成式标注:利用静音视频驱动音频生成的同时,联合产出可解释的事件标签(类型+时间定位),以端到端方式共享生成过程中的跨模态对齐信号,减少级联误差累积,并提升事件时间边界与语义一致性。
Track: Silent-video-to-audio generation / Sound event labeling (SED)
Key innovation: Replaces post-hoc SED-on-generated-audio with a generative labeling paradigm that jointly produces audio and explicit event labels (class + timestamps) from silent video, leveraging shared cross-modal alignment during generation to reduce cascade error and improve temporal/semantic consistency.
- [2026-05-07] LiVeAction: a Lightweight, Versatile, and Asymmetric Neural Codec Design for Real-time Operation
- 赛道归属: 神经编解码(Neural Codec)/ 端侧实时压缩与机器感知友好编码
- 核心创新点: 提出轻量、通用且非对称(asymmetric)的神经编解码设计:以更强的解码端能力换取编码端低算力/低功耗,面向实时与带宽受限设备;同时强调对“机器感知任务与非传统模态”(如空间音频阵列等)的适配,而非仅优化人类感知指标,从体系结构上兼顾码率-质量-端侧可部署性。
Track: Neural codecs / Real-time edge compression for machine perception
Key innovation: A lightweight, versatile asymmetric codec architecture that shifts complexity to the decoder to enable low-power real-time encoding under bandwidth constraints, and is designed to serve machine-perception tasks and non-standard modalities (e.g., spatial audio arrays) beyond human-perceptual optimization.
- [2026-05-07] BitCal-TTS: Bit-Calibrated Test-Time Scaling for Quantized Reasoning Models
- 赛道归属: 推理优化(Test-Time Scaling)/ 量化推理模型校准
- 核心创新点: 针对后训练量化导致的置信度/停止信号失真,提出“比特校准”的测试时缩放方法:在固定生成token预算下,校准量化模型的在线不确定性与早停/算力分配控制信号,缓解“看似合理但推理未收敛”的过早停止问题,使自适应推理深度在量化条件下更稳定可靠。
Track: Inference optimization (Test-Time Scaling) / Quantized reasoning calibration
Key innovation: Bit-calibrated test-time scaling that corrects confidence/halting signals distorted by post-training quantization under a fixed token budget, reducing harmful early stopping and stabilizing adaptive compute allocation for quantized reasoning models.
- OLaPh: Optimal Language Phonemizer
- 赛道归属: 语音合成前端(TTS Front-end)/ 文本到音素(G2P/Phonemization)
- 核心创新点: 提出混合式音素化框架:融合大规模多语种词典(lexica)与现代NLP建模,并引入统计子词切分来处理OOV与跨语言形态变化;通过“词典强约束 + 神经/统计泛化”的组合,在覆盖率与泛化能力之间取得更优折中,提升多语种音素化鲁棒性。
Track: TTS front-end / Phonemization (G2P)
Key innovation: A hybrid phonemizer combining extensive multilingual lexica with advanced NLP modeling and statistical subword segmentation, achieving better OOV/generalization while retaining lexicon-backed correctness across languages.
- LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling
- 赛道归属: 推理优化(Test-Time Scaling)/ 自动化策略发现(Agentic/RL)
- 核心创新点: 提出环境驱动的AutoTTS:将研究者的设计对象从“具体TTS启发式”提升为“可搜索的环境/接口”,让LLM以代理式探索在测试时如何分配计算与组织推理轨迹;通过自动发现与评估策略,系统性覆盖更大的TTS策略空间,减少手工规则与直觉调参依赖。
Track: Inference optimization (Test-Time Scaling) / Agentic strategy discovery
Key innovation: AutoTTS reframes TTS from hand-crafted heuristics to an environment-driven search problem, enabling LLM agents to discover and evaluate test-time compute allocation/reasoning strategies automatically, expanding the explored policy space and reducing manual tuning.
- STEPS: A Temporal Smooth Error Propagation Solver on the Manifolds for Test-Time Adaptation in Time Series Forecasting
- 赛道归属: 时间序列预测 / 测试时自适应(Test-Time Adaptation, TTA)
- 核心创新点: 提出STEPS:在流式、无源域(source-free)的在线TTA场景下,针对前缀短、相关性强且噪声污染导致的不可辨识与误差累积问题,引入“时间平滑的误差传播求解器”,并在流形(manifold)上进行稳定更新,以抑制长预测跨度下的漂移与不稳定修正。
Track: Time-series forecasting / Test-time adaptation (online, source-free)
Key innovation: STEPS introduces a temporally smooth error-propagation solver with manifold-aware updates for online source-free TTA, improving identifiability and robustness under short/noisy prefixes and reducing error accumulation over long horizons.
- [2026-05-07] Optimal Transport Audio Distance with Learned Riemannian Ground Metrics
- 赛道归属: 音频生成评测(Audio Generation Evaluation)/ 最优传输距离度量
- 核心创新点: 提出OTAD以替代/修正FAD的两大结构性缺陷:在“代价项”上学习残差黎曼地面度量适配器(Riemannian ground-metric adapter)以避免冻结嵌入的不变性掩盖伪影;在“耦合项”上用离散OT(带熵正则)替代高斯拟合近似,提升对局部污染与细粒度失真的敏感性,从而得到更可信的生成音频距离度量。
Track: Audio generation evaluation / Optimal transport metrics
Key innovation: OTAD fixes FAD by (1) learning a residual Riemannian ground-metric adapter for the OT cost instead of relying on a frozen embedding pullback, and (2) replacing Gaussian coupling with discrete entropic OT—improving sensitivity to artifacts and rank-1/contaminated distortions.
- [2026-05-06] Empirical Study of Pop and Jazz Mix Ratios for Genre-Adaptive Chord Generation
- 赛道归属: 符号音乐生成(Symbolic Music Generation)/ 和弦进行生成与跨风格适配
- 核心创新点: 将“和弦生成”作为独立任务系统研究跨风格微调的数据配比问题:通过实证分析Pop与Jazz混合比例对适配效果与遗忘的影响,给出在迁移到新风格时保留旧域数据的量化规律/操作建议,为风格自适应训练提供可复用的配方而非仅依赖经验。
Track: Symbolic music generation / Chord progression generation & genre adaptation
Key innovation: Treats chord generation as a standalone task and empirically studies how Pop/Jazz data mix ratios during fine-tuning trade off new-genre acquisition vs. forgetting, yielding actionable, quantitative guidance for genre-adaptive training recipes.
- [2026-05-06] Temporal Structure Matters for Efficient Test-Time Adaptation in Wearable Human Activity Recognition
- 赛道归属: 可穿戴人体活动识别(WHAR)/ 测试时自适应(TTA)
- 核心创新点: 重新审视WHAR流式数据的“跨窗口时间结构”,将其作为特征条件的推理信号用于更高效的测试时自适应,而非照搬视觉TTA的独立样本假设;通过利用序列一致性/时序约束来驱动在线无标注更新,提升跨用户分布偏移下的适配稳定性与效率。
Track: Wearable human activity recognition / Test-time adaptation
Key innovation: Exploits inter-window temporal structure in WHAR streams as a feature-conditioned inference/adaptation signal (instead of i.i.d. assumptions from vision TTA), enabling more efficient and stable online unlabeled adaptation under cross-user distribution shifts.
GitHub
- [2026-05-11] huggingface/diffusers ⭐33585
🤗 Diffusers: State-of-the-art diffusion models for image, video, and audio generation in PyTorch.
- [2026-05-06] OpenMOSS/MOVA ⭐994
MOVA: Towards Scalable and Synchronized Video–Audio Generation
- [2026-05-09] Ameobea/web-synth ⭐552
Browser-based DAW and audio synthesis platform with dozens of effects, synths, and modules
- [2026-05-10] apocas/restai ⭐504
RESTai is an AIaaS (AI as a Service) open-source platform. Supports many public and local LLM suported by Ollama/vLLM/etc. Precise embeddings usage, t...
- [2026-05-07] Saganaki22/ComfyUI-Woosh ⭐97
Text-to-audio and video-to-audio using Sony AI's Woosh foundation model.
HuggingFace Models
语言大模型 / Large Language Models
arXiv
- [2026-05-07] LCC-LLM: Leveraging Code-Centric Large Language Models for Malware Attribution
- 赛道归属: 代码安全与恶意软件分析(LLM+静态分析/归因)
- 核心创新点: 构建面向“代码证据”的恶意软件归因基准与框架:提出代码中心的LCCD数据集(约34K PE样本)并配套证据落地的归因流程,强调从二进制/反汇编等静态线索中定位“恶意/脆弱代码片段”的可验证证据链,弥补以往LLM归因依赖不受支持指标、缺乏代码级grounding的问题,实现归因与多任务静态分析的统一评测与训练范式。
- Track: Code security & malware analysis (LLM + static analysis/attribution)
- Core innovation: Introduces an evidence-grounded, code-centric benchmark and framework: the LCCD dataset (~34K PE samples) plus an attribution pipeline that explicitly grounds decisions in verifiable code-level evidence (malicious/vulnerable segments) extracted from static artifacts, addressing prior LLM attribution’s unsupported indicators and weak code grounding, and enabling unified evaluation/training for attribution and multi-task static malware analysis.
- [2026-05-06] RLearner-LLM: Balancing Logical Grounding and Fluency in Large Language Models via Hybrid Direct Preference Optimization
- 赛道归属: 对齐与偏好优化(RLHF/DPO改进、逻辑一致性对齐)
- 核心创新点: 提出Hybrid-DPO的自动化偏好构造机制以纠正DPO“偏啰嗦/重流畅轻逻辑”的系统性偏差:将逻辑可靠性信号(如基于NLI/蕴含判别的DeBERTa等判据)与生成流畅度偏好融合,形成更平衡的偏好对,从训练目标层面缩小“逻辑对齐缺口”,在知识密集型生成中同时提升逻辑正确性与可读性。
- Track: Alignment & preference optimization (RLHF/DPO, logical grounding)
- Core innovation: Proposes Hybrid-DPO with an automated preference pipeline that counteracts DPO’s verbosity/fluency bias by fusing logical reliability signals (e.g., DeBERTa-based NLI/entailment judgments) with fluency preferences, producing better-balanced preference pairs and reducing the “logical alignment gap” in knowledge-intensive generation.
- Extracting Search Trees from LLM Reasoning Traces Reveals Myopic Planning
- 赛道归属: 推理与规划分析(CoT可解释性/行为刻画)
- 核心创新点: 提出从LLM推理轨迹中“抽取搜索树”的方法学,用结构化量化替代仅看最终答案/文本:在四子棋环境中将CoT中的分支、回溯与前瞻显式拟合为搜索树并度量其深度、分支因子与局部性,从而揭示推理模型存在“短视规划(myopic planning)”等行为特征,并将性能与搜索结构属性建立可检验关联。
- Track: Reasoning & planning analysis (CoT interpretability/behavior characterization)
- Core innovation: Introduces a method to extract and quantify search trees from LLM reasoning traces, fitting deliberative CoT into explicit tree structures in a four-in-a-row game and measuring properties (depth/branching/locality) to reveal myopic planning and link performance to measurable search-structure attributes.
- RelAgent: LLM Agents as Data Scientists for Relational Learning
- 赛道归属: LLM智能体(数据科学自动化/关系学习AutoML)
- 核心创新点: 将LLM智能体定位为“关系学习数据科学家”,提出两阶段自治流程:搜索阶段利用数据库操作、验证集反馈与候选建模策略进行方案探索;随后在精炼阶段对最佳方案进行迭代改进与稳健化。核心突破在于把关系数据(多表/实体关系)上的特征构造、模型选择与评估闭环显式代理化,形成面向关系学习的端到端自动建模工作流。
- Track: LLM agents (data-science automation / relational learning AutoML)
- Core innovation: Frames an LLM agent as an autonomous data scientist for relational learning with a two-phase workflow: a search phase that explores modeling pipelines using database operations and validation feedback, followed by a refinement phase that iteratively improves the best pipeline—explicitly agentizing the closed loop of feature engineering, model selection, and evaluation for relational (multi-table) data.
- [2026-05-08] LLM hallucinations in the wild: Large-scale evidence from non-existent citations
- 赛道归属: 可信生成与幻觉评测(引用幻觉/科学写作审计)
- 核心创新点: 利用“可唯一核验”的科学引用作为幻觉探针,提出大规模现实世界审计范式:跨arXiv、bioRxiv、SSRN、PMC等语料,核查2.5M论文中的1.11亿条参考文献,量化LLM普及后“不存在引用”激增的趋势,并以保守估计给出影响规模。方法论价值在于把幻觉从小样本基准迁移到可复核的真实生产数据上,提供可追责的度量框架。
- Track: Trustworthy generation & hallucination evaluation (citation hallucinations / scientific writing audit)
- Core innovation: Establishes a large-scale, real-world auditing methodology using verifiable scientific citations as hallucination probes: checks 111M references across 2.5M papers from major repositories, quantifies the post-LLM-adoption rise of non-existent citations, and provides conservative impact estimates—moving hallucination measurement from small benchmarks to accountable, reproducible real data.
- [2026-05-08] Reliable Chain-of-Thought via Prefix Consistency
- 赛道归属: 推理可靠性与测试时集成(自一致性改进/置信度估计)
- 核心创新点: 提出Prefix Consistency作为CoT可靠性信号,改造传统self-consistency的“等权多数投票”:对每条CoT截断前缀后再生成后续,利用“正确轨迹更易复现原答案、错误轨迹更不稳定”的差异,为候选答案赋权并聚合,从而在不改模型参数的测试时提升推理准确率与鲁棒性。
- Track: Reasoning reliability & test-time ensembling (self-consistency, confidence estimation)
- Core innovation: Introduces Prefix Consistency as a reliability signal for CoT: truncate a trace and regenerate the suffix; correct traces tend to reproduce the same answer more often than incorrect ones. This stability is used to weight candidates instead of uniform majority voting, improving test-time aggregation without model retraining.
- [2026-05-08] Why do Large Language Models Fail in Low-resource Translation? Unraveling the Token Dynamics of Large Language Models for Machine Translation
- 赛道归属: 机器翻译分析与诊断(低资源翻译/Token动态)
- 核心创新点: 从“token动态”视角系统剖析LLM低资源翻译失败机制:对15个模型(含推理型LLM)在22个不同资源水平语对上进行对比,揭示非英语中心语对更易失败的规律,并将错误与分词/生成过程中的token层现象关联(如分布偏置、复制/退化等),为后续针对性数据与解码策略改进提供可操作的诊断依据。
- Track: Machine translation analysis & diagnostics (low-resource MT, token dynamics)
- Core innovation: Provides a token-dynamics-based diagnosis of why LLMs fail in low-resource MT: evaluates 15 models over 22 language pairs with varying resource levels, shows consistent degradation on non-English-centric pairs, and connects failures to token-level generation/segmentation behaviors (biases/degeneration), yielding actionable diagnostic signals for targeted improvements.
- [2026-05-08] GRaSp: Automatic Example Optimization for In-Context Learning in Low-Data Tasks
- 赛道归属: In-Context Learning优化(示例选择/低数据提示工程自动化)
- 核心创新点: 提出GRaSp三阶段自动示例优化框架,解决低数据场景“好示例稀缺且选择敏感”的痛点:先生成大规模合成候选示例池,再通过聚类与降维进行结构化覆盖,最后用任务反馈/效用评估选择最优demonstrations,实现从“生成—组织—筛选”的闭环自动化,提升ICL在领域与小样本任务上的稳定性与上限。
- Track: In-context learning optimization (example selection, low-data prompt automation)
- Core innovation: Proposes GRaSp, a three-stage automatic demonstration optimization pipeline: generate a large synthetic candidate pool, structure it via clustering and dimensionality reduction for coverage/diversity, then select examples using task-utility feedback—closing the loop from generation to organization to selection to improve ICL robustness in low-data, domain-specific tasks.
- [2026-05-08] DCGL: Dual-Channel Graph Learning with Large Language Models for Knowledge-Aware Recommendation
- 赛道归属: 推荐系统(知识图谱推荐 + LLM语义融合/图学习)
- 核心创新点: 提出双通道图学习DCGL以同时建模显式KG关系与隐式语义关联,并改进ID特征与LLM表征的融合方式:通过双通道(结构关系通道+语义通道)分别捕获可观测链接与超越KG边的语义相似,再进行更细粒度的跨通道交互/融合,缓解单通道融合导致的信息塌缩与语义关系建模不足,从而提升知识感知推荐效果。
- Track: Recommender systems (KG-based recommendation + LLM semantic fusion / graph learning)
- Core innovation: Introduces DCGL, a dual-channel graph learning framework that separately models explicit KG links and implicit semantic relations (beyond observed edges), and improves fusion of ID-based signals with LLM embeddings via finer-grained cross-channel interaction—mitigating single-channel fusion bottlenecks and enhancing knowledge-aware recommendation.
- [2026-05-08] SOM: Structured Opponent Modeling for LLM-based Agents via Structural Causal Model
- 赛道归属: 多智能体与博弈(对手建模/因果建模驱动的LLM Agent)
- 核心创新点: 提出基于结构因果模型(SCM)的Structured Opponent Modeling(SOM),将“对手模型构建”与“行为预测/决策”解耦为两阶段:先从交互数据中学习对手的结构化因果机制(偏好、策略、状态依赖等),再在该显式模型上进行预测与适应,从而提升在动态对抗环境中的可迁移性与可控性,避免仅靠上下文隐式推理导致的纠缠与脆弱。
- Track: Multi-agent & game settings (opponent modeling, causal-model-driven LLM agents)
- Core innovation: Proposes SOM using Structural Causal Models to decouple opponent model construction from prediction/decision-making: first learns an explicit structured causal mechanism of opponent behavior from interactions, then performs prediction and adaptation on top of it—improving transferability and controllability in dynamic multi-agent environments versus implicit context-only reasoning.
GitHub
- [2026-05-11] sgl-project/sglang ⭐27625
SGLang is a high-performance serving framework for large language models and multimodal models.
- [2026-05-11] PaddlePaddle/PaddleFormers ⭐12986
PaddleFormers is an easy-to-use library of pre-trained large language model zoo based on PaddlePaddle.
- [2026-05-11] trpc-group/trpc-agent-go ⭐1153
trpc-agent-go is a powerful Go framework for building intelligent agent systems using large language models (LLMs) and tools.
- [2026-05-11] flagos-ai/FlagGems ⭐995
FlagGems is an operator library for large language models implemented in the Triton Language.
- [2026-05-11] Kwwwww74/Awesome-Trustworthy-AudioLLMs ⭐131
A reading list for trustworthy audio large language models.
HuggingFace Models
HuggingFace Datasets
- [2026-05-01] angrygiraffe/claude-opus-4.6-4.7-reasoning-8.7k
Background
Ended up with some tokens to burn on a Claude Max plan. Assembly began during 4.6 and moved to 4.7. Model is tagged. The develop...
- [2026-04-19] Jackrong/GLM-5.1-Reasoning-1M-Cleaned
GLM-5.1-Reasoning-1M-Cleaned
GLM-5.1-Reasoning-1M-Cleaned is a cleaned and reformatted derivative of Kassadin88/GLM-5.1-1000000x. It prese...
- [2026-04-24] Jackrong/DeepSeek-V4-Distill-8000x
🐳 DeepSeek-V4-Distill-8100x Dataset Summary
DeepSeek-V4-Distill-8100x is a supervised fine-tuning dataset for re...
- [2026-05-03] iletisim/dezenformasyon-bultenleri
İletişim Başkanlığı Dezenformasyon Bültenleri
Kaynak API: llm.iletisim.gov.trKaynak Bültenler: iletisim.gov.tr/turkce/dezenformasyon-bulten...
- [2026-04-05] Roman1111111/claude-opus-4.6-10000x
This is a high-fidelity reasoning dataset synthesized using Claude Opus 4.6. The dataset is designed to capture the model's internal "Chain of Thought...
HuggingFace Spaces
多模态大模型 / Multimodal Models
arXiv
- [2026-05-08] Fine-tuning a vision-language model for fracture-surface morphology recognition
- 赛道归属: 科学影像理解 / 领域VLM微调(材料断口形貌识别)
- 核心创新点: 基于开源VLM(Qwen3-VL-32B-Instruct)进行材料断口图像的领域适配微调,构建并利用13,168张文献挖掘的断口图像数据集;通过推理型大模型从“图像+文本”联合生成形貌标注,实现低人工成本的可扩展标注管线,从而把通用VLM的视觉表征对齐到材料学形貌判别所需的细粒度纹理/结构知识。
- Track: Scientific image understanding / domain VLM fine-tuning (fracture-surface morphology recognition)
- Key innovation: Domain-adapts an open VLM (Qwen3-VL-32B-Instruct) via fine-tuning on a curated 13,168-image literature-mined fracture dataset; uses a reasoning LLM to generate morphology annotations from joint image+text evidence, forming a scalable, low-manual-cost labeling pipeline that aligns generic VLM representations to fine-grained materials morphology cues.
- [2026-05-07] MedHorizon: Towards Long-context Medical Video Understanding in the Wild
- 赛道归属: 医疗长视频多模态理解 / 长上下文视频理解基准
- 核心创新点: 面向“全流程临床视频回顾”这一真实场景,聚焦医疗过程视频的关键难点(高冗余视角、关键证据稀疏且细微、强上下文依赖),提出在野外(long-context in the wild)的长视频理解任务设定与评测方向,突破以往依赖已定位片段/预分割视频的基准假设,使模型必须在长时序中自主发现与整合决定性证据。
- Track: Medical long-context video understanding / benchmark & task setting
- Key innovation: Targets full-procedure clinical video review with a “long-context in-the-wild” formulation, explicitly modeling medical-procedure properties (high redundancy, temporally sparse and subtle decisive evidence, strong context dependence) and moving beyond benchmarks that pre-localize evidence via clips/segments—forcing models to discover and aggregate key evidence over long timelines.
- [2026-05-07] Null Space Constrained Contrastive Visual Forgetting for MLLM Unlearning
- 赛道归属: 多模态模型遗忘/机器反学习(MLLM Unlearning)
- 核心创新点: 提出“零空间约束”的对比式视觉遗忘方法,在多模态耦合更强的MLLM中实现“定向移除目标视觉知识,同时最大化保留非目标视觉知识与全部文本知识”;通过在参数更新中引入零空间/正交约束,将遗忘梯度限制在不干扰保留子空间的方向上,并用对比学习目标强化“忘/记”边界,缓解遗忘-保留的权衡冲突。
- Track: Multimodal unlearning / machine unlearning for MLLMs
- Key innovation: Introduces null-space-constrained contrastive visual forgetting to remove target visual knowledge while preserving non-target visual and all textual knowledge; enforces orthogonality/null-space constraints on updates to avoid interfering with retained subspaces, and uses contrastive objectives to sharpen the forget-vs-retain boundary, improving the unlearning/retention trade-off in tightly coupled MLLMs.
- [2026-05-01] EmoMM: Benchmarking and Steering MLLM for Multimodal Emotion Recognition under Conflict and Missingness
- 赛道归属: 多模态情感识别(MER)/ MLLM鲁棒性评测与可控引导
- 核心创新点: 构建EmoMM基准,系统覆盖模态一致、模态冲突、模态缺失三类设置,用于剖析MLLM在冲突与缺失条件下的决策机制;通过大规模评测揭示模型对视频/音频/文本等模态贡献的偏置规律(如“视频贡献偏置”等现象),并进一步提出针对性“steering/引导”策略以在冲突与缺失时校正模态依赖、提升鲁棒性与可解释性。
- Track: Multimodal emotion recognition / robustness benchmarking & steering for MLLMs
- Key innovation: Presents EmoMM, a benchmark spanning modality-aligned, conflict, and missingness settings to probe MLLM decision behavior; uncovers systematic modality-contribution biases (e.g., video contribution bias) via extensive evaluation, and proposes targeted steering methods to recalibrate modality reliance under conflict/missing inputs, improving robustness and interpretability.
- Proxy3D: Efficient 3D Representations for Vision-Language Models via Semantic Clustering and Alignment
- 赛道归属: 3D表征学习 / 面向VLM的高效3D语义对齐
- 核心创新点: 提出Proxy3D,用“语义聚类+跨模态对齐”构建高效3D代理表示,替代传统VLM以像素对齐为主的2D视觉token管线;通过把视觉序列压缩为语义一致的3D代理单元,兼顾空间一致性(缓解隐式对应模型的空间不稳定)与计算效率(缓解显式3D几何先验方法在长序列上的开销),从而提升VLM的3D空间推理能力与可扩展性。
- Track: 3D representation learning / efficient 3D semantics for VLMs
- Key innovation: Proposes Proxy3D, building efficient 3D proxy representations via semantic clustering and multimodal alignment to replace pixel-aligned 2D token pipelines; compresses visual sequences into semantically coherent 3D proxy units, improving spatial consistency (vs. correspondence-based implicit 3D) while maintaining efficiency (vs. heavy 3D-prior representations), enabling scalable 3D reasoning in VLMs.
- Object Hallucination-Free Reinforcement Unlearning for Vision-Language Models
- 赛道归属: 视觉-语言模型遗忘/隐私合规(VLM Unlearning)
- 核心创新点: 提出HFRU(Object Hallucination-Free Reinforcement Unlearning),针对现有主要微调语言解码器导致“表层遗忘、底层视觉表征未清除”且易引入物体幻觉的问题,改为直接作用于视觉编码器进行深层语义移除;采用两阶段框架:先进行对遗忘目标的强化式优化(以奖励信号驱动“忘得更干净”),再通过稳定化/约束机制抑制遗忘副作用,从而在更彻底移除敏感视觉知识的同时降低幻觉风险。
- Track: VLM unlearning / privacy & safety (hallucination mitigation)
- Key innovation: Introduces HFRU, a reinforcement unlearning framework that targets the vision encoder (not just the language decoder) to achieve deep semantic removal and avoid object hallucinations; uses a two-stage procedure combining reward-driven unlearning for thorough forgetting with stabilization/constraints to reduce side effects, improving both unlearning efficacy and hallucination robustness.
- Semantic-Aware Adaptive Visual Memory for Streaming Video Understanding
- 赛道归属: 流式视频理解 / 在线视觉记忆管理与压缩
- 核心创新点: 提出语义感知的自适应视觉记忆机制,将“语义信号”显式纳入流式视频token的保留/压缩决策,而非仅依赖视觉相似度启发式;并把检索与压缩进行协同设计(而不是压缩后再补检索),使记忆在不确定查询到来时仍能保留对潜在问题最有用的语义证据,从而提升长时在线理解的实时性与问答命中率。
- Track: Streaming video understanding / online memory management & compression
- Key innovation: Proposes semantic-aware adaptive visual memory that incorporates semantic signals into keep/compress decisions beyond visual-similarity heuristics; co-designs retrieval with compression (instead of post-hoc retrieval after irreversible compression), preserving query-relevant semantic evidence under unpredictable query timing and improving real-time long-horizon streaming QA performance.
- Video Understanding Reward Modeling: A Robust Benchmark and Performant Reward Models
- 赛道归属: 视频理解奖励模型(Reward Modeling)/ 偏好数据与评测基准
- 核心创新点: 提出统一框架覆盖基准设计、偏好数据构建与奖励模型训练,发布VURB(Video Understanding Reward Bench):包含2,100组偏好对,并配套长链路推理(CoT)痕迹以提升监督信号密度与可诊断性;在此基础上训练更高性能的视频奖励模型,为视频生成/视频LLM对齐提供可复现、可量化的评测与训练基础,弥补视频域奖励建模长期缺少高质量基准与数据的问题。
- Track: Video reward modeling / preference data & benchmarking
- Key innovation: Establishes an end-to-end framework for benchmark design, preference data construction, and reward-model training; introduces VURB with 2,100 preference pairs plus long chain-of-thought traces to densify supervision and improve diagnosability; trains stronger video reward models, providing reproducible evaluation/training infrastructure for aligning video generators and Video-LLMs where robust benchmarks/data were previously lacking.
- EyeCue: Driver Cognitive Distraction Detection via Gaze-Empowered Egocentric Video Understanding
- 赛道归属: 具身/车载多模态理解(驾驶员状态监测)/ 注视点增强视频理解
- 核心创新点: 提出EyeCue,将眼动/注视信息作为关键中间表征融入第一视角车载视频理解,用于识别“认知分心”这一难以从外显动作判断的状态;核心洞察是认知分心体现在“注视与驾驶场景交互模式”的变化而非简单视线偏移,通过建模注视线索与场景语义/事件的耦合,提高对隐性分心的可检测性与鲁棒性。
- Track: Egocentric multimodal understanding for driver monitoring / gaze-augmented video understanding
- Key innovation: Proposes EyeCue, integrating gaze as a pivotal intermediate signal into egocentric driving video understanding to detect cognitive distraction—hard to infer from overt motions; leverages the insight that distraction manifests as altered gaze–scene interaction patterns (not merely gaze deviation), modeling gaze cues jointly with scene semantics/events to improve detection sensitivity and robustness.
- [2026-05-08] Response-G1: Explicit Scene Graph Modeling for Proactive Streaming Video Understanding
- 赛道归属: 流式视频理解 / 主动响应(Proactive)视频LLM与结构化证据对齐
- 核心创新点: 提出Response-G1,用显式场景图(scene graph)把“随时间累积的视频证据”与“查询所需的响应条件”进行结构化对齐,解决以往隐式、与查询无关的证据建模导致的“何时该回答”困难;采用无需微调的三阶段流程:在线的查询引导场景图构建、证据随时间的结构化更新、以及基于场景图的响应触发判定,从而提升流式场景下的及时性与可控性。
- Track: Proactive streaming video understanding / structured evidence alignment
- Key innovation: Introduces Response-G1, explicitly aligning accumulated streaming video evidence with query-specific response conditions via scene graphs, addressing the “when to respond” challenge caused by implicit, query-agnostic evidence modeling; uses a fine-tuning-free three-stage pipeline—online query-guided scene-graph construction, structured temporal evidence updates, and scene-graph-based response triggering—improving timeliness and controllability in proactive streaming settings.
GitHub
- [2026-05-11] Blaizzy/mlx-vlm ⭐4689
MLX-VLM is a package for inference and fine-tuning of Vision Language Models (VLMs) on your Mac using MLX.
- [2026-05-11] waybarrios/vllm-mlx ⭐1143
OpenAI and Anthropic compatible server for Apple Silicon. Run LLMs and vision-language models (Llama, Qwen-VL, LLaVA) with continuous batching, MCP to...
- [2026-05-08] zhengli97/Awesome-Prompt-Adapter-Learning-for-VLMs-CLIP ⭐772
A curated list of awesome prompt/adapter learning methods for vision-language models like CLIP.
- [2026-05-10] dongyangli-del/EEG_Image_decode ⭐203
Using vision-language models to decode natural image perception from non-invasive brain recordings.
- [2026-05-08] ydyhello/Awesome-VLM-Streaming-Video ⭐154
📚 A curated collection of papers and open-source code repositories dedicated to the application of Vision-Language Models (VLMs) for streaming video.
强化学习 / Reinforcement Learning
arXiv
- How to Compress KV Cache in RL Post-Training? Shadow Mask Distillation for Memory-Efficient Alignment
- 赛道归属: 大模型对齐与推理优化(RL后训练、KV Cache压缩/显存优化)
- 核心创新点: 提出“Shadow Mask Distillation”用于RL在线rollout阶段的KV cache压缩:通过蒸馏得到可学习的掩码/稀疏策略,在尽量不破坏对齐与长上下文推理质量的前提下,显著降低轨迹生成时KV缓存的显存占用,从而缓解长上下文RL后训练的“memory wall”,提升可扩展性与吞吐。
- Track: LLM alignment & inference optimization (RL post-training, KV-cache compression/memory efficiency)
- Core innovation: Proposes Shadow Mask Distillation to compress KV cache during online RL rollouts by distilling a learnable masking/sparsification policy, reducing KV-memory footprint while preserving alignment and long-context reasoning quality, thereby breaking the rollout “memory wall” and improving scalability/throughput.
- [2026-05-07] A Unified Pair-GRPO Family: From Implicit to Explicit Preference Constraints for Stable and General RL Alignment
- 赛道归属: 大模型对齐(偏好学习RL、GRPO/PPO类稳定训练)
- 核心创新点: 构建统一的Pair-GRPO理论框架,将“隐式偏好约束”到“显式偏好约束”纳入同一族方法,并提出Soft-Pair-GRPO与Hard-Pair-GRPO两种紧耦合变体;通过更清晰的约束形式与梯度方向刻画,降低梯度方差、提升更新稳定性与可解释性,并增强跨任务/奖励形态的泛化鲁棒性。
- Track: LLM alignment (preference-based RL, stable GRPO/PPO-style optimization)
- Core innovation: Establishes a unified Pair-GRPO framework spanning implicit-to-explicit preference constraints, introducing tightly coupled Soft- and Hard-Pair-GRPO variants; by making preference constraints and gradient directions more explicit, it reduces gradient variance, improves update stability/interpretability, and strengthens generalization across tasks and reward formulations.
- Rubric-Grounded RL: Structured Judge Rewards for Generalizable Reasoning
- 赛道归属: 大模型对齐与推理能力提升(LLM-as-a-judge、结构化奖励建模)
- 核心创新点: 提出Rubric-Grounded RL:将奖励分解为可验证的多维标准(rubric),由冻结的LLM裁判在辅助“grounding”信息条件下对各维度打分并加权汇总;用“部分得分/分项反馈”替代单一整体分或二元成败信号,提供更密集、更可控的优化梯度,从而提升推理训练的可泛化性与对奖励投机的抑制能力。
- Track: LLM alignment & reasoning improvement (LLM-as-a-judge, structured reward modeling)
- Core innovation: Introduces Rubric-Grounded RL, decomposing reward into weighted, verifiable criteria scored by a frozen LLM judge conditioned on auxiliary grounding; replaces binary/holistic rewards with multi-criterion partial credit to provide denser, more controllable learning signals, improving generalizable reasoning and reducing reward hacking.
- Reinforcement Learning for Exponential Utility: Algorithms and Convergence in Discounted MDPs
- 赛道归属: 风险敏感强化学习(指数效用、折扣MDP的价值学习理论)
- 核心创新点: 针对固定风险厌恶下的指数效用目标,推导两类Q值形式的Bellman扩展,并证明相应算子在$L_\infty$与sup-log/Thompson等度量下为压缩映射,从而给出收敛性与不动点刻画;补齐指数效用RL中“可证明收敛的值迭代/Q学习式算法”理论空白,为风险敏感控制提供可实现的价值型方法。
- Track: Risk-sensitive RL (exponential utility, value-based theory in discounted MDPs)
- Core innovation: Derives two Q-value-style Bellman extensions for fixed risk-aversion exponential-utility objectives and proves the induced operators are contractions under $L_\infty$ and sup-log/Thompson-type metrics, yielding fixed-point characterization and convergence guarantees—filling a gap for principled, value-based algorithms in exponential-utility RL.
- [2026-05-08] ExpThink: Experience-Guided Reinforcement Learning for Adaptive Chain-of-Thought Compression
- 赛道归属: 推理优化(CoT压缩、RL控制token成本/时延)
- 核心创新点: 提出ExpThink,用“经验引导”的自适应RL来压缩CoT:一方面根据训练过程中模型能力变化动态调整长度/成本权重(避免静态统一惩罚导致的欠/过压缩),另一方面引入面向题目难度的自适应机制,使不同样本获得不同的推理预算分配;在保持准确率的同时显著降低token消耗与推理延迟。
- Track: Inference optimization (CoT compression, RL for token/latency control)
- Core innovation: Proposes ExpThink, an experience-guided RL framework for adaptive CoT compression: it dynamically adjusts length/cost weighting as model capability evolves and allocates reasoning budget conditioned on problem difficulty, overcoming static uniform penalties to reduce tokens/latency while maintaining accuracy.
- [2026-05-08] BalCapRL: A Balanced Framework for RL-Based MLLM Image Captioning
- 赛道归属: 多模态生成(图像字幕生成、MLLM的RL对齐与评测优化)
- 核心创新点: 提出BalCapRL,面向MLLM图像字幕的“平衡式”RL框架:针对现有RL与指标只优化单一质量维度导致的维度间权衡(如细节性、忠实性、流畅性等),通过更均衡的奖励设计/优化策略在多维目标间进行协调,提升字幕整体质量与稳健性,减少“只对某个指标过拟合”的问题。
- Track: Multimodal generation (image captioning, RL alignment & evaluation-aware optimization for MLLMs)
- Core innovation: Introduces BalCapRL, a balanced RL framework for MLLM image captioning that mitigates single-metric/single-dimension optimization by coordinating multiple caption quality dimensions (e.g., detail, faithfulness, fluency) via more balanced reward/optimization design, improving overall caption quality and robustness against metric overfitting.
- [2026-05-08] Experience Sharing in Mutual Reinforcement Learning for Heterogeneous Language Models
- 赛道归属: 大模型对齐系统(多模型协同RL后训练、经验共享/分布式训练)
- 核心创新点: 提出Mutual Reinforcement Learning:在参数、目标与tokenizer各异的异构LLM之间进行并行RL后训练,并通过“类型化经验共享”提升样本效率;核心组件包括SEE(共享经验交换)、MWRA(多worker资源分配)与THL(tokenizer异构层,用重分词与token级轨迹对齐解决词表不兼容),把“跨模型共享rollout/优势等轨迹信息”变成可落地的训练基建。
- Track: LLM alignment systems (multi-model collaborative RL post-training, experience sharing/distributed training)
- Core innovation: Proposes Mutual Reinforcement Learning for concurrent RL post-training across heterogeneous LLMs with separate parameters/objectives/tokenizers, enabled by typed experience sharing; introduces SEE (Shared Experience Exchange), MWRA (multi-worker resource allocation), and THL (Tokenizer Heterogeneity Layer) to retokenize and align token-level traces across vocabularies, making cross-model sharing of rollouts/advantages practically feasible and sample-efficient.
- [2026-05-08] Improved Model-based Reinforcement Learning with Smooth Kernels
- 赛道归属: 模型式强化学习(连续空间、核平滑动力学建模与样本效率理论)
- 核心创新点: 提出基于“平滑核”的模型式RL新方法,用非参数核平滑估计转移动力学来替代低秩MDP等强结构假设;通过新的核平滑估计与规划/不确定性控制设计,在连续状态-动作场景下获得更一般的样本效率改进与理论保证,使“利用环境平滑性”的model-based范式更可证明、更可用。
- Track: Model-based RL (continuous spaces, kernel smoothing dynamics & sample-efficiency theory)
- Core innovation: Develops an improved model-based RL approach using smooth kernels and nonparametric kernel-smoothed transition estimation as an alternative to restrictive low-rank MDP assumptions; with new estimation and planning/uncertainty-handling design, it yields stronger, more general sample-efficiency improvements and theoretical guarantees in continuous state-action settings.
- [2026-05-08] Beyond Reasoning: Reinforcement Learning Unlocks Parametric Knowledge in LLMs
- 赛道归属: 大模型对齐(RL提升知识回忆、非CoT场景的能力增益分析)
- 核心创新点: 系统验证RL不仅提升推理,也能提升LLM“参数化知识”的直接回忆:在零样本、单跳、闭卷QA且禁止CoT的受控设定下,仅用二元正确性奖励训练,并做事实级去重以排除记忆/泄漏;由此将增益归因于更好的知识检索/调用而非推理链或数据重复,提出“RL可解锁知识回忆能力”的实证证据与评测范式。
- Track: LLM alignment (RL for parametric knowledge recall, capability analysis without CoT)
- Core innovation: Demonstrates in a tightly controlled setup that RL can improve direct parametric knowledge recall—not just reasoning—by training on binary correctness rewards in zero-shot, one-hop, closed-book QA without CoT and using fact-level deduplication to rule out memorization/leakage; provides evidence and an evaluation protocol attributing gains to improved recall/retrieval behavior.
- [2026-05-08] Convergence and Emergence of In-Context Reinforcement Learning with Chain of Thought
- 赛道归属: 强化学习理论与机制解释(In-Context RL、CoT增强的理论分析)
- 核心创新点: 首次从理论上刻画CoT如何促进In-Context Reinforcement Learning:在基于线性Transformer的policy evaluation框架下,分析“通过上下文实现无参数更新的适应”与“显式CoT轨迹”之间的耦合机制,并给出收敛/涌现条件,解释何时CoT能放大ICRL能力、何时不能,为ICRL+CoT的可预测设计提供理论依据。
- Track: RL theory & mechanistic understanding (in-context RL, theoretical analysis of CoT amplification)
- Core innovation: Provides the first theoretical account of how Chain-of-Thought interacts with In-Context RL: under a linear-Transformer policy-evaluation setting, it analyzes the coupling between context-based adaptation (no parameter updates) and explicit CoT trajectories, deriving conditions for convergence/emergence that explain when CoT amplifies ICRL and when it does not, enabling more predictable ICRL+CoT design.
GitHub
- [2026-05-11] rllm-org/rllm ⭐5489
Democratizing Reinforcement Learning for LLMs
- [2026-05-11] agi-brain/xuance ⭐1066
XuanCe: A Comprehensive and Unified Deep Reinforcement Learning Library
- [2026-05-11] nvidia-cosmos/cosmos-rl ⭐417
Cosmos-RL is a flexible and scalable Reinforcement Learning framework specialized for Physical AI applications.
- [2026-05-11] javifalces/HFTFramework ⭐292
HFTFramework utilized for research on " A reinforcement learning approach to improve the performance of the Avellaneda-Stoikov market-making algorith...
- [2026-05-11] ZJU-REAL/SkillZero ⭐242
Official code for "SKILL0: In-Context Agentic Reinforcement Learning for Skill Internalization"
HuggingFace Datasets
- [2026-05-03] ADSKAILab/Zero-To-CAD-1m
Zero-to-CAD 1M
One million executable, interpretable CAD construction sequences synthesized entirely without real-world data.
...
-
[2026-04-23] nvidia/Nemotron-Personas-Korea
Nemotron-Personas-Korea우리나라 실제 분포에 기반한 합성 페르소나를 위한 복합 AI 시스템 A compound AI approach to personas grounded in real-world dist...
世界动作模型 / World Action Model
arXiv
- [2026-05-08] Is the Future Compatible? Diagnosing Dynamic Consistency in World Action Models
- 赛道归属: 世界模型评测与可靠性诊断(World Action Model / 动态一致性)
- 核心创新点: 提出并系统化定义WAM可靠性的关键缺失维度——动作-状态一致性(action-state consistency),用于检验“模型生成的未来”是否与其声称的动作序列在动力学上相容,而不仅是视觉上合理;围绕该一致性构建诊断框架/评测思路,将WAM的失效从“看起来对”细化为“动力学不兼容”的可检测问题,从而为后续训练目标、校准与安全执行提供可操作的评价轴。
- Track: World-model evaluation & reliability diagnostics (World Action Model / dynamic consistency)
- Core innovation: Introduces and formalizes action–state consistency as a missing reliability axis for WAMs, testing whether imagined futures are dynamically compatible with the predicted action sequence rather than merely visually plausible; builds a diagnostic/evaluation perspective around this notion to make WAM failure modes measurable as dynamical incompatibility, enabling more actionable assessment for calibration, training objectives, and safe deployment.
- [2026-05-07] When to Trust Imagination: Adaptive Action Execution for World Action Models
- 赛道归属: 世界模型驱动的机器人控制(自适应执行 / 想象-现实一致性验证)
- 核心创新点: 将WAM的执行策略从“每次推理固定执行N步”提升为自适应动作执行:把是否继续执行想象动作序列建模为未来-现实验证(future-reality verification)问题;核心方法论是在执行过程中持续对比模型想象的未来与真实滚动的偏差/一致性,并据此动态决定执行更长的开环段还是提前重规划,从机制上缓解因想象漂移导致的失控与累积误差,实现“何时信任想象”的可决策化。
- Track: World-model-based robotic control (adaptive execution / imagination–reality verification)
- Core innovation: Replaces the standard “execute a fixed N predicted actions per inference” paradigm with adaptive action execution, formulating it as a future–reality verification problem; methodologically, it continuously checks consistency between imagined rollouts and real-world evolution during execution and uses this signal to decide whether to keep executing longer open-loop segments or replan early, mitigating imagination drift and compounding errors via an explicit trust-and-replan mechanism.
GitHub
- [2026-05-11] DravenALG/awesome-vla-wam ⭐360
A Curated List of Vision-Language-Action (VLA) and World Action Models (WAM) Research and Beyond
Generated automatically by Daily AI Digest Agent 生成时间: 2026-05-11 08:41:18