AI 每日进展速报 / Daily AI Digest - 2026-06-03
图像生成/编辑 / Image Generation/Editing
arXiv
- Compositional Text-to-Image Generation Via Region-aware Bimodal Direct Preference Optimization
- 赛道归属: 文生图(偏好对齐/强化学习式对齐,组合生成)
- 核心创新点: 提出Region-aware的双模态直接偏好优化(BiDPO),将“偏好学习”从全图层面对齐推进到“区域级/关系级”的组合语义对齐:通过构建高质控的大规模偏好数据集BiComp,针对属性绑定、对象关系、计数等组合难点提供可学习的偏好信号;并在优化时显式利用区域感知与图文双模态信息,使模型在不改变基础生成范式的情况下,更稳定地满足复杂提示词的结构化约束与局部一致性。
- Track: Text-to-Image (preference alignment / RL-style alignment, compositional generation)
- Core innovation: Proposes BiDPO, a region-aware bimodal Direct Preference Optimization framework that upgrades preference learning from global image alignment to region-/relation-level compositional alignment. It builds a large-scale, strictly quality-controlled preference dataset (BiComp) targeting hard compositional skills (attribute binding, object relations, counting), and optimizes with explicit region awareness plus bimodal (text+image) signals to better satisfy structured constraints in complex prompts without changing the base generation paradigm.
- PixelPonder: Dynamic Patch Adaptation for Enhanced Multi-Conditional Text-to-Image Generation
- 赛道归属: 多条件文生图(扩散模型可控生成 / ControlNet增强)
- 核心创新点: 提出一种“动态Patch自适应”的多条件融合机制,在扩散去噪过程中按空间区域(patch)动态分配与调整不同控制信号的影响权重/注入方式,替代传统ControlNet为每种条件建立独立分支的静态融合范式;通过缓解多源异构条件之间的指导冲突,实现更强的组合式条件遵循(结构与语义同时对齐)并减少结构扭曲,在保持高画质的同时提升多条件一致性与可控性。
- Track: Multi-conditional text-to-image (diffusion controllable generation / ControlNet enhancement)
- Core innovation: Introduces a dynamic patch-wise adaptation scheme that modulates how multiple heterogeneous control signals are injected during diffusion denoising on a per-region (patch) basis, replacing the static multi-branch ControlNet-style fusion. By reducing inter-condition guidance conflicts, it improves compositional conditioning fidelity (better joint structural/semantic alignment) while mitigating distortions and preserving high visual quality.
- Safeguarding Text-to-Image Generation via Inference-Time Prompt-Noise Optimization
- 赛道归属: 文生图安全对齐 / 推理时安全防护(Text-to-Image Safety Alignment at Inference)
- 核心创新点: 提出一种仅在推理阶段生效的安全防护机制,通过对输入提示词注入并优化“提示噪声”(prompt-noise) 来抑制不安全内容的生成;其关键突破在于把安全约束转化为可优化的推理时变量,无需重新训练/微调模型即可动态调整生成轨迹,从而提升对绕过式提示与对抗攻击的鲁棒性,并在尽量保持画质与文本一致性的前提下实现更稳定的安全过滤。
Track: Text-to-Image safety alignment / Inference-time safety defense Core innovation: Introduces an inference-only safeguarding method that injects and optimizes prompt noise to steer diffusion sampling away from unsafe regions. The key methodological step is formulating safety control as an optimizable inference-time variable, avoiding retraining while improving robustness to jailbreak prompts and adversarial attacks, with minimal degradation to image quality and prompt fidelity.
- KG-FairDiff: Knowledge Graph-Guided Prompt Refinement for Demographically Fair Text-to-Image Generation 🆕NEW
- 赛道归属: 文生图(公平性/去偏见)、提示词优化(Prompt Refinement)
- 核心创新点: 提出以知识图谱(Knowledge Graph)为约束与检索支撑的提示词自动精炼框架,在不重训/不改动闭源T2I主干模型的前提下,通过对人口统计属性与职业/场景等语义关系的显式建模,系统性地补全或重写提示词中的敏感与相关属性表达,从而在生成阶段实现更均衡的人群呈现;方法重点在“结构化知识→可控prompt变换”的映射,降低仅靠启发式词替换带来的语义漂移,并兼顾公平性提升与文本意图保持。
- Track: Text-to-Image (fairness/de-biasing), Prompt Refinement
- Core innovation: Introduces a knowledge-graph-guided prompt refinement framework that improves demographic fairness without retraining or modifying (potentially closed-source) T2I backbones. By explicitly modeling relationships between demographic attributes and contextual semantics (e.g., occupations, settings), it automatically augments/rewrites prompts to enforce more balanced representation at inference time. The key methodological advance is mapping structured knowledge constraints into controllable prompt transformations, reducing semantic drift compared to heuristic word swaps while preserving the original intent.
- DyCoRM: Dynamic Criterion-Aware Reward Modeling for Text-to-Image Generation
- 赛道归属: 文生图(Text-to-Image)/ 偏好对齐与奖励建模(Reward Modeling, RLHF/RLAIF)
- 核心创新点: 提出动态、准则感知(Dynamic Criterion-Aware)的奖励建模框架 DyCoRM,使奖励模型不再依赖固定的通用评分维度,而是能根据用户当前关注的评价准则(如美学、文本一致性、细节、风格等)动态调整评估与打分机制;通过将“评价准则”显式纳入奖励学习与推断过程,实现对多样化、个性化偏好的更精细建模,从而为文生图生成提供更可控、更贴合需求的优化信号,提升对齐效果与泛化到不同偏好场景的能力。
- Track: Text-to-Image Generation / Preference Alignment & Reward Modeling (Reward Modeling, RLHF/RLAIF)
- Key innovation: Proposes DyCoRM, a Dynamic Criterion-Aware reward modeling framework that moves beyond static, one-size-fits-all scoring dimensions by conditioning the reward model on the user’s active evaluation criteria (e.g., aesthetics, prompt faithfulness, detail, style) and dynamically adapting how images are assessed; by explicitly incorporating “criteria” into reward learning and inference, it enables finer-grained modeling of diverse and personalized preferences, providing more controllable and better-aligned optimization signals for T2I generation and improving generalization across preference scenarios.
- RichControl: Structure- and Appearance-Rich Training-Free Spatial Control for Text-to-Image Generation 🆕NEW
- 赛道归属: 文生图(可控生成)、训练免(Training-free)空间控制/条件注入
- 核心创新点: 提出一种同时具备“结构+外观”双重约束的训练免空间控制方案,通过改进特征注入/融合机制,在扩散采样过程中更稳定地对齐条件图像的几何结构并保留外观细节;针对训练免注入常见的结构错位、条件泄漏(把条件图像纹理/噪声直接拷入结果)与伪影问题,引入更精细的分层/分步控制与抑制策略,使结构遵循与外观一致性可以解耦调节,从而在无需LoRA/微调的情况下获得更可靠的空间可控生成。
- Track: Controllable Text-to-Image, Training-free spatial control / condition feature injection
- Core innovation: Proposes a training-free spatial control method that is rich in both structure and appearance constraints. It improves feature injection/fusion during diffusion sampling to better align geometry from conditional inputs while preserving appearance details. To address common training-free issues—structural misalignment, condition leakage (copying conditional textures/noise), and artifacts—it introduces finer-grained, stage-/layer-wise control and suppression mechanisms, enabling decoupled tuning of structural adherence vs. appearance fidelity without LoRA or finetuning.
- Qwen-Image-Bench: From Generation to Creation in Text-to-Image Evaluation
- 赛道归属: 文生图评测(基准/指标,面向创作能力评估)
- 核心创新点: 提出Qwen-Image-Bench,将评测目标从传统“文本-图像一致性/基础画质”扩展到更贴近真实创作工作流的“从生成到创作”能力刻画:强调对真实世界重建的可信度与创意表达等更高阶维度,设计能区分模型在专业创作场景中关键能力差异的评测集合与判别框架,从而缓解现有benchmark对艺术实践需求覆盖不足、区分度不够的问题。
- Track: Text-to-Image evaluation (benchmark/metrics, creativity-oriented assessment)
- Core innovation: Introduces Qwen-Image-Bench to move beyond classic text-image alignment and basic visual quality, toward capabilities that matter in real creative workflows—faithful real-world reconstruction and genuine creative expression. It provides an evaluation suite and judging protocol aimed at better discriminating models on higher-level, practice-relevant skills that existing benchmarks under-represent.
- DirectEdit: Step-Level Accurate Inversion for Flow-Based Image Editing
- 赛道归属: 图像编辑(基于流模型/扩散式流程的训练免编辑,反演)
- 核心创新点: 提出DirectEdit,实现“步级准确”的反演以支持流式(flow-based)编辑:针对现有训练免编辑常见的反演-前向去噪流程中“时间步不匹配”导致的重建误差累积问题,DirectEdit在反演阶段对齐每一步的潜变量/时间步,使重建路径与编辑路径在对应step上严格一致,从而显著降低误差传播,提升重建保真度与编辑稳定性(尤其在多步编辑或强编辑强度下)。
- Track: Image editing (flow-based / diffusion-style pipeline, training-free editing, inversion)
- Core innovation: Proposes DirectEdit with step-level accurate inversion for flow-based editing. It addresses error accumulation caused by timestep-mismatched noisy latents in common inversion+forward denoising pipelines by aligning latents per step so reconstruction and editing trajectories are consistent at corresponding timesteps, reducing drift and improving reconstruction fidelity and editing robustness, especially for longer or stronger edits.
- FiRe: Fine-grained Multimodal Reasoning for Enhanced Image Generation
- 赛道归属: 文生图(多模态推理增强的图像生成 / Unified MLLM for T2I)
- 核心创新点: 提出细粒度多模态推理框架,将统一式MLLM的“理解-生成”闭环能力用于文生图的自反思与自改写:不再停留在简单的提示词扩写或整体图文一致性打分,而是引入更细粒度的推理与评估信号(如对属性、关系、局部区域/对象级要点的逐项核对),驱动生成过程进行针对性的迭代修正,从而提升复杂指令下的可控性与语义一致性。
- Track: Text-to-Image Generation (multimodal reasoning-enhanced image generation / unified MLLM for T2I)
- Key innovations: Proposes a fine-grained multimodal reasoning framework that leverages a unified MLLM’s closed-loop “understand–generate” capability for self-reflection and self-refinement in T2I. Instead of relying on prompt augmentation or holistic image-text alignment scoring, it introduces finer-grained reasoning/evaluation signals (e.g., attribute-, relation-, and region/object-level checks) to guide targeted iterative corrections during generation, improving controllability and semantic faithfulness for complex prompts.
- Pinterest Canvas: Large-Scale Image Generation at Pinterest 🆕NEW
- 赛道归属: 工业级图像生成系统(大规模部署)、图像编辑/增强(生成式编辑)
- 核心创新点: 面向Pinterest产品级强约束场景,提出端到端的大规模图像生成与编辑系统化方案:通过在多模态大规模数据上进行针对“编辑/增强”任务的训练与系统工程化设计,弥补通用生成模型“可用但难控”的落地缺口;核心突破在于将模型能力、数据构建、训练目标与线上控制/质量保障机制协同设计,使生成结果在风格一致性、可控性、安全与稳定性等产品指标上可达可运营水平,而不仅依赖提示词或轻量推理技巧。
- Track: Production-scale image generation systems, Generative image editing/enhancement
- Core innovation: Presents a product-oriented, large-scale image generation and editing system for Pinterest, targeting use cases with strict controllability requirements where generic models are flexible but hard to steer. The key contribution is the co-design of model training (on diverse large-scale multimodal data with editing/enhancement objectives) and system-level controls/quality mechanisms for online deployment, achieving operational-grade controllability, consistency, safety, and stability beyond prompt-only or minor inference-time adaptations.
GitHub
- [2026-06-03] YouMind-OpenLab/awesome-nano-banana-pro-prompts ⭐12381
🍌 World's largest Nano Banana Pro prompt library — 10,000+ curated prompts with preview images, 16 languages. Google Gemini AI image generation. Free ...
- [2026-06-02] Light-Heart-Labs/DreamServer ⭐1880
Turn your PC, Mac, or Linux box into an AI server. LLM inference, chat UI, voice, agents, workflows, RAG, and image generation.
- [2026-06-02] etkecc/baibot ⭐229
🤖 A Matrix bot for using different capabilities (text-generation, text-to-speech, speech-to-text, image-generation, etc.) of AI / Large Language Model...
- [2026-06-02] CorentinGS/chess ⭐85
chess is a set of go packages which provide common chess utilities such as move generation, turn management, checkmate detection, PGN encoding, UCI in...
- [2026-06-02] AEmotionStudio/ComfyUI-ShaderNoiseKSampler ⭐65 🆕NEW
Transform AI image generation from random exploration into deliberate artistic navigation. This advanced KSampler replacement blends traditional noise...
HuggingFace Datasets
- [2026-05-29] jasperai/monet
Dataset Card for MONET
MONET (Massive, Open, Non-redundant and Enriched Text-to-image dataset) is a large-scale, curated image-text dat...
视频生成/编辑 / Video Generation/Editing
arXiv
- Spatial-Temporal Decoupled Reference Conditioning for Identity-Preserving Text-to-Video Generation 🆕NEW
- 赛道归属: 身份保持文本到视频生成(Reference-conditioned T2V / Video Generation)
- 核心创新点: 提出ST-DRC(Spatial-Temporal Decoupled Reference Conditioning)框架,将参考身份条件在空间与时间维度解耦注入视频扩散/生成过程:用空间侧的细粒度特征强化单帧身份细节(如脸部结构、纹理一致性),用时间侧的机制约束跨帧身份稳定与时序一致,从而在“文本语义可控性”和“低层身份保真度”之间实现更好的平衡;框架层面强调晚期/分阶段的条件融合以减少文本驱动对身份特征的干扰并提升长序列稳定性。
- Track: Identity-preserving text-to-video generation (reference-conditioned T2V / video generation)
- Key innovation: Proposes ST-DRC, a Spatial-Temporal Decoupled Reference Conditioning framework that injects identity reference signals separately along spatial and temporal axes in the video generation (diffusion) process: spatial conditioning strengthens per-frame identity details (geometry/texture), while temporal conditioning enforces cross-frame identity stability and temporal coherence. The method emphasizes late/staged conditioning fusion to reduce interference from text semantics and improve long-range identity consistency.
- SafeGen-Bench: Benchmarking Safety in Image-Conditioned Text-to-Video Generation 🆕NEW
- 赛道归属: 视频生成安全评测(Image-conditioned T2V Safety Benchmark / Evaluation)
- 核心创新点: 提出SafeGen-Bench,面向图像条件引导的文本到视频生成系统化评测其安全风险,补齐现有安全基准主要聚焦纯文本模式的缺口;通过覆盖非法/政治敏感/伦理风险等多类场景与触发方式,构建更贴近真实使用链路的测试集与评测协议,用于量化模型在“给定初始图像+文本”条件下的越界生成倾向与防护能力,从而推动安全对齐在I2V/T2V条件生成中的可比、可复现评估。
- Track: Safety benchmarking for image-conditioned text-to-video generation (evaluation/benchmark)
- Key innovation: Introduces SafeGen-Bench to systematically evaluate safety risks specifically in image-conditioned T2V settings, addressing the gap of prior benchmarks that mainly test text-only generation. It broadens risk coverage (illegal/political/ethical categories and triggers) and provides a more realistic evaluation protocol to quantify unsafe generation propensity and safety guard effectiveness under “input image + prompt” conditioning, enabling comparable and reproducible safety assessment.
- MAVEN A Multi-Agent Framework for Multicultural Text-to-Video Generation
- 赛道归属: 视频生成(文生视频 / 多智能体提示工程)
- 核心创新点: 提出多智能体提示精炼框架 MAVEN,面向“多文化/跨文化”文生视频的文化保真度问题,将文本提示分解为人物(person)、动作(action)、地点(location)等可控维度,并由专门代理并行/串行协作改写与补全文化关键信息;通过结构化分解降低单一提示对文化细节的丢失与歧义,提升同文化与跨文化场景下生成内容的文化一致性与可评测性。
- Track: Video Generation (Text-to-Video / Multi-agent Prompting)
- Key innovation: Introduces MAVEN, a multi-agent prompt-refinement framework targeting cultural fidelity in mono- and cross-cultural T2V. It decomposes prompts into controllable dimensions (person/action/location) handled by specialized agents in parallel or sequential workflows, explicitly enriching under-specified cultural attributes and reducing ambiguity that typical single-prompt pipelines cannot recover.
- World-R1: Reinforcing 3D Constraints for Text-to-Video Generation
- 赛道归属: 文生视频(Text-to-Video)/ 3D一致性对齐(强化学习)
- 核心创新点: 通过强化学习而非结构改造来注入3D约束:将“几何一致性/世界约束”显式构造成奖励信号,对视频生成模型进行对齐优化,从而在不显著增加推理开销、保持可扩展性的前提下缓解几何不一致问题;同时构建面向“世界模拟”的纯文本数据集,用于更系统地覆盖可被3D约束检验的描述分布,提升对齐训练的有效性与泛化。
- Track: Text-to-Video / 3D-consistency alignment (Reinforcement Learning)
- Core innovation: Injects 3D constraints via RL-based alignment instead of architectural modifications: formulates geometric/world-consistency as explicit rewards to optimize a video generator, improving geometric coherence without adding substantial inference cost and preserving scalability; additionally introduces a world-simulation-oriented text-only dataset to better cover descriptions that are verifiable under 3D constraints, strengthening alignment and generalization.
- Knowledge-Intensive Video Generation 🆕NEW
- 赛道归属: 知识密集型文本到视频生成评测(Factuality/Helpfulness Evaluation for T2V)
- 核心创新点: 定义“知识密集型视频生成(KIVI)”任务:针对解释、流程、演示类信息检索式短提示,要求生成视频不仅好看还要事实正确且有用;构建KIVI-Bench(1080条提示)并提出面向事实性(factuality)与帮助性(helpfulness)的自动评测指标,且通过人工评测验证指标相关性,从评测体系上把T2V从感知质量扩展到“知识/实用性”维度,为后续引入检索增强、工具使用或知识对齐的T2V方法提供可量化目标。
- Track: Knowledge-intensive text-to-video generation evaluation (factuality/helpfulness)
- Key innovation: Formulates Knowledge-Intensive Video Generation (KIVI), where prompts request explanations/procedures/demonstrations and outputs must be factually correct and practically helpful, not just visually appealing. Releases KIVI-Bench (1,080 prompts) and proposes automatic metrics for factuality and helpfulness, validated via human studies, extending T2V evaluation from perceptual quality to knowledge/utility and enabling measurable targets for retrieval/tool-augmented or knowledge-aligned T2V models.
- OSP-Next: Efficient High-Quality Video Generation with Sparse Sequence Parallelism, HiF8 Quantization, and Reinforcement Learning
- 赛道归属: 视频生成(文生视频/扩散Transformer加速与部署优化)
- 核心创新点: 提出面向DiT视频生成的系统级效率方案,将“稀疏注意力 + 序列并行 + 低比特量化 + 强化学习”联合设计以在质量不降的前提下降本增效:1) 采用混合全注意力-稀疏注意力架构,用固定模式的 Skiparse-2D 在时空token维度做token级与group级稀疏连接,缓解全注意力二次复杂度;2) 引入稀疏序列并行(Sparse Sequence Parallelism)以更好匹配稀疏计算图,提升多卡吞吐与可扩展性;3) 使用 HiF8(8-bit)量化降低显存与带宽开销,面向推理/训练的硬件友好实现;4) 通过强化学习对生成策略/偏好进行对齐,在引入稀疏与量化后维持或提升感知质量与文本一致性。
- Track: Video generation (text-to-video / Diffusion-Transformer acceleration & deployment optimization)
- Core innovations: A system-level efficiency recipe for DiT-based video generation that jointly combines “sparse attention + sequence parallelism + low-bit quantization + RL” to reduce cost without sacrificing quality: 1) a hybrid full–sparse attention design using fixed-pattern Skiparse-2D to apply token-wise and group-wise sparsity over spatiotemporal tokens, mitigating quadratic attention cost; 2) Sparse Sequence Parallelism to better align distributed execution with sparse computation graphs for higher multi-GPU throughput and scalability; 3) HiF8 (8-bit) quantization to cut memory/bandwidth with hardware-friendly training/inference; 4) reinforcement learning-based alignment to preserve/improve perceptual quality and prompt faithfulness under sparsity/quantization constraints.
- Paris 2.0: A Decentralized Diffusion Model for Video Generation
- 赛道归属: 视频生成(去中心化训练 / 分布式扩散模型)
- 核心创新点: 提出首个通过去中心化计算预训练的视频扩散生成模型,将原本在图像上验证的去中心化扩散训练范式扩展到需要强时序一致性的文本生成视频任务;核心突破在于给出去中心化场景下实现时序连贯训练的配方与机制,使得无需单体GPU集群也能完成低分辨率T2V预训练,并在去中心化通信与优化约束下维持跨帧一致性与可训练性。
- Track: Video generation (decentralized training / distributed diffusion)
- Key innovation: Introduces the first video diffusion generator pre-trained via decentralized computation, extending decentralized diffusion training from images to temporally coherent text-to-video. The main methodological advance is a training recipe/mechanism that preserves temporal coherence under decentralized optimization and communication constraints, enabling low-res T2V pretraining without a monolithic GPU cluster.
- TAGRPO: Boosting GRPO on Image-to-Video Generation with Direct Trajectory Alignment
- 赛道归属: 图生视频生成(I2V)/ 强化学习式后训练(RLHF/RLAIF for generative models)
- 核心创新点: 提出TAGRPO用于I2V的稳健后训练,指出GRPO在I2V上“奖励不稳定/不持续提升”的关键症结在于视频生成的多步轨迹与奖励信号之间存在错位;方法上引入“直接轨迹对齐”(Direct Trajectory Alignment)的对比学习式目标,将高奖励样本的去噪/流匹配轨迹作为正样本对齐参照、低奖励轨迹作为负样本拉开,从而在不改变基础生成架构的情况下,更稳定地把奖励偏好注入到整段生成轨迹而非仅末端结果,提升可控性与一致性。
- Track: Image-to-Video generation (I2V) / RL-style post-training (RLHF/RLAIF for generative models)
- Core innovation: Proposes TAGRPO as a robust post-training framework for I2V, diagnosing that naïvely applying GRPO yields inconsistent reward gains due to misalignment between multi-step generation trajectories and reward signals. It introduces Direct Trajectory Alignment with a contrastive-learning-like objective: align denoising/flow-matching trajectories from high-reward samples as positives and push away low-reward trajectories as negatives, injecting preference into the whole trajectory (not just final frames) without changing the base architecture, improving stability and controllability.
- Kandinsky 5.0: A Family of Foundation Models for Image and Video Generation
- 赛道归属: 文生图/文生视频/图生视频(基础大模型体系与工程化)
- 核心创新点: 给出Kandinsky 5.0成体系的图像与视频基础模型家族,通过“分层产品线”覆盖不同算力与质量需求:6B级高分辨率图像模型(Image Lite)、2B级轻量快速的T2V/I2V(Video Lite)、19B级高质量视频模型(Video Pro)。技术价值在于将图像与10秒视频生成统一到可扩展的基础模型栈中,并通过不同规模与配置实现质量-速度-成本的可部署权衡,为实际应用提供从轻量到旗舰的可迁移方案与训练/推理配方。
- Track: Text-to-Image / Text-to-Video / Image-to-Video (foundation model family & systemization)
- Core innovation: Presents Kandinsky 5.0 as a structured family of foundation models spanning high-res image and 10-second video synthesis, organized into tiered lineups to cover different compute/quality regimes: 6B Image Lite, 2B fast/light Video Lite for T2V/I2V, and 19B Video Pro for top quality. The key contribution is a scalable, unified model stack with practical quality–latency–cost trade-offs and deployable recipes across sizes, enabling transfer across product tiers.
- Retrieve What's Missing: Coverage-Maximizing Retrieval for Consistent Long Video Generation 🆕NEW
- 赛道归属: 长视频生成一致性(Memory/Retrieval-augmented Autoregressive Video Generation)
- 核心创新点: 针对长时域自回归视频生成的几何一致性难题,提出“覆盖最大化(coverage-maximizing)”的检索策略:不再依赖相机位姿或视场重叠等粗粒度启发式,而是设计更能表达历史观测3D几何证据的表示,并在此基础上选择一组记忆帧,使其对当前生成所需的几何信息覆盖最充分;通过“检索什么(几何证据表征)+检索哪些(覆盖最大化选帧)”两点联合改进,提升长视频的结构稳定性与跨段一致性。
- Track: Consistent long video generation (memory/retrieval-augmented autoregressive video generation)
- Key innovation: Addresses long-horizon geometric consistency in autoregressive video generation with coverage-maximizing retrieval. Instead of coarse heuristics like camera poses or FoV overlap, it designs representations that better capture historical 3D geometric evidence, then selects memory frames to maximize coverage of the geometry needed for current generation. By jointly improving “what to retrieve” (geometric evidence) and “which frames to retrieve” (coverage-maximizing selection), it strengthens structural stability and long-range consistency in long videos.
GitHub
- [2026-06-02] Anil-matcha/Open-Generative-AI ⭐17932
Open-source alternative to AI video platforms — Free AI image & video generation studio with 200+ models (Flux, Midjourney, Kling, Sora, Veo). No cont...
- [2026-06-02] hao-ai-lab/FastVideo ⭐3670
A unified inference and post-training framework for accelerated video generation.
- [2026-06-02] YouMind-OpenLab/awesome-seedance-2-prompts ⭐1283
🎬 2000+ curated Seedance 2.0 video generation prompts — cinematic, anime, UGC, ads, meme styles. Includes Seedance API guides, character consistency t...
- [2026-06-02] Anil-matcha/veo3.1-comfyui ⭐64 🆕NEW
ComfyUI custom nodes for Veo 3.1 video generation — text-to-video, image-to-video, reference-to-video, extend, and 4K upscale via MuAPI
- [2026-06-02] heygen-com/heygen-cli ⭐63
Create AI videos from the terminal. Official CLI for the HeyGen video generation API.
HuggingFace Models
音频生成 / Audio Generation
arXiv
- FC-TTS: Style and Timbre Control in Zero-Shot Text-to-Speech with Disentangled Speech Representations
- 赛道归属: 音频生成|零样本文本转语音(Zero-shot TTS)|可控生成(风格/音色解耦控制)
- 核心创新点: 通过解耦语音表征将语音分解为可解释属性(如内容、韵律/风格、音色等),并在零样本TTS中实现来自不同参考音频的分离式条件控制:用一段参考提供说话人音色、另一段参考提供说话风格/韵律,从而突破以往“单一参考同时绑定音色与风格”的耦合限制;方法上强调在表示学习与条件注入机制上实现属性独立性,使模型在保持高保真克隆的同时获得可组合、可编辑的控制能力。
- Track: Audio Generation | Zero-shot Text-to-Speech (TTS) | Controllable generation (disentangled style/timbre control)
- Core innovation: Introduces disentangled speech representations that factor speech into interpretable attributes (e.g., content, prosody/style, timbre) and enables separate-reference conditioning in zero-shot TTS—one reference for speaker timbre and another for speaking style/prosody. This addresses the common entanglement where a single prompt jointly determines both, and advances the method via representation learning and conditioning/injection designs that preserve cloning fidelity while enabling compositional, editable control.
- ParsVoice: A Large-Scale Multi-Speaker Persian Speech Corpus for Text-to-Speech Synthesis
- 赛道归属: 语音生成 / TTS 数据集与数据构建(低资源语言、多说话人)
- 核心创新点: 提出面向多说话人TTS训练的超大规模波斯语开源语音-文本语料库ParsVoice,并给出可扩展的数据构建流水线:从长篇有声书录音中自动切分与对齐高质量语音-文本对,核心在于结合面向波斯语的句级语义/完整性建模(如微调的ParsBERT用于句子补全/筛选)与质量控制策略,以在低资源语言场景下系统性提升对齐准确性、覆盖度与可用性,从而降低多说话人TTS与语音语言建模的数据门槛。
- Track: Audio Generation / TTS dataset & data pipeline (low-resource, multi-speaker)
- Core innovation: Introduces ParsVoice, the largest publicly available Persian speech–text corpus designed for multi-speaker TTS, together with a scalable pipeline to derive high-quality paired data from long-form audiobooks. The key methodological contribution is an automated segmentation/alignment and quality-control workflow that leverages Persian-specific sentence-level modeling (e.g., a fine-tuned ParsBERT for sentence completion/filtering) to improve alignment reliability, coverage, and usability in low-resource settings.
- ImmersiveTTS: Environment-Aware Text-to-Speech with Multimodal Diffusion Transformer and Domain-Specific Representation Alignment
- 赛道归属: 文本到语音(TTS)/ 场景化语音生成(语音+环境声融合)
- 核心创新点: 提出环境感知TTS框架,通过多模态扩散Transformer显式建模语音与环境上下文(如场景/视觉/环境音提示)之间的跨模态交互,解决语音与环境声在声学形态与时间动态上的分布差异;并引入面向领域的表征对齐机制,将“语音生成表征”与“环境/场景表征”在统一空间中对齐,从而实现语音与环境声的自然共存与无缝融合(而非后期拼接)。
- Track: Text-to-Speech (TTS) / Scene-aware speech generation (speech + ambient sound integration)
- Core innovations: Proposes an environment-aware TTS framework that uses a multimodal Diffusion Transformer to explicitly model cross-modal interactions between speech and environmental context (e.g., scene/visual/ambient cues), addressing the distribution and temporal-dynamics mismatch between speech and environmental audio; introduces domain-specific representation alignment to map speech-generation features and environment/scene features into a shared space, enabling coherent in-scene speech generation rather than post-hoc mixing.
- UNISON: A Unified Sound Generation and Editing Framework via Deep LLM Fusion
- 赛道归属: 统一音频生成与编辑(Text-to-Audio/TTS/音频编辑一体化,多任务扩散)
- 核心创新点: 用单一潜空间扩散模型统一覆盖文本到音频、文本到语音、零样本音色克隆、语音+音效混合生成、场景级音频编辑与时间编排等任务,实现“同权重多能力”;关键方法是层级式深度LLM融合(将LLM多层隐状态注入扩散网络以增强语义与结构控制)以及面向多任务的统一条件接口/训练范式,使生成与编辑在同一潜空间与同一推理管线内闭环完成,减少任务间割裂与模型堆叠。
- Track: Unified audio generation & editing (Text-to-Audio/TTS/audio editing; multi-task diffusion)
- Core innovations: Introduces a single latent diffusion model that unifies text-to-audio, text-to-speech, zero-shot speaker cloning, mixed speech-and-sound generation, scene-level editing, and temporal composition under one set of weights; key is layer-wise deep LLM fusion—injecting multi-layer LLM hidden states into the diffusion network for stronger semantic/structural control—plus a unified conditioning/training scheme so generation and editing operate in the same latent space and inference pipeline, avoiding fragmented task-specific stacks.
- Sparse Autoencoders for Interpretable Emotion Control in Text-to-Speech 🆕NEW
- 赛道归属: 语音生成|文本到语音(TTS)|可解释情感控制(表示解析/可控生成)
- 核心创新点: 利用稀疏自编码器(SAE)对LLM-TTS的语义隐状态进行分解与稀疏表征学习,从模型内部表示中自动“挖掘/定位”与情感变化相关的稀疏特征(而非依赖外部情感条件或整体激活粗粒度操控)。该思路将情感控制从黑盒条件注入转为可解释的内部特征级干预:通过识别情感相关的稀疏方向/单元,实现更可诊断、可编辑的情感调节,并为理解情感在TTS隐空间中的编码方式提供机制化证据。
- Track: Speech Generation | Text-to-Speech (TTS) | Interpretable emotion control (representation analysis / controllable generation)
- Core innovation: Applies sparse autoencoders (SAEs) to decompose and sparsify semantic hidden states in LLM-based TTS, automatically isolating emotion-related sparse features from internal representations rather than relying on external emotion conditioning or coarse global activation steering. This reframes emotion control as interpretable, feature-level intervention: by identifying emotion-linked sparse directions/units, the method enables more diagnosable and editable emotion modulation and provides mechanistic insight into how emotion is encoded in the TTS latent space.
- DUET: Unified Dual-Space Emotion Control for Diffusion and Flow-Matching Driven Text-to-Speech 🆕NEW
- 赛道归属: 语音生成|文本到语音(TTS)|扩散/Flow-Matching 可控生成|即插即用情感控制
- 核心创新点: 发现预训练扩散与flow-matching TTS的冻结隐状态中,情感与说话人身份分别对应近似线性可解码且近乎正交的方向,从而提出DUET的“双空间”统一控制:在不重训主体模型的前提下,以plug-and-play方式在生成过程中沿情感方向进行可控操纵,同时尽量不扰动说话人方向以降低身份泄漏/纠缠。该方法论突破在于把“情感-身份解耦”具体化为可操作的几何结构(线性方向+近正交),并将其转化为跨扩散与flow-matching范式通用的推理期控制接口。
- Track: Speech Generation | Text-to-Speech (TTS) | Diffusion/Flow-Matching controllable generation | Plug-and-play emotion control
- Core innovation: Shows that in pretrained diffusion and flow-matching TTS, emotion and speaker identity correspond to (approximately) linearly decodable and nearly orthogonal directions in frozen hidden states. Based on this geometry, DUET introduces unified “dual-space” control: a plug-and-play inference-time manipulation that steers generation along the emotion direction while minimally perturbing the speaker direction to reduce identity–emotion entanglement, without retraining the backbone. The key methodological advance is operationalizing emotion–identity disentanglement as actionable latent geometry (linear directions + near-orthogonality) and turning it into a model-agnostic control interface across both diffusion and flow-matching TTS.
- SwanVoice: Expressive Long-Form Zero-Shot Speech Synthesis for Both Monologue and Dialogue
- 赛道归属: 长文本零样本TTS / 对话式语音合成(多说话人、情感与一致性建模)
- 核心创新点: 面向长篇独白与多轮对话的零样本语音合成,针对“逐轮合成再拼接”导致的音色一致性、韵律连贯性与情绪连续性断裂问题,提出在单模型内联合建模跨轮次的对话上下文与表达状态(如情感/语气/节奏的持续变量),在生成时维持跨turn的声学一致与对话连贯;强调长程依赖与多说话人切换下的表达可控与稳定性,而非仅提升单句质量。
- Track: Long-form zero-shot TTS / Dialogue speech synthesis (multi-speaker, expressive consistency)
- Core innovations: Targets long-form monologue and multi-turn dialogue in zero-shot TTS, addressing the common “synthesize-per-turn then stitch” workaround that breaks timbre, prosody, and affect continuity; proposes single-model joint modeling of cross-turn dialogue context and persistent expressive states (e.g., emotion/intonation/rhythm as continuous trajectories), maintaining acoustic consistency and conversational coherence across turns while supporting multi-speaker switching and expressive control over long horizons.
- Towards Streaming Synchronized Spatial Audio Generation via Autoregressive Diffusion Transformer
- 赛道归属: 流式空间音频生成(视频/文本条件的Spatial Audio,低延迟生成)
- 核心创新点: 提出面向实时的流式空间音频生成统一框架,使用自回归扩散Transformer在“可流式输出”的约束下实现高保真生成,并强化与全景视频/文本提示的时序同步与空间一致性;核心突破在于把扩散生成改造为可在线推进的自回归/分段式推理范式,在降低推理延迟的同时保持空间线索(方位、距离、运动)建模精度,缓解“质量-延迟”权衡与多模态空间对齐困难。
- Track: Streaming spatial audio generation (video/text-conditioned spatial audio; low-latency)
- Core innovations: Proposes a unified streaming framework for real-time spatial audio generation conditioned on panoramic video and text, built on an autoregressive Diffusion Transformer to enable incremental (online) synthesis; key contribution is adapting diffusion-style generation to a streaming-compatible autoregressive/segmented inference scheme that preserves high fidelity while improving latency, and strengthening temporal synchronization and spatial consistency (direction/distance/motion cues) from multimodal inputs, mitigating the quality–latency tradeoff and multimodal spatial alignment challenges.
- Chatterbox-Flash: Prior-Calibrated Block Diffusion for Streaming Zero-Shot TTS
- 赛道归属: 流式零样本TTS / 推理加速(Block Diffusion并行解码)
- 核心创新点: 将预训练自回归TTS解码器微调为块扩散(block-diffusion)解码器,实现“块内并行、块间流式”的低延迟生成;针对离散语音token长尾分布导致的并行位置选择偏置(高频token主导、质量下降)问题,提出先验校准(prior-calibration)机制,在不大改架构的前提下校正并行采样的token先验/选择策略,从而兼顾并行带来的速度与接近自回归的自然度与稳定性。
- Track: Streaming zero-shot TTS / Inference acceleration (block-diffusion parallel decoding)
- Core innovations: Fine-tunes a pretrained autoregressive TTS decoder into a block-diffusion decoder, enabling parallel token generation within each block while keeping block-by-block streaming for low latency; identifies a discrete-speech-token long-tail issue where naive block diffusion biases parallel positions toward a few high-frequency tokens and degrades quality, and introduces prior calibration to correct the sampling prior/position-selection behavior without major architectural changes, preserving naturalness and stability while gaining speed.
- Unlocking Fine-Grained and Within-Utterance Speaking Style Control in Prompt-Based Text-to-Speech Models
- 赛道归属: 文本到语音生成(TTS)/ 语音风格可控生成(Prompt-based Style Control)
- 核心创新点: 在现有“基于提示词的TTS”框架上,针对两类关键瓶颈提出方法级增强:①实现跨语句(inter-utterance)的细粒度风格属性连续可控与插值,使风格强度/属性可在不同句子间平滑调节而非离散切换;②实现单句内部(within-utterance)的时变风格控制,通过引入随时间变化的风格条件/调度机制,让模型不再只能施加全局单一风格,而能在同一句话中完成风格过渡与局部风格片段控制,从而扩展到需要“句内风格转场”的实际应用场景。
- Track: Text-to-Speech (TTS) / Controllable Speech Style Generation (Prompt-based Style Control)
- Core innovations: Proposes method-level extensions to existing prompt-based TTS to overcome two limitations: (1) enables fine-grained, continuous control and interpolation of style attributes across utterances (inter-utterance), allowing smooth adjustment of style intensity/attributes rather than coarse, discrete changes; (2) enables time-varying, within-utterance style control by introducing temporally scheduled/dynamic style conditioning, replacing a single global style per utterance with intra-utterance style transitions and localized style segment control—supporting practical scenarios requiring style changes inside one sentence.
GitHub
- [2026-06-02] huggingface/diffusers ⭐33762
🤗 Diffusers: State-of-the-art diffusion models for image, video, and audio generation in PyTorch.
- [2026-06-02] SamurAIGPT/Generative-Media-Skills ⭐3367 🆕NEW
Multi-modal Generative Media Skills for AI Agents (Claude Code, Cursor, Gemini CLI). High-quality image, video, and audio generation powered by muapi....
- [2026-06-01] BinWang28/audio-ai-hub ⭐925
The hub for audio AI research: papers, open models, benchmarks & datasets across audio LLMs, speech recognition, TTS, music & audio generation.
- [2026-06-01] Ameobea/web-synth ⭐556
Browser-based DAW and audio synthesis platform with dozens of effects, synths, and modules
- [2026-06-02] apocas/restai ⭐509
RESTai is an AIaaS (AI as a Service) open-source platform. Supports many public and local LLM suported by Ollama/vLLM/etc. Precise embeddings usage, t...
语言大模型 / Large Language Models
arXiv
- Attention-guided Fine-tuning of Multimodal Large Language Models Improves Chain-of-Thought Reasoning 🆕NEW
- 赛道归属: 多模态推理(MLLM Chain-of-Thought 对齐/微调优化)
- 核心创新点: 通过系统性实证分析指出多模态 CoT 在视觉推理中常“越想越错”,并归因于两类稳定失败模式:过早锁定答案(premature answer commitment)与对直接视觉证据利用不足(limited direct visual evidence usage)。在此基础上提出“注意力引导的微调”思路:利用/约束模型注意力分配,使推理步骤更聚焦于与当前推理相关的视觉区域与证据链,从训练层面纠正 CoT 生成时的证据对齐与决策时机问题,从而提升多模态逐步推理的可靠性与可解释性。
- Track: Multimodal reasoning (MLLM Chain-of-Thought alignment / fine-tuning optimization)
- Key innovation: Provides a systematic study showing that CoT prompting can hurt visual reasoning in MLLMs, and identifies two recurring failure modes: premature answer commitment and insufficient use of direct visual evidence. Building on these findings, it proposes an attention-guided fine-tuning strategy that steers/regularizes attention to align each reasoning step with the relevant visual regions and evidence, correcting evidence grounding and decision timing during CoT generation to improve step-wise multimodal reasoning robustness.
- COFT: Counterfactual-Conformal Decoding for Fair Chain-of-Thought Reasoning in Large Language Models
- 赛道归属: 公平性可控解码 / 推理阶段偏见抑制(LLM Decoding for Fairness in CoT)
- 核心创新点: 提出一种无需训练、仅在解码阶段生效的公平性控制方法 COFT,用于抑制链式思维(CoT)生成中的社会偏见放大。方法上以反事实提示构造 + 共形预测(Conformal)约束为核心:先将提示中的敏感片段替换为中性占位符形成“掩码反事实”输入,以获得相对去偏的参考分布;再在token 级别对原始解码分布施加公平性约束,并通过分布无关(distribution-free)的边际有效性保证(在 exchangeability 假设下)为公平控制提供可验证的统计保证,从而实现对任意冻结的因果语言模型在推理时的可控去偏解码。
- Track: Fairness-controlled decoding / Inference-time bias mitigation for CoT (LLM Decoding for Fairness in CoT)
- Key innovation: Introduces COFT, a training-free, decoding-time method to curb bias amplification in chain-of-thought generation. The technical core combines counterfactual prompt masking with conformal (distribution-free) constraints: it first replaces sensitive spans with neutral tokens to form a masked counterfactual prompt, yielding a debiased reference distribution; then it enforces token-level fairness control on the original decoding distribution, providing distribution-free marginal validity guarantees (under exchangeability) for any frozen causal LM—enabling verifiable, model-agnostic fairness control at inference time.
- CIRF: Tokenizing Chain-of-Thoughts into Reusable Functional Units for Efficient Latent Reasoning in Large Language Models
- 赛道归属: 推理优化(隐式CoT/潜空间推理、推理token化)
- 核心创新点: 提出CIRF,将传统“链式思维”从自然语言解释转为可复用的离散功能token序列来执行隐式推理:把推理过程模块化为功能单元并在推理时动态编排,以适配不同样例复杂度;同时强调与显式CoT的对齐,使隐式推理在降低推理开销的同时尽量保持可解释推理轨迹的一致性与可控性。
- Track: Reasoning optimization (implicit CoT / latent reasoning, tokenized reasoning)
- Core innovations: CIRF converts natural-language chain-of-thought into a sequence of reusable discrete functional tokens for implicit reasoning. It dynamically composes these functional units at inference time to match instance complexity, aiming to reduce inference cost while improving alignment with explicit CoT so latent reasoning remains consistent and controllable.
- Step-TP: A Grounded, Step-Level Dataset with Chain-of-Thought Reasoning for LLM-Guided Tensor Program Optimization
- 赛道归属: LLM辅助编译优化 / 张量程序优化数据集(程序优化 + 推理链监督)
- 核心创新点: 提出Step-TP,一个“可落地(grounded)到具体变换”的逐步级(step-level)数据集,用于将张量程序优化建模为可组合的序列决策过程;相较仅提供端到端优化前后程序对的既有数据,Step-TP提供可验证的中间变换步骤与对应的Chain-of-Thought推理监督,使每一步优化决策具备可解释性与可检查性,并避免token低效的表示方式,从而更适配LLM在迭代优化中的训练与评测(如逐步决策正确性、可组合性与可回放验证)。
- Track: LLM-guided compiler optimization / tensor program optimization dataset (program optimization + CoT supervision)
- Core innovation: Introduces Step-TP, a grounded step-level dataset that maps tensor program optimization to a composable sequential decision process. Unlike prior datasets that only provide end-to-end before/after optimized program pairs with token-inefficient representations, Step-TP supplies verifiable intermediate transformation steps together with Chain-of-Thought supervision, enabling interpretable and checkable optimization decisions at each step and better supporting LLM training/evaluation for iterative optimization (e.g., step correctness, composability, and replayable verification).
- MetaSICL: Adapting Audiroty LLM via Meta Speech In-Context Learning
- 赛道归属: 多模态理解(语音/音频大模型适配与低资源学习、In-Context Learning)
- 核心创新点: 提出一种面向听觉LLM的元学习式语音上下文学习框架(Meta Speech In-Context Learning),将“推理时用少量示例做ICL适配”作为核心适配机制,用元学习在训练阶段显式优化模型对示例集合的利用方式,从而在标注稀缺或训练-测试分布不匹配时,相比直接微调更稳健地实现快速域内适配与性能提升;强调训练免/轻训练的推理期自适应,降低低资源任务的适配成本并缓解微调脆弱性。
- Track: Multimodal Understanding (speech/audio LLM adaptation for low-resource settings, In-Context Learning)
- Core innovation: Proposes a meta-learning-based speech in-context learning framework (Meta Speech In-Context Learning) for auditory LLMs, treating inference-time adaptation via a few in-domain demonstrations as the primary adaptation mechanism. By meta-optimizing how the model leverages demonstration sets during training, it enables more robust and rapid in-domain adaptation under scarce labels or train–test distribution mismatch, mitigating the brittleness of direct fine-tuning while keeping adaptation largely training-free/lightweight at inference time.
- Multilingual OCR-Aware Fine-Tuning and Prompt-Guided Chain-of-Thought Reasoning for Multimodal Large Language Models
- 赛道归属: 多模态大模型训练与OCR增强(多语言文本理解/视觉文本推理)
- 核心创新点: 提出面向真实场景视觉文本的多语言OCR增强训练框架:结合(1)大规模合成“OCR→翻译/理解”数据生成以覆盖复杂版式与噪声,(2)基于LoRA的OCR-aware监督微调以低成本注入视觉文本能力,(3)结构化的视觉提示与提示引导CoT推理以提升跨语言读图与文本推理的可控性与鲁棒性,系统性缓解MLLM在小字、遮挡、模糊与复杂字体上的失效。
- Track: Multimodal LLM training with OCR enhancement (multilingual visual-text understanding & reasoning)
- Core innovation: Presents a multilingual OCR-aware training pipeline combining (i) large-scale synthetic OCR-to-translation/understanding data generation for noisy real-world layouts, (ii) OCR-aware SFT with LoRA for efficient capability injection, and (iii) structured visual prompting plus prompt-guided CoT to improve controllability and robustness of multilingual visual-text reading and reasoning under clutter, blur, occlusion, and complex typography.
- River-LLM: Large Language Model Seamless Exit Based on KV Share
- 赛道归属: LLM推理加速 / 早退推理(Early Exit)与KV Cache机制优化
- 核心创新点: 提出River-LLM,通过“KV Share(跨层KV共享)”实现decoder-only大模型的无缝早退(seamless exit),针对早退在decoder架构中被“KV Cache缺失(跳过层无法产出后续token所需历史状态)”卡住的关键瓶颈;其方法核心是在允许跳层的同时,仍为后续解码提供一致、可用的KV缓存供给,从而把早退从“理论可跳层”推进到“工程可落地的端到端加速”,在不破坏自回归解码依赖的前提下降低推理时延。
- Track: LLM inference acceleration / Early-exit decoding with KV-cache mechanism optimization
- Core innovation: Proposes River-LLM, enabling seamless early exit in decoder-only LLMs via KV Share (cross-layer KV sharing). It targets the main bottleneck of early exit in decoder architectures—the KV Cache Absence problem, where skipped layers fail to produce the historical states required for subsequent tokens. By maintaining a consistent, usable KV supply even when layers are bypassed, it turns early-exit from a conceptual layer-skipping idea into an end-to-end deployable speedup without breaking autoregressive decoding dependencies, reducing inference latency.
- SmartThinker: Progressive Chain-of-Thought Length Calibration for Efficient Large Language Model Reasoning 🆕NEW
- 赛道归属: 推理优化(CoT 长度自适应控制 / 高效推理)
- 核心创新点: 提出 SmartThinker 的“渐进式 CoT 长度校准”框架,针对长推理模型在不同难度问题上普遍存在的冗余与过度思考,突破点在于将“长度控制”从静态奖励(对所有样本一刀切)升级为随题目难度动态调整的策略。方法上通过逐步(progressive)校准推理链长度,使模型在简单问题上自动收敛到更短、更经济的推理,在困难问题上保留必要的长推理,从而在尽量不损失准确率的前提下显著降低输出冗余与推理成本,并弥补现有 GRPO 静态长度奖励无法自适应难度的缺陷。
- Track: Reasoning optimization (adaptive CoT length control / efficient inference)
- Key innovation: Introduces SmartThinker, a progressive CoT length calibration framework to reduce redundancy and overthinking in long-reasoning models. The key methodological advance is replacing static, one-size-fits-all length rewards (common in GRPO-based approaches) with a difficulty-adaptive mechanism that progressively calibrates reasoning length: it encourages short, cost-efficient reasoning on easy problems while preserving longer chains when needed for hard ones, improving efficiency with minimal accuracy degradation and addressing the non-adaptivity of static length reward designs.
- GILT: An LLM-Free, Tuning-Free Graph Foundational Model for In-Context Learning
- 赛道归属: 图基础模型(Graph Foundation Models)/ 图领域 In-Context Learning(ICL)/ 跨图泛化
- 核心创新点: 提出一种不依赖LLM、无需微调(LLM-Free & Tuning-Free)的图基础模型框架,用于在极端异构图场景下实现类ICL的快速适配与跨图泛化。其方法论突破在于:针对不同图之间特征空间、标签集合与拓扑结构不一致带来的“任务/空间不对齐”问题,通过构建与具体图域无关的统一表示与对齐机制,使模型能够在不进行参数更新的前提下,仅依靠上下文示例完成对新图/新任务的推断与迁移,从而绕开现有GFM依赖文本化/LLM中介或需要额外调参的限制。
- BC Protocol: Structured Dual-Expert Dialogue for Eliciting High-Quality Chain-of-Thought Post-Training Data
- 赛道归属: 后训练数据工程(CoT数据合成/标注流程设计)
- 核心创新点: 提出BC Protocol,用结构化的双专家对话来生成高质量CoT后训练数据:通过“专家-对抗/校验专家”式的分工与对话约束,系统性暴露并补全单专家写作中常见的“专家盲区”(跳步、默认常识),从流程层面提升推理链的完整性、可读性与可用于训练的稳定格式,相比偏好信号或众包标注更能产出深推理轨迹。
- Track: Post-training data engineering (CoT data synthesis / annotation protocol)
- Core innovations: BC Protocol introduces a structured dual-expert dialogue pipeline to elicit high-quality CoT data. By pairing an expert with a second expert focused on challenge/verification under explicit dialogue constraints, it mitigates the “expert blind spot” (skipped steps, implicit assumptions), producing more complete, consistent, training-ready reasoning traces than crowdsourcing or preference-only RLHF signals.
GitHub
- [2026-06-03] sgl-project/sglang ⭐28900
SGLang is a high-performance serving framework for large language models and multimodal models.
- [2026-06-03] NVIDIA-NeMo/NeMo ⭐17292
A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech ...
- [2026-06-03] NVIDIA/TensorRT-LLM ⭐13789
TensorRT LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports state-of-the-art optimizations to perfo...
- [2026-06-03] google-ai-edge/LiteRT-LM ⭐5304
LiteRT-LM is Google's production-ready, high-performance, open-source inference framework for deploying Large Language Models on edge devices.
- [2026-06-03] jonfairbanks/local-rag ⭐743 🆕NEW
Ingest files for retrieval augmented generation (RAG) with open-source Large Language Models (LLMs), all without 3rd parties or sensitive data leaving...
HuggingFace Models
- nvidia/Cosmos3-Nano 🆕NEW
- nvidia/Cosmos3-Super 🆕NEW
HuggingFace Datasets
- [2026-05-28] openbmb/UltraData-SFT-2605
UltraData-SFT-2605
📦 UltraData Collection | 🌐 UltraData | 🤗 MiniCPM5 Series
English | 中文
📚 Introduction
Ult...
- [2026-05-28] openbmb/Ultra-FineWeb-L3
Ultra-FineWeb-L3
📜 Ultra-FineWeb Technical Report | 📦 UltraData Collection | 🌐 UltraData | 🤗 MiniCPM5 Series
English | 中文
...
- [2026-05-01] angrygiraffe/claude-opus-4.6-4.7-reasoning-8.7k
Background
Ended up with some tokens to burn on a Claude Max plan. Assembly began during 4.6 and moved to 4.7. Model is tagged. The develop...
多模态大模型 / Multimodal Models
arXiv
- MLLM-Microscope: Unlocking Hidden Structure Within Multimodal Large Language Models 🆕NEW
- 赛道归属: 多模态理解(MLLM可解释性/表征分析与诊断)
- 核心创新点: 提出一套面向MLLM内部表征的系统化“显微镜”分析框架,沿Transformer层级同时刻画多模态token嵌入的线性度、内在维度与各向异性,并区分主干流与残差流进行对照诊断;在ScienceQA上对LLaVA-NeXT与OmniFusion做跨模型、跨模态的层间结构测量,揭示多模态token在不同流与不同层中呈现高度线性等隐藏结构特征,为后续的可解释性、压缩与对齐机制设计提供可量化的表征指标体系。
- Track: Multimodal understanding (MLLM interpretability / representation analysis & diagnostics)
- Core innovation: Introduces a “microscope”-style, layer-wise diagnostic framework to probe hidden representations in MLLMs by jointly measuring linearity, intrinsic dimension, and anisotropy of multimodal token embeddings, explicitly contrasting main vs. residual streams. Evaluated on ScienceQA with LLaVA-NeXT and OmniFusion, it provides cross-model, cross-modality structural measurements that uncover highly linear behaviors and other latent geometric properties, yielding actionable, quantitative representation metrics for interpretability, compression, and alignment design.
- Immuno-VLM: Immunizing Large Vision-Language Models via Generative Semantic Antibodies for Open-World Trustworthiness
- 赛道归属: 多模态安全与可信(开放世界异常检测/拒识、VLM鲁棒性)
- 核心创新点: 提出“语义自负(Hubris of Semantics)”作为开放世界部署中的关键失效模式:VLM会将未知异常强行映射到已知语义并高置信输出。方法上以“生成式语义抗体(Generative Semantic Antibodies)”为核心机制,为模型显式注入“负知识/反语义”以形成可拒识的决策边界,从而在不破坏原有零样本语义对齐能力的前提下提升开放世界可信性与异常处理能力。
- Track: Multimodal safety & trustworthiness (open-world anomaly detection/rejection, VLM robustness)
- Key innovation: Identifies “Hubris of Semantics” as a core open-world failure where VLMs over-confidently force unknown anomalies into known semantic classes. Introduces “Generative Semantic Antibodies” to explicitly inject negative knowledge/counter-semantics, shaping rejectable decision boundaries while preserving zero-shot semantic alignment, improving open-world trustworthiness.
- SONIC-O1: A Real-World Benchmark for Evaluating Multimodal Large Language Models on Audio-Video Understanding
- 赛道归属: 多模态理解(音频-视频时序理解评测/Benchmark)
- 核心创新点: 提出SONIC-O1作为面向真实世界音频-视频理解的系统性评测基准:以长时序、多领域对话场景为核心覆盖(60小时、231段、13个真实会话域),并采用全人工核验的数据与标注流程,旨在弥补现有评测偏静态图像、缺少对“音视频联合+时序推理”能力刻画的空白,从而更可靠地区分MLLM在真实音视频理解中的能力边界与失效模式。
- Track: Multimodal Understanding (Audio-Video Temporal Understanding Benchmark)
- Key Innovations: Introduces SONIC-O1, a real-world benchmark for systematic evaluation of MLLMs on sequential audio-video understanding. It emphasizes long-form temporal, multi-domain conversational scenarios (60 hours, 231 clips, 13 domains) with fully human-verified data/annotations, addressing the gap of prior benchmarks that over-focus on static images and under-measure joint audio-video temporal reasoning, enabling clearer diagnosis of capability limits and failure modes.
- Cross-modal linkage risk in clinical vision-language models 🆕NEW
- 赛道归属: 多模态安全与隐私(视觉-语言模型的链接攻击/成员关联风险评估)
- 核心创新点: 将临床VLM的隐私问题形式化为跨模态重链接(image-to-report linkage)风险:即模型学习到的共享嵌入空间可能保留实例级对应关系,使攻击者仅凭余弦相似度检索即可把去标识化影像重新关联到原始放射学报告;提出相应的威胁模型与评测设定,用以量化在“影像与报告被刻意分离共享/访问控制”的真实流程下,嵌入对齐带来的可重识别性,从而把“表征对齐能力”转化为可度量的隐私攻击面。
- Track: Multimodal security & privacy (vision-language linkage attacks / instance re-identification risk)
- Core innovation: Formalizes a clinical VLM privacy threat as cross-modal re-linkage (image-to-report linkage) risk: the shared embedding space can preserve instance-level correspondence, enabling attackers to re-associate a de-identified radiograph with its original report via cosine-similarity retrieval alone. It defines a concrete threat model and evaluation protocol aligned with real-world workflows where images and reports are intentionally separated, turning representation alignment strength into a measurable privacy attack surface.
- Unveiling the Fragility of Vision-Language Models: Multi-Modal Adversarial Synergy via Texture-Constrained Perturbations and Cross-Modal Optimization
- 赛道归属: 多模态安全与鲁棒性(VLM 对抗攻击)
- 核心创新点: 提出一种面向视觉-语言模型的跨模态协同对抗框架,将纹理约束的图像扰动与跨模态联合优化结合:在视觉侧通过受限于纹理/局部统计特性的扰动提升隐蔽性与可迁移性,在语言侧通过与视觉扰动协同的目标设计/优化放大误导效应,从而在无需不现实的强白盒假设下实现更强的多模态攻击,系统性揭示 LVLM 在“多模态联动”攻击面前的脆弱性。
Track: Multimodal Security & Robustness (Adversarial Attacks on VLMs)
Key innovation: Proposes a cross-modal synergistic adversarial framework that couples texture-constrained image perturbations with cross-modal joint optimization. The visual perturbation is constrained by texture/local statistics to remain stealthy while improving transferability, and the language-side objective is co-optimized to amplify misalignment, enabling stronger multimodal attacks without relying on impractical strong white-box access and exposing LVLM fragility under coordinated multimodal threats.
- Eyes on VLM: Benchmarking Gaze Following and Social Gaze Prediction in Vision Language Models
- 赛道归属: 多模态理解(人类注视/社会注视预测评测基准)
- 核心创新点: 构建并系统评测VLM在“注视跟随(gaze following)”与“社会注视预测(social gaze prediction)”上的能力边界,强调该任务需要同时理解几何/物理场景与交互语境;通过基准化任务设定与指标,揭示现有VLM在注视相关推理中的可靠性缺口与典型失败模式,为后续面向注意力与行为理解的训练/对齐提供可复现的评测框架。
Track: Multimodal Understanding (Human Gaze & Social Attention Benchmarking)
Core innovations: Establishes a benchmark and systematic evaluation protocol for VLMs on gaze following and social gaze prediction, tasks requiring joint reasoning over physical scene geometry and social/interaction context. The work standardizes settings and metrics, surfaces reliability gaps and common failure modes in current VLMs, and provides a reproducible evaluation framework to guide future training/alignment for attention and behavior understanding.
- Capturing Gaze Shifts for Guidance: Cross-Modal Fusion Enhancement for VLM Hallucination Mitigation
- 赛道归属: 多模态理解(VLM幻觉抑制、跨模态融合/注意力机制改进)
- 核心创新点: 从“视觉注意力汇聚/沉没(attention sink)”角度解释幻觉:并非简单的“语言先验过强”,而是视觉注意力被任务无关区域吸走导致视觉证据未被有效融合。提出利用“注视转移(gaze shifts)”信号来指导跨模态融合增强:通过建模视线在关键区域间的动态转移,重分配视觉-文本对齐时的注意力与融合权重,避免仅按原始注意力分数做放大而加剧偏置,从机制上降低不可证实内容生成。
- Track: Multimodal understanding (VLM hallucination mitigation, cross-modal fusion/attention)
- Key innovation: Reframes hallucination via a “visual attention sink” mechanism—visual attention is diverted to irrelevant regions, preventing evidence from being fused. Uses “gaze shifts” as guidance signals to enhance cross-modal fusion by modeling dynamic transitions between salient regions, reweighting alignment/fusion beyond naive attention amplification, thereby reducing unsupported generations.
- VLM-GLoc: Vision-Language Model Enhanced Monte Carlo Localization for Robust Semantic Global Localization in Cluttered Quasi-Static Environments
- 赛道归属: 具身智能与机器人定位(语义全局定位、VLM+概率滤波/Monte Carlo Localization)
- 核心创新点: 将VLM的开放词汇语义理解引入Monte Carlo Localization(MCL)框架,面向“几何与语义都高度混淆”的准静态室内环境(如货架平行通道、重复家具)提升全局定位鲁棒性。核心在于用VLM生成/评估与场景观测一致的语义证据,并将其作为观测模型或粒子权重更新信号,与传统几何/外观特征互补,从而在几何别名严重、语义长尾且遮挡杂乱的场景中实现更稳定的语义级全局定位。
- Track: Embodied AI & robot localization (semantic global localization, VLM + probabilistic filtering/MCL)
- Key innovation: Integrates open-vocabulary semantic understanding from VLMs into a Monte Carlo Localization pipeline to handle quasi-static indoor environments with strong geometric/semantic aliasing. Uses VLM-derived semantic evidence as an observation/weighting signal for particle updates, complementing geometric/appearance cues to improve robustness under severe aliasing, long-tail semantics, and clutter/occlusion.
- ES-Merging: Biological MLLM Merging via Embedding Space Signals 🆕NEW
- 赛道归属: 多模态模型融合(模型合并/参数高效跨模态统一,生物科学MLLM)
- 核心创新点: 提出ES-Merging,用嵌入空间信号(embedding space signals)来指导生物领域MLLM的合并:不再依赖输入无关的参数空间启发式,而是利用各模型在嵌入空间中体现的模态专长与对齐特征来决定合并策略/权重,从而更忠实地保留不同单模态模型的能力并实现跨模态统一;该思路把“模态专门化”从难以观测的参数差异,转化为可直接度量与可优化的表征信号,提高合并后的跨模态任务适配性。
- Track: Multimodal model merging (parameter-efficient cross-modal unification for biological MLLMs)
- Core innovation: Proposes ES-Merging, a model-merging method for biological MLLMs guided by embedding-space signals rather than input-agnostic parameter-space heuristics. By leveraging representation-level cues that reflect modality specialization and alignment, it determines merging behavior/weights to better preserve complementary single-modality strengths while forming a unified cross-modal model. The key methodological shift is making “modality specialization” observable and optimizable through measurable embedding signals, improving post-merge cross-modal capability.
- EvoCut: Multi-Layer Evolution-Aware Visual Token Compression for Efficient Large Vision-Language Models 🆕NEW
- 赛道归属: 推理优化(LVLM视觉token压缩/高效推理,图像与视频理解)
- 核心创新点: 提出EvoCut,一种多层演化感知(evolution-aware)的视觉token压缩方法:不同于仅在单层用注意力分数或表征属性估计重要性的做法,EvoCut显式建模token在视觉编码器多层中的演化轨迹/稳定性,据此进行更可靠的token重要性评估与裁剪;通过跨层信息整合缓解“层特定指标不完整”导致的误删关键token问题,在保持理解性能的同时显著降低视觉token数量与推理开销,适用于大规模图像/视频LVLM的高效部署。
- Track: Inference optimization (visual token compression for efficient LVLMs in image/video understanding)
- Core innovation: Introduces EvoCut, a multi-layer evolution-aware visual token compression approach. Instead of estimating token importance from attention or representation statistics at a single layer, it models how tokens evolve across multiple encoder layers (e.g., trajectory/stability) to derive more reliable importance scores for pruning. By aggregating cross-layer evidence, it reduces erroneous removal caused by layer-specific criteria, cutting visual token counts and inference cost while better preserving understanding performance for large-scale image/video LVLM deployment.
GitHub
- [2026-06-01] Blaizzy/mlx-vlm ⭐4856
MLX-VLM is a package for inference and fine-tuning of Vision Language Models (VLMs) on your Mac using MLX.
- [2026-06-02] NVlabs/Eagle ⭐1872
Eagle: Frontier Vision-Language Models with Data-Centric Strategies
- [2026-05-31] waybarrios/vllm-mlx ⭐1290
OpenAI and Anthropic compatible server for Apple Silicon. Run LLMs and vision-language models (Llama, Qwen-VL, LLaVA) with continuous batching, MCP to...
- [2026-05-31] ydyhello/Awesome-VLM-Streaming-Video ⭐167
📚 A curated collection of papers and open-source code repositories dedicated to the application of Vision-Language Models (VLMs) for streaming video.
- [2026-06-01] facebookresearch/VLM3 ⭐139
Official implementation of paper "VLM³: Vision Language Models Are Native 3D Learners".
HuggingFace Models
HuggingFace Datasets
- [2026-06-01] ReasonCore/open-spatial-reasoning
Open Spatial Reasoning
A multiple-choice dataset of spatial reasoning questions and answers for evaluating 3D spatial reasoning from si...
强化学习 / Reinforcement Learning
arXiv
- CARE-RL: Capability-Aware Reinforcement Learning for Mitigating Cross-Domain Conflicts 🆕NEW
- 赛道归属: 多领域LLM强化学习对齐(跨域冲突缓解 / 奖励建模)
- 核心创新点: 提出CARE-RL,将“协议感知”的生成式奖励与“能力感知”的优化联合起来解决多领域RL中的两类关键瓶颈:一是非可验证任务奖励不可靠,二是跨领域能力相互干扰。方法上通过Protocol-Aware Generative Reward Model(PA-GRM)在提示/协议层面构造更稳健的奖励信号以覆盖不可验证场景,并在优化阶段引入能力维度的约束/加权机制,使更新更聚焦于目标能力、减少对其他领域能力的负迁移,从而系统性缓解cross-domain conflicts。
Track: Multi-domain LLM RL alignment (cross-domain conflict mitigation / reward modeling)
Key innovations: Proposes CARE-RL, combining protocol-aware generative reward construction with capability-aware optimization to tackle two core issues in multi-domain RL: unreliable rewards for non-verifiable tasks and capability interference across domains. It introduces a Protocol-Aware Generative Reward Model (PA-GRM) that builds more robust reward signals at the prompt/protocol level for non-verifiable settings, and a capability-aware optimization scheme that constrains/weights updates along capability dimensions to focus learning on target skills while reducing negative transfer to other domains.
- Survival Reinforcement Learning: Toward Scalable Self-Supervised RL
- 赛道归属: 自监督强化学习 / 目标条件长时序规划(Goal-conditioned RL)
- 核心创新点: 提出Survival Reinforcement Learning(SRL)作为对比式自监督RL(CRL)的替代范式,用在线分类式目标判别取代对比损失,规避对比学习在长时序规划中“uniformity–tolerance”两难导致的表征退化/目标区分不足问题;将“survival value learning”扩展为通过最大化到达目标后的驻留时间(dwell time)来学习可用于长视野目标条件控制的价值信号,从而在深网络可扩展性与长时序可规划性之间取得更稳健的折中。
- Track: Self-supervised RL / Goal-conditioned long-horizon planning
- Core innovation: Proposes Survival Reinforcement Learning (SRL) as an alternative to contrastive self-supervised RL by replacing contrastive objectives with an online classification-based signal, mitigating the contrastive “uniformity–tolerance” dilemma that hurts long-horizon goal discrimination and planning. It extends survival value learning by maximizing dwell time at target goals, yielding a planning-friendly value signal while retaining strong depth-scaling behavior.
- A Lecture Note on Offline RL and IRL, Part II: Foundations of Inverse Reinforcement Learning and Dynamic Discrete Choice Models
- 赛道归属: 逆强化学习(IRL)理论 / 离线RL与结构计量经济学(DDC)统一视角
- 核心创新点: 以讲义形式系统梳理IRL的基础,并将熵正则IRL与结构计量中的动态离散选择模型(Dynamic Discrete Choice, DDC)在数学结构上进行对齐:从“由专家离线数据反推奖励/偏好”的角度,统一讨论可辨识性、似然/最大熵目标、价值函数与策略的对应关系,以及由此带来的估计与推断框架;其方法论价值在于提供跨社区的同构映射与推导路径,便于将DDC的统计推断工具与IRL的优化视角互相迁移。
- Track: Inverse Reinforcement Learning theory / Unifying Offline RL–IRL with Dynamic Discrete Choice (DDC)
- Core innovation: A foundations-focused note that aligns entropy-regularized IRL with dynamic discrete choice (DDC) models at the level of objectives and solution structure. It frames reward recovery from expert offline data through a unified lens (identifiability, likelihood/max-entropy criteria, value–policy correspondences), enabling methodological transfer between econometric inference in DDC and optimization-centric IRL formulations.
- Infra-Bayesian Reinforcement Learning Agents Outperform Classical RL For Worst-Case Robustness
- 赛道归属: 鲁棒强化学习 / 非可实现环境下的安全RL(对抗/策略依赖环境建模)
- 核心创新点: 提出并实证验证“Infra-Bayesian(下层贝叶斯)”RL智能体,用一种比经典贝叶斯/频率派RL更保守的信念更新与决策准则来应对模型失配(misspecification)与环境对策略的反应(policy-dependent / 预判型对手)。方法上关键在于:不再假设存在真实环境落在模型类中,而是以更弱的可实现性前提构造可学习的决策规则,使策略在最坏情形下具有更强鲁棒性(worst-case robustness),从而在涉及人类/预测器/其他智能体的安全场景中优于经典RL的脆弱性表现。
Track: Robust RL / Safety RL under non-realizable, policy-dependent environments (adversarial/strategic settings)
Core innovation: Introduces and empirically validates Infra-Bayesian RL agents that replace classical Bayesian/frequentist assumptions with a more conservative belief-update and decision criterion tailored to misspecification and policy-dependent (anticipatory) environments. The key methodological shift is to drop the realizability assumption (true environment in the model class) and design a learnable decision rule with stronger worst-case robustness, yielding improved performance in safety-relevant settings involving humans, predictors, other agents, or institutions.
- ClaimDiff-RL: Fine-Grained Caption Reinforcement Learning through Visual Claim Comparison
- 赛道归属: 多模态理解(图像长文本描述对齐 / 细粒度奖励建模的强化学习)
- 核心创新点: 提出以“视觉主张(visual claims)”为单位的细粒度强化学习框架:不再用整段caption的单一标量奖励,而是将描述拆解为可对齐到图像证据的原子主张,并通过“主张级视觉对比/验证”来产生更密集、更可归因的训练信号;从而显式区分并优化“事实性(减少幻觉)”与“信息覆盖(不遗漏细节)”之间的权衡,缓解长文本caption中序列级奖励过度压缩导致的信用分配与训练不稳定问题。
- Track: Multimodal Understanding (long-form image caption alignment / fine-grained reward modeling in RL)
- Core innovation: Introduces a visual-claim–level RL framework: instead of a single sequence-level scalar reward for an entire caption, it decomposes captions into atomic, image-groundable visual claims and generates denser, attributable learning signals via claim-level visual comparison/verification. This makes the trade-off between factuality (reducing hallucinations) and coverage (capturing salient details) explicitly optimizable, mitigating reward granularity and credit-assignment issues in long-form caption RL.
- RL-ACRGNet: Reinforcement Learning-Based Chest Radiology Report Generation Network 🆕NEW
- 赛道归属: 医学影像多模态生成(胸部影像报告生成)/ 强化学习用于文本生成
- 核心创新点: 提出RL-ACRGNet,将强化学习引入胸部放射学报告生成的训练框架,以缓解纯监督学习在“疾病识别准确性”和“报告表述质量/一致性”上的不足。方法层面通过将临床相关的序列级目标(如报告整体质量、关键病灶描述覆盖等)显式作为RL优化信号,直接优化生成报告的全局指标而非仅做token级似然拟合,从而提升对细粒度病灶信息的捕获与报告生成的临床可用性与一致性。
Track: Medical multimodal generation (chest radiology report generation) / RL for text generation
Key innovations: Introduces RL-ACRGNet, integrating reinforcement learning into chest radiology report generation to address limitations of purely supervised training in disease detection accuracy and report quality/consistency. Methodologically, it optimizes clinically meaningful sequence-level objectives (e.g., overall report quality and coverage of key findings) as RL signals, directly targeting global report metrics rather than token-level likelihood alone, improving fine-grained pathology capture and clinical usability/consistency of generated reports.
- Mags-RL: Wearing Multimodal LLMs a Magnifying Glass via Agentic Reinforcement Learning For Complex Scene Reasoning
- 赛道归属: 多模态理解(复杂场景视觉推理)/ Agentic 强化学习
- 核心创新点: 提出一种以“放大镜”式信息获取为核心的智能体强化学习框架,让MLLM在复杂拥挤场景中通过主动、迭代的视觉聚焦与证据收集来提升推理可靠性;相较依赖标注框等显式视觉提示的方法,该思路用RL学习“看哪里、看多细、看几次”的策略,在避免额外标注的同时缓解低分辨率裁剪丢失细节的问题,从而增强细粒度识别与多步推理能力。
- Track: Multimodal understanding (complex-scene visual reasoning) / Agentic Reinforcement Learning
- Core innovation: Introduces an “agentic magnifying-glass” RL framework that trains an MLLM to actively and iteratively acquire visual evidence (where/what to zoom into and how to refine) for reliable reasoning in cluttered, high-density scenes. Unlike prior approaches that inject explicit cues (e.g., annotated boxes) and suffer from detail loss in low-res crops, it learns a sequential visual-attention/inspection policy via RL, improving fine-grained perception and multi-step reasoning without extra annotations.
- GeoSVG-RL: Geometry-Aware Reinforcement Learning for Layout-Constrained Text-to-SVG Diagram Generation
- 赛道归属: 文生矢量图(Text-to-SVG)/ 结构化图表生成 / 强化学习约束生成
- 核心创新点: 提出几何感知的强化学习框架,将SVG图表生成中的“可用性”问题显式建模为布局与几何约束优化:通过对连接线端点对齐、文本与边界/元素的非重叠、画布边界约束等几何规则进行可微或可评估的约束度量,构造面向结构有效性的奖励信号;在生成过程中用RL对策略进行优化以减少结构脆弱错误(如漂移、错连、越界),从而提升可编辑、可落地的专业级SVG图表输出稳定性。
- Track: Text-to-Vector Graphics (Text-to-SVG) / Structured diagram generation / RL for constrained generation
- Key innovations: Introduces a geometry-aware RL framework that explicitly optimizes “usability” of generated SVG diagrams under layout/geometry constraints. It formulates alignment of connector endpoints, text–shape/border non-overlap, and canvas-boundary constraints as measurable (and potentially differentiable) geometric criteria to build structure-validity rewards, then applies RL-based policy optimization during generation to reduce fragile structural failures (misconnections, drift, out-of-bounds), improving robustness and editability of professional-grade SVG outputs.
- StepPO: Step-Aligned Policy Optimization for Agentic Reinforcement Learning 🆕NEW
- 赛道归属: LLM智能体强化学习(Agentic RL)/ 策略优化算法
- 核心创新点: 提出StepPO(Step-Aligned Policy Optimization),针对现有LLM-RL普遍采用token为基本优化粒度而与智能体“按步骤(observation-action循环)决策”的粒度不匹配问题,改为以“步骤”作为对齐与优化的核心单位。方法突破在于将信用分配与策略更新从token层提升到step层,使奖励/优势估计与环境交互的决策边界一致,从而更贴合agentic行为结构,减少由token级噪声与粒度错配带来的优化偏差,提升多步任务中的决策稳定性与学习效率。
Track: LLM agent reinforcement learning (Agentic RL) / policy optimization
Key innovations: Proposes StepPO (Step-Aligned Policy Optimization) to resolve the granularity mismatch where existing LLM RL optimizes at the token level while agents act via step-wise observation–action cycles. The key advance is elevating alignment, credit assignment, and policy updates to the step level so that reward/advantage estimation matches decision boundaries in environment interaction, reducing token-level noise and mismatch-induced bias, and improving stability and sample efficiency in multi-step agent tasks.
- A Local Perturbation Theory for Cross-Domain Interference and Recovery in Multi-Domain RL 🆕NEW
- 赛道归属: 多领域LLM强化学习理论(跨域干扰机理 / 可恢复性分析)
- 核心创新点: 提出用于解释多领域RL中跨域性能干扰的新理论框架“局部扰动理论(Local Perturbation Theory)”。相较于“灾难性遗忘”或“全局梯度冲突”的传统解释,该工作指出:即使不同任务的全模型梯度近乎正交,单领域RL仍可能通过对少量参数/子结构产生稀疏但幅度大的局部更新,引发跨域能力的显著退化。方法论贡献在于把干扰与恢复刻画为局部参数扰动及其传播效应,从而更精确地解释“正交梯度仍干扰”的现象,并为设计更细粒度的干扰抑制与恢复策略提供理论依据。
Track: Theory for multi-domain LLM RL (cross-domain interference mechanism / recovery analysis)
Key innovations: Proposes a Local Perturbation Theory to explain cross-domain interference in multi-domain RL. Unlike catastrophic forgetting or global gradient-conflict accounts, it shows that even when full-model gradients are nearly orthogonal, single-domain RL can induce sparse yet high-magnitude local updates to a small set of parameters/substructures, causing substantial degradation in other domains. The methodological contribution is modeling interference and recovery as effects of localized parameter perturbations and their propagation, explaining “interference despite orthogonal gradients” and providing a theoretical basis for finer-grained mitigation and recovery strategies.
GitHub
- [2026-06-03] OpenPipe/ART ⭐9883
Agent Reinforcement Trainer: train multi-step agents for real-world tasks using GRPO. Give your agents on-the-job training. Reinforcement learning for...
- [2026-06-02] rllm-org/rllm ⭐5590
Democratizing Reinforcement Learning for LLMs
- [2026-06-03] google-deepmind/dm_control ⭐4606 🆕NEW
Google DeepMind's software stack for physics-based simulation and Reinforcement Learning environments, using MuJoCo.
- [2026-06-02] radixark/miles ⭐1476
Miles is an enterprise-facing reinforcement learning framework for LLM and VLM post-training, forked from and co-evolving with slime.
- [2026-06-03] Red-Hat-AI-Innovation-Team/training_hub ⭐81 🆕NEW
An algorithm-focused interface for common llm training, continual learning, and reinforcement learning techniques
HuggingFace Datasets
-
[2026-05-29] stanford-vision-lab/gpic
GPIC: A Giant Permissive Image Corpus for Visual GenerationKeshigeyan Chandrasegaran1, Kyle Sargent1, Suchi...
- [2026-06-01] VCLab-PolyU/GGT-100K 🆕NEW
GGT-100K: Generative Ground Truth for Generalizable Real-World Image Restoration
Real-world LQ–HQ pairs from MFMs to expand IR generalizatio...
世界动作模型 / World Action Model
arXiv
- WALL-WM: Carving World Action Modeling at the Event Joints 🆕NEW
- 赛道归属: 世界动作模型(World Action Model)/ 视觉-语言-动作预训练(Vision-Language-Action Pretraining)/ 视频动作建模
- 核心创新点:
- 中文:提出从“固定长度动作块(chunk)”转向“语义事件(event)”的世界动作建模范式,将语义连贯的动作事件作为最小学习单元,在事件连接点(event joints)处刻画动作的自然边界与状态转移,从而缓解 chunk 粒度与真实动作结构不匹配带来的学习偏差。方法上以事件为锚点进行视觉-语言-动作联合预训练,使模型学习到更符合人类语义分段的动作表征与跨事件的因果/时序衔接能力,相比直接对当前观测+指令做 chunk 级预测,更强调事件级结构化监督与可组合性。
- English: Introduces an event-grounded paradigm for World Action Models, replacing fixed-length action chunks with semantically coherent action events as the atomic learning unit. By modeling transitions at event joints (natural boundaries between events), it addresses the granularity mismatch inherent in chunk-centric optimization and better captures state changes and temporal/causal continuity. The approach performs Vision-Language-Action pretraining anchored on events, encouraging structured, compositional action representations and improved cross-event linkage, rather than directly predicting chunk-level actions conditioned only on the current observation and instruction.
- OASIS: Observation-Action Space Alignment via SE(3) Trajectory Prediction for Robotic Manipulation
- 赛道归属: 机器人操作(Vision-Language-Action / World Action Model)、动作空间建模与对齐、SE(3)轨迹预测
-
核心创新点: 提出OASIS,通过在中间表征中显式引入并对齐动作空间的刚体几何结构,缓解以往WAM/VLA主要停留在观测空间表征、导致动作解码器需“隐式恢复”SE(3)几何的问题;核心做法是将中间表示与SE(3)轨迹预测绑定,使策略在表示层面具备与动作同构的刚体运动先验,从而实现观测-动作空间对齐,降低动作解码难度并提升机器人操作的可学习性与泛化。
-
Track: Robotic manipulation (Vision-Language-Action / World Action Model), action-space modeling & alignment, SE(3) trajectory prediction
- Core innovation: Proposes OASIS to explicitly align intermediate representations with the rigid-body geometry of the action space, addressing a common limitation of prior WAM/VLA approaches whose representations largely remain in observation space and force the action decoder to implicitly reconstruct SE(3) structure. The key idea is to couple the latent representation with SE(3) trajectory prediction, injecting action-isomorphic rigid-motion priors at the representation level, which simplifies action decoding and improves learnability and generalization for robotic manipulation.
GitHub
- [2026-05-31] DravenALG/awesome-vla-wam ⭐678
A Curated List of Vision-Language-Action (VLA) and World Action Models (WAM) Research and Beyond
Generated automatically by Daily AI Digest Agent 生成时间: 2026-06-03 01:02:21