AI 每日进展速报 / Daily AI Digest - 2026-06-02
图像生成/编辑 / Image Generation/Editing
arXiv
- Compositional Text-to-Image Generation Via Region-aware Bimodal Direct Preference Optimization
- 赛道归属: 文生图(偏好对齐/强化学习式对齐,组合生成)
- 核心创新点: 提出Region-aware的双模态直接偏好优化(BiDPO),将“偏好学习”从全图层面对齐推进到“区域级/关系级”的组合语义对齐:通过构建高质控的大规模偏好数据集BiComp,针对属性绑定、对象关系、计数等组合难点提供可学习的偏好信号;并在优化时显式利用区域感知与图文双模态信息,使模型在不改变基础生成范式的情况下,更稳定地满足复杂提示词的结构化约束与局部一致性。
- Track: Text-to-Image (preference alignment / RL-style alignment, compositional generation)
- Core innovation: Proposes BiDPO, a region-aware bimodal Direct Preference Optimization framework that upgrades preference learning from global image alignment to region-/relation-level compositional alignment. It builds a large-scale, strictly quality-controlled preference dataset (BiComp) targeting hard compositional skills (attribute binding, object relations, counting), and optimizes with explicit region awareness plus bimodal (text+image) signals to better satisfy structured constraints in complex prompts without changing the base generation paradigm.
- PixelPonder: Dynamic Patch Adaptation for Enhanced Multi-Conditional Text-to-Image Generation
- 赛道归属: 多条件文生图(扩散模型可控生成 / ControlNet增强)
- 核心创新点: 提出一种“动态Patch自适应”的多条件融合机制,在扩散去噪过程中按空间区域(patch)动态分配与调整不同控制信号的影响权重/注入方式,替代传统ControlNet为每种条件建立独立分支的静态融合范式;通过缓解多源异构条件之间的指导冲突,实现更强的组合式条件遵循(结构与语义同时对齐)并减少结构扭曲,在保持高画质的同时提升多条件一致性与可控性。
- Track: Multi-conditional text-to-image (diffusion controllable generation / ControlNet enhancement)
- Core innovation: Introduces a dynamic patch-wise adaptation scheme that modulates how multiple heterogeneous control signals are injected during diffusion denoising on a per-region (patch) basis, replacing the static multi-branch ControlNet-style fusion. By reducing inter-condition guidance conflicts, it improves compositional conditioning fidelity (better joint structural/semantic alignment) while mitigating distortions and preserving high visual quality.
- Safeguarding Text-to-Image Generation via Inference-Time Prompt-Noise Optimization 🆕NEW
- 赛道归属: 文生图安全对齐 / 推理时安全防护(Text-to-Image Safety Alignment at Inference)
- 核心创新点: 提出一种仅在推理阶段生效的安全防护机制,通过对输入提示词注入并优化“提示噪声”(prompt-noise) 来抑制不安全内容的生成;其关键突破在于把安全约束转化为可优化的推理时变量,无需重新训练/微调模型即可动态调整生成轨迹,从而提升对绕过式提示与对抗攻击的鲁棒性,并在尽量保持画质与文本一致性的前提下实现更稳定的安全过滤。
Track: Text-to-Image safety alignment / Inference-time safety defense Core innovation: Introduces an inference-only safeguarding method that injects and optimizes prompt noise to steer diffusion sampling away from unsafe regions. The key methodological step is formulating safety control as an optimizable inference-time variable, avoiding retraining while improving robustness to jailbreak prompts and adversarial attacks, with minimal degradation to image quality and prompt fidelity.
- DyCoRM: Dynamic Criterion-Aware Reward Modeling for Text-to-Image Generation
- 赛道归属: 文生图(Text-to-Image)/ 偏好对齐与奖励建模(Reward Modeling, RLHF/RLAIF)
- 核心创新点: 提出动态、准则感知(Dynamic Criterion-Aware)的奖励建模框架 DyCoRM,使奖励模型不再依赖固定的通用评分维度,而是能根据用户当前关注的评价准则(如美学、文本一致性、细节、风格等)动态调整评估与打分机制;通过将“评价准则”显式纳入奖励学习与推断过程,实现对多样化、个性化偏好的更精细建模,从而为文生图生成提供更可控、更贴合需求的优化信号,提升对齐效果与泛化到不同偏好场景的能力。
- Track: Text-to-Image Generation / Preference Alignment & Reward Modeling (Reward Modeling, RLHF/RLAIF)
- Key innovation: Proposes DyCoRM, a Dynamic Criterion-Aware reward modeling framework that moves beyond static, one-size-fits-all scoring dimensions by conditioning the reward model on the user’s active evaluation criteria (e.g., aesthetics, prompt faithfulness, detail, style) and dynamically adapting how images are assessed; by explicitly incorporating “criteria” into reward learning and inference, it enables finer-grained modeling of diverse and personalized preferences, providing more controllable and better-aligned optimization signals for T2I generation and improving generalization across preference scenarios.
- Qwen-Image-Bench: From Generation to Creation in Text-to-Image Evaluation
- 赛道归属: 文生图评测(基准/指标,面向创作能力评估)
- 核心创新点: 提出Qwen-Image-Bench,将评测目标从传统“文本-图像一致性/基础画质”扩展到更贴近真实创作工作流的“从生成到创作”能力刻画:强调对真实世界重建的可信度与创意表达等更高阶维度,设计能区分模型在专业创作场景中关键能力差异的评测集合与判别框架,从而缓解现有benchmark对艺术实践需求覆盖不足、区分度不够的问题。
- Track: Text-to-Image evaluation (benchmark/metrics, creativity-oriented assessment)
- Core innovation: Introduces Qwen-Image-Bench to move beyond classic text-image alignment and basic visual quality, toward capabilities that matter in real creative workflows—faithful real-world reconstruction and genuine creative expression. It provides an evaluation suite and judging protocol aimed at better discriminating models on higher-level, practice-relevant skills that existing benchmarks under-represent.
- DirectEdit: Step-Level Accurate Inversion for Flow-Based Image Editing
- 赛道归属: 图像编辑(基于流模型/扩散式流程的训练免编辑,反演)
- 核心创新点: 提出DirectEdit,实现“步级准确”的反演以支持流式(flow-based)编辑:针对现有训练免编辑常见的反演-前向去噪流程中“时间步不匹配”导致的重建误差累积问题,DirectEdit在反演阶段对齐每一步的潜变量/时间步,使重建路径与编辑路径在对应step上严格一致,从而显著降低误差传播,提升重建保真度与编辑稳定性(尤其在多步编辑或强编辑强度下)。
- Track: Image editing (flow-based / diffusion-style pipeline, training-free editing, inversion)
- Core innovation: Proposes DirectEdit with step-level accurate inversion for flow-based editing. It addresses error accumulation caused by timestep-mismatched noisy latents in common inversion+forward denoising pipelines by aligning latents per step so reconstruction and editing trajectories are consistent at corresponding timesteps, reducing drift and improving reconstruction fidelity and editing robustness, especially for longer or stronger edits.
- FiRe: Fine-grained Multimodal Reasoning for Enhanced Image Generation
- 赛道归属: 文生图(多模态推理增强的图像生成 / Unified MLLM for T2I)
- 核心创新点: 提出细粒度多模态推理框架,将统一式MLLM的“理解-生成”闭环能力用于文生图的自反思与自改写:不再停留在简单的提示词扩写或整体图文一致性打分,而是引入更细粒度的推理与评估信号(如对属性、关系、局部区域/对象级要点的逐项核对),驱动生成过程进行针对性的迭代修正,从而提升复杂指令下的可控性与语义一致性。
- Track: Text-to-Image Generation (multimodal reasoning-enhanced image generation / unified MLLM for T2I)
- Key innovations: Proposes a fine-grained multimodal reasoning framework that leverages a unified MLLM’s closed-loop “understand–generate” capability for self-reflection and self-refinement in T2I. Instead of relying on prompt augmentation or holistic image-text alignment scoring, it introduces finer-grained reasoning/evaluation signals (e.g., attribute-, relation-, and region/object-level checks) to guide targeted iterative corrections during generation, improving controllability and semantic faithfulness for complex prompts.
- ACCORD: Alleviating Concept Coupling through Dependence Regularization for Text-to-Image Diffusion Personalization
- 赛道归属: 文生图个性化(少样本定制/概念注入,扩散模型)
- 核心创新点: 提出ACCORD,用“依赖性正则化(dependence regularization)”直接缓解个性化中的概念耦合:针对少量参考图导致模型把目标概念与背景/风格/共现物体等产生非期望绑定的问题,在训练/微调目标中显式惩罚个性化token与无关概念之间的统计依赖或表征耦合,从机制上提升“文本可控性—个性化保真度”的可调平衡,减少过拟合式的错误联想与提示词失控。
- Track: Text-to-Image personalization (few-shot customization / concept injection, diffusion)
- Core innovation: Proposes ACCORD, introducing dependence regularization to directly mitigate concept coupling in personalization. With few reference images, models often form spurious bindings between the target concept and incidental co-occurring attributes (background/style/objects). ACCORD explicitly penalizes unwanted statistical/representation dependence between the personalized token and irrelevant concepts during adaptation, improving the controllability–fidelity trade-off and reducing overfitting-driven prompt hijacking.
- Envisioning Beyond the Few: Disentangled Semantics and Primitives for Few-Shot Atypical Layout-to-Image Generation 🆕NEW
- 赛道归属: 布局到图像生成(Layout-to-Image)/ 小样本与异常布局鲁棒生成(Few-shot Atypical L2I)
- 核心创新点: 针对小样本异常布局下的“表示碎片化”(representation fragmentation)问题,提出以表示学习为中心的生成框架,将对象的语义身份(semantics) 与可渲染的视觉基元(primitives) 进行解耦建模,缓解语义与细节粒度不匹配导致的破碎/扭曲;通过这种“语义—基元”分离的中间表征,实现对少量样本与非典型空间组合的更强泛化与更稳定的结构一致性生成。
Track: Layout-to-Image generation / Few-shot robust generation under atypical layouts Core innovation: Addresses representation fragmentation in few-shot atypical L2I by a representation-driven framework that disentangles semantic identity from visual primitives. This resolves granularity mismatch (semantics entangled with appearance details), improving generalization to rare layouts with more coherent structure and less distortion via a semantics–primitives separated intermediate representation.
- Benchmarking and Enhancing Text-to-Image Models for Generating Visual Representations in Early Arithmetic Education 🆕NEW
- 赛道归属: 文生图评测与教育场景生成(Equation-to-Visual Generation)/ 结构一致性生成基准
- 核心创新点: 提出面向早期算术教育的“方程到视觉”新任务与基准,将生成目标从“看起来合理”提升为必须严格保持数值与关系结构的教学可视化;通过系统化评测揭示现有T2I模型在计数、对应关系、集合划分等结构约束上的失真,并进一步给出增强方向/方法以提升模型对可验证结构正确性的生成能力,使评测从审美质量转向“教学语义可用性”与结构保真。
Track: Text-to-Image benchmarking for education / Equation-to-visual structured generation Core innovation: Introduces the equation-to-visual generation task and benchmark for early arithmetic education, where outputs must faithfully preserve numerical and relational structure rather than just visual plausibility. It systematically diagnoses T2I failures on counting and relational constraints and proposes enhancement directions/methods to improve verifiable structural correctness, shifting evaluation toward pedagogical utility and structure fidelity.
GitHub
- [2026-06-02] YouMind-OpenLab/awesome-nano-banana-pro-prompts ⭐12360
🍌 World's largest Nano Banana Pro prompt library — 10,000+ curated prompts with preview images, 16 languages. Google Gemini AI image generation. Free ...
- [2026-06-01] Light-Heart-Labs/DreamServer ⭐1874
Turn your PC, Mac, or Linux box into an AI server. LLM inference, chat UI, voice, agents, workflows, RAG, and image generation.
- [2026-06-01] AceDataCloud/Nexior ⭐372
Consumer AI app for chat, image generation, video generation, and music creation powered by Ace Data Cloud APIs.
- [2026-06-01] CorentinGS/chess ⭐84 🆕NEW
chess is a set of go packages which provide common chess utilities such as move generation, turn management, checkmate detection, PGN encoding, UCI in...
- [2026-06-01] Dusktarepresent/Leonardo-AI-cracked ⭐52
Leonardo AI - AI image generation and design workflow platform for concept art, marketing assets, and creative teams. Official purchase/referral page ...
HuggingFace Datasets
- [2026-05-29] jasperai/monet
Dataset Card for MONET
MONET (Massive, Open, Non-redundant and Enriched Text-to-image dataset) is a large-scale, curated image-text dat...
视频生成/编辑 / Video Generation/Editing
arXiv
- MAVEN A Multi-Agent Framework for Multicultural Text-to-Video Generation
- 赛道归属: 视频生成(文生视频 / 多智能体提示工程)
- 核心创新点: 提出多智能体提示精炼框架 MAVEN,面向“多文化/跨文化”文生视频的文化保真度问题,将文本提示分解为人物(person)、动作(action)、地点(location)等可控维度,并由专门代理并行/串行协作改写与补全文化关键信息;通过结构化分解降低单一提示对文化细节的丢失与歧义,提升同文化与跨文化场景下生成内容的文化一致性与可评测性。
- Track: Video Generation (Text-to-Video / Multi-agent Prompting)
- Key innovation: Introduces MAVEN, a multi-agent prompt-refinement framework targeting cultural fidelity in mono- and cross-cultural T2V. It decomposes prompts into controllable dimensions (person/action/location) handled by specialized agents in parallel or sequential workflows, explicitly enriching under-specified cultural attributes and reducing ambiguity that typical single-prompt pipelines cannot recover.
- World-R1: Reinforcing 3D Constraints for Text-to-Video Generation
- 赛道归属: 文生视频(Text-to-Video)/ 3D一致性对齐(强化学习)
- 核心创新点: 通过强化学习而非结构改造来注入3D约束:将“几何一致性/世界约束”显式构造成奖励信号,对视频生成模型进行对齐优化,从而在不显著增加推理开销、保持可扩展性的前提下缓解几何不一致问题;同时构建面向“世界模拟”的纯文本数据集,用于更系统地覆盖可被3D约束检验的描述分布,提升对齐训练的有效性与泛化。
- Track: Text-to-Video / 3D-consistency alignment (Reinforcement Learning)
- Core innovation: Injects 3D constraints via RL-based alignment instead of architectural modifications: formulates geometric/world-consistency as explicit rewards to optimize a video generator, improving geometric coherence without adding substantial inference cost and preserving scalability; additionally introduces a world-simulation-oriented text-only dataset to better cover descriptions that are verifiable under 3D constraints, strengthening alignment and generalization.
- OSP-Next: Efficient High-Quality Video Generation with Sparse Sequence Parallelism, HiF8 Quantization, and Reinforcement Learning
- 赛道归属: 视频生成(文生视频/扩散Transformer加速与部署优化)
- 核心创新点: 提出面向DiT视频生成的系统级效率方案,将“稀疏注意力 + 序列并行 + 低比特量化 + 强化学习”联合设计以在质量不降的前提下降本增效:1) 采用混合全注意力-稀疏注意力架构,用固定模式的 Skiparse-2D 在时空token维度做token级与group级稀疏连接,缓解全注意力二次复杂度;2) 引入稀疏序列并行(Sparse Sequence Parallelism)以更好匹配稀疏计算图,提升多卡吞吐与可扩展性;3) 使用 HiF8(8-bit)量化降低显存与带宽开销,面向推理/训练的硬件友好实现;4) 通过强化学习对生成策略/偏好进行对齐,在引入稀疏与量化后维持或提升感知质量与文本一致性。
- Track: Video generation (text-to-video / Diffusion-Transformer acceleration & deployment optimization)
- Core innovations: A system-level efficiency recipe for DiT-based video generation that jointly combines “sparse attention + sequence parallelism + low-bit quantization + RL” to reduce cost without sacrificing quality: 1) a hybrid full–sparse attention design using fixed-pattern Skiparse-2D to apply token-wise and group-wise sparsity over spatiotemporal tokens, mitigating quadratic attention cost; 2) Sparse Sequence Parallelism to better align distributed execution with sparse computation graphs for higher multi-GPU throughput and scalability; 3) HiF8 (8-bit) quantization to cut memory/bandwidth with hardware-friendly training/inference; 4) reinforcement learning-based alignment to preserve/improve perceptual quality and prompt faithfulness under sparsity/quantization constraints.
- Paris 2.0: A Decentralized Diffusion Model for Video Generation
- 赛道归属: 视频生成(去中心化训练 / 分布式扩散模型)
- 核心创新点: 提出首个通过去中心化计算预训练的视频扩散生成模型,将原本在图像上验证的去中心化扩散训练范式扩展到需要强时序一致性的文本生成视频任务;核心突破在于给出去中心化场景下实现时序连贯训练的配方与机制,使得无需单体GPU集群也能完成低分辨率T2V预训练,并在去中心化通信与优化约束下维持跨帧一致性与可训练性。
- Track: Video generation (decentralized training / distributed diffusion)
- Key innovation: Introduces the first video diffusion generator pre-trained via decentralized computation, extending decentralized diffusion training from images to temporally coherent text-to-video. The main methodological advance is a training recipe/mechanism that preserves temporal coherence under decentralized optimization and communication constraints, enabling low-res T2V pretraining without a monolithic GPU cluster.
- TAGRPO: Boosting GRPO on Image-to-Video Generation with Direct Trajectory Alignment
- 赛道归属: 图生视频生成(I2V)/ 强化学习式后训练(RLHF/RLAIF for generative models)
- 核心创新点: 提出TAGRPO用于I2V的稳健后训练,指出GRPO在I2V上“奖励不稳定/不持续提升”的关键症结在于视频生成的多步轨迹与奖励信号之间存在错位;方法上引入“直接轨迹对齐”(Direct Trajectory Alignment)的对比学习式目标,将高奖励样本的去噪/流匹配轨迹作为正样本对齐参照、低奖励轨迹作为负样本拉开,从而在不改变基础生成架构的情况下,更稳定地把奖励偏好注入到整段生成轨迹而非仅末端结果,提升可控性与一致性。
- Track: Image-to-Video generation (I2V) / RL-style post-training (RLHF/RLAIF for generative models)
- Core innovation: Proposes TAGRPO as a robust post-training framework for I2V, diagnosing that naïvely applying GRPO yields inconsistent reward gains due to misalignment between multi-step generation trajectories and reward signals. It introduces Direct Trajectory Alignment with a contrastive-learning-like objective: align denoising/flow-matching trajectories from high-reward samples as positives and push away low-reward trajectories as negatives, injecting preference into the whole trajectory (not just final frames) without changing the base architecture, improving stability and controllability.
- Kandinsky 5.0: A Family of Foundation Models for Image and Video Generation
- 赛道归属: 文生图/文生视频/图生视频(基础大模型体系与工程化)
- 核心创新点: 给出Kandinsky 5.0成体系的图像与视频基础模型家族,通过“分层产品线”覆盖不同算力与质量需求:6B级高分辨率图像模型(Image Lite)、2B级轻量快速的T2V/I2V(Video Lite)、19B级高质量视频模型(Video Pro)。技术价值在于将图像与10秒视频生成统一到可扩展的基础模型栈中,并通过不同规模与配置实现质量-速度-成本的可部署权衡,为实际应用提供从轻量到旗舰的可迁移方案与训练/推理配方。
- Track: Text-to-Image / Text-to-Video / Image-to-Video (foundation model family & systemization)
- Core innovation: Presents Kandinsky 5.0 as a structured family of foundation models spanning high-res image and 10-second video synthesis, organized into tiered lineups to cover different compute/quality regimes: 6B Image Lite, 2B fast/light Video Lite for T2V/I2V, and 19B Video Pro for top quality. The key contribution is a scalable, unified model stack with practical quality–latency–cost trade-offs and deployable recipes across sizes, enabling transfer across product tiers.
- TunerDiT: Training-free Progressive Steering of Diffusion Transformer for Multi-Event Video Generation 🆕NEW
- 赛道归属: 视频生成(长视频/多事件生成;扩散Transformer控制)
- 核心创新点: 发现并利用视频DiT去噪轨迹中的“内在转折点”(conditioning影响从全局布局逐步过渡到细节),据此提出训练-free 的渐进式steering策略:在不同去噪阶段施加不同强度/类型的文本引导,实现对长时域多事件视频的分阶段可控生成,在不改模型参数、无需额外训练的前提下提升多事件组织与细节一致性。
Track: Video generation (long-horizon multi-event; diffusion Transformer control)
Core innovation: Identifies intrinsic “turning points” along the video DiT denoising trajectory where conditioning shifts from shaping global layout to refining details, and leverages this to propose a training-free progressive steering scheme that applies stage-specific guidance across denoising steps, improving controllability and coherence for long multi-event videos without parameter updates or extra training.
- SlotMemory: Object-Centric KV Memory for Streaming Long-Video Generation 🆕NEW
- 赛道归属: 视频生成(流式/长视频生成;记忆机制;扩散模型)
- 核心创新点: 将流式视频扩散的历史上下文从“时间中心”(帧/片段/未聚类token)重构为“对象中心”,提出SlotMemory对象级KV记忆:以slot形式聚合并持久化实体表征,支持实体离场/再入场与交互式prompt切换时的稳定检索与更新,从机制上缓解identity drift与语义不一致,提升长时生成的对象一致性与可编辑性。
Track: Video generation (streaming/long video; memory mechanisms; diffusion)
Core innovation: Replaces temporal-centric history (frames/chunks/unclustered tokens) with an object-centric KV memory via SlotMemory: slot-based aggregation and persistence of entity representations enables robust retrieval/update across occlusion, out-of-frame periods, and interactive prompt transitions, mitigating identity drift and semantic inconsistency to improve long-range object coherence and editability.
- Robust Dreamer: Deviation-Aware Latent Gaussian Memory for Action-Controlled AR Video Generation 🆕NEW
- 赛道归属: 视频生成(动作控制AR生成/世界模型;记忆与漂移鲁棒性)
- 核心创新点: 面向逐帧动作控制的自回归视频生成,提出偏差感知(Deviation-Aware)的潜空间高斯记忆来对抗长rollout漂移:针对(1) Latent–RGB反复解码/重编码导致的信息损失与(2) AR误差累积引发的灾难性漂移,使用可统计建模不确定性/偏差的latent memory在生成过程中进行稳健校正与状态保持,从而在长时序下兼顾即时动作响应、视觉保真与3D一致性。
Track: Video generation (action-controlled autoregressive/world simulation; memory & drift robustness)
Core innovation: Introduces a deviation-aware latent Gaussian memory for frame-wise action-controlled AR video generation to combat long-rollout drift. It targets (1) information loss from repeated Latent–RGB decode/re-encode cycling and (2) compounding AR errors, using a probabilistic latent memory that models deviation/uncertainty to stabilize state and correct drift, improving long-horizon fidelity and 3D consistency while preserving immediate action responsiveness.
- OmniMem: Scalable and Adaptive Memory Retrieval for Long Video Generation 🆕NEW
- 赛道归属: 视频生成(长视频AR生成;KV缓存检索/推理优化;显式记忆)
- 核心创新点: 提出OmniMem显式全量历史的稀疏KV检索框架:在长视频AR生成中不再截断或隐式压缩KV cache,而是对增长的历史缓存进行query相关的稀疏检索,保留对关键细节的显式访问;并通过可扩展、可自适应的检索策略将全范围检索在计算/显存上变得可行,从而在可控成本下提升长程一致性与细节召回。
Track: Video generation (long-video autoregressive; KV-cache retrieval/inference optimization; explicit memory)
Core innovation: Proposes OmniMem, an explicit full-range sparse KV retrieval framework for long-video AR generation: instead of truncating the KV cache or compressing it into implicit memory, it performs query-relevant sparse retrieval over the growing historical cache, preserving explicit access to critical past details. Scalable, adaptive retrieval makes full-range access practical in compute/memory, improving long-range coherence and fine-detail recall under controlled cost.
GitHub
- [2026-06-01] hao-ai-lab/FastVideo ⭐3667
A unified inference and post-training framework for accelerated video generation.
- [2026-06-01] ZeroLu/awesome-seedance ⭐1856
The ultimate collection of high-fidelity Seedance 2.0 prompts and Seedance AI resources. Discover Seedance 2.0 how to use for cinematic film, anime, U...
- [2026-06-01] YouMind-OpenLab/awesome-seedance-2-prompts ⭐1278
🎬 2000+ curated Seedance 2.0 video generation prompts — cinematic, anime, UGC, ads, meme styles. Includes Seedance API guides, character consistency t...
- [2026-06-01] stuttlepress/ComfyUI-Wan-VACE-Prep ⭐87
ComfyUI nodes designed to help make common video editing tasks with video generation models less complicated. Smooth transitions, extensions, outpaint...
- [2026-06-01] DistributorRecord/Kling-AI-Video-Generator-cracked ⭐57
Kling AI Video Generator - AI video generation workflow for text-to-video, image-to-video, creative clips, and social content. Includes setup notes, S...
HuggingFace Models
音频生成 / Audio Generation
arXiv
- FC-TTS: Style and Timbre Control in Zero-Shot Text-to-Speech with Disentangled Speech Representations
- 赛道归属: 音频生成|零样本文本转语音(Zero-shot TTS)|可控生成(风格/音色解耦控制)
- 核心创新点: 通过解耦语音表征将语音分解为可解释属性(如内容、韵律/风格、音色等),并在零样本TTS中实现来自不同参考音频的分离式条件控制:用一段参考提供说话人音色、另一段参考提供说话风格/韵律,从而突破以往“单一参考同时绑定音色与风格”的耦合限制;方法上强调在表示学习与条件注入机制上实现属性独立性,使模型在保持高保真克隆的同时获得可组合、可编辑的控制能力。
- Track: Audio Generation | Zero-shot Text-to-Speech (TTS) | Controllable generation (disentangled style/timbre control)
- Core innovation: Introduces disentangled speech representations that factor speech into interpretable attributes (e.g., content, prosody/style, timbre) and enables separate-reference conditioning in zero-shot TTS—one reference for speaker timbre and another for speaking style/prosody. This addresses the common entanglement where a single prompt jointly determines both, and advances the method via representation learning and conditioning/injection designs that preserve cloning fidelity while enabling compositional, editable control.
- ParsVoice: A Large-Scale Multi-Speaker Persian Speech Corpus for Text-to-Speech Synthesis
- 赛道归属: 语音生成 / TTS 数据集与数据构建(低资源语言、多说话人)
- 核心创新点: 提出面向多说话人TTS训练的超大规模波斯语开源语音-文本语料库ParsVoice,并给出可扩展的数据构建流水线:从长篇有声书录音中自动切分与对齐高质量语音-文本对,核心在于结合面向波斯语的句级语义/完整性建模(如微调的ParsBERT用于句子补全/筛选)与质量控制策略,以在低资源语言场景下系统性提升对齐准确性、覆盖度与可用性,从而降低多说话人TTS与语音语言建模的数据门槛。
- Track: Audio Generation / TTS dataset & data pipeline (low-resource, multi-speaker)
- Core innovation: Introduces ParsVoice, the largest publicly available Persian speech–text corpus designed for multi-speaker TTS, together with a scalable pipeline to derive high-quality paired data from long-form audiobooks. The key methodological contribution is an automated segmentation/alignment and quality-control workflow that leverages Persian-specific sentence-level modeling (e.g., a fine-tuned ParsBERT for sentence completion/filtering) to improve alignment reliability, coverage, and usability in low-resource settings.
- ImmersiveTTS: Environment-Aware Text-to-Speech with Multimodal Diffusion Transformer and Domain-Specific Representation Alignment 🆕NEW
- 赛道归属: 文本到语音(TTS)/ 场景化语音生成(语音+环境声融合)
- 核心创新点: 提出环境感知TTS框架,通过多模态扩散Transformer显式建模语音与环境上下文(如场景/视觉/环境音提示)之间的跨模态交互,解决语音与环境声在声学形态与时间动态上的分布差异;并引入面向领域的表征对齐机制,将“语音生成表征”与“环境/场景表征”在统一空间中对齐,从而实现语音与环境声的自然共存与无缝融合(而非后期拼接)。
- Track: Text-to-Speech (TTS) / Scene-aware speech generation (speech + ambient sound integration)
- Core innovations: Proposes an environment-aware TTS framework that uses a multimodal Diffusion Transformer to explicitly model cross-modal interactions between speech and environmental context (e.g., scene/visual/ambient cues), addressing the distribution and temporal-dynamics mismatch between speech and environmental audio; introduces domain-specific representation alignment to map speech-generation features and environment/scene features into a shared space, enabling coherent in-scene speech generation rather than post-hoc mixing.
- UNISON: A Unified Sound Generation and Editing Framework via Deep LLM Fusion 🆕NEW
- 赛道归属: 统一音频生成与编辑(Text-to-Audio/TTS/音频编辑一体化,多任务扩散)
- 核心创新点: 用单一潜空间扩散模型统一覆盖文本到音频、文本到语音、零样本音色克隆、语音+音效混合生成、场景级音频编辑与时间编排等任务,实现“同权重多能力”;关键方法是层级式深度LLM融合(将LLM多层隐状态注入扩散网络以增强语义与结构控制)以及面向多任务的统一条件接口/训练范式,使生成与编辑在同一潜空间与同一推理管线内闭环完成,减少任务间割裂与模型堆叠。
- Track: Unified audio generation & editing (Text-to-Audio/TTS/audio editing; multi-task diffusion)
- Core innovations: Introduces a single latent diffusion model that unifies text-to-audio, text-to-speech, zero-shot speaker cloning, mixed speech-and-sound generation, scene-level editing, and temporal composition under one set of weights; key is layer-wise deep LLM fusion—injecting multi-layer LLM hidden states into the diffusion network for stronger semantic/structural control—plus a unified conditioning/training scheme so generation and editing operate in the same latent space and inference pipeline, avoiding fragmented task-specific stacks.
- SwanVoice: Expressive Long-Form Zero-Shot Speech Synthesis for Both Monologue and Dialogue 🆕NEW
- 赛道归属: 长文本零样本TTS / 对话式语音合成(多说话人、情感与一致性建模)
- 核心创新点: 面向长篇独白与多轮对话的零样本语音合成,针对“逐轮合成再拼接”导致的音色一致性、韵律连贯性与情绪连续性断裂问题,提出在单模型内联合建模跨轮次的对话上下文与表达状态(如情感/语气/节奏的持续变量),在生成时维持跨turn的声学一致与对话连贯;强调长程依赖与多说话人切换下的表达可控与稳定性,而非仅提升单句质量。
- Track: Long-form zero-shot TTS / Dialogue speech synthesis (multi-speaker, expressive consistency)
- Core innovations: Targets long-form monologue and multi-turn dialogue in zero-shot TTS, addressing the common “synthesize-per-turn then stitch” workaround that breaks timbre, prosody, and affect continuity; proposes single-model joint modeling of cross-turn dialogue context and persistent expressive states (e.g., emotion/intonation/rhythm as continuous trajectories), maintaining acoustic consistency and conversational coherence across turns while supporting multi-speaker switching and expressive control over long horizons.
- Towards Streaming Synchronized Spatial Audio Generation via Autoregressive Diffusion Transformer 🆕NEW
- 赛道归属: 流式空间音频生成(视频/文本条件的Spatial Audio,低延迟生成)
- 核心创新点: 提出面向实时的流式空间音频生成统一框架,使用自回归扩散Transformer在“可流式输出”的约束下实现高保真生成,并强化与全景视频/文本提示的时序同步与空间一致性;核心突破在于把扩散生成改造为可在线推进的自回归/分段式推理范式,在降低推理延迟的同时保持空间线索(方位、距离、运动)建模精度,缓解“质量-延迟”权衡与多模态空间对齐困难。
- Track: Streaming spatial audio generation (video/text-conditioned spatial audio; low-latency)
- Core innovations: Proposes a unified streaming framework for real-time spatial audio generation conditioned on panoramic video and text, built on an autoregressive Diffusion Transformer to enable incremental (online) synthesis; key contribution is adapting diffusion-style generation to a streaming-compatible autoregressive/segmented inference scheme that preserves high fidelity while improving latency, and strengthening temporal synchronization and spatial consistency (direction/distance/motion cues) from multimodal inputs, mitigating the quality–latency tradeoff and multimodal spatial alignment challenges.
- Chatterbox-Flash: Prior-Calibrated Block Diffusion for Streaming Zero-Shot TTS 🆕NEW
- 赛道归属: 流式零样本TTS / 推理加速(Block Diffusion并行解码)
- 核心创新点: 将预训练自回归TTS解码器微调为块扩散(block-diffusion)解码器,实现“块内并行、块间流式”的低延迟生成;针对离散语音token长尾分布导致的并行位置选择偏置(高频token主导、质量下降)问题,提出先验校准(prior-calibration)机制,在不大改架构的前提下校正并行采样的token先验/选择策略,从而兼顾并行带来的速度与接近自回归的自然度与稳定性。
- Track: Streaming zero-shot TTS / Inference acceleration (block-diffusion parallel decoding)
- Core innovations: Fine-tunes a pretrained autoregressive TTS decoder into a block-diffusion decoder, enabling parallel token generation within each block while keeping block-by-block streaming for low latency; identifies a discrete-speech-token long-tail issue where naive block diffusion biases parallel positions toward a few high-frequency tokens and degrades quality, and introduces prior calibration to correct the sampling prior/position-selection behavior without major architectural changes, preserving naturalness and stability while gaining speed.
- Unlocking Fine-Grained and Within-Utterance Speaking Style Control in Prompt-Based Text-to-Speech Models
- 赛道归属: 文本到语音生成(TTS)/ 语音风格可控生成(Prompt-based Style Control)
- 核心创新点: 在现有“基于提示词的TTS”框架上,针对两类关键瓶颈提出方法级增强:①实现跨语句(inter-utterance)的细粒度风格属性连续可控与插值,使风格强度/属性可在不同句子间平滑调节而非离散切换;②实现单句内部(within-utterance)的时变风格控制,通过引入随时间变化的风格条件/调度机制,让模型不再只能施加全局单一风格,而能在同一句话中完成风格过渡与局部风格片段控制,从而扩展到需要“句内风格转场”的实际应用场景。
- Track: Text-to-Speech (TTS) / Controllable Speech Style Generation (Prompt-based Style Control)
- Core innovations: Proposes method-level extensions to existing prompt-based TTS to overcome two limitations: (1) enables fine-grained, continuous control and interpolation of style attributes across utterances (inter-utterance), allowing smooth adjustment of style intensity/attributes rather than coarse, discrete changes; (2) enables time-varying, within-utterance style control by introducing temporally scheduled/dynamic style conditioning, replacing a single global style per utterance with intra-utterance style transitions and localized style segment control—supporting practical scenarios requiring style changes inside one sentence.
- PilotTTS: A Disciplined Modular Recipe for Competitive Speech Synthesis
- 赛道归属: 语音生成 / TTS 系统工程与训练配方(轻量化自回归合成)
- 核心创新点: 提出PilotTTS:以“纪律化的模块化配方”替代复杂多阶段大系统,通过极简自回归架构 + 严格的数据工程实现有竞争力的合成效果。方法论突破在于将性能提升的关键从模型堆叠转移到可复现的训练流程:全链路使用开源工具处理约20万小时数据,强调模块边界清晰、训练/数据清洗规范化与可移植的工程实践,使资源受限团队也能复现接近SOTA的TTS质量。
- Track: Audio Generation / TTS system recipe & training pipeline (lightweight autoregressive synthesis)
- Core innovation: Proposes PilotTTS, a competitive yet lightweight autoregressive TTS system achieved via a disciplined modular recipe rather than heavy multi-stage architectures. The methodological advance is a reproducible, open-source end-to-end training pipeline on ~200K hours that prioritizes rigorous data engineering, clear module interfaces, and standardized processing/cleaning—shifting gains from model complexity to repeatable system-building practices accessible to constrained teams.
- PashtoTTS-Bench: automated screening for low-resource non-Latin-script text-to-speech
- 赛道归属: 语音生成 / TTS 评测与自动化筛查(低资源、非拉丁文字)
- 核心创新点: 针对低资源且使用非拉丁文字的TTS评测中“单一ASR回环WER”易失效的问题,提出INSV报告框架,将失败模式显式拆解为可懂度(Intelligibility)、自然度(Naturalness)、文字/脚本保真度(Script fidelity)与验证(Verification)。并给出INSV-A自动化筛查子集,用自动指标区分“无音频/说错语言/仅转写保留目标文本/听感不自然”等典型误判情形,从评测方法论上提升对低资源TTS系统的可诊断性与可比性。
- Track: Audio Generation / TTS evaluation & automated screening (low-resource, non-Latin scripts)
- Core innovation: Addresses the brittleness of single ASR round-trip WER for low-resource, non-Latin-script TTS evaluation by introducing the INSV framework, which disentangles outcomes into Intelligibility, Naturalness, Script fidelity, and Verification. It further provides INSV-A, an automated screening subset that can separate common failure cases (no audio, wrong language, script-only preservation in transcripts, unnatural speech), improving diagnostic power and comparability of evaluations in low-resource settings.
GitHub
- [2026-06-01] huggingface/diffusers ⭐33758
🤗 Diffusers: State-of-the-art diffusion models for image, video, and audio generation in PyTorch.
- [2026-06-01] BinWang28/audio-ai-hub ⭐925
The hub for audio AI research: papers, open models, benchmarks & datasets across audio LLMs, speech recognition, TTS, music & audio generation.
- [2026-06-01] Ameobea/web-synth ⭐556 🆕NEW
Browser-based DAW and audio synthesis platform with dozens of effects, synths, and modules
- [2026-06-01] apocas/restai ⭐509
RESTai is an AIaaS (AI as a Service) open-source platform. Supports many public and local LLM suported by Ollama/vLLM/etc. Precise embeddings usage, t...
- [2026-05-27] xiquan-li/Awesome-Audio-Generation ⭐73
Curated list for papers, codes and resources related to Text-to-Audio (TTA) Generation
HuggingFace Models
语言大模型 / Large Language Models
arXiv
- COFT: Counterfactual-Conformal Decoding for Fair Chain-of-Thought Reasoning in Large Language Models 🆕NEW
- 赛道归属: 公平性可控解码 / 推理阶段偏见抑制(LLM Decoding for Fairness in CoT)
- 核心创新点: 提出一种无需训练、仅在解码阶段生效的公平性控制方法 COFT,用于抑制链式思维(CoT)生成中的社会偏见放大。方法上以反事实提示构造 + 共形预测(Conformal)约束为核心:先将提示中的敏感片段替换为中性占位符形成“掩码反事实”输入,以获得相对去偏的参考分布;再在token 级别对原始解码分布施加公平性约束,并通过分布无关(distribution-free)的边际有效性保证(在 exchangeability 假设下)为公平控制提供可验证的统计保证,从而实现对任意冻结的因果语言模型在推理时的可控去偏解码。
- Track: Fairness-controlled decoding / Inference-time bias mitigation for CoT (LLM Decoding for Fairness in CoT)
- Key innovation: Introduces COFT, a training-free, decoding-time method to curb bias amplification in chain-of-thought generation. The technical core combines counterfactual prompt masking with conformal (distribution-free) constraints: it first replaces sensitive spans with neutral tokens to form a masked counterfactual prompt, yielding a debiased reference distribution; then it enforces token-level fairness control on the original decoding distribution, providing distribution-free marginal validity guarantees (under exchangeability) for any frozen causal LM—enabling verifiable, model-agnostic fairness control at inference time.
- CIRF: Tokenizing Chain-of-Thoughts into Reusable Functional Units for Efficient Latent Reasoning in Large Language Models
- 赛道归属: 推理优化(隐式CoT/潜空间推理、推理token化)
- 核心创新点: 提出CIRF,将传统“链式思维”从自然语言解释转为可复用的离散功能token序列来执行隐式推理:把推理过程模块化为功能单元并在推理时动态编排,以适配不同样例复杂度;同时强调与显式CoT的对齐,使隐式推理在降低推理开销的同时尽量保持可解释推理轨迹的一致性与可控性。
- Track: Reasoning optimization (implicit CoT / latent reasoning, tokenized reasoning)
- Core innovations: CIRF converts natural-language chain-of-thought into a sequence of reusable discrete functional tokens for implicit reasoning. It dynamically composes these functional units at inference time to match instance complexity, aiming to reduce inference cost while improving alignment with explicit CoT so latent reasoning remains consistent and controllable.
- Step-TP: A Grounded, Step-Level Dataset with Chain-of-Thought Reasoning for LLM-Guided Tensor Program Optimization
- 赛道归属: LLM辅助编译优化 / 张量程序优化数据集(程序优化 + 推理链监督)
- 核心创新点: 提出Step-TP,一个“可落地(grounded)到具体变换”的逐步级(step-level)数据集,用于将张量程序优化建模为可组合的序列决策过程;相较仅提供端到端优化前后程序对的既有数据,Step-TP提供可验证的中间变换步骤与对应的Chain-of-Thought推理监督,使每一步优化决策具备可解释性与可检查性,并避免token低效的表示方式,从而更适配LLM在迭代优化中的训练与评测(如逐步决策正确性、可组合性与可回放验证)。
- Track: LLM-guided compiler optimization / tensor program optimization dataset (program optimization + CoT supervision)
- Core innovation: Introduces Step-TP, a grounded step-level dataset that maps tensor program optimization to a composable sequential decision process. Unlike prior datasets that only provide end-to-end before/after optimized program pairs with token-inefficient representations, Step-TP supplies verifiable intermediate transformation steps together with Chain-of-Thought supervision, enabling interpretable and checkable optimization decisions at each step and better supporting LLM training/evaluation for iterative optimization (e.g., step correctness, composability, and replayable verification).
- MetaSICL: Adapting Audiroty LLM via Meta Speech In-Context Learning
- 赛道归属: 多模态理解(语音/音频大模型适配与低资源学习、In-Context Learning)
- 核心创新点: 提出一种面向听觉LLM的元学习式语音上下文学习框架(Meta Speech In-Context Learning),将“推理时用少量示例做ICL适配”作为核心适配机制,用元学习在训练阶段显式优化模型对示例集合的利用方式,从而在标注稀缺或训练-测试分布不匹配时,相比直接微调更稳健地实现快速域内适配与性能提升;强调训练免/轻训练的推理期自适应,降低低资源任务的适配成本并缓解微调脆弱性。
- Track: Multimodal Understanding (speech/audio LLM adaptation for low-resource settings, In-Context Learning)
- Core innovation: Proposes a meta-learning-based speech in-context learning framework (Meta Speech In-Context Learning) for auditory LLMs, treating inference-time adaptation via a few in-domain demonstrations as the primary adaptation mechanism. By meta-optimizing how the model leverages demonstration sets during training, it enables more robust and rapid in-domain adaptation under scarce labels or train–test distribution mismatch, mitigating the brittleness of direct fine-tuning while keeping adaptation largely training-free/lightweight at inference time.
- Multilingual OCR-Aware Fine-Tuning and Prompt-Guided Chain-of-Thought Reasoning for Multimodal Large Language Models
- 赛道归属: 多模态大模型训练与OCR增强(多语言文本理解/视觉文本推理)
- 核心创新点: 提出面向真实场景视觉文本的多语言OCR增强训练框架:结合(1)大规模合成“OCR→翻译/理解”数据生成以覆盖复杂版式与噪声,(2)基于LoRA的OCR-aware监督微调以低成本注入视觉文本能力,(3)结构化的视觉提示与提示引导CoT推理以提升跨语言读图与文本推理的可控性与鲁棒性,系统性缓解MLLM在小字、遮挡、模糊与复杂字体上的失效。
- Track: Multimodal LLM training with OCR enhancement (multilingual visual-text understanding & reasoning)
- Core innovation: Presents a multilingual OCR-aware training pipeline combining (i) large-scale synthetic OCR-to-translation/understanding data generation for noisy real-world layouts, (ii) OCR-aware SFT with LoRA for efficient capability injection, and (iii) structured visual prompting plus prompt-guided CoT to improve controllability and robustness of multilingual visual-text reading and reasoning under clutter, blur, occlusion, and complex typography.
- River-LLM: Large Language Model Seamless Exit Based on KV Share
- 赛道归属: LLM推理加速 / 早退推理(Early Exit)与KV Cache机制优化
- 核心创新点: 提出River-LLM,通过“KV Share(跨层KV共享)”实现decoder-only大模型的无缝早退(seamless exit),针对早退在decoder架构中被“KV Cache缺失(跳过层无法产出后续token所需历史状态)”卡住的关键瓶颈;其方法核心是在允许跳层的同时,仍为后续解码提供一致、可用的KV缓存供给,从而把早退从“理论可跳层”推进到“工程可落地的端到端加速”,在不破坏自回归解码依赖的前提下降低推理时延。
- Track: LLM inference acceleration / Early-exit decoding with KV-cache mechanism optimization
- Core innovation: Proposes River-LLM, enabling seamless early exit in decoder-only LLMs via KV Share (cross-layer KV sharing). It targets the main bottleneck of early exit in decoder architectures—the KV Cache Absence problem, where skipped layers fail to produce the historical states required for subsequent tokens. By maintaining a consistent, usable KV supply even when layers are bypassed, it turns early-exit from a conceptual layer-skipping idea into an end-to-end deployable speedup without breaking autoregressive decoding dependencies, reducing inference latency.
- GILT: An LLM-Free, Tuning-Free Graph Foundational Model for In-Context Learning
- 赛道归属: 图基础模型(Graph Foundation Models)/ 图领域 In-Context Learning(ICL)/ 跨图泛化
- 核心创新点: 提出一种不依赖LLM、无需微调(LLM-Free & Tuning-Free)的图基础模型框架,用于在极端异构图场景下实现类ICL的快速适配与跨图泛化。其方法论突破在于:针对不同图之间特征空间、标签集合与拓扑结构不一致带来的“任务/空间不对齐”问题,通过构建与具体图域无关的统一表示与对齐机制,使模型能够在不进行参数更新的前提下,仅依靠上下文示例完成对新图/新任务的推断与迁移,从而绕开现有GFM依赖文本化/LLM中介或需要额外调参的限制。
- BC Protocol: Structured Dual-Expert Dialogue for Eliciting High-Quality Chain-of-Thought Post-Training Data
- 赛道归属: 后训练数据工程(CoT数据合成/标注流程设计)
- 核心创新点: 提出BC Protocol,用结构化的双专家对话来生成高质量CoT后训练数据:通过“专家-对抗/校验专家”式的分工与对话约束,系统性暴露并补全单专家写作中常见的“专家盲区”(跳步、默认常识),从流程层面提升推理链的完整性、可读性与可用于训练的稳定格式,相比偏好信号或众包标注更能产出深推理轨迹。
- Track: Post-training data engineering (CoT data synthesis / annotation protocol)
- Core innovations: BC Protocol introduces a structured dual-expert dialogue pipeline to elicit high-quality CoT data. By pairing an expert with a second expert focused on challenge/verification under explicit dialogue constraints, it mitigates the “expert blind spot” (skipped steps, implicit assumptions), producing more complete, consistent, training-ready reasoning traces than crowdsourcing or preference-only RLHF signals.
- Investigating the Interplay between Contextual and Parametric Chain-of-Thought Faithfulness under Optimization
- 赛道归属: 对齐与可解释性评测(CoT忠实性、偏好对齐优化)
- 核心创新点: 针对CoT忠实性的两类评测范式(上下文忠实性与参数忠实性)长期割裂的问题,提出FaithMate作为统一的偏好对齐接口,可在同一优化框架下分别/共同推动模型在两种忠实性目标上的改进;并系统研究在优化过程中两者的相互作用与潜在权衡,为“优化后CoT是否更真实反映模型行为”提供可操作的训练与比较基准。
- Track: Alignment & interpretability evaluation (CoT faithfulness, preference-based optimization)
- Core innovations: FaithMate provides a unified preference-alignment interface to optimize and compare two previously separated notions of CoT faithfulness: contextual (via input/trace perturbations) and parametric (via interventions on model knowledge). It enables joint/isolated optimization and studies their interaction and trade-offs under training, offering an actionable framework to assess whether optimized CoTs better reflect underlying model behavior.
- Reflect-Guard: Enhancing LLM Safeguards against Adversarial Prompts via Logical Self-Reflection
- 赛道归属: 安全对齐(越狱防护/安全分类器增强、对抗鲁棒性)
- 核心创新点: 提出Reflect-Guard,通过参数高效微调为安全分类器引入逻辑自反思式CoT推理:将强模型(如GPT-4o级别)的分析推理蒸馏到Guard模型,使其在面对角色扮演、虚构包装、间接请求等“意图伪装”越狱提示时,能先进行结构化推断与自检再判定风险,从而提升对抗提示下的识别鲁棒性,而非仅依赖表面关键词或模式匹配。
- Track: Safety alignment (jailbreak defense / safety classifier robustness)
- Core innovations: Reflect-Guard enhances LLM-based safety classifiers with logical self-reflection CoT reasoning via parameter-efficient fine-tuning. By distilling analytical reasoning from a stronger model, the classifier learns to infer and self-check hidden malicious intent in adversarial prompts (role-play, fictional framing, indirect requests), improving robustness beyond surface-pattern or keyword-based detection.
GitHub
- [2026-06-02] sgl-project/sglang ⭐28887
SGLang is a high-performance serving framework for large language models and multimodal models.
- [2026-06-02] google-ai-edge/LiteRT-LM ⭐5302
LiteRT-LM is Google's production-ready, high-performance, open-source inference framework for deploying Large Language Models on edge devices.
- [2026-06-02] NiuTrans/NLPBook ⭐621 🆕NEW
A comprehensive book on neural networks and large language models in NLP
- [2026-06-01] chrisliu298/awesome-on-policy-distillation ⭐250
A curated collection of papers, technical reports, frameworks, and tools for on-policy distillation (OPD) of large language models
- [2026-06-02] Nayjest/ai-microcore ⭐107 🆕NEW
A handy lib for smooth interaction with large language models (LLMs) and crafting AI apps.
HuggingFace Datasets
- [2026-05-28] openbmb/UltraData-SFT-2605
UltraData-SFT-2605
📦 UltraData Collection | 🌐 UltraData | 🤗 MiniCPM5 Series
English | 中文
📚 Introduction
Ult...
- [2026-05-28] openbmb/Ultra-FineWeb-L3
Ultra-FineWeb-L3
📜 Ultra-FineWeb Technical Report | 📦 UltraData Collection | 🌐 UltraData | 🤗 MiniCPM5 Series
English | 中文
...
- [2026-05-01] angrygiraffe/claude-opus-4.6-4.7-reasoning-8.7k
Background
Ended up with some tokens to burn on a Claude Max plan. Assembly began during 4.6 and moved to 4.7. Model is tagged. The develop...
- [2025-07-11] HuggingFaceFW/fineweb
🍷 FineWeb
15 trillion tokens of the finest data the 🌐 web has to offer
What is it?
The 🍷 FineWeb dataset consists of m...
多模态大模型 / Multimodal Models
arXiv
- Immuno-VLM: Immunizing Large Vision-Language Models via Generative Semantic Antibodies for Open-World Trustworthiness 🆕NEW
- 赛道归属: 多模态安全与可信(开放世界异常检测/拒识、VLM鲁棒性)
- 核心创新点: 提出“语义自负(Hubris of Semantics)”作为开放世界部署中的关键失效模式:VLM会将未知异常强行映射到已知语义并高置信输出。方法上以“生成式语义抗体(Generative Semantic Antibodies)”为核心机制,为模型显式注入“负知识/反语义”以形成可拒识的决策边界,从而在不破坏原有零样本语义对齐能力的前提下提升开放世界可信性与异常处理能力。
- Track: Multimodal safety & trustworthiness (open-world anomaly detection/rejection, VLM robustness)
- Key innovation: Identifies “Hubris of Semantics” as a core open-world failure where VLMs over-confidently force unknown anomalies into known semantic classes. Introduces “Generative Semantic Antibodies” to explicitly inject negative knowledge/counter-semantics, shaping rejectable decision boundaries while preserving zero-shot semantic alignment, improving open-world trustworthiness.
- SONIC-O1: A Real-World Benchmark for Evaluating Multimodal Large Language Models on Audio-Video Understanding
- 赛道归属: 多模态理解(音频-视频时序理解评测/Benchmark)
- 核心创新点: 提出SONIC-O1作为面向真实世界音频-视频理解的系统性评测基准:以长时序、多领域对话场景为核心覆盖(60小时、231段、13个真实会话域),并采用全人工核验的数据与标注流程,旨在弥补现有评测偏静态图像、缺少对“音视频联合+时序推理”能力刻画的空白,从而更可靠地区分MLLM在真实音视频理解中的能力边界与失效模式。
- Track: Multimodal Understanding (Audio-Video Temporal Understanding Benchmark)
- Key Innovations: Introduces SONIC-O1, a real-world benchmark for systematic evaluation of MLLMs on sequential audio-video understanding. It emphasizes long-form temporal, multi-domain conversational scenarios (60 hours, 231 clips, 13 domains) with fully human-verified data/annotations, addressing the gap of prior benchmarks that over-focus on static images and under-measure joint audio-video temporal reasoning, enabling clearer diagnosis of capability limits and failure modes.
- Unveiling the Fragility of Vision-Language Models: Multi-Modal Adversarial Synergy via Texture-Constrained Perturbations and Cross-Modal Optimization
- 赛道归属: 多模态安全与鲁棒性(VLM 对抗攻击)
- 核心创新点: 提出一种面向视觉-语言模型的跨模态协同对抗框架,将纹理约束的图像扰动与跨模态联合优化结合:在视觉侧通过受限于纹理/局部统计特性的扰动提升隐蔽性与可迁移性,在语言侧通过与视觉扰动协同的目标设计/优化放大误导效应,从而在无需不现实的强白盒假设下实现更强的多模态攻击,系统性揭示 LVLM 在“多模态联动”攻击面前的脆弱性。
Track: Multimodal Security & Robustness (Adversarial Attacks on VLMs)
Key innovation: Proposes a cross-modal synergistic adversarial framework that couples texture-constrained image perturbations with cross-modal joint optimization. The visual perturbation is constrained by texture/local statistics to remain stealthy while improving transferability, and the language-side objective is co-optimized to amplify misalignment, enabling stronger multimodal attacks without relying on impractical strong white-box access and exposing LVLM fragility under coordinated multimodal threats.
- Eyes on VLM: Benchmarking Gaze Following and Social Gaze Prediction in Vision Language Models
- 赛道归属: 多模态理解(人类注视/社会注视预测评测基准)
- 核心创新点: 构建并系统评测VLM在“注视跟随(gaze following)”与“社会注视预测(social gaze prediction)”上的能力边界,强调该任务需要同时理解几何/物理场景与交互语境;通过基准化任务设定与指标,揭示现有VLM在注视相关推理中的可靠性缺口与典型失败模式,为后续面向注意力与行为理解的训练/对齐提供可复现的评测框架。
Track: Multimodal Understanding (Human Gaze & Social Attention Benchmarking)
Core innovations: Establishes a benchmark and systematic evaluation protocol for VLMs on gaze following and social gaze prediction, tasks requiring joint reasoning over physical scene geometry and social/interaction context. The work standardizes settings and metrics, surfaces reliability gaps and common failure modes in current VLMs, and provides a reproducible evaluation framework to guide future training/alignment for attention and behavior understanding.
- Capturing Gaze Shifts for Guidance: Cross-Modal Fusion Enhancement for VLM Hallucination Mitigation 🆕NEW
- 赛道归属: 多模态理解(VLM幻觉抑制、跨模态融合/注意力机制改进)
- 核心创新点: 从“视觉注意力汇聚/沉没(attention sink)”角度解释幻觉:并非简单的“语言先验过强”,而是视觉注意力被任务无关区域吸走导致视觉证据未被有效融合。提出利用“注视转移(gaze shifts)”信号来指导跨模态融合增强:通过建模视线在关键区域间的动态转移,重分配视觉-文本对齐时的注意力与融合权重,避免仅按原始注意力分数做放大而加剧偏置,从机制上降低不可证实内容生成。
- Track: Multimodal understanding (VLM hallucination mitigation, cross-modal fusion/attention)
- Key innovation: Reframes hallucination via a “visual attention sink” mechanism—visual attention is diverted to irrelevant regions, preventing evidence from being fused. Uses “gaze shifts” as guidance signals to enhance cross-modal fusion by modeling dynamic transitions between salient regions, reweighting alignment/fusion beyond naive attention amplification, thereby reducing unsupported generations.
- VLM-GLoc: Vision-Language Model Enhanced Monte Carlo Localization for Robust Semantic Global Localization in Cluttered Quasi-Static Environments 🆕NEW
- 赛道归属: 具身智能与机器人定位(语义全局定位、VLM+概率滤波/Monte Carlo Localization)
- 核心创新点: 将VLM的开放词汇语义理解引入Monte Carlo Localization(MCL)框架,面向“几何与语义都高度混淆”的准静态室内环境(如货架平行通道、重复家具)提升全局定位鲁棒性。核心在于用VLM生成/评估与场景观测一致的语义证据,并将其作为观测模型或粒子权重更新信号,与传统几何/外观特征互补,从而在几何别名严重、语义长尾且遮挡杂乱的场景中实现更稳定的语义级全局定位。
- Track: Embodied AI & robot localization (semantic global localization, VLM + probabilistic filtering/MCL)
- Key innovation: Integrates open-vocabulary semantic understanding from VLMs into a Monte Carlo Localization pipeline to handle quasi-static indoor environments with strong geometric/semantic aliasing. Uses VLM-derived semantic evidence as an observation/weighting signal for particle updates, complementing geometric/appearance cues to improve robustness under severe aliasing, long-tail semantics, and clutter/occlusion.
- "In\^{t}elegi Rom\^ane\c{s}te?'' A Recipe for Romanian Vision-Language Models 🆕NEW
- 赛道归属: 多语言/低资源多模态模型(语言特定VLM构建、数据与评测体系)
- 核心创新点: 给出构建罗马尼亚语VLM的端到端“配方”:不仅做简单翻译,而是系统覆盖数据构建、训练与评测闭环。方法上将主流英文图文训练/评测语料翻译并进行质量控制与本地化处理,结合语言特性进行架构/训练策略选择与对比,建立文化与语言一致的评测基准,用以量化低资源语言下的能力退化与改进来源,从而形成可复用的语言特定VLM开发范式。
- Track: Multilingual/low-resource multimodal models (language-specific VLM building, data & evaluation)
- Key innovation: Provides an end-to-end recipe for Romanian VLMs, covering the full loop of data construction, training, and evaluation rather than naive translation. Translates major English image-text corpora with quality control/localization, explores language-aware architectural/training choices, and builds culturally/linguistically aligned benchmarks to diagnose degradation and attribute gains—yielding a reusable paradigm for language-specific VLM development.
- Variational Adapter for Cross-modal Similarity Representation 🆕NEW
- 赛道归属: 跨模态检索与表征学习(图文相似度建模、适配器/参数高效微调)
- 核心创新点: 针对缺乏细粒度相似度标注导致“连续相似度空间被二值边界压缩”、进而产生伪负样本与泛化下降的问题,提出“变分适配器(Variational Adapter)”来学习跨模态相似度的分布式表示。方法上用变分建模将相似度从点估计扩展为不确定性可表达的潜变量分布,在训练中缓解二值监督带来的错误分离,并以轻量适配器形式插入现有VLM/对比学习框架,实现更稳健的相似度度量与跨数据集泛化。
- Track: Cross-modal retrieval & representation learning (image-text similarity modeling, adapter/PEFT)
- Key innovation: Addresses the lack of fine-grained similarity annotations that collapses a continuous similarity space into binary boundaries, creating false negatives and hurting generalization. Proposes a “Variational Adapter” that models cross-modal similarity as a distribution (latent-variable uncertainty) rather than a point estimate, mitigating erroneous separations induced by binary supervision. Implemented as a lightweight adapter plug-in to existing VLM/contrastive setups for more robust similarity metrics and better cross-dataset generalization.
- MineExplorer: Evaluating Open-World Exploration of MLLM Agents in Minecraft 🆕NEW
- 赛道归属: 多模态智能体评测(开放世界探索、游戏环境基准/Minecraft)
- 核心创新点: 提出MineExplorer基准,专门评测MLLM智能体在Minecraft中的“开放世界持续探索”能力,避免现有基准将交互压缩为短时任务或与特定机制强耦合。方法论上将探索能力拆解为可度量维度(如覆盖、发现新颖性、资源/线索获取与长期行为组织等),并设计更贴近开放世界动态性的评测协议与任务设置,使得模型的感知-推理-行动闭环在长时程下可被稳定比较与诊断。
- Track: Multimodal agent evaluation (open-world exploration, game benchmark/Minecraft)
- Key innovation: Introduces MineExplorer, a benchmark targeting sustained open-world exploration of MLLM agents in Minecraft, avoiding short-horizon compression and domain-mechanic entanglement seen in prior benchmarks. Decomposes exploration into measurable dimensions (e.g., coverage, novelty discovery, resource/clue acquisition, long-horizon behavior organization) and provides evaluation protocols/tasks that better reflect dynamic open worlds, enabling stable comparison and diagnosis of perception–reasoning–action loops over long horizons.
- Reading or Guessing? Visual Grounding Failures of Vision-Language Models for OCR in Ancient Greek Editions
- 赛道归属: 多模态理解(视觉-语言OCR/视觉定位鲁棒性与失效分析)
- 核心创新点: 针对VLM在古希腊文献OCR中的“看图不读字、依赖语言先验猜测”问题,系统对比开源权重VLM与传统OCR引擎在低资源古希腊校勘本上的表现,揭示VLM即使识别错误也常生成流畅且貌似合理、但缺乏视觉证据支撑的文本替换;并进一步从“视觉证据/视觉定位”角度分析模型在解码过程中对图像信息的依赖不足,形成可复现的失效模式刻画与诊断框架,为改进VLM的视觉扎根(visual grounding)与OCR可信度提供依据。
- Track: Multimodal Understanding (Vision-Language OCR; Robustness & Failure Analysis in Visual Grounding)
- Key Innovations: Studies the “reading vs. guessing” failure mode of VLM-based OCR on low-resource Ancient Greek critical editions. By comparing open-weight VLMs with classical OCR baselines, it shows VLM outputs can remain fluent yet visually unsupported—substituting plausible Greek text driven by language priors rather than image evidence. It further analyzes insufficient visual grounding/visual evidence usage during decoding, providing a reproducible diagnostic characterization of grounding failures to guide more trustworthy VLM-OCR improvements.
GitHub
- [2026-06-01] Blaizzy/mlx-vlm ⭐4816
MLX-VLM is a package for inference and fine-tuning of Vision Language Models (VLMs) on your Mac using MLX.
- [2026-05-30] NVlabs/Eagle ⭐1772
Eagle: Frontier Vision-Language Models with Data-Centric Strategies
- [2026-05-31] waybarrios/vllm-mlx ⭐1281
OpenAI and Anthropic compatible server for Apple Silicon. Run LLMs and vision-language models (Llama, Qwen-VL, LLaVA) with continuous batching, MCP to...
- [2026-05-31] ydyhello/Awesome-VLM-Streaming-Video ⭐166
📚 A curated collection of papers and open-source code repositories dedicated to the application of Vision-Language Models (VLMs) for streaming video.
- [2026-06-01] facebookresearch/VLM3 ⭐78 🆕NEW
Official implementation of paper "VLM³: Vision Language Models Are Native 3D Learners".
HuggingFace Models
HuggingFace Datasets
- [2026-06-01] ReasonCore/open-spatial-reasoning 🆕NEW
Open Spatial Reasoning
A multiple-choice dataset of spatial reasoning questions and answers for evaluating 3D spatial reasoning from si...
强化学习 / Reinforcement Learning
arXiv
- Survival Reinforcement Learning: Toward Scalable Self-Supervised RL 🆕NEW
- 赛道归属: 自监督强化学习 / 目标条件长时序规划(Goal-conditioned RL)
- 核心创新点: 提出Survival Reinforcement Learning(SRL)作为对比式自监督RL(CRL)的替代范式,用在线分类式目标判别取代对比损失,规避对比学习在长时序规划中“uniformity–tolerance”两难导致的表征退化/目标区分不足问题;将“survival value learning”扩展为通过最大化到达目标后的驻留时间(dwell time)来学习可用于长视野目标条件控制的价值信号,从而在深网络可扩展性与长时序可规划性之间取得更稳健的折中。
- Track: Self-supervised RL / Goal-conditioned long-horizon planning
- Core innovation: Proposes Survival Reinforcement Learning (SRL) as an alternative to contrastive self-supervised RL by replacing contrastive objectives with an online classification-based signal, mitigating the contrastive “uniformity–tolerance” dilemma that hurts long-horizon goal discrimination and planning. It extends survival value learning by maximizing dwell time at target goals, yielding a planning-friendly value signal while retaining strong depth-scaling behavior.
- A Lecture Note on Offline RL and IRL, Part II: Foundations of Inverse Reinforcement Learning and Dynamic Discrete Choice Models 🆕NEW
- 赛道归属: 逆强化学习(IRL)理论 / 离线RL与结构计量经济学(DDC)统一视角
- 核心创新点: 以讲义形式系统梳理IRL的基础,并将熵正则IRL与结构计量中的动态离散选择模型(Dynamic Discrete Choice, DDC)在数学结构上进行对齐:从“由专家离线数据反推奖励/偏好”的角度,统一讨论可辨识性、似然/最大熵目标、价值函数与策略的对应关系,以及由此带来的估计与推断框架;其方法论价值在于提供跨社区的同构映射与推导路径,便于将DDC的统计推断工具与IRL的优化视角互相迁移。
- Track: Inverse Reinforcement Learning theory / Unifying Offline RL–IRL with Dynamic Discrete Choice (DDC)
- Core innovation: A foundations-focused note that aligns entropy-regularized IRL with dynamic discrete choice (DDC) models at the level of objectives and solution structure. It frames reward recovery from expert offline data through a unified lens (identifiability, likelihood/max-entropy criteria, value–policy correspondences), enabling methodological transfer between econometric inference in DDC and optimization-centric IRL formulations.
- Infra-Bayesian Reinforcement Learning Agents Outperform Classical RL For Worst-Case Robustness
- 赛道归属: 鲁棒强化学习 / 非可实现环境下的安全RL(对抗/策略依赖环境建模)
- 核心创新点: 提出并实证验证“Infra-Bayesian(下层贝叶斯)”RL智能体,用一种比经典贝叶斯/频率派RL更保守的信念更新与决策准则来应对模型失配(misspecification)与环境对策略的反应(policy-dependent / 预判型对手)。方法上关键在于:不再假设存在真实环境落在模型类中,而是以更弱的可实现性前提构造可学习的决策规则,使策略在最坏情形下具有更强鲁棒性(worst-case robustness),从而在涉及人类/预测器/其他智能体的安全场景中优于经典RL的脆弱性表现。
Track: Robust RL / Safety RL under non-realizable, policy-dependent environments (adversarial/strategic settings)
Core innovation: Introduces and empirically validates Infra-Bayesian RL agents that replace classical Bayesian/frequentist assumptions with a more conservative belief-update and decision criterion tailored to misspecification and policy-dependent (anticipatory) environments. The key methodological shift is to drop the realizability assumption (true environment in the model class) and design a learnable decision rule with stronger worst-case robustness, yielding improved performance in safety-relevant settings involving humans, predictors, other agents, or institutions.
- ClaimDiff-RL: Fine-Grained Caption Reinforcement Learning through Visual Claim Comparison
- 赛道归属: 多模态理解(图像长文本描述对齐 / 细粒度奖励建模的强化学习)
- 核心创新点: 提出以“视觉主张(visual claims)”为单位的细粒度强化学习框架:不再用整段caption的单一标量奖励,而是将描述拆解为可对齐到图像证据的原子主张,并通过“主张级视觉对比/验证”来产生更密集、更可归因的训练信号;从而显式区分并优化“事实性(减少幻觉)”与“信息覆盖(不遗漏细节)”之间的权衡,缓解长文本caption中序列级奖励过度压缩导致的信用分配与训练不稳定问题。
- Track: Multimodal Understanding (long-form image caption alignment / fine-grained reward modeling in RL)
- Core innovation: Introduces a visual-claim–level RL framework: instead of a single sequence-level scalar reward for an entire caption, it decomposes captions into atomic, image-groundable visual claims and generates denser, attributable learning signals via claim-level visual comparison/verification. This makes the trade-off between factuality (reducing hallucinations) and coverage (capturing salient details) explicitly optimizable, mitigating reward granularity and credit-assignment issues in long-form caption RL.
- Mags-RL: Wearing Multimodal LLMs a Magnifying Glass via Agentic Reinforcement Learning For Complex Scene Reasoning
- 赛道归属: 多模态理解(复杂场景视觉推理)/ Agentic 强化学习
- 核心创新点: 提出一种以“放大镜”式信息获取为核心的智能体强化学习框架,让MLLM在复杂拥挤场景中通过主动、迭代的视觉聚焦与证据收集来提升推理可靠性;相较依赖标注框等显式视觉提示的方法,该思路用RL学习“看哪里、看多细、看几次”的策略,在避免额外标注的同时缓解低分辨率裁剪丢失细节的问题,从而增强细粒度识别与多步推理能力。
- Track: Multimodal understanding (complex-scene visual reasoning) / Agentic Reinforcement Learning
- Core innovation: Introduces an “agentic magnifying-glass” RL framework that trains an MLLM to actively and iteratively acquire visual evidence (where/what to zoom into and how to refine) for reliable reasoning in cluttered, high-density scenes. Unlike prior approaches that inject explicit cues (e.g., annotated boxes) and suffer from detail loss in low-res crops, it learns a sequential visual-attention/inspection policy via RL, improving fine-grained perception and multi-step reasoning without extra annotations.
- GeoSVG-RL: Geometry-Aware Reinforcement Learning for Layout-Constrained Text-to-SVG Diagram Generation
- 赛道归属: 文生矢量图(Text-to-SVG)/ 结构化图表生成 / 强化学习约束生成
- 核心创新点: 提出几何感知的强化学习框架,将SVG图表生成中的“可用性”问题显式建模为布局与几何约束优化:通过对连接线端点对齐、文本与边界/元素的非重叠、画布边界约束等几何规则进行可微或可评估的约束度量,构造面向结构有效性的奖励信号;在生成过程中用RL对策略进行优化以减少结构脆弱错误(如漂移、错连、越界),从而提升可编辑、可落地的专业级SVG图表输出稳定性。
- Track: Text-to-Vector Graphics (Text-to-SVG) / Structured diagram generation / RL for constrained generation
- Key innovations: Introduces a geometry-aware RL framework that explicitly optimizes “usability” of generated SVG diagrams under layout/geometry constraints. It formulates alignment of connector endpoints, text–shape/border non-overlap, and canvas-boundary constraints as measurable (and potentially differentiable) geometric criteria to build structure-validity rewards, then applies RL-based policy optimization during generation to reduce fragile structural failures (misconnections, drift, out-of-bounds), improving robustness and editability of professional-grade SVG outputs.
- Skill Reuse as Compression in Agentic RL 🆕NEW
- 赛道归属: Agentic RL(LLM智能体强化学习)/ 技能复用与层级策略学习
- 核心创新点: 提出将“泛化能力”与轨迹的结构可压缩性挂钩的观点,并给出ReuseRL:以最小描述长度(MDL)为原则,从成功轨迹中学习一个共享技能字典(skill dictionary),用“用更少可复用模式解释更多成功行为”的压缩目标来约束/增强RL训练;通过在优化目标中显式鼓励技能复用,减少任务特定捷径与脆弱策略,推动智能体形成可组合、可迁移的抽象行为单元。
- Track: Agentic RL (LLM agents) / Skill reuse and hierarchical policy learning
- Core innovation: Introduces ReuseRL, grounding agentic RL in the Minimum Description Length (MDL) principle by linking generalization to the compressibility of successful trajectories. It learns a shared skill dictionary from successes and augments the RL objective to explicitly reward skill reuse, discouraging brittle task-specific shortcuts and promoting compositional, transferable abstractions.
- Answer-Set-Programming-based Abstractions for Reinforcement Learning 🆕NEW
- 赛道归属: 符号-强化学习融合 / 关系型强化学习与状态抽象(Logic-based Abstraction)
- 核心创新点: 基于Answer Set Programming(ASP)构建RL抽象机制:用可解释的逻辑规则表达对象-关系结构与MDP要素,在关系型表示下自动/半自动形成状态与动作的抽象,从而在巨大状态空间中提升泛化与样本效率;相较纯函数逼近的表示学习路线,该工作强调通过ASP的可满足性/推理能力实现可验证的抽象与约束注入,使学习过程能够利用高层先验结构并保持可解释性。
- Track: Neuro-symbolic RL / Relational RL and logic-based state abstraction
- Core innovation: Builds RL abstractions using Answer Set Programming (ASP), representing object–relation structure and MDP components with interpretable logical rules. This enables automatic/semi-automatic abstraction over large state spaces to improve generalization and sample efficiency, leveraging ASP’s satisfiability and reasoning to inject verifiable structure/constraints beyond purely function-approximation-based representations.
- Constrained Multi-Objective Reinforcement Learning with Max-Min Criterion 🆕NEW
- 赛道归属: 多目标强化学习(MORL)/ 约束优化与公平性(Max-Min MORL)
- 核心创新点: 提出将max-min准则(强调最差目标表现、促进公平)与显式约束满足统一的MORL框架:在优化上同时处理多目标冲突与约束可行域,给出相应的理论基础(如问题形式化、可行性与解的性质/保证等),扩展了max-min MORL在现实受限场景(安全、资源、合规等)中的适用性;方法论突破在于把“公平型鲁棒目标”与“硬/软约束”放入同一优化与学习范式中系统处理。
- Track: Multi-Objective RL (MORL) / Constrained optimization and fairness (max-min)
- Core innovation: Proposes a MORL framework that unifies the max-min criterion (optimizing the worst-performing objective for fairness/robustness) with explicit constraint satisfaction. It formalizes the constrained max-min MORL problem and develops theoretical underpinnings (feasibility and solution properties/guarantees), extending max-min MORL to practical settings with safety/resource/compliance constraints.
- Reinforcement Learning Amplifies Emergent Misalignment from Harmless Rewards 🆕NEW
- 赛道归属: 强化学习安全与对齐(RL Alignment)/ 涌现失配(Emergent Misalignment)机理与复现
- 核心创新点: 将“无害奖励下的涌现失配(EM)”从SFT扩展到RL情境,并在小型、开源权重模型上系统刻画与复现,从而降低研究门槛并提升可重复性;从多个维度(文中提到三条轴)分析RL如何放大由局部/狭窄奖励诱发的广泛失配行为,强调RL优化过程可能产生的分布外泛化与目标外推风险;其关键贡献在于提供可操作的实验刻画框架,使EM不再依赖大型闭源模型才能观察。
- Track: RL safety & alignment / Emergent misalignment analysis and reproducibility
- Core innovation: Demonstrates and characterizes emergent misalignment (EM) arising from reinforcement learning, not just SFT, and does so on small, open-weight models to make the phenomenon reproducible and inexpensive to study. It analyzes EM along multiple axes and highlights how RL optimization can amplify broadly misaligned behaviors from narrowly “harmless” rewards, providing an actionable experimental framework for studying EM without relying on large closed-source systems.
GitHub
- [2026-06-01] rllm-org/rllm ⭐5587
Democratizing Reinforcement Learning for LLMs
- [2026-06-01] RLinf/RLinf ⭐3603 🆕NEW
RLinf: Reinforcement Learning Infrastructure for Embodied and Agentic AI
- [2026-06-02] radixark/miles ⭐1466
Miles is an enterprise-facing reinforcement learning framework for LLM and VLM post-training, forked from and co-evolving with slime.
- [2026-06-01] LucasAlegre/morl-baselines ⭐528
Multi-Objective Reinforcement Learning algorithms implementations.
- [2026-06-01] GPUOpen-LibrariesAndSDKs/Schola ⭐73 🆕NEW
Schola is a plugin for enabling Reinforcement Learning (RL) in Unreal Engine. It provides tools to help developers create environments, define agents,...
HuggingFace Datasets
-
[2026-05-29] stanford-vision-lab/gpic
GPIC: A Giant Permissive Image Corpus for Visual GenerationKeshigeyan Chandrasegaran1, Kyle Sargent1, Suchi...
世界动作模型 / World Action Model
arXiv
- OASIS: Observation-Action Space Alignment via SE(3) Trajectory Prediction for Robotic Manipulation
- 赛道归属: 机器人操作(Vision-Language-Action / World Action Model)、动作空间建模与对齐、SE(3)轨迹预测
-
核心创新点: 提出OASIS,通过在中间表征中显式引入并对齐动作空间的刚体几何结构,缓解以往WAM/VLA主要停留在观测空间表征、导致动作解码器需“隐式恢复”SE(3)几何的问题;核心做法是将中间表示与SE(3)轨迹预测绑定,使策略在表示层面具备与动作同构的刚体运动先验,从而实现观测-动作空间对齐,降低动作解码难度并提升机器人操作的可学习性与泛化。
-
Track: Robotic manipulation (Vision-Language-Action / World Action Model), action-space modeling & alignment, SE(3) trajectory prediction
- Core innovation: Proposes OASIS to explicitly align intermediate representations with the rigid-body geometry of the action space, addressing a common limitation of prior WAM/VLA approaches whose representations largely remain in observation space and force the action decoder to implicitly reconstruct SE(3) structure. The key idea is to couple the latent representation with SE(3) trajectory prediction, injecting action-isomorphic rigid-motion priors at the representation level, which simplifies action decoding and improves learnability and generalization for robotic manipulation.
GitHub
- [2026-05-31] DravenALG/awesome-vla-wam ⭐665
A Curated List of Vision-Language-Action (VLA) and World Action Models (WAM) Research and Beyond
Generated automatically by Daily AI Digest Agent 生成时间: 2026-06-02 01:02:34