AI 每日进展速报 / Daily AI Digest - 2026-05-31
图像生成/编辑 / Image Generation/Editing
arXiv
- Compositional Text-to-Image Generation Via Region-aware Bimodal Direct Preference Optimization
- 赛道归属: 文生图(偏好对齐/强化学习式对齐,组合生成)
- 核心创新点: 提出Region-aware的双模态直接偏好优化(BiDPO),将“偏好学习”从全图层面对齐推进到“区域级/关系级”的组合语义对齐:通过构建高质控的大规模偏好数据集BiComp,针对属性绑定、对象关系、计数等组合难点提供可学习的偏好信号;并在优化时显式利用区域感知与图文双模态信息,使模型在不改变基础生成范式的情况下,更稳定地满足复杂提示词的结构化约束与局部一致性。
- Track: Text-to-Image (preference alignment / RL-style alignment, compositional generation)
- Core innovation: Proposes BiDPO, a region-aware bimodal Direct Preference Optimization framework that upgrades preference learning from global image alignment to region-/relation-level compositional alignment. It builds a large-scale, strictly quality-controlled preference dataset (BiComp) targeting hard compositional skills (attribute binding, object relations, counting), and optimizes with explicit region awareness plus bimodal (text+image) signals to better satisfy structured constraints in complex prompts without changing the base generation paradigm.
- PixelPonder: Dynamic Patch Adaptation for Enhanced Multi-Conditional Text-to-Image Generation
- 赛道归属: 多条件文生图(扩散模型可控生成 / ControlNet增强)
- 核心创新点: 提出一种“动态Patch自适应”的多条件融合机制,在扩散去噪过程中按空间区域(patch)动态分配与调整不同控制信号的影响权重/注入方式,替代传统ControlNet为每种条件建立独立分支的静态融合范式;通过缓解多源异构条件之间的指导冲突,实现更强的组合式条件遵循(结构与语义同时对齐)并减少结构扭曲,在保持高画质的同时提升多条件一致性与可控性。
- Track: Multi-conditional text-to-image (diffusion controllable generation / ControlNet enhancement)
- Core innovation: Introduces a dynamic patch-wise adaptation scheme that modulates how multiple heterogeneous control signals are injected during diffusion denoising on a per-region (patch) basis, replacing the static multi-branch ControlNet-style fusion. By reducing inter-condition guidance conflicts, it improves compositional conditioning fidelity (better joint structural/semantic alignment) while mitigating distortions and preserving high visual quality.
- DyCoRM: Dynamic Criterion-Aware Reward Modeling for Text-to-Image Generation
- 赛道归属: 文生图(Text-to-Image)/ 偏好对齐与奖励建模(Reward Modeling, RLHF/RLAIF)
- 核心创新点: 提出动态、准则感知(Dynamic Criterion-Aware)的奖励建模框架 DyCoRM,使奖励模型不再依赖固定的通用评分维度,而是能根据用户当前关注的评价准则(如美学、文本一致性、细节、风格等)动态调整评估与打分机制;通过将“评价准则”显式纳入奖励学习与推断过程,实现对多样化、个性化偏好的更精细建模,从而为文生图生成提供更可控、更贴合需求的优化信号,提升对齐效果与泛化到不同偏好场景的能力。
- Track: Text-to-Image Generation / Preference Alignment & Reward Modeling (Reward Modeling, RLHF/RLAIF)
- Key innovation: Proposes DyCoRM, a Dynamic Criterion-Aware reward modeling framework that moves beyond static, one-size-fits-all scoring dimensions by conditioning the reward model on the user’s active evaluation criteria (e.g., aesthetics, prompt faithfulness, detail, style) and dynamically adapting how images are assessed; by explicitly incorporating “criteria” into reward learning and inference, it enables finer-grained modeling of diverse and personalized preferences, providing more controllable and better-aligned optimization signals for T2I generation and improving generalization across preference scenarios.
- TextBoost: Boosting Text Encoder for Personalized Text-to-Image Generation
- 赛道归属: 个性化文生图(Text-to-Image Personalization / Diffusion 微调)
- 核心创新点: 提出一-shot 个性化方案,仅选择性微调文本编码器而非扩散模型主体/UNet,从而显著降低存储开销与训练成本并加快收敛;为避免个性化导致的语义漂移,引入“因果性保持”(causality-preserving) 机制/约束,在注入新概念表征的同时尽量维持原有文本语义结构与可组合性,实现更稳健的个性化生成与更好的泛化。
- Track: Personalized Text-to-Image generation (Diffusion personalization / fine-tuning)
- Core innovation: Proposes an efficient one-shot personalization method that fine-tunes only the text encoder instead of large diffusion components (e.g., UNet), greatly reducing training/storage cost and improving convergence; introduces a causality-preserving mechanism/constraint to prevent semantic drift, maintaining the original text semantic structure and compositionality while injecting new concept representations for more robust personalized generation.
- Qwen-Image-Bench: From Generation to Creation in Text-to-Image Evaluation
- 赛道归属: 文生图评测(基准/指标,面向创作能力评估)
- 核心创新点: 提出Qwen-Image-Bench,将评测目标从传统“文本-图像一致性/基础画质”扩展到更贴近真实创作工作流的“从生成到创作”能力刻画:强调对真实世界重建的可信度与创意表达等更高阶维度,设计能区分模型在专业创作场景中关键能力差异的评测集合与判别框架,从而缓解现有benchmark对艺术实践需求覆盖不足、区分度不够的问题。
- Track: Text-to-Image evaluation (benchmark/metrics, creativity-oriented assessment)
- Core innovation: Introduces Qwen-Image-Bench to move beyond classic text-image alignment and basic visual quality, toward capabilities that matter in real creative workflows—faithful real-world reconstruction and genuine creative expression. It provides an evaluation suite and judging protocol aimed at better discriminating models on higher-level, practice-relevant skills that existing benchmarks under-represent.
- PixVerve: Advancing Native UHR Image Generation to 100MP with a Large-Scale High-Quality Dataset
- 赛道归属: 超高分辨率文生图(Ultra-High-Resolution T2I / 原生 UHR 生成)
- 核心创新点: 面向原生超高分辨率(最高 100MP)生成的关键瓶颈(数据稀缺与内容复杂),构建并开源大规模高质量 UHR 数据集 PixVerve-95K,通过系统化的数据筛选/清洗与质量控制提升训练信号密度;以数据为核心驱动推进“原生 UHR”生成能力(而非仅依赖后处理放大),为训练与评测超高像素级 T2I 模型提供可复用的数据基础与基准。
- Track: Ultra-High-Resolution Text-to-Image generation (native UHR generation)
- Core innovation: Targets the main bottleneck of native UHR (up to 100MP) generation—scarce and complex high-res data—by curating and open-sourcing PixVerve-95K, a large-scale high-quality UHR dataset with systematic filtering/cleaning and quality control to increase effective training signal; advances native UHR capability (beyond post-hoc upscaling) by providing a reusable data foundation and benchmark for training/evaluating extremely high-pixel T2I models.
- DirectEdit: Step-Level Accurate Inversion for Flow-Based Image Editing
- 赛道归属: 图像编辑(基于流模型/扩散式流程的训练免编辑,反演)
- 核心创新点: 提出DirectEdit,实现“步级准确”的反演以支持流式(flow-based)编辑:针对现有训练免编辑常见的反演-前向去噪流程中“时间步不匹配”导致的重建误差累积问题,DirectEdit在反演阶段对齐每一步的潜变量/时间步,使重建路径与编辑路径在对应step上严格一致,从而显著降低误差传播,提升重建保真度与编辑稳定性(尤其在多步编辑或强编辑强度下)。
- Track: Image editing (flow-based / diffusion-style pipeline, training-free editing, inversion)
- Core innovation: Proposes DirectEdit with step-level accurate inversion for flow-based editing. It addresses error accumulation caused by timestep-mismatched noisy latents in common inversion+forward denoising pipelines by aligning latents per step so reconstruction and editing trajectories are consistent at corresponding timesteps, reducing drift and improving reconstruction fidelity and editing robustness, especially for longer or stronger edits.
- FiRe: Fine-grained Multimodal Reasoning for Enhanced Image Generation
- 赛道归属: 文生图(多模态推理增强的图像生成 / Unified MLLM for T2I)
- 核心创新点: 提出细粒度多模态推理框架,将统一式MLLM的“理解-生成”闭环能力用于文生图的自反思与自改写:不再停留在简单的提示词扩写或整体图文一致性打分,而是引入更细粒度的推理与评估信号(如对属性、关系、局部区域/对象级要点的逐项核对),驱动生成过程进行针对性的迭代修正,从而提升复杂指令下的可控性与语义一致性。
- Track: Text-to-Image Generation (multimodal reasoning-enhanced image generation / unified MLLM for T2I)
- Key innovations: Proposes a fine-grained multimodal reasoning framework that leverages a unified MLLM’s closed-loop “understand–generate” capability for self-reflection and self-refinement in T2I. Instead of relying on prompt augmentation or holistic image-text alignment scoring, it introduces finer-grained reasoning/evaluation signals (e.g., attribute-, relation-, and region/object-level checks) to guide targeted iterative corrections during generation, improving controllability and semantic faithfulness for complex prompts.
- ACCORD: Alleviating Concept Coupling through Dependence Regularization for Text-to-Image Diffusion Personalization
- 赛道归属: 文生图个性化(少样本定制/概念注入,扩散模型)
- 核心创新点: 提出ACCORD,用“依赖性正则化(dependence regularization)”直接缓解个性化中的概念耦合:针对少量参考图导致模型把目标概念与背景/风格/共现物体等产生非期望绑定的问题,在训练/微调目标中显式惩罚个性化token与无关概念之间的统计依赖或表征耦合,从机制上提升“文本可控性—个性化保真度”的可调平衡,减少过拟合式的错误联想与提示词失控。
- Track: Text-to-Image personalization (few-shot customization / concept injection, diffusion)
- Core innovation: Proposes ACCORD, introducing dependence regularization to directly mitigate concept coupling in personalization. With few reference images, models often form spurious bindings between the target concept and incidental co-occurring attributes (background/style/objects). ACCORD explicitly penalizes unwanted statistical/representation dependence between the personalized token and irrelevant concepts during adaptation, improving the controllability–fidelity trade-off and reducing overfitting-driven prompt hijacking.
- No Safe Dose: How Training Data Drives Unsafe Image Generation
- 赛道归属: 文生图安全(数据治理/安全性机理分析)
- 核心创新点: 通过严格控制变量的训练实验,系统量化“训练数据中不安全样本占比”对不安全图像生成的因果驱动:在模型结构与训练流程一致的前提下,仅改变数据集中不安全图像比例(0%–9.6%)并覆盖不同数据规模(100K–8M),从而分离并验证数据成分对输出安全性的直接影响,揭示即便低剂量不安全数据也可能显著影响生成风险,为数据过滤阈值、配比策略与安全训练/审计提供可操作的实证依据。
- Track: Text-to-Image safety (data governance / mechanistic analysis of safety)
- Core innovation: Uses controlled training experiments to causally quantify how the fraction of unsafe training images drives unsafe generation. Keeping architecture and training fixed, it varies only the unsafe-content ratio (0%–9.6%) across multiple dataset scales (100K–8M), isolating data composition as the key variable. The results support the “no safe dose” premise—small unsafe fractions can materially affect risk—providing actionable evidence for filtering thresholds, dataset mixing policies, and safety auditing/training practices.
GitHub
- [2026-05-31] YouMind-OpenLab/awesome-nano-banana-pro-prompts ⭐12316
🍌 World's largest Nano Banana Pro prompt library — 10,000+ curated prompts with preview images, 16 languages. Google Gemini AI image generation. Free ...
- [2026-05-30] AceDataCloud/Nexior ⭐372
Consumer AI app for chat, image generation, video generation, and music creation powered by Ace Data Cloud APIs.
- [2026-05-30] VigoZhao/AI-Visual-Prompt-Cookbook ⭐168
Curated collection of reusable JSON prompt templates & style references for AI image generation. Updated daily.
- [2026-05-30] Z1rconium/gpt-image-linux ⭐75
Self-hosted web panel for GPT-compatible image generation APIs — generate, edit, and manage your images in one place.
- [2026-05-30] Dusktarepresent/Leonardo-AI-cracked ⭐58 🆕NEW
Leonardo AI - AI image generation and design workflow platform for concept art, marketing assets, and creative teams. Official purchase/referral page ...
HuggingFace Datasets
- [2026-05-29] jasperai/monet
Dataset Card for MONET
MONET (Massive, Open, Non-redundant and Enriched Text-to-image dataset) is a large-scale, curated image-text dat...
视频生成/编辑 / Video Generation/Editing
arXiv
- MAVEN A Multi-Agent Framework for Multicultural Text-to-Video Generation
- 赛道归属: 视频生成(文生视频 / 多智能体提示工程)
- 核心创新点: 提出多智能体提示精炼框架 MAVEN,面向“多文化/跨文化”文生视频的文化保真度问题,将文本提示分解为人物(person)、动作(action)、地点(location)等可控维度,并由专门代理并行/串行协作改写与补全文化关键信息;通过结构化分解降低单一提示对文化细节的丢失与歧义,提升同文化与跨文化场景下生成内容的文化一致性与可评测性。
- Track: Video Generation (Text-to-Video / Multi-agent Prompting)
- Key innovation: Introduces MAVEN, a multi-agent prompt-refinement framework targeting cultural fidelity in mono- and cross-cultural T2V. It decomposes prompts into controllable dimensions (person/action/location) handled by specialized agents in parallel or sequential workflows, explicitly enriching under-specified cultural attributes and reducing ambiguity that typical single-prompt pipelines cannot recover.
- World-R1: Reinforcing 3D Constraints for Text-to-Video Generation
- 赛道归属: 文生视频(Text-to-Video)/ 3D一致性对齐(强化学习)
- 核心创新点: 通过强化学习而非结构改造来注入3D约束:将“几何一致性/世界约束”显式构造成奖励信号,对视频生成模型进行对齐优化,从而在不显著增加推理开销、保持可扩展性的前提下缓解几何不一致问题;同时构建面向“世界模拟”的纯文本数据集,用于更系统地覆盖可被3D约束检验的描述分布,提升对齐训练的有效性与泛化。
- Track: Text-to-Video / 3D-consistency alignment (Reinforcement Learning)
- Core innovation: Injects 3D constraints via RL-based alignment instead of architectural modifications: formulates geometric/world-consistency as explicit rewards to optimize a video generator, improving geometric coherence without adding substantial inference cost and preserving scalability; additionally introduces a world-simulation-oriented text-only dataset to better cover descriptions that are verifiable under 3D constraints, strengthening alignment and generalization.
- OSP-Next: Efficient High-Quality Video Generation with Sparse Sequence Parallelism, HiF8 Quantization, and Reinforcement Learning
- 赛道归属: 视频生成(文生视频/扩散Transformer加速与部署优化)
- 核心创新点: 提出面向DiT视频生成的系统级效率方案,将“稀疏注意力 + 序列并行 + 低比特量化 + 强化学习”联合设计以在质量不降的前提下降本增效:1) 采用混合全注意力-稀疏注意力架构,用固定模式的 Skiparse-2D 在时空token维度做token级与group级稀疏连接,缓解全注意力二次复杂度;2) 引入稀疏序列并行(Sparse Sequence Parallelism)以更好匹配稀疏计算图,提升多卡吞吐与可扩展性;3) 使用 HiF8(8-bit)量化降低显存与带宽开销,面向推理/训练的硬件友好实现;4) 通过强化学习对生成策略/偏好进行对齐,在引入稀疏与量化后维持或提升感知质量与文本一致性。
- Track: Video generation (text-to-video / Diffusion-Transformer acceleration & deployment optimization)
- Core innovations: A system-level efficiency recipe for DiT-based video generation that jointly combines “sparse attention + sequence parallelism + low-bit quantization + RL” to reduce cost without sacrificing quality: 1) a hybrid full–sparse attention design using fixed-pattern Skiparse-2D to apply token-wise and group-wise sparsity over spatiotemporal tokens, mitigating quadratic attention cost; 2) Sparse Sequence Parallelism to better align distributed execution with sparse computation graphs for higher multi-GPU throughput and scalability; 3) HiF8 (8-bit) quantization to cut memory/bandwidth with hardware-friendly training/inference; 4) reinforcement learning-based alignment to preserve/improve perceptual quality and prompt faithfulness under sparsity/quantization constraints.
- Paris 2.0: A Decentralized Diffusion Model for Video Generation
- 赛道归属: 视频生成(去中心化训练 / 分布式扩散模型)
- 核心创新点: 提出首个通过去中心化计算预训练的视频扩散生成模型,将原本在图像上验证的去中心化扩散训练范式扩展到需要强时序一致性的文本生成视频任务;核心突破在于给出去中心化场景下实现时序连贯训练的配方与机制,使得无需单体GPU集群也能完成低分辨率T2V预训练,并在去中心化通信与优化约束下维持跨帧一致性与可训练性。
- Track: Video generation (decentralized training / distributed diffusion)
- Key innovation: Introduces the first video diffusion generator pre-trained via decentralized computation, extending decentralized diffusion training from images to temporally coherent text-to-video. The main methodological advance is a training recipe/mechanism that preserves temporal coherence under decentralized optimization and communication constraints, enabling low-res T2V pretraining without a monolithic GPU cluster.
- TAGRPO: Boosting GRPO on Image-to-Video Generation with Direct Trajectory Alignment
- 赛道归属: 图生视频生成(I2V)/ 强化学习式后训练(RLHF/RLAIF for generative models)
- 核心创新点: 提出TAGRPO用于I2V的稳健后训练,指出GRPO在I2V上“奖励不稳定/不持续提升”的关键症结在于视频生成的多步轨迹与奖励信号之间存在错位;方法上引入“直接轨迹对齐”(Direct Trajectory Alignment)的对比学习式目标,将高奖励样本的去噪/流匹配轨迹作为正样本对齐参照、低奖励轨迹作为负样本拉开,从而在不改变基础生成架构的情况下,更稳定地把奖励偏好注入到整段生成轨迹而非仅末端结果,提升可控性与一致性。
- Track: Image-to-Video generation (I2V) / RL-style post-training (RLHF/RLAIF for generative models)
- Core innovation: Proposes TAGRPO as a robust post-training framework for I2V, diagnosing that naïvely applying GRPO yields inconsistent reward gains due to misalignment between multi-step generation trajectories and reward signals. It introduces Direct Trajectory Alignment with a contrastive-learning-like objective: align denoising/flow-matching trajectories from high-reward samples as positives and push away low-reward trajectories as negatives, injecting preference into the whole trajectory (not just final frames) without changing the base architecture, improving stability and controllability.
- Kandinsky 5.0: A Family of Foundation Models for Image and Video Generation
- 赛道归属: 文生图/文生视频/图生视频(基础大模型体系与工程化)
- 核心创新点: 给出Kandinsky 5.0成体系的图像与视频基础模型家族,通过“分层产品线”覆盖不同算力与质量需求:6B级高分辨率图像模型(Image Lite)、2B级轻量快速的T2V/I2V(Video Lite)、19B级高质量视频模型(Video Pro)。技术价值在于将图像与10秒视频生成统一到可扩展的基础模型栈中,并通过不同规模与配置实现质量-速度-成本的可部署权衡,为实际应用提供从轻量到旗舰的可迁移方案与训练/推理配方。
- Track: Text-to-Image / Text-to-Video / Image-to-Video (foundation model family & systemization)
- Core innovation: Presents Kandinsky 5.0 as a structured family of foundation models spanning high-res image and 10-second video synthesis, organized into tiered lineups to cover different compute/quality regimes: 6B Image Lite, 2B fast/light Video Lite for T2V/I2V, and 19B Video Pro for top quality. The key contribution is a scalable, unified model stack with practical quality–latency–cost trade-offs and deployable recipes across sizes, enabling transfer across product tiers.
- Which Pretraining Paradigm Better Serves Spatial Intelligence? An Empirical Comparison of Vision-Language and Video Generation Models
- 赛道归属: 多模态表征学习(空间智能评测/预训练范式对比)
- 核心创新点: 系统性对比两类基础预训练范式对“空间智能(语义+几何/物理结构)”的贡献,建立可复现实证框架来回答“VLM 还是 VGM 更利于空间表征”:1) 统一评测协议与任务集合,将空间能力拆解为对象语义理解与几何结构/空间关系建模等维度;2) 在相近规模/设置下对视觉-语言对齐监督(VLM)与时序世界建模监督(视频生成模型VGM)进行对照实验,隔离数据、目标函数与架构差异带来的混淆因素;3) 通过细粒度诊断揭示不同预训练信号在空间推理、几何一致性、跨视角/时序泛化等方面的优势与短板,为后续“混合式预训练目标”或“以生成式时序学习补足VLM几何能力”等方法设计提供依据。
- Track: Multimodal representation learning (spatial intelligence evaluation / pretraining paradigm comparison)
- Core innovations: An empirical, reproducible framework to compare how two major pretraining paradigms contribute to “spatial intelligence” (semantics + geometry/physical structure), directly addressing whether VLMs or VGMs yield better spatial representations: 1) a unified evaluation protocol and task suite that decomposes spatial capability into object semantics and geometric/spatial-relationship modeling; 2) controlled comparisons between language-aligned supervision (VLM) and temporally evolving world modeling via video generation (VGM), reducing confounds from data/objectives/architectures; 3) fine-grained diagnostics that expose strengths/weaknesses in spatial reasoning, geometric consistency, and cross-view/temporal generalization, informing future hybrid objectives (e.g., adding generative temporal learning to complement VLM geometry).
- MTAVG-Bench 2.0: Diagnosing Failure Modes of Cinematic Expressiveness in Multi-Talker Audio-Video Generation
- 赛道归属: 视频生成评测(多说话人音视频生成/电影化表现力基准)
- 核心创新点: 面向多角色场景的“电影化表现力”提出新一代诊断基准,超越传统lip-sync与音画对齐指标:1) 构建 MTAVG-Bench 2.0,将评测从口型同步扩展到场景级叙事与表演一致性,覆盖多角色互动、镜头语言、情绪/表演连贯性、角色身份保持等更高层次维度;2) 以失败模式诊断为核心设计评测维度与标注/协议,能够定位模型在多说话人切换、遮挡与交互、镜头切换下的典型崩溃点;3) 提供更贴近真实制作需求的综合评价体系,促进模型从“对齐正确”走向“表达有戏”,并为后续训练(如偏好学习/奖励建模)提供可量化目标。
- Track: Video generation evaluation (multi-talker audio-video generation / cinematic expressiveness benchmark)
- Core innovations: A next-generation diagnostic benchmark targeting “cinematic expressiveness” in multi-character MTAVG, going beyond lip-sync and basic A/V alignment: 1) MTAVG-Bench 2.0 elevates evaluation to scene-level narrative/performance coherence, covering multi-character interaction, cinematic language, emotion/performance continuity, and identity consistency; 2) failure-mode–oriented design with evaluation dimensions and protocols that pinpoint typical breakdowns under speaker turns, occlusions/interactions, and shot transitions; 3) a production-relevant holistic metric suite that pushes models from “correct alignment” to “expressive acting,” and can serve as measurable targets for preference learning/reward modeling.
- SmartDirector: Keyframe-Conditioned Cinematic Video Generation with Narrative Pacing Control
- 赛道归属: 视频生成(关键帧条件控制/叙事节奏与镜头调度)
- 核心创新点: 提出以多关键帧为核心的可控叙事视频生成框架,实现对“故事结构与节奏(pacing)”的显式控制:1) 用多关键帧替代仅文本或首尾帧的稀疏条件,将叙事节点(事件/镜头意图)离散化并注入生成过程,提升长时序一致性与可控性;2) 引入叙事节奏控制机制,在关键帧之间对时长分配、过渡速度与镜头推进进行调度,使生成视频能按目标节奏展开而非平均/漂移;3) 将关键帧条件与生成模型的时序建模耦合,增强跨段落的角色/场景连续性与镜头连贯性,面向“导演式”生成提供更细粒度控制接口。
- Track: Video generation (keyframe-conditioned control / narrative pacing & shot-level directing)
- Core innovations: A controllable narrative video generation framework centered on multiple keyframes, enabling explicit control over story structure and temporal pacing: 1) replaces sparse conditioning (text or first/last frame) with multi-keyframe conditioning that encodes narrative beats/events and injects them into generation for stronger long-horizon coherence and controllability; 2) introduces a pacing-control mechanism to schedule duration allocation and transition speed between keyframes, preventing uniform timing or drift and enabling target narrative rhythm; 3) couples keyframe constraints with temporal modeling to improve cross-segment character/scene continuity and shot coherence, providing a more “director-like” control interface for cinematic generation.
- PARE: Pruning and Adaptive Routing for Efficient Video Generation
- 赛道归属: 视频生成推理优化(高效Video DiT / 结构化压缩与动态路由)
- 核心创新点: 提出PARE,将“结构化剪枝(Pruning)”与“自适应路由(Adaptive Routing)”联合用于视频DiT的高效生成,突破以往只做固定宽度/深度/步数压缩、无法随输入与去噪阶段动态调整的限制。方法上进行结构感知的宽度与深度联合裁剪,并引入按输入难度与去噪时刻选择性激活子网络/层的路由机制,使模型在不同样本与不同扩散阶段分配不同计算预算,在尽量保持画质的同时显著降低平均计算量与延迟。
- Track: Efficient video generation inference (Video DiT optimization / structured compression + dynamic routing)
- Core innovation: Proposes PARE, combining structure-aware pruning with adaptive routing to accelerate Video Diffusion Transformers. Unlike prior fixed compression of width/depth/steps, it jointly prunes width and depth and dynamically routes computation by selectively activating subnetworks/layers conditioned on input difficulty and denoising stage, allocating compute budget per-sample and per-timestep to reduce average FLOPs/latency while preserving quality.
GitHub
- [2026-05-31] hao-ai-lab/FastVideo ⭐3655
A unified inference and post-training framework for accelerated video generation.
- [2026-05-30] YouMind-OpenLab/awesome-seedance-2-prompts ⭐1256
🎬 2000+ curated Seedance 2.0 video generation prompts — cinematic, anime, UGC, ads, meme styles. Includes Seedance API guides, character consistency t...
- [2026-05-30] marcelo-earth/generative-manim ⭐841
🎨 GPT for video generation ⚡️
- [2026-05-30] DistributorRecord/Kling-AI-Video-Generator-cracked ⭐60 🆕NEW
Kling AI Video Generator - AI video generation workflow for text-to-video, image-to-video, creative clips, and social content. Includes setup notes, S...
- [2026-05-30] Opticcluruminate/Runway-AI-Video-cracked ⭐52 🆕NEW
Runway AI Video - AI video generation and creative editing platform for creators, marketers, and production teams. Official purchase/referral page wit...
HuggingFace Models
音频生成 / Audio Generation
arXiv
- FC-TTS: Style and Timbre Control in Zero-Shot Text-to-Speech with Disentangled Speech Representations
- 赛道归属: 音频生成|零样本文本转语音(Zero-shot TTS)|可控生成(风格/音色解耦控制)
- 核心创新点: 通过解耦语音表征将语音分解为可解释属性(如内容、韵律/风格、音色等),并在零样本TTS中实现来自不同参考音频的分离式条件控制:用一段参考提供说话人音色、另一段参考提供说话风格/韵律,从而突破以往“单一参考同时绑定音色与风格”的耦合限制;方法上强调在表示学习与条件注入机制上实现属性独立性,使模型在保持高保真克隆的同时获得可组合、可编辑的控制能力。
- Track: Audio Generation | Zero-shot Text-to-Speech (TTS) | Controllable generation (disentangled style/timbre control)
- Core innovation: Introduces disentangled speech representations that factor speech into interpretable attributes (e.g., content, prosody/style, timbre) and enables separate-reference conditioning in zero-shot TTS—one reference for speaker timbre and another for speaking style/prosody. This addresses the common entanglement where a single prompt jointly determines both, and advances the method via representation learning and conditioning/injection designs that preserve cloning fidelity while enabling compositional, editable control.
- ParsVoice: A Large-Scale Multi-Speaker Persian Speech Corpus for Text-to-Speech Synthesis
- 赛道归属: 语音生成 / TTS 数据集与数据构建(低资源语言、多说话人)
- 核心创新点: 提出面向多说话人TTS训练的超大规模波斯语开源语音-文本语料库ParsVoice,并给出可扩展的数据构建流水线:从长篇有声书录音中自动切分与对齐高质量语音-文本对,核心在于结合面向波斯语的句级语义/完整性建模(如微调的ParsBERT用于句子补全/筛选)与质量控制策略,以在低资源语言场景下系统性提升对齐准确性、覆盖度与可用性,从而降低多说话人TTS与语音语言建模的数据门槛。
- Track: Audio Generation / TTS dataset & data pipeline (low-resource, multi-speaker)
- Core innovation: Introduces ParsVoice, the largest publicly available Persian speech–text corpus designed for multi-speaker TTS, together with a scalable pipeline to derive high-quality paired data from long-form audiobooks. The key methodological contribution is an automated segmentation/alignment and quality-control workflow that leverages Persian-specific sentence-level modeling (e.g., a fine-tuned ParsBERT for sentence completion/filtering) to improve alignment reliability, coverage, and usability in low-resource settings.
- Unlocking Fine-Grained and Within-Utterance Speaking Style Control in Prompt-Based Text-to-Speech Models
- 赛道归属: 文本到语音生成(TTS)/ 语音风格可控生成(Prompt-based Style Control)
- 核心创新点: 在现有“基于提示词的TTS”框架上,针对两类关键瓶颈提出方法级增强:①实现跨语句(inter-utterance)的细粒度风格属性连续可控与插值,使风格强度/属性可在不同句子间平滑调节而非离散切换;②实现单句内部(within-utterance)的时变风格控制,通过引入随时间变化的风格条件/调度机制,让模型不再只能施加全局单一风格,而能在同一句话中完成风格过渡与局部风格片段控制,从而扩展到需要“句内风格转场”的实际应用场景。
- Track: Text-to-Speech (TTS) / Controllable Speech Style Generation (Prompt-based Style Control)
- Core innovations: Proposes method-level extensions to existing prompt-based TTS to overcome two limitations: (1) enables fine-grained, continuous control and interpolation of style attributes across utterances (inter-utterance), allowing smooth adjustment of style intensity/attributes rather than coarse, discrete changes; (2) enables time-varying, within-utterance style control by introducing temporally scheduled/dynamic style conditioning, replacing a single global style per utterance with intra-utterance style transitions and localized style segment control—supporting practical scenarios requiring style changes inside one sentence.
- PilotTTS: A Disciplined Modular Recipe for Competitive Speech Synthesis
- 赛道归属: 语音生成 / TTS 系统工程与训练配方(轻量化自回归合成)
- 核心创新点: 提出PilotTTS:以“纪律化的模块化配方”替代复杂多阶段大系统,通过极简自回归架构 + 严格的数据工程实现有竞争力的合成效果。方法论突破在于将性能提升的关键从模型堆叠转移到可复现的训练流程:全链路使用开源工具处理约20万小时数据,强调模块边界清晰、训练/数据清洗规范化与可移植的工程实践,使资源受限团队也能复现接近SOTA的TTS质量。
- Track: Audio Generation / TTS system recipe & training pipeline (lightweight autoregressive synthesis)
- Core innovation: Proposes PilotTTS, a competitive yet lightweight autoregressive TTS system achieved via a disciplined modular recipe rather than heavy multi-stage architectures. The methodological advance is a reproducible, open-source end-to-end training pipeline on ~200K hours that prioritizes rigorous data engineering, clear module interfaces, and standardized processing/cleaning—shifting gains from model complexity to repeatable system-building practices accessible to constrained teams.
- PashtoTTS-Bench: automated screening for low-resource non-Latin-script text-to-speech
- 赛道归属: 语音生成 / TTS 评测与自动化筛查(低资源、非拉丁文字)
- 核心创新点: 针对低资源且使用非拉丁文字的TTS评测中“单一ASR回环WER”易失效的问题,提出INSV报告框架,将失败模式显式拆解为可懂度(Intelligibility)、自然度(Naturalness)、文字/脚本保真度(Script fidelity)与验证(Verification)。并给出INSV-A自动化筛查子集,用自动指标区分“无音频/说错语言/仅转写保留目标文本/听感不自然”等典型误判情形,从评测方法论上提升对低资源TTS系统的可诊断性与可比性。
- Track: Audio Generation / TTS evaluation & automated screening (low-resource, non-Latin scripts)
- Core innovation: Addresses the brittleness of single ASR round-trip WER for low-resource, non-Latin-script TTS evaluation by introducing the INSV framework, which disentangles outcomes into Intelligibility, Naturalness, Script fidelity, and Verification. It further provides INSV-A, an automated screening subset that can separate common failure cases (no audio, wrong language, script-only preservation in transcripts, unnatural speech), improving diagnostic power and comparability of evaluations in low-resource settings.
- CosyEdit2: Speech-Editing-Oriented Reinforcement Learning Unlocks Better Zero-Shot TTS
- 赛道归属: 音频生成|语音编辑(Speech Editing)|零样本TTS|强化学习对齐(RL for editing consistency)
- 核心创新点: 面向“语音编辑”对局部一致性要求更高的特点,提出以语音编辑为目标的强化学习优化框架,用比SFT更细粒度、更贴近编辑任务的奖励信号,缓解配对编辑数据不完美带来的上限;核心突破在于将零样本TTS的生成能力通过RL进行“编辑对齐”,重点优化被编辑片段与上下文未编辑语音的声学连续性/一致性(如音色、韵律、能量、边界过渡),从而在零样本场景下获得更可靠的编辑质量与鲁棒性。
- Track: Audio Generation | Speech Editing | Zero-shot TTS | Reinforcement-learning alignment
- Core innovation: Proposes a speech-editing-oriented RL optimization that goes beyond supervised fine-tuning by leveraging task-aligned, fine-grained reward signals to overcome imperfect paired editing data. The key methodological advance is aligning a prompt-conditioned TTS model toward editing objectives—explicitly optimizing local acoustic consistency and seamless transitions between edited spans and surrounding untouched audio (timbre/prosody/energy/boundary continuity), yielding stronger zero-shot editing performance and robustness.
- Toward Natural Emotional Text-To-Speech System with Fine-Grained Non-Verbal Expression Control
- 赛道归属: 音频生成|情感TTS|可控生成(细粒度非言语表达控制)
- 核心创新点: 将情感表达从“仅控制言语韵律”扩展到更真实的人类情绪关键成分——非言语发声(Non-verbal vocalizations, NVs)(如笑、叹气、抽泣、哼声等),并提出细粒度NV控制的合成方法;通过构建/整理带有更高质量、细粒度标注的NV相关数据与控制标签,使模型能够在文本与情感条件之外,进一步在时间位置、类型、强度等维度对NV进行可控生成,从方法论上提升情感TTS的自然度与可表达性上限。
- Track: Audio Generation | Emotional TTS | Controllable generation (fine-grained non-verbal expression control)
- Core innovation: Extends emotional TTS beyond verbal prosody control by explicitly modeling non-verbal vocalizations (NVs) (e.g., laughter, sighs, sobs, hums) and introducing a fine-grained NV control synthesis approach. By curating/leveraging higher-quality, fine-grained NV annotations and control labels, the method enables controllable NV generation along dimensions such as placement, type, and intensity, raising the ceiling of naturalness and expressiveness in emotional speech synthesis.
- Natural Yet Challenging to Detect: Robust In-the-Wild TTS through EMA and Dual-Scoring Prompt Selection -- Submission for WildSpoof 2026 TTS Track
- 赛道归属: 文本到语音(TTS)/ 语音反欺骗鲁棒生成(in-the-wild TTS)
- 核心创新点: 在F5-TTS架构上提出F5-TTS-DPS,将EMA(指数滑动平均)引入监督微调以稳定训练轨迹、降低野外数据分布噪声带来的过拟合,从而提升跨场景泛化与鲁棒性;同时提出双评分提示词/参考音频选择(Dual-Scoring Prompt Selection),利用LLM与音频大模型(LALM)对候选参考/提示进行双路质量评估与过滤,在不改动主干生成机制的前提下提高合成保真度与自然度,并面向“更自然且更难检测”的对抗性目标优化生成质量。
- Track: Text-to-Speech (TTS) / Robust anti-spoof-oriented in-the-wild speech generation
- Core innovations: Proposes F5-TTS-DPS built on F5-TTS, introducing Exponential Moving Average (EMA) into supervised fine-tuning to stabilize optimization and improve generalization under noisy, in-the-wild data distributions; further adds Dual-Scoring Prompt/Reference Selection, where an LLM and an audio language model (LALM) jointly score and filter candidate prompts/references to boost synthesis fidelity and naturalness without changing the core generator, explicitly targeting more natural yet harder-to-detect outputs.
- Raon-OpenTTS: Open Models and Data for Robust Text-to-Speech
- 赛道归属: 文本到语音(TTS)/开源数据与可复现训练
- 核心创新点: 提出“模型+数据”一体化的开源TTS方案:一方面发布可与闭源数据SOTA竞争的Raon-OpenTTS模型;另一方面构建超大规模可复现训练数据池Raon-OpenTTS-Pool(由公开英文语音聚合而成,规模达数十万小时级),系统性验证“大规模开放数据”在提升TTS鲁棒性与质量中的关键作用,并降低TTS研究对私有数据的依赖、提升可复现实验基线的可获得性。
- Track: Text-to-Speech (TTS) / Open data & reproducible training
- Core innovation: Delivers an integrated open “model + data” TTS stack: (1) Raon-OpenTTS, an open TTS model competitive with closed-data SOTA systems; (2) Raon-OpenTTS-Pool, a very large-scale open dataset aggregated from public English speech, enabling reproducible TTS training at scale. The work isolates and substantiates the impact of large open data on robustness and quality, reducing reliance on proprietary corpora and improving reproducibility of TTS benchmarks.
- RVCBench: Benchmarking the Robustness of Voice Cloning Across Modern Audio Generation Models
- 赛道归属: 音频生成|零样本TTS/声音克隆|评测基准(鲁棒性Benchmark)
- 核心创新点: 提出面向现代声音克隆系统的鲁棒性评测基准,系统覆盖真实应用中的关键扰动源:参考音频噪声与失真、文本提示不完备/错误、多语种与长音频生成、后处理链路影响、对抗扰动等;方法论贡献在于将“相似度/自然度”之外的维度标准化为可复现的测试协议与指标集合,用于对比不同范式模型(如基于codec token的语言模型等)在复杂条件下的稳定性与退化模式,从而推动可部署级语音克隆的可靠性研究。
- Track: Audio Generation | Zero-shot TTS / Voice Cloning | Benchmarking (robustness evaluation)
- Core innovation: Introduces a robustness-focused benchmark for modern voice cloning systems, covering practical stressors such as noisy/degraded reference audio, imperfect text prompts, multilingual and long-form generation, post-processing effects, and adversarial perturbations. Methodologically, it standardizes reproducible protocols and metrics beyond similarity/naturalness to characterize stability and failure modes across model families (e.g., codec-token LMs), enabling deployment-oriented robustness comparisons and guiding more reliable voice cloning research.
GitHub
- [2026-05-30] huggingface/diffusers ⭐33728
🤗 Diffusers: State-of-the-art diffusion models for image, video, and audio generation in PyTorch.
- [2026-05-26] Stability-AI/stable-audio-tools ⭐3757
Generative models for conditional audio generation
- [2026-05-29] BinWang28/audio-ai-hub ⭐924
The hub for audio AI research: papers, open models, benchmarks & datasets across audio LLMs, speech recognition, TTS, music & audio generation.
- [2026-05-30] apocas/restai ⭐508
RESTai is an AIaaS (AI as a Service) open-source platform. Supports many public and local LLM suported by Ollama/vLLM/etc. Precise embeddings usage, t...
- [2026-05-27] xiquan-li/Awesome-Audio-Generation ⭐73
Curated list for papers, codes and resources related to Text-to-Audio (TTA) Generation
HuggingFace Models
语言大模型 / Large Language Models
arXiv
- EvalMORAAL: Interpretable Chain-of-Thought and LLM-as-Judge Evaluation for Moral Alignment in Large Language Models
- 赛道归属: 对齐评测(道德/价值观对齐、LLM-as-Judge评估)
- 核心创新点:
- 提出一个可解释的链式思维(CoT)评测框架EvalMORAAL,将道德对齐评估从“黑箱打分”推进到“可追溯推理+量化评分”的透明流程。
- 设计双评分通道:基于模型输出的log-probabilities与直接主观评分(direct ratings)并行,降低单一指标偏置并增强稳健性。
- 引入LLM-as-Judge 的同侪评审(peer review)机制,用于对模型推理与结论进行二次审阅,提升评测一致性与可审计性。
-
将评测基准落到跨文化大规模社会调查数据(WVS/PEW,多国家多议题),把“道德对齐”具体化为与真实人群价值分布的相关性度量(如Pearson相关),形成更具外部效度的对齐评估范式。
-
Track: Alignment Evaluation (Moral/Value Alignment, LLM-as-Judge Evaluation)
- Core Innovations:
- Proposes EvalMORAAL, an interpretable Chain-of-Thought (CoT) evaluation framework that makes moral-alignment assessment transparent and traceable rather than purely black-box scoring.
- Introduces two complementary scoring channels—log-probability-based scoring and direct rating—to reduce reliance on a single metric and improve robustness.
- Adds an LLM-as-Judge peer-review layer to re-evaluate and audit the model’s reasoning and final judgments, improving consistency and inspectability.
- Grounds evaluation in large-scale cross-cultural survey datasets (WVS/PEW across countries and topics), operationalizing moral alignment as correlation with real population value distributions, increasing external validity.
- CIRF: Tokenizing Chain-of-Thoughts into Reusable Functional Units for Efficient Latent Reasoning in Large Language Models
- 赛道归属: 推理优化(隐式CoT/潜空间推理、推理token化)
- 核心创新点: 提出CIRF,将传统“链式思维”从自然语言解释转为可复用的离散功能token序列来执行隐式推理:把推理过程模块化为功能单元并在推理时动态编排,以适配不同样例复杂度;同时强调与显式CoT的对齐,使隐式推理在降低推理开销的同时尽量保持可解释推理轨迹的一致性与可控性。
- Track: Reasoning optimization (implicit CoT / latent reasoning, tokenized reasoning)
- Core innovations: CIRF converts natural-language chain-of-thought into a sequence of reusable discrete functional tokens for implicit reasoning. It dynamically composes these functional units at inference time to match instance complexity, aiming to reduce inference cost while improving alignment with explicit CoT so latent reasoning remains consistent and controllable.
- Step-TP: A Grounded, Step-Level Dataset with Chain-of-Thought Reasoning for LLM-Guided Tensor Program Optimization
- 赛道归属: LLM辅助编译优化 / 张量程序优化数据集(程序优化 + 推理链监督)
- 核心创新点: 提出Step-TP,一个“可落地(grounded)到具体变换”的逐步级(step-level)数据集,用于将张量程序优化建模为可组合的序列决策过程;相较仅提供端到端优化前后程序对的既有数据,Step-TP提供可验证的中间变换步骤与对应的Chain-of-Thought推理监督,使每一步优化决策具备可解释性与可检查性,并避免token低效的表示方式,从而更适配LLM在迭代优化中的训练与评测(如逐步决策正确性、可组合性与可回放验证)。
- Track: LLM-guided compiler optimization / tensor program optimization dataset (program optimization + CoT supervision)
- Core innovation: Introduces Step-TP, a grounded step-level dataset that maps tensor program optimization to a composable sequential decision process. Unlike prior datasets that only provide end-to-end before/after optimized program pairs with token-inefficient representations, Step-TP supplies verifiable intermediate transformation steps together with Chain-of-Thought supervision, enabling interpretable and checkable optimization decisions at each step and better supporting LLM training/evaluation for iterative optimization (e.g., step correctness, composability, and replayable verification).
- MetaSICL: Adapting Audiroty LLM via Meta Speech In-Context Learning
- 赛道归属: 多模态理解(语音/音频大模型适配与低资源学习、In-Context Learning)
- 核心创新点: 提出一种面向听觉LLM的元学习式语音上下文学习框架(Meta Speech In-Context Learning),将“推理时用少量示例做ICL适配”作为核心适配机制,用元学习在训练阶段显式优化模型对示例集合的利用方式,从而在标注稀缺或训练-测试分布不匹配时,相比直接微调更稳健地实现快速域内适配与性能提升;强调训练免/轻训练的推理期自适应,降低低资源任务的适配成本并缓解微调脆弱性。
- Track: Multimodal Understanding (speech/audio LLM adaptation for low-resource settings, In-Context Learning)
- Core innovation: Proposes a meta-learning-based speech in-context learning framework (Meta Speech In-Context Learning) for auditory LLMs, treating inference-time adaptation via a few in-domain demonstrations as the primary adaptation mechanism. By meta-optimizing how the model leverages demonstration sets during training, it enables more robust and rapid in-domain adaptation under scarce labels or train–test distribution mismatch, mitigating the brittleness of direct fine-tuning while keeping adaptation largely training-free/lightweight at inference time.
- Multilingual OCR-Aware Fine-Tuning and Prompt-Guided Chain-of-Thought Reasoning for Multimodal Large Language Models
- 赛道归属: 多模态大模型训练与OCR增强(多语言文本理解/视觉文本推理)
- 核心创新点: 提出面向真实场景视觉文本的多语言OCR增强训练框架:结合(1)大规模合成“OCR→翻译/理解”数据生成以覆盖复杂版式与噪声,(2)基于LoRA的OCR-aware监督微调以低成本注入视觉文本能力,(3)结构化的视觉提示与提示引导CoT推理以提升跨语言读图与文本推理的可控性与鲁棒性,系统性缓解MLLM在小字、遮挡、模糊与复杂字体上的失效。
- Track: Multimodal LLM training with OCR enhancement (multilingual visual-text understanding & reasoning)
- Core innovation: Presents a multilingual OCR-aware training pipeline combining (i) large-scale synthetic OCR-to-translation/understanding data generation for noisy real-world layouts, (ii) OCR-aware SFT with LoRA for efficient capability injection, and (iii) structured visual prompting plus prompt-guided CoT to improve controllability and robustness of multilingual visual-text reading and reasoning under clutter, blur, occlusion, and complex typography.
- River-LLM: Large Language Model Seamless Exit Based on KV Share
- 赛道归属: LLM推理加速 / 早退推理(Early Exit)与KV Cache机制优化
- 核心创新点: 提出River-LLM,通过“KV Share(跨层KV共享)”实现decoder-only大模型的无缝早退(seamless exit),针对早退在decoder架构中被“KV Cache缺失(跳过层无法产出后续token所需历史状态)”卡住的关键瓶颈;其方法核心是在允许跳层的同时,仍为后续解码提供一致、可用的KV缓存供给,从而把早退从“理论可跳层”推进到“工程可落地的端到端加速”,在不破坏自回归解码依赖的前提下降低推理时延。
- Track: LLM inference acceleration / Early-exit decoding with KV-cache mechanism optimization
- Core innovation: Proposes River-LLM, enabling seamless early exit in decoder-only LLMs via KV Share (cross-layer KV sharing). It targets the main bottleneck of early exit in decoder architectures—the KV Cache Absence problem, where skipped layers fail to produce the historical states required for subsequent tokens. By maintaining a consistent, usable KV supply even when layers are bypassed, it turns early-exit from a conceptual layer-skipping idea into an end-to-end deployable speedup without breaking autoregressive decoding dependencies, reducing inference latency.
- GILT: An LLM-Free, Tuning-Free Graph Foundational Model for In-Context Learning
- 赛道归属: 图基础模型(Graph Foundation Models)/ 图领域 In-Context Learning(ICL)/ 跨图泛化
- 核心创新点: 提出一种不依赖LLM、无需微调(LLM-Free & Tuning-Free)的图基础模型框架,用于在极端异构图场景下实现类ICL的快速适配与跨图泛化。其方法论突破在于:针对不同图之间特征空间、标签集合与拓扑结构不一致带来的“任务/空间不对齐”问题,通过构建与具体图域无关的统一表示与对齐机制,使模型能够在不进行参数更新的前提下,仅依靠上下文示例完成对新图/新任务的推断与迁移,从而绕开现有GFM依赖文本化/LLM中介或需要额外调参的限制。
- BC Protocol: Structured Dual-Expert Dialogue for Eliciting High-Quality Chain-of-Thought Post-Training Data
- 赛道归属: 后训练数据工程(CoT数据合成/标注流程设计)
- 核心创新点: 提出BC Protocol,用结构化的双专家对话来生成高质量CoT后训练数据:通过“专家-对抗/校验专家”式的分工与对话约束,系统性暴露并补全单专家写作中常见的“专家盲区”(跳步、默认常识),从流程层面提升推理链的完整性、可读性与可用于训练的稳定格式,相比偏好信号或众包标注更能产出深推理轨迹。
- Track: Post-training data engineering (CoT data synthesis / annotation protocol)
- Core innovations: BC Protocol introduces a structured dual-expert dialogue pipeline to elicit high-quality CoT data. By pairing an expert with a second expert focused on challenge/verification under explicit dialogue constraints, it mitigates the “expert blind spot” (skipped steps, implicit assumptions), producing more complete, consistent, training-ready reasoning traces than crowdsourcing or preference-only RLHF signals.
- Investigating the Interplay between Contextual and Parametric Chain-of-Thought Faithfulness under Optimization
- 赛道归属: 对齐与可解释性评测(CoT忠实性、偏好对齐优化)
- 核心创新点: 针对CoT忠实性的两类评测范式(上下文忠实性与参数忠实性)长期割裂的问题,提出FaithMate作为统一的偏好对齐接口,可在同一优化框架下分别/共同推动模型在两种忠实性目标上的改进;并系统研究在优化过程中两者的相互作用与潜在权衡,为“优化后CoT是否更真实反映模型行为”提供可操作的训练与比较基准。
- Track: Alignment & interpretability evaluation (CoT faithfulness, preference-based optimization)
- Core innovations: FaithMate provides a unified preference-alignment interface to optimize and compare two previously separated notions of CoT faithfulness: contextual (via input/trace perturbations) and parametric (via interventions on model knowledge). It enables joint/isolated optimization and studies their interaction and trade-offs under training, offering an actionable framework to assess whether optimized CoTs better reflect underlying model behavior.
- Reflect-Guard: Enhancing LLM Safeguards against Adversarial Prompts via Logical Self-Reflection
- 赛道归属: 安全对齐(越狱防护/安全分类器增强、对抗鲁棒性)
- 核心创新点: 提出Reflect-Guard,通过参数高效微调为安全分类器引入逻辑自反思式CoT推理:将强模型(如GPT-4o级别)的分析推理蒸馏到Guard模型,使其在面对角色扮演、虚构包装、间接请求等“意图伪装”越狱提示时,能先进行结构化推断与自检再判定风险,从而提升对抗提示下的识别鲁棒性,而非仅依赖表面关键词或模式匹配。
- Track: Safety alignment (jailbreak defense / safety classifier robustness)
- Core innovations: Reflect-Guard enhances LLM-based safety classifiers with logical self-reflection CoT reasoning via parameter-efficient fine-tuning. By distilling analytical reasoning from a stronger model, the classifier learns to infer and self-check hidden malicious intent in adversarial prompts (role-play, fictional framing, indirect requests), improving robustness beyond surface-pattern or keyword-based detection.
GitHub
- [2026-05-31] sgl-project/sglang ⭐28478
SGLang is a high-performance serving framework for large language models and multimodal models.
- [2026-05-30] NVIDIA/TensorRT-LLM ⭐13769
TensorRT LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports state-of-the-art optimizations to perfo...
- [2026-05-31] lone-cloud/gerbil ⭐462
A desktop app for running Large Language Models locally.
- [2026-05-30] basellm/llm-metadata ⭐115
A lightweight interface for accessing and integrating LLM metadata, enabling applications to seamlessly discover, query, and integrate large language ...
- [2026-05-31] gpt-cmdr/ras-commander ⭐62
The RAS-Commander library provides a python API for automating HEC-RAS 6.x and accessing HDF data using Python, built with and driven by large languag...
HuggingFace Models
HuggingFace Datasets
- [2026-05-28] openbmb/UltraData-SFT-2605
UltraData-SFT-2605
📦 UltraData Collection | 🌐 UltraData | 🤗 MiniCPM5 Series
English | 中文
📚 Introduction
Ult...
- [2026-05-28] openbmb/Ultra-FineWeb-L3
Ultra-FineWeb-L3
📜 Ultra-FineWeb Technical Report | 📦 UltraData Collection | 🌐 UltraData | 🤗 MiniCPM5 Series
English | 中文
...
- [2026-05-01] angrygiraffe/claude-opus-4.6-4.7-reasoning-8.7k
Background
Ended up with some tokens to burn on a Claude Max plan. Assembly began during 4.6 and moved to 4.7. Model is tagged. The develop...
-
[2026-05-19] Jackrong/Claude-opus-4.6-TraceInversion-9000x
🌀 Claude-opus-4.6-TraceInversion-9000x v1.0 Release
A High-Fidelity Reconstructed CoT Dataset via Trace Inversion 📊 ...
- [2025-07-11] HuggingFaceFW/fineweb
🍷 FineWeb
15 trillion tokens of the finest data the 🌐 web has to offer
What is it?
The 🍷 FineWeb dataset consists of m...
多模态大模型 / Multimodal Models
arXiv
- RE-VLM: Event-Augmented Vision-Language Model for Scene Understanding
- 赛道归属: 多模态理解(事件相机+视觉语言模型/鲁棒场景理解)
- 核心创新点: 提出首个事件流-图像双流VLM(RE-VLM),将事件相机的高时间分辨率与高动态范围信息作为与RGB互补的输入,通过跨模态对齐与融合机制把事件的运动/边缘变化线索注入语言推理,从而显著提升低照度、HDR、快速运动等恶劣条件下的场景理解与描述鲁棒性。
Track: Multimodal Understanding (Event-camera + VLM / Robust Scene Understanding)
Core innovations: Introduces the first dual-stream event-frame VLM (RE-VLM) that leverages event streams as a complementary modality to RGB. Via cross-modal alignment and fusion, it injects motion- and change-centric cues from events into language reasoning, improving robustness for scene understanding/captioning under low light, HDR, and fast-motion conditions.
- SONIC-O1: A Real-World Benchmark for Evaluating Multimodal Large Language Models on Audio-Video Understanding
- 赛道归属: 多模态理解(音频-视频时序理解评测/Benchmark)
- 核心创新点: 提出SONIC-O1作为面向真实世界音频-视频理解的系统性评测基准:以长时序、多领域对话场景为核心覆盖(60小时、231段、13个真实会话域),并采用全人工核验的数据与标注流程,旨在弥补现有评测偏静态图像、缺少对“音视频联合+时序推理”能力刻画的空白,从而更可靠地区分MLLM在真实音视频理解中的能力边界与失效模式。
- Track: Multimodal Understanding (Audio-Video Temporal Understanding Benchmark)
- Key Innovations: Introduces SONIC-O1, a real-world benchmark for systematic evaluation of MLLMs on sequential audio-video understanding. It emphasizes long-form temporal, multi-domain conversational scenarios (60 hours, 231 clips, 13 domains) with fully human-verified data/annotations, addressing the gap of prior benchmarks that over-focus on static images and under-measure joint audio-video temporal reasoning, enabling clearer diagnosis of capability limits and failure modes.
- Unveiling the Fragility of Vision-Language Models: Multi-Modal Adversarial Synergy via Texture-Constrained Perturbations and Cross-Modal Optimization
- 赛道归属: 多模态安全与鲁棒性(VLM 对抗攻击)
- 核心创新点: 提出一种面向视觉-语言模型的跨模态协同对抗框架,将纹理约束的图像扰动与跨模态联合优化结合:在视觉侧通过受限于纹理/局部统计特性的扰动提升隐蔽性与可迁移性,在语言侧通过与视觉扰动协同的目标设计/优化放大误导效应,从而在无需不现实的强白盒假设下实现更强的多模态攻击,系统性揭示 LVLM 在“多模态联动”攻击面前的脆弱性。
Track: Multimodal Security & Robustness (Adversarial Attacks on VLMs)
Key innovation: Proposes a cross-modal synergistic adversarial framework that couples texture-constrained image perturbations with cross-modal joint optimization. The visual perturbation is constrained by texture/local statistics to remain stealthy while improving transferability, and the language-side objective is co-optimized to amplify misalignment, enabling stronger multimodal attacks without relying on impractical strong white-box access and exposing LVLM fragility under coordinated multimodal threats.
- Eyes on VLM: Benchmarking Gaze Following and Social Gaze Prediction in Vision Language Models
- 赛道归属: 多模态理解(人类注视/社会注视预测评测基准)
- 核心创新点: 构建并系统评测VLM在“注视跟随(gaze following)”与“社会注视预测(social gaze prediction)”上的能力边界,强调该任务需要同时理解几何/物理场景与交互语境;通过基准化任务设定与指标,揭示现有VLM在注视相关推理中的可靠性缺口与典型失败模式,为后续面向注意力与行为理解的训练/对齐提供可复现的评测框架。
Track: Multimodal Understanding (Human Gaze & Social Attention Benchmarking)
Core innovations: Establishes a benchmark and systematic evaluation protocol for VLMs on gaze following and social gaze prediction, tasks requiring joint reasoning over physical scene geometry and social/interaction context. The work standardizes settings and metrics, surfaces reliability gaps and common failure modes in current VLMs, and provides a reproducible evaluation framework to guide future training/alignment for attention and behavior understanding.
- Reading or Guessing? Visual Grounding Failures of Vision-Language Models for OCR in Ancient Greek Editions
- 赛道归属: 多模态理解(视觉-语言OCR/视觉定位鲁棒性与失效分析)
- 核心创新点: 针对VLM在古希腊文献OCR中的“看图不读字、依赖语言先验猜测”问题,系统对比开源权重VLM与传统OCR引擎在低资源古希腊校勘本上的表现,揭示VLM即使识别错误也常生成流畅且貌似合理、但缺乏视觉证据支撑的文本替换;并进一步从“视觉证据/视觉定位”角度分析模型在解码过程中对图像信息的依赖不足,形成可复现的失效模式刻画与诊断框架,为改进VLM的视觉扎根(visual grounding)与OCR可信度提供依据。
- Track: Multimodal Understanding (Vision-Language OCR; Robustness & Failure Analysis in Visual Grounding)
- Key Innovations: Studies the “reading vs. guessing” failure mode of VLM-based OCR on low-resource Ancient Greek critical editions. By comparing open-weight VLMs with classical OCR baselines, it shows VLM outputs can remain fluent yet visually unsupported—substituting plausible Greek text driven by language priors rather than image evidence. It further analyzes insufficient visual grounding/visual evidence usage during decoding, providing a reproducible diagnostic characterization of grounding failures to guide more trustworthy VLM-OCR improvements.
- ICG: Improving Cover Image Generation via MLLM-based Prompting and Personalized Preference Alignment
- 赛道归属: 文生图(封面图生成/个性化生成与偏好对齐)
- 核心创新点: 提出ICG框架,将MLLM用于“高质量提示词生成/重写”与扩散模型的图像生成解耦协同:先从标题、内容等输入中抽取语义要素并由MLLM生成更具可控性与表达力的生成提示,再引入个性化偏好对齐机制(面向用户/平台风格与点击偏好)对生成结果进行定制化优化,从而在“语义相关性+审美/偏好一致性”上同时提升封面图质量,补足现有AIGC封面生成缺少个性化与偏好建模的不足。
- Track: Text-to-Image (Cover Image Generation; Personalization & Preference Alignment)
- Key Innovations: Proposes ICG, a framework that couples MLLM-based prompting with diffusion-based generation for personalized cover images. It uses an MLLM to extract semantic attributes from inputs (e.g., titles/content) and produce more controllable, expressive prompts, then applies a personalized preference-alignment mechanism (user/platform style and engagement preferences) to steer outputs. This decoupled “prompt intelligence + preference alignment” design improves both contextual relevance and aesthetic/preference consistency, addressing the underexplored personalization gap in cover generation.
- EdgeFlow: Edge-Map Augmented VLM-Based Flowchart Processing for Industrial Requirements Engineering
- 赛道归属: 多模态文档理解(流程图解析/结构化信息抽取)
- 核心创新点: 提出 EdgeFlow:在 VLM 输入端引入确定性提取的 Canny 边缘图作为结构先验,与原图共同输入以强化拓扑与连通关系建模,针对流程图中“节点-连线-箭头”这类对拓扑敏感的细节显著降低漏检/错连;核心突破在于用轻量、可解释的视觉结构增强替代复杂的端到端重训,使通用 VLM 更可靠地完成从静态流程图到机器可读模型的转换。
Track: Multimodal Document Understanding (Flowchart Parsing / Structured Extraction)
Key innovation: Introduces EdgeFlow, which augments VLM inputs with a deterministically extracted Canny edge map as a structural prior. Feeding edge maps alongside the original image strengthens modeling of topology/connectivity (nodes, links, arrows), reducing topology-critical errors. The key methodological step is a lightweight, interpretable structural augmentation that boosts generic VLM flowchart-to-model conversion without heavy end-to-end retraining.
- Self-Ensembling Vision-Language Models for Chart Data Extraction
- 赛道归属: 多模态文档理解(图表到表格数据抽取)
- 核心创新点: 提出面向图表数据抽取的 VLM 自集成(self-ensembling)策略:对同一图表进行多次采样生成多个候选表格输出,并通过一致性/聚合机制得到更稳健的最终表格,缓解单次生成在“数据点密集、样式多变”场景下的随机性与局部错误;方法论突破在于不依赖额外标注或专用模型结构,通过推理阶段的多样化采样与集成提升准确率与稳定性。
Track: Multimodal Document Understanding (Chart-to-Table Data Extraction)
Key innovation: Proposes VLM self-ensembling for chart data extraction: repeatedly samples multiple candidate tables from the same chart and aggregates them via consistency/merging to produce a more reliable final table. This mitigates stochastic and local generation errors on dense datapoints and diverse styles. The methodological contribution is improving accuracy and stability via inference-time sampling-and-ensemble, without extra annotations or specialized architectures.
- On the Robustness of Machine Unlearning for Vision-Language Models
- 赛道归属: 多模态安全与隐私(VLM 机器遗忘/反遗忘鲁棒性评测)
- 核心创新点: 首次对 VLM 机器遗忘进行系统化综述与鲁棒性分析:构建方法分类体系与统一评测协议(含多种提示设置),并提出三类攻击范式检验“已遗忘知识”是否可被跨提示/跨模态重新激活,从而把 unlearning 从“遗忘效果”扩展到“对再唤醒攻击的鲁棒性”维度;核心价值在于明确 VLM unlearning 的威胁模型与评测基准,揭示现有方法在可恢复性上的薄弱环节。
Track: Multimodal Security & Privacy (Machine Unlearning Robustness for VLMs)
Key innovation: Provides the first systematic survey and robustness analysis of VLM unlearning, including a taxonomy and unified evaluation protocol across multiple prompting settings. It further introduces three attack paradigms to test whether “forgotten” multimodal knowledge can be reactivated via cross-prompt/cross-modal cues, extending evaluation from forgetting efficacy to robustness against recovery attacks. The key value is clarifying threat models/benchmarks and exposing recoverability weaknesses in existing unlearning methods.
- Personalizing Embodied Multimodal Large Language Model Agents over Long-term User Interactions
- 赛道归属: 具身智能与多模态代理(长期交互个性化/记忆增强)
- 核心创新点: 提出 POLAR:面向具身 MLLM 代理的长期个性化记忆增强框架,使代理能从跨时间的用户交互中积累偏好与隐式指代线索,并在后续任务中进行检索与利用,从而解决真实场景中“目标未显式说明、需依赖历史上下文”的个性化指令理解问题;方法突破在于将多模态交互经验结构化为可检索的长期记忆,并与具身决策/执行闭环融合,实现持续适应而非一次性定制。
Track: Embodied AI & Multimodal Agents (Long-term Personalization / Memory-Augmented MLLMs)
Key innovation: Proposes POLAR, a long-term personalization framework for embodied MLLM agents with multimodal memory augmentation. The agent accumulates user preferences and implicit reference cues over extended interactions, retrieves them when needed, and uses them to resolve underspecified goals in future tasks. The methodological advance is structuring multimodal interaction history into retrievable long-term memory and integrating it into the perception–decision–action loop for continual adaptation rather than one-shot personalization.
GitHub
- [2026-05-31] Blaizzy/mlx-vlm ⭐4791
MLX-VLM is a package for inference and fine-tuning of Vision Language Models (VLMs) on your Mac using MLX.
- [2026-05-30] NVlabs/Eagle ⭐1564
Eagle: Frontier Vision-Language Models with Data-Centric Strategies
- [2026-05-30] waybarrios/vllm-mlx ⭐1270
OpenAI and Anthropic compatible server for Apple Silicon. Run LLMs and vision-language models (Llama, Qwen-VL, LLaVA) with continuous batching, MCP to...
- [2026-05-29] zhengli97/Awesome-Prompt-Adapter-Learning-for-VLMs-CLIP ⭐778
A curated list of awesome prompt/adapter learning methods for vision-language models like CLIP.
- [2026-05-30] jamjamjon/usls ⭐407 🆕NEW
A Rust library integrated with ONNXRuntime, providing a collection of Computer Vison and Vision-Language models such as YOLO, FastVLM, and more.
HuggingFace Models
强化学习 / Reinforcement Learning
arXiv
- ARC-RL: A Reinforcement Learning Playground Inspired by ARC Raiders
- 赛道归属: 强化学习基准与仿真环境(MuJoCo连续控制/游戏风格NPC运动控制)
- 核心创新点: 提出ARC-RL作为面向“游戏NPC风格约束”的连续控制基准套件:在MuJoCo中构建4个受游戏生物图鉴启发的非真实机器人形态环境,刻意脱离传统仿真到真实(sim-to-real)常见的商业硬件形态假设,从而把研究焦点从“真实可落地的腿式机器人”扩展到“无现实对应形态但有强风格/表现约束的角色运动”。其方法论价值在于用系统化的形态多样性与任务设定,推动奖励设计、控制策略与泛化能力在“非工程硬件先验”下的评测与对比。
- Track: RL benchmarks & simulation environments (MuJoCo continuous control / game-style NPC locomotion)
- Core innovation: Introduces ARC-RL, a MuJoCo suite of four continuous-control environments with creature-like morphologies inspired by game bestiaries, explicitly moving beyond real-robot hardware-derived bodies. Methodologically, it reframes locomotion RL evaluation toward stylistic/character constraints typical for game NPCs, enabling more meaningful study of reward design, control, and generalization under morphology diversity without sim-to-real hardware priors.
- Infra-Bayesian Reinforcement Learning Agents Outperform Classical RL For Worst-Case Robustness
- 赛道归属: 鲁棒强化学习 / 非可实现环境下的安全RL(对抗/策略依赖环境建模)
- 核心创新点: 提出并实证验证“Infra-Bayesian(下层贝叶斯)”RL智能体,用一种比经典贝叶斯/频率派RL更保守的信念更新与决策准则来应对模型失配(misspecification)与环境对策略的反应(policy-dependent / 预判型对手)。方法上关键在于:不再假设存在真实环境落在模型类中,而是以更弱的可实现性前提构造可学习的决策规则,使策略在最坏情形下具有更强鲁棒性(worst-case robustness),从而在涉及人类/预测器/其他智能体的安全场景中优于经典RL的脆弱性表现。
Track: Robust RL / Safety RL under non-realizable, policy-dependent environments (adversarial/strategic settings)
Core innovation: Introduces and empirically validates Infra-Bayesian RL agents that replace classical Bayesian/frequentist assumptions with a more conservative belief-update and decision criterion tailored to misspecification and policy-dependent (anticipatory) environments. The key methodological shift is to drop the realizability assumption (true environment in the model class) and design a learnable decision rule with stronger worst-case robustness, yielding improved performance in safety-relevant settings involving humans, predictors, other agents, or institutions.
- ClaimDiff-RL: Fine-Grained Caption Reinforcement Learning through Visual Claim Comparison
- 赛道归属: 多模态理解(图像长文本描述对齐 / 细粒度奖励建模的强化学习)
- 核心创新点: 提出以“视觉主张(visual claims)”为单位的细粒度强化学习框架:不再用整段caption的单一标量奖励,而是将描述拆解为可对齐到图像证据的原子主张,并通过“主张级视觉对比/验证”来产生更密集、更可归因的训练信号;从而显式区分并优化“事实性(减少幻觉)”与“信息覆盖(不遗漏细节)”之间的权衡,缓解长文本caption中序列级奖励过度压缩导致的信用分配与训练不稳定问题。
- Track: Multimodal Understanding (long-form image caption alignment / fine-grained reward modeling in RL)
- Core innovation: Introduces a visual-claim–level RL framework: instead of a single sequence-level scalar reward for an entire caption, it decomposes captions into atomic, image-groundable visual claims and generates denser, attributable learning signals via claim-level visual comparison/verification. This makes the trade-off between factuality (reducing hallucinations) and coverage (capturing salient details) explicitly optimizable, mitigating reward granularity and credit-assignment issues in long-form caption RL.
- FBOS-RL: Feedback-Driven Bi-Objective Synergistic Reinforcement Learning
- 赛道归属: 推理对齐(大模型RLHF/RLVR训练算法优化:在线采样-更新闭环)
- 核心创新点: 提出“反馈驱动的双目标协同强化学习”训练范式:针对GRPO类方法中“采样阶段产生的高质量rollout决定了更新方向但梯度目标并不显式”的痛点,引入由反馈信号驱动的双目标优化机制,将(1)提升rollout质量/可用性与(2)稳定有效的策略更新方向联合建模,并通过协同约束/耦合更新减少采样噪声与梯度方向漂移;本质上是在rollout生成与参数更新之间建立更强的闭环,使训练更稳、更高效地对齐推理能力。
- Track: Reasoning Alignment (LLM RLHF/RLVR training algorithm optimization: online rollout–update loop)
- Core innovation: Proposes Feedback-Driven Bi-Objective Synergistic RL: addressing the GRPO-style issue where the update direction is not explicitly grounded and heavily depends on the quality of sampled rollouts, it formulates a coupled bi-objective optimization that jointly (1) improves rollout quality/usability and (2) stabilizes the policy update direction. By enforcing synergy/constraints between sampling and updating, it reduces sampling noise and gradient-direction drift, strengthening the closed loop between rollout generation and parameter updates for more stable and efficient reasoning alignment.
- Mags-RL: Wearing Multimodal LLMs a Magnifying Glass via Agentic Reinforcement Learning For Complex Scene Reasoning
- 赛道归属: 多模态理解(复杂场景视觉推理)/ Agentic 强化学习
- 核心创新点: 提出一种以“放大镜”式信息获取为核心的智能体强化学习框架,让MLLM在复杂拥挤场景中通过主动、迭代的视觉聚焦与证据收集来提升推理可靠性;相较依赖标注框等显式视觉提示的方法,该思路用RL学习“看哪里、看多细、看几次”的策略,在避免额外标注的同时缓解低分辨率裁剪丢失细节的问题,从而增强细粒度识别与多步推理能力。
- Track: Multimodal understanding (complex-scene visual reasoning) / Agentic Reinforcement Learning
- Core innovation: Introduces an “agentic magnifying-glass” RL framework that trains an MLLM to actively and iteratively acquire visual evidence (where/what to zoom into and how to refine) for reliable reasoning in cluttered, high-density scenes. Unlike prior approaches that inject explicit cues (e.g., annotated boxes) and suffer from detail loss in low-res crops, it learns a sequential visual-attention/inspection policy via RL, improving fine-grained perception and multi-step reasoning without extra annotations.
- GeoSVG-RL: Geometry-Aware Reinforcement Learning for Layout-Constrained Text-to-SVG Diagram Generation
- 赛道归属: 文生矢量图(Text-to-SVG)/ 结构化图表生成 / 强化学习约束生成
- 核心创新点: 提出几何感知的强化学习框架,将SVG图表生成中的“可用性”问题显式建模为布局与几何约束优化:通过对连接线端点对齐、文本与边界/元素的非重叠、画布边界约束等几何规则进行可微或可评估的约束度量,构造面向结构有效性的奖励信号;在生成过程中用RL对策略进行优化以减少结构脆弱错误(如漂移、错连、越界),从而提升可编辑、可落地的专业级SVG图表输出稳定性。
- Track: Text-to-Vector Graphics (Text-to-SVG) / Structured diagram generation / RL for constrained generation
- Key innovations: Introduces a geometry-aware RL framework that explicitly optimizes “usability” of generated SVG diagrams under layout/geometry constraints. It formulates alignment of connector endpoints, text–shape/border non-overlap, and canvas-boundary constraints as measurable (and potentially differentiable) geometric criteria to build structure-validity rewards, then applies RL-based policy optimization during generation to reduce fragile structural failures (misconnections, drift, out-of-bounds), improving robustness and editability of professional-grade SVG outputs.
- Skill0.5: Joint Skill Internalization and Utilization for Out-of-Distribution Generalization in Agentic Reinforcement Learning
- 赛道归属: LLM智能体强化学习(技能学习与OOD泛化)
- 核心创新点: 提出“Skill0.5”范式,将技能表示在“完全外置(占上下文)”与“完全内化(易过拟合)”之间做可学习的折中:联合优化技能的内化(写入模型参数形成可迁移能力)与使用(在推理时按需调用/组合技能以执行任务),从而在不显著增加上下文开销的前提下提升分布外任务的泛化与执行稳健性,并缓解技能库僵化选择带来的性能瓶颈。
- Track: LLM agent reinforcement learning (skill learning & OOD generalization)
- Core innovation: Proposes the “Skill0.5” paradigm that learns a middle ground between fully externalized skills (high context overhead) and fully internalized skills (overfitting risk). It jointly optimizes skill internalization (parameterized, transferable competence) and skill utilization (on-demand invocation/composition at inference), improving out-of-distribution generalization and robustness without large prompt/context costs, and avoiding rigid skill-representation choices.
- Efficient Post-training of LLMs for Code Generation With Offline Reinforcement Learning
- 赛道归属: 代码生成LLM后训练(离线强化学习 / RLHF替代路径)
- 核心创新点: 将离线强化学习系统性引入代码生成模型的后训练:直接利用既有代码数据集构建离线RL学习信号,绕开在线RL所需的高成本“生成-执行/验证-反馈”闭环;核心突破在于把代码任务的优化从在线交互转为离线策略改进,使训练更高效、更可扩展,并在实验中验证离线RL可有效提升代码生成性能。
- Track: Post-training for code-generation LLMs (offline reinforcement learning)
- Core innovation: Systematically applies offline RL to post-train code-generation LLMs by leveraging existing code datasets as offline experience, avoiding the expensive online loop of sampling, executing/verifying, and rewarding code. The key advance is reframing code optimization as offline policy improvement for efficiency and scalability, with empirical gains demonstrating offline RL as an effective post-training strategy.
- Teacher-Student Representational Alignment for Reinforcement Learning-Driven Imitation Learning
- 赛道归属: 机器人模仿学习(RL驱动IL / 表征对齐)
- 核心创新点: 针对“教师用特权状态、学生仅有观测”导致的不可消除模仿鸿沟,提出教师-学生表征对齐机制:不再单纯提升学生拟合,而是通过对齐两者的中间表示/特征空间,让教师策略的决策依据在学生可观测信息上变得可表达、可迁移;从方法论上把问题从“动作匹配”提升为“可模仿的表征学习”,以减少由信息不对称带来的系统性误差。
- Track: Robotics imitation learning (RL-driven IL / representation alignment)
- Core innovation: Addresses the irreducible imitation gap caused by privileged-state teachers and observation-only students via teacher–student representational alignment. Instead of only improving student action regression, it aligns intermediate feature spaces so the teacher’s decision basis becomes expressible from the student’s observations, reframing imitation from action matching to learnable, transferable representations and reducing systematic errors from information asymmetry.
- ProRL: Effective Reinforcement Learning for Proactive Recommendation via Rectified Policy Gradient Estimation
- 赛道归属: 推荐系统强化学习(主动推荐 / 序列决策优化)
- 核心创新点: 提出ProRL,通过“校正的策略梯度估计(Rectified Policy Gradient)”解决主动推荐中路径级奖励带来的梯度估计偏差与不稳定:识别并刻画朴素policy gradient在PRS场景下的两类缺陷(与路径奖励/信用分配相关),并用校正机制改进梯度信号质量,使模型能更有效地学习“中间推荐路径”以同时优化短期接受率与长期偏好引导效果。
- Track: Reinforcement learning for recommender systems (proactive recommendation / sequential decision making)
- Core innovation: Proposes ProRL with a Rectified Policy Gradient estimator to fix biased/deficient gradient estimation in proactive recommender systems where path-level rewards drive learning. By identifying two key failure modes of naive policy gradients (tied to path rewards and credit assignment) and rectifying the gradient signal, it enables more stable and effective learning of intermediate recommendation paths that balance short-term acceptance and long-term preference steering.
GitHub
- [2026-05-30] Farama-Foundation/Gymnasium ⭐11970 🆕NEW
An API standard for single-agent reinforcement learning environments, with popular reference environments and related utilities (formerly Gym)
- [2026-05-30] Farama-Foundation/ViZDoom ⭐2023 🆕NEW
Reinforcement Learning environments based on the 1993 game Doom :godmode:
- [2026-05-30] natolambert/rlhf-book ⭐1925 🆕NEW
Textbook on reinforcement learning from human feedback
- [2026-05-30] radixark/miles ⭐1461
Miles is an enterprise-facing reinforcement learning framework for LLM and VLM post-training, forked from and co-evolving with slime.
- [2026-05-30] ventr1c/Awesome-RL-based-Agentic-Search-Papers ⭐245 🆕NEW
The official repository of "A Comprehensive Survey on Reinforcement Learning-based Agentic Search: Foundations, Roles, Optimizations, Evaluations, and...
世界动作模型 / World Action Model
arXiv
- OASIS: Observation-Action Space Alignment via SE(3) Trajectory Prediction for Robotic Manipulation
- 赛道归属: 机器人操作(Vision-Language-Action / World Action Model)、动作空间建模与对齐、SE(3)轨迹预测
-
核心创新点: 提出OASIS,通过在中间表征中显式引入并对齐动作空间的刚体几何结构,缓解以往WAM/VLA主要停留在观测空间表征、导致动作解码器需“隐式恢复”SE(3)几何的问题;核心做法是将中间表示与SE(3)轨迹预测绑定,使策略在表示层面具备与动作同构的刚体运动先验,从而实现观测-动作空间对齐,降低动作解码难度并提升机器人操作的可学习性与泛化。
-
Track: Robotic manipulation (Vision-Language-Action / World Action Model), action-space modeling & alignment, SE(3) trajectory prediction
- Core innovation: Proposes OASIS to explicitly align intermediate representations with the rigid-body geometry of the action space, addressing a common limitation of prior WAM/VLA approaches whose representations largely remain in observation space and force the action decoder to implicitly reconstruct SE(3) structure. The key idea is to couple the latent representation with SE(3) trajectory prediction, injecting action-isomorphic rigid-motion priors at the representation level, which simplifies action decoding and improves learnability and generalization for robotic manipulation.
GitHub
- [2026-05-29] DravenALG/awesome-vla-wam ⭐649
A Curated List of Vision-Language-Action (VLA) and World Action Models (WAM) Research and Beyond
Generated automatically by Daily AI Digest Agent 生成时间: 2026-05-31 01:00:24