AI 每日进展速报 / Daily AI Digest - 2026-06-09
图像生成/编辑 / Image Generation/Editing
arXiv
- TrioPose: Native Triple-Stream Diffusion Transformers for Pose-Guided Text-to-Image Generation 🆕NEW
- 赛道归属: 姿态引导文生图(Pose-guided Text-to-Image)/ 多模态扩散Transformer(MM-DiT)
- 核心创新点: 提出原生“三流”(triple-stream) 的扩散Transformer结构,将文本、图像潜变量与姿态条件以更结构化的方式解耦建模,避免在MM-DiT中直接拼接条件信号导致的预训练潜空间分布被破坏;通过为姿态引导建立独立且可控的信息注入路径,增强长程空间依赖建模能力,显著缓解多人复杂姿态下的肢体扭曲与特征串扰问题,并在SD3.5M架构上实现更稳定的姿态对齐与细节一致性。
Track: Pose-guided Text-to-Image / Multimodal Diffusion Transformer (MM-DiT)
Key innovation: Introduces a native triple-stream diffusion Transformer that structurally separates text, latent image tokens, and pose conditioning, avoiding naive concatenation that disrupts the pre-trained latent distribution in MM-DiTs; by creating a dedicated, controllable pose information pathway, it improves long-range spatial dependency modeling and reduces limb distortions and feature crosstalk in complex multi-person scenes, yielding more stable pose adherence and visual consistency on top of the SD3.5M backbone.
- WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation
- 赛道归属: 文生图(Text-to-Image)评测基准 / 语义与世界知识对齐评估
- 核心创新点: 提出面向文生图的“世界知识驱动语义评测”基准WISE,将评估重点从传统的画质与浅层文本-图像匹配,提升到对复杂语义理解、隐含常识/事实知识、关系与组合推理等能力的系统化测量;通过构造需要外部世界知识才能判定对错的提示与判别维度,提供更能暴露模型“看似对齐但语义错误”的评测框架,从而推动T2I模型在知识一致性与深层语义对齐上的改进。
- Track: Text-to-Image evaluation benchmark / semantic & world-knowledge alignment assessment
- Key innovation: Introduces WISE, a world-knowledge-informed semantic evaluation benchmark for T2I that shifts emphasis from realism and shallow text-image matching to systematic measurement of complex semantic understanding—commonsense/factual knowledge, relations, and compositional reasoning. By designing prompts and evaluation dimensions that require external world knowledge to judge correctness, it better exposes “plausible-looking but semantically wrong” generations and drives progress on knowledge-consistent, deep semantic alignment.
- Safeguarding Text-to-Image Generation via Inference-Time Prompt-Noise Optimization
- 赛道归属: 文生图安全对齐 / 推理时安全防护(Text-to-Image Safety Alignment at Inference)
- 核心创新点: 提出一种仅在推理阶段生效的安全防护机制,通过对输入提示词注入并优化“提示噪声”(prompt-noise) 来抑制不安全内容的生成;其关键突破在于把安全约束转化为可优化的推理时变量,无需重新训练/微调模型即可动态调整生成轨迹,从而提升对绕过式提示与对抗攻击的鲁棒性,并在尽量保持画质与文本一致性的前提下实现更稳定的安全过滤。
Track: Text-to-Image safety alignment / Inference-time safety defense Core innovation: Introduces an inference-only safeguarding method that injects and optimizes prompt noise to steer diffusion sampling away from unsafe regions. The key methodological step is formulating safety control as an optimizable inference-time variable, avoiding retraining while improving robustness to jailbreak prompts and adversarial attacks, with minimal degradation to image quality and prompt fidelity.
- Breaking the Lock-in: Diversifying Text-to-Image Generation via Representation Modulation 🆕NEW
- 赛道归属: 文生图多样性提升(Text-to-Image Diversity)/ 表征调制(Representation Modulation)
- 核心创新点: 从“同质化输出”的根因出发,分析Transformer中间表征(尤其是潜变量/中间特征的收缩与聚集现象)对采样多样性的限制,提出基于表征调制的多样性增强策略:在不引入昂贵的多次采样、额外优化或外部搜索的前提下,通过对中间特征分布/通道响应进行可控扰动或重标定,打破固定prompt下的表示锁定(lock-in),以较低推理开销提升样本多样性,同时尽量保持文本对齐与画质。
Track: Text-to-Image Diversity / Representation Modulation
Key innovation: Targets the root cause of homogeneity by diagnosing how intermediate Transformer representations collapse/cluster and restrict sample diversity; proposes a representation-modulation mechanism that perturbs or re-scales intermediate features in a controlled manner to break prompt-conditioned “lock-in,” improving diversity without expensive extra sampling loops or auxiliary optimization, while largely preserving text-image alignment and visual quality.
- MemoGen: Can Past Experience Improve Future Text-to-Image Generation?
- 赛道归属: 文生图(Text-to-Image)生成增强 / 记忆与检索增强生成(Memory-augmented Generation)
- 核心创新点: 提出MemoGen,将“单次请求的检索/代理式增强”扩展为“跨任务可积累的经验记忆”机制:把历史生成中的成功/失败案例、隐含约束满足策略、有效提示改写或参考证据进行结构化存储,并在新请求到来时进行检索与复用,以提升对隐式视觉约束、关系推理与外部知识需求场景的可靠性;核心突破在于把T2I生成从一次性优化转为可持续学习的闭环(记录—检索—迁移),减少重复犯错并提高长期一致性。
- Track: Text-to-Image generation enhancement / memory-augmented (experience-reuse) generation
- Key innovation: Proposes MemoGen, extending retrieval/agentic augmentation from per-request assistance to an accumulative experience memory. It stores structured signals from past generations (success/failure cases, constraint-satisfaction tactics, effective prompt rewrites, supporting references) and retrieves them to guide future requests, improving reliability on implicit constraints, relational reasoning, and external-knowledge prompts. The key methodological step is turning T2I generation into a continual closed loop (log–retrieve–transfer) that reduces repeated errors and improves long-horizon consistency.
- KG-FairDiff: Knowledge Graph-Guided Prompt Refinement for Demographically Fair Text-to-Image Generation
- 赛道归属: 文生图(公平性/去偏见)、提示词优化(Prompt Refinement)
- 核心创新点: 提出以知识图谱(Knowledge Graph)为约束与检索支撑的提示词自动精炼框架,在不重训/不改动闭源T2I主干模型的前提下,通过对人口统计属性与职业/场景等语义关系的显式建模,系统性地补全或重写提示词中的敏感与相关属性表达,从而在生成阶段实现更均衡的人群呈现;方法重点在“结构化知识→可控prompt变换”的映射,降低仅靠启发式词替换带来的语义漂移,并兼顾公平性提升与文本意图保持。
- Track: Text-to-Image (fairness/de-biasing), Prompt Refinement
- Core innovation: Introduces a knowledge-graph-guided prompt refinement framework that improves demographic fairness without retraining or modifying (potentially closed-source) T2I backbones. By explicitly modeling relationships between demographic attributes and contextual semantics (e.g., occupations, settings), it automatically augments/rewrites prompts to enforce more balanced representation at inference time. The key methodological advance is mapping structured knowledge constraints into controllable prompt transformations, reducing semantic drift compared to heuristic word swaps while preserving the original intent.
- RichControl: Structure- and Appearance-Rich Training-Free Spatial Control for Text-to-Image Generation
- 赛道归属: 文生图(可控生成)、训练免(Training-free)空间控制/条件注入
- 核心创新点: 提出一种同时具备“结构+外观”双重约束的训练免空间控制方案,通过改进特征注入/融合机制,在扩散采样过程中更稳定地对齐条件图像的几何结构并保留外观细节;针对训练免注入常见的结构错位、条件泄漏(把条件图像纹理/噪声直接拷入结果)与伪影问题,引入更精细的分层/分步控制与抑制策略,使结构遵循与外观一致性可以解耦调节,从而在无需LoRA/微调的情况下获得更可靠的空间可控生成。
- Track: Controllable Text-to-Image, Training-free spatial control / condition feature injection
- Core innovation: Proposes a training-free spatial control method that is rich in both structure and appearance constraints. It improves feature injection/fusion during diffusion sampling to better align geometry from conditional inputs while preserving appearance details. To address common training-free issues—structural misalignment, condition leakage (copying conditional textures/noise), and artifacts—it introduces finer-grained, stage-/layer-wise control and suppression mechanisms, enabling decoupled tuning of structural adherence vs. appearance fidelity without LoRA or finetuning.
- Unified Safe In-context Image Generation in Multimodal Diffusion Transformers via Restricting Unsafe Information Flows 🆕NEW
- 赛道归属: 安全文生图与安全图像编辑(Safe T2I & Safe I2I)/ DiT多模态注意力安全对齐
- 核心创新点: 面向带多模态注意力(MM-Attn)的扩散Transformer,提出统一的“限制不安全信息流”(restricting unsafe information flows) 安全框架,解决现有安全机制偏向T2I或U-Net、难以覆盖I2I编辑的问题;核心在于在DiT的跨模态/上下文注入链路中识别并抑制不安全语义从条件端(文本、参考图、上下文示例等)向生成端传播的关键通道,实现in-context生成与编辑场景下的一体化安全控制,在尽量不牺牲正常内容生成能力的同时降低有害内容泄露与绕过风险。
Track: Safe Text-to-Image & Safe Image-to-Image Editing / Safety alignment for MM-Attn DiTs
Key innovation: Proposes a unified safety mechanism for diffusion Transformers with multimodal attention by explicitly restricting unsafe information flows through cross-modal/context injection pathways, addressing the gap where prior safety methods are tailored to T2I or U-Net and fail to generalize to I2I editing; by identifying and suppressing critical channels that propagate unsafe semantics from conditioning sources (text, reference images, in-context examples) into generation, it enables consistent safety mitigation across in-context generation and editing while minimizing degradation on benign outputs.
- Text-to-Image Models Need Less from Text Encoders Than You Think
- 赛道归属: 文生图(Text-to-Image)基础机制分析 / 文本编码器与条件表征消融(Representation/Conditioning Analysis)
- 核心创新点: 系统性检验文生图模型对文本编码器“丰富语义表征”(上下文、组合性、属性绑定等)的真实依赖程度,提出并验证:图像生成模型可能并未充分利用文本嵌入中的高阶语言信息,从而文本编码器并不需要想象中那么强;通过对文本表征不同成分的消融/替换与对生成质量、对齐能力的影响分析,给出更精确的“哪些文本信息是必要的”结论,为简化文本编码器、重分配模型容量、以及改进条件注入方式提供依据。
- Track: Text-to-Image mechanism analysis / text-encoder & conditioning representation ablation
- Key innovation: Systematically probes how much T2I models truly rely on “rich” text-encoder representations (context, compositionality, attribute binding). It argues and empirically tests that image generators may not fully exploit higher-order linguistic information in embeddings, implying text encoders can be simpler than commonly assumed. By ablating/replacing components of text representations and measuring impacts on generation quality and alignment, it pinpoints which textual signals are actually necessary, informing encoder simplification, capacity reallocation, and improved conditioning injection designs.
- Pinterest Canvas: Large-Scale Image Generation at Pinterest
- 赛道归属: 工业级图像生成系统(大规模部署)、图像编辑/增强(生成式编辑)
- 核心创新点: 面向Pinterest产品级强约束场景,提出端到端的大规模图像生成与编辑系统化方案:通过在多模态大规模数据上进行针对“编辑/增强”任务的训练与系统工程化设计,弥补通用生成模型“可用但难控”的落地缺口;核心突破在于将模型能力、数据构建、训练目标与线上控制/质量保障机制协同设计,使生成结果在风格一致性、可控性、安全与稳定性等产品指标上可达可运营水平,而不仅依赖提示词或轻量推理技巧。
- Track: Production-scale image generation systems, Generative image editing/enhancement
- Core innovation: Presents a product-oriented, large-scale image generation and editing system for Pinterest, targeting use cases with strict controllability requirements where generic models are flexible but hard to steer. The key contribution is the co-design of model training (on diverse large-scale multimodal data with editing/enhancement objectives) and system-level controls/quality mechanisms for online deployment, achieving operational-grade controllability, consistency, safety, and stability beyond prompt-only or minor inference-time adaptations.
GitHub
- [2026-06-09] YouMind-OpenLab/awesome-nano-banana-pro-prompts ⭐12438
🍌 World's largest Nano Banana Pro prompt library — 10,000+ curated prompts with preview images, 16 languages. Google Gemini AI image generation. Free ...
- [2026-06-08] Light-Heart-Labs/DreamServer ⭐1926
Turn your PC, Mac, or Linux box into an AI server. LLM inference, chat UI, voice, agents, workflows, RAG, and image generation.
- [2026-06-08] AceDataCloud/Nexior ⭐373
Consumer AI app for chat, image generation, video generation, and music creation powered by Ace Data Cloud APIs.
- [2026-06-08] ferranpons/Llamatik ⭐144 🆕NEW
True on-device AI for Kotlin Multiplatform (Android, iOS, Desktop, JVM, WASM). LLM, Speech-to-Text and Image Generation — powered by llama.cpp, whispe...
- [2026-06-08] Z1rconium/gpt-image-linux ⭐78
Self-hosted web panel for GPT-compatible image generation APIs — generate, edit, and manage your images in one place.
HuggingFace Models
HuggingFace Datasets
- [2026-05-29] jasperai/monet
Dataset Card for MONET
MONET (Massive, Open, Non-redundant and Enriched Text-to-image dataset) is a large-scale, curated image-text dat...
视频生成/编辑 / Video Generation/Editing
arXiv
- Spatial-Temporal Decoupled Reference Conditioning for Identity-Preserving Text-to-Video Generation
- 赛道归属: 身份保持文本到视频生成(Reference-conditioned T2V / Video Generation)
- 核心创新点: 提出ST-DRC(Spatial-Temporal Decoupled Reference Conditioning)框架,将参考身份条件在空间与时间维度解耦注入视频扩散/生成过程:用空间侧的细粒度特征强化单帧身份细节(如脸部结构、纹理一致性),用时间侧的机制约束跨帧身份稳定与时序一致,从而在“文本语义可控性”和“低层身份保真度”之间实现更好的平衡;框架层面强调晚期/分阶段的条件融合以减少文本驱动对身份特征的干扰并提升长序列稳定性。
- Track: Identity-preserving text-to-video generation (reference-conditioned T2V / video generation)
- Key innovation: Proposes ST-DRC, a Spatial-Temporal Decoupled Reference Conditioning framework that injects identity reference signals separately along spatial and temporal axes in the video generation (diffusion) process: spatial conditioning strengthens per-frame identity details (geometry/texture), while temporal conditioning enforces cross-frame identity stability and temporal coherence. The method emphasizes late/staged conditioning fusion to reduce interference from text semantics and improve long-range identity consistency.
- SafeGen-Bench: Benchmarking Safety in Image-Conditioned Text-to-Video Generation
- 赛道归属: 视频生成安全评测(Image-conditioned T2V Safety Benchmark / Evaluation)
- 核心创新点: 提出SafeGen-Bench,面向图像条件引导的文本到视频生成系统化评测其安全风险,补齐现有安全基准主要聚焦纯文本模式的缺口;通过覆盖非法/政治敏感/伦理风险等多类场景与触发方式,构建更贴近真实使用链路的测试集与评测协议,用于量化模型在“给定初始图像+文本”条件下的越界生成倾向与防护能力,从而推动安全对齐在I2V/T2V条件生成中的可比、可复现评估。
- Track: Safety benchmarking for image-conditioned text-to-video generation (evaluation/benchmark)
- Key innovation: Introduces SafeGen-Bench to systematically evaluate safety risks specifically in image-conditioned T2V settings, addressing the gap of prior benchmarks that mainly test text-only generation. It broadens risk coverage (illegal/political/ethical categories and triggers) and provides a more realistic evaluation protocol to quantify unsafe generation propensity and safety guard effectiveness under “input image + prompt” conditioning, enabling comparable and reproducible safety assessment.
- MAVEN A Multi-Agent Framework for Multicultural Text-to-Video Generation
- 赛道归属: 文生视频(Text-to-Video)/ 提示词工程与多智能体协同(Multi-agent Prompt Refinement)
- 核心创新点: 提出多智能体提示词精炼框架MAVEN,面向“多文化一致性/文化保真度”这一以往T2V较少系统覆盖的目标进行优化;方法上将文本提示分解为“人物(Person)-动作(Action)-地点(Location)”三维语义槽位,由具备专长的代理分别并行或串行地改写与约束,从而在单一文化与跨文化组合提示中减少文化符号混淆与刻板化偏差;同时构建支持系统评测的多文化/跨文化基准与流程,使文化保真度从主观描述转为可对比的评估闭环。
- Track: Text-to-Video / Prompt Engineering with Multi-Agent Collaboration (Multi-agent Prompt Refinement)
- Core innovations: Introduces MAVEN, a multi-agent prompt refinement framework targeting cultural fidelity, a dimension underexplored in prior T2V work; technically, it decomposes prompts into three semantic slots—Person, Action, and Location—and assigns specialized agents to refine/ground each slot in parallel or sequential modes, reducing cultural symbol confusion and stereotyping in mono-cultural and cross-cultural prompts; additionally, it establishes a systematic multicultural/cross-cultural evaluation setup to make cultural fidelity more measurable and comparable.
- Knowledge-Intensive Video Generation
- 赛道归属: 知识密集型文本到视频生成评测(Factuality/Helpfulness Evaluation for T2V)
- 核心创新点: 定义“知识密集型视频生成(KIVI)”任务:针对解释、流程、演示类信息检索式短提示,要求生成视频不仅好看还要事实正确且有用;构建KIVI-Bench(1080条提示)并提出面向事实性(factuality)与帮助性(helpfulness)的自动评测指标,且通过人工评测验证指标相关性,从评测体系上把T2V从感知质量扩展到“知识/实用性”维度,为后续引入检索增强、工具使用或知识对齐的T2V方法提供可量化目标。
- Track: Knowledge-intensive text-to-video generation evaluation (factuality/helpfulness)
- Key innovation: Formulates Knowledge-Intensive Video Generation (KIVI), where prompts request explanations/procedures/demonstrations and outputs must be factually correct and practically helpful, not just visually appealing. Releases KIVI-Bench (1,080 prompts) and proposes automatic metrics for factuality and helpfulness, validated via human studies, extending T2V evaluation from perceptual quality to knowledge/utility and enabling measurable targets for retrieval/tool-augmented or knowledge-aligned T2V models.
- Consistency-Preserving Diverse Video Generation 🆕NEW
- 赛道归属: 视频生成(多样性采样/一致性保持,Flow-Matching)
- 核心创新点: 提出面向Flow-Matching视频生成器的联合采样(joint-sampling)框架,在“每个提示词只能生成少量样本”的低采样场景下,显式提升跨视频(batch内)多样性同时保持单个视频内部的时序一致性。相较将图像多样性技巧直接迁移到视频而导致时间一致性下降的方法,该框架避免或减少对视频解码器进行昂贵的反向传播优化,从采样层面实现“多样性-一致性”兼顾。
Track: Video Generation (diverse sampling with consistency preservation, Flow-Matching)
Key innovation: Proposes a joint-sampling framework for Flow-Matching video generators to maximize cross-video (in-batch) diversity in the low-sample regime while preserving within-video temporal consistency. Unlike image-diversity tricks that often harm temporal coherence when applied to video, the method achieves a better diversity–consistency trade-off primarily at the sampling level, avoiding or reducing costly backpropagation through the video decoder.
- Streaming Video Generation with Streaming Force Control 🆕NEW
- 赛道归属: 视频生成(流式/因果生成 + 物理控制)
- 核心创新点: 提出StreamForce,一个因果(causal)的流式视频生成框架,可通过连续、时变的力(force)输入实现物理可解释的控制,并能对局部/全局力信号即时响应且保持时序连贯。方法上通过设计统一的力表示作为控制条件,避免为不同力类型训练多个模型或假设力恒定;并引入蒸馏/训练策略使模型在流式生成设置下稳定对齐控制信号,实现“边生成边受控”的统一范式。
Track: Video Generation (streaming/causal generation + physics-based control)
Key innovation: Introduces StreamForce, a causal streaming video generation framework enabling physically grounded control via continuous, time-varying force inputs. It unifies diverse force types through a single force representation (instead of separate models or fixed-force assumptions) and employs a distillation/training scheme to ensure instant, coherent responses to both local and global forces under streaming generation.
- CULTURESCORE: Evaluating Cultural Faithfulness in Video Generation Models 🆕NEW
- 赛道归属: 视频生成评测(文化一致性/文化忠实度评估)
- 核心创新点: 提出CultureScore,用于评估视频生成模型的文化忠实度(cultural faithfulness),弥补现有指标(如仅衡量画质的VideoScore)无法区分文化语义正确与否的问题。其方法论突破在于采用组合式(compositional)评测框架,将“文化表达是否正确”分解为可验证的子维度/要素(例如特定手势、服饰、仪式、语境等),从而能对“视觉质量相近但文化符号错误”的生成结果给出显式惩罚与可诊断反馈。
Track: Video Generation Evaluation (cultural faithfulness)
Key innovation: Proposes CultureScore, a metric/framework targeting cultural faithfulness in video generation, addressing the gap where quality-focused metrics cannot detect culturally incorrect substitutions (e.g., wrong gestures). The key methodological advance is a compositional evaluation that decomposes cultural correctness into checkable components, enabling diagnostic scoring that separates visual fidelity from culturally grounded semantic accuracy.
- LoomVideo: Unifying Multimodal Inputs into Video Generation and Editing 🆕NEW
- 赛道归属: 视频生成与视频编辑(统一多模态条件输入/高效条件融合)
- 核心创新点: 提出LoomVideo,面向“生成+编辑”统一模型,能够理解交错的多模态输入并用于视频生成与编辑。针对现有统一框架常用“把源视频条件token直接拼接到序列里”导致序列长度翻倍、注意力计算量近似四倍且依赖超大参数规模的问题,LoomVideo在条件融合上做结构性改造:以更高效的方式将源视频/多模态条件注入模型,避免简单拼接带来的注意力复杂度爆炸,从而在更可控的计算预算下实现统一的生成与编辑能力。
Track: Video Generation & Video Editing (unified multimodal conditioning / efficient condition fusion)
Key innovation: Presents LoomVideo, a unified model for video generation and editing that can interpret interleaved multimodal inputs. It tackles the common bottleneck where editing conditions are injected by token concatenation, which doubles sequence length and roughly quadruples self-attention cost, often forcing very large models. LoomVideo introduces a more compute-efficient conditioning/fusion mechanism to incorporate source-video and multimodal conditions without naive concatenation, enabling unified generation/editing under tighter compute budgets.
- Dream.exe: Can Video Generation Models Dream Executable Robot Manipulation?
- 赛道归属: 视频生成 × 机器人(具身智能)/ 物理一致性评测与可执行性(Executable Manipulation from Generated Video)
- 核心创新点: 提出“从生成视频到可执行机器人操作”的评测范式Dream.exe,把视频生成模型是否学到物理规律的问题转化为可量化的具身任务:将模型生成的操作过程视频作为中间表示,进一步映射/提取为机器人可执行的动作序列并在真实或仿真环境中验证执行效果;该思路以“能否落地执行”作为强约束信号,绕开仅凭视觉逼真度评估的局限,从而系统检验生成模型对接触动力学、时序因果与可操作性(affordance)的隐式建模能力,并为后续将生成模型用于机器人策略生成/数据合成提供可复现的测试框架。
- Track: Video Generation × Robotics (Embodied AI) / Physical-Consistency Evaluation via Executable Manipulation
- Core innovations: Proposes Dream.exe, an evaluation paradigm that turns the question “do video generators internalize physics?” into a measurable embodied task: use a model-generated manipulation video as an intermediate representation, convert/parse it into robot-executable action sequences, and validate by execution in real or simulated environments; by enforcing executability as a hard constraint, it goes beyond visual realism metrics to systematically probe implicit modeling of contact dynamics, temporal causality, and affordances, and provides a reproducible framework toward using generative video models for robot policy generation or data synthesis.
- Activation Steering of Video Generation Models via Reduced-Order Linear Optimal Control
- 赛道归属: 视频生成(扩散模型可控生成 / 安全对齐 / 激活操控)
- 核心创新点: 提出一种将视频扩散模型的“激活干预”形式化为降阶线性最优控制的问题(Reduced-Order Linear Optimal Control),在潜空间/中间激活上学习线性、可预见(anticipative)的控制律,而非以往逐帧/局部、非前瞻的粗粒度 steering。通过对高维激活动力学做降阶建模与线性二次型(LQ)式控制(文中称 Latent Activation Linear-Q…),实现对不良内容的定向抑制同时降低过度干预导致的画质与语义退化,并以机制化干预替代再训练与提示词过滤的高成本/不稳定性。
Track: Video generation (controllable diffusion / safety alignment / activation steering)
Core innovation: Formulates activation steering for text-to-video diffusion models as a reduced-order linear optimal control problem. Instead of coarse, non-anticipative, per-step interventions, it learns a linear, anticipative control law over latent/intermediate activations by building a low-dimensional surrogate of high-dimensional activation dynamics and applying an LQ-style controller (Latent Activation Linear-Q…). This enables targeted suppression of undesired content while mitigating oversteering-induced quality/semantic degradation, providing a mechanistic alternative to finetuning or prompt filtering.
GitHub
- [2026-06-08] HKUDS/ViMax ⭐9159 🆕NEW
"ViMax: Agentic Video Generation (Director, Screenwriter, Producer, and Video Generator All-in-One)"
- [2026-06-09] hao-ai-lab/FastVideo ⭐3695
A unified inference and post-training framework for accelerated video generation.
- [2026-06-08] ZeroLu/awesome-seedance ⭐1899
The ultimate collection of high-fidelity Seedance 2.0 prompts and Seedance AI resources. Discover Seedance 2.0 how to use for cinematic film, anime, U...
- [2026-06-08] YouMind-OpenLab/awesome-seedance-2-prompts ⭐1318
🎬 2000+ curated Seedance 2.0 video generation prompts — cinematic, anime, UGC, ads, meme styles. Includes Seedance API guides, character consistency t...
- [2026-06-08] guaardvark/guaardvark ⭐55 🆕NEW
The self-hosted AI workstation. Autonomous screen agents, 3-tier neural routing, parallel agent swarms, video generation, 4K/8K upscaling, RAG, voice ...
HuggingFace Models
音频生成 / Audio Generation
arXiv
- ImmersiveTTS: Environment-Aware Text-to-Speech with Multimodal Diffusion Transformer and Domain-Specific Representation Alignment
- 赛道归属: 文本到语音(TTS)/ 场景化语音生成(语音+环境声融合)
- 核心创新点: 提出环境感知TTS框架,通过多模态扩散Transformer显式建模语音与环境上下文(如场景/视觉/环境音提示)之间的跨模态交互,解决语音与环境声在声学形态与时间动态上的分布差异;并引入面向领域的表征对齐机制,将“语音生成表征”与“环境/场景表征”在统一空间中对齐,从而实现语音与环境声的自然共存与无缝融合(而非后期拼接)。
- Track: Text-to-Speech (TTS) / Scene-aware speech generation (speech + ambient sound integration)
- Core innovations: Proposes an environment-aware TTS framework that uses a multimodal Diffusion Transformer to explicitly model cross-modal interactions between speech and environmental context (e.g., scene/visual/ambient cues), addressing the distribution and temporal-dynamics mismatch between speech and environmental audio; introduces domain-specific representation alignment to map speech-generation features and environment/scene features into a shared space, enabling coherent in-scene speech generation rather than post-hoc mixing.
- Audio-Oscar: A Multi-Agent System for Complex Audio Scene Generation, Orchestration, and Refinement 🆕NEW
- 赛道归属: 音频生成(复杂音频场景生成 / 多智能体编排与后期精修)
- 核心创新点: 提出多智能体(multi-agent)框架,将“复杂音频场景描述→长音频成品”的生成过程拆解为可协作的子任务(如对白/音效/音乐/时间结构/后期处理等),通过代理间的规划、编排与迭代式精修实现长时序结构化生成与可控性提升;重点突破在于用系统级的分工与闭环优化机制,缓解单模型端到端生成在长音频一致性、元素协调与制作级后处理上的困难。
- Track: Audio Generation (Complex audio scene generation / multi-agent orchestration & post-production refinement)
- Core innovation: Introduces a multi-agent framework that decomposes complex scene-to-audio generation into coordinated sub-agents (e.g., speech, SFX, music, temporal layout, post-processing) and uses planning/orchestration plus iterative refinement to improve long-form structure and controllability; the key methodological advance is a system-level division-of-labor and closed-loop refinement pipeline that mitigates drift and poor cross-element coordination in monolithic end-to-end models.
- UNISON: A Unified Sound Generation and Editing Framework via Deep LLM Fusion
- 赛道归属: 统一音频生成与编辑(Text-to-Audio/TTS/音频编辑一体化,多任务扩散)
- 核心创新点: 用单一潜空间扩散模型统一覆盖文本到音频、文本到语音、零样本音色克隆、语音+音效混合生成、场景级音频编辑与时间编排等任务,实现“同权重多能力”;关键方法是层级式深度LLM融合(将LLM多层隐状态注入扩散网络以增强语义与结构控制)以及面向多任务的统一条件接口/训练范式,使生成与编辑在同一潜空间与同一推理管线内闭环完成,减少任务间割裂与模型堆叠。
- Track: Unified audio generation & editing (Text-to-Audio/TTS/audio editing; multi-task diffusion)
- Core innovations: Introduces a single latent diffusion model that unifies text-to-audio, text-to-speech, zero-shot speaker cloning, mixed speech-and-sound generation, scene-level editing, and temporal composition under one set of weights; key is layer-wise deep LLM fusion—injecting multi-layer LLM hidden states into the diffusion network for stronger semantic/structural control—plus a unified conditioning/training scheme so generation and editing operate in the same latent space and inference pipeline, avoiding fragmented task-specific stacks.
- Audio Imitator: Controlling Timbre and Tempo in Video2Audio Synthesis with Audio Reference 🆕NEW
- 赛道归属: 音频生成(Video-to-Audio / 参考音频驱动的风格可控合成)
- 核心创新点: 提出属性感知的Video2Audio框架,将参考音频中的“音色(timbre)”与“速度/节奏(tempo)”显式建模为可控属性,而非把参考音频当作整体条件直接注入;通过对风格属性的解耦表示与定向条件化,实现对生成音频风格维度的细粒度控制,同时保持与视频语义和时间对齐的一致性。
- Track: Audio Generation (Video-to-Audio / reference-audio-driven controllable synthesis)
- Core innovation: Proposes an attribute-aware Video2Audio method that explicitly models timbre and tempo from reference audio as disentangled, controllable attributes rather than using the reference as a single holistic condition; this enables fine-grained style control (timbre/tempo) while preserving semantic consistency and temporal alignment with the input video.
- dots.tts Technical Report 🆕NEW
- 赛道归属: 语音生成(Text-to-Speech / 连续潜空间自回归基础模型)
- 核心创新点: 提出2B参数的连续自回归TTS基础模型,在连续潜空间中建模语音;方法上三点关键突破:1)训练带多目标的AudioVAE,构建语义结构更强、对预测更友好的连续语音潜空间;2)在flow-matching生成头中采用“全历史(full-history)条件化”,增强长程一致性并降低长音频漂移;3)引入(报告中描述的)训练/推理层面的稳定化与效率设计,使连续自回归在质量与稳定性上更可用。
- Track: Speech Generation (Text-to-Speech / continuous-latent autoregressive foundation model)
- Core innovation: Presents a 2B-parameter continuous autoregressive TTS foundation model operating in a continuous latent speech space; key methodological advances include (1) a multi-objective AudioVAE to build a semantically structured, prediction-friendly latent space, (2) full-history conditioning in the flow-matching head to improve long-range consistency and reduce drift, and (3) additional stability/efficiency-oriented training/inference designs described in the report to make continuous AR generation more robust.
- Foley-Omni: A Unified Multimodal Generation Model from Task-Level Audio Synthesis to Complete Video Soundtrack Generation
- 赛道归属: 视频到音频生成(Video-to-Audio)、多模态统一音频生成(Unified Audio Generation)
- 核心创新点: 提出统一的多模态音频生成框架,将传统“单任务级”的语音/音效/音乐生成扩展为“整段视频完整配乐(soundtrack)”的一体化联合生成:在同一模型中对语音、拟音(foley)、环境声与音乐等多音频组件进行协同建模与联合采样,使各组件在时间轴上对齐、在语义与风格上保持一致,从而面向真实视频制作流程实现端到端的完整声轨生成(而非彼此独立的分段合成)。
Track: Video-to-Audio generation, Unified multimodal audio generation
Key innovation: Proposes a unified multimodal audio generation framework that moves beyond isolated task-level synthesis (speech/SFX/music) to end-to-end full video soundtrack generation. The model jointly models and co-generates multiple audio components—speech, foley, ambience, and music—within a single system, enforcing temporal alignment and semantic/style consistency across components to produce a coherent, production-ready soundtrack rather than separately generated audio segments.
- Sparse Autoencoders for Interpretable Emotion Control in Text-to-Speech
- 赛道归属: 语音生成|文本到语音(TTS)|可解释情感控制(表示解析/可控生成)
- 核心创新点: 利用稀疏自编码器(SAE)对LLM-TTS的语义隐状态进行分解与稀疏表征学习,从模型内部表示中自动“挖掘/定位”与情感变化相关的稀疏特征(而非依赖外部情感条件或整体激活粗粒度操控)。该思路将情感控制从黑盒条件注入转为可解释的内部特征级干预:通过识别情感相关的稀疏方向/单元,实现更可诊断、可编辑的情感调节,并为理解情感在TTS隐空间中的编码方式提供机制化证据。
- Track: Speech Generation | Text-to-Speech (TTS) | Interpretable emotion control (representation analysis / controllable generation)
- Core innovation: Applies sparse autoencoders (SAEs) to decompose and sparsify semantic hidden states in LLM-based TTS, automatically isolating emotion-related sparse features from internal representations rather than relying on external emotion conditioning or coarse global activation steering. This reframes emotion control as interpretable, feature-level intervention: by identifying emotion-linked sparse directions/units, the method enables more diagnosable and editable emotion modulation and provides mechanistic insight into how emotion is encoded in the TTS latent space.
- DUET: Unified Dual-Space Emotion Control for Diffusion and Flow-Matching Driven Text-to-Speech
- 赛道归属: 语音生成|文本到语音(TTS)|扩散/Flow-Matching 可控生成|即插即用情感控制
- 核心创新点: 发现预训练扩散与flow-matching TTS的冻结隐状态中,情感与说话人身份分别对应近似线性可解码且近乎正交的方向,从而提出DUET的“双空间”统一控制:在不重训主体模型的前提下,以plug-and-play方式在生成过程中沿情感方向进行可控操纵,同时尽量不扰动说话人方向以降低身份泄漏/纠缠。该方法论突破在于把“情感-身份解耦”具体化为可操作的几何结构(线性方向+近正交),并将其转化为跨扩散与flow-matching范式通用的推理期控制接口。
- Track: Speech Generation | Text-to-Speech (TTS) | Diffusion/Flow-Matching controllable generation | Plug-and-play emotion control
- Core innovation: Shows that in pretrained diffusion and flow-matching TTS, emotion and speaker identity correspond to (approximately) linearly decodable and nearly orthogonal directions in frozen hidden states. Based on this geometry, DUET introduces unified “dual-space” control: a plug-and-play inference-time manipulation that steers generation along the emotion direction while minimally perturbing the speaker direction to reduce identity–emotion entanglement, without retraining the backbone. The key methodological advance is operationalizing emotion–identity disentanglement as actionable latent geometry (linear directions + near-orthogonality) and turning it into a model-agnostic control interface across both diffusion and flow-matching TTS.
- SwanVoice: Expressive Long-Form Zero-Shot Speech Synthesis for Both Monologue and Dialogue
- 赛道归属: 长文本零样本TTS / 对话式语音合成(多说话人、情感与一致性建模)
- 核心创新点: 面向长篇独白与多轮对话的零样本语音合成,针对“逐轮合成再拼接”导致的音色一致性、韵律连贯性与情绪连续性断裂问题,提出在单模型内联合建模跨轮次的对话上下文与表达状态(如情感/语气/节奏的持续变量),在生成时维持跨turn的声学一致与对话连贯;强调长程依赖与多说话人切换下的表达可控与稳定性,而非仅提升单句质量。
- Track: Long-form zero-shot TTS / Dialogue speech synthesis (multi-speaker, expressive consistency)
- Core innovations: Targets long-form monologue and multi-turn dialogue in zero-shot TTS, addressing the common “synthesize-per-turn then stitch” workaround that breaks timbre, prosody, and affect continuity; proposes single-model joint modeling of cross-turn dialogue context and persistent expressive states (e.g., emotion/intonation/rhythm as continuous trajectories), maintaining acoustic consistency and conversational coherence across turns while supporting multi-speaker switching and expressive control over long horizons.
- Towards Streaming Synchronized Spatial Audio Generation via Autoregressive Diffusion Transformer
- 赛道归属: 流式空间音频生成(视频/文本条件的Spatial Audio,低延迟生成)
- 核心创新点: 提出面向实时的流式空间音频生成统一框架,使用自回归扩散Transformer在“可流式输出”的约束下实现高保真生成,并强化与全景视频/文本提示的时序同步与空间一致性;核心突破在于把扩散生成改造为可在线推进的自回归/分段式推理范式,在降低推理延迟的同时保持空间线索(方位、距离、运动)建模精度,缓解“质量-延迟”权衡与多模态空间对齐困难。
- Track: Streaming spatial audio generation (video/text-conditioned spatial audio; low-latency)
- Core innovations: Proposes a unified streaming framework for real-time spatial audio generation conditioned on panoramic video and text, built on an autoregressive Diffusion Transformer to enable incremental (online) synthesis; key contribution is adapting diffusion-style generation to a streaming-compatible autoregressive/segmented inference scheme that preserves high fidelity while improving latency, and strengthening temporal synchronization and spatial consistency (direction/distance/motion cues) from multimodal inputs, mitigating the quality–latency tradeoff and multimodal spatial alignment challenges.
GitHub
- [2026-06-08] huggingface/diffusers ⭐33804
🤗 Diffusers: State-of-the-art diffusion models for image, video, and audio generation in PyTorch.
- [2026-06-08] BinWang28/audio-ai-hub ⭐930
The hub for audio AI research: papers, open models, benchmarks & datasets across audio LLMs, speech recognition, TTS, music & audio generation.
- [2026-06-05] apocas/restai ⭐510
RESTai is an AIaaS (AI as a Service) open-source platform. Supports many public and local LLM suported by Ollama/vLLM/etc. Precise embeddings usage, t...
- [2026-06-08] xiaomi-research/controlfoley ⭐131 🆕NEW
ControlFoley: Unified and Controllable Video-to-Audio Generation with Cross-Modal Conflict Handling
- [2026-06-08] dgrauet/ltx-2-mlx ⭐55
Pure MLX port of LTX-2 (Lightricks LTX-2.3) for Apple Silicon — video + audio generation
HuggingFace Models
语言大模型 / Large Language Models
arXiv
- Agentic Chain-of-Thought Steering for Efficient and Controllable LLM Reasoning
- 赛道归属: 推理优化(可控推理/测试时推理控制)
- 核心创新点: 将“推理过程如何展开”的控制显式化为一个马尔可夫决策过程(MDP):引入控制器智能体在推理时按状态自适应决策(如继续思考、切换策略、停止等),以最小化无效token消耗并在不显著牺牲准确率的前提下实现可控的推理长度与推理轨迹;相较仅做截断/早停/压缩的效率方法,ACTS把“思考策略”作为可学习/可调度的动作空间,从而提供更细粒度的推理时控制与效率-性能权衡。
- Track: Reasoning optimization (controllable inference / test-time reasoning control)
- Key innovation: Makes “how the model reasons” an explicit control problem by formulating chain-of-thought steering as an MDP: a controller agent adaptively selects actions at inference (e.g., continue, change strategy, stop) based on the current reasoning state, reducing wasted tokens while maintaining accuracy and enabling controllable reasoning length/trajectory. Unlike prior efficiency methods that mainly shorten/early-stop/compress traces, ACTS treats reasoning strategy as an explicit, schedulable action space for finer-grained control over the efficiency–accuracy trade-off.
- An Asymptotic Theory of Chain-of-Thought in In-Context Learning
- 赛道归属: 理论分析(In-Context Learning / Chain-of-Thought 机理与尺度律)
- 核心创新点: 在一个可解析的理论模型中刻画CoT深度与泛化性能的尺度行为:将测试时CoT推理形式化为对线性回归中“权重参数估计”的迭代精炼过程(iterative refinement),从而推导随推理步数增加时误差/泛化的渐近规律与收益递减条件;该框架把“CoT=迭代算法”的观点落到可证明的渐近理论上,为理解何时加深CoT有效、何时无效提供了可计算的判据。
- Track: Theoretical analysis (in-context learning / chain-of-thought mechanism & scaling laws)
- Key innovation: Develops an analytically solvable model to characterize how generalization scales with CoT depth: models test-time CoT as iterative refinement of the weight-parameter estimate in linear regression (in-context weight prediction), enabling asymptotic derivations of error/generalization behavior as the number of reasoning steps grows and identifying regimes of diminishing returns. This provides provable, computable criteria for when deeper CoT helps versus when it does not.
- Attention-guided Fine-tuning of Multimodal Large Language Models Improves Chain-of-Thought Reasoning
- 赛道归属: 多模态推理(MLLM Chain-of-Thought 对齐/微调优化)
- 核心创新点: 通过系统性实证分析指出多模态 CoT 在视觉推理中常“越想越错”,并归因于两类稳定失败模式:过早锁定答案(premature answer commitment)与对直接视觉证据利用不足(limited direct visual evidence usage)。在此基础上提出“注意力引导的微调”思路:利用/约束模型注意力分配,使推理步骤更聚焦于与当前推理相关的视觉区域与证据链,从训练层面纠正 CoT 生成时的证据对齐与决策时机问题,从而提升多模态逐步推理的可靠性与可解释性。
- Track: Multimodal reasoning (MLLM Chain-of-Thought alignment / fine-tuning optimization)
- Key innovation: Provides a systematic study showing that CoT prompting can hurt visual reasoning in MLLMs, and identifies two recurring failure modes: premature answer commitment and insufficient use of direct visual evidence. Building on these findings, it proposes an attention-guided fine-tuning strategy that steers/regularizes attention to align each reasoning step with the relevant visual regions and evidence, correcting evidence grounding and decision timing during CoT generation to improve step-wise multimodal reasoning robustness.
- COFT: Counterfactual-Conformal Decoding for Fair Chain-of-Thought Reasoning in Large Language Models
- 赛道归属: 公平性可控解码 / 推理阶段偏见抑制(LLM Decoding for Fairness in CoT)
- 核心创新点: 提出一种无需训练、仅在解码阶段生效的公平性控制方法 COFT,用于抑制链式思维(CoT)生成中的社会偏见放大。方法上以反事实提示构造 + 共形预测(Conformal)约束为核心:先将提示中的敏感片段替换为中性占位符形成“掩码反事实”输入,以获得相对去偏的参考分布;再在token 级别对原始解码分布施加公平性约束,并通过分布无关(distribution-free)的边际有效性保证(在 exchangeability 假设下)为公平控制提供可验证的统计保证,从而实现对任意冻结的因果语言模型在推理时的可控去偏解码。
- Track: Fairness-controlled decoding / Inference-time bias mitigation for CoT (LLM Decoding for Fairness in CoT)
- Key innovation: Introduces COFT, a training-free, decoding-time method to curb bias amplification in chain-of-thought generation. The technical core combines counterfactual prompt masking with conformal (distribution-free) constraints: it first replaces sensitive spans with neutral tokens to form a masked counterfactual prompt, yielding a debiased reference distribution; then it enforces token-level fairness control on the original decoding distribution, providing distribution-free marginal validity guarantees (under exchangeability) for any frozen causal LM—enabling verifiable, model-agnostic fairness control at inference time.
- Entity Binding Failures in Speech LLM Reasoning: Diagnosis and Chain-of-Thought Intervention
- 赛道归属: 语音大模型推理诊断与对齐(Speech LLM Reasoning / 语音-文本推理鲁棒性)
- 核心创新点: 提出并验证“实体绑定失败(entity binding failure)”是语音LLM在复杂推理中相对文本LLM性能塌陷的关键、且高度局部化的原因:通过对多种任务分解评测,发现S2T在空间/句法/事实类任务不弱于T2T,但在需要持续实体跟踪的逻辑推理任务上准确率降至随机水平;进一步将退化机制归因于连续语音表征导致的实体-属性/关系绑定不稳,从而把“模态差距”从笼统能力不足细化为可诊断的绑定机制问题,并提出基于Chain-of-Thought的干预思路以强化实体跟踪与绑定过程。
- Track: Speech LLM reasoning diagnosis & alignment (speech-text reasoning robustness)
- Core innovation: Identifies and empirically validates a localized failure mode—entity binding failure—as the main driver of the reasoning gap between speech LLMs and text LLMs: via task-factorized evaluation, shows S2T matches/exceeds T2T on spatial/syntactic/factual tasks, but collapses to chance on logical tasks requiring persistent entity tracking; attributes the degradation to instability in binding entities to attributes/relations induced by continuous speech representations, reframing the “modality gap” into a concrete, diagnosable binding-mechanism issue and proposing Chain-of-Thought-based interventions to reinforce entity tracking/binding.
- LLM-XTM: Enhancing Cross-Lingual Topic Models with Large Language Models
- 赛道归属: 跨语言主题建模(NLP 表征学习/主题模型 + LLM增强)
- 核心创新点: 提出LLM-XTM,将LLM用于“主题层面”的跨语言对齐与可解释性提升,同时通过自一致性不确定性估计抑制幻觉并降低对不可获得的token概率(白盒接口)的依赖:用LLM引导的主题精炼(topic refinement)替代昂贵且易漂移的文档级改写/标注式增强,并以不确定性驱动的自一致性机制筛选/聚合LLM建议,使跨语言主题更连贯、对齐更稳健,且在资源稀缺的双语条件下仍可工作。
- Track: Cross-lingual topic modeling (topic models + LLM augmentation)
- Key innovation: Proposes LLM-XTM, using LLMs for topic-level cross-lingual alignment and interpretability while mitigating hallucinations via self-consistency–based uncertainty estimation and avoiding reliance on inaccessible token-probability (white-box) APIs. It replaces costly, drift-prone document-level LLM refinements with LLM-guided topic refinement, and uses uncertainty-driven selection/aggregation of LLM suggestions to yield more coherent and better-aligned multilingual topics under sparse bilingual resources.
- SmartThinker: Progressive Chain-of-Thought Length Calibration for Efficient Large Language Model Reasoning
- 赛道归属: 推理优化(CoT 长度自适应控制 / 高效推理)
- 核心创新点: 提出 SmartThinker 的“渐进式 CoT 长度校准”框架,针对长推理模型在不同难度问题上普遍存在的冗余与过度思考,突破点在于将“长度控制”从静态奖励(对所有样本一刀切)升级为随题目难度动态调整的策略。方法上通过逐步(progressive)校准推理链长度,使模型在简单问题上自动收敛到更短、更经济的推理,在困难问题上保留必要的长推理,从而在尽量不损失准确率的前提下显著降低输出冗余与推理成本,并弥补现有 GRPO 静态长度奖励无法自适应难度的缺陷。
- Track: Reasoning optimization (adaptive CoT length control / efficient inference)
- Key innovation: Introduces SmartThinker, a progressive CoT length calibration framework to reduce redundancy and overthinking in long-reasoning models. The key methodological advance is replacing static, one-size-fits-all length rewards (common in GRPO-based approaches) with a difficulty-adaptive mechanism that progressively calibrates reasoning length: it encourages short, cost-efficient reasoning on easy problems while preserving longer chains when needed for hard ones, improving efficiency with minimal accuracy degradation and addressing the non-adaptivity of static length reward designs.
- Visual Instruction Tuning Aligns Modalities through Abstraction
- 赛道归属: 多模态理解与视觉指令微调(Vision-Language Instruction Tuning / 跨模态对齐机制)
- 核心创新点: 从“层级抽象”视角系统揭示视觉指令微调如何实现跨模态对齐:通过跨多种视觉-语言架构的层间分析,发现指令微调的主要作用并非让视觉信息逐层经过LLM早期的单模态处理层,而是作为“桥接器”将视觉特征直接注入LLM的中间语义层,在抽象层面完成对齐并绕过早期层;该结论为设计更高效的视觉接入方式(如选择性注入层、减少无效早期融合)提供了机制性依据,而不仅是经验性配方。
- Track: Multimodal understanding & visual instruction tuning (vision-language alignment mechanisms)
- Core innovation: Provides a layer-wise abstraction account of how visual instruction tuning aligns modalities: across diverse VLM architectures, shows instruction tuning mainly acts as a bridge that embeds visual features directly into intermediate semantic layers of the LLM backbone, largely bypassing early unimodal layers; this mechanistic finding supports more principled designs for visual integration (e.g., selective layer injection and avoiding inefficient early fusion) beyond recipe-style tuning.
- Taiji: Pareto Optimal Policy Optimization with Semantics-IDs Trade-off for Industrial LLM-Enhanced Recommendation
- 赛道归属: LLM增强推荐系统(LLM4Rec)/ 后训练对齐(SFT+RL)/ 多目标强化学习优化
- 核心创新点: 提出一种面向工业LLM推荐的“语义空间—ID空间”可控对齐框架,通过将语义质量收益与ID推荐效果建模为相互制约的多目标优化问题,学习帕累托最优的策略集合而非单一加权目标,从而显式刻画并可调节两类奖励的权衡;同时针对开放域推荐中CoT质量难以度量与提升的问题,引入可操作的训练信号/优化机制,使策略优化能够在语义推理质量与ID点击/排序指标之间稳定迁移,缓解传统SFT/RL在该场景下的对齐瓶颈。
- Track: LLM-enhanced Recommender Systems (LLM4Rec) / Post-training alignment (SFT+RL) / Multi-objective RL optimization
- Core innovation: Proposes an industrial LLM4Rec alignment framework that makes the “semantic space vs. ID space” alignment explicitly controllable by formulating semantic-quality gains and ID-based recommendation performance as a constrained multi-objective problem, and learning a Pareto-optimal set of policies instead of a single scalarized objective—thereby exposing and tuning the trade-off between semantic rewards and ID ranking/click metrics. It also addresses the difficulty of measuring/improving CoT quality in open-domain recommendation by introducing actionable training signals/optimization mechanisms so policy optimization can reliably balance semantic reasoning quality with ID-based recommendation KPIs, mitigating key bottlenecks of prior SFT/RL paradigms.
- Online Pandora's Box for Contextual LLM Cascading 🆕NEW
- 赛道归属: LLM推理优化(级联路由/在线决策与成本-质量权衡)
- 核心创新点: 将LLM级联调用形式化为“带上下文的在线Pandora’s Box”两阶段决策问题:先在查询阶段按序探测多个LLM API(每次探测会揭示输出并产生与输出相关的成本),再在选择阶段从已生成候选中选定最终输出;核心方法突破在于把“是否继续查询下一个模型/何时停止”与“最终选哪个输出”统一到在线、上下文驱动的最优停止与选择框架中,从而实现对不同请求上下文下的自适应路由与动态成本-效果权衡,并支持在交互过程中学习/更新查询策略以提升长期收益。
- Track: LLM Inference Optimization (cascaded routing / online decision-making under cost–quality trade-offs)
- Core innovation: Formulates LLM cascading as a contextual online Pandora’s Box problem with a two-phase decision structure: (1) a query phase that sequentially probes multiple LLM APIs, where each probe reveals an output and incurs an output-dependent cost, and (2) a selection phase that chooses the final response among observed candidates. The key methodological advance is unifying “whether/when to query the next model” (optimal stopping) and “which output to return” into a single contextual online framework, enabling adaptive routing and dynamic cost–utility trade-offs per request context, with policies that can be learned/updated online to improve long-run performance.
GitHub
- [2026-06-09] sgl-project/sglang ⭐28896
SGLang is a high-performance serving framework for large language models and multimodal models.
- [2026-06-09] google-ai-edge/LiteRT-LM ⭐5498
LiteRT-LM is Google's production-ready, high-performance, open-source inference framework for deploying Large Language Models on edge devices.
- [2026-06-09] ModelTC/LightLLM ⭐4092 🆕NEW
LightLLM is a Python-based LLM (Large Language Model) inference and serving framework, notable for its lightweight design, easy scalability, and high-...
- [2026-06-09] llnl/OGhidra ⭐176 🆕NEW
OGhidra bridges Large Language Models (LLMs) via Ollama with the Ghidra reverse engineering platform, enabling AI-driven binary analysis through natur...
- [2026-06-09] gpt-cmdr/ras-commander ⭐67
The RAS-Commander library provides a python API for automating HEC-RAS 6.x and accessing HDF data using Python, built with and driven by large languag...
HuggingFace Datasets
- [2026-05-28] openbmb/UltraData-SFT-2605
UltraData-SFT-2605
📦 UltraData Collection | 🌐 UltraData | 🤗 MiniCPM5 Series
English | 中文
📚 Introduction
Ult...
- [2026-05-01] angrygiraffe/claude-opus-4.6-4.7-reasoning-8.7k
Background
Ended up with some tokens to burn on a Claude Max plan. Assembly began during 4.6 and moved to 4.7. Model is tagged. The develop...
- [2026-05-28] openbmb/Ultra-FineWeb-L3
Ultra-FineWeb-L3
📜 Ultra-FineWeb Technical Report | 📦 UltraData Collection | 🌐 UltraData | 🤗 MiniCPM5 Series
English | 中文
...
- [2026-06-04] nvidia/Nemotron-Pretraining-Code-v3 🆕NEW
Nemotron-Pretraining-Code-v3 Dataset Description:
The Nemotron-Pretraining-Code-v3 dataset is part of the Nemotron Pretr...
- [2026-06-03] OpenClaw/clawhub-security-signals
ClawHub Security Signals
🦀 ClawHub | 📝 OpenClaw Blog | 🤗 Hugging Face Blog | 📄 Paper | 📄 Pre-Print ClawHub Security Signals is a saniti...
多模态大模型 / Multimodal Models
arXiv
- MLLM-Microscope: Unlocking Hidden Structure Within Multimodal Large Language Models
- 赛道归属: 多模态理解(MLLM可解释性/表征分析与诊断)
- 核心创新点: 提出一套面向MLLM内部表征的系统化“显微镜”分析框架,沿Transformer层级同时刻画多模态token嵌入的线性度、内在维度与各向异性,并区分主干流与残差流进行对照诊断;在ScienceQA上对LLaVA-NeXT与OmniFusion做跨模型、跨模态的层间结构测量,揭示多模态token在不同流与不同层中呈现高度线性等隐藏结构特征,为后续的可解释性、压缩与对齐机制设计提供可量化的表征指标体系。
- Track: Multimodal understanding (MLLM interpretability / representation analysis & diagnostics)
- Core innovation: Introduces a “microscope”-style, layer-wise diagnostic framework to probe hidden representations in MLLMs by jointly measuring linearity, intrinsic dimension, and anisotropy of multimodal token embeddings, explicitly contrasting main vs. residual streams. Evaluated on ScienceQA with LLaVA-NeXT and OmniFusion, it provides cross-model, cross-modality structural measurements that uncover highly linear behaviors and other latent geometric properties, yielding actionable, quantitative representation metrics for interpretability, compression, and alignment design.
- Immuno-VLM: Immunizing Large Vision-Language Models via Generative Semantic Antibodies for Open-World Trustworthiness
- 赛道归属: 多模态安全与可信(开放世界异常检测/拒识、VLM鲁棒性)
- 核心创新点: 提出“语义自负(Hubris of Semantics)”作为开放世界部署中的关键失效模式:VLM会将未知异常强行映射到已知语义并高置信输出。方法上以“生成式语义抗体(Generative Semantic Antibodies)”为核心机制,为模型显式注入“负知识/反语义”以形成可拒识的决策边界,从而在不破坏原有零样本语义对齐能力的前提下提升开放世界可信性与异常处理能力。
- Track: Multimodal safety & trustworthiness (open-world anomaly detection/rejection, VLM robustness)
- Key innovation: Identifies “Hubris of Semantics” as a core open-world failure where VLMs over-confidently force unknown anomalies into known semantic classes. Introduces “Generative Semantic Antibodies” to explicitly inject negative knowledge/counter-semantics, shaping rejectable decision boundaries while preserving zero-shot semantic alignment, improving open-world trustworthiness.
- Cross-modal linkage risk in clinical vision-language models
- 赛道归属: 多模态安全与隐私(视觉-语言模型的链接攻击/成员关联风险评估)
- 核心创新点: 将临床VLM的隐私问题形式化为跨模态重链接(image-to-report linkage)风险:即模型学习到的共享嵌入空间可能保留实例级对应关系,使攻击者仅凭余弦相似度检索即可把去标识化影像重新关联到原始放射学报告;提出相应的威胁模型与评测设定,用以量化在“影像与报告被刻意分离共享/访问控制”的真实流程下,嵌入对齐带来的可重识别性,从而把“表征对齐能力”转化为可度量的隐私攻击面。
- Track: Multimodal security & privacy (vision-language linkage attacks / instance re-identification risk)
- Core innovation: Formalizes a clinical VLM privacy threat as cross-modal re-linkage (image-to-report linkage) risk: the shared embedding space can preserve instance-level correspondence, enabling attackers to re-associate a de-identified radiograph with its original report via cosine-similarity retrieval alone. It defines a concrete threat model and evaluation protocol aligned with real-world workflows where images and reports are intentionally separated, turning representation alignment strength into a measurable privacy attack surface.
- Capturing Gaze Shifts for Guidance: Cross-Modal Fusion Enhancement for VLM Hallucination Mitigation
- 赛道归属: 多模态理解(VLM幻觉抑制、跨模态融合/注意力机制改进)
- 核心创新点: 从“视觉注意力汇聚/沉没(attention sink)”角度解释幻觉:并非简单的“语言先验过强”,而是视觉注意力被任务无关区域吸走导致视觉证据未被有效融合。提出利用“注视转移(gaze shifts)”信号来指导跨模态融合增强:通过建模视线在关键区域间的动态转移,重分配视觉-文本对齐时的注意力与融合权重,避免仅按原始注意力分数做放大而加剧偏置,从机制上降低不可证实内容生成。
- Track: Multimodal understanding (VLM hallucination mitigation, cross-modal fusion/attention)
- Key innovation: Reframes hallucination via a “visual attention sink” mechanism—visual attention is diverted to irrelevant regions, preventing evidence from being fused. Uses “gaze shifts” as guidance signals to enhance cross-modal fusion by modeling dynamic transitions between salient regions, reweighting alignment/fusion beyond naive attention amplification, thereby reducing unsupported generations.
- Align-KD: Distilling Cross-Modal Alignment Knowledge for Mobile Vision-Language Model Enhancement
- 赛道归属: 多模态模型压缩与端侧部署(知识蒸馏/对齐增强)
- 核心创新点: 提出 Align-KD,将“大模型的跨模态对齐能力”作为可蒸馏的核心知识而非仅蒸馏输出分布/特征;通过显式对齐约束与跨模态一致性信号,把教师VLM在图文对齐、语义绑定等能力迁移到轻量学生模型,从而在移动端/边缘设备的参数与算力受限条件下,尽量减少模型缩小带来的对齐与理解能力退化。
Track: Multimodal model compression & on-device deployment (knowledge distillation / alignment enhancement)
Key innovation: Proposes Align-KD, treating cross-modal alignment as the primary distillable knowledge rather than only logits/features; it introduces explicit alignment constraints and cross-modal consistency signals to transfer the teacher VLM’s image-text grounding/alignment capability to a compact student, mitigating the alignment and understanding degradation typically caused by aggressive downsizing for mobile/edge settings.
- VLM-GLoc: Vision-Language Model Enhanced Monte Carlo Localization for Robust Semantic Global Localization in Cluttered Quasi-Static Environments
- 赛道归属: 具身智能与机器人定位(语义全局定位、VLM+概率滤波/Monte Carlo Localization)
- 核心创新点: 将VLM的开放词汇语义理解引入Monte Carlo Localization(MCL)框架,面向“几何与语义都高度混淆”的准静态室内环境(如货架平行通道、重复家具)提升全局定位鲁棒性。核心在于用VLM生成/评估与场景观测一致的语义证据,并将其作为观测模型或粒子权重更新信号,与传统几何/外观特征互补,从而在几何别名严重、语义长尾且遮挡杂乱的场景中实现更稳定的语义级全局定位。
- Track: Embodied AI & robot localization (semantic global localization, VLM + probabilistic filtering/MCL)
- Key innovation: Integrates open-vocabulary semantic understanding from VLMs into a Monte Carlo Localization pipeline to handle quasi-static indoor environments with strong geometric/semantic aliasing. Uses VLM-derived semantic evidence as an observation/weighting signal for particle updates, complementing geometric/appearance cues to improve robustness under severe aliasing, long-tail semantics, and clutter/occlusion.
- ES-Merging: Biological MLLM Merging via Embedding Space Signals
- 赛道归属: 多模态模型融合(模型合并/参数高效跨模态统一,生物科学MLLM)
- 核心创新点: 提出ES-Merging,用嵌入空间信号(embedding space signals)来指导生物领域MLLM的合并:不再依赖输入无关的参数空间启发式,而是利用各模型在嵌入空间中体现的模态专长与对齐特征来决定合并策略/权重,从而更忠实地保留不同单模态模型的能力并实现跨模态统一;该思路把“模态专门化”从难以观测的参数差异,转化为可直接度量与可优化的表征信号,提高合并后的跨模态任务适配性。
- Track: Multimodal model merging (parameter-efficient cross-modal unification for biological MLLMs)
- Core innovation: Proposes ES-Merging, a model-merging method for biological MLLMs guided by embedding-space signals rather than input-agnostic parameter-space heuristics. By leveraging representation-level cues that reflect modality specialization and alignment, it determines merging behavior/weights to better preserve complementary single-modality strengths while forming a unified cross-modal model. The key methodological shift is making “modality specialization” observable and optimizable through measurable embedding signals, improving post-merge cross-modal capability.
- MemDreamer: Decoupling Perception and Reasoning for Long Video Understanding via Hierarchical Graph Memory and Agentic Retrieval Mechanism 🆕NEW
- 赛道归属: 长视频多模态理解 / 记忆增强推理(Agentic Retrieval)
- 核心创新点: 提出一种“感知-推理解耦”的长视频理解框架,将对整段视频的端到端密集编码改为“代理式探索+按需检索”的流程,从根源上缓解长视频的token爆炸与注意力稀释;通过增量式流式读取视频构建分层图记忆(Hierarchical Graph Memory),以自顶向下三层语义抽象把片段、事件与高层概念组织为可检索结构;在推理阶段引入agentic检索机制,根据问题动态选择需要回看的记忆节点/证据片段,实现“只看必要内容”的可扩展推理;整体作为plug-and-play模块,可在不重训或少量适配的前提下增强现有VLM/MLLM的小时级视频问答与理解能力。
- Track: Long-video multimodal understanding / memory-augmented reasoning (agentic retrieval)
- Core innovations: Proposes a perception–reasoning decoupled framework that replaces end-to-end dense encoding of full videos with an agentic exploration + on-demand retrieval pipeline, directly mitigating token explosion and attention dilution in hours-long videos; incrementally streams video to build a Hierarchical Graph Memory with a top-down three-level semantic abstraction that structures clips, events, and concepts into a retrievable graph; introduces an agentic retrieval mechanism that conditionally selects memory nodes/evidence segments based on the query, enabling scalable “read only what’s needed” reasoning; designed as a plug-and-play component to boost existing VLM/MLLM performance on long-horizon video QA/understanding with minimal retraining.
- Watch, Remember, Reason: Human-View Video Understanding with MLLMs 🆕NEW
- 赛道归属: 长视频多模态理解(能力框架/评测与系统化综述)
- 核心创新点: 从“人类观看视频”的视角提出面向MLLM的视频理解能力分解框架,将长视频理解系统化为Watching(获取证据)-Remembering(跨时间记忆组织)-Reasoning(基于稀疏证据推断)三类核心功能,并围绕稀疏证据、长程依赖、多模态对齐与算力预算受限等关键约束梳理方法谱系;强调在有限计算下的可靠推理,推动从“长上下文堆token”转向“选择性观察、结构化记忆、可控检索与验证式推断”的设计范式;为后续模型/系统提供可操作的模块化设计坐标系与问题清单(如记忆表示、检索策略、证据聚合与不确定性控制),以指导长视频场景的研究与评测。
- Track: Long-video multimodal understanding (capability framework / evaluation & systematic survey)
- Core innovations: Introduces a human-view capability decomposition for MLLM-based video understanding, structuring long-video comprehension into three functional abilities—Watching (evidence acquisition), Remembering (temporal memory organization), and Reasoning (inference from sparse evidence)—and organizes existing methods around constraints such as sparse cues, long-range dependencies, multimodal alignment, and limited compute budgets; shifts the design paradigm from “stuffing longer context” to selective observation, structured memory, controllable retrieval, and verification-oriented inference under compute constraints; provides a modular design map and actionable research checklist (memory representation, retrieval policies, evidence aggregation, uncertainty control) to guide future systems and evaluations in long-video settings.
- Textual Supervision Enhances Geospatial Representations in Vision-Language Models 🆕NEW
- 赛道归属: 多模态表征学习(地理空间理解/图像地理定位与空间推理)
- 核心创新点: 系统比较视觉模型、视觉-语言模型与多模态基础模型在地理空间表征上的差异,提出并验证“文本监督能显著增强地理空间表征”这一机制性结论:语言对地点、地标、文化与语义线索的编码可作为弱结构先验,帮助模型在视觉相似但地理不同的样本间形成更可分的嵌入;通过跨类别图像簇(人物、地标、日常物体等)的评估,揭示地理信息并非只来自显式地标,而可由语义共现与文本对齐间接注入;为构建更强的地理定位/空间推理模型提供训练信号选择与数据构建方向(强调高质量文本配对与语义覆盖),并为后续在VLM/MLLM中显式建模地理先验提供实证依据。
- Track: Multimodal representation learning (geospatial understanding / image geolocation & spatial reasoning)
- Core innovations: Provides a systematic comparison of geospatial representations across vision-only models, vision–language models, and multimodal foundation models, and empirically supports a mechanistic takeaway: textual supervision substantially strengthens geospatial representations. Language encodes weak structured priors about places, landmarks, culture, and semantic cues, improving embedding separability even when visuals are similar across different locations; evaluations across diverse image clusters (people, landmarks, everyday objects) show geospatial signals can be injected indirectly via semantic co-occurrence and text alignment, not only via explicit landmarks; offers practical guidance on training-signal and dataset design (high-quality text pairing and semantic coverage) and motivates explicit geospatial priors in future VLM/MLLM modeling.
GitHub
- [2026-06-08] Blaizzy/mlx-vlm ⭐4987
MLX-VLM is a package for inference and fine-tuning of Vision Language Models (VLMs) on your Mac using MLX.
- [2026-06-08] NVlabs/Eagle ⭐2278
Eagle: Frontier Vision-Language Models with Data-Centric Strategies
- [2026-06-08] liudaizong/Awesome-LVLM-Attack ⭐553
😎 up-to-date & curated list of awesome Attacks on Large-Vision-Language-Models papers, methods & resources.
- [2026-06-07] jamjamjon/usls ⭐413
A Rust library integrated with ONNXRuntime, providing a collection of Computer Vison and Vision-Language models such as YOLO, FastVLM, and more.
- [2026-06-06] CityMind-Lab/ICML25-TimeVLM ⭐117
[ICML 2025] Time-VLM: Exploring Multimodal Vision-Language Models for Augmented Time Series Forecasting
强化学习 / Reinforcement Learning
arXiv
- CARE-RL: Capability-Aware Reinforcement Learning for Mitigating Cross-Domain Conflicts
- 赛道归属: 多领域LLM强化学习对齐(跨域冲突缓解 / 奖励建模)
- 核心创新点: 提出CARE-RL,将“协议感知”的生成式奖励与“能力感知”的优化联合起来解决多领域RL中的两类关键瓶颈:一是非可验证任务奖励不可靠,二是跨领域能力相互干扰。方法上通过Protocol-Aware Generative Reward Model(PA-GRM)在提示/协议层面构造更稳健的奖励信号以覆盖不可验证场景,并在优化阶段引入能力维度的约束/加权机制,使更新更聚焦于目标能力、减少对其他领域能力的负迁移,从而系统性缓解cross-domain conflicts。
Track: Multi-domain LLM RL alignment (cross-domain conflict mitigation / reward modeling)
Key innovations: Proposes CARE-RL, combining protocol-aware generative reward construction with capability-aware optimization to tackle two core issues in multi-domain RL: unreliable rewards for non-verifiable tasks and capability interference across domains. It introduces a Protocol-Aware Generative Reward Model (PA-GRM) that builds more robust reward signals at the prompt/protocol level for non-verifiable settings, and a capability-aware optimization scheme that constrains/weights updates along capability dimensions to focus learning on target skills while reducing negative transfer to other domains.
- Survival Reinforcement Learning: Toward Scalable Self-Supervised RL
- 赛道归属: 自监督强化学习 / 目标条件长时序规划(Goal-conditioned RL)
- 核心创新点: 提出Survival Reinforcement Learning(SRL)作为对比式自监督RL(CRL)的替代范式,用在线分类式目标判别取代对比损失,规避对比学习在长时序规划中“uniformity–tolerance”两难导致的表征退化/目标区分不足问题;将“survival value learning”扩展为通过最大化到达目标后的驻留时间(dwell time)来学习可用于长视野目标条件控制的价值信号,从而在深网络可扩展性与长时序可规划性之间取得更稳健的折中。
- Track: Self-supervised RL / Goal-conditioned long-horizon planning
- Core innovation: Proposes Survival Reinforcement Learning (SRL) as an alternative to contrastive self-supervised RL by replacing contrastive objectives with an online classification-based signal, mitigating the contrastive “uniformity–tolerance” dilemma that hurts long-horizon goal discrimination and planning. It extends survival value learning by maximizing dwell time at target goals, yielding a planning-friendly value signal while retaining strong depth-scaling behavior.
- A Lecture Note on Offline RL and IRL, Part II: Foundations of Inverse Reinforcement Learning and Dynamic Discrete Choice Models
- 赛道归属: 逆强化学习(IRL)理论 / 离线RL与结构计量经济学(DDC)统一视角
- 核心创新点: 以讲义形式系统梳理IRL的基础,并将熵正则IRL与结构计量中的动态离散选择模型(Dynamic Discrete Choice, DDC)在数学结构上进行对齐:从“由专家离线数据反推奖励/偏好”的角度,统一讨论可辨识性、似然/最大熵目标、价值函数与策略的对应关系,以及由此带来的估计与推断框架;其方法论价值在于提供跨社区的同构映射与推导路径,便于将DDC的统计推断工具与IRL的优化视角互相迁移。
- Track: Inverse Reinforcement Learning theory / Unifying Offline RL–IRL with Dynamic Discrete Choice (DDC)
- Core innovation: A foundations-focused note that aligns entropy-regularized IRL with dynamic discrete choice (DDC) models at the level of objectives and solution structure. It frames reward recovery from expert offline data through a unified lens (identifiability, likelihood/max-entropy criteria, value–policy correspondences), enabling methodological transfer between econometric inference in DDC and optimization-centric IRL formulations.
- RL-ACRGNet: Reinforcement Learning-Based Chest Radiology Report Generation Network
- 赛道归属: 医学影像多模态生成(胸部影像报告生成)/ 强化学习用于文本生成
- 核心创新点: 提出RL-ACRGNet,将强化学习引入胸部放射学报告生成的训练框架,以缓解纯监督学习在“疾病识别准确性”和“报告表述质量/一致性”上的不足。方法层面通过将临床相关的序列级目标(如报告整体质量、关键病灶描述覆盖等)显式作为RL优化信号,直接优化生成报告的全局指标而非仅做token级似然拟合,从而提升对细粒度病灶信息的捕获与报告生成的临床可用性与一致性。
Track: Medical multimodal generation (chest radiology report generation) / RL for text generation
Key innovations: Introduces RL-ACRGNet, integrating reinforcement learning into chest radiology report generation to address limitations of purely supervised training in disease detection accuracy and report quality/consistency. Methodologically, it optimizes clinically meaningful sequence-level objectives (e.g., overall report quality and coverage of key findings) as RL signals, directly targeting global report metrics rather than token-level likelihood alone, improving fine-grained pathology capture and clinical usability/consistency of generated reports.
- StepPO: Step-Aligned Policy Optimization for Agentic Reinforcement Learning
- 赛道归属: LLM智能体强化学习(Agentic RL)/ 策略优化算法
- 核心创新点: 提出StepPO(Step-Aligned Policy Optimization),针对现有LLM-RL普遍采用token为基本优化粒度而与智能体“按步骤(observation-action循环)决策”的粒度不匹配问题,改为以“步骤”作为对齐与优化的核心单位。方法突破在于将信用分配与策略更新从token层提升到step层,使奖励/优势估计与环境交互的决策边界一致,从而更贴合agentic行为结构,减少由token级噪声与粒度错配带来的优化偏差,提升多步任务中的决策稳定性与学习效率。
Track: LLM agent reinforcement learning (Agentic RL) / policy optimization
Key innovations: Proposes StepPO (Step-Aligned Policy Optimization) to resolve the granularity mismatch where existing LLM RL optimizes at the token level while agents act via step-wise observation–action cycles. The key advance is elevating alignment, credit assignment, and policy updates to the step level so that reward/advantage estimation matches decision boundaries in environment interaction, reducing token-level noise and mismatch-induced bias, and improving stability and sample efficiency in multi-step agent tasks.
- Exploring Reinforcement Learning for Fluid Transitions Between Clinical Mental Healthcare and Everyday Wellness Support 🆕NEW
- 赛道归属: 强化学习在数字医疗/健康干预(Contextual Bandit 个性化推荐与干预编排)
- 核心创新点: 以“临床心理健康—日常健康支持”的连续照护为目标,将情境多臂老虎机用于动态选择干预内容(如临床与非临床的日志/反思提示),把传统割裂的两类干预统一到同一决策框架中;重点在于将“照护过渡/阶段变化”建模为可在线自适应的序列化决策问题,从而实现跨场景、跨强度干预的主动编排与个体化触发机制。
Track: RL for digital health interventions (contextual bandits for personalized recommendation & intervention orchestration)
Core innovation: Targets continuity of care across “clinical mental healthcare ↔ everyday wellness support” by using a contextual bandit to dynamically select intervention content (e.g., journaling prompts spanning clinical and wellness domains), unifying previously siloed interventions under one decision-making loop; key methodological contribution is framing care transitions/phase shifts as an online adaptive sequential decision problem to proactively orchestrate personalized, cross-context interventions.
- Performance Variation in Deep Reinforcement Learning 🆕NEW
- 赛道归属: 强化学习评测与可靠性(Deep RL 复现实验、方差/鲁棒性度量与报告规范)
- 核心创新点: 聚焦深度强化学习“同配置不同随机种子”导致的显著性能波动问题,指出仅报告均值及其不确定性(如均值置信区间)不足以刻画真实的运行间变异;方法论上强调将“性能变异”作为独立评测对象,系统梳理传统不确定性估计与变异度量的局限,并提出/倡导更贴近实践鲁棒性的评估视角与统计表述方式,用于更可靠地比较算法与复现实验结论。
Track: RL evaluation & reliability (deep RL reproducibility; variation/robustness metrics and reporting)
Core innovation: Centers on large run-to-run performance variation in deep RL under identical configurations, arguing that reporting only mean performance uncertainty (e.g., CIs on the mean) fails to capture true inter-run variability; the methodological advance is treating performance variation as a first-class evaluation target, analyzing limitations of conventional uncertainty/variation estimates and promoting robustness-oriented statistical reporting to enable more reliable algorithm comparison and reproducible conclusions.
- Uncertainty-Aware LLM-Guided Policy Shaping for Sparse-Reward Reinforcement Learning 🆕NEW
- 赛道归属: LLM+强化学习(稀疏奖励探索、策略塑形/引导、基于不确定性的教师信号融合)
- 核心创新点: 提出 ULPS,将“校准后的 LLM”作为训练环路中的策略塑形模块:用 A 生成的最优符号轨迹作为结构化先验/教师信号,再由 LLM 输出可执行的行为指导,并通过不确定性估计对指导强度进行调制(高不确定时弱引导/促探索,低不确定时强引导/促收敛),以缓解稀疏奖励与任务序列异质性带来的探索低效与泛化差;关键突破在于把 LLM 的建议从“硬编码提示”升级为“可校准、可控强度、与规划 oracle 对齐”的训练时策略塑形信号。
Track: LLM-enhanced RL (sparse-reward exploration; policy shaping with uncertainty-aware teacher-signal integration)
Core innovation: Proposes ULPS, integrating a calibrated LLM into the RL training loop as a policy-shaping module: an A-based oracle synthesizes optimal symbolic trajectories as structured priors/teacher signals, the LLM converts them into actionable behavioral guidance, and uncertainty estimates modulate guidance strength (weaker guidance to encourage exploration when uncertain; stronger guidance to accelerate convergence when confident), improving exploration efficiency and generalization under sparse rewards and heterogeneous task sequences; the key advance is turning LLM advice into a calibrated, controllable, planning-aligned training-time shaping signal rather than static prompting.
- Reproducing, Analyzing, and Detecting Reward Hacking in Rubric-Based Reinforcement Learning
- 赛道归属: 基于LLM裁判的强化学习(Rubric-based RL)安全与对齐 / Reward Hacking检测
- 核心创新点: 提出CHERRL作为“可控的奖励黑客”实验环境,将真实rubric-based RL中隐蔽且多偏置纠缠的reward hacking现象进行可控生成与复现;通过显式参数化/组合裁判(LaaJ)的潜在偏置与策略可利用的漏洞,支持系统化分析“策略如何利用裁判偏差获得高分但低质量/不安全输出”;进一步面向检测提出可操作的评测与识别设置,使reward hacking从难以复盘的现象转化为可基准化、可诊断的研究对象。
- Track: LLM-as-a-Judge Reinforcement Learning (Rubric-based RL) Safety & Alignment / Reward Hacking Detection
- Core innovation: Introduces CHERRL, a controllable reward-hacking environment that makes subtle, bias-entangled reward hacking in real rubric-based RL reproducible and tunable; it explicitly parameterizes and composes judge (LaaJ) latent biases and exploitable loopholes to enable mechanistic analysis of how policies game the judge for high scores despite low-quality/unsafe outputs; additionally provides a concrete detection/evaluation setup that turns reward hacking into a benchmarkable, diagnosable target.
- Scenario Generation for Risk-Aware Reinforcement Learning with Probably Approximately Safe Guarantees
- 赛道归属: 安全强化学习(Risk-aware RL)/ 场景生成与形式化安全保证(Probably Approximately Safe)
- 核心创新点: 面向“策略对转移扰动敏感、易出现未知不安全行为”的问题,将安全验证与数据/场景生成耦合:通过采样策略轨迹构造概率型barrier certificate来刻画安全边界,并提出用于生成“更紧的安全界/更有效暴露风险”的场景采样机制;以Probably Approximately Safe(PAS)形式给出可证明的安全保证,使得生成的验证场景在统计意义上覆盖高风险区域,从而提升对策略安全性的可验证性与风险感知训练的有效性。
- Track: Safe Reinforcement Learning (Risk-aware RL) / Scenario Generation with Formal Safety Guarantees (Probably Approximately Safe)
- Core innovation: Couples safety verification with scenario generation to address policy fragility under transition perturbations: it builds probabilistic barrier certificates from sampled trajectories to delineate safe vs. unknown regions, and designs a scenario sampling/generation procedure aimed at tightening safety bounds and more effectively surfacing risky behaviors; provides Probably Approximately Safe (PAS) guarantees so the generated scenarios statistically target high-risk regions, improving verifiability and risk-aware training.
GitHub
- [2026-06-08] OpenPipe/ART ⭐9958
Agent Reinforcement Trainer: train multi-step agents for real-world tasks using GRPO. Give your agents on-the-job training. Reinforcement learning for...
- [2026-06-08] PufferAI/PufferLib ⭐5843
Puffing up reinforcement learning
- [2026-06-08] rllm-org/rllm ⭐5602
Democratizing Reinforcement Learning for LLMs
- [2026-06-09] radixark/miles ⭐1521
Miles is an enterprise-facing reinforcement learning framework for LLM and VLM post-training, forked from and co-evolving with slime.
- [2026-06-08] VAIL-UCLA/S2E ⭐66 🆕NEW
[ICLR 2026] From Seeing to Experiencing: Scaling Navigation Foundation Models with Reinforcement Learning
HuggingFace Models
HuggingFace Datasets
-
[2026-06-04] stanford-vision-lab/gpic
GPIC: A Giant Permissive Image Corpus for Visual GenerationKeshigeyan Chandrasegaran1, Kyle Sargent1, Suchi...
-
[2026-06-05] nvidia/Nemotron-Personas-El-Salvador
Nemotron-Personas-El-SalvadorUn enfoque de IA compuesta para personas en español salvadoreño ancladas en distribuciones del...
-
[2026-06-05] nvidia/Nemotron-Personas-Vietnam 🆕NEW
Nemotron-Personas-VietnamHệ thống AI kết hợp để tạo personas tổng hợp dựa trên phân bố thực tế của Việt Nam A compound ...
世界动作模型 / World Action Model
arXiv
- WALL-WM: Carving World Action Modeling at the Event Joints
- 赛道归属: 世界动作模型(World Action Model)/ 视觉-语言-动作预训练(Vision-Language-Action Pretraining)/ 视频动作建模
- 核心创新点:
- 中文:提出从“固定长度动作块(chunk)”转向“语义事件(event)”的世界动作建模范式,将语义连贯的动作事件作为最小学习单元,在事件连接点(event joints)处刻画动作的自然边界与状态转移,从而缓解 chunk 粒度与真实动作结构不匹配带来的学习偏差。方法上以事件为锚点进行视觉-语言-动作联合预训练,使模型学习到更符合人类语义分段的动作表征与跨事件的因果/时序衔接能力,相比直接对当前观测+指令做 chunk 级预测,更强调事件级结构化监督与可组合性。
- English: Introduces an event-grounded paradigm for World Action Models, replacing fixed-length action chunks with semantically coherent action events as the atomic learning unit. By modeling transitions at event joints (natural boundaries between events), it addresses the granularity mismatch inherent in chunk-centric optimization and better captures state changes and temporal/causal continuity. The approach performs Vision-Language-Action pretraining anchored on events, encouraging structured, compositional action representations and improved cross-event linkage, rather than directly predicting chunk-level actions conditioned only on the current observation and instruction.
- Unified Video-Action Joint Denoising for Dexterous Action and Data Generation
- 赛道归属: 机器人世界模型 / 视频-动作联合生成(World Action Model, Video-Action Joint Modeling)
- 核心创新点: 从分布建模角度重构“视频先验→动作策略”的对齐方式:不再将视频基础模型的动态先验压缩为“给定观测的未来动作策略分布”,而是直接在交互视频与可执行手部轨迹的联合空间上进行建模与去噪生成;通过支持多种条件化机制/条件模式来保持更“宽”的联合分布,从而在同一框架内同时服务于灵巧动作生成与数据生成(视频与动作的协同合成),提升视频-动作一致性与可控性。
- Track: Robotics World Models / Video-Action Joint Generation (World Action Model, Video-Action Joint Modeling)
- Key innovation: Reframes video-to-action alignment as a distribution modeling problem: instead of collapsing the video foundation model’s dynamics prior into an observation-conditioned action policy over future actions, it models and denoises the joint distribution over interaction videos and executable hand trajectories. By enabling multiple conditioning regimes, it preserves a broader joint distribution, unifying dexterous action generation and data generation (co-synthesizing videos and actions) with improved video–action consistency and controllability.
GitHub
- [2026-06-08] DravenALG/awesome-vla-wam ⭐715
A Curated List of Vision-Language-Action (VLA) and World Action Models (WAM) Research and Beyond
Generated automatically by Daily AI Digest Agent 生成时间: 2026-06-09 01:01:55