AI 每日进展速报 / Daily AI Digest - 2026-03-26
图像生成/编辑 / Image Generation/Editing
arXiv
- [2026-03-21] CTCal: Rethinking Text-to-Image Diffusion Models via Cross-Timestep Self-Calibration 📖1 🆕NEW
- 赛道归属: 文生图(扩散模型对齐/训练策略)
- 核心创新点: 提出跨时间步自校准(Cross-Timestep Self-Calibration, CTCal),利用扩散过程不同时间步之间的互补监督信号,显式强化细粒度文本-图像对应关系,而非仅依赖传统扩散损失的隐式对齐。通过跨步一致性/校准机制,让模型在去噪轨迹中持续纠偏,从而提升提示词遵循与语义绑定精度。
-
一句话总结: CTCal用“跨时间步自我纠偏”的训练范式补足扩散损失对细粒度对齐监督不足的问题,显著增强文生图的文本一致性与可控性。
-
Track: Text-to-Image (Diffusion alignment / training strategy)
- Core innovation: Proposes Cross-Timestep Self-Calibration (CTCal), which leverages complementary supervision across diffusion timesteps to explicitly strengthen fine-grained text–image correspondence instead of relying on the implicit alignment of standard diffusion losses. A cross-timestep consistency/calibration mechanism continuously corrects the denoising trajectory, improving prompt adherence and semantic binding.
- One-sentence summary: CTCal introduces a cross-timestep self-correction training paradigm that directly targets fine-grained text–image alignment, improving controllability and prompt faithfulness in text-to-image diffusion.
- [2026-03-25] Polynomial Speedup in Diffusion Models with the Multilevel Euler-Maruyama Method 🆕NEW
- 赛道归属: 推理优化(扩散模型采样/数值解法加速)
- 核心创新点: 将多层次Euler–Maruyama(Multilevel EM)引入扩散/流匹配相关的SDE/ODE求解,通过“多精度漂移近似器”分摊计算:大量调用低成本粗近似,少量调用高精度昂贵近似。并在HTMC(Harder-than-Monte-Carlo)计算复杂度假设下给出多项式级加速的理论保证。
-
一句话总结: 该工作用多层次数值方法在理论上系统性降低扩散采样/求解成本,为高精度生成带来可证明的计算加速路径。
-
Track: Inference optimization (diffusion sampling / numerical solvers)
- Core innovation: Introduces Multilevel Euler–Maruyama (ML-EM) for solving diffusion-related SDEs/ODEs by distributing computation across a hierarchy of drift approximators with increasing accuracy/cost—many cheap coarse evaluations and only a few expensive accurate ones. Provides polynomial speedup guarantees under the HTMC (Harder-than-Monte-Carlo) regime assumptions.
- One-sentence summary: ML-EM offers a theoretically grounded route to substantially cheaper diffusion-time computation while maintaining target accuracy.
- [2026-03-25] Anti-I2V: Safeguarding your photos from malicious image-to-video generation 🆕NEW
- 赛道归属: 视频生成安全(图像到视频防护/对抗扰动)
- 核心创新点: 面向图像到视频(I2V)扩散模型的滥用风险,提出专门的图像防护方案,通过构造对抗性微扰使输入照片在I2V生成链路中失效或显著降质,而非仅针对文生图/图生图。方法强调对视频扩散时序建模与跨帧一致性带来的新攻击面进行适配,从而提升对I2V模型的防护有效性与迁移性。
-
一句话总结: Anti-I2V把“对抗防护”从图像扩散扩展到I2V视频扩散场景,为防止照片被恶意驱动生成伪造视频提供了更贴合实际威胁模型的技术手段。
-
Track: Video generation safety (image-to-video protection / adversarial perturbations)
- Core innovation: Targets misuse of image-to-video (I2V) diffusion models by crafting adversarial perturbations on photos that specifically disrupt the I2V generation pipeline, rather than focusing on text-to-image or image generation alone. Adapts the attack/protection design to temporal modeling and cross-frame consistency, improving effectiveness and transferability against video diffusion models.
- One-sentence summary: Anti-I2V extends adversarial photo protection to the I2V diffusion setting, offering a practical defense against malicious photo-driven deepfake video generation.
- [2026-03-25] Cross-Modal Prototype Alignment and Mixing for Training-Free Few-Shot Classification 🆕NEW
- 赛道归属: 多模态理解(训练自由 few-shot 分类/CLIP增强)
- 核心创新点: 提出跨模态原型对齐与混合:在训练自由(training-free)设定下,将文本原型与少量样本的图像原型进行直接混合,并从偏差-方差视角解释其收益与适用条件。通过对齐与混合策略在不微调模型的情况下更好利用视觉样本信息,提升CLIP式few-shot分类的稳健性与精度。
-
一句话总结: 该方法用“图文原型混合”的简单机制在无需训练的前提下显著增强VLM的few-shot识别能力,并给出可解释的理论视角。
-
Track: Multimodal understanding (training-free few-shot classification / CLIP enhancement)
- Core innovation: Proposes cross-modal prototype alignment and direct mixing of text prototypes with few-shot image prototypes in a training-free manner, and analyzes the gains through a bias–variance lens. The alignment/mixing strategy better exploits visual evidence without fine-tuning, improving robustness and accuracy of CLIP-style few-shot classification.
- One-sentence summary: By mixing aligned image and text prototypes without training, this work delivers a simple yet effective boost to VLM few-shot classification with an interpretable bias–variance rationale.
- [2026-03-25] OmniWeaving: Towards Unified Video Generation with Free-form Composition and Reasoning 🆕NEW
- 赛道归属: 视频生成(统一框架/组合生成与推理)
- 核心创新点: 提出统一的“全能型”视频生成框架,将多任务视频生成能力(如多元素组合、自由形式编辑/重组、基于语义的推理驱动生成等)在单一模型内打通,而非任务割裂的拼装式方案。核心在于引入更强的多模态组合表示与“推理引导”的生成机制,使模型能在复杂指令下进行结构化编排与一致性生成。
-
一句话总结: OmniWeaving试图把开放源视频生成从“单点能力”推进到“统一可组合可推理”的通用生成系统形态。
-
Track: Video generation (unified framework / compositional generation with reasoning)
- Core innovation: Proposes an omni-capable unified video generation model that integrates diverse tasks (free-form composition, editing/recombination, and semantics-driven generation) within a single framework rather than fragmented task-specific pipelines. Key is stronger multimodal compositional representations and reasoning-informed generation to support structured planning and consistent synthesis under complex prompts.
- One-sentence summary: OmniWeaving moves open video generation toward a unified, compositional, reasoning-aware model that can handle diverse generation tasks coherently.
- [2026-03-25] ViHOI: Human-Object Interaction Synthesis with Visual Priors 🆕NEW
- 赛道归属: 3D生成(人体-物体交互/动作生成)
- 核心创新点: 提出从易获取的2D图像中提取人-物交互视觉先验,并将其注入扩散式生成模型,以弥补“仅靠文本难以描述物理约束”的瓶颈。通过视觉先验提供接触关系、相对位姿与交互可行性等信息,提升3D HOI合成的物理合理性与真实感。
-
一句话总结: ViHOI用2D视觉先验替代难以完备语言描述的物理约束,让3D人-物交互生成更可信、更可控。
-
Track: 3D generation (human–object interaction / motion synthesis)
- Core innovation: Extracts rich HOI visual priors from readily available 2D images and injects them into diffusion-based generative models, addressing the limitation that text alone poorly specifies physical constraints. The priors encode interaction feasibility cues (e.g., contact, relative pose), improving physical plausibility and realism in 3D HOI synthesis.
- One-sentence summary: ViHOI leverages 2D visual interaction priors to make 3D human–object interaction generation more physically plausible and controllable than text-only conditioning.
- [2026-03-25] ScrollScape: Unlocking 32K Image Generation With Video Diffusion Priors 🆕NEW
- 赛道归属: 文生图(超高分辨率/极端长宽比生成)
- 核心创新点: 利用视频扩散模型中更强的时空一致性先验来补足静态文生图在极端长宽比(EAR)与超高分辨率下缺失的全局空间先验,提出面向32K级超长图生成的方案。通过引入“视频式”生成先验/机制来抑制对象重复、空间碎裂等结构性崩坏,实现可扩展的超大画布合成。
-
一句话总结: ScrollScape把视频扩散的优势迁移到超长图生成,解决EAR场景下结构崩坏难题,打开32K级图像合成的可行路径。
-
Track: Text-to-Image (ultra-high-resolution / extreme aspect-ratio generation)
- Core innovation: Transfers stronger spatiotemporal consistency priors from video diffusion models to compensate for the weak global spatial priors of standard text-to-image models under extreme aspect ratios and ultra-high resolutions. By leveraging “video-like” diffusion priors/mechanisms, it mitigates structural failures such as object repetition and spatial fragmentation, enabling scalable ~32K image synthesis.
- One-sentence summary: ScrollScape repurposes video diffusion priors to stabilize extreme-canvas image generation, making 32K ultra-wide synthesis substantially more reliable.
- [2026-03-25] LGTM: Training-Free Light-Guided Text-to-Image Diffusion Model via Initial Noise Manipulation 🆕NEW
- 赛道归属: 文生图(光照可控/训练自由编辑)
- 核心创新点: 提出通过操控扩散初始噪声实现光照引导的训练自由方法(无需微调与额外数据),将“光照条件”作为可注入的生成因素直接影响去噪轨迹。相较两阶段后处理重光照流程,该方法在生成阶段即实现光照控制,降低计算与工程复杂度。
-
一句话总结: LGTM用“初始噪声操控”把光照控制前置到扩散生成过程中,实现无需训练的高效光照可控文生图。
-
Track: Text-to-Image (lighting control / training-free editing)
- Core innovation: Introduces a training-free lighting-guided approach by manipulating the diffusion initial noise so that lighting conditions become an explicit controllable factor influencing the denoising trajectory. This avoids data-heavy fine-tuning and replaces inefficient two-stage relighting pipelines with in-generation lighting control.
- One-sentence summary: LGTM enables efficient, training-free lighting control in text-to-image diffusion by steering generation through initial-noise manipulation.
- [2026-03-25] When Understanding Becomes a Risk: Authenticity and Safety Risks in the Emerging Image Generation Paradigm 🆕NEW
- 赛道归属: 生成安全(MLLM图像生成/真实性与安全评估)
- 核心创新点: 系统比较以多模态大模型(MLLM)为核心的图像生成范式与扩散模型在安全性上的差异,指出更强语义理解能力可能带来新的真实性与滥用风险(例如更好地理解复杂意图与上下文,从而更易生成高风险内容)。通过风险维度化分析与对照基线,建立面向新范式的安全评估框架与问题清单。
-
一句话总结: 该工作提醒“更会理解”的生成模型可能更危险,并为MLLM驱动的图像生成提供了系统化的安全风险分析坐标系。
-
Track: Generative safety (MLLM-based image generation / authenticity & risk assessment)
- Core innovation: Systematically analyzes safety and authenticity risks of the emerging MLLM-centric image generation paradigm versus diffusion models, arguing that stronger semantic understanding can enable new or amplified misuse vectors. Provides a structured, comparative risk analysis framework and evaluation considerations tailored to this new generation stack.
- One-sentence summary: This work frames how improved semantic understanding in MLLM-based image generation can increase safety risks, and offers a systematic lens to evaluate and mitigate them.
- [2026-03-25] HAM: A Training-Free Style Transfer Approach via Heterogeneous Attention Modulation for Diffusion Models 🆕NEW
- 赛道归属: 图像编辑(扩散模型风格迁移/训练自由)
- 核心创新点: 提出异构注意力调制(Heterogeneous Attention Modulation, HAM)的训练自由风格迁移,通过在扩散模型注意力层中对内容与风格信息进行差异化调制,增强复杂风格参考的表达能力并更好保持内容身份一致性。相较外接控制模块或重度微调方案,HAM以更轻量的内部注意力干预实现风格注入与身份保真之间的平衡。
-
一句话总结: HAM用“异构注意力调制”在无需训练的条件下提升扩散风格迁移的风格还原与身份保持能力。
-
Track: Image editing (diffusion style transfer / training-free)
- Core innovation: Proposes Heterogeneous Attention Modulation (HAM), a training-free style transfer method that differentially modulates attention to separate and control content vs. style signals inside diffusion models. This improves complex style reference capture while better preserving the identity/structure of the content image, avoiding heavy fine-tuning or bulky external control branches.
- One-sentence summary: HAM achieves stronger style fidelity and identity preservation for diffusion-based style transfer via lightweight, training-free attention modulation.
视频生成/编辑 / Video Generation/Editing
arXiv
- [2026-03-25] DreamerAD: Efficient Reinforcement Learning via Latent World Model for Autonomous Driving 🆕NEW
- 赛道归属: 自动驾驶世界模型 + 强化学习加速(latent world model / diffusion压缩)
- 核心创新点: 提出DreamerAD,将像素级扩散世界模型压缩到潜空间并把扩散采样从多步(约100步)极限压到1步,在保持可视化可解释性的同时显著降低想象生成的推理延迟。基于该高效世界模型进行离线/安全的“想象式”策略学习,使自动驾驶RL训练在真实数据上更可行。
-
一句话总结: 用“1步扩散”的潜空间世界模型把自动驾驶想象训练的成本与时延大幅打下来,为可部署的世界模型RL提供关键工程路径。
-
Track: Autonomous-driving video world models + RL acceleration (latent world model / diffusion compression)
- Core innovation: DreamerAD introduces a latent diffusion world-model framework that compresses diffusion sampling from ~100 steps to 1 step, drastically reducing imagination-time latency while retaining visual interpretability. This enables more practical and safer imagination-based RL policy training from real driving data.
- One-sentence takeaway: A 1-step latent diffusion world model makes world-model RL for autonomous driving far more deployable by cutting generation latency and cost without sacrificing interpretability.
- [2026-03-25] Anti-I2V: Safeguarding your photos from malicious image-to-video generation 🆕NEW
- 赛道归属: 视频生成安全(图像到视频防护/对抗扰动)
- 核心创新点: 面向图像到视频(I2V)扩散模型的滥用风险,提出专门的图像防护方案,通过构造对抗性微扰使输入照片在I2V生成链路中失效或显著降质,而非仅针对文生图/图生图。方法强调对视频扩散时序建模与跨帧一致性带来的新攻击面进行适配,从而提升对I2V模型的防护有效性与迁移性。
-
一句话总结: Anti-I2V把“对抗防护”从图像扩散扩展到I2V视频扩散场景,为防止照片被恶意驱动生成伪造视频提供了更贴合实际威胁模型的技术手段。
-
Track: Video generation safety (image-to-video protection / adversarial perturbations)
- Core innovation: Targets misuse of image-to-video (I2V) diffusion models by crafting adversarial perturbations on photos that specifically disrupt the I2V generation pipeline, rather than focusing on text-to-image or image generation alone. Adapts the attack/protection design to temporal modeling and cross-frame consistency, improving effectiveness and transferability against video diffusion models.
- One-sentence summary: Anti-I2V extends adversarial photo protection to the I2V diffusion setting, offering a practical defense against malicious photo-driven deepfake video generation.
- [2026-03-25] Toward Physically Consistent Driving Video World Models under Challenging Trajectories 🆕NEW
- 赛道归属: 自动驾驶视频世界模型(物理一致性/反事实轨迹条件生成)
- 核心创新点: 针对“挑战/反事实轨迹”条件下视频世界模型易出现物理不一致与伪影的问题,提出面向困难轨迹的训练/约束机制,使模型在非自然、带噪或不完美轨迹条件下仍能生成更符合车辆运动与场景几何的时空一致视频。核心在于把物理一致性作为关键目标显式纳入建模与优化,而非仅依赖自然数据分布拟合。
-
一句话总结: 让驾驶视频世界模型在规划/仿真产生的“刁钻轨迹”下也能保持物理可信度,提升其作为闭环仿真与评测引擎的可靠性。
-
Track: Autonomous-driving video world models (physical consistency under counterfactual/challenging trajectories)
- Core innovation: Addresses severe physical inconsistencies when conditioning on challenging or counterfactual trajectories by introducing training/constraint strategies tailored to imperfect, out-of-distribution trajectories, explicitly optimizing for spatiotemporal physical plausibility rather than only fitting natural driving data.
- One-sentence takeaway: Improves the reliability of driving video world models for planning/simulation by keeping generations physically consistent under hard, non-natural trajectories.
- [2026-03-25] OmniWeaving: Towards Unified Video Generation with Free-form Composition and Reasoning 🆕NEW
- 赛道归属: 视频生成(统一框架/组合生成与推理)
- 核心创新点: 提出统一的“全能型”视频生成框架,将多任务视频生成能力(如多元素组合、自由形式编辑/重组、基于语义的推理驱动生成等)在单一模型内打通,而非任务割裂的拼装式方案。核心在于引入更强的多模态组合表示与“推理引导”的生成机制,使模型能在复杂指令下进行结构化编排与一致性生成。
-
一句话总结: OmniWeaving试图把开放源视频生成从“单点能力”推进到“统一可组合可推理”的通用生成系统形态。
-
Track: Video generation (unified framework / compositional generation with reasoning)
- Core innovation: Proposes an omni-capable unified video generation model that integrates diverse tasks (free-form composition, editing/recombination, and semantics-driven generation) within a single framework rather than fragmented task-specific pipelines. Key is stronger multimodal compositional representations and reasoning-informed generation to support structured planning and consistent synthesis under complex prompts.
- One-sentence summary: OmniWeaving moves open video generation toward a unified, compositional, reasoning-aware model that can handle diverse generation tasks coherently.
- [2026-03-25] RS-SSM: Refining Forgotten Specifics in State Space Model for Video Semantic Segmentation 🆕NEW
- 赛道归属: 视频语义分割(State Space Model / 时空记忆增强)
- 核心创新点: 提出RS-SSM,针对SSM线性压缩带来的“固定容量状态空间遗忘细节”问题,引入对被遗忘的像素级特定信息进行显式精炼/回填的机制,在保持SSM高效性的同时增强细粒度时空建模与跨帧一致性。方法重点在于把“通用语义保留”与“特定细节恢复”解耦并协同优化。
-
一句话总结: 在不牺牲SSM效率的前提下补回视频分割所需的细节记忆,提升复杂场景下的时序一致分割质量。
-
Track: Video semantic segmentation (State Space Models / spatiotemporal memory refinement)
- Core innovation: RS-SSM mitigates the “forgotten specifics” issue caused by fixed-size SSM compression by explicitly refining/recovering pixel-level specific information, improving fine-grained spatiotemporal modeling and temporal consistency while preserving SSM efficiency. It effectively decouples common semantic retention from specific-detail restoration.
- One-sentence takeaway: Restores the fine details SSMs tend to forget, yielding more temporally consistent and accurate video semantic segmentation without losing linear-time efficiency.
- [2026-03-25] ScrollScape: Unlocking 32K Image Generation With Video Diffusion Priors 🆕NEW
- 赛道归属: 文生图(超高分辨率/极端长宽比生成)
- 核心创新点: 利用视频扩散模型中更强的时空一致性先验来补足静态文生图在极端长宽比(EAR)与超高分辨率下缺失的全局空间先验,提出面向32K级超长图生成的方案。通过引入“视频式”生成先验/机制来抑制对象重复、空间碎裂等结构性崩坏,实现可扩展的超大画布合成。
-
一句话总结: ScrollScape把视频扩散的优势迁移到超长图生成,解决EAR场景下结构崩坏难题,打开32K级图像合成的可行路径。
-
Track: Text-to-Image (ultra-high-resolution / extreme aspect-ratio generation)
- Core innovation: Transfers stronger spatiotemporal consistency priors from video diffusion models to compensate for the weak global spatial priors of standard text-to-image models under extreme aspect ratios and ultra-high resolutions. By leveraging “video-like” diffusion priors/mechanisms, it mitigates structural failures such as object repetition and spatial fragmentation, enabling scalable ~32K image synthesis.
- One-sentence summary: ScrollScape repurposes video diffusion priors to stabilize extreme-canvas image generation, making 32K ultra-wide synthesis substantially more reliable.
- [2026-03-25] Accelerating Diffusion-based Video Editing via Heterogeneous Caching: Beyond Full Computing at Sampled Denoising Timestep 🆕NEW
- 赛道归属: 扩散式视频编辑加速(DiT推理优化 / 缓存复用)
- 核心创新点: 提出异构缓存(heterogeneous caching)加速框架,不再局限于“按采样timestep复用特征”的单一粒度,而是跨不同计算单元/层级对可复用中间结果进行更精细的缓存与调度,从而减少冗余计算并降低DiT迭代去噪的总体开销。突破点在于把缓存从“时间步复用”扩展为“多粒度、异构的计算复用”。
-
一句话总结: 通过更聪明的多粒度缓存复用,把高质量扩散视频编辑的推理成本显著压缩,推动其实用化部署。
-
Track: Diffusion-based video editing acceleration (DiT inference optimization / caching & reuse)
- Core innovation: Proposes a heterogeneous caching framework that goes beyond timestep-level reuse by caching and scheduling reusable intermediates across heterogeneous components/granularities, reducing redundant computation throughout iterative denoising and lowering end-to-end DiT editing cost.
- One-sentence takeaway: Makes diffusion video editing substantially more practical by cutting inference compute via multi-granularity, heterogeneous caching reuse.
- [2026-03-25] Modeling Spatiotemporal Neural Frames for High Resolution Brain Dynamic 🆕NEW
- 赛道归属: 医学影像生成/重建(EEG→fMRI 时空建模)
- 核心创新点: 提出“时空神经帧(spatiotemporal neural frames)”建模范式,将EEG的高时间分辨率与fMRI的高空间分辨率进行融合,用可学习的时空表示来重建高质量动态脑活动。关键在于以帧级神经表示统一刻画跨时间的脑动态与跨空间的皮层模式,从而提升重建的细粒度与一致性。
-
一句话总结: 用统一的时空神经表示把EEG的时间信息有效转化为高分辨率fMRI动态重建能力,降低高成本采集依赖。
-
Track: Medical imaging generation/reconstruction (EEG-to-fMRI spatiotemporal modeling)
- Core innovation: Introduces spatiotemporal neural frames to fuse EEG’s millisecond temporal cues with fMRI’s high spatial detail, learning a unified spatiotemporal representation for reconstructing high-quality dynamic brain activity with improved fine-grained consistency.
- One-sentence takeaway: Turns inexpensive EEG temporal information into high-resolution fMRI-like dynamic reconstructions via a unified spatiotemporal neural representation.
- [2026-03-25] Decompose and Transfer: CoT-Prompting Enhanced Alignment for Open-Vocabulary Temporal Action Detection 🆕NEW
- 赛道归属: 多模态理解(开放词表时序动作检测 / OV-TAD)
- 核心创新点: 提出PDA(Phase-wise Decomposition and Alignment)并结合CoT式提示增强,将动作检测的对齐从“全局语义-视觉对齐”细化为分阶段的模式分解与对齐,显式学习可迁移的时序动作结构知识,从已见类向未见类更稳健迁移。方法突破在于用分解式对齐强化时间一致的先验传递,而非仅靠标签文本嵌入的粗粒度匹配。
-
一句话总结: 通过“分阶段分解+CoT对齐”把已见动作的时序模式迁移给未见类别,提升开放词表动作定位的泛化能力。
-
Track: Multimodal video understanding (Open-vocabulary temporal action detection, OV-TAD)
- Core innovation: Proposes Phase-wise Decomposition and Alignment (PDA) with CoT-prompting enhanced alignment, refining global text-vision alignment into phase-wise pattern decomposition and alignment to learn transferable, temporally consistent action knowledge from seen to unseen classes.
- One-sentence takeaway: Improves OV-TAD generalization by transferring structured temporal action patterns—via phase-wise decomposition and CoT-enhanced alignment—rather than relying on coarse global semantic matching.
- [2026-03-25] Leave No Stone Unturned: Uncovering Holistic Audio-Visual Intrinsic Coherence for Deepfake Detection 🆕NEW
- 赛道归属: 多模态安全(音视频深度伪造检测 / 泛化鲁棒检测)
- 核心创新点: 提出“整体音视频内在一致性(holistic audio-visual intrinsic coherence)”视角,将音频与视频的内在耦合关系作为检测核心信号,联合利用单模态线索与跨模态一致性来对抗仅依赖伪影或生成器特征导致的泛化退化。方法重点在于从“生成器特定伪影检测”转向“内容层面的内在一致性建模”,提升对未知伪造的鲁棒性。
-
一句话总结: 以音视频内在一致性为中心信号,提升深伪检测在未知生成器与高逼真伪造下的泛化与可靠性。
-
Track: Multimodal security (audio-visual deepfake detection / generalizable detection)
- Core innovation: Advocates holistic audio-visual intrinsic coherence as the key detection cue, jointly leveraging unimodal evidence and cross-modal coherence to avoid overfitting to generator-specific artifacts and improve robustness against unseen forgeries.
- One-sentence takeaway: Shifts deepfake detection from artifact chasing to intrinsic audio-visual coherence modeling, boosting generalization to novel, highly realistic forgeries.
HuggingFace Models
语言大模型 / Large Language Models
arXiv
- [2026-03-24] SortedRL: Accelerating RL Training for LLMs through Online Length-Aware Scheduling 📖7 🆕NEW
- 赛道归属: 强化学习训练加速(LLM RL/rollout 调度优化)
- 核心创新点: 提出在线“长度感知”的调度策略,在RL训练中按生成轨迹长度动态编排rollout与更新的执行顺序/并行方式,减少长序列自回归生成带来的等待与同步开销。通过把计算资源优先分配给更“拖慢流水线”的长trajectory,显著提升整体吞吐与训练效率。
-
一句话总结: 用在线长度调度把RL训练的主要瓶颈(长rollout)从系统层面“排队优化”,以更低成本扩展长链推理RL训练。
-
赛道归属: RLVR机理分析(token级分布漂移与推理提升解释)
- 核心创新点: 从token粒度系统刻画RLVR前后分布变化,分析哪些token的概率质量发生“稀疏但关键”的漂移,并将这种漂移与序列级推理表现建立因果/关联验证(如通过控制变量或干预式实验)。进一步评估这些token级变化对推理正确性与泛化的贡献边界。
-
一句话总结: 该工作把RLVR“为何有效”落到可检验的token级机制上,为更可控、更高效的推理微调提供诊断工具。
-
赛道归属: RLVR更新机理与可控优化(更新方向建模)
- 核心创新点: 不仅关注RLVR更新“幅度稀疏”,而是提出用带符号的token级log-prob差分来刻画更新“方向”,并识别对推理最关键的方向性更新模式。基于对方向的识别结果,进一步提出利用/放大有益方向、抑制有害方向的训练或后处理策略,以更稳定地获得推理增益。
-
一句话总结: 通过把RLVR的关键从“改了多少”转向“往哪改”,为推理能力提升提供更直接的可操作优化信号。
-
赛道归属: 语义建模理论(量子逻辑/上下文性与LLM语义机制)
- 核心创新点: 将自然语言意义生成视为具有强上下文性的过程,借鉴量子逻辑框架解释语义组合与语境依赖现象,并对比经典布尔语义的不足。讨论并连接认知科学实验与LLM中观察到的类似上下文性特征,为“意义如何产生”提供统一的形式化视角。
-
一句话总结: 该工作用量子式上下文性为人类与LLM的语义处理提供共同解释框架,服务于更可预测、更安全的人机语言交互设计。
-
赛道归属: 幻觉检测与RAG可靠性(多智能体自检 + 强化学习)
- 核心创新点: 提出多智能体“强化自检”框架,通过引入多个相互制衡的验证代理来对抗LLM-as-a-judge的确认偏差,并用强化学习机制优化自检策略(何时查证、如何对齐证据、如何判定冲突)。在RAG场景下将生成与证据核验解耦并形成对抗/协作式审查流程,提升幻觉识别的鲁棒性。
-
一句话总结: 用多代理+RL把“自我复读式评审”变成“交叉审计式核验”,显著增强RAG输出的可信度。
-
赛道归属: Agentic视频理解(主动感知/可视化证据检索式推理)
- 核心创新点: 提出LensWalk,让LLM推理器在视频中“规划如何看”:根据当前假设主动选择时间段、视角/区域或证据片段进行迭代式感知,而非依赖一次性预提取特征。通过将感知决策与推理闭环耦合,实现随理解进展动态取证的可扩展视频理解流程。
-
一句话总结: 把视频理解从“被动读特征”升级为“主动取证”,显著缓解长视频的时序密度与证据定位难题。
-
赛道归属: RAG工程评测(文档切分/Chunking策略优化)
- 核心创新点: 面向企业长文档检索生成,系统对比固定滑窗、递归切分、语义断点、结构感知等chunking策略,并在真实行业语料上量化其对检索命中、答案质量与稳定性的影响。强调chunking作为RAG关键超参的可测量性,并给出不同策略在结构化/半结构化文档中的适配规律。
-
一句话总结: 该工作用实证结果把“怎么切文档”从经验活变成可量化决策,直接提升企业RAG落地效果。
-
赛道归属: 教育对话分析(表征学习 + 教学支架动态建模)
- 核心创新点: 提出基于embedding对齐的度量方法,将辅导对话轮次与题目、标准解的语义相似度作为“支架(scaffolding)动态”的可计算指标,用以刻画导师引导与学生理解之间的时序耦合。通过表示学习把原本难以量化的教学策略变化转化为可追踪的时间序列特征。
-
一句话总结: 用语义表征把真实辅导对话中的“支架强弱与时机”量化出来,为自适应教学系统评估与优化提供基础工具。
-
赛道归属: GUI智能体(移动端长程任务学习/失败经验自进化)
- 核心创新点: 提出两阶段自进化GUI Agent:先用拒绝式微调(RFT)从失败轨迹中提炼负例信号与改进策略,实现数据与策略的共同演化;再缓解长程稀疏奖励下的信用分配与学习效率问题。核心在于把“失败”结构化为可学习资产,形成持续迭代的闭环。
-
一句话总结: 该工作让GUI智能体真正“从失败中成长”,在长任务与稀疏奖励场景下获得更高效的自我提升路径。
-
赛道归属: 多模态社会推理(视频-only Theory of Mind评测与增强)
- 核心创新点: 聚焦仅视频输入的ToM能力,补足以往偏文本评测的空白,并提出面向多模态LLM的ToM增强/诊断方法(强调可解释的内部机制分析而非纯黑盒评分)。通过将心理状态推断建立在纯视觉线索上,检验模型在真实交互场景中的意图、信念与误解识别能力。
- 一句话总结: 把ToM从“读故事”推进到“看视频懂人心”,更贴近真实世界的人机交互需求与风险评估。
- [2026-03-23] Sparse but Critical: A Token-Level Analysis of Distributional Shifts in RLVR Fine-Tuning of LLMs 📖2 🆕NEW
- 赛道归属: RL training acceleration (LLM RL / rollout scheduling optimization)
- 核心创新点: Proposes an online length-aware scheduling strategy that dynamically orchestrates rollouts and policy updates based on trajectory length, reducing idle time and synchronization overhead caused by long autoregressive generations. By prioritizing resource allocation to long trajectories that stall the pipeline, it improves end-to-end throughput and training efficiency.
-
一句话总结: System-level length-aware scheduling removes the dominant long-rollout bottleneck, making long-CoT RL training cheaper to scale.
-
赛道归属: RLVR mechanism analysis (token-level distribution shift)
- 核心创新点: Conducts a token-level characterization of distributional shifts induced by RLVR, identifying sparse-but-critical probability mass changes and linking them to sequence-level reasoning gains via controlled/ablation-style analyses. It quantifies how specific token-level shifts contribute to correctness and generalization.
-
一句话总结: It turns “why RLVR works” into testable token-level mechanisms, enabling better diagnostics and more controllable reasoning fine-tuning.
-
赛道归属: RLVR update mechanics & controllable optimization (update direction modeling)
- 核心创新点: Argues that update direction matters more than magnitude, modeling it via signed token-level log-probability differences to identify directionally critical update patterns for reasoning. It then exploits these patterns to amplify beneficial directions and suppress harmful ones for more stable gains.
-
一句话总结: By focusing on “where RLVR pushes the model,” it provides a more actionable handle for improving reasoning than sparsity/magnitude alone.
-
赛道归属: Semantic modeling theory (quantum logic contextuality for language meaning)
- 核心创新点: Frames meaning production in natural language as a strongly contextual process better captured by quantum-logical mechanisms than classical Boolean semantics, connecting evidence from cognitive science with similar contextuality observed in LLMs. It offers a unified formal lens for semantic composition under context dependence.
-
一句话总结: A quantum-contextual account of meaning links human and LLM semantics, informing safer and more predictable human–agent language interactions.
-
赛道归属: Hallucination detection & RAG reliability (multi-agent self-check + RL)
- 核心创新点: Introduces a multi-agent reinforced self-check framework that mitigates LLM-as-a-judge confirmation bias by using multiple cross-checking verifier agents and reinforcement learning to optimize verification policies (when/how to verify, evidence alignment, conflict resolution). It decouples generation from evidence auditing to improve robustness in RAG.
-
一句话总结: Multi-agent RL turns self-verification into cross-auditing, substantially improving trustworthiness of RAG outputs.
-
赛道归属: Agentic video understanding (active perception / evidence-seeking reasoning)
- 核心创新点: Presents LensWalk, where an LLM reasoner plans how to look into videos—actively selecting temporal segments/regions as hypotheses evolve—rather than relying on static preprocessed features. This closes the loop between perception decisions and reasoning for scalable long-video understanding.
-
一句话总结: It upgrades video understanding from passive feature reading to active evidence seeking, easing temporal density and evidence localization challenges.
-
赛道归属: RAG engineering evaluation (document chunking strategy optimization)
- 核心创新点: Empirically compares fixed sliding windows, recursive splitting, semantic breakpoints, and structure-aware chunking on enterprise oil-and-gas documents, quantifying impacts on retrieval and answer quality. It elevates chunking to a measurable, domain-dependent design choice with practical guidance.
-
一句话总结: It makes “how to chunk” a data-driven decision, directly improving enterprise RAG performance.
-
赛道归属: Educational dialogue analytics (representation learning for scaffolding dynamics)
- 核心创新点: Proposes an embedding-alignment method that operationalizes tutoring scaffolding dynamics via cosine similarity among dialogue turns, problem statements, and correct solutions, yielding time-series signals of guidance/understanding alignment. Representation learning turns qualitative tutoring strategies into quantifiable temporal features.
-
一句话总结: It quantifies real tutoring scaffolding in-the-wild, enabling evaluation and optimization of adaptive tutoring systems.
-
赛道归属: GUI agents (mobile long-horizon learning / failure-driven self-evolution)
- 核心创新点: Proposes a two-stage self-evolving mobile GUI agent: first, Rejection Fine-Tuning (RFT) converts failed trajectories into structured learning signals and co-evolves data and policy; then it addresses sparse-reward credit assignment for long-horizon tasks. The key is treating failures as reusable training assets in a closed loop.
-
一句话总结: It enables GUI agents to systematically learn from failures, improving efficiency under long-horizon sparse-reward settings.
-
赛道归属: Multimodal social reasoning (video-only Theory of Mind evaluation & enhancement)
- 核心创新点: Targets Theory of Mind under video-only inputs, filling the gap left by text-centric ToM benchmarks, and proposes enhancement/diagnostic approaches emphasizing interpretable mechanism analysis beyond black-box scoring. It grounds mental-state inference purely in visual cues relevant to real interactions.
- 一句话总结: It pushes ToM from “reading stories” to “understanding people from video,” aligning evaluation with real-world human–AI interaction needs.
- [2026-03-23] On the Direction of RLVR Updates for LLM Reasoning: Identification and Exploitation 📖1 🆕NEW
- [2026-03-20] The production of meaning in the processing of natural language 📖1 🆕NEW
- [2026-03-25] MARCH: Multi-Agent Reinforced Self-Check for LLM Hallucination 🆕NEW
- [2026-03-25] LensWalk: Agentic Video Understanding by Planning How You See in Videos 🆕NEW
- 赛道归属: 视频理解(Agentic感知-推理协同/主动取证)
- 核心创新点: 提出LensWalk框架,将“推理”与“看视频取证”闭环耦合:由LLM制定观察计划(何时看、看哪里、看多长),并按需从原始视频中主动采样证据以迭代更新推理状态。相较依赖静态预处理特征的范式,该方法把感知作为可控工具,缓解长视频的时序密集与证据缺失问题。
-
一句话总结: 该工作用“会规划的观看”把视频理解从被动读特征升级为主动取证式推理,提高长时序任务的可靠性。
-
Track: Video Understanding (Agentic Perception–Reasoning / Active Evidence Seeking)
- Core innovation: Introduces LensWalk, a closed-loop agentic framework where an LLM plans how to observe a video (when/where/how long) and actively queries raw video evidence to iteratively refine reasoning. By making perception a controllable tool rather than a fixed pre-processing step, it addresses dense temporal complexity and missing-evidence issues in long videos.
- One-sentence summary: It upgrades video understanding from passive feature consumption to planned, evidence-seeking reasoning for more reliable long-horizon analysis.
- [2026-03-25] Evaluating Chunking Strategies For Retrieval-Augmented Generation in Oil and Gas Enterprise Documents 🆕NEW
- [2026-03-25] UI-Voyager: A Self-Evolving GUI Agent Learning via Failed Experience 🆕NEW
- 赛道归属: 多模态智能体(移动端GUI Agent/自我进化学习)
- 核心创新点: 提出两阶段自进化GUI智能体UI-Voyager:第一阶段用Rejection Fine-Tuning基于失败轨迹进行持续共进化式数据/策略改进;第二阶段针对长时序稀疏奖励下的信用分配与失败复盘,提升从“失败经验”中提炼可迁移技能的效率。相较常规模仿/强化流程,强调失败驱动的自举式学习闭环。
-
一句话总结: 该工作让GUI Agent把失败当作高价值训练信号,实现更高效的长任务学习与自我迭代。
-
Track: Multimodal Agents (Mobile GUI Agents / Self-evolving Learning)
- Core innovation: Proposes UI-Voyager, a two-stage self-evolving GUI agent: (1) Rejection Fine-Tuning to continuously improve using failed trajectories via co-evolving data and policy; (2) mechanisms targeting sparse-reward, long-horizon credit assignment to better distill transferable skills from failures. It emphasizes a failure-driven bootstrapping loop beyond standard imitation/RL pipelines.
- One-sentence summary: It turns failures into a primary training signal, enabling more efficient self-improvement for long-horizon mobile GUI tasks.
- [2026-03-25] Video-Only ToM: Enhancing Theory of Mind in Multimodal Large Language Models 🆕NEW
- 赛道归属: 视频理解(ToM/心智推理评测与增强)
- 核心创新点: 聚焦“仅视频输入”的Theory of Mind能力,构建/评估多模态大模型在无文本提示下对他人信念、意图与知识状态的推断,并探讨可解释的改进路径而非黑盒打分。通过把ToM从文本迁移到纯视觉情境,暴露模型在社会因果与隐含状态推理上的关键短板。
-
一句话总结: 该工作补齐了ToM评测的纯视觉空白,推动多模态模型面向真实交互的心智推理能力提升。
-
Track: Video Understanding (Theory-of-Mind Reasoning: Evaluation & Enhancement)
- Core innovation: Targets Theory of Mind under video-only inputs, evaluating and improving MLLMs’ ability to infer beliefs, intentions, and knowledge states without textual scaffolding, with attention to interpretable improvement rather than purely black-box scoring. By moving ToM from text to pure visual scenarios, it surfaces core weaknesses in social-causal and latent-state reasoning.
- One-sentence summary: It fills the gap of video-only ToM evaluation and advances multimodal models toward human-like mental-state inference in real interactions.
HuggingFace Datasets
-
[2026-03-25] OpenMOSS-Team/OmniAction 🆕NEW
RoboOmni: Proactive Robot Manipulation in Omni-modal Context
📖 arXiv Paper (Accepted to ICLR 2026 🎉) |
🌐 Website |
🤗 Model...
-
[2026-03-24] OpenMOSS-Team/OmniAction-LIBERO 🆕NEW
RoboOmni: Proactive Robot Manipulation in Omni-modal Context
📖 arXiv Paper (Accepted to ICLR 2026 🎉) |
🌐 Website |
🤗 Model...
多模态大模型 / Multimodal Models
arXiv
- [2026-03-25] Vision-Language Models vs Human: Perceptual Image Quality Assessment 🆕NEW
- 赛道归属: 多模态理解(感知质量评估/IQA基准)
- 核心创新点: 系统性评测多种开源/闭源视觉语言模型在对比度、色彩丰富度与整体偏好三类主观质量维度上对齐人类心理物理实验结果的能力,并以统一协议量化“VLM≈人类感知判断”的边界与偏差模式。通过跨尺度对比揭示VLM在主观偏好与低层视觉属性判断上的一致性差异,为后续IQA自动化提供可复现基线。
-
一句话总结: 该工作用严格的人类心理物理数据对VLM做IQA对齐测试,明确了VLM替代/辅助主观画质评估的可行性与局限。
-
Track: Multimodal Understanding (Perceptual IQA Benchmarking)
- Core innovation: Provides a systematic benchmark of multiple open and proprietary VLMs against psychophysical human judgments on three perceptual quality axes—contrast, colorfulness, and overall preference—under a unified evaluation protocol. It characterizes where VLMs align with or deviate from human perception across low-level attributes vs subjective preference, establishing a reproducible baseline for automated IQA.
- One-sentence summary: It rigorously measures how close VLMs are to human perceptual image-quality judgments, clarifying both promise and limitations for scalable IQA.
- [2026-03-25] VFIG: Vectorizing Complex Figures in SVG with Vision-Language Models 🆕NEW
- 赛道归属: 图像到矢量(SVG生成/图形矢量化)
- 核心创新点: 利用视觉语言模型将复杂栅格图(如技术插图)结构化“理解”为可编辑的SVG表示,重点在于把图形分解为语义/几何可组合的矢量原语与层级结构,而非仅做像素级描摹。通过VLM的语义对齐能力提升对复杂图形元素(标注、箭头、曲线、分组关系等)的恢复与可编辑性。
-
一句话总结: 该工作把VLM的语义理解引入矢量化流程,使“从PNG恢复可编辑SVG”更接近设计级重建而非简单描边。
-
Track: Image-to-Vector (SVG Generation / Figure Vectorization)
- Core innovation: Uses VLMs to convert complex raster figures into editable SVG by structuring the output into compositional vector primitives and hierarchical/semantic groupings, rather than pixel-level tracing. It leverages vision-language semantics to better recover complex elements (labels, arrows, curves, grouping relations) for downstream editing.
- One-sentence summary: It brings VLM-level semantic understanding into vectorization, enabling design-grade reconstruction of editable SVGs from flat images.
- [2026-03-25] LensWalk: Agentic Video Understanding by Planning How You See in Videos 🆕NEW
- 赛道归属: 视频理解(Agentic感知-推理协同/主动取证)
- 核心创新点: 提出LensWalk框架,将“推理”与“看视频取证”闭环耦合:由LLM制定观察计划(何时看、看哪里、看多长),并按需从原始视频中主动采样证据以迭代更新推理状态。相较依赖静态预处理特征的范式,该方法把感知作为可控工具,缓解长视频的时序密集与证据缺失问题。
-
一句话总结: 该工作用“会规划的观看”把视频理解从被动读特征升级为主动取证式推理,提高长时序任务的可靠性。
-
Track: Video Understanding (Agentic Perception–Reasoning / Active Evidence Seeking)
- Core innovation: Introduces LensWalk, a closed-loop agentic framework where an LLM plans how to observe a video (when/where/how long) and actively queries raw video evidence to iteratively refine reasoning. By making perception a controllable tool rather than a fixed pre-processing step, it addresses dense temporal complexity and missing-evidence issues in long videos.
- One-sentence summary: It upgrades video understanding from passive feature consumption to planned, evidence-seeking reasoning for more reliable long-horizon analysis.
- [2026-03-25] UI-Voyager: A Self-Evolving GUI Agent Learning via Failed Experience 🆕NEW
- 赛道归属: 多模态智能体(移动端GUI Agent/自我进化学习)
- 核心创新点: 提出两阶段自进化GUI智能体UI-Voyager:第一阶段用Rejection Fine-Tuning基于失败轨迹进行持续共进化式数据/策略改进;第二阶段针对长时序稀疏奖励下的信用分配与失败复盘,提升从“失败经验”中提炼可迁移技能的效率。相较常规模仿/强化流程,强调失败驱动的自举式学习闭环。
-
一句话总结: 该工作让GUI Agent把失败当作高价值训练信号,实现更高效的长任务学习与自我迭代。
-
Track: Multimodal Agents (Mobile GUI Agents / Self-evolving Learning)
- Core innovation: Proposes UI-Voyager, a two-stage self-evolving GUI agent: (1) Rejection Fine-Tuning to continuously improve using failed trajectories via co-evolving data and policy; (2) mechanisms targeting sparse-reward, long-horizon credit assignment to better distill transferable skills from failures. It emphasizes a failure-driven bootstrapping loop beyond standard imitation/RL pipelines.
- One-sentence summary: It turns failures into a primary training signal, enabling more efficient self-improvement for long-horizon mobile GUI tasks.
- [2026-03-25] Cross-Modal Prototype Alignment and Mixing for Training-Free Few-Shot Classification 🆕NEW
- 赛道归属: 多模态理解(训练自由 few-shot 分类/CLIP增强)
- 核心创新点: 提出跨模态原型对齐与混合:在训练自由(training-free)设定下,将文本原型与少量样本的图像原型进行直接混合,并从偏差-方差视角解释其收益与适用条件。通过对齐与混合策略在不微调模型的情况下更好利用视觉样本信息,提升CLIP式few-shot分类的稳健性与精度。
-
一句话总结: 该方法用“图文原型混合”的简单机制在无需训练的前提下显著增强VLM的few-shot识别能力,并给出可解释的理论视角。
-
Track: Multimodal understanding (training-free few-shot classification / CLIP enhancement)
- Core innovation: Proposes cross-modal prototype alignment and direct mixing of text prototypes with few-shot image prototypes in a training-free manner, and analyzes the gains through a bias–variance lens. The alignment/mixing strategy better exploits visual evidence without fine-tuning, improving robustness and accuracy of CLIP-style few-shot classification.
- One-sentence summary: By mixing aligned image and text prototypes without training, this work delivers a simple yet effective boost to VLM few-shot classification with an interpretable bias–variance rationale.
- [2026-03-25] Video-Only ToM: Enhancing Theory of Mind in Multimodal Large Language Models 🆕NEW
- 赛道归属: 视频理解(ToM/心智推理评测与增强)
- 核心创新点: 聚焦“仅视频输入”的Theory of Mind能力,构建/评估多模态大模型在无文本提示下对他人信念、意图与知识状态的推断,并探讨可解释的改进路径而非黑盒打分。通过把ToM从文本迁移到纯视觉情境,暴露模型在社会因果与隐含状态推理上的关键短板。
-
一句话总结: 该工作补齐了ToM评测的纯视觉空白,推动多模态模型面向真实交互的心智推理能力提升。
-
Track: Video Understanding (Theory-of-Mind Reasoning: Evaluation & Enhancement)
- Core innovation: Targets Theory of Mind under video-only inputs, evaluating and improving MLLMs’ ability to infer beliefs, intentions, and knowledge states without textual scaffolding, with attention to interpretable improvement rather than purely black-box scoring. By moving ToM from text to pure visual scenarios, it surfaces core weaknesses in social-causal and latent-state reasoning.
- One-sentence summary: It fills the gap of video-only ToM evaluation and advances multimodal models toward human-like mental-state inference in real interactions.
- [2026-03-25] Unleashing Vision-Language Semantics for Deepfake Video Detection 🆕NEW
- 赛道归属: 视频理解(深度伪造检测/多模态语义对齐)
- 核心创新点: 提出VLAForge,突破以往仅用CLIP等VLM视觉特征做深伪检测的做法,显式挖掘其潜空间中的视觉-语言语义作为判别信号,以提升跨身份、跨场景的泛化与鲁棒性。通过跨模态语义约束/对齐,将“伪造痕迹”从低层伪影提升到更可迁移的语义异常表征。
-
一句话总结: 该工作把VLM的跨模态语义真正用到深伪检测中,增强了对分布外伪造的泛化能力。
-
Track: Video Understanding (Deepfake Detection / Vision-Language Semantics)
- Core innovation: Proposes VLAForge to go beyond using only VLM visual embeddings for deepfake detection by explicitly leveraging vision-language semantics in the latent space as discriminative signals, improving robustness and cross-identity generalization. Cross-modal semantic constraints elevate detection from low-level artifacts to more transferable semantic anomaly representations.
- One-sentence summary: It operationalizes VLM cross-modal semantics for deepfake detection, boosting generalization to out-of-distribution forgeries.
- [2026-03-25] ViHOI: Human-Object Interaction Synthesis with Visual Priors 🆕NEW
- 赛道归属: 3D生成(人体-物体交互/动作生成)
- 核心创新点: 提出从易获取的2D图像中提取人-物交互视觉先验,并将其注入扩散式生成模型,以弥补“仅靠文本难以描述物理约束”的瓶颈。通过视觉先验提供接触关系、相对位姿与交互可行性等信息,提升3D HOI合成的物理合理性与真实感。
-
一句话总结: ViHOI用2D视觉先验替代难以完备语言描述的物理约束,让3D人-物交互生成更可信、更可控。
-
Track: 3D generation (human–object interaction / motion synthesis)
- Core innovation: Extracts rich HOI visual priors from readily available 2D images and injects them into diffusion-based generative models, addressing the limitation that text alone poorly specifies physical constraints. The priors encode interaction feasibility cues (e.g., contact, relative pose), improving physical plausibility and realism in 3D HOI synthesis.
- One-sentence summary: ViHOI leverages 2D visual interaction priors to make 3D human–object interaction generation more physically plausible and controllable than text-only conditioning.
- [2026-03-25] GeoRouter: Dynamic Paradigm Routing for Worldwide Image Geolocalization 🆕NEW
- 赛道归属: 多模态理解(图像地理定位/检索-生成融合路由)
- 核心创新点: 提出GeoRouter动态路由框架,在检索式与生成式(LVLM直接回归坐标)两种地理定位范式间按样本自适应选择/融合,利用二者互补的误差画像:检索擅长细粒度匹配、生成更具覆盖与先验。通过“范式级路由”而非单一模型堆参,实现全球尺度下精度与鲁棒性的折中最优。
-
一句话总结: 该工作用动态路由把检索与生成的优势组合起来,显著提升全球图像定位的稳定性与精度上限。
-
Track: Multimodal Understanding (Image Geolocalization / Retrieval–Generation Routing)
- Core innovation: Introduces GeoRouter, a dynamic routing framework that adaptively selects or fuses retrieval-based and generation-based (LVLM coordinate prediction) paradigms per sample, exploiting their complementary error profiles—fine-grained matching vs broader coverage/priors. Paradigm-level routing, rather than scaling a single approach, yields a better accuracy–robustness trade-off worldwide.
- One-sentence summary: It combines retrieval and generation via adaptive routing to improve both robustness and peak accuracy for global image geolocalization.
- [2026-03-25] PP-OCRv5: A Specialized 5M-Parameter Model Rivaling Billion-Parameter Vision-Language Models on OCR Tasks 🆕NEW
- 赛道归属: 文档理解(OCR/轻量化识别与检测)
- 核心创新点: 提出仅5M参数的PP-OCRv5,通过系统级结构与训练/后处理优化在OCR任务上逼近甚至对标十亿级VLM,同时针对复杂版面提供更精确的文本定位并降低大模型常见的文本幻觉。核心突破在于以专用小模型的工程化与任务分解,重新证明“高精度不必依赖超大统一模型”。
-
一句话总结: 该工作以极小参数量实现接近大VLM的OCR效果,为低成本、可控的工业级文字识别提供强替代方案。
-
Track: Document Understanding (OCR / Lightweight Detection & Recognition)
- Core innovation: Presents PP-OCRv5, a highly optimized 5M-parameter OCR system that rivals billion-parameter VLMs on OCR, while improving precise text localization in complex layouts and reducing text hallucinations common in unified large models. The key advance is task-specialized, system-level optimization showing accuracy need not rely on massive unified architectures.
- One-sentence summary: It delivers near-VLM OCR performance with a tiny model, enabling low-cost, controllable, production-grade text recognition.
Generated automatically by Daily AI Digest Agent 生成时间: 2026-03-26 02:30:01