AI 每日进展速报 / Daily AI Digest - 2026-02-27

图像生成/编辑 / Image Generation/Editing

arXiv

[2026-02-23] Closing the gap in multimodal medical representation alignment 📖2 🆕NEW
- 赛道归属: 多模态医学表征学习（图文对齐/表示对齐）
- 核心创新点: 针对CLIP式对比学习在医学多模态对齐中引发的“模态鸿沟”（潜空间稀疏、语义碎片化）问题，系统分析其非预期优化行为，并提出更贴近真实语义对齐目标的对齐策略以缩小模态间分布差异。方法重点在于纠正对比损失带来的错误几何结构，从而提升跨模态语义一致性。
- 一句话总结: 通过诊断并修复对比学习导致的模态鸿沟，该工作为医学场景的可靠图文共享表征提供了更稳健的对齐路径。
- Track: Multimodal medical representation learning (image-text alignment/representation alignment)
- Core innovation: It analyzes unintended behaviors of CLIP-style contrastive objectives that create a “modality gap” (sparse/fragmented latent geometry) in medical multimodal alignment, and proposes an alignment strategy better matched to true semantic correspondence. The key is correcting the latent-space geometry induced by contrastive loss to improve cross-modal semantic consistency.
- One-sentence summary: By diagnosing and mitigating contrastive-learning-induced modality gaps, it strengthens trustworthy shared representations for medical image-text modeling.

[2026-02-25] GeoDiv: Framework For Measuring Geographical Diversity In Text-To-Image Models 📖1 🆕NEW
- 赛道归属: 文生图评测与公平性（地理多样性/偏见评估）
- 核心创新点: 提出GeoDiv评测框架，利用大语言模型与视觉-语言模型对T2I生成结果进行“地理语义”层面的可解释评估，避免仅依赖人工标注数据集或表层视觉相似度指标。框架将地理多样性、刻板印象与区域表征偏差转化为可量化、可诊断的评测信号。
- 一句话总结: GeoDiv为文生图模型提供了可解释、可扩展的地理多样性与偏见评估工具，帮助系统性发现“世界表征”失真问题。
- Track: Text-to-image evaluation & fairness (geographical diversity/bias assessment)
- Core innovation: GeoDiv introduces an interpretable evaluation framework that leverages LLMs and vision-language models to assess geographic semantics in T2I outputs, moving beyond curated datasets and shallow visual-similarity metrics. It turns geographic diversity, stereotyping, and regional misrepresentation into quantifiable, diagnosable signals.
- One-sentence summary: GeoDiv enables scalable and interpretable auditing of how T2I models portray regions, exposing geographic bias and diversity failures.

[2026-02-24] When Safety Collides: Resolving Multi-Category Harmful Conflicts in Text-to-Image Diffusion via Adaptive Safety Guidance 📖1 🆕NEW
- 赛道归属: 文生图安全对齐（扩散模型安全引导/多类有害冲突消解）
- 核心创新点: 指出现有安全引导将多类有害内容“平均化”成单一避让方向，无法建模不同伤害类别间的冲突与耦合；提出自适应安全引导，在生成过程中按类别动态调节引导强度与方向以解决多类别冲突。方法层面强调“按需分解+自适应融合”的安全梯度/引导策略，而非静态关键词区间。
- 一句话总结: 该工作让扩散模型在面对多类安全约束时能更精细地权衡与避险，提升安全性同时减少对正常生成质量的误伤。
- Track: Text-to-image safety alignment (diffusion safety guidance / multi-harm conflict resolution)
- Core innovation: It shows that prior safety guidance collapses multiple harm categories into an averaged avoidance direction, missing inter-category conflicts, and proposes adaptive safety guidance that dynamically adjusts per-category guidance directions/strengths during sampling. The methodological leap is conflict-aware, adaptive fusion rather than static keyword-based zones.
- One-sentence summary: It improves diffusion-model safety under multi-category constraints while better preserving benign generation quality.

[2026-02-26] Decomposing Private Image Generation via Coarse-to-Fine Wavelet Modeling 🆕NEW
- 赛道归属: 隐私保护图像生成（差分隐私训练/频域建模）
- 核心创新点: 提出基于小波的coarse-to-fine频域差分隐私框架，将生成建模分解到不同频段/尺度，在DP噪声注入时对高频纹理等敏感质量维度进行更精细的结构化处理，缓解DP-SGD“全参数均匀加噪”导致的纹理崩坏。核心突破在于用频谱分解重构DP训练的噪声分配与建模顺序。
- 一句话总结: 通过频域分解实现更“懂画质”的DP训练，该工作在隐私保证与图像质量之间取得更优折中。
- Track: Privacy-preserving image generation (differential privacy training / spectral modeling)
- Core innovation: It proposes a wavelet-based coarse-to-fine spectral DP framework that decomposes generation across frequency bands/scales, enabling structured noise allocation that better preserves high-frequency textures than uniform DP-SGD noise. The key advance is redesigning DP training via spectral decomposition and staged modeling.
- One-sentence summary: It delivers stronger privacy–quality trade-offs by making DP noise injection frequency-aware and generation coarse-to-fine.

[2026-02-26] PATRA: Pattern-Aware Alignment and Balanced Reasoning for Time Series Question Answering 🆕NEW
- 赛道归属: 多模态时间序列问答（时序模式对齐/推理训练）
- 核心创新点: 提出PATRA，通过“模式感知对齐”显式建模趋势、季节性等时序结构，避免将时间序列粗暴当作文本/图像输入；并通过“平衡推理”训练机制抑制简单任务目标的主导效应，促使模型学习更深层的逻辑推理能力。方法突破在于把时序模式表征与训练目标配比共同纳入可控优化。
- 一句话总结: PATRA让LLM在时间序列QA中既看得懂模式又推得动逻辑，提升复杂问题的可靠性。
- Track: Multimodal time-series QA (pattern-aware alignment / reasoning-oriented training)
- Core innovation: PATRA introduces pattern-aware alignment to explicitly encode trends/seasonality rather than treating time series as plain text/images, and a balanced-reasoning training scheme to prevent easy objectives from dominating and suppressing deep reasoning. The advance is jointly controlling time-series structure modeling and objective balance.
- One-sentence summary: It improves time-series QA by making models both pattern-literate and reasoning-robust on harder queries.

[2026-02-26] WISER: Wider Search, Deeper Thinking, and Adaptive Fusion for Training-Free Zero-Shot Composed Image Retrieval 🆕NEW
- 赛道归属: 组合式图像检索（零样本CIR/训练免适配）
- 核心创新点: 提出WISER训练免的ZS-CIR框架，不再将“参考图+修改文本”强行折叠为单一模态，而是进行更宽的候选搜索与更深的语义推断，并对T2I式与I2I式检索信号进行自适应融合。方法论突破在于以“多路径检索+自适应融合”同时保留细粒度视觉细节与文本修改意图。
- 一句话总结: WISER在无需三元组训练数据的前提下显著增强组合检索的鲁棒性与可用性。
- Track: Composed image retrieval (zero-shot CIR / training-free adaptation)
- Core innovation: WISER avoids collapsing (reference image + edit text) into a single modality by performing wider candidate search, deeper semantic inference, and adaptive fusion of T2I-style and I2I-style retrieval signals. The key is multi-route retrieval with adaptive fusion to preserve both fine visual details and edit intent.
- One-sentence summary: It makes zero-shot composed image retrieval more robust and practical without triplet training.

[2026-02-26] DMAligner: Enhancing Image Alignment via Diffusion Model Based View Synthesis 🆕NEW
- 赛道归属: 图像对齐与配准（扩散模型视角合成/对齐增强）
- 核心创新点: 提出DMAligner，用扩散模型进行“面向对齐的视图合成”来替代/补充传统光流扭曲，在遮挡与光照变化下生成更一致的对齐结果。核心突破是将对齐问题转化为条件生成的视图重建，通过生成式先验提升对齐质量与下游稳定性。
- 一句话总结: DMAligner用生成式视图合成绕开光流在复杂场景的脆弱性，提升图像对齐的视觉质量与可靠性。
- Track: Image alignment/registration (diffusion-based view synthesis for alignment)
- Core innovation: DMAligner reframes alignment as alignment-oriented view synthesis with diffusion models, mitigating optical-flow warping failures under occlusion and illumination changes. The advance is leveraging generative priors via conditional view reconstruction to improve alignment fidelity and downstream robustness.
- One-sentence summary: It boosts alignment quality in challenging conditions by replacing brittle warping with diffusion-based synthesized aligned views.

[2026-02-26] Where Vision Becomes Text: Locating the OCR Routing Bottleneck in Vision-Language Models 🆕NEW
- 赛道归属: 多模态理解与可解释性（VLM OCR信息路由/因果分析）
- 核心创新点: 通过因果干预定位VLM中OCR信息进入语言流的关键瓶颈：对比原图与“文本抹除/修补”图像的激活差异，系统刻画不同架构（Qwen3-VL、Phi-4、InternVL3.5）中OCR路由的主导层/模块位置。方法突破在于用可操作的反事实输入与激活差分，给出架构相关的可解释“路由瓶颈”诊断。
- 一句话总结: 该工作把VLM“读字能力”从黑箱变为可定位的系统瓶颈，为OCR能力增强与失效排查提供了直接抓手。
- Track: Multimodal interpretability (VLM OCR routing / causal analysis)
- Core innovation: Using causal interventions, it locates where OCR information is routed into the language stream by comparing activation differences between original images and text-inpainted counterfactuals, across multiple VLM families. The key advance is an actionable, architecture-specific diagnosis of OCR bottlenecks via counterfactual activation analysis.
- One-sentence summary: It turns VLM OCR from a black box into identifiable routing bottlenecks, enabling targeted improvements and debugging.

[2026-02-26] From Blind Spots to Gains: Diagnostic-Driven Iterative Training for Large Multimodal Models 🆕NEW
- 赛道归属: 多模态大模型训练（诊断驱动迭代训练/RL数据闭环）
- 核心创新点: 提出DPE（Diagnostic-driven Progressive Evolution），用诊断评测暴露能力盲点，并据此动态生成/筛选训练信号与强化反馈，替代静态数据与固定训练配方。方法突破在于把“测试驱动的错误暴露—针对性纠错—迭代进化”做成闭环训练流程，实现更定向的能力补齐。
- 一句话总结: DPE让多模态大模型训练从“堆数据”走向“按盲点进化”，更高效地提升复杂推理与决策能力。
- Track: Large multimodal model training (diagnostic-driven iterative training / RL data loop)
- Core innovation: DPE (Diagnostic-driven Progressive Evolution) uses diagnostics to surface capability blind spots and then dynamically curates training signals and reinforcement feedback, replacing static datasets and fixed recipes. The methodological leap is a closed-loop “test-driven error exposure → targeted correction → iterative evolution” pipeline.
- One-sentence summary: It makes LMM training more efficient and targeted by continuously evolving data and feedback around diagnosed weaknesses.

[2026-02-26] PhotoAgent: Agentic Photo Editing with Exploratory Visual Aesthetic Planning 🆕NEW
- 赛道归属: 图像编辑（智能体式编辑/自动审美规划）
- 核心创新点: 提出PhotoAgent，将指令式编辑升级为“可探索的审美规划+执行”的agent流程：自动分解编辑目标、规划步骤与顺序，并在视觉审美维度上进行探索式方案选择，减少用户对高质量prompt与流程编排的依赖。方法突破在于显式引入审美计划层，把多步编辑从单轮指令变为可迭代的策略搜索与工具调用。
- 一句话总结: PhotoAgent把修图从“会用指令”变成“会做规划”，显著降低高质量图像编辑的门槛并提升一致性。
- Track: Image editing (agentic editing / aesthetic planning)
- Core innovation: PhotoAgent turns instruction-based editing into an agent workflow with explicit, exploratory aesthetic planning: it autonomously decomposes goals, sequences multi-step edits, and explores aesthetic alternatives, reducing reliance on carefully crafted user prompts. The key advance is adding a planning layer that enables iterative strategy search and tool/model execution.
- One-sentence summary: It lowers the barrier to high-quality photo editing by making the system plan and execute edits autonomously rather than relying on user-crafted instructions.

GitHub

[2026-02-27] YouMind-OpenLab/awesome-nano-banana-pro-prompts ⭐8209

🍌 World's largest Nano Banana Pro prompt library — 10,000+ curated prompts with preview images, 16 languages. Google Gemini AI image generation. Free ...

[2026-02-27] Dreamy-rain/gemini-business2api ⭐879

OpenAI-compatible API for Gemini Business with multi-account load balancing and image generation | 将 Gemini Business 转为 OpenAI 兼容接口，支持多账户负载均衡与图像生成、视频生...

[2026-02-27] etkecc/baibot ⭐191

🤖 A Matrix bot for using different capabilities (text-generation, text-to-speech, speech-to-text, image-generation, etc.) of AI / Large Language Model...

[2026-02-27] ramanujammv1988/edge-veda ⭐67

On-device AI SDK for Flutter — LLM inference, vision, STT, TTS, image generation, embeddings, RAG, and function calling. Metal GPU on iOS/macOS.

[2026-02-26] erroralex/Latent-Library ⭐55 🆕NEW

A local-first, high-performance desktop asset manager for AI image generations. Features universal metadata parsing (ComfyUI/A1111), instant SQLite se...

视频生成/编辑 / Video Generation/Editing

arXiv

[2026-02-24] VII: Visual Instruction Injection for Jailbreaking Image-to-Video Generation Models 📖1
- 赛道归属: 视频生成安全（Image-to-Video 越狱/对抗攻击）
- 核心创新点: 揭示I2V模型存在“视觉指令跟随”带来的新型攻击面：攻击者可在参考图像中嵌入隐式视觉指令，从而在不依赖文本提示的情况下诱导视频生成产生恶意/违规意图。提出并系统化“Visual Instruction Injection”威胁范式，用于评估与触发此类跨模态注入式越狱风险。
- 一句话总结: 该工作把I2V模型的安全问题从“文本提示注入”扩展到“图像指令注入”，为视频生成模型的红队评测与防护提供了新的关键基准方向。
- Track: Video-to-Audio Generation (length generalization / multimodal alignment)
- Core innovation: Identifies and targets the length-generalization gap in video-to-audio models—whether training on short clips can generalize to long sequences at test time—under data scarcity and text–frame mismatch. Proposes MMHNet, a multimodal hierarchical architecture that improves scalable alignment and temporal extension to longer horizons.
- One-sentence summary: This work advances video-to-audio generation by explicitly engineering for long-duration generalization rather than only short-clip fidelity.

[2026-02-26] ColoDiff: Integrating Dynamic Consistency With Content Awareness for Colonoscopy Video Generation 🆕NEW
- 赛道归属: 医学视频生成（扩散模型/可控生成）
- 核心创新点: 提出基于扩散模型的ColoDiff，将“动态一致性”与“内容感知”的生成目标显式结合，以在复杂肠道形变、病灶多样性与多成像模态下同时保证时序稳定与临床属性可控。通过面向结肠镜场景的结构/内容约束设计，提升生成视频的可诊断信息密度与可控性。
- 一句话总结: 在数据稀缺的临床场景中，ColoDiff以更强的时序一致性与属性控制能力生成高质量结肠镜视频，提升医学数据合成的实用价值。
- Track: Medical video generation (diffusion / controllable generation)
- Core innovation: ColoDiff is a diffusion-based framework that explicitly integrates dynamic (temporal) consistency with content awareness, targeting colonoscopy-specific challenges such as irregular anatomy, diverse lesions, and multi-modality imaging while enabling controllable clinical attributes. It introduces scene-tailored structure/content constraints to improve both stability over time and clinically meaningful control.
- One-sentence summary: ColoDiff makes synthetic colonoscopy videos more temporally consistent and clinically controllable, improving practical medical data augmentation under scarcity.

[2026-02-26] Uni-Animator: Towards Unified Visual Colorization 🆕NEW
- 赛道归属: 图像/视频上色（草图到图像/视频的生成与编辑，DiT）
- 核心创新点: 提出基于Diffusion Transformer的统一框架，同时覆盖图像与视频草图上色，并针对单/多参考的颜色迁移不准与细节丢失，引入“视觉参考增强”以提升参考信息的可用性与高频细节保真。面向视频进一步强化时序一致性，降低大运动场景中的闪烁与运动伪影。
- 一句话总结: Uni-Animator用一个统一的DiT框架把图像与视频草图上色打通，在参考上色精度、细节与时序稳定性上更均衡。
- Track: Image & video colorization (sketch-to-image/video, DiT-based generation/editing)
- Core innovation: Uni-Animator unifies sketch colorization for both images and videos with a Diffusion Transformer backbone, and improves inaccurate color transfer (single/multi-reference) and high-frequency detail preservation via visual reference enhancement. For videos, it explicitly strengthens temporal coherence to reduce flicker and motion artifacts in large-motion scenes.
- One-sentence summary: Uni-Animator provides a single DiT framework that jointly handles image/video sketch colorization with better reference transfer, detail fidelity, and temporal stability.

[2026-02-26] The Trinity of Consistency as a Defining Principle for General World Models 🆕NEW
- 赛道归属: 世界模型（视频生成驱动的物理一致性与推理框架）
- 核心创新点: 提出“Consistency Trinity（三位一致性）”作为通用世界模型的定义性原则，用以统一刻画数据驱动视频生成在物理规律学习、可模拟性与可推理性上的关键约束。该工作从原则层面连接视频生成扩展规律与统一多模态模型（UMM）的架构趋势，为评测与设计世界模型提供可操作的理论坐标系。
- 一句话总结: 该工作用“三位一致性”把世界模型的目标从“能生成”提升到“物理一致、可复现、可推理”的可检验标准。
- Track: World models (video-generation-driven physical consistency & reasoning principles)
- Core innovation: It proposes the “Trinity of Consistency” as a defining principle for general world models, offering a unified lens to formalize the key constraints needed for learning, simulating, and reasoning about physical laws from data-driven video generation. The principle-level framing connects scaling in video generators with the emerging Unified Multimodal Model paradigm, guiding both design and evaluation.
- One-sentence summary: The paper elevates world-model goals into a testable consistency-based standard beyond mere video realism, targeting physically grounded simulation and reasoning.

[2026-02-26] PackUV: Packed Gaussian UV Maps for 4D Volumetric Video 🆕NEW
- 赛道归属: 4D体积视频重建与表示（Gaussian Splatting/视频编码友好）
- 核心创新点: 提出PackUV，用“Packed Gaussian UV Maps”将4D高斯表示映射到可打包的UV空间，在长序列、大运动与遮挡/显隐变化下提升时序一致性与稳定性。关键突破在于让高斯体积视频输出与传统视频编码/传输管线兼容，从表示层面解决存储与流式分发的落地障碍。
- 一句话总结: PackUV把4D高斯体积视频变得更稳定且更“可编码”，推动体积视频从研究走向可规模化存储与传输。
- Track: 4D volumetric video representation & reconstruction (Gaussian splatting, codec-friendly)
- Core innovation: PackUV introduces Packed Gaussian UV Maps, mapping 4D Gaussian representations into a packed UV space to improve robustness on long sequences, large motions, and disocclusions while enhancing temporal consistency. A key contribution is making outputs compatible with conventional video coding pipelines, addressing practical storage/streaming constraints at the representation level.
- One-sentence summary: PackUV turns 4D Gaussian volumetric video into a more temporally stable and codec-compatible format, enabling scalable storage and streaming.

[2026-02-26] UCM: Unifying Camera Control and Memory with Time-aware Positional Encoding Warping for World Models 🆕NEW
- 赛道归属: 世界模型/视频生成（相机控制 + 长时记忆一致性）
- 核心创新点: 提出UCM，通过“时间感知的位置编码扭曲（Time-aware Positional Encoding Warping）”在生成过程中统一相机控制与记忆机制，使模型在场景被重复访问时保持长期内容一致，并能根据用户输入实现更精确的相机运动控制。相较显式3D重建或直接复用历史帧的方法，该方案在开放场景与细粒度结构上兼顾灵活性与一致性。
- 一句话总结: UCM用时间感知的位置编码变换把“可控相机运动”和“可记忆的世界一致性”统一到同一生成机制中。
- Track: World models / video generation (camera control + long-term memory consistency)
- Core innovation: UCM unifies camera control and memory via time-aware positional encoding warping, enabling precise user-driven camera motion while maintaining long-term content consistency when revisiting scenes. Compared with explicit 3D reconstruction or frame-reuse strategies, it aims to preserve flexibility in unbounded settings and fine-grained structures without sacrificing consistency.
- One-sentence summary: UCM provides a unified mechanism to achieve both controllable camera trajectories and persistent scene memory in generative world models.

[2026-02-26] SPATIALALIGN: Aligning Dynamic Spatial Relationships in Video Generation 🆕NEW
- 赛道归属: 文生视频对齐优化（动态空间关系/偏好优化）
- 核心创新点: 提出SPATIALALIGN作为自我改进框架，面向文本提示中的“动态空间关系（DSR）”对T2V模型进行对齐增强。方法上采用零阶正则化的DPO进行微调，并设计基于几何的DSR-SCORE作为可优化的反馈信号，从而在不依赖昂贵标注的情况下提升空间关系随时间变化的正确性。
- 一句话总结: SPATIALALIGN用几何评分+偏好优化，让文生视频更可靠地遵守“物体之间如何运动与相对位置如何变化”的文本约束。
- Track: Text-to-video alignment (dynamic spatial relationships / preference optimization)
- Core innovation: SPATIALALIGN is a self-improvement framework that aligns T2V models to Dynamic Spatial Relationships (DSR) expressed in prompts. It fine-tunes models with a zeroth-order regularized Direct Preference Optimization objective and introduces DSR-SCORE, a geometry-based metric that provides optimization feedback to improve temporally evolving spatial correctness without heavy annotation.
- One-sentence summary: SPATIALALIGN improves T2V faithfulness to dynamic spatial constraints using geometry-driven scoring and preference-based fine-tuning.

[2026-02-26] Denoising as Path Planning: Training-Free Acceleration of Diffusion Models with DPCache 🆕NEW
- 赛道归属: 推理优化（扩散模型采样加速/缓存）
- 核心创新点: 提出DPCache，将扩散去噪过程视为“路径规划”，利用对整条去噪轨迹全局结构的建模来决定缓存复用/预测策略，而非仅用固定或局部自适应步长。该方法训练无关（training-free），在保持生成质量的同时减少多步采样的计算开销，适用于图像与视频扩散推理加速。
- 一句话总结: DPCache以“全局轨迹视角”改造缓存加速策略，在无需训练的前提下更高效地加速扩散采样。
- Track: Inference optimization (diffusion sampling acceleration / caching)
- Core innovation: DPCache reframes diffusion denoising as path planning, leveraging the global structure of the denoising trajectory to guide cache reuse/prediction rather than relying on fixed or purely local adaptive schedules. It is training-free and reduces multi-step sampling cost while preserving generation quality for both image and video diffusion models.
- One-sentence summary: DPCache accelerates diffusion inference more effectively by making cache decisions with a global trajectory-aware strategy, without any additional training.

[2026-02-26] BetterScene: 3D Scene Synthesis with Representation-Aligned Generative Model 🆕NEW
- 赛道归属: 3D场景生成/新视角合成（稀疏输入 + 扩散先验）
- 核心创新点: 提出BetterScene，利用在海量视频上预训练的Stable Video Diffusion作为强先验，在推理阶段对极稀疏、非受控照片输入进行表示对齐的生成式补全，以提升NVS的视角一致细节恢复并抑制伪影。其方法论重点在于将“视频扩散的时空先验”与3D/NVS表示进行对齐，从而在不增加采集成本的情况下增强真实场景泛化。
- 一句话总结: BetterScene把大规模视频扩散模型的先验迁移到稀疏照片的新视角合成中，显著改善真实场景的细节与一致性。
- Track: 3D scene synthesis / novel view synthesis (sparse inputs + diffusion prior)
- Core innovation: BetterScene leverages the production-scale Stable Video Diffusion prior and performs representation-aligned generative refinement at inference time for extremely sparse, unconstrained photos, improving view-consistent detail recovery and reducing artifacts in NVS. The key methodological idea is aligning spatiotemporal diffusion priors with 3D/NVS representations to boost real-scene generalization without extra capture.
- One-sentence summary: BetterScene transfers large-scale video diffusion priors to sparse-photo NVS, yielding more consistent details and fewer artifacts in real-world scenes.

[2026-02-25] Flow Matching is Adaptive to Manifold Structures 🆕NEW
- 赛道归属: 生成建模理论（Flow Matching/连续归一化流）
- 核心创新点: 从理论与机制层面论证Flow Matching对数据“流形结构”具有自适应性：在高维但低维流形集中的数据分布下，学习到的速度场/ODE采样会自然贴合流形几何，从而解释其训练稳定性与经验性能优势。该工作为在图像/视频等流形数据上的流式生成提供更坚实的理论依据与方法选择指导。
- 一句话总结: 该工作解释了为何Flow Matching在流形数据上更“顺着数据几何走”，为替代扩散的生成路线提供理论支撑。
- Track: Generative modeling theory (flow matching / continuous-time generative models)
- Core innovation: The work argues that flow matching is inherently adaptive to manifold structure: when high-dimensional data concentrate near low-dimensional manifolds, the learned velocity field and ODE sampling tend to align with manifold geometry, helping explain improved stability and empirical performance. This provides theoretical grounding for applying flow-based generative modeling to manifold-structured data such as images and videos.
- One-sentence summary: It offers a principled explanation for flow matching’s effectiveness on manifold data, strengthening the case for flow-based alternatives to diffusion.

GitHub

[2026-02-26] hao-ai-lab/FastVideo ⭐3105 🆕NEW

A unified inference and post-training framework for accelerated video generation.

[2026-02-26] NVlabs/LongLive ⭐1069 🆕NEW

[ICLR 2026] LongLive: Real-time Interactive Long Video Generation

[2026-02-26] marcelo-earth/generative-manim ⭐800 🆕NEW

🎨 GPT for video generation ⚡️

[2026-02-27] YouMind-OpenLab/awesome-seedance-2-prompts ⭐164

🎬 400+ curated Seedance 2.0 video generation prompts — cinematic, anime, UGC, ads, meme styles. Includes Seedance API guides, character consistency ti...

[2026-02-26] princepainter/ComfyUI-PainterNodes ⭐68 🆕NEW

A comprehensive ComfyUI toolkit for video generation, image editing, and audio-driven lip‑sync, featuring Flux, LTXV, Wan2.2 and advanced batch workfl...

语言大模型 / Large Language Models

arXiv

[2026-02-22] Evaluating the Reliability of Digital Forensic Evidence Discovered by Large Language Model: A Case Study 📖2 🆕NEW
- 赛道归属: 法证AI与LLM可靠性评估（数字取证/证据验证）
- 核心创新点: 提出一套端到端的结构化框架，将“自动化取证工件提取—LLM驱动的语义分析与精炼—基于数字取证知识图谱(DFKG)的交叉验证”串联起来，用可验证的知识约束来评估并提升LLM发现证据的可信度。通过在大规模真实取证镜像数据上实验，系统化暴露LLM在证据发现中的错误模式与可校验边界。
- 一句话总结: 用知识图谱验证把LLM取证从“可用”推进到“可审计、可追责”的证据级可靠性评估范式。
- Track: Forensic AI & LLM reliability evaluation (digital forensics/evidence validation)
- Core innovation: Proposes an end-to-end structured framework that links automated artifact extraction, LLM-based refinement, and cross-validation via a Digital Forensic Knowledge Graph (DFKG), using verifiable knowledge constraints to assess and improve the trustworthiness of LLM-discovered evidence. Experiments on large real forensic images characterize failure modes and the practical checkability limits of LLM outputs.
- One-sentence summary: Moves LLM-assisted forensics toward auditable, accountable evidence discovery by grounding validation in a forensic knowledge graph.

[2026-02-24] An Expert Schema for Evaluating Large Language Model Errors in Scholarly Question-Answering Systems 📖1 🆕NEW
- 赛道归属: LLM评测与可靠性（学术问答/错误分析框架）
- 核心创新点: 构建并验证一套面向“学术问答系统”的专家级错误分类与标注schema，强调领域专家在真实科研语境下对错误的判定维度（而非仅依赖自动指标）。该schema将错误类型与可操作的诊断信号对齐，支持更细粒度地定位幻觉、证据使用不当、推断越界等问题。
- 一句话总结: 为学术场景的LLM问答提供了可复用的“专家视角”错误本体，提升可靠性评估的可解释性与可行动性。
- Track: LLM evaluation & reliability (scholarly QA / error taxonomy)
- Core innovation: Develops and validates an expert-driven error evaluation schema for scholarly QA that captures how domain experts judge failures in real scientific contexts, beyond automated metrics. The schema aligns error categories with actionable diagnostic cues to pinpoint hallucinations, misuse of evidence, and overreaching inference.
- One-sentence summary: Provides a reusable expert-grounded error ontology that makes scholarly LLM QA evaluation more interpretable and actionable.

[2026-02-26] Mitigating Legibility Tax with Decoupled Prover-Verifier Games 🆕NEW
- 赛道归属: 推理可验证性与对齐训练（Prover-Verifier/可检查输出）
- 核心创新点: 针对prover-verifier训练带来的“可读性税(legibility tax)”——可检查性提升但准确率下降——提出“解题器(solver)与翻译器(translator)解耦”的博弈训练：先固定追求正确性的solver，再训练translator将其解转写为更易被弱验证器检查的形式，从而把“正确性优化”与“可检查性约束”分离。
- 一句话总结: 通过解耦式翻译器把答案变得更可检而不牺牲正确性，为可验证推理提供了更稳健的训练路径。
- Track: Verifiable reasoning & alignment training (prover–verifier / checkable outputs)
- Core innovation: Addresses the “legibility tax” in prover–verifier games—improved checkability but reduced accuracy—by decoupling correctness from checkability: keep a correctness-optimized solver fixed, and train a translator that rewrites solutions into forms easier for a weaker verifier to check, separating objectives cleanly.
- One-sentence summary: A decoupled translator mitigates the accuracy–checkability tradeoff, enabling more reliable verifiable reasoning training.

[2026-02-26] Agency and Architectural Limits: Why Optimization-Based Systems Cannot Be Norm-Responsive 🆕NEW
- 赛道归属: AI对齐与AI治理理论（规范响应性/代理性边界）
- 核心创新点: 从形式化角度论证：以优化为核心的系统（特别是RLHF训练的LLM）在架构层面无法满足“对规范(noms)作出响应”的必要条件，并给出“真正代理性”所需的两条充要架构条件，从而将“能否被规范治理”从经验问题提升为结构性限制问题。
- 一句话总结: 提供了对RLHF类优化系统可治理性的结构性否定论证，为高风险部署中的规范与责任边界划定理论红线。
- Track: AI alignment & governance theory (norm-responsiveness / agency limits)
- Core innovation: Formally argues that optimization-based systems—especially RLHF-trained LLMs—cannot be norm-responsive due to architectural constraints, and specifies two jointly sufficient and necessary architectural conditions for genuine agency, reframing governability as a structural limitation rather than an empirical tuning issue.
- One-sentence summary: Establishes a theoretical red line on the governability of RLHF-style systems, clarifying responsibility and norm-compliance limits in high-stakes deployment.

[2026-02-26] InnerQ: Hardware-aware Tuning-free Quantization of KV Cache for Large Language Models 🆕NEW
- 赛道归属: 推理优化与系统加速（KV Cache量化/硬件感知）
- 核心创新点: 提出InnerQ：一种无需调参/无需微调的KV cache量化方案，显式做硬件感知设计，以降低长序列解码时KV缓存的内存与带宽压力，并以“降低解码延迟”为直接优化目标（而非仅追求信息保真）。方法在不显著损伤生成质量的前提下提升端侧/推理服务的吞吐与时延表现。
- 一句话总结: 用硬件感知、免调参的KV缓存量化把长上下文推理的主要瓶颈从内存侧显著缓解。
- Track: Inference optimization & systems acceleration (KV-cache quantization / hardware-aware)
- Core innovation: Introduces InnerQ, a tuning-free KV-cache quantization method designed with hardware characteristics in mind, targeting decode latency directly by reducing KV memory/bandwidth pressure in long-context decoding rather than only preserving information fidelity. It improves throughput/latency with minimal quality degradation.
- One-sentence summary: A hardware-aware, tuning-free KV-cache quantization approach that meaningfully alleviates long-context decoding bottlenecks.

[2026-02-26] SC-Arena: A Natural Language Benchmark for Single-Cell Reasoning with Knowledge-Augmented Evaluation 🆕NEW
- 赛道归属: 科学大模型评测（单细胞生物推理/知识增强评估）
- 核心创新点: 提出SC-Arena：面向单细胞生物学推理的自然语言基准，避免碎片化任务与选择题式评测，转向更贴近科研使用的对话/开放式回答；并引入知识增强的评估机制，使评分与生物学事实、可解释依据对齐，从而更能区分“会说”与“会推理/会用证据”。
- 一句话总结: 用更真实的自然语言任务形态与知识增强评分，补齐单细胞领域LLM评测的可解释性与科学有效性。
- Track: Scientific LLM benchmarking (single-cell reasoning / knowledge-augmented evaluation)
- Core innovation: Presents SC-Arena, a natural-language benchmark for single-cell biology reasoning that moves beyond fragmented, multiple-choice evaluations toward realistic open-ended interactions, coupled with knowledge-augmented evaluation to ground scoring in biological facts and interpretable evidence use.
- One-sentence summary: Improves scientific validity and interpretability of single-cell LLM evaluation via realistic NL tasks and knowledge-grounded scoring.

[2026-02-26] Fine-Tuning Without Forgetting In-Context Learning: A Theoretical Analysis of Linear Attention Models 🆕NEW
- 赛道归属: 理论分析与训练机制（微调-上下文学习冲突/线性注意力）
- 核心创新点: 在“线性注意力模型”设定下对“微调会损害in-context learning(上下文学习)”进行理论刻画，分析微调提升零样本能力的同时为何会削弱对未见任务的示例学习能力，并给出可检验的条件/机制解释，为设计兼顾零样本与ICL的训练策略提供理论依据。
- 一句话总结: 从理论上解释并界定“微调遗忘ICL”的根因，为更稳健的下游适配方法提供可推导的指导。
- Track: Theory & training dynamics (fine-tuning vs in-context learning / linear attention)
- Core innovation: Provides a theoretical characterization (under linear attention models) of why fine-tuning that boosts zero-shot performance can degrade in-context learning on unseen tasks, offering mechanistic, testable conditions that explain the tradeoff and inform training designs that preserve both capabilities.
- One-sentence summary: Clarifies the root causes of “ICL forgetting” under fine-tuning, giving principled guidance for more robust adaptation.

[2026-02-26] ESAA: Event Sourcing for Autonomous Agents in LLM-Based Software Engineering 🆕NEW
- 赛道归属: LLM智能体工程（软件工程Agent架构/状态管理）
- 核心创新点: 提出ESAA事件溯源(Event Sourcing)架构，将智能体的“状态”从LLM上下文中剥离为可持久化、可回放的事件日志，并以确定性执行层承接工具调用与环境交互，缓解长程任务中的上下文退化与不可复现问题。该设计把概率生成与确定性工作流解耦，提升可调试性、可审计性与可靠执行。
- 一句话总结: 用事件溯源把LLM软件工程智能体变成“可回放、可审计、可工程化”的长期运行系统。
- Track: LLM agent engineering (software engineering agents / state management)
- Core innovation: Proposes ESAA, an event-sourcing architecture that externalizes agent state into persistent, replayable event logs and routes tool use/environment interaction through a deterministic execution layer, mitigating long-horizon context degradation and irreproducibility. It decouples probabilistic generation from deterministic workflows for better debugging and auditability.
- One-sentence summary: Turns LLM-based SE agents into replayable, auditable, engineering-grade long-running systems via event sourcing.

[2026-02-26] MTRAG-UN: A Benchmark for Open Challenges in Multi-Turn RAG Conversations 🆕NEW
- 赛道归属: RAG评测与对话检索（多轮RAG/不可回答与指代问题）
- 核心创新点: 发布MTRAG-UN多轮RAG基准，系统覆盖多轮对话中三类开放难题：不可回答(UNanswerable)、信息不足(UNderspecified)、问题非独立(NONstandalone)及回应不清晰等，并提供配套语料以分离检索与生成误差来源。通过实验揭示现有检索与生成模块在对话语境保持、澄清与拒答策略上的薄弱点。
- 一句话总结: 用面向真实对话失败模式的基准推动多轮RAG从“能答”走向“该拒答/该追问也能做对”。
- Track: RAG evaluation & conversational retrieval (multi-turn RAG / unanswerable & underspecified queries)
- Core innovation: Releases MTRAG-UN, a multi-turn RAG benchmark targeting open challenges such as unanswerable, underspecified, and non-standalone questions and unclear responses, with paired corpora to disentangle retrieval vs generation failures. Experiments expose weaknesses in context tracking, clarification, and refusal behaviors.
- One-sentence summary: Drives multi-turn RAG toward robust real-world behavior by benchmarking when systems should clarify or abstain, not just answer.

[2026-02-26] A Decision-Theoretic Formalisation of Steganography With Applications to LLM Monitoring 🆕NEW
- 赛道归属: LLM安全与监控（隐写/对抗性通信检测的形式化）
- 核心创新点: 提出面向隐写的决策论形式化，绕开传统隐写检测对“已知非隐写参考分布”的依赖，适配LLM场景中参考分布不可得的问题；并将该框架用于LLM监控，提供可量化的检测与风险度量思路，用于识别模型可能的隐蔽信道与规避监督行为。
- 一句话总结: 用决策论重建LLM隐写监控的理论地基，使“无参考分布”的隐蔽通信检测成为可定义、可度量的问题。
- Track: LLM security & monitoring (steganography / formal detection without reference distribution)
- Core innovation: Introduces a decision-theoretic formalization of steganography that avoids requiring a known non-steganographic reference distribution—an unrealistic assumption for LLM settings—and applies it to LLM monitoring to quantify and detect covert-channel behaviors that could evade oversight.
- One-sentence summary: Provides a principled foundation for measuring and detecting LLM steganography when no clean reference distribution exists.

GitHub

[2026-02-27] sgl-project/sglang ⭐23786

SGLang is a high-performance serving framework for large language models and multimodal models.

[2026-02-27] NVIDIA/TensorRT-LLM ⭐12956

TensorRT LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports state-of-the-art optimizations to perfo...

[2026-02-27] trpc-group/trpc-agent-go ⭐923

trpc-agent-go is a powerful Go framework for building intelligent agent systems using large language models (LLMs) and tools.

[2026-02-27] stardomains3/oxproxion ⭐216 🆕NEW

oxproxion is a versatile and user-centric Android chat application designed to interact with various Large Language Models (LLMs). It provides a seaml...

[2026-02-27] testtimescaling/testtimescaling.github.io ⭐88 🆕NEW

"what, how, where, and how well? a survey on test-time scaling in large language models" repository

HuggingFace Datasets

[2026-02-26] FINAL-Bench/Metacognitive

FINAL Bench: Functional Metacognitive Reasoning Benchmark

"Not how much AI knows — but whether it knows what it doesn't know, and can fix ...

多模态大模型 / Multimodal Models

arXiv

[2026-02-25] GeoDiv: Framework For Measuring Geographical Diversity In Text-To-Image Models 📖1 🆕NEW
- 赛道归属: 文生图评测与公平性（地理多样性/偏见评估）
- 核心创新点: 提出GeoDiv评测框架，利用大语言模型与视觉-语言模型对T2I生成结果进行“地理语义”层面的可解释评估，避免仅依赖人工标注数据集或表层视觉相似度指标。框架将地理多样性、刻板印象与区域表征偏差转化为可量化、可诊断的评测信号。
- 一句话总结: GeoDiv为文生图模型提供了可解释、可扩展的地理多样性与偏见评估工具，帮助系统性发现“世界表征”失真问题。
- Track: Text-to-image evaluation & fairness (geographical diversity/bias assessment)
- Core innovation: GeoDiv introduces an interpretable evaluation framework that leverages LLMs and vision-language models to assess geographic semantics in T2I outputs, moving beyond curated datasets and shallow visual-similarity metrics. It turns geographic diversity, stereotyping, and regional misrepresentation into quantifiable, diagnosable signals.
- One-sentence summary: GeoDiv enables scalable and interpretable auditing of how T2I models portray regions, exposing geographic bias and diversity failures.

[2026-02-24] Inner Speech as Behavior Guides: Steerable Imitation of Diverse Behaviors for Human-AI coordination 📖1 🆕NEW
- 赛道归属: 多模态智能体 / 模仿学习与可控行为生成
- 核心创新点: 将“内在言语（inner speech）”作为显式的行为指导信号，引入到模仿学习中以刻画人类行为的多样性与非马尔可夫（长程依赖）特征。通过在推理时对内在言语进行条件化，实现对同一任务下不同行为风格/策略的可控“转向（steering）”。
- 一句话总结: 用可操控的“内在言语”把人类式多样行为注入模仿学习，使人机协作中的策略选择更灵活、可解释、可调控。
- Track: Multimodal agents / Imitation learning & controllable behavior generation
- Core innovation: Introduces explicit “inner speech” as a behavior-guiding signal in imitation learning to capture diverse, non-Markovian human behaviors. Enables inference-time steering by conditioning on inner speech to select different behavior modes/policies for the same task.
- One-sentence summary: Inner-speech conditioning makes imitation-learned agents more steerable and adaptable for human–AI coordination.

[2026-02-26] Large Multimodal Models as General In-Context Classifiers 🆕NEW
- 赛道归属: 多模态理解 / In-context 分类与评测
- 核心创新点: 系统性论证并基准评测：大多模态模型（LMM/MLLM）不仅适合复杂推理任务，也能作为通用的 in-context 分类器用于闭集分类。通过在多数据集设置下对比CLIP类对比学习VLM与LMM的ICL能力，揭示LMM在“示例驱动分类”范式中的适用边界与优势来源。
- 一句话总结: 重新定位LMM在分类任务中的角色：它们可作为通用 in-context 分类器，而不只是“复杂任务专用”。
- Track: Multimodal understanding / In-context classification & benchmarking
- Core innovation: Establishes and benchmarks LMMs as general in-context classifiers for closed-world classification, challenging the prevailing “use CLIP for classification” view. Provides multi-dataset comparisons that isolate when and why LMM in-context learning works for classification.
- One-sentence summary: Shows LMMs can be strong, general-purpose in-context classifiers, expanding their practical use beyond complex reasoning tasks.

[2026-02-26] MovieTeller: Tool-augmented Movie Synopsis with ID Consistent Progressive Abstraction 🆕NEW
- 赛道归属: 视频理解与生成式摘要 / 长视频剧情梗概生成
- 核心创新点: 提出工具增强（tool-augmented）的电影梗概生成框架，通过“ID一致性（角色/实体一致）”与“渐进抽象（progressive abstraction）”的分层生成策略，把长时序视频从片段级信息逐步抽象为高层叙事。用外部工具/模块辅助检索、对齐与一致性维护，缓解通用VLM在长视频中常见的角色混淆、事件断裂与时间跨度失忆问题。
- 一句话总结: 通过工具增强与分层抽象叙事，实现更长程、更一致的电影级视频梗概生成。
- Track: Video understanding & generative summarization / Long-form movie synopsis generation
- Core innovation: Proposes a tool-augmented synopsis pipeline with ID-consistent, progressively abstracted generation to turn long videos into coherent narratives. External tools/modules support retrieval/alignment and consistency tracking, reducing character/entity drift and long-context failures in generic VLMs.
- One-sentence summary: Delivers more coherent long-form movie synopses by combining tool augmentation with ID-consistent progressive narrative abstraction.

[2026-02-26] Cytoarchitecture in Words: Weakly Supervised Vision-Language Modeling for Human Brain Microscopy 🆕NEW
- 赛道归属: 医学多模态 / 显微图像-语言弱监督对齐
- 核心创新点: 面向人脑组织显微图像中“配对图文稀缺”的现实约束，提出弱监督的视觉-语言建模方案，将细胞结构/皮层分层等“细胞构筑（cytoarchitecture）”知识以文本形式对齐到显微视觉表征。通过弱标注/间接监督学习可用的图文接口，支持研究者以自然语言进行检索、描述与交互式分析。
- 一句话总结: 在缺少高质量配对标注的脑显微场景中，用弱监督实现可用的图文对齐，为科研工作流提供自然语言入口。
- Track: Medical multimodal / Weakly supervised microscopy vision-language alignment
- Core innovation: Develops weakly supervised vision-language modeling for human brain microscopy where paired image–text data are scarce, aligning cytoarchitectural concepts expressed in words to microscopic visual representations. Enables natural-language interfaces for interactive analysis, retrieval, and interpretation under limited supervision.
- One-sentence summary: Makes brain microscopy more “language-accessible” by learning useful vision–language coupling from weak supervision.

[2026-02-26] SubspaceAD: Training-Free Few-Shot Anomaly Detection via Subspace Modeling 🆕NEW
- 赛道归属: 异常检测 / 小样本工业视觉（训练-free）
- 核心创新点: 提出无需训练的 few-shot 异常检测方法 SubspaceAD：直接在基础视觉模型特征空间中对“正常样本”进行子空间建模，用重构误差/子空间距离实现异常评分。避免记忆库、辅助数据集或VLM调参等复杂组件，强调“用好现成表征即可”的极简范式。
- 一句话总结: 用子空间建模把 few-shot 异常检测做成训练-free、低工程复杂度的强基线。
- Track: Anomaly detection / Few-shot industrial inspection (training-free)
- Core innovation: Introduces SubspaceAD, a training-free few-shot anomaly detector that models normal data as a subspace in foundation-model feature space and scores anomalies via subspace distance/reconstruction error. Eliminates memory banks, auxiliary datasets, and multimodal tuning while leveraging strong pretrained representations.
- One-sentence summary: A minimalist, training-free subspace approach turns foundation features into an effective few-shot anomaly detection baseline.

[2026-02-26] FactGuard: Agentic Video Misinformation Detection via Reinforcement Learning 🆕NEW
- 赛道归属: 多模态智能体 / 视频事实核查与谣言检测（强化学习）
- 核心创新点: 将视频谣言检测建模为“迭代式验证”的智能体问题，引入强化学习来学习何时深入推理、何时调用外部证据与工具，从而克服固定推理深度与“自我假设过度信任”的缺陷。通过可交互的证据搜集-核验-更新流程，提高在证据稀疏/碎片化场景下的鲁棒性与可追溯性。
- 一句话总结: 用RL驱动的核查智能体把视频谣言检测从一次性判断升级为可迭代、可验证的证据推理流程。
- Track: Multimodal agents / Video fact-checking & misinformation detection (reinforcement learning)
- Core innovation: Frames video misinformation detection as iterative verification and uses reinforcement learning to learn adaptive reasoning depth and tool/evidence usage, addressing fixed-depth inference and overreliance on internal assumptions. Builds an interactive gather–verify–revise loop to improve robustness when evidence is sparse or fragmented.
- One-sentence summary: RL-trained agentic verification makes video misinformation detection more evidence-grounded, adaptive, and traceable.

[2026-02-26] Can Agents Distinguish Visually Hard-to-Separate Diseases in a Zero-Shot Setting? A Pilot Study 🆕NEW
- 赛道归属: 医学多模态理解 / 零样本诊断与智能体评测
- 核心创新点: 聚焦“视觉上难区分疾病”的零样本诊断这一更贴近临床风险的设定，构建并评测多种智能体在仅影像输入下的区分能力（如黑色素瘤 vs. 非典型痣、肺水肿 vs. 肺炎）。通过代理式推理/工具使用的对比，揭示现有MLLM/agent在细粒度医学辨别上的能力上限与失败模式，为后续数据、提示与验证机制提供依据。
- 一句话总结: 用高难度零样本医学辨别任务对智能体“做压力测试”，明确其临床可用性差距与改进方向。
- Track: Medical multimodal understanding / Zero-shot diagnosis & agent benchmarking
- Core innovation: Benchmarks agents/MLLMs on clinically relevant, visually hard-to-separate diseases in a zero-shot, imaging-only setting (e.g., melanoma vs atypical nevus; pulmonary edema vs pneumonia). Uses agent-style reasoning/tooling comparisons to expose capability limits and failure modes in fine-grained medical discrimination.
- One-sentence summary: Provides a stress-test benchmark that clarifies where zero-shot multimodal agents fall short for high-stakes, fine-grained medical imaging decisions.

[2026-02-26] MSJoE: Jointly Evolving MLLM and Sampler for Efficient Long-Form Video Understanding 🆕NEW
- 赛道归属: 长视频理解 / 采样-推理协同优化（效率）
- 核心创新点: 提出 MSJoE，让MLLM与轻量关键帧采样器“联合进化”：基于“每个问题只需少量信息帧”的假设，模型先生成与问题相关的查询/线索，再驱动采样器选择关键帧，反过来用更有效的帧子集提升问答推理效率与性能。通过协同训练/演化机制，把“看哪些帧”和“如何推理”从割裂的两阶段变为闭环优化。
- 一句话总结: 通过联合学习关键帧采样与多模态推理，在长视频问答中实现更高效的“少看但看对”。
- Track: Long-form video understanding / Joint sampler–reasoner optimization (efficiency)
- Core innovation: Proposes MSJoE to jointly evolve an MLLM and a lightweight keyframe sampler under the premise that only a small frame subset is question-relevant. The model generates query cues to guide sampling, and improved sampled evidence in turn boosts reasoning—closing the loop between “what to watch” and “how to answer.”
- One-sentence summary: Jointly optimizing sampling and reasoning enables efficient long-video QA by selecting fewer but more informative frames.

[2026-02-26] Where Vision Becomes Text: Locating the OCR Routing Bottleneck in Vision-Language Models 🆕NEW
- 赛道归属: 多模态理解与可解释性（VLM OCR信息路由/因果分析）
- 核心创新点: 通过因果干预定位VLM中OCR信息进入语言流的关键瓶颈：对比原图与“文本抹除/修补”图像的激活差异，系统刻画不同架构（Qwen3-VL、Phi-4、InternVL3.5）中OCR路由的主导层/模块位置。方法突破在于用可操作的反事实输入与激活差分，给出架构相关的可解释“路由瓶颈”诊断。
- 一句话总结: 该工作把VLM“读字能力”从黑箱变为可定位的系统瓶颈，为OCR能力增强与失效排查提供了直接抓手。
- Track: Multimodal interpretability (VLM OCR routing / causal analysis)
- Core innovation: Using causal interventions, it locates where OCR information is routed into the language stream by comparing activation differences between original images and text-inpainted counterfactuals, across multiple VLM families. The key advance is an actionable, architecture-specific diagnosis of OCR bottlenecks via counterfactual activation analysis.
- One-sentence summary: It turns VLM OCR from a black box into identifiable routing bottlenecks, enabling targeted improvements and debugging.

GitHub

[2026-02-23] BradyFU/Awesome-Multimodal-Large-Language-Models ⭐17365

:sparkles::sparkles:Latest Advances on Multimodal Large Language Models

[2026-02-25] xjywhu/Awesome-Multimodal-LLM-for-Code ⭐207 🆕NEW

Multimodal Large Language Models for Code Generation under Multimodal Scenarios

[2026-02-26] Roots-Automation/GutenOCR ⭐143 🆕NEW

Open-source tools for training and evaluating Vision Language Models for OCR

[2026-02-22] Wang-ML-Lab/multimodal-needle-in-a-haystack ⭐54

[NAACL 2025 Oral] Multimodal Needle in a Haystack (MMNeedle): Benchmarking Long-Context Capability of Multimodal Large Language Models

[2026-02-23] Yu-xm/ReVision ⭐51

Modality Gap–Driven Subspace Alignment Training Paradigm For Multimodal Large Language Models

Generated automatically by Daily AI Digest Agent 生成时间: 2026-02-27 02:14:30