AI 每日进展速报 / Daily AI Digest - 2026-02-25
图像生成/编辑 / Image Generation/Editing
arXiv
- [2026-02-24] Seeing Through Words: Controlling Visual Retrieval Quality with Language Models
- 赛道归属: 文本-图像检索 / 检索质量可控(Query Understanding & Retrieval Control)
- 核心创新点: 提出“质量可控检索”范式:利用语言模型对短且欠描述的查询进行语义扩展与约束补全,将用户意图从模糊的关键词提升为可控的检索条件,从而减少语义歧义与检索碰撞。方法上强调用LLM显式调节检索结果的质量维度(如相关性/具体性等),把“检索质量控制”从后处理变为查询侧的可编程能力。
-
一句话总结: 用语言模型把短查询变成可控、可约束的检索意图表达,显著提升真实场景下文本检索到图像的稳定性与可控性。
-
Track: Text-to-image retrieval / quality-controllable retrieval (query understanding & retrieval control)
- Core innovation: Proposes a quality-controllable retrieval paradigm where an LLM enriches underspecified short queries by expanding and constraining their semantics, reducing ambiguity and collisions across multiple visual interpretations. The key methodological shift is to make retrieval quality (e.g., relevance/specificity) explicitly controllable at the query level, turning quality control into a programmable capability rather than a post-hoc fix.
- One-sentence summary: Improves real-world text-to-image retrieval by using LLMs to transform vague short queries into controllable, constraint-rich intents that yield more reliable results.
- [2026-02-24] SynthRender and IRIS: Open-Source Framework and Dataset for Bidirectional Sim-Real Transfer in Industrial Object Perception
- 赛道归属: 合成数据生成 / Sim2Real与Real2Sim迁移(工业视觉感知)
- 核心创新点: 提出开源合成渲染框架 SynthRender,引入“Guided Domain Randomization”在可控参数空间内有指导地随机化,从而更高效覆盖真实域变化;并配套发布 IRIS 数据集,支持双向的 sim→real 与 real→sim 迁移评测与训练闭环。
-
一句话总结: 以开源工具链+数据集的方式,把工业零件感知中昂贵的数据采集标注问题转化为可规模化的合成与域迁移流程。
-
Track: Synthetic data generation / Sim2Real & Real2Sim transfer (industrial object perception)
- Core innovation: Releases SynthRender, an open-source synthetic rendering framework with Guided Domain Randomization to efficiently explore reality-relevant variations; introduces the IRIS dataset to enable bidirectional sim↔real transfer benchmarking and training loops.
- One-sentence summary: Turns costly industrial perception data collection into a scalable pipeline via an open framework plus a bidirectional transfer dataset.
- [2026-02-24] TextPecker: Rewarding Structural Anomaly Quantification for Enhancing Visual Text Rendering
- 赛道归属: 文生图评测与强化学习优化(视觉文本渲染/VTR)
- 核心创新点: 针对生成图中文字的“结构性异常”(形变、模糊、错位等)难以被现有 MLLM/OCR 感知的问题,提出可量化结构异常的奖励信号,用于更可靠的 VTR 评测与 RL 优化闭环;核心在于把“可读性”从语义识别转为对字形结构缺陷的显式度量与奖励。
-
一句话总结: 通过可学习/可计算的结构异常奖励,把文生图的“文字生成质量”从难评测变为可优化,从而提升生成文本的稳定可读性。
-
Track: T2I evaluation & RL optimization (visual text rendering)
- Core innovation: Introduces a reward signal that explicitly quantifies structural anomalies in rendered text (distortion/blur/misalignment), addressing the inability of MLLMs/OCR to perceive such defects and enabling a tighter evaluation-to-RL optimization loop.
- One-sentence summary: Makes text rendering quality in T2I measurable and optimizable by rewarding structural correctness rather than relying on imperfect recognition.
- [2026-02-24] When Safety Collides: Resolving Multi-Category Harmful Conflicts in Text-to-Image Diffusion via Adaptive Safety Guidance
- 赛道归属: 文生图安全对齐 / 安全引导(扩散模型)
- 核心创新点: 指出现有安全引导将多类有害概念“平均化”为单一安全区域,无法处理类别间冲突与耦合;提出自适应安全引导,在采样过程中根据不同有害类别的冲突关系动态调节引导强度/方向,以在多类别约束下更稳健地抑制有害生成并减少过度拒绝或画质损失。
-
一句话总结: 用“自适应、分冲突”的安全引导替代粗粒度平均策略,使扩散模型在多类安全约束同时存在时更可靠可控。
-
Track: T2I safety alignment / safety guidance (diffusion)
- Core innovation: Identifies that averaging safety regions across harm categories misses inter-category conflicts; proposes adaptive safety guidance that dynamically adjusts guidance direction/strength during sampling to resolve multi-category harmful conflicts while preserving quality.
- One-sentence summary: Improves diffusion safety control under simultaneous multi-harm constraints by replacing coarse averaged guidance with conflict-aware adaptive guidance.
- [2026-02-24] Training-Free Multi-Concept Image Editing
- 赛道归属: 图像编辑(免训练/零样本,多概念编辑,扩散模型)
- 核心创新点: 提出免训练的“概念驱动”多概念编辑框架,将优化式零样本文本编辑与基于视觉概念的约束统一起来,使编辑不仅依赖语言提示,还能利用难以用文字表达的视觉概念(身份、材质纹理、几何细节等)进行细粒度保持与组合编辑。
-
一句话总结: 在不训练的前提下,把“文本可说的编辑”和“视觉不可言说的概念保持”打通,实现更可靠的多概念零样本编辑。
-
Track: Image editing (training-free/zero-shot, multi-concept, diffusion)
- Core innovation: Proposes a training-free concept-based multi-concept editing framework that unifies optimization-based text edits with visual concept constraints, enabling preservation/control of hard-to-verbalize attributes like identity, texture, and geometry.
- One-sentence summary: Enables robust zero-shot multi-concept edits by combining text-driven changes with explicit visual concept preservation—without any training.
- [2026-02-24] Bridging Physically Based Rendering and Diffusion Models with Stochastic Differential Equation
- 赛道归属: 可控生成 / 物理一致渲染与扩散模型融合(PBR + Diffusion)
- 核心创新点: 以随机微分方程(SDE)为桥梁统一 PBR 的“从噪声到干净图像”的生成视角与扩散模型的去噪过程,构建可在扩散生成中注入物理可解释的光照/材质/阴影控制的框架;关键突破在于用连续时间随机过程对齐两类系统的生成动力学,从而实现物理参数与生成灵活性的兼得。
-
一句话总结: 用 SDE 把物理渲染的可解释控制与扩散模型的生成能力连接起来,为“物理可控的生成式渲染”提供统一路径。
-
Track: Controllable generation / bridging PBR and diffusion (physics-grounded control)
- Core innovation: Uses an SDE formulation to align physically based rendering’s image formation with diffusion denoising dynamics, enabling diffusion generation with explicit, physically grounded control over lighting/material/shading via a unified stochastic process view.
- One-sentence summary: Provides a principled SDE bridge that combines diffusion flexibility with PBR-level physical controllability.
- [2026-02-24] CleanStyle: Plug-and-Play Style Conditioning Purification for Text-to-Image Stylization
- 赛道归属: 文生图风格化 / 参考图条件注入(扩散模型,Plug-and-Play)
- 核心创新点: 针对编码器式风格注入易发生“内容泄漏”(把风格图语义带入输出)的问题,提出 CleanStyle:对风格嵌入进行净化,过滤与内容相关的成分,仅保留风格统计/表征;方法强调即插即用、无需重训练即可提升提示词一致性与风格纯度。
-
一句话总结: 通过对风格条件做“去内容化”净化,显著缓解参考图风格化中的内容串扰,实现更干净的风格控制。
-
Track: T2I stylization / reference-based conditioning (diffusion, plug-and-play)
- Core innovation: Introduces CleanStyle, a plug-and-play purification module that removes content-related components from style embeddings to prevent content leakage in encoder-based stylization, improving prompt fidelity and style consistency without retraining.
- One-sentence summary: Delivers cleaner style control by purifying style conditions to suppress semantic leakage from the reference image.
- [2026-02-24] BBQ-to-Image: Numeric Bounding Box and Qolor Control in Large-Scale Text-to-Image Models
- 赛道归属: 文生图可控生成(数值布局+颜色参数控制)
- 核心创新点: 提出 BBQ-to-Image,在大规模 T2I 模型中直接以数值 bounding box 与 RGB 三元组作为条件输入,弥补仅靠自然语言难以实现的专业级参数化控制;关键在于把“布局/尺寸/颜色”从描述性文本提升为可精确约束的连续/离散数值条件,从而更贴近设计与内容生产工作流。
-
一句话总结: 让文生图具备可落地的“像设计软件一样”的数值级布局与颜色控制能力,显著提升工程可用性。
-
Track: Controllable T2I generation (numeric layout + color control)
- Core innovation: Presents a large-scale T2I model conditioned directly on numeric bounding boxes and RGB triplets, closing the gap between language-only control and professional parametric needs for precise position/size/color constraints.
- One-sentence summary: Brings design-grade parametric control to T2I by enabling direct numeric conditioning for layout and color.
- [2026-02-24] LESA: Learnable Stage-Aware Predictors for Diffusion Model Acceleration
- 赛道归属: 推理优化 / 扩散模型加速(DiT,特征缓存与预测)
- 核心创新点: 提出 LESA:可学习的“阶段感知”预测器,用于在扩散不同时间步/阶段下更准确地预测与复用特征,克服现有简单缓存/无训练预测无法适配阶段动态而导致的质量下降;核心突破是把扩散过程的阶段性差异显式建模为可学习模块,从而在加速与保真之间取得更优折中。
-
一句话总结: 通过阶段感知的可学习特征预测,把扩散加速从“盲目复用”升级为“动态适配”,提升 DiT 部署效率与质量稳定性。
-
Track: Inference optimization / diffusion acceleration (DiT, caching & prediction)
- Core innovation: Proposes LESA, learnable stage-aware predictors that model stage-dependent diffusion dynamics to better forecast/reuse features than naive caching or training-free extrapolation, reducing quality drop under acceleration.
- One-sentence summary: Improves DiT acceleration by replacing static reuse with stage-adaptive learned prediction for a better speed–quality trade-off.
- [2026-02-23] gQIR: Generative Quanta Image Reconstruction
- 赛道归属: 计算成像 / 生成式重建(极低光子成像,SPAD)
- 核心创新点: 提出 gQIR,用生成式方法从 SPAD 的稀疏二值“quanta frames”突发序列中重建高质量图像,联合处理对齐、去噪与(彩色)去马赛克,并显式适配光子计数噪声统计而非传统高斯噪声假设;关键在于把极端稀疏观测下的多步骤成像管线整合为统一的生成式重建框架。
-
一句话总结: 在“几乎没光”的成像条件下实现可用画质重建,为 SPAD 等量子/光子计数相机的实用化提供生成式解法。
-
Track: Computational imaging / generative reconstruction (photon-limited SPAD imaging)
- Core innovation: Introduces gQIR, a generative framework to reconstruct images from sparse binary SPAD quanta-frame bursts by jointly handling alignment, denoising, and (color) demosaicing under photon-counting noise statistics rather than standard Gaussian assumptions.
- One-sentence summary: Enables practical high-quality reconstruction in extreme low-photon regimes by unifying the quanta-frame imaging pipeline into a generative solution.
GitHub
- [2026-02-25] YouMind-OpenLab/awesome-nano-banana-pro-prompts ⭐8134
🍌 The world's largest Nano Banana Pro prompt library — 9,985+ curated prompts with images, 16 languages. Google Gemini AI image generation. Free & ope...
- [2026-02-25] Dreamy-rain/gemini-business2api ⭐851
OpenAI-compatible API for Gemini Business with multi-account load balancing and image generation | 将 Gemini Business 转为 OpenAI 兼容接口,支持多账户负载均衡与图像生成、视频生...
- [2026-02-24] etkecc/baibot ⭐191
🤖 A Matrix bot for using different capabilities (text-generation, text-to-speech, speech-to-text, image-generation, etc.) of AI / Large Language Model...
- [2026-02-24] iconben/z-image-studio ⭐94
A Cli, a webUI, and a MCP server for the Z-Image-Turbo text-to-image generation model (Tongyi-MAI/Z-Image-Turbo base model as well as quantized models...
- [2026-02-24] ramanujammv1988/edge-veda ⭐59
On-device AI SDK for Flutter — LLM inference, vision, STT, TTS, image generation, embeddings, RAG, and function calling. Metal GPU on iOS/macOS.
视频生成/编辑 / Video Generation/Editing
arXiv
- [2026-02-24] Multi-Vector Index Compression in Any Modality
- 赛道归属: 多模态检索 / 向量索引压缩(Late Interaction Multi-Vector Retrieval)
- 核心创新点: 提出一种与查询无关(query-agnostic)的多向量文档表示压缩框架,在固定向量预算下对文档的多向量表示进行选择/聚合式压缩,从而显著降低 late interaction 检索在图像、视频、音频等长序列模态中的存储与计算线性增长成本。方法重点在于在不依赖具体查询的前提下,最大化保留对检索匹配最关键的向量信息,以兼顾召回质量与系统开销。
-
一句话总结: 在不改变 late interaction 范式的前提下,用“固定向量预算”的通用压缩思路把多模态多向量检索的成本打下来,使其更可扩展到视频等重内容语料。
-
Track: Multimodal Retrieval / Vector Index Compression (Late Interaction Multi-Vector Retrieval)
- Core innovation: Proposes a query-agnostic compression framework for multi-vector document representations under a constant vector budget, reducing the linear-in-length storage and compute overhead of late-interaction retrieval for long, rich modalities like images, videos, and audio. The key methodological step is to preserve the most retrieval-critical vectors without conditioning on specific queries, balancing effectiveness and efficiency.
- One-sentence takeaway: It makes late-interaction multi-vector retrieval substantially more scalable for video-heavy corpora by compressing documents to a fixed vector budget while retaining retrieval quality.
- [2026-02-24] Human Video Generation from a Single Image with 3D Pose and View Control
- 赛道归属: 视频生成(单图生成视频 / 人体视频生成)+ 3D姿态与视角可控生成
- 核心创新点: 提出 HVG(Human Video Generation in 4D)潜空间视频扩散模型,将3D姿态与视角控制显式纳入生成过程,以从单张人物图像生成高质量、多视角一致的人体视频;重点突破在于通过3D几何/姿态约束提升跨视角一致性,并更好地建模由运动引起的衣物褶皱等细粒度时空变化,从而缓解单图条件下的不可观测细节推断难题。
-
一句话总结: 通过把3D姿态与视角控制“结构化地”注入扩散式视频生成,显著提升单图人体视频的可控性与跨视角时空一致性。
-
Track: Video Generation (Image-to-Video / Human Video Synthesis) with 3D pose & view control
- Core innovation: Introduces HVG (Human Video Generation in 4D), a latent video diffusion model that explicitly incorporates 3D pose and camera/view control to generate high-quality human videos from a single image. The methodological advance is leveraging 3D geometric/pose constraints to improve cross-view consistency and better capture motion-induced fine-grained dynamics such as clothing wrinkles, addressing the ill-posed nature of single-image conditioning.
- One-sentence takeaway: By structurally injecting 3D pose and view control into diffusion-based video synthesis, it improves controllability and spatiotemporal consistency for single-image human video generation.
- [2026-02-24] VII: Visual Instruction Injection for Jailbreaking Image-to-Video Generation Models
- 赛道归属: 视频生成安全(Image-to-Video 越狱/对抗攻击)
- 核心创新点: 揭示I2V模型存在“视觉指令跟随”带来的新型攻击面:攻击者可在参考图像中嵌入隐式视觉指令,从而在不依赖文本提示的情况下诱导视频生成产生恶意/违规意图。提出并系统化“Visual Instruction Injection”威胁范式,用于评估与触发此类跨模态注入式越狱风险。
-
一句话总结: 该工作把I2V模型的安全问题从“文本提示注入”扩展到“图像指令注入”,为视频生成模型的红队评测与防护提供了新的关键基准方向。
-
Track: Video-to-Audio Generation (length generalization / multimodal alignment)
- Core innovation: Identifies and targets the length-generalization gap in video-to-audio models—whether training on short clips can generalize to long sequences at test time—under data scarcity and text–frame mismatch. Proposes MMHNet, a multimodal hierarchical architecture that improves scalable alignment and temporal extension to longer horizons.
- One-sentence summary: This work advances video-to-audio generation by explicitly engineering for long-duration generalization rather than only short-clip fidelity.
- [2026-02-24] Echoes Over Time: Unlocking Length Generalization in Video-to-Audio Generation Models
- 赛道归属: 自动驾驶世界模型/视频生成(自回归时空建模)
- 核心创新点: 提出RAYNOVA:不依赖显式3D几何约束的自回归驾驶世界模型,通过“双因果”自回归同时遵循尺度拓扑顺序与时间拓扑顺序,并用全局注意力实现统一4D时空推理,避免以往将空间与时间相关性割裂建模的局限。
-
一句话总结: RAYNOVA用统一时空表示与双因果自回归机制提升驾驶世界模拟的可扩展性与一致性,为“几何弱依赖”的世界模型提供了新路线。
-
Track: Driving world modeling / video generation (geometry-free autoregressive spatio-temporal modeling)
- Core innovation: Introduces a geometry-free world model with a dual-causal autoregressive process that follows both scale-wise and temporal topological orders, enabling unified 4D spatio-temporal reasoning via global attention rather than separate spatial/temporal modules.
- One-sentence summary: RAYNOVA offers a scalable alternative to geometry-heavy simulators by unifying spatio-temporal reasoning in an autoregressive framework.
- [2026-02-24] RAYNOVA: 3D-Geometry-Free Auto-Regressive Driving World Modeling with Unified Spatio-Temporal Representation
- 赛道归属: 3D/自由视角视频生成(自动驾驶场景新视角合成与可编辑仿真)
- 核心创新点: 提出GA-Drive,通过“几何-外观解耦”将可控的结构信息(几何)与纹理/光照等外观因素分离建模,并结合扩散式生成在给定轨迹图像与场景几何的条件下合成用户指定新轨迹的相机视角,实现可编辑且高保真的自由视角驾驶场景生成。
-
一句话总结: GA-Drive把驾驶仿真从“重建固定轨迹”推进到“可控新轨迹自由视角生成”,更贴近自动驾驶训练所需的可编辑数据引擎。
-
Track: Free-viewpoint driving scene generation (geometry-appearance decoupling + diffusion)
- Core innovation: Decouples geometry from appearance to enable controllable novel-view synthesis along user-defined trajectories, and leverages diffusion-based generation to produce high-fidelity, editable driving views conditioned on captured images and scene geometry.
- One-sentence summary: GA-Drive turns recorded driving data into an editable free-viewpoint generator suitable for scalable autonomous-driving simulation.
- [2026-02-24] GA-Drive: Geometry-Appearance Decoupled Modeling for Free-viewpoint Driving Scene Generatio
- 赛道归属: 视频编辑(基于传播的时序一致编辑/训练范式)
- 核心创新点: 提出PropFly训练流水线:不再依赖昂贵的大规模“原视频-编辑后视频”成对数据,而是利用预训练视频扩散模型在训练过程中提供“on-the-fly”监督信号,学习将单帧编辑稳定传播到后续帧,同时保持运动与结构一致性。
-
一句话总结: PropFly用“借助预训练VDM的即时监督”替代成对数据采集成本,显著降低高质量传播式视频编辑模型的训练门槛。
-
Track: Video editing (propagation-based editing / training with diffusion supervision)
- Core innovation: Replaces costly paired (source, edited) video supervision by using pre-trained video diffusion models to provide on-the-fly training signals, enabling robust propagation of a single-frame edit across time while preserving motion and structure.
- One-sentence summary: PropFly makes propagation-based video editing trainable at scale without curated paired datasets by distilling supervision from foundation video diffusion models.
- [2026-02-24] PropFly: Learning to Propagate via On-the-Fly Supervision from Pre-trained Video Diffusion Models
- 赛道归属: 视频生成评测(真实感评价指标/自动化评估)
- 核心创新点: 提出3DSPA:通过3D语义点(semantic point)自编码器构建无需人工标注或特定评测集的自动化真实感评估框架,同时捕捉语义合理性与跨帧一致的3D结构连贯性,从而更贴近“物理/几何一致”的视频真实感定义。
-
一句话总结: 3DSPA把视频真实感评估从“主观打分/数据集依赖”推进到“可泛化的语义+3D一致性”自动指标,有助于更可靠地比较与迭代视频生成模型。
-
Track: Video realism evaluation (automatic metrics with 3D/semantic consistency)
- Core innovation: Proposes a 3D semantic point autoencoder to evaluate realism by jointly modeling semantic plausibility and coherent 3D structure over time, avoiding reliance on human labels or narrowly scoped evaluation datasets.
- One-sentence summary: 3DSPA provides a more general, structure-aware automatic realism metric for rapidly evolving video generators.
- [2026-02-24] LESA: Learnable Stage-Aware Predictors for Diffusion Model Acceleration
- 赛道归属: 推理优化 / 扩散模型加速(DiT,特征缓存与预测)
- 核心创新点: 提出 LESA:可学习的“阶段感知”预测器,用于在扩散不同时间步/阶段下更准确地预测与复用特征,克服现有简单缓存/无训练预测无法适配阶段动态而导致的质量下降;核心突破是把扩散过程的阶段性差异显式建模为可学习模块,从而在加速与保真之间取得更优折中。
-
一句话总结: 通过阶段感知的可学习特征预测,把扩散加速从“盲目复用”升级为“动态适配”,提升 DiT 部署效率与质量稳定性。
-
Track: Inference optimization / diffusion acceleration (DiT, caching & prediction)
- Core innovation: Proposes LESA, learnable stage-aware predictors that model stage-dependent diffusion dynamics to better forecast/reuse features than naive caching or training-free extrapolation, reducing quality drop under acceleration.
- One-sentence summary: Improves DiT acceleration by replacing static reuse with stage-adaptive learned prediction for a better speed–quality trade-off.
- [2026-02-23] 3DSPA: A 3D Semantic Point Autoencoder for Evaluating Video Realism
- 赛道归属: 机器人长时序操作(视频-语言规划 + 闭环执行)
- 核心创新点: 提出NovaPlan分层闭环框架:高层用VLM进行语义分解与决策,并结合视频生成/预测进行“想象式”结果评估;低层将计划落地到几何约束下的机器人执行,通过闭环反馈持续修正,实现零样本的长时程操作能力与更强的物理落地性。
-
一句话总结: NovaPlan把“会说会想”的视频/语言模型与“能做得对”的几何执行闭环打通,为零样本长任务机器人提供了可扩展的规划-执行范式。
-
Track: Long-horizon robot manipulation (closed-loop video-language planning)
- Core innovation: Introduces a hierarchical closed-loop system that combines VLM-based task decomposition with video-based outcome imagination, then grounds actions through geometry-aware robot execution and feedback-driven replanning for zero-shot long-horizon tasks.
- One-sentence summary: NovaPlan bridges generative video/VLM planning with physically grounded execution, enabling more reliable zero-shot long-horizon manipulation.
- [2026-02-23] NovaPlan: Zero-Shot Long-Horizon Manipulation via Closed-Loop Video Language Planning
- 赛道归属: 机器人长时序操作规划(Video-Language Planning)/ 多模态理解与闭环控制
- 核心创新点: 提出分层式闭环视频-语言规划框架,将VLM的高层语义分解与视频生成式“想象/预测”结合,并在执行层引入几何/物理约束的具身落地机制,实现从“计划—预测—执行—再规划”的闭环。通过把生成式视频规划与几何可执行性对齐,缓解了纯VLM/视频模型缺乏物理可行性导致的长时序误差累积问题。
- 一句话总结: 该工作用“闭环视频语言规划 + 几何具身执行”把生成式想象真正接到机器人可执行的长程操作上,实现零样本长时序任务的更可靠落地。
GitHub
- [2026-02-25] test-time-training/ttt-video-dit ⭐2370
Official PyTorch implementation of One-Minute Video Generation with Test-Time Training
- [2026-02-25] Winn1y/Awesome-Human-Motion-Video-Generation ⭐297
【Accepted by TPAMI】Human Motion Video Generation: A Survey (https://ieeexplore.ieee.org/document/11106267)
- [2026-02-25] q1uki/MoneyPrinterAICreate ⭐255
基于MoneyPrinterTurbo,AI生成分镜大纲与视频(动态,不是念ppt),接入万相通义wan2.1 ai文生视频、图生视频功能,灵活把控视频生成。Based on MoneyPrinterTurbo, AI generates image outline and video (dynam...
- [2026-02-25] YouMind-OpenLab/awesome-seedance-2-prompts ⭐143
🎬 400+ curated Seedance 2.0 video generation prompts — cinematic, anime, UGC, ads, meme styles. Includes Seedance API guides, character consistency ti...
- [2026-02-25] Phantom-video/Phantom-Data ⭐104
Phantom-Data: Towards a General Subject-Consistent Video Generation Dataset
语言大模型 / Large Language Models
arXiv
- [2026-02-19] El Agente Gráfico: Structured Execution Graphs for Scientific Agents 📖1
- 赛道归属: LLM智能体框架与科学工作流编排(可审计执行/工具调用)
- 核心创新点: 提出以“结构化执行图(execution graphs)”为核心的单智能体框架,将LLM的决策、工具调用、状态与中间产物显式编码为可追踪的图结构,而非依赖松散的自然语言上下文串联。通过图结构约束与记录执行依赖关系,提升跨异构计算工具协作时的可复现性、可审计性与决策溯源能力。
-
一句话总结: 用结构化执行图把LLM科学智能体的“想法—行动—结果”流程变成可追踪、可复现的程序化工件,从而显著增强复杂科学工作流的可靠性与可审计性。
-
Track: LLM agent frameworks for scientific workflow orchestration (auditable execution / tool use)
- Core innovation: Introduces a single-agent framework centered on structured execution graphs that explicitly encode LLM decisions, tool invocations, state, and intermediate artifacts as a traceable graph rather than unstructured text context. By constraining and logging dependencies in graph form, it improves reproducibility, auditability, and provenance tracking when coordinating heterogeneous computational tools.
- One-sentence summary: It turns an LLM scientific agent’s end-to-end “reason–act–observe” process into a structured, inspectable artifact, making complex scientific workflows more reliable and auditable.
- [2026-02-19] Influence-Preserving Proxies for Gradient-Based Data Selection in LLM Fine-tuning 📖1
- 赛道归属: LLM微调数据选择与训练效率优化(影响函数/梯度影响近似)
- 核心创新点: 针对TracIn、Influence Functions等基于梯度影响的样本选择在大模型上计算不可承受的问题,提出“保持影响力(influence-preserving)的代理模型/代理指标”,在更低成本下近似大模型的样本影响排序。相较直接用小模型做proxy,该方法强调对“学习动力学/影响信号”的保真,从而更可靠地筛选对下游最有益的SFT数据。
-
一句话总结: 通过更忠实地近似大模型的影响信号,该工作让梯度影响驱动的数据选择在多十亿参数LLM微调中变得可用且更有效。
-
Track: Data selection for LLM fine-tuning & training efficiency (influence/gradient-based approximations)
- Core innovation: Addresses the prohibitive cost of gradient-influence methods (e.g., TracIn, Influence Functions) on large LLMs by proposing influence-preserving proxies that approximate the target model’s influence-based sample ranking at much lower compute. Unlike naïve small-model proxies, it focuses on preserving the fidelity of the underlying influence/learning-dynamics signal, yielding more reliable selection of SFT data that benefits downstream performance.
- One-sentence summary: By making influence-based data selection computationally practical while retaining ranking fidelity, it enables more effective SFT dataset curation for multi-billion-parameter LLMs.
- [2026-02-24] On Data Engineering for Scaling LLM Terminal Capabilities
- 赛道归属: LLM智能体(终端/工具使用)数据工程与合成数据生成
- 核心创新点: 提出 Terminal-Task-Gen 轻量级合成任务生成流水线,支持基于种子任务与基于技能分解的任务构造,用更可控的方式扩展终端代理训练数据覆盖面;并系统化分析不同数据与训练策略对终端能力扩展的影响,为“如何配数据”提供可复用的经验框架。
-
一句话总结: 该工作把终端代理能力提升从“黑箱配方”拉回到可复现的数据工程方法论,给出可扩展的合成任务生成与策略分析路径。
-
Track: Data engineering & synthetic data generation for LLM agents (terminal/tool use)
- Core innovation: Introduces Terminal-Task-Gen, a lightweight synthetic task generation pipeline enabling both seed-based and skill-based task construction to controllably broaden training coverage for terminal agents; additionally provides a systematic analysis of how data and training choices affect scaling terminal capabilities, yielding a reusable recipe-level framework.
- One-sentence summary: It turns terminal-agent scaling from an opaque recipe into a reproducible data-engineering methodology with a practical synthetic task pipeline and evidence-backed training insights.
- [2026-02-24] Why Pass@k Optimization Can Degrade Pass@1: Prompt Interference in LLM Post-training
- 赛道归属: LLM后训练与推理优化(Pass@k 目标优化/可验证任务)
- 核心创新点: 揭示直接优化 pass@k 可能导致 pass@1 下降的机制性原因——后训练过程中出现“提示干扰(prompt interference)”,使模型对单次采样的最优解分布产生偏移;通过分析多样本目标与单样本质量之间的冲突,解释了为何推理感知微调会在不同采样预算下产生性能反转。
-
一句话总结: 该工作为“pass@k 提升但 pass@1 变差”的常见现象给出可操作的机制解释,提醒后训练目标需要显式兼顾单样本质量与多样本覆盖。
-
Track: LLM post-training & inference-aware optimization (Pass@k optimization for verifiable tasks)
- Core innovation: Identifies a mechanistic cause for pass@k optimization degrading pass@1—prompt interference during post-training shifts the model’s single-sample solution distribution away from the best answer; by analyzing the conflict between multi-sample objectives and single-sample quality, it explains performance reversals across sampling budgets.
- One-sentence summary: It provides an actionable explanation for the pass@k vs. pass@1 trade-off, highlighting that post-training objectives must explicitly balance single-shot quality with multi-sample coverage.
- [2026-02-24] SELAUR: Self Evolving LLM Agent via Uncertainty-aware Rewards
- 赛道归属: LLM智能体强化学习 / 奖励建模(不确定性感知)
- 核心创新点: 将LLM的内在不确定性显式引入奖励设计与信用分配,用“不确定性感知奖励”在成功与失败轨迹中都提供可学习信号,驱动智能体自我演化式改进。通过把“模型信心/犹豫”转化为探索与纠错的训练线索,提升多步决策学习效率与鲁棒性。
-
一句话总结: 用不确定性作为奖励信号补齐传统reward shaping盲区,让LLM智能体在失败中也能有效学习并持续自我提升。
-
Track: LLM agent RL / reward modeling (uncertainty-aware)
- Core innovation: Explicitly injects intrinsic LLM uncertainty into reward design and credit assignment via uncertainty-aware rewards, providing learnable signals from both successful and failed trajectories to enable self-evolving improvement. It turns “confidence/hesitation” into actionable cues for exploration and correction, improving learning efficiency and robustness in multi-step decision making.
- One-sentence summary: Uses uncertainty as a missing reward signal so LLM agents can learn effectively even from failures and continuously self-improve.
- [2026-02-24] Scaling State-Space Models on Multiple GPUs with Tensor Parallelism
- 赛道归属: 推理优化 / 多GPU并行(State-Space Models 张量并行)
- 核心创新点: 针对选择性SSM模块在推理时的跨参数耦合与算子特性,提出可行的张量并行切分与通信组织方式,使SSM在多GPU上扩展时避免单卡显存/带宽/时延瓶颈。将Transformer常用TP范式适配到SSM推理路径,提升长上下文部署的吞吐与可扩展性。
-
一句话总结: 让SSM骨干也能像Transformer一样高效做多GPU张量并行,为长上下文推理提供工程级扩展方案。
-
Track: Inference optimization / multi-GPU parallelism (tensor parallelism for SSMs)
- Core innovation: Designs tensor-parallel partitioning and communication schemes tailored to selective SSM blocks, addressing their parameter coupling and operator characteristics so inference can scale beyond single-GPU memory/bandwidth/latency limits. Adapts the widely used Transformer TP paradigm to the SSM inference path to improve throughput and scalability for long-context deployment.
- One-sentence summary: Enables efficient multi-GPU tensor parallelism for SSM backbones, providing a practical scaling path for long-context inference.
- [2026-02-24] A Benchmark for Deep Information Synthesis
- 赛道归属: 基准评测 / LLM智能体信息综合(多源推断)
- 核心创新点: 提出DEEPSYNTH基准,专门评测智能体在真实、耗时任务中对多来源信息的整合、交叉验证与“超越检索”的推断能力,而非仅做事实查找。通过任务设计强调工具使用下的深层综合与洞见生成,补足现有benchmark对信息合成能力覆盖不足的问题。
-
一句话总结: 用面向真实多源任务的评测框架,把“信息综合与推断”从口号变成可量化的智能体能力指标。
-
Track: Benchmarking / LLM agent deep information synthesis (multi-source reasoning)
- Core innovation: Introduces the DEEPSYNTH benchmark to evaluate agents on realistic, time-consuming tasks that require synthesizing and cross-validating information from multiple sources and inferring insights beyond simple retrieval. Its task design emphasizes tool-mediated deep synthesis and insight generation, filling a key gap in existing evaluations.
- One-sentence summary: Makes deep multi-source synthesis and inference a measurable capability for LLM agents with a more realistic benchmark.
- [2026-02-24] SparkMe: Adaptive Semi-Structured Interviewing for Qualitative Insight Discovery
- 赛道归属: LLM应用系统 / 对话式智能体(自适应半结构化访谈)
- 核心创新点: 提出用于定性洞察挖掘的自适应半结构化访谈系统,在“预设主题的系统覆盖”与“基于受访者回答的动态追问/深挖”之间建立更原则化的平衡机制。通过可控的访谈策略实现跟进问题、深入探究与新主题探索,从而在规模化采集时仍保持访谈质量与洞察深度。
-
一句话总结: 把专家式半结构化访谈的“覆盖+追问”能力产品化到LLM系统中,实现可扩展的高质量定性研究。
-
Track: LLM application systems / conversational agents (adaptive semi-structured interviewing)
- Core innovation: Proposes an adaptive semi-structured interviewing system for qualitative insight discovery, with a principled mechanism to balance systematic coverage of predefined topics and adaptive exploration driven by respondent answers. It supports follow-ups, deep dives, and emergent topic exploration to preserve interview quality and depth at scale.
- One-sentence summary: Productizes expert-like semi-structured interviewing—coverage plus targeted probing—into a scalable LLM system for qualitative research.
- [2026-02-24] "Are You Sure?": An Empirical Study of Human Perception Vulnerability in LLM-Driven Agentic Systems
- 赛道归属: AI安全 / LLM智能体人因安全(代理介导欺骗)
- 核心创新点: 首次大规模实证研究“被攻陷的LLM代理如何欺骗其人类用户”,聚焦Agent-Mediated Deception这一以人为中心的攻击面,而非仅评估模型本体脆弱性。通过用户实验量化人类对代理输出的信任与受骗机制,为设计抗欺骗交互、告警与校验策略提供证据基础。
-
一句话总结: 把LLM智能体安全从“模型是否安全”推进到“人是否会被代理骗”,为高风险场景的人因防护提供关键数据。
-
Track: AI safety / human factors in LLM agents (agent-mediated deception)
- Core innovation: Conducts the first large-scale empirical study on how compromised LLM agents can deceive their human users, framing Agent-Mediated Deception as a human-centered attack surface beyond model-centric vulnerabilities. User studies quantify trust and susceptibility patterns, grounding the design of anti-deception interactions, warnings, and verification mechanisms.
- One-sentence summary: Shifts agent security from model-only risk to human susceptibility, providing evidence needed for safer agent UX in high-stakes settings.
- [2026-02-24] Beyond the Star Rating: A Scalable Framework for Aspect-Based Sentiment Analysis Using LLMs and Text Classification
- 赛道归属: 文本挖掘 / 方面级情感分析(LLM+传统分类的可扩展框架)
- 核心创新点: 提出混合式可扩展管线:用LLM完成高语义难度的“方面/维度识别”,再用轻量文本分类模型在大规模数据上进行情感归类,从而兼顾质量与成本。通过将LLM能力用于瓶颈环节、把高吞吐任务交给分类器,实现百万级评论分析的可部署性。
-
一句话总结: 用“LLM做理解、分类器做规模”的分工,把方面级情感分析从高成本试验推向可落地的大规模生产。
-
Track: Text mining / aspect-based sentiment analysis (scalable LLM + classifier framework)
- Core innovation: Proposes a scalable hybrid pipeline where LLMs handle the semantically hard step of aspect identification, while lightweight text classifiers perform sentiment labeling at scale to balance quality and compute cost. By reserving LLM usage for the bottleneck and offloading high-throughput labeling to classifiers, it enables deployable million-review analysis.
- One-sentence summary: Operationalizes ABSA at scale by splitting labor—LLMs for understanding, classifiers for throughput—making large-scale review analytics practical.
GitHub
- [2026-02-25] sgl-project/sglang ⭐23754
SGLang is a high-performance serving framework for large language models and multimodal models.
- [2026-02-25] NVIDIA/TensorRT-LLM ⭐12940
TensorRT LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports state-of-the-art optimizations to perfo...
- [2026-02-25] Azure-Samples/chat-with-your-data-solution-accelerator ⭐1154
A Solution Accelerator for the RAG pattern running in Azure, using Azure AI Search for retrieval and Azure OpenAI large language models to power ChatG...
- [2026-02-25] trpc-group/trpc-agent-go ⭐894
trpc-agent-go is a powerful Go framework for building intelligent agent systems using large language models (LLMs) and tools.
- [2026-02-25] mensfeld/llm-docs-builder ⭐78
Transform and optimize your markdown documentation for Large Language Models (LLMs) and RAG systems. Generate llms.txt automatically.
HuggingFace Datasets
- [2026-02-24] FINAL-Bench/Metacognitive
FINAL Bench: Functional Metacognitive Reasoning Benchmark
"Not how much AI knows — but whether it knows what it doesn't know, and can fix ...
- [2026-02-12] OpenResearcher/OpenResearcher-Dataset
🤗 HuggingFace | Blog | Slack | WeChat
Overview
OpenResearcher is a fully open agentic lar...
多模态大模型 / Multimodal Models
arXiv
- [2026-02-24] Spa3R: Predictive Spatial Field Modeling for 3D Visual Reasoning
- 赛道归属: 3D视觉推理 / 多模态空间理解
- 核心创新点: 提出以“预测式空间场(predictive spatial field)”为核心的建模框架,将多视角/2D观测映射为可查询的连续3D空间表征,用显式空间场来承载几何与空间关系推理,从而避免让语言模型在缺乏约束的情况下隐式“重建3D”。相较依赖显式3D模态或零散视角几何先验的方法,该框架更可扩展,并把3D推理的关键负担从LLM侧转移到结构化空间表征上。
-
一句话总结: 通过可查询的连续空间场显式建模3D结构,提升VLM在真实3D空间关系上的推理能力与可扩展性。
-
Track: 3D visual reasoning / multimodal spatial understanding
- Core innovation: Introduces a predictive spatial-field modeling framework that turns multi-view/2D observations into a queryable continuous 3D representation, enabling explicit reasoning over geometry and spatial relations rather than forcing the LLM to implicitly solve ill-posed 3D reconstruction. Compared to approaches relying on explicit 3D modalities or partial view-conditioned priors, it improves scalability by shifting the core 3D reasoning burden to a structured spatial representation.
- One-sentence summary: Makes 3D reasoning in VLMs more robust and scalable by explicitly representing scene structure as a queryable continuous spatial field.
- [2026-02-24] Seeing Through Words: Controlling Visual Retrieval Quality with Language Models
- 赛道归属: 文本-图像检索 / 检索质量可控(Query Understanding & Retrieval Control)
- 核心创新点: 提出“质量可控检索”范式:利用语言模型对短且欠描述的查询进行语义扩展与约束补全,将用户意图从模糊的关键词提升为可控的检索条件,从而减少语义歧义与检索碰撞。方法上强调用LLM显式调节检索结果的质量维度(如相关性/具体性等),把“检索质量控制”从后处理变为查询侧的可编程能力。
-
一句话总结: 用语言模型把短查询变成可控、可约束的检索意图表达,显著提升真实场景下文本检索到图像的稳定性与可控性。
-
Track: Text-to-image retrieval / quality-controllable retrieval (query understanding & retrieval control)
- Core innovation: Proposes a quality-controllable retrieval paradigm where an LLM enriches underspecified short queries by expanding and constraining their semantics, reducing ambiguity and collisions across multiple visual interpretations. The key methodological shift is to make retrieval quality (e.g., relevance/specificity) explicitly controllable at the query level, turning quality control into a programmable capability rather than a post-hoc fix.
- One-sentence summary: Improves real-world text-to-image retrieval by using LLMs to transform vague short queries into controllable, constraint-rich intents that yield more reliable results.
- [2026-02-24] LUMEN: Longitudinal Multi-Modal Radiology Model for Prognosis and Diagnosis
- 赛道归属: 医学多模态理解(放射影像纵向诊断/预后)
- 核心创新点: 面向纵向(多时点)放射影像场景,构建能显式建模时间变化的多模态模型/框架,将跨时间的影像差异与临床语义对齐,用于同时支持诊断与预后推断的VQA式决策支持。相较单次检查建模,重点解决“变化趋势/进展”这一临床关键证据的表征与推理。
-
一句话总结: 把“时间维度的影像演变”纳入多模态推理核心,使VLM在放射科从静态识别走向更贴近真实流程的纵向诊断与预后支持。
-
Track: Medical multimodal understanding (longitudinal radiology diagnosis/prognosis)
- Core innovation: Introduces a multimodal modeling approach tailored to longitudinal (multi-timepoint) radiology, explicitly capturing temporal changes and aligning them with clinical semantics to support both diagnosis and prognosis via a VQA-style interface. It targets the key clinical evidence of progression/regression rather than single-scan reasoning.
- One-sentence summary: Elevates radiology VLMs from static interpretation to clinically realistic longitudinal reasoning for diagnosis and prognosis support.
- [2026-02-24] VAUQ: Vision-Aware Uncertainty Quantification for LVLM Self-Evaluation
- 赛道归属: 多模态理解(幻觉检测/不确定性量化/自评估)
- 核心创新点: 提出面向视觉条件输出的自评估不确定性量化框架,弱化仅依赖语言先验的“自我打分”,改为引入视觉感知相关信号来估计答案可靠性,从而更准确地识别LVLM在视觉问题上的幻觉与错误。方法上强调“vision-aware”的置信度建模与校准。
-
一句话总结: 让LVLM的自我评估从“语言自信”转向“视觉证据驱动的可信度”,提升真实部署的安全性与可控性。
-
Track: Multimodal understanding (hallucination detection / uncertainty quantification / self-evaluation)
- Core innovation: Proposes a vision-aware uncertainty quantification framework for LVLM self-evaluation, reducing reliance on language priors by incorporating vision-conditioned signals to better estimate answer reliability and detect hallucinations on visual tasks. Emphasizes vision-grounded confidence modeling and calibration.
- One-sentence summary: Shifts LVLM self-evaluation from language-driven confidence to vision-evidence-driven reliability for safer deployment.
- [2026-02-24] OCR-Agent: Agentic OCR with Capability and Memory Reflection
- 赛道归属: 文档理解与OCR(Agentic OCR / 自纠错与记忆反思)
- 核心创新点: 提出具备“能力反思+记忆反思”的Agent式OCR流程,使VLM在多轮识别/校对中能诊断失败原因、避免重复无效修正,并通过记忆机制沉淀有效策略以稳定提升后续迭代质量。核心突破在于把OCR从一次性预测变为可自我纠偏的闭环优化过程。
-
一句话总结: 通过反思与记忆驱动的代理式流程,把OCR提升为可持续自我改进的多轮系统,显著降低反复试错带来的不稳定性。
-
Track: Document understanding & OCR (agentic OCR / self-correction with memory reflection)
- Core innovation: Introduces an agentic OCR pipeline with capability reflection and memory reflection, enabling VLMs to diagnose failure modes, avoid repetitive ineffective revisions, and accumulate successful strategies to improve multi-turn correction stability. Key leap: turning OCR from one-shot prediction into a self-correcting closed loop.
- One-sentence summary: Makes OCR a reflective, memory-augmented agent that improves reliably across iterations instead of looping on unproductive retries.
- [2026-02-24] Not Just What's There: Enabling CLIP to Comprehend Negated Visual Descriptions Without Fine-tuning
- 赛道归属: 多模态检索与对齐(CLIP语义理解/否定理解/无需微调适配)
- 核心创新点: 提出即插即用的CLIPGlasses,通过“双阶段”模块在不微调CLIP的前提下解耦否定语义:先用Lens模块从文本中分离否定成分,再用后续模块重构与视觉证据一致的匹配表征,缓解“no dog≈dog”这类嵌入塌缩问题。方法突破在于以外接结构化变换替代端到端微调,降低过拟合风险。
-
一句话总结: 在不改动CLIP参数的情况下显著补齐否定理解短板,使文本-图像匹配更符合真实语言逻辑。
-
Track: Multimodal retrieval & alignment (CLIP semantics / negation understanding / no fine-tuning adaptation)
- Core innovation: Proposes CLIPGlasses, a plug-and-play dual-stage framework that disentangles negation semantics without fine-tuning CLIP: a Lens module separates negated meaning, followed by a stage that reconstructs vision-consistent matching representations, mitigating embedding collapse like “no dog” matching dog images. Replaces end-to-end fine-tuning with structured external transformations to reduce overfitting.
- One-sentence summary: Fixes CLIP’s negation blind spot without parameter updates, making text–image matching more logically faithful.
- [2026-02-24] From Perception to Action: An Interactive Benchmark for Vision Reasoning
- 赛道归属: 多模态推理评测(交互式视觉推理/具身与物理结构推理基准)
- 核心创新点: 提出交互式基准来评估从感知到行动的视觉推理能力,重点覆盖几何结构、接触/支撑关系与可行动性约束等“物理因果”要素,突破传统单轮VQA对动态环境与行动可行性评估不足的问题。方法上以交互流程与因果层级式任务设计,逼近真实具身决策链路。
-
一句话总结: 用交互与物理结构约束重塑VLM评测,使模型能力评估从“看懂”走向“能行动地推理”。
-
Track: Multimodal reasoning evaluation (interactive vision reasoning / embodied & physical-structure benchmarks)
- Core innovation: Introduces an interactive benchmark to evaluate perception-to-action vision reasoning, emphasizing physical structure—geometry, contact/support relations, and action feasibility constraints—addressing limitations of single-turn, structure-agnostic VQA. Uses interactive protocols and causal-hierarchy-style task design to approximate real embodied decision pipelines.
- One-sentence summary: Reorients VLM evaluation from static understanding to action-grounded, physics-aware reasoning.
- [2026-02-24] CrystaL: Spontaneous Emergence of Visual Latents in MLLMs
- 赛道归属: 多模态推理优化(MLLM潜变量CoT/视觉潜表示涌现)
- 核心创新点: 提出CrystaL以改进潜变量Chain-of-Thought范式中“中间潜状态保留关键信息”的问题,避免依赖启发式监督信号导致视觉信息在隐空间推理中丢失;其关键在于促使MLLM在推理过程中自发形成更可用的视觉潜变量表征,从而实现更紧密的视觉-语言融合与更高效推理。
-
一句话总结: 通过让视觉潜变量在隐式推理中更稳定涌现与保真,提升MLLM的融合推理质量并兼顾推理效率。
-
Track: Multimodal reasoning optimization (latent CoT in MLLMs / emergence of visual latents)
- Core innovation: Proposes CrystaL to address the loss of critical visual information in latent Chain-of-Thought by moving beyond heuristic intermediate supervision; it encourages MLLMs to spontaneously form more usable visual latent representations during reasoning, enabling tighter vision–language integration and faster inference.
- One-sentence summary: Improves latent reasoning by making visual latents emerge more faithfully, boosting multimodal reasoning quality without sacrificing efficiency.
- [2026-02-24] Are Multimodal Large Language Models Good Annotators for Image Tagging?
- 赛道归属: 数据生成与标注(图像标签/MLLM自动标注/弱监督学习)
- 核心创新点: 系统分析MLLM生成标签与人工标注之间的差距,并提出使MLLM标注可替代人工的有效方案(如对齐标注规范、纠错/过滤、或基于置信度与一致性进行标签校准),将“能生成描述”转化为“能产出可训练的高质量多标签监督”。方法论重点在于把MLLM输出变成可控、可评估、可用于训练的标注管线。
-
一句话总结: 推动MLLM从“辅助标注”走向“可规模化替代人工标注”的图像标签生产力工具。
-
Track: Data generation & annotation (image tagging / MLLM auto-annotation / weak supervision)
- Core innovation: Systematically studies the gap between MLLM-generated tags and human annotations, and proposes an effective recipe to make MLLM-based annotation viable as a human substitute (e.g., guideline alignment, error filtering/correction, confidence- or consistency-based calibration), turning free-form outputs into trainable, high-quality multi-label supervision. Focuses on building a controllable, evaluable annotation pipeline.
- One-sentence summary: Moves MLLMs from “annotation assistants” to scalable producers of training-grade image tags.
- [2026-02-24] LongVideo-R1: Smart Navigation for Low-cost Long Video Understanding
- 赛道归属: 长视频理解(低成本视频推理/主动检索式导航/Agent)
- 核心创新点: 提出LongVideo-R1作为具备推理能力的主动导航MLLM Agent,在推理时利用高层视觉线索先判断“最信息量片段”再读取,从而避免对长视频进行穷举式切片处理;核心突破在于把长视频理解转化为“推理驱动的片段选择+按需读取”的决策过程,以显著降低计算与上下文开销。
-
一句话总结: 用主动导航替代全量扫描,在低预算下实现更可扩展的长视频理解与问答。
-
Track: Long-video understanding (low-cost video reasoning / active navigation / agent)
- Core innovation: Proposes LongVideo-R1, a reasoning-equipped MLLM agent that actively navigates long videos by using high-level visual cues to select the most informative clip before deeper processing, avoiding exhaustive segment scanning. Key leap: reframing long-video understanding as reasoning-driven clip selection with on-demand reading to cut compute/context costs.
- One-sentence summary: Replaces brute-force long-video processing with smart, agentic navigation for scalable understanding under tight budgets.
GitHub
- [2026-02-23] BradyFU/Awesome-Multimodal-Large-Language-Models ⭐17360
:sparkles::sparkles:Latest Advances on Multimodal Large Language Models
- [2026-02-25] Blaizzy/mlx-vlm ⭐2172
MLX-VLM is a package for inference and fine-tuning of Vision Language Models (VLMs) on your Mac using MLX.
- [2026-02-19] yfzhang114/Awesome-Multimodal-Large-Language-Models ⭐956
Reading notes about Multimodal Large Language Models, Large Language Models, and Diffusion Models
- [2026-02-22] Wang-ML-Lab/multimodal-needle-in-a-haystack ⭐54
[NAACL 2025 Oral] Multimodal Needle in a Haystack (MMNeedle): Benchmarking Long-Context Capability of Multimodal Large Language Models
- [2026-02-23] Yu-xm/ReVision ⭐50
Modality Gap–Driven Subspace Alignment Training Paradigm For Multimodal Large Language Models
Generated automatically by Daily AI Digest Agent 生成时间: 2026-02-25 11:40:52