AI 每日进展速报 / Daily AI Digest - 2026-03-02
图像生成/编辑 / Image Generation/Editing
arXiv
- [2026-02-26] Instruction-based Image Editing with Planning, Reasoning, and Generation 📖2 🆕NEW
- 赛道归属: 指令驱动图像编辑(Instruction-based Image Editing)
- 核心创新点: 提出将“规划-推理-生成”一体化的多模态模型,用统一框架打通编辑任务中的场景理解与图像生成,避免以往依赖LLM+分割+编辑模型的串联式流水线带来的模态割裂与误差累积。通过在编辑前显式进行步骤规划与视觉语义推理,提升对复杂指令、对象关系与全局一致性的编辑质量。
- 一句话总结: 用端到端多模态的规划推理机制连接理解与生成,使指令图像编辑更可靠、更一致、可控性更强。
- [2026-02-23] Closing the gap in multimodal medical representation alignment 📖2
- 赛道归属: 多模态医学表征学习(图文对齐/表示对齐)
- 核心创新点: 针对CLIP式对比学习在医学多模态对齐中引发的“模态鸿沟”(潜空间稀疏、语义碎片化)问题,系统分析其非预期优化行为,并提出更贴近真实语义对齐目标的对齐策略以缩小模态间分布差异。方法重点在于纠正对比损失带来的错误几何结构,从而提升跨模态语义一致性。
-
一句话总结: 通过诊断并修复对比学习导致的模态鸿沟,该工作为医学场景的可靠图文共享表征提供了更稳健的对齐路径。
-
Track: Multimodal medical representation learning (image-text alignment/representation alignment)
- Core innovation: It analyzes unintended behaviors of CLIP-style contrastive objectives that create a “modality gap” (sparse/fragmented latent geometry) in medical multimodal alignment, and proposes an alignment strategy better matched to true semantic correspondence. The key is correcting the latent-space geometry induced by contrastive loss to improve cross-modal semantic consistency.
- One-sentence summary: By diagnosing and mitigating contrastive-learning-induced modality gaps, it strengthens trustworthy shared representations for medical image-text modeling.
- [2026-02-25] GeoDiv: Framework For Measuring Geographical Diversity In Text-To-Image Models 📖1
- 赛道归属: 文生图评测与公平性(地理多样性/偏见评估)
- 核心创新点: 提出GeoDiv评测框架,利用大语言模型与视觉-语言模型对T2I生成结果进行“地理语义”层面的可解释评估,避免仅依赖人工标注数据集或表层视觉相似度指标。框架将地理多样性、刻板印象与区域表征偏差转化为可量化、可诊断的评测信号。
-
一句话总结: GeoDiv为文生图模型提供了可解释、可扩展的地理多样性与偏见评估工具,帮助系统性发现“世界表征”失真问题。
-
Track: Text-to-image evaluation & fairness (geographical diversity/bias assessment)
- Core innovation: GeoDiv introduces an interpretable evaluation framework that leverages LLMs and vision-language models to assess geographic semantics in T2I outputs, moving beyond curated datasets and shallow visual-similarity metrics. It turns geographic diversity, stereotyping, and regional misrepresentation into quantifiable, diagnosable signals.
- One-sentence summary: GeoDiv enables scalable and interpretable auditing of how T2I models portray regions, exposing geographic bias and diversity failures.
- [2026-02-24] When Safety Collides: Resolving Multi-Category Harmful Conflicts in Text-to-Image Diffusion via Adaptive Safety Guidance 📖1
- 赛道归属: 文生图安全对齐(扩散模型安全引导/多类有害冲突消解)
- 核心创新点: 指出现有安全引导将多类有害内容“平均化”成单一避让方向,无法建模不同伤害类别间的冲突与耦合;提出自适应安全引导,在生成过程中按类别动态调节引导强度与方向以解决多类别冲突。方法层面强调“按需分解+自适应融合”的安全梯度/引导策略,而非静态关键词区间。
-
一句话总结: 该工作让扩散模型在面对多类安全约束时能更精细地权衡与避险,提升安全性同时减少对正常生成质量的误伤。
-
Track: Text-to-image safety alignment (diffusion safety guidance / multi-harm conflict resolution)
- Core innovation: It shows that prior safety guidance collapses multiple harm categories into an averaged avoidance direction, missing inter-category conflicts, and proposes adaptive safety guidance that dynamically adjusts per-category guidance directions/strengths during sampling. The methodological leap is conflict-aware, adaptive fusion rather than static keyword-based zones.
- One-sentence summary: It improves diffusion-model safety under multi-category constraints while better preserving benign generation quality.
- [2026-02-26] SeeThrough3D: Occlusion Aware 3D Control in Text-to-Image Generation 🆕NEW
- 赛道归属: 3D布局可控文生图(3D Layout-conditioned Text-to-Image)/ 遮挡感知生成
- 核心创新点: 针对3D布局条件生成中长期被忽视的“遮挡推理”问题,提出显式建模遮挡关系的SeeThrough3D,通过遮挡感知的3D场景表示与生成机制,约束物体间前后关系与深度一致性,从而生成尺度与几何更合理的部分遮挡物体。相比仅遵循布局但忽略精确遮挡的现有方法,该方案强化了跨物体交互的结构正确性。
- 一句话总结: 通过显式遮挡建模补齐3D可控生成的关键短板,让布局条件下的多物体场景在深度与遮挡关系上更真实可信。
- [2026-02-26] Decomposing Private Image Generation via Coarse-to-Fine Wavelet Modeling
- 赛道归属: 隐私保护图像生成(差分隐私训练/频域建模)
- 核心创新点: 提出基于小波的coarse-to-fine频域差分隐私框架,将生成建模分解到不同频段/尺度,在DP噪声注入时对高频纹理等敏感质量维度进行更精细的结构化处理,缓解DP-SGD“全参数均匀加噪”导致的纹理崩坏。核心突破在于用频谱分解重构DP训练的噪声分配与建模顺序。
-
一句话总结: 通过频域分解实现更“懂画质”的DP训练,该工作在隐私保证与图像质量之间取得更优折中。
-
Track: Privacy-preserving image generation (differential privacy training / spectral modeling)
- Core innovation: It proposes a wavelet-based coarse-to-fine spectral DP framework that decomposes generation across frequency bands/scales, enabling structured noise allocation that better preserves high-frequency textures than uniform DP-SGD noise. The key advance is redesigning DP training via spectral decomposition and staged modeling.
- One-sentence summary: It delivers stronger privacy–quality trade-offs by making DP noise injection frequency-aware and generation coarse-to-fine.
- [2026-02-26] PATRA: Pattern-Aware Alignment and Balanced Reasoning for Time Series Question Answering
- 赛道归属: 多模态时间序列问答(时序模式对齐/推理训练)
- 核心创新点: 提出PATRA,通过“模式感知对齐”显式建模趋势、季节性等时序结构,避免将时间序列粗暴当作文本/图像输入;并通过“平衡推理”训练机制抑制简单任务目标的主导效应,促使模型学习更深层的逻辑推理能力。方法突破在于把时序模式表征与训练目标配比共同纳入可控优化。
-
一句话总结: PATRA让LLM在时间序列QA中既看得懂模式又推得动逻辑,提升复杂问题的可靠性。
-
Track: Multimodal time-series QA (pattern-aware alignment / reasoning-oriented training)
- Core innovation: PATRA introduces pattern-aware alignment to explicitly encode trends/seasonality rather than treating time series as plain text/images, and a balanced-reasoning training scheme to prevent easy objectives from dominating and suppressing deep reasoning. The advance is jointly controlling time-series structure modeling and objective balance.
- One-sentence summary: It improves time-series QA by making models both pattern-literate and reasoning-robust on harder queries.
- [2026-02-26] WISER: Wider Search, Deeper Thinking, and Adaptive Fusion for Training-Free Zero-Shot Composed Image Retrieval
- 赛道归属: 组合式图像检索(零样本CIR/训练免适配)
- 核心创新点: 提出WISER训练免的ZS-CIR框架,不再将“参考图+修改文本”强行折叠为单一模态,而是进行更宽的候选搜索与更深的语义推断,并对T2I式与I2I式检索信号进行自适应融合。方法论突破在于以“多路径检索+自适应融合”同时保留细粒度视觉细节与文本修改意图。
-
一句话总结: WISER在无需三元组训练数据的前提下显著增强组合检索的鲁棒性与可用性。
-
Track: Composed image retrieval (zero-shot CIR / training-free adaptation)
- Core innovation: WISER avoids collapsing (reference image + edit text) into a single modality by performing wider candidate search, deeper semantic inference, and adaptive fusion of T2I-style and I2I-style retrieval signals. The key is multi-route retrieval with adaptive fusion to preserve both fine visual details and edit intent.
- One-sentence summary: It makes zero-shot composed image retrieval more robust and practical without triplet training.
- [2026-02-26] DMAligner: Enhancing Image Alignment via Diffusion Model Based View Synthesis
- 赛道归属: 图像对齐与配准(扩散模型视角合成/对齐增强)
- 核心创新点: 提出DMAligner,用扩散模型进行“面向对齐的视图合成”来替代/补充传统光流扭曲,在遮挡与光照变化下生成更一致的对齐结果。核心突破是将对齐问题转化为条件生成的视图重建,通过生成式先验提升对齐质量与下游稳定性。
-
一句话总结: DMAligner用生成式视图合成绕开光流在复杂场景的脆弱性,提升图像对齐的视觉质量与可靠性。
-
Track: Image alignment/registration (diffusion-based view synthesis for alignment)
- Core innovation: DMAligner reframes alignment as alignment-oriented view synthesis with diffusion models, mitigating optical-flow warping failures under occlusion and illumination changes. The advance is leveraging generative priors via conditional view reconstruction to improve alignment fidelity and downstream robustness.
- One-sentence summary: It boosts alignment quality in challenging conditions by replacing brittle warping with diffusion-based synthesized aligned views.
- [2026-02-26] Where Vision Becomes Text: Locating the OCR Routing Bottleneck in Vision-Language Models
- 赛道归属: 多模态理解与可解释性(VLM OCR信息路由/因果分析)
- 核心创新点: 通过因果干预定位VLM中OCR信息进入语言流的关键瓶颈:对比原图与“文本抹除/修补”图像的激活差异,系统刻画不同架构(Qwen3-VL、Phi-4、InternVL3.5)中OCR路由的主导层/模块位置。方法突破在于用可操作的反事实输入与激活差分,给出架构相关的可解释“路由瓶颈”诊断。
-
一句话总结: 该工作把VLM“读字能力”从黑箱变为可定位的系统瓶颈,为OCR能力增强与失效排查提供了直接抓手。
-
Track: Multimodal interpretability (VLM OCR routing / causal analysis)
- Core innovation: Using causal interventions, it locates where OCR information is routed into the language stream by comparing activation differences between original images and text-inpainted counterfactuals, across multiple VLM families. The key advance is an actionable, architecture-specific diagnosis of OCR bottlenecks via counterfactual activation analysis.
- One-sentence summary: It turns VLM OCR from a black box into identifiable routing bottlenecks, enabling targeted improvements and debugging.
GitHub
- [2026-03-02] YouMind-OpenLab/awesome-nano-banana-pro-prompts ⭐8396
🍌 World's largest Nano Banana Pro prompt library — 10,000+ curated prompts with preview images, 16 languages. Google Gemini AI image generation. Free ...
- [2026-03-01] Dreamy-rain/gemini-business2api ⭐921
OpenAI-compatible API for Gemini Business with multi-account load balancing and image generation | 将 Gemini Business 转为 OpenAI 兼容接口,支持多账户负载均衡与图像生成、视频生...
- [2026-03-01] etkecc/baibot ⭐192
🤖 A Matrix bot for using different capabilities (text-generation, text-to-speech, speech-to-text, image-generation, etc.) of AI / Large Language Model...
- [2026-03-02] WP-Autoplugin/wp-banana ⭐83 🆕NEW
AI image generation and editing via Gemini, OpenAI and Replicate, right in your WordPress media library. Native-like integration in Elementor, WooComm...
- [2026-03-01] erroralex/Latent-Library ⭐63
A local-first, high-performance desktop asset manager for AI image generations. Features universal metadata parsing (ComfyUI/A1111), instant SQLite se...
视频生成/编辑 / Video Generation/Editing
arXiv
- [2026-02-26] ColoDiff: Integrating Dynamic Consistency With Content Awareness for Colonoscopy Video Generation
- 赛道归属: 医学视频生成(扩散模型/可控生成)
- 核心创新点: 提出基于扩散模型的ColoDiff,将“动态一致性”与“内容感知”的生成目标显式结合,以在复杂肠道形变、病灶多样性与多成像模态下同时保证时序稳定与临床属性可控。通过面向结肠镜场景的结构/内容约束设计,提升生成视频的可诊断信息密度与可控性。
- 一句话总结: 在数据稀缺的临床场景中,ColoDiff以更强的时序一致性与属性控制能力生成高质量结肠镜视频,提升医学数据合成的实用价值。
- Track: Medical video generation (diffusion / controllable generation)
- Core innovation: ColoDiff is a diffusion-based framework that explicitly integrates dynamic (temporal) consistency with content awareness, targeting colonoscopy-specific challenges such as irregular anatomy, diverse lesions, and multi-modality imaging while enabling controllable clinical attributes. It introduces scene-tailored structure/content constraints to improve both stability over time and clinically meaningful control.
- One-sentence summary: ColoDiff makes synthetic colonoscopy videos more temporally consistent and clinically controllable, improving practical medical data augmentation under scarcity.
- [2026-02-26] Uni-Animator: Towards Unified Visual Colorization
- 赛道归属: 图像/视频上色(草图到图像/视频的生成与编辑,DiT)
- 核心创新点: 提出基于Diffusion Transformer的统一框架,同时覆盖图像与视频草图上色,并针对单/多参考的颜色迁移不准与细节丢失,引入“视觉参考增强”以提升参考信息的可用性与高频细节保真。面向视频进一步强化时序一致性,降低大运动场景中的闪烁与运动伪影。
- 一句话总结: Uni-Animator用一个统一的DiT框架把图像与视频草图上色打通,在参考上色精度、细节与时序稳定性上更均衡。
- Track: Image & video colorization (sketch-to-image/video, DiT-based generation/editing)
- Core innovation: Uni-Animator unifies sketch colorization for both images and videos with a Diffusion Transformer backbone, and improves inaccurate color transfer (single/multi-reference) and high-frequency detail preservation via visual reference enhancement. For videos, it explicitly strengthens temporal coherence to reduce flicker and motion artifacts in large-motion scenes.
- One-sentence summary: Uni-Animator provides a single DiT framework that jointly handles image/video sketch colorization with better reference transfer, detail fidelity, and temporal stability.
- [2026-02-26] The Trinity of Consistency as a Defining Principle for General World Models
- 赛道归属: 世界模型(视频生成驱动的物理一致性与推理框架)
- 核心创新点: 提出“Consistency Trinity(三位一致性)”作为通用世界模型的定义性原则,用以统一刻画数据驱动视频生成在物理规律学习、可模拟性与可推理性上的关键约束。该工作从原则层面连接视频生成扩展规律与统一多模态模型(UMM)的架构趋势,为评测与设计世界模型提供可操作的理论坐标系。
- 一句话总结: 该工作用“三位一致性”把世界模型的目标从“能生成”提升到“物理一致、可复现、可推理”的可检验标准。
- Track: World models (video-generation-driven physical consistency & reasoning principles)
- Core innovation: It proposes the “Trinity of Consistency” as a defining principle for general world models, offering a unified lens to formalize the key constraints needed for learning, simulating, and reasoning about physical laws from data-driven video generation. The principle-level framing connects scaling in video generators with the emerging Unified Multimodal Model paradigm, guiding both design and evaluation.
- One-sentence summary: The paper elevates world-model goals into a testable consistency-based standard beyond mere video realism, targeting physically grounded simulation and reasoning.
- [2026-02-26] PackUV: Packed Gaussian UV Maps for 4D Volumetric Video
- 赛道归属: 4D体积视频重建与表示(Gaussian Splatting/视频编码友好)
- 核心创新点: 提出PackUV,用“Packed Gaussian UV Maps”将4D高斯表示映射到可打包的UV空间,在长序列、大运动与遮挡/显隐变化下提升时序一致性与稳定性。关键突破在于让高斯体积视频输出与传统视频编码/传输管线兼容,从表示层面解决存储与流式分发的落地障碍。
- 一句话总结: PackUV把4D高斯体积视频变得更稳定且更“可编码”,推动体积视频从研究走向可规模化存储与传输。
- Track: 4D volumetric video representation & reconstruction (Gaussian splatting, codec-friendly)
- Core innovation: PackUV introduces Packed Gaussian UV Maps, mapping 4D Gaussian representations into a packed UV space to improve robustness on long sequences, large motions, and disocclusions while enhancing temporal consistency. A key contribution is making outputs compatible with conventional video coding pipelines, addressing practical storage/streaming constraints at the representation level.
- One-sentence summary: PackUV turns 4D Gaussian volumetric video into a more temporally stable and codec-compatible format, enabling scalable storage and streaming.
- [2026-02-26] UCM: Unifying Camera Control and Memory with Time-aware Positional Encoding Warping for World Models
- 赛道归属: 世界模型/视频生成(相机控制 + 长时记忆一致性)
- 核心创新点: 提出UCM,通过“时间感知的位置编码扭曲(Time-aware Positional Encoding Warping)”在生成过程中统一相机控制与记忆机制,使模型在场景被重复访问时保持长期内容一致,并能根据用户输入实现更精确的相机运动控制。相较显式3D重建或直接复用历史帧的方法,该方案在开放场景与细粒度结构上兼顾灵活性与一致性。
- 一句话总结: UCM用时间感知的位置编码变换把“可控相机运动”和“可记忆的世界一致性”统一到同一生成机制中。
- Track: World models / video generation (camera control + long-term memory consistency)
- Core innovation: UCM unifies camera control and memory via time-aware positional encoding warping, enabling precise user-driven camera motion while maintaining long-term content consistency when revisiting scenes. Compared with explicit 3D reconstruction or frame-reuse strategies, it aims to preserve flexibility in unbounded settings and fine-grained structures without sacrificing consistency.
- One-sentence summary: UCM provides a unified mechanism to achieve both controllable camera trajectories and persistent scene memory in generative world models.
- [2026-02-26] SPATIALALIGN: Aligning Dynamic Spatial Relationships in Video Generation
- 赛道归属: 文生视频对齐优化(动态空间关系/偏好优化)
- 核心创新点: 提出SPATIALALIGN作为自我改进框架,面向文本提示中的“动态空间关系(DSR)”对T2V模型进行对齐增强。方法上采用零阶正则化的DPO进行微调,并设计基于几何的DSR-SCORE作为可优化的反馈信号,从而在不依赖昂贵标注的情况下提升空间关系随时间变化的正确性。
- 一句话总结: SPATIALALIGN用几何评分+偏好优化,让文生视频更可靠地遵守“物体之间如何运动与相对位置如何变化”的文本约束。
- Track: Text-to-video alignment (dynamic spatial relationships / preference optimization)
- Core innovation: SPATIALALIGN is a self-improvement framework that aligns T2V models to Dynamic Spatial Relationships (DSR) expressed in prompts. It fine-tunes models with a zeroth-order regularized Direct Preference Optimization objective and introduces DSR-SCORE, a geometry-based metric that provides optimization feedback to improve temporally evolving spatial correctness without heavy annotation.
- One-sentence summary: SPATIALALIGN improves T2V faithfulness to dynamic spatial constraints using geometry-driven scoring and preference-based fine-tuning.
- [2026-02-26] Denoising as Path Planning: Training-Free Acceleration of Diffusion Models with DPCache
- 赛道归属: 推理优化(扩散模型采样加速/缓存)
- 核心创新点: 提出DPCache,将扩散去噪过程视为“路径规划”,利用对整条去噪轨迹全局结构的建模来决定缓存复用/预测策略,而非仅用固定或局部自适应步长。该方法训练无关(training-free),在保持生成质量的同时减少多步采样的计算开销,适用于图像与视频扩散推理加速。
- 一句话总结: DPCache以“全局轨迹视角”改造缓存加速策略,在无需训练的前提下更高效地加速扩散采样。
- Track: Inference optimization (diffusion sampling acceleration / caching)
- Core innovation: DPCache reframes diffusion denoising as path planning, leveraging the global structure of the denoising trajectory to guide cache reuse/prediction rather than relying on fixed or purely local adaptive schedules. It is training-free and reduces multi-step sampling cost while preserving generation quality for both image and video diffusion models.
- One-sentence summary: DPCache accelerates diffusion inference more effectively by making cache decisions with a global trajectory-aware strategy, without any additional training.
- [2026-02-26] BetterScene: 3D Scene Synthesis with Representation-Aligned Generative Model
- 赛道归属: 3D场景生成/新视角合成(稀疏输入 + 扩散先验)
- 核心创新点: 提出BetterScene,利用在海量视频上预训练的Stable Video Diffusion作为强先验,在推理阶段对极稀疏、非受控照片输入进行表示对齐的生成式补全,以提升NVS的视角一致细节恢复并抑制伪影。其方法论重点在于将“视频扩散的时空先验”与3D/NVS表示进行对齐,从而在不增加采集成本的情况下增强真实场景泛化。
- 一句话总结: BetterScene把大规模视频扩散模型的先验迁移到稀疏照片的新视角合成中,显著改善真实场景的细节与一致性。
- Track: 3D scene synthesis / novel view synthesis (sparse inputs + diffusion prior)
- Core innovation: BetterScene leverages the production-scale Stable Video Diffusion prior and performs representation-aligned generative refinement at inference time for extremely sparse, unconstrained photos, improving view-consistent detail recovery and reducing artifacts in NVS. The key methodological idea is aligning spatiotemporal diffusion priors with 3D/NVS representations to boost real-scene generalization without extra capture.
- One-sentence summary: BetterScene transfers large-scale video diffusion priors to sparse-photo NVS, yielding more consistent details and fewer artifacts in real-world scenes.
- [2026-02-25] Flow Matching is Adaptive to Manifold Structures
- 赛道归属: 生成建模理论(Flow Matching/连续归一化流)
- 核心创新点: 从理论与机制层面论证Flow Matching对数据“流形结构”具有自适应性:在高维但低维流形集中的数据分布下,学习到的速度场/ODE采样会自然贴合流形几何,从而解释其训练稳定性与经验性能优势。该工作为在图像/视频等流形数据上的流式生成提供更坚实的理论依据与方法选择指导。
- 一句话总结: 该工作解释了为何Flow Matching在流形数据上更“顺着数据几何走”,为替代扩散的生成路线提供理论支撑。
- Track: Generative modeling theory (flow matching / continuous-time generative models)
- Core innovation: The work argues that flow matching is inherently adaptive to manifold structure: when high-dimensional data concentrate near low-dimensional manifolds, the learned velocity field and ODE sampling tend to align with manifold geometry, helping explain improved stability and empirical performance. This provides theoretical grounding for applying flow-based generative modeling to manifold-structured data such as images and videos.
- One-sentence summary: It offers a principled explanation for flow matching’s effectiveness on manifold data, strengthening the case for flow-based alternatives to diffusion.
- [2026-02-25] Neu-PiG: Neural Preconditioned Grids for Fast Dynamic Surface Reconstruction on Long Sequences 🆕NEW
- 赛道归属: 动态3D重建 / 长序列点云表面重建(Dynamic 3D Reconstruction from Point Clouds)
- 核心创新点: 提出一种“神经预条件化网格(Neural Preconditioned Grids)”的潜在网格编码,将空间特征以网格形式参数化并用于形变优化,从而显著改善长序列动态表面重建中的优化条件数与收敛效率。相较于逐帧增量形变带来的漂移与高耗时,以及依赖类别训练的复杂模型,该方法以更轻量的学习/编码方式实现时间一致性与快速优化,适配超长序列与非结构化点云输入。
- 一句话总结: Neu-PiG通过预条件化的潜在网格编码加速并稳定长序列动态点云的形变优化,实现更快、更不易漂移的时间一致动态表面重建。
GitHub
- [2026-03-01] hao-ai-lab/FastVideo ⭐3112
A unified inference and post-training framework for accelerated video generation.
- [2026-03-01] leofan90/Awesome-World-Models ⭐1264 🆕NEW
A comprehensive list of papers for the definition of World Models and using World Models for General Video Generation, Embodied AI, and Autonomous Dri...
- [2026-03-01] alex4727/MotionStream ⭐514 🆕NEW
MotionStream: Real-Time Video Generation with Interactive Motion Controls
- [2026-03-01] thu-ml/Causal-Forcing ⭐402 🆕NEW
Official codebase for "Causal Forcing: Autoregressive Diffusion Distillation Done Right for High-Quality Real-Time Interactive Video Generation"
- [2026-03-01] YouMind-OpenLab/awesome-seedance-2-prompts ⭐188
🎬 400+ curated Seedance 2.0 video generation prompts — cinematic, anime, UGC, ads, meme styles. Includes Seedance API guides, character consistency ti...
语言大模型 / Large Language Models
arXiv
- [2026-02-26] Instruction-based Image Editing with Planning, Reasoning, and Generation 📖2 🆕NEW
- 赛道归属: 指令驱动图像编辑(Instruction-based Image Editing)
- 核心创新点: 提出将“规划-推理-生成”一体化的多模态模型,用统一框架打通编辑任务中的场景理解与图像生成,避免以往依赖LLM+分割+编辑模型的串联式流水线带来的模态割裂与误差累积。通过在编辑前显式进行步骤规划与视觉语义推理,提升对复杂指令、对象关系与全局一致性的编辑质量。
- 一句话总结: 用端到端多模态的规划推理机制连接理解与生成,使指令图像编辑更可靠、更一致、可控性更强。
- [2026-02-26] Exploratory Memory-Augmented LLM Agent via Hybrid On- and Off-Policy Optimization 📖1 🆕NEW
- 赛道归属: LLM智能体强化学习(探索与记忆增强)
- 核心创新点: 提出EMPO²混合强化学习框架,将“记忆驱动探索”与“on-policy + off-policy”联合优化结合:一方面利用外部/内部记忆促进新状态发现,另一方面通过离策略更新提升在无记忆条件下的鲁棒性与泛化。该设计针对“依赖记忆导致脆弱”的常见问题,实现探索能力与稳健性的兼顾。
-
一句话总结: 用记忆增强探索、用混合策略训练保鲁棒,缓解LLM智能体在新环境中“找不到新状态、离开记忆就退化”的核心瓶颈。
-
Track: RL for LLM Agents (exploration & memory augmentation)
- Core innovation: EMPO² introduces a hybrid RL scheme that couples memory-driven exploration with joint on-policy and off-policy optimization, improving novel state discovery while explicitly training robustness when memory is absent. This directly targets the common brittleness of memory-reliant agents.
- One-sentence takeaway: It boosts LLM-agent exploration via memory while preserving performance without memory through hybrid policy optimization.
- [2026-02-24] An Expert Schema for Evaluating Large Language Model Errors in Scholarly Question-Answering Systems 📖1
- 赛道归属: LLM评测与可靠性(学术问答/错误分析框架)
- 核心创新点: 构建并验证一套面向“学术问答系统”的专家级错误分类与标注schema,强调领域专家在真实科研语境下对错误的判定维度(而非仅依赖自动指标)。该schema将错误类型与可操作的诊断信号对齐,支持更细粒度地定位幻觉、证据使用不当、推断越界等问题。
-
一句话总结: 为学术场景的LLM问答提供了可复用的“专家视角”错误本体,提升可靠性评估的可解释性与可行动性。
-
Track: LLM evaluation & reliability (scholarly QA / error taxonomy)
- Core innovation: Develops and validates an expert-driven error evaluation schema for scholarly QA that captures how domain experts judge failures in real scientific contexts, beyond automated metrics. The schema aligns error categories with actionable diagnostic cues to pinpoint hallucinations, misuse of evidence, and overreaching inference.
- One-sentence summary: Provides a reusable expert-grounded error ontology that makes scholarly LLM QA evaluation more interpretable and actionable.
- [2026-02-26] MediX-R1: Open Ended Medical Reinforcement Learning 🆕NEW
- 赛道归属: 医疗多模态大模型对齐(开放式回答的强化学习)
- 核心创新点: 提出MediX-R1开放式医疗RL框架,用Group-based RL对视觉-语言骨干进行微调,并设计面向医疗推理的复合奖励:包含LLM裁判的严格YES/NO语义正确性奖励,以及基于医学嵌入的语义相似奖励以覆盖同义改写,从而把训练目标从选择题扩展到临床可用的自由文本回答。
-
一句话总结: 通过“医疗专用复合奖励 + 组式RL”,把医疗MLLM从答题式评测推进到更贴近临床表达的开放式推理与作答。
-
Track: Medical multimodal LLM alignment (RL for open-ended answering)
- Core innovation: MediX-R1 fine-tunes a vision-language backbone with group-based RL and a medical-reasoning composite reward: a strict LLM-judge YES/NO semantic-accuracy signal plus a medical-embedding semantic reward to handle paraphrases, enabling clinically grounded free-form outputs beyond MCQ formats.
- One-sentence takeaway: It aligns medical MLLMs for clinically realistic open-ended responses using domain-tailored rewards and group-based RL.
- [2026-02-26] Utilizing LLMs for Industrial Process Automation 🆕NEW
- 赛道归属: 工业软件工程LLM应用(工业过程自动化/专用语言)
- 核心创新点: 聚焦工业过程自动化领域的专用/私有语言与工程流程,系统化评估与总结LLM在低公开语料、强领域约束场景下的可用性与实践路径(如提示策略、知识注入、验证与安全约束),弥补现有研究过度偏向Python等通用语言的空白。
-
一句话总结: 为“数据稀缺且高度专用”的工业自动化编程场景提供LLM落地方法论与边界认知。
-
Track: LLMs for industrial software engineering (process automation & domain-specific languages)
- Core innovation: The work investigates how to effectively apply LLMs to industrial process automation where proprietary/specialized languages dominate and public training data is scarce, distilling practical strategies (prompting, knowledge injection, verification, safety constraints) and clarifying limitations versus general-purpose language settings.
- One-sentence takeaway: It maps out actionable best practices and constraints for deploying LLMs in real industrial automation workflows with specialized languages.
- [2026-02-26] Toward Expert Investment Teams:A Multi-Agent LLM System with Fine-Grained Trading Tasks 🆕NEW
- 赛道归属: 金融交易多智能体系统(任务分解与可解释决策)
- 核心创新点: 提出面向真实投研流程的多智能体LLM框架,将投资分析拆解为细粒度、可执行的交易任务链(而非抽象角色指令),以结构化分工提升推理稳定性与决策透明度,并更贴近投研团队的协作工作流。
-
一句话总结: 用“细粒度任务分解”的多智能体协作替代泛化角色扮演,让LLM交易系统更可控、更可解释、更贴近真实投研。
-
Track: Multi-agent LLMs for financial trading (workflow/task decomposition & interpretability)
- Core innovation: The framework replaces abstract role-based prompting with explicit decomposition into fine-grained trading tasks aligned with real investment workflows, improving inference reliability and making intermediate decisions more transparent through structured agent responsibilities.
- One-sentence takeaway: It makes LLM-based trading systems more controllable and interpretable by grounding multi-agent collaboration in concrete, fine-grained tasks.
- [2026-02-26] LLM Novice Uplift on Dual-Use, In Silico Biology Tasks 🆕NEW
- 赛道归属: LLM安全与人类增益评测(生物双用途/生物安全)
- 核心创新点: 通过多模型、多基准的人类受试“uplift”实验,直接测量LLM是否能让生物领域新手在双用途(biosecurity-relevant)的纯计算任务上超越“仅用互联网资源”的表现,从评测范式上把“模型能力”推进到“对人类能力提升与风险外溢”的因果对比。
-
一句话总结: 用严格的人类对照实验量化LLM对生物双用途任务的“新手增益”,为科学加速与安全风险评估提供关键证据链。
-
Track: LLM safety & human uplift evaluation (dual-use / biosecurity)
- Core innovation: A multi-model, multi-benchmark human uplift study compares novices with LLM access vs internet-only access on biosecurity-relevant in silico tasks, shifting evaluation from model-only scores to causal measurement of human performance amplification and associated dual-use risk.
- One-sentence takeaway: It quantifies whether and how LLMs materially boost novice capability on dual-use biology tasks, informing both acceleration and risk assessments.
- [2026-02-26] Evaluating Zero-Shot and One-Shot Adaptation of Small Language Models in Leader-Follower Interaction 🆕NEW
- 赛道归属: 小语言模型SLM评测与适配(人机交互/机器人角色识别)
- 核心创新点: 面向资源受限机器人场景,系统评估小语言模型在leader-follower交互中的零样本/一样本适配能力,用以替代高延迟大模型;重点在于对“实时角色分类”这一HRI关键子任务建立可复现的适配与性能基线,明确SLM在低样本条件下的可用边界。
-
一句话总结: 给出SLM在HRI实时角色分配任务上的零/一样本能力画像,为端侧部署提供依据。
-
Track: Small language models (SLMs) evaluation & adaptation (HRI / robotics role classification)
- Core innovation: The paper benchmarks zero-shot and one-shot adaptation of SLMs for leader-follower role classification in HRI, targeting on-device constraints (latency/compute) and establishing reproducible baselines that clarify when small models can replace large ones in real-time interaction.
- One-sentence takeaway: It characterizes the practical limits and potential of SLMs for real-time role assignment in resource-constrained robots.
- [2026-02-26] ThinkOmni: Lifting Textual Reasoning to Omni-modal Scenarios via Guidance Decoding 🆕NEW
- 赛道归属: 多模态推理增强(训练外推理/引导式解码)
- 核心创新点: 提出ThinkOmni,通过“Guidance Decoding(引导式解码)”把文本大推理模型的推理能力迁移到全模态场景:在不进行或尽量少进行额外训练的前提下,用解码阶段的引导机制增强omni-modal LLM的复杂推理表现,绕开高质量多模态推理数据与高算力再训练的瓶颈。
-
一句话总结: 以解码期引导替代重训练,把强文本推理能力更低成本地“抬升”到全模态推理。
-
Track: Multimodal reasoning enhancement (training-free / guided decoding)
- Core innovation: ThinkOmni boosts omni-modal LLM reasoning via guidance decoding, transferring strong textual reasoning behaviors to omni-modal inputs with little or no additional training, avoiding the need for expensive multimodal reasoning datasets and compute-heavy retraining.
- One-sentence takeaway: It upgrades omni-modal reasoning largely at inference time, offering a cost-effective path to stronger multimodal inference.
- [2026-02-26] A Mixture-of-Experts Model for Multimodal Emotion Recognition in Conversations 🆕NEW
- 赛道归属: 多模态对话情感理解(MoE融合与上下文建模)
- 核心创新点: 提出MiSTER-E模块化MoE框架,将对话情感识别中的两大难点解耦:分别用语音/文本专家进行模态内的上下文时序建模,再通过MoE机制实现跨模态融合,从结构上提升多轮对话中的信息对齐与鲁棒融合能力。
-
一句话总结: 用“模态专家 + MoE融合”的解耦设计,更稳健地把语音与文本线索整合到多轮对话情感识别中。
-
Track: Multimodal conversational emotion recognition (MoE fusion & contextual modeling)
- Core innovation: MiSTER-E is a modular MoE architecture that decouples modality-specific temporal context modeling (speech vs text experts) from multimodal fusion, improving alignment and robustness for multi-turn emotion recognition in conversations.
- One-sentence takeaway: It strengthens ERC by separating per-modality context understanding from cross-modal fusion via a mixture-of-experts design.
GitHub
- [2026-03-02] sgl-project/sglang ⭐23928
SGLang is a high-performance serving framework for large language models and multimodal models.
- [2026-03-02] PaddlePaddle/PaddleFormers ⭐12985 🆕NEW
PaddleFormers is an easy-to-use library of pre-trained large language model zoo based on PaddlePaddle.
- [2026-03-02] NVIDIA/TensorRT-LLM ⭐12976
TensorRT LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports state-of-the-art optimizations to perfo...
- [2026-03-02] johnhuang316/code-index-mcp ⭐794 🆕NEW
A Model Context Protocol (MCP) server that helps large language models index, search, and analyze code repositories with minimal setup
- [2026-03-02] testtimescaling/testtimescaling.github.io ⭐88
"what, how, where, and how well? a survey on test-time scaling in large language models" repository
HuggingFace Datasets
- [2026-02-27] nvidia/Nemotron-Terminal-Corpus 🆕NEW
Terminal-Corpus: Large-Scale SFT Dataset for Terminal Agents
Terminal-Corpus is a large-scale Supervised Fine-Tuning (SFT) dataset designed...
多模态大模型 / Multimodal Models
arXiv
- [2026-02-26] Scale Can't Overcome Pragmatics: The Impact of Reporting Bias on Vision-Language Reasoning 📖1 🆕NEW
- 赛道归属: 多模态理解(视觉-语言推理数据偏差分析/评测)
- 核心创新点: 该工作将VLM推理不足归因于训练语料中的“报告偏差”(人类描述图像时倾向省略显而易见但对推理监督关键的隐含信息),并系统性分析主流VLM训练数据中这种偏差如何削弱可学习的推理信号。方法上强调从数据生成与标注语用学角度解释“规模化不等于推理提升”,为改进数据构建与评测提供可操作的诊断框架。
Core innovation: The paper attributes weak VLM reasoning to “reporting bias” in training corpora—human captions omit tacit but supervision-critical facts—and empirically analyzes how this bias in popular VLM datasets limits learnable reasoning signals. Methodologically, it reframes “scaling ≠ reasoning” as a pragmatics/data issue and provides a diagnostic lens for dataset design and evaluation. - 一句话总结: 通过揭示报告偏差对视觉-语言推理学习信号的系统性抑制,该工作说明仅靠扩大数据与模型规模难以补齐推理能力,关键在于改造数据监督形态。
One-sentence summary: By pinpointing reporting bias as a root cause of missing reasoning supervision, it argues that scaling alone won’t yield robust VLM reasoning without changing how data is written and supervised.
- [2026-02-25] GeoDiv: Framework For Measuring Geographical Diversity In Text-To-Image Models 📖1
- 赛道归属: 文生图评测与公平性(地理多样性/偏见评估)
- 核心创新点: 提出GeoDiv评测框架,利用大语言模型与视觉-语言模型对T2I生成结果进行“地理语义”层面的可解释评估,避免仅依赖人工标注数据集或表层视觉相似度指标。框架将地理多样性、刻板印象与区域表征偏差转化为可量化、可诊断的评测信号。
-
一句话总结: GeoDiv为文生图模型提供了可解释、可扩展的地理多样性与偏见评估工具,帮助系统性发现“世界表征”失真问题。
-
Track: Text-to-image evaluation & fairness (geographical diversity/bias assessment)
- Core innovation: GeoDiv introduces an interpretable evaluation framework that leverages LLMs and vision-language models to assess geographic semantics in T2I outputs, moving beyond curated datasets and shallow visual-similarity metrics. It turns geographic diversity, stereotyping, and regional misrepresentation into quantifiable, diagnosable signals.
- One-sentence summary: GeoDiv enables scalable and interpretable auditing of how T2I models portray regions, exposing geographic bias and diversity failures.
- [2026-02-24] Inner Speech as Behavior Guides: Steerable Imitation of Diverse Behaviors for Human-AI coordination 📖1
- 赛道归属: 多模态智能体 / 模仿学习与可控行为生成
- 核心创新点: 将“内在言语(inner speech)”作为显式的行为指导信号,引入到模仿学习中以刻画人类行为的多样性与非马尔可夫(长程依赖)特征。通过在推理时对内在言语进行条件化,实现对同一任务下不同行为风格/策略的可控“转向(steering)”。
-
一句话总结: 用可操控的“内在言语”把人类式多样行为注入模仿学习,使人机协作中的策略选择更灵活、可解释、可调控。
-
Track: Multimodal agents / Imitation learning & controllable behavior generation
- Core innovation: Introduces explicit “inner speech” as a behavior-guiding signal in imitation learning to capture diverse, non-Markovian human behaviors. Enables inference-time steering by conditioning on inner speech to select different behavior modes/policies for the same task.
- One-sentence summary: Inner-speech conditioning makes imitation-learned agents more steerable and adaptable for human–AI coordination.
- [2026-02-26] MediX-R1: Open Ended Medical Reinforcement Learning 🆕NEW
- 赛道归属: 医疗多模态大模型对齐(开放式回答的强化学习)
- 核心创新点: 提出MediX-R1开放式医疗RL框架,用Group-based RL对视觉-语言骨干进行微调,并设计面向医疗推理的复合奖励:包含LLM裁判的严格YES/NO语义正确性奖励,以及基于医学嵌入的语义相似奖励以覆盖同义改写,从而把训练目标从选择题扩展到临床可用的自由文本回答。
-
一句话总结: 通过“医疗专用复合奖励 + 组式RL”,把医疗MLLM从答题式评测推进到更贴近临床表达的开放式推理与作答。
-
Track: Medical multimodal LLM alignment (RL for open-ended answering)
- Core innovation: MediX-R1 fine-tunes a vision-language backbone with group-based RL and a medical-reasoning composite reward: a strict LLM-judge YES/NO semantic-accuracy signal plus a medical-embedding semantic reward to handle paraphrases, enabling clinically grounded free-form outputs beyond MCQ formats.
- One-sentence takeaway: It aligns medical MLLMs for clinically realistic open-ended responses using domain-tailored rewards and group-based RL.
- [2026-02-26] Retrieve and Segment: Are a Few Examples Enough to Bridge the Supervision Gap in Open-Vocabulary Segmentation? 🆕NEW
- 赛道归属: 开放词汇分割(Open-Vocabulary Segmentation)/少样本分割
- 核心创新点: 提出“检索+分割”的少样本OVS范式:用少量示例(few-shot)结合检索机制来补足VLM仅有图像级监督带来的像素级监督缺口,并缓解自然语言提示的语义歧义对分割边界与类别对齐的影响。方法上通过示例驱动的语义对齐与检索到的可迁移视觉证据,将开放词汇能力更有效地下沉到像素预测。
Core innovation: Introduces a retrieve-and-segment few-shot OVS paradigm that uses a small number of examples plus retrieval to compensate for VLMs’ coarse image-level supervision and to reduce semantic ambiguity from natural-language prompts. The key methodological step is example-driven semantic alignment with retrieved transferable visual evidence to better ground open-vocabulary text at the pixel level. - 一句话总结: 该工作表明在开放词汇分割中,引入少量示例并结合检索可显著缩小与全监督分割的差距,是从“纯提示”走向“可控监督补强”的有效路径。
One-sentence summary: It shows that a few retrieved examples can meaningfully close the supervision gap in open-vocabulary segmentation, offering a practical bridge from prompt-only to stronger, controllable pixel-level grounding.
- [2026-02-26] CXReasonAgent: Evidence-Grounded Diagnostic Reasoning Agent for Chest X-rays 🆕NEW
- 赛道归属: 医疗多模态智能体(胸片诊断推理)/证据可追溯推理
- 核心创新点: 提出面向胸片的证据落地诊断推理智能体,通过多步推理流程将结论与可核验的视觉证据显式绑定,缓解LVLM“看似合理但不忠实”的幻觉式诊断;同时以智能体式编排(而非昂贵重训)支持任务扩展与适配新诊断需求。方法突破在于把“证据检索/定位—推理—结论”做成可审计闭环,提高临床可用性与可验证性。
Core innovation: Proposes an evidence-grounded diagnostic reasoning agent for chest X-rays that explicitly ties multi-step conclusions to verifiable visual evidence, mitigating LVLMs’ plausible-but-unfaithful outputs; it also emphasizes agentic orchestration to adapt to new diagnostic tasks without costly retraining. The methodological advance is an auditable closed loop of evidence retrieval/localization → reasoning → conclusion. - 一句话总结: 通过将胸片诊断从“生成答案”升级为“证据驱动、可验证的推理流程”,该工作提升了医疗LVLM的可靠性与可扩展性,更贴近临床落地需求。
One-sentence summary: By turning chest X-ray interpretation into evidence-linked, verifiable multi-step reasoning, it improves reliability and adaptability of medical LVLMs for clinical deployment.
- [2026-02-26] Large Multimodal Models as General In-Context Classifiers
- 赛道归属: 多模态理解 / In-context 分类与评测
- 核心创新点: 系统性论证并基准评测:大多模态模型(LMM/MLLM)不仅适合复杂推理任务,也能作为通用的 in-context 分类器用于闭集分类。通过在多数据集设置下对比CLIP类对比学习VLM与LMM的ICL能力,揭示LMM在“示例驱动分类”范式中的适用边界与优势来源。
-
一句话总结: 重新定位LMM在分类任务中的角色:它们可作为通用 in-context 分类器,而不只是“复杂任务专用”。
-
Track: Multimodal understanding / In-context classification & benchmarking
- Core innovation: Establishes and benchmarks LMMs as general in-context classifiers for closed-world classification, challenging the prevailing “use CLIP for classification” view. Provides multi-dataset comparisons that isolate when and why LMM in-context learning works for classification.
- One-sentence summary: Shows LMMs can be strong, general-purpose in-context classifiers, expanding their practical use beyond complex reasoning tasks.
- [2026-02-26] MovieTeller: Tool-augmented Movie Synopsis with ID Consistent Progressive Abstraction
- 赛道归属: 视频理解与生成式摘要 / 长视频剧情梗概生成
- 核心创新点: 提出工具增强(tool-augmented)的电影梗概生成框架,通过“ID一致性(角色/实体一致)”与“渐进抽象(progressive abstraction)”的分层生成策略,把长时序视频从片段级信息逐步抽象为高层叙事。用外部工具/模块辅助检索、对齐与一致性维护,缓解通用VLM在长视频中常见的角色混淆、事件断裂与时间跨度失忆问题。
-
一句话总结: 通过工具增强与分层抽象叙事,实现更长程、更一致的电影级视频梗概生成。
-
Track: Video understanding & generative summarization / Long-form movie synopsis generation
- Core innovation: Proposes a tool-augmented synopsis pipeline with ID-consistent, progressively abstracted generation to turn long videos into coherent narratives. External tools/modules support retrieval/alignment and consistency tracking, reducing character/entity drift and long-context failures in generic VLMs.
- One-sentence summary: Delivers more coherent long-form movie synopses by combining tool augmentation with ID-consistent progressive narrative abstraction.
- [2026-02-26] Cytoarchitecture in Words: Weakly Supervised Vision-Language Modeling for Human Brain Microscopy
- 赛道归属: 医学多模态 / 显微图像-语言弱监督对齐
- 核心创新点: 面向人脑组织显微图像中“配对图文稀缺”的现实约束,提出弱监督的视觉-语言建模方案,将细胞结构/皮层分层等“细胞构筑(cytoarchitecture)”知识以文本形式对齐到显微视觉表征。通过弱标注/间接监督学习可用的图文接口,支持研究者以自然语言进行检索、描述与交互式分析。
-
一句话总结: 在缺少高质量配对标注的脑显微场景中,用弱监督实现可用的图文对齐,为科研工作流提供自然语言入口。
-
Track: Medical multimodal / Weakly supervised microscopy vision-language alignment
- Core innovation: Develops weakly supervised vision-language modeling for human brain microscopy where paired image–text data are scarce, aligning cytoarchitectural concepts expressed in words to microscopic visual representations. Enables natural-language interfaces for interactive analysis, retrieval, and interpretation under limited supervision.
- One-sentence summary: Makes brain microscopy more “language-accessible” by learning useful vision–language coupling from weak supervision.
- [2026-02-26] SubspaceAD: Training-Free Few-Shot Anomaly Detection via Subspace Modeling
- 赛道归属: 异常检测 / 小样本工业视觉(训练-free)
- 核心创新点: 提出无需训练的 few-shot 异常检测方法 SubspaceAD:直接在基础视觉模型特征空间中对“正常样本”进行子空间建模,用重构误差/子空间距离实现异常评分。避免记忆库、辅助数据集或VLM调参等复杂组件,强调“用好现成表征即可”的极简范式。
-
一句话总结: 用子空间建模把 few-shot 异常检测做成训练-free、低工程复杂度的强基线。
-
Track: Anomaly detection / Few-shot industrial inspection (training-free)
- Core innovation: Introduces SubspaceAD, a training-free few-shot anomaly detector that models normal data as a subspace in foundation-model feature space and scores anomalies via subspace distance/reconstruction error. Eliminates memory banks, auxiliary datasets, and multimodal tuning while leveraging strong pretrained representations.
- One-sentence summary: A minimalist, training-free subspace approach turns foundation features into an effective few-shot anomaly detection baseline.
GitHub
- [2026-03-01] OmniSVG/OmniSVG ⭐2381 🆕NEW
[NeurIPS 2025] OmniSVG is the first family of end-to-end multimodal SVG generators that leverage pre-trained Vision-Language Models (VLMs), capable of...
- [2026-02-25] xjywhu/Awesome-Multimodal-LLM-for-Code ⭐209
Multimodal Large Language Models for Code Generation under Multimodal Scenarios
- [2026-02-27] EdinburghNLP/MMLongBench ⭐176 🆕NEW
The official repo of the paper "MMLongBench Benchmarking Long-Context Vision-Language Models Effectively and Thoroughly"
- [2026-02-27] YuyangSunshine/Awesome-Continual-learning-of-Vision-Language-Models ⭐157 🆕NEW
Awsome of VLM-CL. Continual Learning for VLMs: A Survey and Taxonomy Beyond Forgetting
- [2026-02-27] yaolinli/MLLM-Token-Compression ⭐115 🆕NEW
Towards Efficient Multimodal Large Language Models: A Survey on Token Compression
Generated automatically by Daily AI Digest Agent 生成时间: 2026-03-02 01:56:41