arXiv Daily Index - 2026-05-07

#	Title	Categories	Authors	Abstract
astro-ph.IM 1 papers
24	AstroAlertBench: Evaluating the Accuracy, Reasoning, and Honesty of Multimodal LLMs in Astronomical Classification 2605.05573	astro-ph.IMcs.AI	Claire Chen, Jiabao Sean Xiao, Shuze Daniel Liu, Facundo Perez Paolino, Luke Handley	Modern astronomical observatories generate a massive volume of multimodal data, creating a critical bottleneck for expert human review. While multimodal large language models (LLMs) have shown promise in interpreting complex visual and textual inputs, their ab... Modern astronomical observatories generate a massive volume of multimodal data, creating a critical bottleneck for expert human review. While multimodal large language models (LLMs) have shown promise in interpreting complex visual and textual inputs, their ability to perform specialized scientific classification while providing interpretable reasoning remains understudied. We introduce AstroAlertBench, a comprehensive multimodal benchmark designed to evaluate LLM performance in astronomical eve...
cond-mat.mtrl-sci 1 papers
113	Polarizable atomic multipoles for learning long-range electrostatics 2605.05746	cond-mat.mtrl-scics.LGphysics.chem-phphysics.comp-ph	Dongjin Kim, Daniel S. King, Yoonjae Park, Roya Savoj, Sebastien Hamel	Long-range electrostatics and polarization remain central obstacles to extending machine learning interatomic potentials (MLIPs) to ionic, polar, and interfacial systems. Here, we introduce a semi-local framework for learning electrostatics from energies and f... Long-range electrostatics and polarization remain central obstacles to extending machine learning interatomic potentials (MLIPs) to ionic, polar, and interfacial systems. Here, we introduce a semi-local framework for learning electrostatics from energies and forces using polarizable atomic multipoles. Local equivariant descriptors predict environment-dependent latent monopoles, dipoles, and quadrupoles, while residual non-local charge transfer and polarization are captured by non-self-consistent...
cs.AI 124 papers
7	Housing Potential Common Data Model and City Digital Twin 2605.05535	cs.AI	Megan Katsumi, Mark Fox, Anderson Wong, Divnoor Chatha	The evaluation of housing potential requires consideration of a location from multiple perspectives, ranging from zoning and land use to population characteristics and access to services. This research introduces the Housing Potential Common Data Model (HPCDM)... The evaluation of housing potential requires consideration of a location from multiple perspectives, ranging from zoning and land use to population characteristics and access to services. This research introduces the Housing Potential Common Data Model (HPCDM) to overcome existing data silos, serving as a standard to support integration and interoperability across the diverse range of datasets that are required for housing potential analysis. This report details the evaluation of the model along...
8	AgenticRAG: Agentic Retrieval for Enterprise Knowledge Bases 2605.05538	cs.AIcs.IR	Susheel Suresh, Hazel Mak, Shangpo Chou, Fred Kroon, Sahil Bhatnagar	We present AgenticRAG, a practical agentic harness for retrieval and analysis over enterprise knowledge bases. Standard RAG pipelines place significant burden of grounding on the search stack, constraining the language model to a fixed candidate set chosen dee... We present AgenticRAG, a practical agentic harness for retrieval and analysis over enterprise knowledge bases. Standard RAG pipelines place significant burden of grounding on the search stack, constraining the language model to a fixed candidate set chosen deep in the retrieval process. Our approach reduces this overdependence by layering a lightweight harness on top of existing enterprise search infrastructure, equipping a reasoning LLM with search, find, open, and summarize tools enabling the ...
11	SPARK: Self-Play with Asymmetric Reward from Knowledge Graphs 2605.05546	cs.AI	Hyobin Park, Taeseop Kim, Dong-Geol Choi	Self-play reinforcement learning has shown strong performance in domains with formally verifiable structure, such as mathematics and coding, where both problem generation and reward computation can be grounded in explicit rules. Extending this paradigm to scie... Self-play reinforcement learning has shown strong performance in domains with formally verifiable structure, such as mathematics and coding, where both problem generation and reward computation can be grounded in explicit rules. Extending this paradigm to scientific literature is more challenging: the relationships among multi-modal elements within and across documents are rarely made explicit in text, which makes automatic generation of relational reasoning questions difficult and weakens the r...
17	Who Prices Cognitive Labor in the Age of Agents? Compute-Anchored Wages 2605.05558	cs.AIcs.CY	Siqi Zhu	A natural intuition about the economics of AI agents is that, because agents can be replicated at very low marginal cost, agent labor may be supplied highly elastically, placing downward pressure on cognitive-labor wages when it closely substitutes for human l... A natural intuition about the economics of AI agents is that, because agents can be replicated at very low marginal cost, agent labor may be supplied highly elastically, placing downward pressure on cognitive-labor wages when it closely substitutes for human labor. We argue this framing is wrong in mechanism but partially correct in conclusion, and that the correction matters for both theory and policy. \textbf{Agents are not labor; they are a production technology that converts compute capital ...
18	BitCal-TTS: Bit-Calibrated Test-Time Scaling for Quantized Reasoning Models 2605.05561	cs.AI	Sai Babu Patarlapalli, Surya Teja Avvaru	Post-training quantization makes large reasoning models practical under tight memory and latency budgets, but it can distort the online signals that drive adaptive test-time compute allocation. Under a fixed cap on the number of newly generated tokens, miscali... Post-training quantization makes large reasoning models practical under tight memory and latency budgets, but it can distort the online signals that drive adaptive test-time compute allocation. Under a fixed cap on the number of newly generated tokens, miscalibrated confidence can lead to harmful early halting: the model may surface a plausible final line while the underlying reasoning is still wrong, or the controller may stop before the trace has stabilized. We study this interaction for greed...
19	Nonsense Helps: Prompt Space Perturbation Broadens Reasoning Exploration 2605.05566	cs.AIcs.CLcs.LG	Langlin Huang, Chengsong Huang, Jinyuan Li, Donghong Cai, Yuyi Yang	Reinforcement learning with verifiable rewards, particularly Group Relative Policy Optimization (GRPO), has significantly advanced the reasoning capabilities of Large Language Models (LLMs). However, in complex tasks, GRPO frequently suffers from the ``zero-ad... Reinforcement learning with verifiable rewards, particularly Group Relative Policy Optimization (GRPO), has significantly advanced the reasoning capabilities of Large Language Models (LLMs). However, in complex tasks, GRPO frequently suffers from the ``zero-advantage problem'': when all sampled rollouts for a query fail, the relative advantage collapses to zero. Consequently, the model loses effective training signals for these questions, wasting the training data and computational budget. While...
20	Locality-aware Private Class Identification for Domain Adaptation with Extreme Label Shift 2605.05567	cs.AI	Chuan-Xian Ren, Cheng-Jun Guo, Hong Yan	Domain adaptation aims to transfer knowledge from a labeled source domain to an unlabeled target domain with different distributions. In real-world scenarios, the label spaces of the two domains often have an inclusion relationship, where some classes exist on... Domain adaptation aims to transfer knowledge from a labeled source domain to an unlabeled target domain with different distributions. In real-world scenarios, the label spaces of the two domains often have an inclusion relationship, where some classes exist only in one domain but not the other. These non-overlapping classes are referred to as private classes. Identifying private class samples and mitigating their adverse effects is critical in the literature. Existing methods rely on the assumpt...
26	AlphaCrafter: A Full-Stack Multi-Agent Framework for Cross-Sectional Quantitative Trading 2605.05580	cs.AI	Yishuo Yuan, Jiayi Sheng, Sirui Zeng, Jiaqi Wang, Jiaheng Liu	Financial markets are inherently non-stationary, driven by complex interactions among macroeconomic regimes, microstructural frictions, and behavioral dynamics. Building quantitative strategies that remain profitable demands the continuous coupling of factor d... Financial markets are inherently non-stationary, driven by complex interactions among macroeconomic regimes, microstructural frictions, and behavioral dynamics. Building quantitative strategies that remain profitable demands the continuous coupling of factor discovery, regime-adaptive selection, and risk-constrained execution. Prevailing approaches, however, optimize these components under static or isolated assumptions. Factor mining frameworks typically treat alpha discovery as a one-time sear...
28	Belief Memory: Agent Memory Under Partial Observability 2605.05583	cs.AIcs.CL	Junfeng Liao, Qizhou Wang, Jianing Zhu, Bo Du, Rui Yan	LLM agents that operate over long context depend on external memory to accumulate knowledge over time. However, existing methods typically store each observation as a single deterministic conclusion (e.g., inferring "API~X failed" from temporary errors), even ... LLM agents that operate over long context depend on external memory to accumulate knowledge over time. However, existing methods typically store each observation as a single deterministic conclusion (e.g., inferring "API~X failed" from temporary errors), even though such observations are inherently partial and potentially ambiguous. By committing to one conclusion and discarding uncertainty, these methods introduce self-reinforcing error: the agent acts on the stored conclusion, never revisits a...
33	Causal Probing for Internal Visual Representations in Multimodal Large Language Models 2605.05593	cs.AI	Zehao Deng, Tianjie Ju, Zheng Wu, Liangbo He, Jun Lan	Despite the remarkable success of Multimodal Large Language Models (MLLMs) across diverse tasks, the internal mechanisms governing how they encode and ground distinct visual concepts remain poorly understood. To bridge this gap, we propose a causal framework b... Despite the remarkable success of Multimodal Large Language Models (MLLMs) across diverse tasks, the internal mechanisms governing how they encode and ground distinct visual concepts remain poorly understood. To bridge this gap, we propose a causal framework based on activation steering to actively probe and manipulate internal visual representations. Through systematic intervention across four visual concept categories, our results reveal a divergence in concept encoding: entities exhibit disti...
35	Prober.ai: Gated Inquiry-Based Feedback via LLM-Constrained Personas for Argumentative Writing Development 2605.05598	cs.AIcs.HC	Ran Bi, Shiyao Wei, Yuanyiyi Zhou	The proliferation of large language models (LLMs) in educational settings has paradoxically undermined the cognitive processes they purport to support. Students increasingly outsource critical thinking to AI assistants that generate polished text on demand, re... The proliferation of large language models (LLMs) in educational settings has paradoxically undermined the cognitive processes they purport to support. Students increasingly outsource critical thinking to AI assistants that generate polished text on demand, resulting in measurable cognitive debt and diminished argumentative reasoning skills. We present Prober.ai, a web-based writing environment that inverts the conventional AI-tutoring paradigm: rather than generating or rewriting student text, ...
49	From Storage to Experience: A Survey on the Evolution of LLM Agent Memory Mechanisms 2605.06716	cs.AIcs.CL	Jinghao Luo, Yuchen Tian, Chuxue Cao, Ziyang Luo, Hongzhan Lin	Large Language Model (LLM)-based agents have fundamentally reshaped artificial intelligence by integrating external tools and planning capabilities. While memory mechanisms have emerged as the architectural cornerstone of these systems, current research remain... Large Language Model (LLM)-based agents have fundamentally reshaped artificial intelligence by integrating external tools and planning capabilities. While memory mechanisms have emerged as the architectural cornerstone of these systems, current research remains fragmented, oscillating between operating system engineering and cognitive science. This theoretical divide prevents a unified view of technological synthesis and a coherent evolutionary perspective. To bridge this gap, this survey propos...
53	Text-Graph Synergy: A Bidirectional Verification and Completion Framework for RAG 2605.05643	cs.AIcs.IR	Jiarui Zhong, Hong Cai Chen	Retrieval-Augmented Generation (RAG) has become a core paradigm for enhancing factual grounding and multi-hop reasoning in Large Language Models (LLMs). Traditional text-based RAG often retrieves logically irrelevant pseudo-evidence, while graph-based RAG is f... Retrieval-Augmented Generation (RAG) has become a core paradigm for enhancing factual grounding and multi-hop reasoning in Large Language Models (LLMs). Traditional text-based RAG often retrieves logically irrelevant pseudo-evidence, while graph-based RAG is frequently hindered by search-time pruning, which may discard potentially valid reasoning paths. Existing hybrid approaches primarily adopt simple evidence concatenation or unidirectional enhancement, which fails to address the fundamental "...
60	Retrieval-Conditioned Topology Selection with Provable Budget Conservation for Multi-Agent Code Generation 2605.05657	cs.AIcs.MA	Abhijit Talluri, Pujith Anne, Bhagavan Choudary Pendiyala, Raghavendra Chilukuri	Multi-agent LLM systems for code generation face a fundamental routing problem: the optimal orchestration topology depends on the structural complexity of the code under modification, yet existing systems select topologies without consulting the codebase. We p... Multi-agent LLM systems for code generation face a fundamental routing problem: the optimal orchestration topology depends on the structural complexity of the code under modification, yet existing systems select topologies without consulting the codebase. We present Retrieval-Guided Adaptive Orchestration (RGAO), an architecture that closes this loop by extracting a structural complexity vector from a hierarchical code index before selecting the orchestration topology. RGAO operates within Code-...
65	Large Vision-Language Models Get Lost in Attention 2605.05668	cs.AIcs.CV	Gongli Xi, Ye Tian, Mengyu Yang, Huahui Yi, Liang Lin	Despite the rapid evolution of training paradigms, the decoder backbone of large vision--language models (LVLMs) remains fundamentally rooted in the residual-connection Transformer architecture. Therefore, deciphering the distinct roles of internal modules is ... Despite the rapid evolution of training paradigms, the decoder backbone of large vision--language models (LVLMs) remains fundamentally rooted in the residual-connection Transformer architecture. Therefore, deciphering the distinct roles of internal modules is critical for understanding model mechanics and guiding architectural optimization. While prior statistical approaches have provided valuable attribution-based insights, they often lack a unified theoretical basis. To bridge this gap, we pro...
68	Chain of Risk: Safety Failures in Large Reasoning Models and Mitigation via Adaptive Multi-Principle Steering 2605.05678	cs.AI	Xiaomin Li, Jianheng Hou, Zheyuan Deng, Zhiwei Zhang, Taoran Li	Large reasoning models (LRMs) increasingly expose chain-of-thought-like reasoning for transparency, verification, and deliberate problem solving. This creates a safety blind spot: harmful or policy-violating content may appear in reasoning traces even when fin... Large reasoning models (LRMs) increasingly expose chain-of-thought-like reasoning for transparency, verification, and deliberate problem solving. This creates a safety blind spot: harmful or policy-violating content may appear in reasoning traces even when final answers appear safe. We test whether final-answer safety is a sufficient proxy for the full reasoning-answer trajectory by scoring both stages under a unified twenty-principle safety rubric. Using prompts from seven public harmfulness an...
73	Attractor Geometry of Transformer Memory: From Conflict Arbitration to Confident Hallucination 2605.05686	cs.AI	Qiyao Liang, Risto Miikkulainen, Ila Fiete	Language models draw on two knowledge sources: facts baked into weights (parametric memory, PM) and information in context (working memory, WM). We study two mechanistically distinct failure modes--conflict, when PM and WM disagree and interfere; and hallucina... Language models draw on two knowledge sources: facts baked into weights (parametric memory, PM) and information in context (working memory, WM). We study two mechanistically distinct failure modes--conflict, when PM and WM disagree and interfere; and hallucination, when the queried fact was never learned. Both produce confident output regardless, making output-based monitoring blind by design. We show both failures share a unified geometric account. In the hidden-state space of autoregressive ge...
74	DataDignity: Training Data Attribution for Large Language Models 2605.05687	cs.AI	Xiaomin Li, Andrzej Banburski-Fahey, Jaron Lanier	Auditing language-model outputs often requires more than judging correctness: an auditor may need to identify which source document most likely supports the knowledge expressed in a response. We study this as pinpoint provenance: given a prompt, a target-model... Auditing language-model outputs often requires more than judging correctness: an auditor may need to identify which source document most likely supports the knowledge expressed in a response. We study this as pinpoint provenance: given a prompt, a target-model response, and a candidate corpus, rank the documents that best support the response. We introduce FakeWiki, a controlled benchmark of 3,537 fabricated Wikipedia-style articles designed to preserve ground-truth provenance while weakening le...
76	GCCM: Enhancing Generative Graph Prediction via Contrastive Consistency Model 2605.05689	cs.AI	Shaozhen Ma, Wei Huang, Hanchen Wang, Dong Wen, Wenjie Zhang	Conditional generative models, particularly diffusion-based methods, have recently been applied to graph prediction by modeling the target as a conditional distribution given the input graph, yielding competitive results compared to deterministic predictor. Ho... Conditional generative models, particularly diffusion-based methods, have recently been applied to graph prediction by modeling the target as a conditional distribution given the input graph, yielding competitive results compared to deterministic predictor. However, existing diffusion-based prediction methods typically require expensive iterative denoising at inference and often suffer from unstable sampling, which motivates recent efforts to reduce inference denoising steps and enable stable sa...
78	Saliency-Aware Regularized Quantization Calibration for Large Language Models 2605.05693	cs.AIcs.LG	Yanlong Zhao, Xiaoyuan Cheng, Huihang Liu, Baihua He, Xinyu Zhang	Post-training quantization (PTQ) is an effective approach for deploying large language models (LLMs) under memory and latency constraints. Most existing PTQ methods determine quantization parameters by minimizing a layer-wise reconstruction error on a predeter... Post-training quantization (PTQ) is an effective approach for deploying large language models (LLMs) under memory and latency constraints. Most existing PTQ methods determine quantization parameters by minimizing a layer-wise reconstruction error on a predetermined calibration dataset, typically optimized via either scale search or Gram-based methods. However, from the perspective of generalization risk, existing PTQ calibration objectives based solely on empirical reconstruction error over limi...
84	Inference-Time Budget Control for LLM Search Agents 2605.05701	cs.AI	Zhengru Fang, Senkang Forest Hu, Zhonghao Chang, Yu Guo, Yihang Tao	LLM search agents increasingly rely on tools at inference time, but their trajectories are often constrained by hard limits on both tool calls and generated tokens. Under such dual budgets, better answers require not only stronger models, but also explicit con... LLM search agents increasingly rely on tools at inference time, but their trajectories are often constrained by hard limits on both tool calls and generated tokens. Under such dual budgets, better answers require not only stronger models, but also explicit control over which search action should receive the next budget unit and when the accumulated evidence is sufficient to commit a final answer. We study this problem in multi-hop question answering (QA) and formulate it as two-stage inference-t...
85	Knowledge-Graph Paths as Intermediate Supervision for Self-Evolving Search Agents 2605.05702	cs.AI	Huyu Wu, Jun Liu, Xiaochi Wei, Yan Gao, Yi Wu	Self-evolving search agents reduce reliance on human-written training questions by generating and solving their own search tasks. We build on Search Self-Play (SSP), a representative Proposer and Solver framework in which questions are generated and answered v... Self-evolving search agents reduce reliance on human-written training questions by generating and solving their own search tasks. We build on Search Self-Play (SSP), a representative Proposer and Solver framework in which questions are generated and answered via multi-step search and reasoning. In practice, however, SSP faces two bottlenecks: the Proposer constructs questions from isolated answer entities without relational context, yielding many invalid or unverifiable questions in early self-p...
89	Resolving the bias-precision paradox with stochastic causal representation learning for personalized medicine 2605.05706	cs.AIq-bio.QM	Peisong Zhang, Manqiang Peng, Yuxuan Wu, Pawit Phadungsaksawasdi, Wesley Yeung	Estimating individualized treatment effects from longitudinal observational data is central to data-driven medicine, yet existing methods face a fundamental limitation: reducing confounding bias often suppresses clinically informative heterogeneity, degrading ... Estimating individualized treatment effects from longitudinal observational data is central to data-driven medicine, yet existing methods face a fundamental limitation: reducing confounding bias often suppresses clinically informative heterogeneity, degrading patient-specific predictions. Here, we identify this tension as a bias-precision paradox in causal representation learning and introduce sampling-based maximum mean discrepancy (sMMD), a stochastic alignment strategy that replaces global ad...
90	Conceal, Reconstruct, Jailbreak: Exploiting the Reconstruction-Concealment Tradeoff in MLLMs 2605.05709	cs.AI	Md Farhamdur Reza, Richeng Jin, Tianfu Wu, Huaiyu Dai	Intent-obfuscation-based jailbreak attacks on multimodal large language models (MLLMs) transform a harmful query into a concealed multimodal input to bypass safety mechanisms. We show that such attacks are governed by a \emph{reconstruction--concealment tradeo... Intent-obfuscation-based jailbreak attacks on multimodal large language models (MLLMs) transform a harmful query into a concealed multimodal input to bypass safety mechanisms. We show that such attacks are governed by a \emph{reconstruction--concealment tradeoff}: the transformed input must hide harmful intent from safety filters while remaining recoverable enough for the victim model to reconstruct the original request. Through a reconstruction analysis of three representative black-box methods...
95	Decodable but Not Corrected by Fixed Residual-Stream Linear Steering: Evidence from Medical LLM Failure Regimes 2605.05715	cs.AIcs.CLcs.LG	Ming Liu	Can linearly decodable failure signals in LLM hidden states be leveraged to correct those failures? We investigate this classification-correction gap via Overthinking (OT)--a stable behavioral regime (Jaccard >= 0.81, 94% inter-annotator agreement) in medic... Can linearly decodable failure signals in LLM hidden states be leveraged to correct those failures? We investigate this classification-correction gap via Overthinking (OT)--a stable behavioral regime (Jaccard >= 0.81, 94% inter-annotator agreement) in medical QA where models answer correctly under resampling yet fail in extended chain-of-thought. OT is linearly decodable at 71.6% balanced accuracy (p < 10^{-16}). Yet five families of fixed linear steering (29 configurations, n=1,273) all yield D...
96	More Is Not Always Better: Cross-Component Interference in LLM Agent Scaffolding 2605.05716	cs.AIcs.CL	Ming Liu	LLM agent systems are built by stacking scaffolding components (planning, tools, memory, self-reflection, retrieval) assuming more is better. We study cross-component interference (CCI): degradation when components interact destructively. We run a full factori... LLM agent systems are built by stacking scaffolding components (planning, tools, memory, self-reflection, retrieval) assuming more is better. We study cross-component interference (CCI): degradation when components interact destructively. We run a full factorial experiment over all 2^5=32 subsets of five components on HotpotQA and GSM8K with Llama-3.1-8B/70B (96 conditions, up to 10 seeds). The All-In system is consistently suboptimal: on HotpotQA, a single-tool agent surpasses All-In by 32% (F1...
100	Detecting Time Series Anomalies Like an Expert: A Multi-Agent LLM Framework with Specialized Analyzers 2605.05725	cs.AI	Hyeongwon Kang, Jeongseob Kim, Jinwoo Park, Pilsung Kang	Recent studies have explored large language models for time-series anomaly detection, yet existing approaches often rely on a single general-purpose model to directly infer anomaly indices or intervals, limiting controllability, interpretability, and reliabili... Recent studies have explored large language models for time-series anomaly detection, yet existing approaches often rely on a single general-purpose model to directly infer anomaly indices or intervals, limiting controllability, interpretability, and reliability for complex anomaly patterns. We propose SAGE (Specialized Analyzer Group for Expert-like Detection), a multi-agent framework for structured anomaly diagnosis in univariate time series. It decomposes anomaly analysis into four specialize...
101	SkillRet: A Large-Scale Benchmark for Skill Retrieval in LLM Agents 2605.05726	cs.AI	Hongcheol Cho, Ryangkyung Kang, Youngeun Kim	As LLM agents are increasingly deployed with large libraries of reusable skills, selecting the right skill for a user request has become a critical systems challenge. In small libraries, users may invoke skills explicitly by name, but this assumption breaks do... As LLM agents are increasingly deployed with large libraries of reusable skills, selecting the right skill for a user request has become a critical systems challenge. In small libraries, users may invoke skills explicitly by name, but this assumption breaks down as skill ecosystems grow under tight context and latency budgets. Despite its practical importance, skill retrieval remains underexplored, with limited benchmarks and little understanding of retrieval behavior on realistic skill librarie...
103	Knee Osteoarthritis Severity Grading Using Optimized Deep Learning and LLM-Driven Intelligent AI on Computationally Limited Systems 2605.05731	cs.AI	Dayam Nadeem, Neha, Safdar Mustafa, Adnan Alvi, Mohd Hussain	Knee osteoarthritis (KOA) is among the musculoskeletal disorders that considerably restrict joint mobility, cause severe chronic pain and impact negatively on quality life. It is one of the persistent health issues worldwide. Generally, subjectivity and inter-... Knee osteoarthritis (KOA) is among the musculoskeletal disorders that considerably restrict joint mobility, cause severe chronic pain and impact negatively on quality life. It is one of the persistent health issues worldwide. Generally, subjectivity and inter-observer variability undermine conventional practices and evaluation process that are adopted to address such health issues. Hence precise and timely diagnosis would be one of the effective ways for the assessment of its severity. This pape...
105	SDFlow: Similarity-Driven Flow Matching for Time Series Generation 2605.05736	cs.AI	Wei Li, Shibo Feng, Pengcheng Wu, Min Wu, Peilin Zhao	Vector quantization (VQ) with autoregressive (AR) token modeling is a widely adopted and highly competitive paradigm for time-series generation. However, such models are fundamentally limited by exposure bias: during inference, errors can accumulate across seq... Vector quantization (VQ) with autoregressive (AR) token modeling is a widely adopted and highly competitive paradigm for time-series generation. However, such models are fundamentally limited by exposure bias: during inference, errors can accumulate across sequential predictions, leading to pronounced quality degradation in long-horizon generation. To address this, we propose SDFlow ($\textbf{S}$imilarity-$\textbf{D}$riven $\textbf{Flow}$ Matching), a non-autoregressive framework that operates e...
106	ReFlect: An Effective Harness System for Complex Long-Horizon LLM Reasoning 2605.05737	cs.AIcs.CL	Fan Huang	Current reasoning paradigms for LLMs include chain-of-thought, ReAct, and post-hoc self-critique. These paradigms rely on two assumptions that fail on long-horizon, multi-stage tasks. As a result, errors accumulate silently across reasoning steps, leaving an o... Current reasoning paradigms for LLMs include chain-of-thought, ReAct, and post-hoc self-critique. These paradigms rely on two assumptions that fail on long-horizon, multi-stage tasks. As a result, errors accumulate silently across reasoning steps, leaving an open question: can a reasoning system effectively detect and recover from its own failures? We present ReFlect, a \emph{harness} system for LLM reasoning that creates standalone error detection and recovery logic as a deterministic wrapper a...
109	HyperLens: Quantifying Cognitive Effort in LLMs with Fine-grained Confidence Trajectory 2605.05741	cs.AI	Chengda Lu, Xiaoyu Fan, Wei Xu	While Large Language Models (LLMs) achieve strong performance across diverse tasks, their inference dynamics remain poorly understood because of the limited resolution of existing analysis tools. In this work, we identify an intrinsic magnification mechanism i... While Large Language Models (LLMs) achieve strong performance across diverse tasks, their inference dynamics remain poorly understood because of the limited resolution of existing analysis tools. In this work, we identify an intrinsic magnification mechanism in transformer architectures: deeper layers inherently magnify the small changes of layer-wise confidence, providing a fine-grained confidence trajectory. Building on this insight, we introduce HyperLens, a high-resolution probe designed to ...
112	Best Arm Identification in Generalized Linear Bandits via Hybrid Feedback 2605.05745	cs.AI	Qirun Zeng, Xuchuang Wang, Jiayi Shen, Xutong Liu, Fang Kong	We study fixed-confidence best arm identification in generalized linear bandits under a hybrid feedback model: at each round, the learner may query either (i) absolute reward feedback from a single arm or (ii) relative (dueling) feedback from an arm pair, both... We study fixed-confidence best arm identification in generalized linear bandits under a hybrid feedback model: at each round, the learner may query either (i) absolute reward feedback from a single arm or (ii) relative (dueling) feedback from an arm pair, both governed by generalized linear models. We introduce a likelihood-ratio--based confidence sequence that unifies heterogeneous generalized linear observations and yields an explicit ellipsoidal confidence set under a self-concordance assumpt...
114	Evaluating Explainability in Safety-Critical ATR Systems: Limitations of Post-Hoc Methods and Paths Toward Robust XAI 2605.05748	cs.AI	Vanessa Buhrmester, David Muench, Dimitri Bulatov, Michael Arens	Explainable Artificial Intelligence (XAI) is increasingly rec ognized as essential for deploying machine learning systems in safety critical environments. In Automatic Target Recognition (ATR), where models operate on image, video, radar, and multisensor data,... Explainable Artificial Intelligence (XAI) is increasingly rec ognized as essential for deploying machine learning systems in safety critical environments. In Automatic Target Recognition (ATR), where models operate on image, video, radar, and multisensor data, high pre dictive performance alone is insufficient. Model decisions must also be interpretable, reliable, and suitable for validation. This paper presents a structured evaluation of explainability methods in the context of safety-critical ...
128	Confidence is the key: how conformal prediction enhances the generative design of permeable peptides 2605.05770	cs.AI	Laura van Weesep, Sunay Chankeshwara, Leonardo De Maria, Florian David, Ola Engkvist	Generative models coupled with reinforcement learning (RL), such as REINVENT and PepINVENT, have emerged as a powerful framework for de novo molecular design. During the ideation process these generative frameworks utilize various predictive models as part of ... Generative models coupled with reinforcement learning (RL), such as REINVENT and PepINVENT, have emerged as a powerful framework for de novo molecular design. During the ideation process these generative frameworks utilize various predictive models as part of the optimization objectives. However, the utility of the predictive models can be limited by their domain of applicability. When RL is used to explore the chemical space with predictive models, it can suggest molecules that lie outside the ...
129	CircuitFormer: A Circuit Language Model for Analog Topology Design from Natural Language Prompt 2605.05773	cs.AI	Md Touhidul Islam, Sujan Kumar Saha, Farimah Farahmandi, Mark Tehranipoor	Automating analog circuit design remains a longstanding challenge in Electronic Design Automation (EDA). While Transformer-based Large Language Models (LLMs) have revolutionized software code generation, their application to analog hardware design is hindered ... Automating analog circuit design remains a longstanding challenge in Electronic Design Automation (EDA). While Transformer-based Large Language Models (LLMs) have revolutionized software code generation, their application to analog hardware design is hindered by two critical limitations: (i) the scarcity of analog design datasets containing natural language description of a design and its corresponding netlist, and (ii) the inefficiency of general-purpose tokenizers (e.g., Byte Pair Encoding (BP...
131	HEDP: A Hybrid Energy-Distance Prompt-based Framework for Domain Incremental Learning 2605.05776	cs.AI	Yu Feng, Zhen Tian, Haoran Luo, Xie Yu, Diancheng Cheng	Domain Incremental Learning is a critical scenario that requires models to continuously adapt to new data domains without retraining. However, domain shifts often cause severe performance degradation. To address this, we propose Hybrid Energy-Distance Prompt, ... Domain Incremental Learning is a critical scenario that requires models to continuously adapt to new data domains without retraining. However, domain shifts often cause severe performance degradation. To address this, we propose Hybrid Energy-Distance Prompt, a domain-incremental framework inspired by Helmholtz free energy. HEDP introduces an energy regularization loss to enhance the separability of domain representations and a hybrid energy-distance weighted mechanism that fuses energy-based an...
133	Von Neumann Networks 2605.05780	cs.AIcs.CVcs.LG	Shekhar S. Chandra	In the mid-twentieth century, mathematician and polymath John von Neumann created a computational system on an array of cells as a simple model of the human brain, where each cell had one of a finite set of roles or states that he predicted would be modelled b... In the mid-twentieth century, mathematician and polymath John von Neumann created a computational system on an array of cells as a simple model of the human brain, where each cell had one of a finite set of roles or states that he predicted would be modelled by a diffusion process. In this work, we show that such a system, when developed in a modern deep learning setting, enables the construction of an artificial neuron having specialized roles that can be learnt. We refer to this neuron as the ...
145	Sheet as Token: A Graph-Enhanced Representation for Multi-Sheet Spreadsheet Understanding 2605.05811	cs.AI	Yiming Lei, Yiqi Wang, Yujia Zhang, Bo Guan, Depei Zhu	Workbook-scale spreadsheet understanding is increasingly important for language-model-based data analysis agents, but remains challenging because relevant information is often distributed across multiple sheets with heterogeneous schemas, layouts, and implicit... Workbook-scale spreadsheet understanding is increasingly important for language-model-based data analysis agents, but remains challenging because relevant information is often distributed across multiple sheets with heterogeneous schemas, layouts, and implicit relationships. Existing retrieval-augmented approaches typically decompose spreadsheets into rows, columns, or blocks to improve scalability; however, such chunk-centric representations can fragment worksheets into isolated text spans and ...
146	Long-Horizon Q-Learning: Accurate Value Learning via n-Step Inequalities 2605.05812	cs.AI	Armaan A. Abraham, Lucy Xiaoyang Shi, Chelsea Finn	Off-policy, value-based reinforcement learning methods such as Q-learning are appealing because they can learn from arbitrary experience, including data collected by older policies or other agents. In practice, however, bootstrapping makes long-horizon learnin... Off-policy, value-based reinforcement learning methods such as Q-learning are appealing because they can learn from arbitrary experience, including data collected by older policies or other agents. In practice, however, bootstrapping makes long-horizon learning brittle: estimation errors at later states propagate backward through temporal-difference (TD) updates and can compound over time. We propose long-horizon Q-learning (LQL), which introduces a principled backstop against compounding error ...
151	AGPO: Asymmetric Group Policy Optimization for Verifiable Reasoning and Search Ads Relevance at JD 2605.05826	cs.AI	Yang Xu, Kun Yao, Yiming Deng, Zheng Fang, Kai Ming Ting	Reinforcement Learning with Verifiable Rewards (RLVR) has demonstrated notable success in enhancing the reasoning performance of large language models (LLMs). However, recent studies reveal that while current RLVR methods improve sampling efficiency towards co... Reinforcement Learning with Verifiable Rewards (RLVR) has demonstrated notable success in enhancing the reasoning performance of large language models (LLMs). However, recent studies reveal that while current RLVR methods improve sampling efficiency towards correct paths, they do not elicit fundamentally new reasoning patterns. Instead, the reasoning capability boundary of trained models often narrows compared to their base models, with base models achieving higher coverage at large sample sizes...
153	MolRecBench-Wild: A Real-World Benchmark for Optical Chemical Structure Recognition 2605.05832	cs.AI	Haote Yang, Hui Wang, Chen Zhu, Jingchao Wang, Linye Li	Optical Chemical Structure Recognition (OCSR) aims to translate molecular diagrams in scientific literature into machine-readable formats, but current systems remain unreliable on real-world images due to substantial visual and chemical complexity. We introduc... Optical Chemical Structure Recognition (OCSR) aims to translate molecular diagrams in scientific literature into machine-readable formats, but current systems remain unreliable on real-world images due to substantial visual and chemical complexity. We introduce MOSAIC, a dual-dimensional difficulty framework with 37 fine-grained labels that jointly characterize visual interference and chemical semantic challenges in molecular diagrams. Based on this framework, we construct MolRecBench-Wild, a be...
154	On the Role of Language Representations in Auto-Bidding: Findings and Implications 2605.05833	cs.AI	Guanyu Zhu, Jining Luan, Hanwen Du, Xinyu Fang, Sibo Xu	Auto-bidding is a crucial task in real-time advertising markets, where policies must optimize long-horizon value under delivery constraints (e.g., budget and CPA). Existing methods for auto-bidding rely on compact numerical state representations: while they ca... Auto-bidding is a crucial task in real-time advertising markets, where policies must optimize long-horizon value under delivery constraints (e.g., budget and CPA). Existing methods for auto-bidding rely on compact numerical state representations: while they can implicitly capture delivery dynamics, they offer limited support for explicitly representing and controlling high-level intent, evolving feedback, and operator-style strategic guidance in real campaigns. Meanwhile, Large Language Models (...
157	Taklif.AI: LLM-Powered Platform for Interest-Based Personalized College Assignments 2605.05842	cs.AI	Zaki Kurdya, Mohammed Zuqlam, Salem Amassi, Shady Telbany, Motaz Saad	Educators face significant challenges in creating engaging, personalized assignments that accommodate students' diverse interests and cognitive abilities. Traditional one-size-fits-all assignments frequently lead to decreased student engagement and increased r... Educators face significant challenges in creating engaging, personalized assignments that accommodate students' diverse interests and cognitive abilities. Traditional one-size-fits-all assignments frequently lead to decreased student engagement and increased reliance on unethical practices such as plagiarism. To address these challenges, we present Taklif.AI, a platform that leverages Large Language Models (LLMs) to automatically generate personalized assignments tailored to individual student i...
162	AirQualityBench: A Realistic Evaluation Benchmark for Global Air Quality Forecasting 2605.05854	cs.AI	Xing Xu, Xu Wang, Yudong Zhang, Huilin Zhao, Zhengyang Zhou	Air-quality forecasting models are commonly evaluated on regional, preprocessed, and normalized datasets, where missing observations are removed or artificially completed. Such protocols simplify comparison but hide the conditions that dominate real monitoring... Air-quality forecasting models are commonly evaluated on regional, preprocessed, and normalized datasets, where missing observations are removed or artificially completed. Such protocols simplify comparison but hide the conditions that dominate real monitoring networks: uneven global coverage, structured missingness, heterogeneous pollutant scales, and deployment cost. We introduce \textbf{AirQualityBench}, a global multi-pollutant benchmark designed to evaluate forecasting models under these re...
166	SANEmerg: An Emergent Communication Framework for Semantic-aware Agentic AI Networking 2605.05861	cs.AIcs.NI	Yong Xiao, Haoran Zhou, Yujie Zhou, Marwan Krunz	Future networking systems are envisioned to become part of an agentic AI-native ecosystem in which a vast number of heterogeneous and specialized AI agents cooperate seamlessly to fulfill complex user requirements in real time. However, traditional networking ... Future networking systems are envisioned to become part of an agentic AI-native ecosystem in which a vast number of heterogeneous and specialized AI agents cooperate seamlessly to fulfill complex user requirements in real time. However, traditional networking paradigms are characterized by a rigid decoupling of communication and computation, which often leads to significant inefficiencies in large-scale agentic AI networking (AgentNet) systems. Emergent communication offers a novel solution by e...
170	XDecomposer: Learning Prior-Free Set Decomposition for Multiphase X-ray Diffraction 2605.05866	cs.AIcond-mat.mtrl-scics.LG	Hanyu Gao, Bin Cao, Yunyue Su, Tong-Yi Zhang, Qiang Liu	Multiphase powder X-ray diffraction (PXRD) analysis remains a fundamental bottleneck in structure identification, as real-world synthesis often produces complex mixtures whose constituent phases (components) cannot be reliably disentangled. While recent advanc... Multiphase powder X-ray diffraction (PXRD) analysis remains a fundamental bottleneck in structure identification, as real-world synthesis often produces complex mixtures whose constituent phases (components) cannot be reliably disentangled. While recent advances in representation-based crystal retrieval and generation suggest the possibility of inferring structures directly from PXRD, existing approaches largely assume single-phase inputs and break down in multiphase settings. Here, we present X...
171	When Does a Language Model Commit? A Finite-Answer Theory of Pre-Verbalization Commitment 2605.06723	cs.AIcs.CLcs.LG	Long Zhang, Wei-neng Chen, Feng-feng Wei, Zi-bo Qin	Language models often generate reasoning before giving a final answer, but the visible answer does not reveal when the model's answer preference became stable. We study this question through a narrow computable object: \emph{finite-answer preference stabilizat... Language models often generate reasoning before giving a final answer, but the visible answer does not reveal when the model's answer preference became stable. We study this question through a narrow computable object: \emph{finite-answer preference stabilization}. For a model state and specified answer verbalizers, we project the model's own continuation probabilities onto a finite answer set; in binary tasks this yields an exact log-odds code, $δ(ξ)=S_θ(\mathrm{yes}\midξ)-S_θ(\mathrm{no}\midξ)...
176	Agentic, Context-Aware Risk Intelligence in the Internet of Value 2605.05878	cs.AI	Basel Magableh, OmniRisk Research	The Internet of Value (IoV) is a heterogeneous, partially-trusted network in which the dominant marginal risk is composite (route, sentiment, liquidity, and the policy a system is willing to commit to) rather than a property of any single chain. We argue that ... The Internet of Value (IoV) is a heterogeneous, partially-trusted network in which the dominant marginal risk is composite (route, sentiment, liquidity, and the policy a system is willing to commit to) rather than a property of any single chain. We argue that a risk primitive adequate for this regime is a composition of five engines: a prediction engine over price, liquidity, volatility, and route health; a Bittensor verification subnet that decentralises and economically scores prediction outpu...
190	Null Space Constrained Contrastive Visual Forgetting for MLLM Unlearning 2605.05909	cs.AI	Yuhang Wang, Zhenxing Niu, Haoxuan Ji, Guangyu He, Linlin Zhang	The core challenge of machine unlearning is to strike a balance between target knowledge removal and non-target knowledge retention. In the context of Multimodal Large Language Models (MLLMs), this challenge becomes even more pronounced, as knowledge is furthe... The core challenge of machine unlearning is to strike a balance between target knowledge removal and non-target knowledge retention. In the context of Multimodal Large Language Models (MLLMs), this challenge becomes even more pronounced, as knowledge is further divided into visual and textual modalities that are tightly intertwined. In this paper, we introduce an MLLM unlearning approach that aims to forget target visual knowledge while preserving non-target visual knowledge and all textual know...
192	PREFER: Personalized Review Summarization with Online Preference Learning 2605.05911	cs.AIcs.GTcs.LGeess.SYmath.OC	Millend Roy, Agostino Capponi, Vineet Goyal	Product reviews significantly influence purchasing decisions on e-commerce platforms. However, the sheer volume of reviews can overwhelm users, obscuring the information most relevant to their specific needs. Current e-commerce summarization systems typically ... Product reviews significantly influence purchasing decisions on e-commerce platforms. However, the sheer volume of reviews can overwhelm users, obscuring the information most relevant to their specific needs. Current e-commerce summarization systems typically produce generic, static summaries that fail to account for the fact that (i) different users care about different product characteristics, and (ii) these preferences may evolve with interactions. To address the challenge of unknown latent p...
194	Wisteria: A Unified Multi-Scale Feature Learning Framework for DNA Language Model 2605.05913	cs.AI	Weihua Wang, Haoji Li, Feilong Bao, Lei Yang, Guanglai Gao	DNA language model aims to decipher the regulatory grammar and semantic of genomes by capturing long range dependencies in DNA sequences. Existing methods emphasize long range token interactions but often ignore the interplay between local motifs and global de... DNA language model aims to decipher the regulatory grammar and semantic of genomes by capturing long range dependencies in DNA sequences. Existing methods emphasize long range token interactions but often ignore the interplay between local motifs and global dependencies. In this paper, we propose Wisteria, a genomic language model that integrates multi scale feature learning within a unified framework for DNA sequence. Specifically, Wisteria augments the Mamba based architecture with gated dilat...
197	Intentmaking and Sensemaking: Human Interaction with AI-Guided Mathematical Discovery 2605.05921	cs.AIcs.HC	Alex Bäuerle, Adam Connors, Alexander Novikov, Adam Zsolt Wagner, Ngân Vũ	Artificial intelligence offers powerful new tools for scientific discovery, but the interaction paradigms required to effectively harness these systems remain underexplored. In this paper, we present findings from a formative user study with 11 expert mathemat... Artificial intelligence offers powerful new tools for scientific discovery, but the interaction paradigms required to effectively harness these systems remain underexplored. In this paper, we present findings from a formative user study with 11 expert mathematicians who used AlphaEvolve, an evolutionary coding agent, to tackle advanced problems in their fields of expertise. We identify and characterize a distinct workflow we term intentmaking, the iterative process of discovering, defining, and ...
201	Which Are the Low-Resource Languages of the Semantic Web? 2605.05929	cs.AI	Ndeye-Emilie Mbengue, Pierre Monnin, Miguel Couceiro, Fabien Gandon	Emerging digital technologies are exacerbating the existing divide in Open Access Data (OAD) between high-and low-resource languages, excluding many communities from the global digital transformation. Multilingual Linked Open Data Knowledge Graphs (LOD KGs) co... Emerging digital technologies are exacerbating the existing divide in Open Access Data (OAD) between high-and low-resource languages, excluding many communities from the global digital transformation. Multilingual Linked Open Data Knowledge Graphs (LOD KGs) could contribute to mitigating this divide through cross-lingual transfer; however, no clear quantitative definition of low-resource languages has yet been established in the context of LOD KGs. In this poster, we present a methodology to ana...
202	In Data or Invisible: Toward a Better Digital Representation of Low-Resource Languages with Knowledge Graphs 2605.05931	cs.AI	Ndeye-Emilie Mbengue	Emerging digital technologies are exacerbating the existing divide in Open Access Data (OAD) between high-and low-resource languages, excluding many communities from participating in the global digital transformation. In this PhD proposal, we aim to address th... Emerging digital technologies are exacerbating the existing divide in Open Access Data (OAD) between high-and low-resource languages, excluding many communities from participating in the global digital transformation. In this PhD proposal, we aim to address this gap, focusing on the language coverage of Linked Open Data knowledge graphs (LOD KGs). First, we identify key variables that characterize language distribution in LOD, including the number of Wikipedia articles per language edition and t...
204	ICU-Bench:Benchmarking Continual Unlearning in Multimodal Large Language Models 2605.05938	cs.AI	Yuhang Wang, Wenjie Mei, Junkai Zhang, Guangyu He, Zhenxing Niu	Although Multimodal Large Language Models (MLLMs) have achieved remarkable progress across many domains, their training on large-scale multimodal datasets raises serious privacy concerns, making effective machine unlearning increasingly necessary. However, exi... Although Multimodal Large Language Models (MLLMs) have achieved remarkable progress across many domains, their training on large-scale multimodal datasets raises serious privacy concerns, making effective machine unlearning increasingly necessary. However, existing benchmarks mainly focus on static or short-sequence settings, offering limited support for evaluating continual privacy deletion requests in realistic deployments. To bridge this gap, we introduce ICU-Bench, a continual multimodal unl...
209	MAS-Algorithm: A Workflow for Solving Algorithmic Programming Problems with a Multi-Agent System 2605.05949	cs.AIcs.SE	Yuliang Xu, Xiang Xu, Yao Wan, Hu Wei, Tong Jia	Algorithmic problem solving serves as a rigorous testbed for evaluating structured reasoning in AI coding systems, as it directly reflects a model's ability to perform structured reasoning in complex scenarios. Existing approaches predominantly rely on model-c... Algorithmic problem solving serves as a rigorous testbed for evaluating structured reasoning in AI coding systems, as it directly reflects a model's ability to perform structured reasoning in complex scenarios. Existing approaches predominantly rely on model-centric strategies, such as architectural modifications and data scaling, which are costly and offer limited interpretability. Alternative methods leveraging external tools or prompting techniques (e.g., chain-of-thought) are often fragmente...
211	HaM-World: Soft-Hamiltonian World Models with Selective Memory for Planning 2605.05951	cs.AI	Haoyun Tang, Haodong Cui, Keyao Xu, Kun Wang, Zhandong Mei	World models enable model-based planning through learned latent dynamics, but imagined rollouts become unstable as the planning horizon grows or the dynamics distribution shifts. We argue that this instability reflects two missing structures in planner-facing ... World models enable model-based planning through learned latent dynamics, but imagined rollouts become unstable as the planning horizon grows or the dynamics distribution shifts. We argue that this instability reflects two missing structures in planner-facing latents: history-conditioned memory for approximate Markov completeness, and geometric organization that separates configuration, momentum, and task semantics. We propose HaM-World (HMW), a structured world model that decomposes the latent ...
215	Temporal Smoothness Doubly Robust Learning for Debiased Knowledge Tracing 2605.05958	cs.AI	Peilin Zhan, Wei Chen, Weilin Chen, Shuyi Pan, Ruichu Cai	Knowledge Tracing (KT) is fundamental to intelligent education systems, yet relies on educational logs that are selectively observed. The non-random nature of exercise recommendations and student choices inevitably induces severe selection bias. Most existing ... Knowledge Tracing (KT) is fundamental to intelligent education systems, yet relies on educational logs that are selectively observed. The non-random nature of exercise recommendations and student choices inevitably induces severe selection bias. Most existing KT methods neglect this issue, training on observed logs using standard empirical risk, which yields biased mastery estimates and accumulates errors in subsequent recommendations. To address this, we introduce a doubly robust (DR) formulati...
216	From Coordinate Matching to Structural Alignment: Rethinking Prototype Alignment in Heterogeneous Federated Learning 2605.05959	cs.AIcs.DCcs.LG	Xinghao Wu, Jianwei Niu, Guogang Zhu, Xuefeng Liu, Shaojie Tang	Heterogeneous federated learning (HtFL) aims to enable collaboration among clients that differ in both data distributions and model architectures. Prototype-based methods, which communicate class-level feature centers (prototypes) instead of full model paramet... Heterogeneous federated learning (HtFL) aims to enable collaboration among clients that differ in both data distributions and model architectures. Prototype-based methods, which communicate class-level feature centers (prototypes) instead of full model parameters, have recently shown strong potential for HtFL. Existing prototype-based HtFL methods typically reuse the MSE-based or cosine-based alignment mechanism developed for homogeneous FL when aligning client-specific representations with glob...
218	TheraAgent: Self-Improving Therapeutic Agent for Precise and Comprehensive Treatment Planning 2605.05963	cs.AIcs.CL	Junkai Li, Yunghwei Lai, Tianyi Zhu, Zheng Long Lee, Weizhi Ma	Formulating a treatment plan is inherently a complex reasoning and refinement task rather than a simple generation problem. However, existing large language models (LLMs) mainly rely on one-shot output without explicit verification, which may result in rough, ... Formulating a treatment plan is inherently a complex reasoning and refinement task rather than a simple generation problem. However, existing large language models (LLMs) mainly rely on one-shot output without explicit verification, which may result in rough, incomplete, and potentially unsafe treatment plans. To address these limitations, we propose TheraAgent, an agentic framework that replaces one-shot generation with an iterative generate-judge-refine pipeline. By mirroring the actual reason...
226	BehaviorGuard: Online Backdoor Defense for Deep Reinforcement Learning 2605.05977	cs.AI	Yinbo Yu, Xueyu Yin, Jiadai Wang, Chunwei Tian, Sai Xu	Backdoor attacks pose a serious threat to deep reinforcement learning (DRL). Current defenses typically rely on reward anomalies to reverse-engineer triggers and model finetuning to remove backdoors. However, complex trigger patterns undermine their robustness... Backdoor attacks pose a serious threat to deep reinforcement learning (DRL). Current defenses typically rely on reward anomalies to reverse-engineer triggers and model finetuning to remove backdoors. However, complex trigger patterns undermine their robustness, and fine-tuning entails high costs, limiting practical utility. Therefore, we shift defense concerns to trigger-agnostic backdoor output behaviors and propose BehaviorGuard, an online behavior-based backdoor detection and mitigation frame...
228	TACT: Mitigating Overthinking and Overacting in Coding Agents via Activation Steering 2605.05980	cs.AI	Yuan Sui, Yulin Chen, Yibo Li, Xue Jiang, Yufei He	When language model agents tackle complex software engineering tasks, they often degrade over long trajectories, which we define as agent drift. We focus on two recurring failure modes overthinking and overacting, i.e., where the agent repeatedly reasons... When language model agents tackle complex software engineering tasks, they often degrade over long trajectories, which we define as agent drift. We focus on two recurring failure modes overthinking and overacting, i.e., where the agent repeatedly reasons over information it already has, and where it issues tool calls without integrating recent observations or acquiring new evidence. In this paper, we introduce TACT (Think-Act Calibration via activation Steering), to detect and mitigate age...
231	BioResearcher: Scenario-Guided Multi-Agent for Translational Medicine 2605.05985	cs.AIcs.MAq-bio.QM	Remigiusz Kinas, Joanna Krawczyk, Rafał Powalski, Przemysław Pietrzak, Agnieszka Kowalewska	Translational medicine turns underspecified development goals into evidence synthesis that must combine literature, trials, patents, and quantitative multi-omics analysis while preserving identifiers, uncertainty, and retrievable provenance. General-purpose fo... Translational medicine turns underspecified development goals into evidence synthesis that must combine literature, trials, patents, and quantitative multi-omics analysis while preserving identifiers, uncertainty, and retrievable provenance. General-purpose foundation models and off-the-shelf tool-augmented or multi-agent systems are not built for this: they tend to produce single-shot answers or run open-endedly, and fall short on the auditable, scenario-specific workflows that heterogeneous bi...
248	Strat-LLM: Stratified Strategy Alignment for LLM-based Stock Trading with Real-time Multi-Source Signals 2605.06024	cs.AI	Wenliang Huang, Zengyi Yu	Large Language Models (LLMs) are evolving into autonomous trading agents, yet existing benchmarks often overlook the interplay between architectural reasoning and strategy consistency. We propose Strat-LLM, a framework grounded in Stratified Strategy Alignment... Large Language Models (LLMs) are evolving into autonomous trading agents, yet existing benchmarks often overlook the interplay between architectural reasoning and strategy consistency. We propose Strat-LLM, a framework grounded in Stratified Strategy Alignment. Operating in a live-forward setting throughout 2025, it integrates heterogeneous data including sequential prices, real-time news, and annual reports to eliminate look-ahead bias. Extensive stress tests on A-share and U.S. markets reveal:...
250	Pathways to AGI 2605.06029	cs.AI	Gordon Fletcher, Saomai Vu Khan	Our focus are five related questions that stem from a critical software studies perspective. Underpinning this view is the acknowledged need to avoid assumptions regarding the inevitability of the current situation relating to AI. What we need to see is the cl... Our focus are five related questions that stem from a critical software studies perspective. Underpinning this view is the acknowledged need to avoid assumptions regarding the inevitability of the current situation relating to AI. What we need to see is the closeness of the linkage between current commercial AI development and our prevailing social, political and economic circumstances. This does mean that the perspectives presented here are done so critically and conditionally. Most importantly...
259	Novelty-based Tree-of-Thought Search for LLM Reasoning and Planning 2605.06040	cs.AIcs.CL	Leon Hamm, Zlatan Ajanovic	Although advances such as chain-of-thought, tree-of-thought or reinforcement learning have improved the performance of LLMs in reasoning and planning tasks, they are still brittle and have not achieved human-level performance in many domains, and often suffer ... Although advances such as chain-of-thought, tree-of-thought or reinforcement learning have improved the performance of LLMs in reasoning and planning tasks, they are still brittle and have not achieved human-level performance in many domains, and often suffer from high time and token costs. Inspired by the success of width-based search in planning, we explore how the concept of novelty can be transferred to language domains and how it can improve tree-of-thought reasoning. A tree of thoughts rel...
269	Visual Fingerprints for LLM Generation Comparison 2605.06054	cs.AIcs.HC	Amal Alnouri, Andreas Hinterreiter, Christina Humer, Furui Cheng, Marc Streit	Large language model (LLM) outputs arise from complex interactions among prompts, system instructions, model parameters, and architecture. We refer to specific configurations of these factors as generation conditions, each of which can bias outputs in various ... Large language model (LLM) outputs arise from complex interactions among prompts, system instructions, model parameters, and architecture. We refer to specific configurations of these factors as generation conditions, each of which can bias outputs in various ways. Understanding how different generation conditions shape model behaviors is essential for tasks such as prompt design and model evaluation, yet it remains challenging due to the stochastic and open-ended nature of text generation. We p...
278	VibeServe: Can AI Agents Build Bespoke LLM Serving Systems? 2605.06068	cs.AIcs.DC	Keisuke Kamahori, Shihang Li, Simon Peter, Baris Kasikci	For years, we have built LLM serving systems like any other critical infrastructure: a single general-purpose stack, hand-tuned over many engineer-years, meant to support every model and workload. In this paper, we take the opposite bet: a multi-agent loop tha... For years, we have built LLM serving systems like any other critical infrastructure: a single general-purpose stack, hand-tuned over many engineer-years, meant to support every model and workload. In this paper, we take the opposite bet: a multi-agent loop that automatically synthesizes bespoke serving systems for different usage scenarios. We propose VibeServe, the first agentic loop that generates entire LLM serving stacks end-to-end. VibeServe uses an outer loop to plan and track the search o...
291	Safety Certification is Classification 2605.06087	cs.AIeess.SY	Oliver Schön, Licio Romao, Sadegh Soudjani	The goal of this paper is certifying safety of dynamical systems subject to uncertainty. Existing approaches use trajectory data to estimate transition probabilities, and compute safety probabilities recursively via dynamic programming (DP). This recursion may... The goal of this paper is certifying safety of dynamical systems subject to uncertainty. Existing approaches use trajectory data to estimate transition probabilities, and compute safety probabilities recursively via dynamic programming (DP). This recursion may lead to compounding errors in the certified safety probability, thus collapsing to a vacuous lower bound for growing horizons $T$. We propose a kernel embedding framework that treats safety certification as a classification problem on traj...
299	Shallow Prefill, Deep Decoding: Efficient Long-Context Inference via Layer-Asymmetric KV Visibility 2605.06105	cs.AI	Jungsuk Oh, Hyeseo Jeon, Hyunjune Ji, Kyongmin Kong, Jay-Yoon Lee	Long-context inference in decoder-only language models is costly because long prompts are processed during Prefill, cached at every layer, and repeatedly attended to during autoregressive Decode. We introduce \emph{Shallow Prefill, dEEp Decode} (SPEED), a phas... Long-context inference in decoder-only language models is costly because long prompts are processed during Prefill, cached at every layer, and repeatedly attended to during autoregressive Decode. We introduce \emph{Shallow Prefill, dEEp Decode} (SPEED), a phase-asymmetric KV-visibility policy that materializes non-anchor prompt-token KV states only in lower layers while keeping Decode-phase tokens full-depth. Unlike previous approaches that make upper-layer prompt KV states cheaper to store or c...
301	On Time, Within Budget: Constraint-Driven Online Resource Allocation for Agentic Workflows 2605.06110	cs.AIcs.CL	Xinglin Wang, Zishen Liu, Shaoxiong Feng, Peiwen Yuan, Yiwei Li	Agentic systems increasingly solve complex user requests by executing orchestrated workflows, where subtasks are assigned to specialized models or tools and coordinated according to their dependencies. While recent work improves agent efficiency by optimizing ... Agentic systems increasingly solve complex user requests by executing orchestrated workflows, where subtasks are assigned to specialized models or tools and coordinated according to their dependencies. While recent work improves agent efficiency by optimizing the performance--cost--latency frontier, real deployments often impose concrete requirements: a workflow must be completed within a specified budget and before a specified deadline. This shifts the goal from average efficiency optimization ...
305	CrossCult-KIBench: A Benchmark for Cross-Cultural Knowledge Insertion in MLLMs 2605.06115	cs.AI	Zhen Zeng, Leijiang Gu, Feng Li, Jing Yu, Zenglin Shi	Multimodal Large Language Models (MLLMs), trained primarily on English-centric data, frequently generate culturally inappropriate or misaligned responses in cross-cultural settings. To mitigate this, we introduce the task of cross-cultural knowledge insertion,... Multimodal Large Language Models (MLLMs), trained primarily on English-centric data, frequently generate culturally inappropriate or misaligned responses in cross-cultural settings. To mitigate this, we introduce the task of cross-cultural knowledge insertion, which focuses on adapting models to specific cultural contexts while preserving their original behavior in other cultures. To facilitate research in this area, we introduce CrossCult-KIBench, a comprehensive evaluation benchmark for assess...
306	Policy-Guided Stepwise Model Routing for Cost-Effective Reasoning 2605.06116	cs.AI	Wenwen Si, Insup Lee, Osbert Bastani	Inference-time computation has greatly enhanced the performance of large language models (LLMs) on challenging reasoning tasks, but this strategy can incur high inference costs. One solution is to route intermediate chain-of-thought (CoT) states to language mo... Inference-time computation has greatly enhanced the performance of large language models (LLMs) on challenging reasoning tasks, but this strategy can incur high inference costs. One solution is to route intermediate chain-of-thought (CoT) states to language models of different sizes; however, existing approaches rely on handcrafted routing strategies that limit performance, or on training large process reward models that may be infeasible in many applications. We formulate stepwise model routing...
309	Back to the Beginning of Heuristic Design: Bridging Code and Knowledge with LLMs 2605.06123	cs.AI	Nguyen Viet Tuan Kiet, Bui Dinh Pham, Dao Van Tung, Tran Cong Dao, Huynh Thi Thanh Binh	Large language models (LLMs) have recently advanced automatic heuristic design (AHD) for combinatorial optimization (CO), where candidate heuristics are iteratively proposed, evaluated, and refined. Most existing approaches search over executable programs and ... Large language models (LLMs) have recently advanced automatic heuristic design (AHD) for combinatorial optimization (CO), where candidate heuristics are iteratively proposed, evaluated, and refined. Most existing approaches search over executable programs and distill insights from execution feedback to guide later iterations. Because this process moves from low-level implementations to high-level principles, we refer to it as a bottom-up paradigm. We argue that this view is incomplete and introd...
310	P-Guide: Parameter-Efficient Prior Steering for Single-Pass CFG Inference 2605.06124	cs.AI	Xin Peng, Ang Gao	Classifier-Free Guidance (CFG) is essential for high-fidelity conditional generation in flow matching, yet it imposes significant computational overhead by requiring dual forward passes at each sampling step. In this work, we address this bottleneck by introdu... Classifier-Free Guidance (CFG) is essential for high-fidelity conditional generation in flow matching, yet it imposes significant computational overhead by requiring dual forward passes at each sampling step. In this work, we address this bottleneck by introducing \textbf{P-Guide}, a framework that achieves high-quality guidance through a single inference pass by modulating only the initial latent state. We further show that, under a first-order approximation, P-Guide is equivalent to CFG in the...
312	Skill1: Unified Evolution of Skill-Augmented Agents via Reinforcement Learning 2605.06130	cs.AI	Yaorui Shi, Yuxin Chen, Zhengxi Lu, Yuchun Miao, Shugui Liu	A persistent skill library allows language model agents to reuse successful strategies across tasks. Maintaining such a library requires three coupled capabilities. The agent selects a relevant skill, utilizes it during execution, and distills new skills from ... A persistent skill library allows language model agents to reuse successful strategies across tasks. Maintaining such a library requires three coupled capabilities. The agent selects a relevant skill, utilizes it during execution, and distills new skills from experience. Existing methods optimize these capabilities in isolation or with separate reward sources, resulting in partial and conflicting evolution. We propose Skill1, a framework that trains a single policy to co-evolve skill selection, ...
327	Graphlets as Building Blocks for Structural Vocabulary in Knowledge Graph Foundation Models 2605.06154	cs.AIcs.LG	Kossi Amouzouvi, Robert Wardenga, Jens Lehmann, Sahar Vahdati	Foundation models excel at language, where sentences become tokens, and vision, where images become pixels, because both reduce to discrete symbols on a shared, fixed grid. Knowledge Graphs share the discreteness, but not the geometry. Their entities and relat... Foundation models excel at language, where sentences become tokens, and vision, where images become pixels, because both reduce to discrete symbols on a shared, fixed grid. Knowledge Graphs share the discreteness, but not the geometry. Their entities and relations are discrete symbols, yet their arrangement is relational and lacks a common, fixed grid. Knowledge Graphs (KGs) share the discreteness, but not the geometry. They form irregular, non-Euclidean topologies whose local neighborhoods diff...
330	Beyond Accuracy: Policy Invariance as a Reliability Test for LLM Safety Judges 2605.06161	cs.AIcs.SE	Shihao Weng, Yang Feng, Xiaofei Xie	LLM-as-a-Judge pipelines have become the de facto evaluator for agent safety, yet existing benchmarks treat their verdicts as ground-truth proxies without checking whether the verdicts depend on the agent's behavior or merely on how the evaluation policy happe... LLM-as-a-Judge pipelines have become the de facto evaluator for agent safety, yet existing benchmarks treat their verdicts as ground-truth proxies without checking whether the verdicts depend on the agent's behavior or merely on how the evaluation policy happens to be worded. We argue that any trustworthy safety judge must satisfy a basic property we call policy invariance, and we operationalize it as three testable principles: rubric-semantics invariance under certified-equivalent rewrites, rub...
332	Post Reasoning: Improving the Performance of Non-Thinking Models at No Cost 2605.06165	cs.AI	Richmond Sin Jing Xuan, Rishabh Bhardwaj, Soujanya Poria	As the widespread adoption of Large Language Models (LLMs) accelerates, token consumption from intermediate reasoning traces increasingly contributes to inference latency and operational cost. Recent studies suggest that many real-world tasks require little to... As the widespread adoption of Large Language Models (LLMs) accelerates, token consumption from intermediate reasoning traces increasingly contributes to inference latency and operational cost. Recent studies suggest that many real-world tasks require little to no explicit reasoning, with additional reasoning sometimes even degrading performance. In this work, we propose \textbf{Post-Reasoning}, a simple yet effective approach that improves instruction-tuned models by conditioning them to justify...
338	BioMedArena: An Open-source Toolkit for Building and Evaluating Biomedical Deep Research Agents 2605.06177	cs.AI	Jinge Wu, Hongjian Zhou, Mingde Zeng, Jiayuan Zhu, Junde Wu	Building a deep research agent today is an exercise in glue code: the same backbone evaluated on the same benchmark can report different accuracies in different papers because harness and tool registry all differ, and integrating a new foundation model into a ... Building a deep research agent today is an exercise in glue code: the same backbone evaluated on the same benchmark can report different accuracies in different papers because harness and tool registry all differ, and integrating a new foundation model into a comparable evaluation surface costs weeks of model-specific engineering. We call this the per-paper engineering tax and release BioMedArena, an open-source toolkit that not only alleviates it but also provides an arena for fair comparison o...
341	Rethinking Adapter Placement: A Dominant Adaptation Module Perspective 2605.06183	cs.AIcs.CLcs.LG	Suoxin Zhang, Run He, Di Fang, Xiang Tan, Kaixuan Chen	Low-rank adaptation (LoRA) is a widely used parameter-efficient fine-tuning method that places trainable low-rank adapters into frozen pre-trained models. Recent studies show that using fewer LoRA adapters may still maintain or even improve performance, but ex... Low-rank adaptation (LoRA) is a widely used parameter-efficient fine-tuning method that places trainable low-rank adapters into frozen pre-trained models. Recent studies show that using fewer LoRA adapters may still maintain or even improve performance, but existing methods still distribute adapters broadly, leaving where to place a limited number of adapters to maximize performance largely open. To investigate this, we introduce PAGE (Projected Adapter Gradient Energy), a gradient-based sensiti...
343	Event-Causal RAG: A Retrieval-Augmented Generation Framework for Long Video Reasoning in Complex Scenarios 2605.06185	cs.AIcs.CV	Peizheng Yan, Yu Zhao, Liang Xie, Juntong Qi, Mingming Wang	Recent large vision-language models have achieved strong performance on short- and medium-length video understanding, yet they remain inadequate for ultra-long or even infinite video reasoning, where models must preserve coherent memory over extended durations... Recent large vision-language models have achieved strong performance on short- and medium-length video understanding, yet they remain inadequate for ultra-long or even infinite video reasoning, where models must preserve coherent memory over extended durations and infer causal dependencies across temporally distant events. Existing end-to-end video understanding methods are fundamentally limited by the $O(n^2)$ complexity of self-attention, while recent retrieval-augmented generation (RAG) appro...
346	OPSD Compresses What RLVR Teaches: A Post-RL Compaction Stage for Reasoning Models 2605.06188	cs.AIcs.CL	Jaehoon Kim, Dongha Lee	On-Policy Self-Distillation (OPSD) has recently emerged as an alternative to Reinforcement Learning with Verifiable Rewards (RLVR), promising higher accuracy and shorter responses through token-level credit assignment from a self-teacher conditioned on privile... On-Policy Self-Distillation (OPSD) has recently emerged as an alternative to Reinforcement Learning with Verifiable Rewards (RLVR), promising higher accuracy and shorter responses through token-level credit assignment from a self-teacher conditioned on privileged context. However, this promise does not carry over to thinking-enabled mathematical reasoning, where reported accuracy gains shrink and sometimes turn negative. We hypothesize that hindsight supervision can specify better token-level al...
349	Systematic Evaluation of Large Language Models for Post-Discharge Clinical Action Extraction 2605.06191	cs.AI	Shivali Dalmia, Ananya Mantravadi, Prasanna Desikan	The work in this paper evaluates zero-shot and few-shot large language models (LLMs) for safety-critical clinical action extraction using the CLIP discharge-note dataset, with particular emphasis on transitions of care and post-discharge patient safety. To man... The work in this paper evaluates zero-shot and few-shot large language models (LLMs) for safety-critical clinical action extraction using the CLIP discharge-note dataset, with particular emphasis on transitions of care and post-discharge patient safety. To manage the complexity of clinical documentation, we introduce a two-stage extraction framework that decomposes discharge notes, that are written in narrative form, into fine-grained, explicitly actionable clinical tasks through a staged prompt...
351	The Granularity Axis: A Micro-to-Macro Latent Direction for Social Roles in Language Models 2605.06196	cs.AIcs.CL	Chonghan Qin, Xiachong Feng, Ziyun Song, Xiaocheng Feng, Jing Xiong	Large language models (LLMs) are routinely prompted to take on social roles ranging from individuals to institutions, yet it remains unclear whether their internal representations encode the granularity of such roles, from micro-level individual experience to ... Large language models (LLMs) are routinely prompted to take on social roles ranging from individuals to institutions, yet it remains unclear whether their internal representations encode the granularity of such roles, from micro-level individual experience to macro-level organizational, institutional, or national reasoning. We show that they do. We define a contrast-based Granularity Axis as the difference between mean macro- and micro-role hidden states. In Qwen3-8B, this axis aligns with the p...
354	Towards Annotation-Free Validation of MLLMs: A Vision-Language Logical Consistency Metric 2605.06201	cs.AI	Ying Gu, Mei Chee Leong, Hui Li Tan, Shangbo Mao, Liyuan Li	Dominant accuracy evaluation might reward unwarranted guessing of Large Language Models, and it might not be applicable to novel tasks for model validation without ground-truth (gt) annotation. Based on basic logic principle, we propose a novel framework to ev... Dominant accuracy evaluation might reward unwarranted guessing of Large Language Models, and it might not be applicable to novel tasks for model validation without ground-truth (gt) annotation. Based on basic logic principle, we propose a novel framework to evaluate the vision-language logical consistency of MLLMs on both sufficient and necessary cause-effect relations. We define Vision-Language Logical Consistency Metric (VL-LCM) on traditional MC-VQA tests, and recent NaturalBench tests withou...
364	Beyond Fixed Benchmarks and Worst-Case Attacks: Dynamic Boundary Evaluation for Language Models 2605.06213	cs.AI	Haoxiang Wang, Da Yu, Huishuai Zhang	Evaluating large language models (LLMs) today rests on fixed benchmarks that apply the same set of items to any model, producing ceiling and floor effects that mask capability gaps. We argue that the most informative evaluation signal lies at the boundary, whe... Evaluating large language models (LLMs) today rests on fixed benchmarks that apply the same set of items to any model, producing ceiling and floor effects that mask capability gaps. We argue that the most informative evaluation signal lies at the boundary, where the per-prompt pass probability is near $0.5$ under random-sampling decoding, and propose Dynamic Boundary Evaluation (DBE), which actively locates each model's boundary and places it on a globally comparable difficulty scale. DBE delive...
369	Joint Consistency: A Unified Test-Time Aggregation Framework via Energy Minimization 2605.06219	cs.AI	Yunzhen Yao, Hongye Wang, Yahong Wang, Michael C. Gastpar, Bo Jiang	This paper studies test-time aggregation, an approach that generates multiple reasoning traces and aggregates them into a final answer. Most existing methods rely on evaluation signals collected from candidate traces in isolation or answer frequencies, while i... This paper studies test-time aggregation, an approach that generates multiple reasoning traces and aggregates them into a final answer. Most existing methods rely on evaluation signals collected from candidate traces in isolation or answer frequencies, while ignoring comparative interactions among candidates. We propose Joint Consistency (JC), formulated as a constrained Ising-type energy minimization problem, where independent evaluation signals act as external fields and pairwise comparisons a...
372	Proactive Instance Navigation with Comparative Judgment for Ambiguous User Queries 2605.06223	cs.AIcs.RO	Junhyuk Kwon, Seungjoon Lee, Hyejin Park, Kyle Min, Jungseul Ok	Natural-language instance navigation becomes challenging when the initial user request does not uniquely specify the target instance. A practical agent should reduce the user's burden by actively asking only the information needed to distinguish the target fro... Natural-language instance navigation becomes challenging when the initial user request does not uniquely specify the target instance. A practical agent should reduce the user's burden by actively asking only the information needed to distinguish the target from similar distractors, rather than requiring a detailed description upfront. Existing approaches often fall short of this goal: they may stop at the first plausible candidate before sufficiently exploring alternatives, or, even after collec...
374	A Versatile AI Agent for Rare Disease Diagnosis and Risk Gene Prioritization 2605.06226	cs.AIq-bio.GN	Tianyu Liu, Wangjie Zheng, Rui Yang, Benny Kai Guo Loo, Hui Zhang	Accurate and timely diagnosis is essential for effective treatment, particularly in the context of rare diseases. However, current diagnostic workflows often lead to prolonged assessment times and low accuracy. To address these limitations, we introduce Hygiei... Accurate and timely diagnosis is essential for effective treatment, particularly in the context of rare diseases. However, current diagnostic workflows often lead to prolonged assessment times and low accuracy. To address these limitations, we introduce Hygieia, a multi-modal AI agent system designed to support precision disease diagnosis by integrating diverse data sources, including phenotypic features, genetic profiles, and clinical records. Hygieia features a router-based and knowledge-enhan...
375	Price of Fairness in Short-Term and Long-Term Algorithmic Selections 2605.06227	cs.AI	Shahin Jabbari, Chen Wang	Algorithmic decision-making in high-stakes settings can have profound impacts on individuals and populations. While much prior work studies fairness in static settings, recent results show that enforcing static fairness constraints may exacerbate long-run disp... Algorithmic decision-making in high-stakes settings can have profound impacts on individuals and populations. While much prior work studies fairness in static settings, recent results show that enforcing static fairness constraints may exacerbate long-run disparities. Motivated by this tension, we study a stylized sequential selection problem in which a decision-maker repeatedly selects individuals, affecting both immediate utility and the population distribution over time. We introduce notions ...
378	Safactory: A Scalable Agentic Infrastructure for Training Trustworthy Autonomous Intelligence 2605.06230	cs.AIcs.DC	Xinquan Chen, Zhenyun Yin, Shan He, Bin Huang, Shanzhe Lei	As large models evolve from conversational assistants into autonomous agents, challenges increasingly arise from long-horizon decision making, tool use, and real environment interaction. Existing agenticinfrastructure remain fragmented across evaluation, data ... As large models evolve from conversational assistants into autonomous agents, challenges increasingly arise from long-horizon decision making, tool use, and real environment interaction. Existing agenticinfrastructure remain fragmented across evaluation, data management, and agent evolution, making it difficult to discover risks systematically and improve models in a continuous closed loop. In this report, we present \textbf{Safactory}, a scalable agent factory for trustworthy autonomous intelli...
408	Data Language Models: A New Foundation Model Class for Tabular Data 2605.06290	cs.AI	Eda Erol, Giuliano Pezzoli, Ozer Cem Kelahmet	Every major data modality now has a foundation model that understands it natively: text has language models, images have vision models, audio has audio models. Tabular data, the modality on which many consequential real-world AI decisions are made, does not. E... Every major data modality now has a foundation model that understands it natively: text has language models, images have vision models, audio has audio models. Tabular data, the modality on which many consequential real-world AI decisions are made, does not. Every approach to tabular AI today, from gradient-boosted trees to the latest tabular foundation models, requires a preprocessing pipeline before any model can consume the data. None of them understand tabular data as a modality. We introduc...
414	Addressing Labelled Data Scarcity: Taxonomy-Agnostic Annotation of PII Values in HTTP Traffic using LLMs 2605.06305	cs.AIcs.IR	Thomas Cory, Axel Küpper	Automated privacy audits of web and mobile applications often analyse outbound HTTP traffic to detect Personally Identifiable Information (PII) leakage. However, existing learning-based detectors typically depend on scarce, manually labelled traffic and are ti... Automated privacy audits of web and mobile applications often analyse outbound HTTP traffic to detect Personally Identifiable Information (PII) leakage. However, existing learning-based detectors typically depend on scarce, manually labelled traffic and are tightly coupled to fixed label taxonomies, limiting transferability across domains and evolving definitions of PII. This paper investigates whether Large Language Models (LLMs) can support taxonomy-agnostic annotation of explicitly transmitte...
416	Measuring Black-Box Confidence via Reasoning Trajectories: Geometry, Coverage, and Verbalization 2605.06308	cs.AI	Marc Boubnovski Martell, Josefa Lia Stoisser, Kaspar Märtens, Jialin Yu, Robert Kitchen	Reliable confidence estimation enables safe deployment of chain-of-thought (CoT) reasoning through text-only APIs. Yet the dominant black-box baseline, self-consistency over K samples, is linearly expensive and ignores the geometry of the trace. We propose a b... Reliable confidence estimation enables safe deployment of chain-of-thought (CoT) reasoning through text-only APIs. Yet the dominant black-box baseline, self-consistency over K samples, is linearly expensive and ignores the geometry of the trace. We propose a black-box trajectory-confidence score: we embed a CoT as a sliding-window trajectory and measure its convergence to external answer anchors with a one-parameter softmax. The method needs no logits, hidden states, or supervised calibrators. A...
436	A Regime Theory of Controller Class Selection for LLM Action Decisions 2605.06339	cs.AI	Zhaoyang Jiang, Zhizhong Fu, Yunsoo Kim, Jiacong Mi, Zicheng Li	Deployed language and vision-language models must decide, on each input, whether to answer directly, retrieve evidence, defer to a stronger model, or abstain. Contrary to the common monotonicity intuition, greater per-input expressivity is not uniformly benefi... Deployed language and vision-language models must decide, on each input, whether to answer directly, retrieve evidence, defer to a stronger model, or abstain. Contrary to the common monotonicity intuition, greater per-input expressivity is not uniformly beneficial in finite samples: under identical strict cross-validation, different benchmarks prefer different controller classes. This reflects a finite-sample limitation of instance-level uncertainty signals, which can be exhausted at a distribut...
441	Mind the Gap? A Distributional Comparison of Real and Synthetic Priors for Tabular Foundation Models 2605.06343	cs.AI	Alex O. Davies, Telmo de Menezes e Silva Filho, Nirav Ajmeri	Tabular foundation models are pre-trained on one of three classes of corpus: curated datasets drawn from benchmark repositories, tables harvested at scale from the web, or synthetic tables sampled from a parametric generative prior. Despite the centrality of p... Tabular foundation models are pre-trained on one of three classes of corpus: curated datasets drawn from benchmark repositories, tables harvested at scale from the web, or synthetic tables sampled from a parametric generative prior. Despite the centrality of pre-training data to model performance, little is known about how these corpora relate to one another in distribution, and the impact this has on downstream performance. In this work we take three canonical, archetypal datasets used to train...
442	More Than Can Be Said: A Benchmark and Framework for Pre-Question Scientific Ideation 2605.06345	cs.AI	Jie Yu, Song Qiu	AI research agents have shown strong potential in automating literature search and manuscript refinement, yet most assume a clear and actionable initial input, operating only after a research question has been made explicit. In contrast, human research often b... AI research agents have shown strong potential in automating literature search and manuscript refinement, yet most assume a clear and actionable initial input, operating only after a research question has been made explicit. In contrast, human research often begins with tacit friction, a sense of misalignment before a question can be formed. We introduce InciteResearch, a multi-agent framework designed to make a researcher's implicit understanding explicit, inspectable, and actionable. InciteRes...
443	Prediction and Empowerment: A Theory of Agency through Bridge Interfaces 2605.06346	cs.AI	Richard Csaky	We study agency under partial observability in deterministic physical or simulated worlds, where apparent randomness arises from uncertainty over initial conditions, fixed law bits, and unrolled exogenous noise. We model sensing and actuation as bridge interfa... We study agency under partial observability in deterministic physical or simulated worlds, where apparent randomness arises from uncertainty over initial conditions, fixed law bits, and unrolled exogenous noise. We model sensing and actuation as bridge interfaces split between agent-controlled parameters and environment-controlled channel state, inducing a deterministic POMDP through a prior over latent microstates and many-to-one observation coarsening. Within this framework, we prove a separat...
454	From Agent Loops to Deterministic Graphs: Execution Lineage for Reproducible AI-Native Work 2605.06365	cs.AIcs.MAcs.SE	Josh Rosen, Seth Rosen	Large language model systems are increasingly deployed as agentic workflows that interleave reasoning, tool use, memory, and iterative refinement. These systems are effective at producing answers, but they often rely on implicit conversational state, making it... Large language model systems are increasingly deployed as agentic workflows that interleave reasoning, tool use, memory, and iterative refinement. These systems are effective at producing answers, but they often rely on implicit conversational state, making it difficult to preserve stable work products, isolate irrelevant updates, or propagate changes through intermediate artifacts. We introduce execution lineage: an execution model in which AI-native work is represented as a directed acyclic ...
458	Debiased Multimodal Personality Understanding through Dual Causal Intervention 2605.06371	cs.AI	Yangfu Zhu, Zitong Han, Nianwen Ning, Yuting Wei, Yuandong Wang	Multimodalpersonalityunderstandingplaysacriticalroleinhuman centered artificial intelligence. Previous work mainly focus on learn-ing rich multimodal representations for video personality under standing. However, they often suffer from potential harm caused by... Multimodalpersonalityunderstandingplaysacriticalroleinhuman centered artificial intelligence. Previous work mainly focus on learn-ing rich multimodal representations for video personality under standing. However, they often suffer from potential harm caused by subject bias (e.g., observable age and unobservable mental states), as subjects originate from diverse demographic backgrounds. Learn ing such spurious associations between multimodal features and traits may lead to unfair personality unde...
464	Rethinking Vacuity for OOD Detection in Evidential Deep Learning 2605.06382	cs.AI	Claire McNamara	Vacuity, or Uncertainty Mass (UM), is commonly used as a metric to evaluate Out-of-Distribution (OOD) detection in Evidential Deep Learning (EDL). It generally involves dividing the number of classes ($K$) by the total strength of belief ($S$) of the model's p... Vacuity, or Uncertainty Mass (UM), is commonly used as a metric to evaluate Out-of-Distribution (OOD) detection in Evidential Deep Learning (EDL). It generally involves dividing the number of classes ($K$) by the total strength of belief ($S$) of the model's predictions, where $S$ is derived from summing the Dirichlet parameters. As such, UM is sensitive to the cardinality of $K$. In particular, it is unlikely in practice that there is a linear relationship between $K$ and $S$ as $K$ and $S$ inc...
470	Automated alignment is harder than you think 2605.06390	cs.AI	Aleksandr Bowkis, Marie Davidsen Buhl, Jacob Pfau, Geoffrey Irving	A leading proposal for aligning artificial superintelligence (ASI) is to use AI agents to automate an increasing fraction of alignment research as capabilities improve. We argue that, even when research agents are not scheming to deliberately sabotage alignmen... A leading proposal for aligning artificial superintelligence (ASI) is to use AI agents to automate an increasing fraction of alignment research as capabilities improve. We argue that, even when research agents are not scheming to deliberately sabotage alignment work, this plan could produce compelling but catastrophically misleading safety assessments resulting in the unintentional deployment of misaligned AI. This could happen because alignment research involves many hard-to-supervise fuzzy tas...
483	Knowledge Graphs, the Missing Link in Agentic AI-based Formal Verification 2605.06434	cs.AI	Vaisakh Naduvodi Viswambharan, Keerthan Kopparam Radhakrishna, Deepak Narayan Gadde, Aman Kumar	Recent advances in Large Language Models (LLMs) have enabled workflows that generate SystemVerilog Assertions (SVAs) from natural-language specifications, with the potential to accelerate Formal Verification (FV). However, high-quality assertion synthesis rema... Recent advances in Large Language Models (LLMs) have enabled workflows that generate SystemVerilog Assertions (SVAs) from natural-language specifications, with the potential to accelerate Formal Verification (FV). However, high-quality assertion synthesis remains challenging because specifications are often ambiguous or incomplete and critical micro-architectural details reside in the Register Transfer Level (RTL). Many existing approaches treat the specification and RTL as loosely structured te...
488	SCRuB: Social Concept Reasoning under Rubric-Based Evaluation 2605.06444	cs.AI	Jamelle Watson-Daniels, Himaghna Bhattacharjee, Skyler Wang, Brandon Handoko, Antonio Li	While many studies of Large Language Model (LLM) reasoning capabilities emphasize mathematical or technical tasks, few address reasoning about social concepts: the abstract ideas shaping social norms, culture, and institutions. This understudied capability is ... While many studies of Large Language Model (LLM) reasoning capabilities emphasize mathematical or technical tasks, few address reasoning about social concepts: the abstract ideas shaping social norms, culture, and institutions. This understudied capability is essential for modern models acting as social agents, yet no systematic evaluation methodology targets it. We introduce SCRuB (Social Concept Reasoning under Rubric-Based Evaluation), a framework designed for this setting of task indetermina...
494	PrefixGuard: From LLM-Agent Traces to Online Failure-Warning Monitors 2605.06455	cs.AI	Xinmiao Huang, Jinwei Hu, Rajarshi Roy, Changshun Wu, Yi Dong	Large language model (LLM) agents now execute long, tool-using tasks where final outcome checks can arrive too late for intervention. Online warning requires lightweight prefix monitors over heterogeneous traces, but hand-authored event schemas are brittle and... Large language model (LLM) agents now execute long, tool-using tasks where final outcome checks can arrive too late for intervention. Online warning requires lightweight prefix monitors over heterogeneous traces, but hand-authored event schemas are brittle and deployment-time LLM judging is costly. We introduce PrefixGuard, a trace-to-monitor framework with an offline StepView induction step followed by supervised monitor training. StepView induces deterministic typed-step adapters from raw trac...
495	Beyond Task Success: Measuring Workflow Fidelity in LLM-Based Agentic Payment Systems 2605.06457	cs.AI	Donghao Huang, Joon Kiat Chua, Zhaoxia Wang	LLM-based multi-agent systems are increasingly deployed for payment workflows, yet prevailing metrics, Task Success Rate (TSR) and Agent Handoff F1-Score (HF1), capture only final outcomes or unordered routing decisions. We introduce the Agentic Success Rate (... LLM-based multi-agent systems are increasingly deployed for payment workflows, yet prevailing metrics, Task Success Rate (TSR) and Agent Handoff F1-Score (HF1), capture only final outcomes or unordered routing decisions. We introduce the Agentic Success Rate (ASR), a trajectory-fidelity metric that compares observed and expected agent execution sequences at the transition level, decomposing performance into Transition Recall and Transition Precision. Applied to the Hierarchical Multi-Agent Syste...
505	Probabilistic Dating of Historical Manuscripts via Evidential Deep Regression on Visual Script Features 2605.06475	cs.AI	Ranjith Chodavarapu	We introduce a probabilistic approach for dating historical manuscript pages from visual features alone. Instead of aggregating centuries into classes as is standard in the previous literature, we pose dating as an evidential deep regression problem over a con... We introduce a probabilistic approach for dating historical manuscript pages from visual features alone. Instead of aggregating centuries into classes as is standard in the previous literature, we pose dating as an evidential deep regression problem over a continuous year axis, allowing our neural network to output a full predictive distribution with decomposed aleatoric and epistemic uncertainty in a single forward pass. Our architecture combines an EfficientNet-B2 backbone with a Normal-Invers...
509	Patch-Effect Graph Kernels for LLM Interpretability 2605.06480	cs.AIcs.CL	Ruben Fernandez-Boullon, David N. Olivieri	Mechanistic interpretability aims to reverse-engineer transformer computations by identifying causal circuits through activation patching. However, scaling these interventions across diverse prompts and task families produces high-dimensional, unstructured dat... Mechanistic interpretability aims to reverse-engineer transformer computations by identifying causal circuits through activation patching. However, scaling these interventions across diverse prompts and task families produces high-dimensional, unstructured datasets that are difficult to compare systematically. We propose a framework that reframes mechanistic analysis as a graph machine-learning problem by representing activation-patching profiles as patch-effect graphs over model components. We ...
510	ReasonSTL: Bridging Natural Language and Signal Temporal Logic via Tool-Augmented Process-Rewarded Learning 2605.06483	cs.AI	Bowen Ye, Zhijian Li, Junyue Huang, Junkai Ma, Xiang Yin	Signal Temporal Logic (STL) is an expressive formal language for specifying spatio-temporal requirements over real-valued, real-time signals. It has been widely used for the verification and synthesis of autonomous systems and cyber-physical systems. In practi... Signal Temporal Logic (STL) is an expressive formal language for specifying spatio-temporal requirements over real-valued, real-time signals. It has been widely used for the verification and synthesis of autonomous systems and cyber-physical systems. In practice, however, users often express their requirements in natural language rather than in structured STL formulas, making natural-language-to-STL translation a critical yet challenging task. Manual specification requires temporal-logic experti...
514	Instrumental Choices: Measuring the Propensity of LLM Agents to Pursue Instrumental Behaviors 2605.06490	cs.AIcs.CY	Jonas Wiedermann-Möller, Leonard Dung, Maksym Andriushchenko	AI systems have become increasingly capable of dangerous behaviours in many domains. This raises the question: Do models sometimes choose to violate human instructions in order to perform behaviour that is more useful for certain goals? We introduce a benchmar... AI systems have become increasingly capable of dangerous behaviours in many domains. This raises the question: Do models sometimes choose to violate human instructions in order to perform behaviour that is more useful for certain goals? We introduce a benchmark for measuring model propensity for instrumental convergence (IC) behaviour in terminal-based agents. This is behaviour such as self-preservation that has been hypothesised to play a key role in risks from highly capable AI agents. Our ben...
515	From Token Lists to Graph Motifs: Weisfeiler-Lehman Analysis of Sparse Autoencoder Features 2605.06494	cs.AI	Ruben Fernandez-Boullon, Pablo Magariños-Docampo, Javier Perez-Robles	Sparse autoencoders (SAEs) have become central to mechanistic interpretability, decomposing transformer activations into monosemantic features. Yet existing analyses characterise features almost exclusively through top-activating token lists or decoder weight ... Sparse autoencoders (SAEs) have become central to mechanistic interpretability, decomposing transformer activations into monosemantic features. Yet existing analyses characterise features almost exclusively through top-activating token lists or decoder weight vectors, leaving the higher-order co-occurrence structure shared across features largely unexamined. We introduce a graph-structured representation in which each SAE feature is modelled as a token co-occurrence graph: nodes are the tokens m...
532	Process Matters more than Output for Distinguishing Humans from Machines 2605.06524	cs.AI	Milena Rmus, Mathew D. Hardy, Thomas L. Griffiths, Mayank Agrawal	Reliable human-machine discrimination is becoming increasingly important as large language models and autonomous agents are deployed in online settings. Existing approaches evaluate whether a system can produce behavior or responses indistinguishable from thos... Reliable human-machine discrimination is becoming increasingly important as large language models and autonomous agents are deployed in online settings. Existing approaches evaluate whether a system can produce behavior or responses indistinguishable from those of a human, following the emphasis on outputs as a criterion for intelligence proposed by Alan Turing. Cognitive science offers an alternative perspective: evaluating the process by which behavior is produced. To test whether cognitive pr...
534	Market-Alignment Risk in Pricing Agents: Trace Diagnostics and Trace-Prior RL under Hidden Competitor State 2605.06529	cs.AIcs.LG	Peiying Zhu, Sidi Chang	Outcome metrics can certify the wrong behavior. We study this failure in a two-hotel revenue-management simulator where Hotel A trains an agent against a fixed rule-based revenue-management competitor, Hotel B. A standard learning agent can obtain near-referen... Outcome metrics can certify the wrong behavior. We study this failure in a two-hotel revenue-management simulator where Hotel A trains an agent against a fixed rule-based revenue-management competitor, Hotel B. A standard learning agent can obtain near-reference revenue per available room (RevPAR) while failing to learn market-like yield management: it sells too aggressively, undercuts, or collapses to modal price buckets. We diagnose this as a Goodhart-style failure under partial observability....
535	SpatialEpiBench: Benchmarking Spatial Information and Epidemic Priors in Forecasting 2605.06530	cs.AI	Ruiqi Lyu, Alistair Turcan, Bryan Wilder	Accurate epidemic forecasting is crucial for public health response, resource allocation, and outbreak intervention, but remains difficult with sparse, noisy, and highly non-stationary data. Because epidemics unfold across interacting regions, spatiotemporal m... Accurate epidemic forecasting is crucial for public health response, resource allocation, and outbreak intervention, but remains difficult with sparse, noisy, and highly non-stationary data. Because epidemics unfold across interacting regions, spatiotemporal methods are natural candidates for improving forecasts. Despite growing interest in spatial information, no standardized benchmark exists, and current evaluations often use simple chronological train-test splits that do not reflect real-time...
540	Ex Ante Evaluation of AI-Induced Idea Diversity Collapse 2605.06540	cs.AIcs.GT	Nafis Saami Azad, Raiyan Abdul Baten	Creative AI systems are typically evaluated at the level of individual utility, yet creative outputs are consumed in populations: an idea loses value when many others produce similar ones. This creates an evaluation blind spot, as AI can improve individual out... Creative AI systems are typically evaluated at the level of individual utility, yet creative outputs are consumed in populations: an idea loses value when many others produce similar ones. This creates an evaluation blind spot, as AI can improve individual outputs while increasing population-level crowding. We introduce a human-relative framework for benchmarking AI-induced human diversity collapse without requiring human-AI interaction data, providing an ex ante protocol to estimate crowding ri...
559	Improved techniques for fine-tuning flow models via adjoint matching: a deterministic control pipeline 2605.06583	cs.AI	Zhengyi Guo, Jiayuan Sheng, David D. Yao, Wenpin Tang	We propose a deterministic adjoint matching framework that formulates human preference alignment for flow-based generative models as an optimal control problem over velocity fields. One can directly regress the control toward a value-gradient-induced target un... We propose a deterministic adjoint matching framework that formulates human preference alignment for flow-based generative models as an optimal control problem over velocity fields. One can directly regress the control toward a value-gradient-induced target under the current policy, leading to a simple and stable training objective. Building on this perspective, we introduce a truncated adjoint scheme that focuses computation on the terminal portion of the trajectory, where reward-relevant signa...
560	NeuroAgent: LLM Agents for Multimodal Neuroimaging Analysis and Research 2605.06584	cs.AI	Lujia Zhong, Yihao Xia, Jianwei Zhang, Shuo huang, Jiaxin Yue	Multimodal neuroimaging analysis often involves complex, modality-specific preprocessing workflows that require careful configuration, quality control, and coordination across heterogeneous toolchains. Beyond preprocessing, downstream statistical analysis and ... Multimodal neuroimaging analysis often involves complex, modality-specific preprocessing workflows that require careful configuration, quality control, and coordination across heterogeneous toolchains. Beyond preprocessing, downstream statistical analysis and disease classification commonly require task-specific code, evaluation protocols, and data-format conventions, creating additional barriers between raw acquisitions and reproducible scientific analysis. We present NeuroAgent, an LLM-driven ...
563	Weblica: Scalable and Reproducible Training Environments for Visual Web Agents 2605.06761	cs.AIcs.CVcs.LG	Oğuzhan Fatih Kar, Roman Bachmann, Yuanzheng Gong, Anders Boesen Lindbo Larsen, Afshin Dehghan	The web is complex, open-ended, and constantly changing, making it challenging to scale training data for visual web agents. Existing data collection attempts remain limited to offline trajectories for supervised fine-tuning or a handful of simulated environme... The web is complex, open-ended, and constantly changing, making it challenging to scale training data for visual web agents. Existing data collection attempts remain limited to offline trajectories for supervised fine-tuning or a handful of simulated environments for RL training, thus failing to capture web diversity. We propose Weblica (Web Replica), a framework for constructing reproducible and scalable web environments. Our framework leverages 1) HTTP-level caching to capture and replay stabl...
580	SkillOS: Learning Skill Curation for Self-Evolving Agents 2605.06614	cs.AIcs.CL	Siru Ouyang, Jun Yan, Yanfei Chen, Rujun Han, Zifeng Wang	LLM-based agents are increasingly deployed to handle streaming tasks, yet they often remain one-off problem solvers that fail to learn from past interactions. Reusable skills distilled from experience provide a natural substrate for self-evolution, where high-... LLM-based agents are increasingly deployed to handle streaming tasks, yet they often remain one-off problem solvers that fail to learn from past interactions. Reusable skills distilled from experience provide a natural substrate for self-evolution, where high-quality skill curation serves as the key bottleneck. Existing approaches either rely on manual skill curation, prescribe heuristic skill operations, or train for short-horizon skill operations. However, they still struggle to learn complex ...
584	MASPO: Joint Prompt Optimization for LLM-based Multi-Agent Systems 2605.06623	cs.AIcs.CLcs.LGcs.MA	Zhexuan Wang, Xuebo Liu, Li Wang, Zifei Shan, Yutong Wang	Large language model (LLM)-based Multi-agent systems (MAS) have shown promise in tackling complex collaborative tasks, where agents are typically orchestrated via role-specific prompts. While the quality of these prompts is pivotal, jointly optimizing them acr... Large language model (LLM)-based Multi-agent systems (MAS) have shown promise in tackling complex collaborative tasks, where agents are typically orchestrated via role-specific prompts. While the quality of these prompts is pivotal, jointly optimizing them across interacting agents remains a non-trivial challenge, primarily due to the misalignment between local agent objectives and holistic system goals. To address this, we introduce MASPO, a novel framework designed to automatically and iterati...
594	Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key 2605.06638	cs.AIcs.CL	Tianle Wang, Zhaoyang Wang, Guangchen Lan, Xinpeng Wei, Sipeng Zhang	Reinforcement learning (RL) has been applied to improve large language model (LLM) reasoning, yet the systematic study of how training scales with task difficulty has been hampered by the lack of controlled, scalable environments. We introduce ScaleLogic, a sy... Reinforcement learning (RL) has been applied to improve large language model (LLM) reasoning, yet the systematic study of how training scales with task difficulty has been hampered by the lack of controlled, scalable environments. We introduce ScaleLogic, a synthetic logical reasoning framework that offers independent control over two axes of difficulty: the depth of the required proof planning (i.e., the horizon) and the expressiveness of the underlying logic. Our proposed framework supports a ...
597	GlazyBench: A Benchmark for Ceramic Glaze Property Prediction and Image Generation 2605.06641	cs.AIcs.CV	Ziyu Zhai, Siyou Li, Juexi Shao, Juntao Yu	Developing ceramic glazes is a costly, time-consuming process of trial and error due to complex chemistry, placing a significant burden on independent artists. While recent advances in multimodal AI offer a modern solution, the field lacks the large-scale data... Developing ceramic glazes is a costly, time-consuming process of trial and error due to complex chemistry, placing a significant burden on independent artists. While recent advances in multimodal AI offer a modern solution, the field lacks the large-scale datasets required to train these models. We propose GlazyBench, the first dataset for AI-assisted glaze design. Comprising 23,148 real glaze formulations, GlazyBench supports two primary tasks: predicting post-firing surface properties, such as...
cs.AR 2 papers
196	LLM-Driven Design Space Exploration of FPGA-based Accelerators 2605.05920	cs.ARcs.AIcs.PF	Vinamra Sharma, Xingjian Fu, Jude Haris, José Cano	Designing field-programmable gate array (FPGA)-based accelerators for modern artificial intelligence workloads requires navigating a large and complex hardware design space encompassing architectural parameters, dataflow strategies, and memory hierarchies, mak... Designing field-programmable gate array (FPGA)-based accelerators for modern artificial intelligence workloads requires navigating a large and complex hardware design space encompassing architectural parameters, dataflow strategies, and memory hierarchies, making the process time-consuming and resource-intensive. While the SECDA methodology enables rapid hardware-software co-design of accelerators through SystemC simulation and FPGA execution, identifying optimal accelerator configurations still...
286	PoTAcc: A Pipeline for End-to-End Acceleration of Power-of-Two Quantized DNNs 2605.06082	cs.ARcs.LGcs.PF	Rappy Saha, Jude Haris, Nicolas Bohm Agostini, David Kaeli, José Cano	Power-of-two (PoT) quantization significantly reduces the size of deep neural networks (DNNs) and replaces multiplications with bit-shift operations for inference. Prior work has shown that PoT-quantized DNNs can preserve accuracy for tasks such as image class... Power-of-two (PoT) quantization significantly reduces the size of deep neural networks (DNNs) and replaces multiplications with bit-shift operations for inference. Prior work has shown that PoT-quantized DNNs can preserve accuracy for tasks such as image classification; however, their performance on resource-constrained edge devices remains insufficiently understood. While general-purpose edge CPUs and GPUs do not provide optimized backends for bit-shift operations, custom hardware accelerators ...
cs.CE 1 papers
2	Discrete Elastic Ribbons: A Unified Discrete Differential Geometry Framework for One-Dimensional Energy Models 2605.05529	cs.CEcs.GRcs.LG	Shivam Kumar Panda, M Khalid Jawed	Elastic ribbons, slender structures whose length ($L$), width ($W$), and thickness ($b$) satisfy $L \gg W \gg b$, exhibit mechanical behaviors intermediate between one-dimensional rods ($L \gg W, b$) and two-dimensional plates ($L, W \gg b$). In quadratic Kirc... Elastic ribbons, slender structures whose length ($L$), width ($W$), and thickness ($b$) satisfy $L \gg W \gg b$, exhibit mechanical behaviors intermediate between one-dimensional rods ($L \gg W, b$) and two-dimensional plates ($L, W \gg b$). In quadratic Kirchhoff-type rod-based frameworks, such as Discrete Elastic Rods (DER), the governing equilibrium equations are independent of width, and therefore these models cannot capture width-dependent mechanical effects. Reduced centerline-based ribbo...
cs.CL 58 papers
5	A Few Good Clauses: Comparing LLMs vs Domain-Trained Small Language Models on Structured Contract Extraction 2605.05532	cs.CLcs.CY	Nicole Lincoln, Nick Whitehouse, Jaron Mar, Rivindu Perera	This paper evaluates whether a domain trained Small Language Model (SLM) can outperform frontier Large Language Models on structured contract extraction at radically lower cost. We test Olava Extract, a self hosted legal domain Mixture of Experts model, agains... This paper evaluates whether a domain trained Small Language Model (SLM) can outperform frontier Large Language Models on structured contract extraction at radically lower cost. We test Olava Extract, a self hosted legal domain Mixture of Experts model, against five frontier models. Olava Extract achieved the strongest aggregate performance in the study, with a macro F1 of 0.812 and a micro F1 of 0.842, while reducing inference cost by 78% to 97% compared with the frontier models tested. It al...
34	The Cost of Context: Mitigating Textual Bias in Multimodal Retrieval-Augmented Generation 2605.05594	cs.CLcs.CVcs.LG	Hoin Jung, Xiaoqian Wang	While Multimodal Large Language Models (MLLMs) are increasingly integrated with Retrieval-Augmented Generation (RAG) to mitigate hallucinations, the introduction of external documents can conceal severe failure modes at the instance level. We identify and form... While Multimodal Large Language Models (MLLMs) are increasingly integrated with Retrieval-Augmented Generation (RAG) to mitigate hallucinations, the introduction of external documents can conceal severe failure modes at the instance level. We identify and formalize the phenomenon of recorruption, where the introduction of even perfectly accurate "oracle" context causes a capable model to abandon an initially correct prediction. Through a mechanistic diagnosis of internal attention matrices, we s...
44	When2Speak: A Dataset for Temporal Participation and Turn-Taking in Multi-Party Conversations for Large Language Models 2605.05626	cs.CLcs.AI	Vihaan Nama, Shreya Mendi, Zian Ye, Brinnae Bent	Large Language Models (LLMs) excel at generating contextually appropriate responses but remain poorly calibrated for multi-party conversations, where deciding when to speak is as critical as what to say. In such settings, naively responding at every turn leads... Large Language Models (LLMs) excel at generating contextually appropriate responses but remain poorly calibrated for multi-party conversations, where deciding when to speak is as critical as what to say. In such settings, naively responding at every turn leads to excessive interruptions and degraded conversational coherence. We introduce When2Speak, a grounded synthetic dataset and four-stage generation pipeline for learning intervention timing in group interactions. The dataset comprises over 2...
47	One Turn Too Late: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue 2605.05630	cs.CLcs.AIcs.CR	Xinjie Shen, Rongzhe Wei, Peizhi Niu, Haoyu Wang, Ruihan Wu	Hidden malicious intent in multi-turn dialogue poses a growing threat to deployed large language models (LLMs). Rather than exposing a harmful objective in a single prompt, increasingly capable attackers can distribute their intent across multiple benign-looki... Hidden malicious intent in multi-turn dialogue poses a growing threat to deployed large language models (LLMs). Rather than exposing a harmful objective in a single prompt, increasingly capable attackers can distribute their intent across multiple benign-looking turns. Recent studies show that even modern commercial models with advanced guardrails remain vulnerable to such attacks despite advances in safety alignment and external guardrails. In this work, we address this challenge by detecting t...
58	Negative Before Positive: Asymmetric Valence Processing in Large Language Models 2605.05653	cs.CL	Sohan Venkatesh	Mechanistic interpretability has revealed how concepts are encoded in large language models (LLMs), but emotional content remains poorly understood at the mechanistic level. We study whether LLMs process emotional valence through dedicated internal structure o... Mechanistic interpretability has revealed how concepts are encoded in large language models (LLMs), but emotional content remains poorly understood at the mechanistic level. We study whether LLMs process emotional valence through dedicated internal structure or through surface token matching. Using activation patching and steering on open-source LLMs, we find that negative and positive valence are processed at different network depths. Negative outcomes localize to early layers while positive ou...
63	XL-SafetyBench: A Country-Grounded Cross-Cultural Benchmark for LLM Safety and Cultural Sensitivity 2605.05662	cs.CLcs.AI	Dasol Choi, Eugenia Kim, Jaewon Noh, Sang Seo, Eunmi Kim	Current LLM safety benchmarks are predominantly English-centric and often rely on translation, failing to capture country-specific harms. Moreover, they rarely evaluate a model's ability to detect culturally embedded sensitivities as distinct from universal ha... Current LLM safety benchmarks are predominantly English-centric and often rely on translation, failing to capture country-specific harms. Moreover, they rarely evaluate a model's ability to detect culturally embedded sensitivities as distinct from universal harms. We introduce XL-SafetyBench. a suite of 5,500 test cases across 10 country-language pairs, comprising a Jailbreak Benchmark of country-grounded adversarial prompts and a Cultural Benchmark where local sensitivities are embedded within ...
67	Decomposing the Basic Abilities of Large Language Models: Mitigating Cross-Task Interference in Multi-Task Instruct-Tuning 2605.05676	cs.CLcs.AI	Bing Wang, Ximing Li, Changchun Li, Jinjin Chi, Gang Niu	Recently, the prominent performance of large language models (LLMs) has been largely driven by multi-task instruct-tuning. Unfortunately, this training paradigm suffers from a key issue, named cross-task interference, due to conflicting gradients over shared p... Recently, the prominent performance of large language models (LLMs) has been largely driven by multi-task instruct-tuning. Unfortunately, this training paradigm suffers from a key issue, named cross-task interference, due to conflicting gradients over shared parameters among different tasks. Some previous methods mitigate this issue by isolating task-specific parameters, e.g., task-specific neuron selection and mixture-of-experts. In this paper, we empirically reveal that the cross-task interfer...
120	BioTool: A Comprehensive Tool-Calling Dataset for Enhancing Biomedical Capabilities of Large Language Models 2605.05758	cs.CL	Xin Gao, Ruiyi Zhang, Meixi Du, Peijia Qin, Pengtao Xie	Despite the success of large language models (LLMs) on general-purpose tasks, their performance in highly specialized domains such as biomedicine remains unsatisfactory. A key limitation is the inability of LLMs to effectively leverage biomedical tools, which ... Despite the success of large language models (LLMs) on general-purpose tasks, their performance in highly specialized domains such as biomedicine remains unsatisfactory. A key limitation is the inability of LLMs to effectively leverage biomedical tools, which clinical experts and biomedical researchers rely on extensively in daily workflows. While recent general-domain tool-calling datasets have substantially improved the capabilities of LLM agents, existing efforts in the biomedical domain larg...
132	Estimating the Black-box LLM Uncertainty with Distribution-Aligned Adversarial Distillation 2605.05777	cs.CL	Huizi Cui, Huan Ma, Qilin Wang, Yuhang Gao, Changqing Zhang	Large language models (LLMs) have progressed rapidly in complex reasoning and question answering, yet LLM hallucination remains a central bottleneck that hinders practical deployment, especially for commercial black-box LLMs accessible only via APIs. Existing ... Large language models (LLMs) have progressed rapidly in complex reasoning and question answering, yet LLM hallucination remains a central bottleneck that hinders practical deployment, especially for commercial black-box LLMs accessible only via APIs. Existing uncertainty quantification methods typically depend on computationally expensive multiple sampling or internal parameters, which prevents real-time estimation and fails to capture information implicit in the black-box reasoning process. To ...
155	Evaluation Awareness in Language Models Has Limited Effect on Behaviour 2605.05835	cs.CLcs.CY	Amelie Knecht, Lucas Florin, Thilo Hagendorff	Large reasoning models (LRMs) sometimes note in their chain of thought (CoT) that they may be under evaluation. Researchers worry that this verbalised evaluation awareness (VEA) causes models to adapt their outputs strategically, optimising for perceived evalu... Large reasoning models (LRMs) sometimes note in their chain of thought (CoT) that they may be under evaluation. Researchers worry that this verbalised evaluation awareness (VEA) causes models to adapt their outputs strategically, optimising for perceived evaluation criteria, which, for instance, can make models appear safer than they actually are. However, whether VEA actually has this effect is largely unknown. We tested this across open-weight LRMs and benchmarks covering safety, alignment, mo...
182	Beyond Steering Vector: Flow-based Activation Steering for Inference-Time Intervention 2605.05892	cs.CLcs.LG	Zehao Jin, Ruixuan Deng, Junran Wang, Xinjie Shen, Chao Zhang	Activation steering has emerged as a promising alternative for controlling language-model behavior at inference time by modifying intermediate representations while keeping model parameters frozen. However, large-scale evaluations such as AxBench show that exi... Activation steering has emerged as a promising alternative for controlling language-model behavior at inference time by modifying intermediate representations while keeping model parameters frozen. However, large-scale evaluations such as AxBench show that existing steering methods are often outperformed by simple in-context prompting and generalize poorly to unseen concepts. We hypothesize that these limitations arise from unvalidated simplifying assumptions shared across prior methods, which t...
183	Logic-Regularized Verifier Elicits Reasoning from LLMs 2605.05893	cs.CLcs.AI	Xinyu Wang, Changzhi Sun, Lian Cheng, Yuanbin Wu, Dell Zhang	Verifiers are crucial components for enhancing modern LLMs' reasoning capability. Typicalverifiers require resource-intensive superviseddataset construction, which is costly and faceslimitations in data diversity. In this paper, wepropose LOVER, an unsupervise... Verifiers are crucial components for enhancing modern LLMs' reasoning capability. Typicalverifiers require resource-intensive superviseddataset construction, which is costly and faceslimitations in data diversity. In this paper, wepropose LOVER, an unsupervised verifier regularized by logical rules. LOVER treats theverifier as a binary latent variable, utilizinginternal activations and enforcing three logical constraints on multiple reasoning paths:negation consistency, intra-group consistency,a...
199	Minimizing Modality Gap from the Input Side: Your Speech LLM Can Be a Prosody-Aware Text LLM 2605.05927	cs.CLcs.SDeess.AS	Wenqian Cui, Xiao-Hui Li, Daxin Tan, Qiyong Zheng, Irwin King	Speech large language models (SLMs) are typically built from text large language model (TLM) checkpoints, yet they still suffer from a substantial modality gap. Prior work has mainly attempted to reduce this gap from the output side by making speech generation... Speech large language models (SLMs) are typically built from text large language model (TLM) checkpoints, yet they still suffer from a substantial modality gap. Prior work has mainly attempted to reduce this gap from the output side by making speech generation more text-like, but the gap remains. We argue that the key remaining bottleneck lies on the input side. We propose TextPro-SLM, an SLM that makes spoken input more closely resemble that of a prosody-aware text LLM. TextPro-SLM combines Whi...
210	Lightweight Stylistic Consistency Profiling: Robust Detection of LLM-Generated Textual Content for Multimedia Moderation 2605.05950	cs.CL	Siyuan Li, Aodu Wulianghai, Xi Lin, Xibin Yuan, Qinghua Mao	The increasing prevalence of Large Language Models (LLMs) in content creation has made distinguishing human-written textual content from LLM-generated counterparts a critical task for multimedia moderation. Existing detectors often rely on statistical cues or ... The increasing prevalence of Large Language Models (LLMs) in content creation has made distinguishing human-written textual content from LLM-generated counterparts a critical task for multimedia moderation. Existing detectors often rely on statistical cues or model-specific heuristics, making them vulnerable to paraphrasing and adversarial manipulations, and consequently limiting their robustness and interpretability. In this work, we proposeLiSCP , a novel lightweight stylistic consistency prof...
212	Hallucination as an Anomaly: Dynamic Intervention via Probabilistic Circuits 2605.05953	cs.CLcs.AI	Erik Nielsen, Elia Cunegatti, Marcus Vukojevic, Giovanni Iacca	One of the most critical challenges in Large Language Models is their tendency to hallucinate, i.e., produce factually incorrect responses. Existing approaches show promising results in terms of hallucination correction, but still suffer from a main limitation... One of the most critical challenges in Large Language Models is their tendency to hallucinate, i.e., produce factually incorrect responses. Existing approaches show promising results in terms of hallucination correction, but still suffer from a main limitation: they apply corrections indiscriminately to every token, corrupting also the originally correct generations. To overcome this drawback, we propose PCNET, a Probabilistic Circuit trained as a tractable density estimator over the LLM residua...
213	TableVista: Benchmarking Multimodal Table Reasoning under Visual and Structural Complexity 2605.05955	cs.CLcs.CV	Zheyuan Yang, Liqiang Shang, Junjie Chen, Xun Yang, Chenglong Xu	We introduce TableVista, a comprehensive benchmark for evaluating foundation models in multimodal table reasoning under visual and structural complexity. TableVista consists of 3,000 high-quality table reasoning problems, where each instance is expanded into 1... We introduce TableVista, a comprehensive benchmark for evaluating foundation models in multimodal table reasoning under visual and structural complexity. TableVista consists of 3,000 high-quality table reasoning problems, where each instance is expanded into 10 distinct visual variants through our multi-style rendering and transformation pipeline. This process encompasses diverse scenario styles, robustness perturbations, and vision-only configurations, culminating in 30,000 multimodal samples f...
217	Tatarstan Toponyms: A Bilingual Dataset and Hybrid RAG System for Geospatial Question Answering 2605.05962	cs.CL	Mullosharaf K. Arabov	This paper addresses automatic geospatial question answering over multilingual toponymic data. An original bilingual dataset of toponyms of the Republic of Tatarstan is introduced, comprising 9,688 structured records with linguistic, etymological, administrati... This paper addresses automatic geospatial question answering over multilingual toponymic data. An original bilingual dataset of toponyms of the Republic of Tatarstan is introduced, comprising 9,688 structured records with linguistic, etymological, administrative, and coordinate information (93.1% georeferenced). Based on this dataset, a question-answering corpus of approximately 39,000 question-context-answer triples is constructed with guaranteed answer localization. A hybrid retriever integrat...
241	From Articles to Premises: Building PrimeFacts, an Extraction Methodology and Resource for Fact-Checking Evidence 2605.06006	cs.CL	Premtim Sahitaj, Jawan Kolanowski, Ariana Sahitaj, Veronika Solopova, Max Upravitelev	Fact-checking articles encode rich supporting evidence and reasoning, yet this evidence remains largely inaccessible to automated verification systems due to unstructured presentation. We introduce PrimeFacts, a methodology and resource for extracting fine-gra... Fact-checking articles encode rich supporting evidence and reasoning, yet this evidence remains largely inaccessible to automated verification systems due to unstructured presentation. We introduce PrimeFacts, a methodology and resource for extracting fine-grained evidence from full fact-checking articles. We compile 13,106 PolitiFact articles with claims, verdicts, and all referenced sources, and we identify 49,718 in-article hyperlinks as natural anchors to pinpoint key evidence. Our framework...
242	PersonaKit (PK): A Plug-and-Play Platform for User Testing Diverse Roles in Full-Duplex Dialogue 2605.06007	cs.CLcs.AIcs.HC	Hyunbae Jeon, Jinho D. Choi	As spoken dialogue systems expand beyond traditional assistant roles to encompass diverse personas -- such as authoritative instructors, uncooperative merchants, or distracted workers -- they require distinct, human-like turn-taking behaviors to maintain psych... As spoken dialogue systems expand beyond traditional assistant roles to encompass diverse personas -- such as authoritative instructors, uncooperative merchants, or distracted workers -- they require distinct, human-like turn-taking behaviors to maintain psychological immersion. However, current full-duplex systems often default to a rigid, overly accommodating ``always-yield'' policy during overlapping speech, which severely undermines character consistency for non-submissive roles. Evaluating ...
252	More Aligned, Less Diverse? Analyzing the Grammar and Lexicon of Two Generations of LLMs 2605.06030	cs.CL	Adrián Gude, Roi Santos-Ríos, Francis Bond, Dan Flickinger, Carlos Gómez-Rodríguez	This study contributes to a growing line of research in comparing LLM-generated texts with human-authored text, in this case, English news text. We focus in particular on the evaluation of syntactic properties through formal grammar frameworks. Our analysis co... This study contributes to a growing line of research in comparing LLM-generated texts with human-authored text, in this case, English news text. We focus in particular on the evaluation of syntactic properties through formal grammar frameworks. Our analysis compares two generations of LLMs in the context of two human-authored English news datasets from two different years. Employing the Head-Driven Phrase Structure Grammar (HPSG) formalism, we investigate the distributions of syntactic structure...
281	Navigating by Old Maps: The Pitfalls of Static Mechanistic Localization in LLM Post-Training 2605.06076	cs.CL	Hang Chen, Jiaying Zhu, Hongyang Chen, Hongxu Liu, Xinyu Yang	The "Locate-then-Update" paradigm has become a predominant approach in the post-training of large language models (LLMs), identifying critical components via mechanistic interpretability for targeted parameter updates. However, this paradigm rests on a fundame... The "Locate-then-Update" paradigm has become a predominant approach in the post-training of large language models (LLMs), identifying critical components via mechanistic interpretability for targeted parameter updates. However, this paradigm rests on a fundamental yet unverified assumption: can mechanisms derived from current static parameters reliably guide future dynamic parameter updates? To investigate this, we systematically track the structural evolution of Transformer circuits throughout ...
283	Milestone-Guided Policy Learning for Long-Horizon Language Agents 2605.06078	cs.CLcs.AI	Zixuan Wang, Yuchen Yan, Hongxing Li, Teng Pan, Dingming Li	While long-horizon agentic tasks require language agents to perform dozens of sequential decisions, training such agents with reinforcement learning remains challenging. We identify two root causes: credit misattribution, where correct early actions are penali... While long-horizon agentic tasks require language agents to perform dozens of sequential decisions, training such agents with reinforcement learning remains challenging. We identify two root causes: credit misattribution, where correct early actions are penalized due to terminal failures, and sample inefficiency, where scarce successful trajectories result in near-total loss of learning signal. We introduce a milestone-guided policy learning framework, BEACON, that leverages the compositional st...
296	Uncovering Entity Identity Confusion in Multimodal Knowledge Editing 2605.06096	cs.CLcs.CV	Shu Wu, Xiaotian Ye, Xinyu Mou, Dongsheng Liu, Xiaohan Wang	Multimodal knowledge editing (MKE) aims to correct the internal knowledge of large vision-language models after deployment, yet the behavioral patterns of post-edit models remain underexplored. In this paper, we identify a systemic failure mode in edited model... Multimodal knowledge editing (MKE) aims to correct the internal knowledge of large vision-language models after deployment, yet the behavioral patterns of post-edit models remain underexplored. In this paper, we identify a systemic failure mode in edited models, termed Entity Identity Confusion (EIC): edited models exhibit an absurd behavior where text-only queries about the original entity's identity unexpectedly return information about the new entity. To rigorously investigate EIC, we constru...
313	MemReranker: Reasoning-Aware Reranking for Agent Memory Retrieval 2605.06132	cs.CL	Chunyu Li, Jingyi Kang, Ding Chen, Mengyuan Zhang, Jiajun Shen	In agent memory systems, the reranking model serves as the critical bridge connecting user queries with long-term memory. Most systems adopt the "retrieve-then-rerank" two-stage paradigm, but generic reranking models rely on semantic similarity matching and la... In agent memory systems, the reranking model serves as the critical bridge connecting user queries with long-term memory. Most systems adopt the "retrieve-then-rerank" two-stage paradigm, but generic reranking models rely on semantic similarity matching and lack genuine reasoning capabilities, leading to a problem where recalled results are semantically highly relevant yet do not contain the key information needed to answer the question. This deficiency manifests in memory scenarios as three spe...
320	IRC-Bench: Recognizing Entities from Contextual Cues in First-Person Reminiscences 2605.06142	cs.CLcs.AI	Yehudit Aperstein, Eden Moran, Alexander Apartsin	When people recount personal memories, they often refer to people, places, and events indirectly, relying on contextual cues rather than explicit names. Such implicit references are central to reminiscence narratives: first-person accounts of lived experience ... When people recount personal memories, they often refer to people, places, and events indirectly, relying on contextual cues rather than explicit names. Such implicit references are central to reminiscence narratives: first-person accounts of lived experience used in therapeutic, archival, and social settings. They pose a difficult computational problem because the intended entity must be inferred from dispersed narrative evidence rather than from a local mention. We introduce IRC-Bench, the Imp...
353	A$^2$TGPO: Agentic Turn-Group Policy Optimization with Adaptive Turn-level Clipping 2605.06200	cs.CL	Dingwei Chen, Zefang Zong, Zhipeng Ma, Leo Luo, Yang Li	Reinforcement learning for agentic large language models (LLMs) typically relies on a sparse, trajectory-level outcome reward, making it difficult to evaluate the contribution of individual tool-calls within multi-turn interactions. Existing approaches to such... Reinforcement learning for agentic large language models (LLMs) typically relies on a sparse, trajectory-level outcome reward, making it difficult to evaluate the contribution of individual tool-calls within multi-turn interactions. Existing approaches to such process credit assignment either depend on separate external process reward models that introduce additional consumption, or tree-based structural rollout that merely redistributes the outcome signal while constraining trajectory diversity...
367	TIDE: Every Layer Knows the Token Beneath the Context 2605.06216	cs.CLcs.AIcs.LG	Ajay Jaiswal, Lauren Hannah, Han-Byul Kim, Duc Hoang, Mehrdad Farajtabar	We revisit a universally accepted but under-examined design choice in every modern LLM: a token index is looked up once at the input embedding layer and then permanently discarded. This single-injection assumption induces two structural failures: (i) the Rare ... We revisit a universally accepted but under-examined design choice in every modern LLM: a token index is looked up once at the input embedding layer and then permanently discarded. This single-injection assumption induces two structural failures: (i) the Rare Token Problem, where a Zipf-type distribution of vocabulary causes rare-token embeddings are chronically under-trained due to receiving a fraction of the cumulative gradient signal compared to common tokens; and (ii) the Contextual Collapse...
370	UniPrefill: Universal Long-Context Prefill Acceleration via Block-wise Dynamic Sparsification 2605.06221	cs.CL	Qihang Fan, Huaibo Huang, Zhiying Wu, Bingning Wang, Ran He	As large language models (LLMs) continue to advance rapidly, they are becoming increasingly capable while simultaneously demanding ever-longer context lengths. To improve the inference efficiency of long-context processing, several novel low-complexity hybrid ... As large language models (LLMs) continue to advance rapidly, they are becoming increasingly capable while simultaneously demanding ever-longer context lengths. To improve the inference efficiency of long-context processing, several novel low-complexity hybrid architectures have recently been proposed, effectively alleviating the computational burden of long-context inference. However, existing research on long-context prefill acceleration remains predominantly focused on sparse attention mechani...
379	YEZE at SemEval-2026 Task 9: Detecting Multilingual, Multicultural and Multievent Online Polarization via Heterogeneous Ensembling 2605.06231	cs.CL	Fengze Guo, Yue Chang	This paper presents our system for SemEval-2026 Task 9: Detecting Multilingual, Multicultural and Multievent Online Polarization, which identifies polarized social media content in 22 languages through three subtasks: binary detection, target classification, a... This paper presents our system for SemEval-2026 Task 9: Detecting Multilingual, Multicultural and Multievent Online Polarization, which identifies polarized social media content in 22 languages through three subtasks: binary detection, target classification, and manifestation identification. We propose a heterogeneous ensemble of multilingual pretrained models, combining XLM-RoBERTa-large and mDeBERTa-v3-base. We investigate techniques such as multi-task learning, translation-based data augmenta...
384	Rethinking RL for LLM Reasoning: It's Sparse Policy Selection, Not Capability Learning 2605.06241	cs.CL	Ömer Faruk Akgül, Rajgopal Kannan, Willie Neiswanger, Viktor Prasanna	Reinforcement learning has become the standard for improving reasoning in large language models, yet evidence increasingly suggests that RL does not teach new strategies; it redistributes probability mass over solutions the base model already contains. In this... Reinforcement learning has become the standard for improving reasoning in large language models, yet evidence increasingly suggests that RL does not teach new strategies; it redistributes probability mass over solutions the base model already contains. In this work, we ask: if RL merely steers the model toward paths it already knows, is the RL optimization loop itself necessary? Through token-level analysis across multiple model families and RL algorithms, we find that RL's beneficial footprint ...
399	Linear Semantic Segmentation for Low-Resource Spoken Dialects 2605.06276	cs.CLcs.AI	Kirill Chirkunov, Younes Samih, Abed Alhakim Freihat, Hanan Aldarmaki	Semantic segmentation is a core component of discourse analysis, yet existing models are primarily developed and evaluated on high-resource written text, limiting their effectiveness on low-resource spoken varieties. In particular, dialectal Arabic exhibits in... Semantic segmentation is a core component of discourse analysis, yet existing models are primarily developed and evaluated on high-resource written text, limiting their effectiveness on low-resource spoken varieties. In particular, dialectal Arabic exhibits informal syntax, code-switching, and weakly marked discourse structure that challenge standard segmentation approaches. In this paper, we introduce a new multi-genre benchmark (more than 1000 samples) for semantic segmentation in conversation...
404	Quantifying the Statistical Effect of Rubric Modifications on Human-Autorater Agreement 2605.06283	cs.CL	Jessica Huynh, Alfredo Gomez, Athiya Deviyani, Renee Shelby, Jeffrey P. Bigham	Autoraters, also referred to as LLM-as-judges, are increasingly used for evaluation and automated content moderation. However, there is limited statistical analysis of how modifications in a rubric presented to both humans and autoraters affect their score agr... Autoraters, also referred to as LLM-as-judges, are increasingly used for evaluation and automated content moderation. However, there is limited statistical analysis of how modifications in a rubric presented to both humans and autoraters affect their score agreement. Rubrics that ask for an overall or \emph{holistic} judgment - for example, rating the ``quality'' of an essay - may be inconsistently interpreted due to the complexity or subjectivity of the criteria. Conversely, rubrics can ask for...
405	LatentRAG: Latent Reasoning and Retrieval for Efficient Agentic RAG 2605.06285	cs.CLcs.LG	Yijia Zheng, Marcel Worring	Single-step retrieval-augmented generation (RAG) provides an efficient way to incorporate external information for simple question answering tasks but struggles with complex questions. Agentic RAG extends this paradigm by replacing single-step retrieval with a... Single-step retrieval-augmented generation (RAG) provides an efficient way to incorporate external information for simple question answering tasks but struggles with complex questions. Agentic RAG extends this paradigm by replacing single-step retrieval with a multi-step process, in which the large language model (LLM) acts as a search agent that generates intermediate thoughts and subqueries to iteratively interact with the retrieval system. This iterative process incurs substantial latency due...
409	Log-Likelihood, Simpson's Paradox, and the Detection of Machine-Generated Text 2605.06294	cs.CLcs.AIcs.LG	Tom Kempton, Viktor Drobnyi, Maeve Madigan, Stuart Burrell	The ability to reliably distinguish human-written text from that generated by large language models is of profound societal importance. The dominant approach to this problem exploits the likelihood hypothesis: that machine-generated text should appear more pro... The ability to reliably distinguish human-written text from that generated by large language models is of profound societal importance. The dominant approach to this problem exploits the likelihood hypothesis: that machine-generated text should appear more probable to a detector language model than human-written text. However, we demonstrate that the token-level signal distinguishing human and machine text is non-uniform across the hidden space of the detector model, and naively averaging likeli...
417	MultiLinguahah : A New Unsupervised Multilingual Acoustic Laughter Segmentation Method 2605.06309	cs.CL	Callejas Sofia, Gomez Nahuel, Pelachaud Catherine, Ravenet Brian, Barriere Valentin	Laughter is a social non-vocalization that is universal across cultures and languages, and is crucial for human communication, including social bonding and communication signaling. However, detecting laughter in audio is a challenging task, and segmenting is e... Laughter is a social non-vocalization that is universal across cultures and languages, and is crucial for human communication, including social bonding and communication signaling. However, detecting laughter in audio is a challenging task, and segmenting is even more difficult. Currently, Machine Learning methods generally rely on costly manual annotation, and their datasets are mostly based on English contexts. Thus, we propose an unsupervised multilingual method that sets up the laughter segm...
423	Who and What? Using Linguistic Features and Annotator Characteristics to Analyze Annotation Variation 2605.06318	cs.CLcs.CY	Maximilian Maurer, Maximilian Linde, Gabriella Lapesa	Human label variation has been established as a central phenomenon in NLP: the perspectives different annotators have on the same item need to be embraced. Data collection practices thus shifted towards increasing the annotator numbers and releasing disaggrega... Human label variation has been established as a central phenomenon in NLP: the perspectives different annotators have on the same item need to be embraced. Data collection practices thus shifted towards increasing the annotator numbers and releasing disaggregated datasets, harmful language being most resourced due to its high subjectivity. While this resulted in rich information about \textit{who} annotated (sociodemographics, attitudes, etc.), the \textit{what} (e.g., linguistic properties of i...
427	Teaching Thinking Models to Reason with Tools: A Full-Pipeline Recipe for Tool-Integrated Reasoning 2605.06326	cs.CL	Qianjia Cheng, Yuchen Zhang, Zhilin Wang, Yuxin Zuo, Shunkai Zhang	Tool-integrated reasoning (TIR) offers a direct way to extend thinking models beyond the limits of text-only reasoning. Paradoxically, we observe that tool-enabled evaluation can degrade reasoning performance even when the strong thinking models make almost no... Tool-integrated reasoning (TIR) offers a direct way to extend thinking models beyond the limits of text-only reasoning. Paradoxically, we observe that tool-enabled evaluation can degrade reasoning performance even when the strong thinking models make almost no actual tool calls. In this paper, we investigate how to inject natural tool-use behavior into a strong thinking model without sacrificing its no-tool reasoning ability, and present a comprehensive TIR recipe. We highlight that (i) the effe...
428	Measuring Evaluation-Context Divergence in Open-Weight LLMs: A Paired-Prompt Protocol with Pilot Evidence of Alignment-Pipeline-Specific Heterogeneity 2605.06327	cs.CLcs.AIcs.LG	Florian A. D. Burnat, Brittany I. Davidson	Safety benchmarks are routinely treated as evidence about how a language model will behave once deployed, but this inference is fragile if behavior depends on whether a prompt looks like an evaluation. We define evaluation-context divergence as an observable w... Safety benchmarks are routinely treated as evidence about how a language model will behave once deployed, but this inference is fragile if behavior depends on whether a prompt looks like an evaluation. We define evaluation-context divergence as an observable within-item change in behavior induced by framing a fixed task as an evaluation, a live deployment interaction, or a neutral request, and present a paired-prompt protocol that measures it in open-weight LLMs while controlling for paraphrase ...
433	MANTRA: Synthesizing SMT-Validated Compliance Benchmarks for Tool-Using LLM Agents 2605.06334	cs.CLcs.LGcs.LO	Ashwani Anand, Ivi Chatzi, Ritam Raha, Anne-Kathrin Schmuck	Tool-using large language model (LLM) agents are increasingly deployed in settings where their reliable behavior is governed by strict procedural manuals. Ensuring that such agents comply with the rules from these manuals is challenging, as they are typically ... Tool-using large language model (LLM) agents are increasingly deployed in settings where their reliable behavior is governed by strict procedural manuals. Ensuring that such agents comply with the rules from these manuals is challenging, as they are typically written for humans in natural language while agent behavior manifests as an execution trace of tool calls. Existing evaluations of LLM agents rely on manually constructed benchmarks or LLM-based judges, which either do not scale or lack rel...
440	Don't Lose Focus: Activation Steering via Key-Orthogonal Projections 2605.06342	cs.CL	Haoyan Luo, Mateo Espinosa Zarlenga, Mateja Jamnik	Activation steering controls LLM behaviour towards target behaviour by intervening in internal representations, yet it often degrades reasoning and retrieval performance. We argue that a primary cause of this trade-off is attention rerouting: steering vectors ... Activation steering controls LLM behaviour towards target behaviour by intervening in internal representations, yet it often degrades reasoning and retrieval performance. We argue that a primary cause of this trade-off is attention rerouting: steering vectors alter query-key matching, shifting attention away from contextually important tokens toward less informative ones. To address this, we propose Steering via Key-Orthogonal Projections (SKOP), a steering method that constrains harmful attenti...
447	SEQUOR: A Multi-Turn Benchmark for Realistic Constraint Following 2605.06353	cs.CL	Beatriz Canaverde, Duarte M. Alves, José Pombal, Giuseppe Attanasio, André F. T. Martins	In a conversation, a helpful assistant must reliably follow user directives, even as they refine, modify, or contradict earlier requests. Yet most instruction-following benchmarks focus on single-turn or short multi-turn scenarios, leaving open how well models... In a conversation, a helpful assistant must reliably follow user directives, even as they refine, modify, or contradict earlier requests. Yet most instruction-following benchmarks focus on single-turn or short multi-turn scenarios, leaving open how well models handle long-horizon instruction-following tasks. To bridge this gap, we present SEQUOR, an automatic benchmark for evaluating constraint adherence in long multi-turn conversations. SEQUOR consists of simulated persona-driven interactions b...
473	GATHER: Convergence-Centric Hyper-Entity Retrieval for Zero-Shot Cell-Type Annotation 2605.06403	cs.CLcs.IR	Zhonghui Zhang, Feng Jiang, Shaowei Qin, Jiahao Zhao, Min Yang	Zero-shot single-cell cell-type annotation aims to determine a cell's type from a given set of expressed genes without any training. Existing knowledge-graph-based RAG approaches retrieve evidence by expanding from source entities and relying on iterative LLM ... Zero-shot single-cell cell-type annotation aims to determine a cell's type from a given set of expressed genes without any training. Existing knowledge-graph-based RAG approaches retrieve evidence by expanding from source entities and relying on iterative LLM reasoning. However, in this setting each query contains tens to hundreds of genes, where no single gene is decisive and the label emerges only from their collective co-occurrence. Such hyper-entity queries fundamentally challenge local, ent...
479	MiA-Signature: Approximating Global Activation for Long-Context Understanding 2605.06416	cs.CL	Yuqing Li, Jiangnan Li, Mo Yu, Zheng Lin, Weiping Wang	A growing body of work in cognitive science suggests that reportable conscious access is associated with \emph{global ignition} over distributed memory systems, while such activation is only partially accessible as individuals cannot directly access or enumera... A growing body of work in cognitive science suggests that reportable conscious access is associated with \emph{global ignition} over distributed memory systems, while such activation is only partially accessible as individuals cannot directly access or enumerate all activated contents. This tension suggests a plausible mechanism that cognition may rely on a compact representation that approximates the global influence of activation on downstream processing. Inspired by this idea, we introduce th...
481	From 124 Million Tokens to 1,021 Neologisms: A Large-Scale Pipeline for Automatic Neologism Detection 2605.06426	cs.CL	Diego Rossini, Lonneke van der Plas	We present a scalable, modular pipeline for automatic neologism detection that combines rule-based filtering with LLM classification. The pipeline is grounded in two complementary word-formation frameworks, grammatical and extra-grammatical morphology, which j... We present a scalable, modular pipeline for automatic neologism detection that combines rule-based filtering with LLM classification. The pipeline is grounded in two complementary word-formation frameworks, grammatical and extra-grammatical morphology, which jointly define the scope of what counts as a neologism and inform a four-class classification scheme (neologism, entity, foreign, none). While designed to be modular and transferable at the architectural level, the pipeline is instantiated o...
484	COVID-19 Infodemic. Understanding content features in detecting fake news using a machine learning approach 2605.06435	cs.CLcs.AIcs.LG	Vimala Balakrishnan, Lee Zing Hii, Eric Laporte	The use of content features, particularly textual and linguistic for fake news detection is under-researched, despite empirical evidence showing the features could contribute to differentiating real and fake news. To this end, this study investigates a selecti... The use of content features, particularly textual and linguistic for fake news detection is under-researched, despite empirical evidence showing the features could contribute to differentiating real and fake news. To this end, this study investigates a selection of content features such as word bigrams, part of speech distribution etc. to improve fake news detection. We performed a series of experiments on a new dataset gathered during the COVID-19 pandemic and using Decision Tree, K-Nearest Nei...
506	Towards Emotion Consistency Analysis of Large Language Models in Emotional Conversational Contexts 2605.06476	cs.CL	Sneha Oram, Ojaswita Bhushan, Pushpak Bhattacharyya	In this work, we conduct an analysis to examine the consistency of Large Language Models (LLMs) with respect to their own generated responses in an emotionally-driven conversational context. Specifically, the text generated by LLM is framed as a query to the s... In this work, we conduct an analysis to examine the consistency of Large Language Models (LLMs) with respect to their own generated responses in an emotionally-driven conversational context. Specifically, the text generated by LLM is framed as a query to the same model, and its responses are subsequently assessed. This is performed with three queries across two dimensions of extreme and moderate emotions. The three queries are, in particular, false claim queries that contain inherently wrong ass...
512	Litespark Inference on Consumer CPUs: Custom SIMD Kernels for Ternary Neural Networks 2605.06485	cs.CLcs.AI	Nii Osae Osae Dade, Tony Morri, Moinul Hossain Rahat, Sayandip Pal	Large language models (LLMs) have transformed artificial intelligence, but their computational requirements remain prohibitive for most users. Standard inference demands expensive datacenter GPUs or cloud API access, leaving over one billion personal computers... Large language models (LLMs) have transformed artificial intelligence, but their computational requirements remain prohibitive for most users. Standard inference demands expensive datacenter GPUs or cloud API access, leaving over one billion personal computers underutilized for AI workloads. Ternary models offer a path forward: their weights are constrained to {-1, 0, +1}, theoretically eliminating the need for floating-point multiplication. However, existing frameworks fail to exploit this stru...
520	The Frequency Confound in Language-Model Surprisal and Metaphor Novelty 2605.06506	cs.CL	Omar Momen, Sina Zarrieß	Language-model (LM) surprisal is widely used as a proxy for contextual predictability and has been reported to correlate with metaphor novelty judgments. However, surprisal is tightly intertwined with lexical frequency. We explore this interaction on metaphor ... Language-model (LM) surprisal is widely used as a proxy for contextual predictability and has been reported to correlate with metaphor novelty judgments. However, surprisal is tightly intertwined with lexical frequency. We explore this interaction on metaphor novelty ratings using two different word frequency measures. We analyse surprisal estimates from eight Pythia model sizes and 154 training checkpoints. Across settings, word frequency is a stronger predictor of metaphor novelty than surpris...
533	STALE: Can LLM Agents Know When Their Memories Are No Longer Valid? 2605.06527	cs.CL	Hanxiang Chao, Yihan Bai, Rui Sheng, Tianle Li, Yushi Sun	Large Language Model (LLM) agents are increasingly expected to maintain coherent, long-term personalized memory, yet current benchmarks primarily measure static fact retrieval, overlooking the ability to revise stored beliefs when new evidence emerges. We iden... Large Language Model (LLM) agents are increasingly expected to maintain coherent, long-term personalized memory, yet current benchmarks primarily measure static fact retrieval, overlooking the ability to revise stored beliefs when new evidence emerges. We identify a critical and underexplored failure mode, Implicit Conflict: a later observation invalidates an earlier memory without explicit negation, requiring contextual inference and commonsense reasoning to detect. To rigorously evaluate this ...
542	Efficient Pre-Training with Token Superposition 2605.06546	cs.CL	Bowen Peng, Théo Gigant, Jeffrey Quesnelle	Pre-training of Large Language Models is often prohibitively expensive and inefficient at scale, requiring complex and invasive modifications in order to achieve high data throughput. In this work, we present Token-Superposition Training (TST), a simple drop-i... Pre-training of Large Language Models is often prohibitively expensive and inefficient at scale, requiring complex and invasive modifications in order to achieve high data throughput. In this work, we present Token-Superposition Training (TST), a simple drop-in method that significantly improves the data throughput per FLOPs during pre-training without modifying the parallelism, optimizer, tokenizer, data, or model architecture. TST is done in two phases: (i) A highly efficient superposition pha...
543	Continuous Latent Diffusion Language Model 2605.06548	cs.CLcs.AIcs.CV	Hongcan Guo, Qinyu Zhao, Yian Zhao, Shen Nie, Rui Zhu	Large language models have achieved remarkable success under the autoregressive paradigm, yet high-quality text generation need not be tied to a fixed left-to-right order. Existing alternatives still struggle to jointly achieve generation efficiency, scalable ... Large language models have achieved remarkable success under the autoregressive paradigm, yet high-quality text generation need not be tied to a fixed left-to-right order. Existing alternatives still struggle to jointly achieve generation efficiency, scalable representation learning, and effective global semantic modeling. We propose Cola DLM, a hierarchical latent diffusion language model that frames text generation through hierarchical information decomposition. Cola DLM first learns a stable ...
546	Long Context Pre-Training with Lighthouse Attention 2605.06554	cs.CL	Bowen Peng, Subho Ghosh, Jeffrey Quesnelle	Training causal transformers at extreme sequence lengths is bottlenecked by the quadratic time and memory of scaled dot-product attention (SDPA). In this work, we propose Lighthouse Attention, a training-only symmetrical selection-based hierarchical attention ... Training causal transformers at extreme sequence lengths is bottlenecked by the quadratic time and memory of scaled dot-product attention (SDPA). In this work, we propose Lighthouse Attention, a training-only symmetrical selection-based hierarchical attention algorithm that wraps around ordinary SDPA and can be easily removed towards the end of the training. Our hierarchical selection is also gradient-free, which exempts us from dealing with a complicated and potentially inefficient backward pas...
567	Automated Clinical Report Generation for Remote Cognitive Remediation: Comparing Knowledge-Engineered Templates and LLMs in Low-Resource Settings 2605.06594	cs.CL	Yongxin Zhou, Fabien Ringeval, François Portet	The growing demand for cognitive remediation therapy, combined with limited speech therapist availability, has accelerated the adoption of remote rehabilitation tools. These systems generate large volumes of interaction data that are difficult for clinicians t... The growing demand for cognitive remediation therapy, combined with limited speech therapist availability, has accelerated the adoption of remote rehabilitation tools. These systems generate large volumes of interaction data that are difficult for clinicians to review efficiently. This paper investigates automated clinical report generation for avatar-guided, home-based cognitive remediation sessions in a low-resource setting with no reference reports. We present and compare two approaches: (1) ...
570	UniSD: Towards a Unified Self-Distillation Framework for Large Language Models 2605.06597	cs.CLcs.AIcs.LG	Yiqiao Jin, Yiyang Wang, Lucheng Fu, Yijia Xiao, Yinyi Luo	Self-distillation (SD) offers a promising path for adapting large language models (LLMs) without relying on stronger external teachers. However, SD in autoregressive LLMs remains challenging because self-generated trajectories are free-form, correctness is tas... Self-distillation (SD) offers a promising path for adapting large language models (LLMs) without relying on stronger external teachers. However, SD in autoregressive LLMs remains challenging because self-generated trajectories are free-form, correctness is task-dependent, and plausible rationales can still provide unstable or unreliable supervision. Existing methods mainly examine isolated design choices, leaving their effectiveness, roles, and interactions unclear. In this paper, we propose Uni...
583	Algospeak, Hiding in the Open: The Trade-off Between Legible Meaning and Detection Avoidance 2605.06619	cs.CLcs.CY	Jan Fillies, Ronald E. Robertson, Jeffrey Hancock	As large language models (LLMs) increasingly mediate both content generation and moderation, linguistic evasion strategies known as Algospeak have intensified the coevolution between evaders and detectors. This research formalizes the underlying dynamics groun... As large language models (LLMs) increasingly mediate both content generation and moderation, linguistic evasion strategies known as Algospeak have intensified the coevolution between evaders and detectors. This research formalizes the underlying dynamics grounded in a joint action model: when Algospeak increases, detectability and understandability decrease. Further, the concept of Majority Understandable Modulation (MUM) is introduced and defined as the modulation level at which additional evas...
586	Parser agreement and disagreement in L2 Korean UD: Implications for human-in-the-loop annotation 2605.06625	cs.CL	Hakyung Sung, Gyu-Ho Shin	We propose a simplified human-in-the-loop workflow for second language (L2) Korean morphosyntactic annotation by leveraging agreement between two domain-adapted parsers. We first evaluate whether parser agreement can serve as a proxy for annotation correctness... We propose a simplified human-in-the-loop workflow for second language (L2) Korean morphosyntactic annotation by leveraging agreement between two domain-adapted parsers. We first evaluate whether parser agreement can serve as a proxy for annotation correctness by comparing it with independent human judgments. The results show strong correspondence between parser and human judgments, supporting the feasibility of semi-automatic L2-Korean UD annotation. Further analysis demonstrates that parser di...
592	Cited but Not Verified: Parsing and Evaluating Source Attribution in LLM Deep Research Agents 2605.06635	cs.CL	Hailey Onweller, Elias Lumer, Austin Huber, Pia Ramchandani, Vamse Kumar Subbiah	Large language models (LLMs) power deep research agents that synthesize information from hundreds of web sources into cited reports, yet these citations cannot be reliably verified. Current approaches either trust models to self-cite accurately, risking bias, ... Large language models (LLMs) power deep research agents that synthesize information from hundreds of web sources into cited reports, yet these citations cannot be reliably verified. Current approaches either trust models to self-cite accurately, risking bias, or employ retrieval-augmented generation (RAG) that does not validate source accessibility, relevance, or factual consistency. We introduce the first source attribution evaluation framework that uses a reproducible AST parser to extract and...
599	StraTA: Incentivizing Agentic Reinforcement Learning with Strategic Trajectory Abstraction 2605.06642	cs.CLcs.AI	Xiangyuan Xue, Yifan Zhou, Zidong Wang, Shengji Tang, Philip Torr	Large language models (LLMs) are increasingly used as interactive agents, but optimizing them for long-horizon decision making remains difficult because current methods are largely purely reactive, which weakens both exploration and credit assignment over exte... Large language models (LLMs) are increasingly used as interactive agents, but optimizing them for long-horizon decision making remains difficult because current methods are largely purely reactive, which weakens both exploration and credit assignment over extended trajectories. In this work, we present Strategic Trajectory Abstraction (StraTA), a simple framework that introduces an explicit trajectory-level strategy into agentic reinforcement learning (RL). StraTA samples a compact strategy from...
cs.CR 17 papers
48	Architecture Matters: Comparing RAG Systems under Knowledge Base Poisoning 2605.05632	cs.CRcs.CLcs.LG	Samuel Korn	Retrieval-Augmented Generation (RAG) systems are vulnerable to knowledge base poisoning, yet existing attacks have been evaluated almost exclusively against vanilla retrieve-then-generate pipelines. Architectures designed to handle conflicting retrieved inform... Retrieval-Augmented Generation (RAG) systems are vulnerable to knowledge base poisoning, yet existing attacks have been evaluated almost exclusively against vanilla retrieve-then-generate pipelines. Architectures designed to handle conflicting retrieved information - multi-agent debate, agentic retrieval, recursive language models - remain untested against adversarially optimized contradictions. We evaluate four RAG architectures (vanilla RAG, agentic RAG, MADAM-RAG, and Recursive Language Model...
59	TUANDROMD-X: Advanced Entropy and Visual Analytics Dataset for Enhanced Malware Detection and Classification 2605.06718	cs.CRcs.LG	Parthajit Borah, Upasana Sarmah, D. K. Bhattacharyya, J. K. Kalita	Malware and malware-based attacks are becoming more prevalent and complex. Attackers regularly come up with new techniques that have the ability to evade conventional and signature-based malware defense. In order to address such threats, there is an increasing... Malware and malware-based attacks are becoming more prevalent and complex. Attackers regularly come up with new techniques that have the ability to evade conventional and signature-based malware defense. In order to address such threats, there is an increasing demand for advanced and better defense solutions. Machine learning-based techniques are efficiently capable of defending against malware and malware-based attacks. Nevertheless, creating and efficiently testing such techniques demand high-...
87	SafeHarbor: Hierarchical Memory-Augmented Guardrail for LLM Agent Safety 2605.05704	cs.CRcs.AI	Zhe Liu, Zonghao Ying, Wenxin Zhang, Quanchen Zou, Deyue Zhang	With the rapid evolution of foundation models, Large Language Model (LLM) agents have demonstrated increasingly powerful tool-use capabilities. However, this proficiency introduces significant security risks, as malicious actors can manipulate agents into exec... With the rapid evolution of foundation models, Large Language Model (LLM) agents have demonstrated increasingly powerful tool-use capabilities. However, this proficiency introduces significant security risks, as malicious actors can manipulate agents into executing tools to generate harmful content. While existing defensive mechanisms are effective, they frequently suffer from the over-refusal problem, where increased safety strictness compromises the agent's utility on benign tasks. To mitigate...
135	Stego Battlefield: Evaluating Image Steganography Attacks and Steganalysis Defenses 2605.05789	cs.CRcs.CV	Zhen Sun, Zongmin Zhang, Leyi Sheng, Yule Liu, Yifan Liao	Image steganography is widely used to protect user privacy and enable covert communication. However, it can also be abused by the adversary as a covert channel to bypass content moderation, disseminate harmful semantics, and even hide malicious instructions in... Image steganography is widely used to protect user privacy and enable covert communication. However, it can also be abused by the adversary as a covert channel to bypass content moderation, disseminate harmful semantics, and even hide malicious instructions in images to elicit dangerous outputs from large models, posing a practical security risk that continues to evolve. To address the lack of a unified and systematic evaluation framework, we propose SADBench, a systematic benchmark that assesse...
142	LCC-LLM: Leveraging Code-Centric Large Language Models for Malware Attribution 2605.05807	cs.CRcs.AI	Christopher G. Pedraza Pohlenz, Hassan Jalil Hadi, Ali Hassan, Ali Shoker	LLMs are increasingly explored for malware analysis; however, current LLM-based malware attribution remains limited by unsupported indicators and insufficient code-level grounding for identifying malicious and vulnerable code segments. To address these limitat... LLMs are increasingly explored for malware analysis; however, current LLM-based malware attribution remains limited by unsupported indicators and insufficient code-level grounding for identifying malicious and vulnerable code segments. To address these limitations, this research introduces LCC-LLM, a code-centric benchmark dataset and evidence-grounded framework for malware attribution and multi-task static malware analysis. The proposed LCCD dataset contains approximately 34K PE samples process...
148	LeakDojo: Decoding the Leakage Threats of RAG Systems 2605.05818	cs.CRcs.AIcs.CL	Maosen Zhang, Jianshuo Dong, Boting Lu, Wenyue Li, Xiaoping Zhang	Retrieval-Augmented Generation (RAG) enables large language models (LLMs) to leverage external knowledge, but also exposes valuable RAG databases to leakage attacks. As RAG systems grow more complex and LLMs exhibit stronger instruction-following capabilities,... Retrieval-Augmented Generation (RAG) enables large language models (LLMs) to leverage external knowledge, but also exposes valuable RAG databases to leakage attacks. As RAG systems grow more complex and LLMs exhibit stronger instruction-following capabilities, existing studies fall short of systematically assessing RAG leakage risks. We present LeakDojo, a configurable framework for controlled evaluation of RAG leakage. Using LeakDojo, we benchmark six existing attacks across fourteen LLMs, four...
158	LoopTrap: Termination Poisoning Attacks on LLM Agents 2605.05846	cs.CRcs.AI	Huiyu Xu, Zhibo Wang, Wenhui Zhang, Ziqi Zhu, Yaopeng Wang	Modern LLM agents solve complex tasks by operating in iterative execution loops, where they repeatedly reason, act, and self-evaluate progress to determine when a task is complete. In this work, we show that while this self-directed loop facilitates autonomy, ... Modern LLM agents solve complex tasks by operating in iterative execution loops, where they repeatedly reason, act, and self-evaluate progress to determine when a task is complete. In this work, we show that while this self-directed loop facilitates autonomy, it also introduces a critical risk: by injecting malicious prompts into the agent's context, an adversary can distort the agent's termination judgment, making it believe the task remains incomplete and leading to unbounded computation.To un...
224	PragLocker: Protecting Agent Intellectual Property in Untrusted Deployments via Non-Portable Prompts 2605.05974	cs.CRcs.AI	Qinfeng Li, Yuntai Bao, Jianghui Hu, Wenqi Zhang, Jintao Chen	LLM agents rely on prompts to implement task-specific capabilities based on foundation LLMs, making agent prompts valuable intellectual property. However, in untrusted deployments, adversaries can copy and reuse these prompts with other proprietary LLMs, causi... LLM agents rely on prompts to implement task-specific capabilities based on foundation LLMs, making agent prompts valuable intellectual property. However, in untrusted deployments, adversaries can copy and reuse these prompts with other proprietary LLMs, causing economic losses. To protect these prompts, we identify four key challenges: proactivity, runtime protection, usability, and non-portability that existing approaches fail to address. We present PragLocker, a prompt protection scheme that ...
235	Safety Anchor: Defending Harmful Fine-tuning via Geometric Bottlenecks 2605.05995	cs.CRcs.AIcs.CL	Guoxin Lu, Letian Sha, Qing Wang, Peijie Sun, Hao Zhou	The safety alignment of Large Language Models (LLMs) remains vulnerable to Harmful Fine-tuning (HFT). While existing defenses impose constraints on parameters, gradients, or internal representations, we observe that they can be effectively circumvented under p... The safety alignment of Large Language Models (LLMs) remains vulnerable to Harmful Fine-tuning (HFT). While existing defenses impose constraints on parameters, gradients, or internal representations, we observe that they can be effectively circumvented under persistent HFT. Our analysis traces this failure to the inherent redundancy of the high-dimensional parameter space: attackers exploit optimization trajectories that are orthogonal to defense constraints to restore harmful capabilities while...
304	When Routine Chats Turn Toxic: Unintended Long-Term State Poisoning in Personalized Agents 2605.06731	cs.CRcs.CLcs.LG	Xiaoyu Xu, Minxin Du, Qipeng Xie, Haobin Ke, Qingqing Ye	Personalized LLM agents maintain persistent cross-session state to support long-horizon collaboration. Yet, this persistence introduces a subtle but critical security vulnerability: routine user-agent interactions can gradually reshape an agent's long-term sta... Personalized LLM agents maintain persistent cross-session state to support long-horizon collaboration. Yet, this persistence introduces a subtle but critical security vulnerability: routine user-agent interactions can gradually reshape an agent's long-term state, inadvertently weakening future confirmation boundaries, expanding tool-use defaults, and escalating autonomous behavior over time. We formalize this risk as \textbf{unintended long-term state poisoning}. To systematically study it, we i...
326	Secure Seed-Based Multi-bit Watermarking for Diffusion Models from First Principles 2605.06153	cs.CRcs.CV	Enoal Gesny, Eva Giboulot	The rapid emergence of generative image models has led to the development of specialized watermarking techniques, particularly in-generation methods such as seed-based embedding. However, current evaluations in this area remain largely empirical, making them h... The rapid emergence of generative image models has led to the development of specialized watermarking techniques, particularly in-generation methods such as seed-based embedding. However, current evaluations in this area remain largely empirical, making them heavily reliant on the specific model architectures used for generation and inversion. This prevents any clear conclusion on the performance of any method, especially regarding security, for which a rigorous definition is lacking. Against th...
415	From Specification to Deployment: Empirical Evidence from a W3C VC + DID Trust Infrastructure for Autonomous Agents 2605.06738	cs.CRcs.AI	Lars Kersten Kroehl	Autonomous AI agents now transact at production scale -- 69,000 bots executing 165 million transactions across 50 million USDC in cumulative volume on a single marketplace -- without any shared trust layer between participants. Regulatory frameworks (Singapore... Autonomous AI agents now transact at production scale -- 69,000 bots executing 165 million transactions across 50 million USDC in cumulative volume on a single marketplace -- without any shared trust layer between participants. Regulatory frameworks (Singapore IMDA, NIST CAISI, EU AI Act) and major AI laboratories (Anthropic, Google) have independently converged on the same structural requirement: an open, portable, cryptographically verifiable trust infrastructure for autonomous agents that no ...
426	Gaming the Metric, Not the Harm: Certifying Safety Audits against Strategic Platform Manipulation 2605.06324	cs.CRcs.CYcs.LG	Florian A. D. Burnat, Brittany I. Davidson	Online-safety regulation under the UK Online Safety Act and the EU Digital Services Act increasingly treats scalar metrics as compliance evidence. Once announced, such a metric also becomes an optimization target: a strategic platform can improve its score by ... Online-safety regulation under the UK Online Safety Act and the EU Digital Services Act increasingly treats scalar metrics as compliance evidence. Once announced, such a metric also becomes an optimization target: a strategic platform can improve its score by routing recommendations through semantically equivalent content variants, without reducing true harm. We ask when such an audit metric can still certify a genuine reduction in harm. The protocol is modeled as a published transformation grap...
430	Fine-Tuning Small Language Models for Solution-Oriented Windows Event Log Analysis 2605.06330	cs.CRcs.AI	Siraaj Akhtar, Saad Khan, Simon Parkinson	Large language models (LLMs) have shown promise for event log analysis, but their high computational requirements, reliance on cloud infrastructure, and security concerns limit practical deployment. In addition, most existing approaches focus only on the ident... Large language models (LLMs) have shown promise for event log analysis, but their high computational requirements, reliance on cloud infrastructure, and security concerns limit practical deployment. In addition, most existing approaches focus only on the identification of the problem and do not provide actionable remediation. Small language models (SLMs) present a light-weight alternative that can be fine-tuned for a specific purpose and hosted locally. This paper investigates whether SLMs, when...
522	On the Security of Research Artifacts 2605.06508	cs.CRcs.AI	Nanda Rani, Christian Rossow	Research artifacts are widely shared to support reproducibility, and artifact evaluation (AE) has become common at many leading conferences. However, AE mainly checks whether artifacts work as claimed and can be reproduced. It largely overlooks potential secur... Research artifacts are widely shared to support reproducibility, and artifact evaluation (AE) has become common at many leading conferences. However, AE mainly checks whether artifacts work as claimed and can be reproduced. It largely overlooks potential security risks. Since these artifacts are publicly released and reused, they may unintentionally create opportunities for misuse and raise concerns about safe and responsible sharing. We study 509 research artifacts from top-tier security venues...
569	FedAttr: Towards Privacy-preserving Client-Level Attribution in Federated LLM Fine-tuning 2605.06596	cs.CRcs.LG	Su Zhang, Junfeng Guo, Heng Huang	Watermark radioactivity testing type of methods can detect whether a model was trained on watermarked documents, and have become key tools for protecting data ownership in the fine-tuning of large language models (LLMs). Existing works have proved their effect... Watermark radioactivity testing type of methods can detect whether a model was trained on watermarked documents, and have become key tools for protecting data ownership in the fine-tuning of large language models (LLMs). Existing works have proved their effectiveness in centralized LLM fine-tuning. However, this type of method faces several challenges and remains underexplored in federated learning (FL), a widely-applied paradigm for fine-tuning LLMs collaboratively on private data across differ...
572	Patch2Vuln: Agentic Reconstruction of Vulnerabilities from Linux Distribution Binary Patches 2605.06601	cs.CRcs.AI	Isaac David, Arthur Gervais	Security updates create a short but important window in which defenders and attackers can compare vulnerable and patched software. Yet in many operational settings, the most accessible artifacts are binary packages rather than source patches or advisory text. ... Security updates create a short but important window in which defenders and attackers can compare vulnerable and patched software. Yet in many operational settings, the most accessible artifacts are binary packages rather than source patches or advisory text. This paper asks whether a language-model agent, restricted to local binary-derived evidence, can reconstruct the security meaning of Linux distribution updates. Patch2Vuln is a local, resumable pipeline that extracts old/new ELF pairs, diff...
cs.CV 108 papers
4	Edge Deep Learning in Computer Vision and Medical Diagnostics: A Comprehensive Survey 2605.06714	cs.CVcs.AI	Yiwen Xu, Tariq M. Khan, Yang Song, Erik Meijering	Edge deep learning, a paradigm change reconciling edge computing and deep learning, facilitates real-time decision making attuned to environmental factors through the close integration of computational resources and data sources. Here we provide a comprehensiv... Edge deep learning, a paradigm change reconciling edge computing and deep learning, facilitates real-time decision making attuned to environmental factors through the close integration of computational resources and data sources. Here we provide a comprehensive review of the current state of the art in edge deep learning, focusing on computer vision applications, in particular medical diagnostics. An overview of the foundational principles and technical advantages of edge deep learning is presen...
12	Characterizing Brazilian Atlantic Forest Restoration Outcomes with Geospatial AlphaEarth Embeddings 2605.05547	cs.CV	Alice Heiman	The Atlantic Forest in Brazil is a critical biodiversity hotspot, yet less than 12-15% of its original cover remains. Although monitoring forest restoration on a large scale is essential, traditional methods are limited by the impracticality of on-the-ground r... The Atlantic Forest in Brazil is a critical biodiversity hotspot, yet less than 12-15% of its original cover remains. Although monitoring forest restoration on a large scale is essential, traditional methods are limited by the impracticality of on-the-ground reporting on such a scale and by the saturation of remote-sensing indices such as NDVI. Furthermore, reforestation is a gradual process as opposed to the rapid spectral changes caused by deforestation. In this study, we examine 1,729 restora...
13	A Novel Graph-Regulated Disentangling Mamba Model with Sparse Tokens for Enhanced Tree Species Classification from MODIS Time Series 2605.05549	cs.CV	Motasem Alkayid, Zhengsen Xu, Saeid Taleghanidoozdoozan, Yimin Zhu, Megan Greenwood	Although tree species classification from Moderate Resolution Imaging Spectroradiometer (MODIS) time series data is critical for supporting various environmental applications, it is a challenging task due to several key difficulties: the subtle signature diffe... Although tree species classification from Moderate Resolution Imaging Spectroradiometer (MODIS) time series data is critical for supporting various environmental applications, it is a challenging task due to several key difficulties: the subtle signature differences among tree species, strong spatial-spectral-temporal information coupling, and the difficulty of modeling large-scale topological context information. To better address these challenges, this paper presents a novel Graph-regulated Di...
16	An extremely coarse feedback signal is sufficient for learning human-aligned visual representations 2605.05556	cs.CV	Yash Mehta, Michael F. Bonner	Artificial neural networks trained on visual tasks develop internal representations resembling those of the primate visual system, a discovery that has guided a decade of computational neuroscience. Research on building brain-aligned models has progressively e... Artificial neural networks trained on visual tasks develop internal representations resembling those of the primate visual system, a discovery that has guided a decade of computational neuroscience. Research on building brain-aligned models has progressively embraced finer-grained supervisory signals, from object classification to contrastive self-supervised objectives that maximize distinctions among individual images, yet the role of supervisory signal granularity on brain alignment remains la...
23	Text-to-CAD Retrieval: a Strong Baseline 2605.05572	cs.CV	Honghu Pan, Zibo Du, Daxiang Liu, Chengliang Liu, Xiaoling Luo	Text-based retrieval of Computer-Aided Design (CAD) models is a critical yet underexplored task for the reuse of legacy industrial designs. Existing CAD repositories are typically searched using filenames or directories, which limits the efficiency, scalabilit... Text-based retrieval of Computer-Aided Design (CAD) models is a critical yet underexplored task for the reuse of legacy industrial designs. Existing CAD repositories are typically searched using filenames or directories, which limits the efficiency, scalability, and accuracy of design retrieval. In this paper, we formally introduce text-to-CAD retrieval as a new cross-modal retrieval task, aiming to retrieve semantically relevant CAD models from large-scale databases given natural language queri...
30	Uncertainty-Guided Edge Learning for Deep Image Regression in Remote Sensing 2605.05590	cs.CV	Anh Vu Nguyen, Dino Sejdinovic, Tat-Jun Chin	Edge learning refers to training machine learning models deployed on edge platforms, typically using new data accumulated onboard. The computational limitations on edge devices affect not only model optimisation, but also calculation of the predictive uncertai... Edge learning refers to training machine learning models deployed on edge platforms, typically using new data accumulated onboard. The computational limitations on edge devices affect not only model optimisation, but also calculation of the predictive uncertainty of the current model on the unlabelled data, which is vital for informing model updating. In this paper, we investigate edge learning in the context of performing deep image regression on a remote sensing satellite, where a deep network...
41	RAM-H1200: A Unified Evaluation and Dataset on Hand Radiographs for Rheumatoid Arthritis 2605.05616	cs.CVcs.LG	Songxiao Yang, Haolin Wang, Yao Fu, Junmu Peng, Lin Fan	Rheumatoid arthritis (RA) assessment from hand radiographs requires multi-level analysis and modeling of anatomical structures and fine-grained local pathological changes. However, existing public resources do not support such unified multi-level analysis, oft... Rheumatoid arthritis (RA) assessment from hand radiographs requires multi-level analysis and modeling of anatomical structures and fine-grained local pathological changes. However, existing public resources do not support such unified multi-level analysis, often lacking full-hand coverage, fine-grained annotations, and consistent integration with clinical scoring systems. In particular, annotations that enable quantitative analysis of bone erosion (BE) remain scarce. RAM-H1200 contains 1,200 han...
45	Leveraging Image Generators to Address Training Data Scarcity: The Gen4Regen Dataset for Forest Regeneration Mapping 2605.05627	cs.CVcs.AIcs.LGcs.RO	Gabriel Jeanson, David-Alexandre Duclos, William Larrivée-Hardy, Noé Cochet, Matěj Boxan	Sustainable forest management relies on precise species composition mapping, yet traditional ground surveys are labour-intensive and geographically constrained. While Uncrewed Aerial Vehicles (UAVs) offer scalable data collection, the transition to deep learni... Sustainable forest management relies on precise species composition mapping, yet traditional ground surveys are labour-intensive and geographically constrained. While Uncrewed Aerial Vehicles (UAVs) offer scalable data collection, the transition to deep learning-based interpretation is bottlenecked by the severe scarcity of expert-annotated imagery, particularly in complex, visually heterogeneous regeneration zones. This paper addresses the dual challenges of data scarcity and extreme class imba...
50	Learning a Delighting Prior for Facial Appearance Capture in the Wild 2605.05636	cs.CVcs.GR	Yuxuan Han, Xin Ming, Tianxiao Li, Zhuofan Shen, Qixuan Zhang	High-quality facial appearance capture has traditionally required costly studio recording. Recent works consider an in-the-wild smartphone-based setup; however, their model-based inverse rendering paradigm struggles with the complex disentanglement of reflecta... High-quality facial appearance capture has traditionally required costly studio recording. Recent works consider an in-the-wild smartphone-based setup; however, their model-based inverse rendering paradigm struggles with the complex disentanglement of reflectance from unknown illumination. To bridge this gap, we propose to shift the paradigm into training a powerful delighting network as a prior to constrain the optimization. We leverage the OLAT dataset and the rendered Light Stage scans for tr...
52	AffectSeek: Agentic Affective Understanding in Long Videos under Vague User Queries 2605.05640	cs.CV	Zhen Zhang, Yuhang Yang, Yunxiang Jiang, Yuhuan Lu, Haifeng Lu	Existing affective understanding studies have mainly focused on recognizing emotions from images, audio signals, or pre-cliped video clips, where the affective evidence is already given. This passive and clip-centered setting does not fully reflect real-world ... Existing affective understanding studies have mainly focused on recognizing emotions from images, audio signals, or pre-cliped video clips, where the affective evidence is already given. This passive and clip-centered setting does not fully reflect real-world scenarios, in which users often interact with long videos and express their needs through natural-language queries. In this paper, we study \textbf{Vague-Query-driven video Affective Understanding (VQAU)}, a new task that requires models to...
55	MUSE: Resolving Manifold Misalignment in Visual Tokenization via Topological Orthogonality 2605.05646	cs.CV	Panqi Yang, Haodong Jing, Jiahao Chao, Tingyan Xiang, Li Lin	Unified visual tokenization faces a fundamental trade-off between high-fidelity pixel reconstruction (spatial equivariance) and semantic abstraction (conceptual invariance). We attribute this conflict to Manifold Misalignment: naive joint optimization induces ... Unified visual tokenization faces a fundamental trade-off between high-fidelity pixel reconstruction (spatial equivariance) and semantic abstraction (conceptual invariance). We attribute this conflict to Manifold Misalignment: naive joint optimization induces opposing gradients, creating a zero-sum game between reconstruction and perception. To address this, we propose MUSE, a framework based on Topological Orthogonality. By treating Structure as an orthogonal bridge, MUSE decouples optimization...
64	Sparse-to-Complete: From Sparse Image Captures to Complete 3D Scenes 2605.05664	cs.CV	Yiyang Shen, Yin Yang, Kun Zhou, Tianjia Shao	We introduce S2C-3D, a novel sparse-view 3D reconstruction framework for high-fidelity and complete scene reconstruction from as few as six to eight images. Our framework features three components: a specialized diffusion model for scene-specific image restora... We introduce S2C-3D, a novel sparse-view 3D reconstruction framework for high-fidelity and complete scene reconstruction from as few as six to eight images. Our framework features three components: a specialized diffusion model for scene-specific image restoration, a training-free view-consistency conditioned sampling process in the diffusion model for refined Gaussian optimization, and a camera trajectory planning scheme to ensure comprehensive scene coverage. The specialized diffusion model is...
66	EGA: Adapting Frozen Encoders for Vector Search with Bounded Out-of-Distribution Degradation 2605.05674	cs.CVcs.AIcs.LG	Dongfang Zhao	Vector search systems built on frozen vision encoders face queries from unseen classes at deployment, yet existing adapter training collapses under this shift: high-capacity adapters with global contrastive losses silently reassign unseen-class samples to wron... Vector search systems built on frozen vision encoders face queries from unseen classes at deployment, yet existing adapter training collapses under this shift: high-capacity adapters with global contrastive losses silently reassign unseen-class samples to wrong seen-class clusters, dropping worst-case Label Precision by over 40 points below the frozen baseline in our tests. We propose Euclidean Geodesic Alignment (EGA), a residual adapter that couples three principles: zero initialization, local...
69	MotionGRPO: Overcoming Low Intra-Group Diversity in GRPO-Based Egocentric Motion Recovery 2605.05680	cs.CV	Nanjie Yao, Junlong Ren, Wenhao Shen, Hao Wang	This paper studies full-body 3D human motion recovery from head-mounted device signals. Existing diffusion-based methods often rely on global distribution matching, leading to local joint reconstruction errors. We propose MotionGRPO, a novel framework leveragi... This paper studies full-body 3D human motion recovery from head-mounted device signals. Existing diffusion-based methods often rely on global distribution matching, leading to local joint reconstruction errors. We propose MotionGRPO, a novel framework leveraging reinforcement learning post-training to inject fine-grained guidance into the diffusion process. Technically, we model diffusion sampling as a Markov decision process optimized via Group Relative Policy Optimization (GRPO). To this end, ...
75	R2H-Diff: Guided Spectral Diffusion Model for RGB-to-Hyperspectral Reconstruction 2605.05688	cs.CV	Songyu Ding, Ronggiang Zhao, Mingchun Sun, Jie Liu	RGB-to-hyperspectral image reconstruction is a highly ill-posed inverse problem, since multiple plausible spectral distributions may correspond to the same RGB observation. Existing regression-based methods usually learn a deterministic mapping, which limits t... RGB-to-hyperspectral image reconstruction is a highly ill-posed inverse problem, since multiple plausible spectral distributions may correspond to the same RGB observation. Existing regression-based methods usually learn a deterministic mapping, which limits their ability to model reconstruction uncertainty and often leads to over-smoothed spectral responses. Although diffusion models provide strong distribution modeling capability, their direct application to hyperspectral reconstruction remain...
77	CFE-PPAR: Compression-friendly encryption for privacy-preserving action recognition leveraging video transformers 2605.05692	cs.CVcs.AIcs.CR	Haiwei Lin, Shoko Imaizumi, Hitoshi Kiya	Privacy-preserving action recognition (PPAR) enables machines to understand human activities in videos without revealing sensitive visual content. Among the various strategies for PPAR, encryption-based methods achieve strong privacy protection while maintaini... Privacy-preserving action recognition (PPAR) enables machines to understand human activities in videos without revealing sensitive visual content. Among the various strategies for PPAR, encryption-based methods achieve strong privacy protection while maintaining high recognition performance. However, these methods lead to a catastrophic decrease in recognition performance and visual quality when the encrypted videos are compressed. That is, the previous methods are not compression-friendly. To a...
79	Adaptive Physical-Facial Representation Fusion via Subject-Invariant Cross-Modal Prompt Tuning for Video-Based Emotion Recognition 2605.05694	cs.CV	Xiwen Luo, Jia Li, Rencheng Song, Yu Liu, Juan Cheng	Emotion recognition from facial videos enables non-contact inference of human emotional states. Although facial expressions are widely used cues, they cannot fully reflect intrinsic affective states. Remote photoplethysmography (rPPG) provides complementary ph... Emotion recognition from facial videos enables non-contact inference of human emotional states. Although facial expressions are widely used cues, they cannot fully reflect intrinsic affective states. Remote photoplethysmography (rPPG) provides complementary physiological information, but it is highly susceptible to noise and inter-subject variability, limiting generalization to unseen individuals. Existing multimodal methods combine facial and rPPG features, yet their fusion strategies often dis...
92	Closing the Loop: Unified 3D Scene Generation and Immersive Interaction via LLM-RL Coupling 2605.05711	cs.CVcs.GRcs.HCcs.LGcs.MM	Anh H. Vo, Sungyo Lee, Phil-Joong Kim, Soo-Mi Choi, Yong-Guk Kim	Recent advances in large language models (LLMs) have significantly improved language-driven 3D content generation, but most existing approaches still treat scene generation and user interaction as separate processes, limiting the adaptability and immersive pot... Recent advances in large language models (LLMs) have significantly improved language-driven 3D content generation, but most existing approaches still treat scene generation and user interaction as separate processes, limiting the adaptability and immersive potential of interactive multimedia systems. This paper presents a unified framework that closes the loop between language-driven 3D scene generation and immersive user interaction. Given natural language instructions, the system first constru...
93	EgoEMG: A Multimodal Egocentric Dataset with Bilateral EMG and Vision for Hand Pose Estimation 2605.05712	cs.CV	Ziheng Xi, Jiayi Yu, Yitao Wang, Yanbo Duan, Jianjiang Feng	Surface electromyography (sEMG) records muscle activity during hand movement and can be decoded to recover detailed hand articulation. EMG and egocentric vision are complementary for hand sensing: EMG captures fine-grained finger articulation even under occlus... Surface electromyography (sEMG) records muscle activity during hand movement and can be decoded to recover detailed hand articulation. EMG and egocentric vision are complementary for hand sensing: EMG captures fine-grained finger articulation even under occlusion and poor lighting, while vision provides global hand configuration. However, no existing dataset synchronizes both modalities. We present EgoEMG, a multimodal egocentric dataset for bimanual hand pose estimation. EgoEMG includes bilater...
94	TriRelVLA: Triadic Relational Structure for Generalizable Embodied Manipulation 2605.05714	cs.CVcs.RO	Hanyu Zhou, Chuanhao Ma, Gim Hee Lee	Vision-language-action (VLA) models perform well on training-seen robotic tasks but struggle to generalize to unseen scenes and objects. A key limitation lies in their implicit visual representations, which entangle object appearance, background, and scene lay... Vision-language-action (VLA) models perform well on training-seen robotic tasks but struggle to generalize to unseen scenes and objects. A key limitation lies in their implicit visual representations, which entangle object appearance, background, and scene layout. This makes policies sensitive to visual variations. Prior work improves transferability through structured intermediate representations that objectify visual content. However, these representations mainly capture scene semantics instea...
98	$\mathcal{B}^{3}$-Net: Controlled Posterior Bridge Learning for Multi-Task Dense Prediction 2605.05722	cs.CV	Meihua Zhou, Li Yang	Multi-task dense prediction solves complementary pixel-level tasks in a unified model, such as semantic segmentation, depth estimation, surface normal estimation, and edge detection. Existing decoder-side interactions use attention, prompts, routing, diffusion... Multi-task dense prediction solves complementary pixel-level tasks in a unified model, such as semantic segmentation, depth estimation, surface normal estimation, and edge detection. Existing decoder-side interactions use attention, prompts, routing, diffusion, Mamba, or bridge features to exchange task evidence, but most of them organize this evidence implicitly. They usually fuse task features by similarity or affinity, without explicitly modeling that evidence reliability varies across tasks ...
115	Ray-Aware Pointer Memory with Adaptive Updates for Streaming 3D Reconstruction 2605.05749	cs.CV	Feifei Li, Qi Song, Chi Zhang, Rui Huang	Dense 3D reconstruction from continuous image streams requires both accurate geometric aggregation and stable long-term memory management. Recent feed-forward reconstruction frameworks integrate observations through persistent memory representations, yet most ... Dense 3D reconstruction from continuous image streams requires both accurate geometric aggregation and stable long-term memory management. Recent feed-forward reconstruction frameworks integrate observations through persistent memory representations, yet most rely primarily on appearance-based similarity when updating memory. Such appearance-driven integration often leads to redundant accumulation of observations and unstable geometry when viewpoint changes occur. In this work, we propose a ray-...
117	Jointly Learning Structured Representations and Stabilized Affinity for Human Motion Segmentation 2605.05753	cs.CV	Xianghan Meng, Zhiyuan Huang, Zhengyu Tong, Chun-Guang Li	Human Motion Segmentation (HMS), which aims to partition a video into non-overlapping segments corresponding to different human motions, has recently attracted increasing research attention. Existing HMS approaches are predominantly based on subspace clusterin... Human Motion Segmentation (HMS), which aims to partition a video into non-overlapping segments corresponding to different human motions, has recently attracted increasing research attention. Existing HMS approaches are predominantly based on subspace clustering, which are grounded on the assumption that the distribution of high-dimensional temporal features well aligns with a Union-of-Subspaces (UoS). For videos in the real world, however, the raw frame-level features often violate the UoS assum...
122	iTRIALSPACE: Programmable Virtual Lesion Trials for Controlled Evaluation of Lung CT Models 2605.05761	cs.CV	Fakrul Islam Tushar, Umme Hafsa Momy, Joseph Y. Lo, Geoffrey D. Rubin	We introduce iTRIALSPACE, a programmable evaluation framework for controlled assessment of lung CT models. Standard benchmarks are static retrospective collections that entangle lesion size, lobe prevalence, anatomy, and acquisition context, making it difficul... We introduce iTRIALSPACE, a programmable evaluation framework for controlled assessment of lung CT models. Standard benchmarks are static retrospective collections that entangle lesion size, lobe prevalence, anatomy, and acquisition context, making it difficult to determine what structurally drives model accuracy. iTRIALSPACE addresses this limitation by composing real clinical CTs and lesion profiles into controlled virtual lesion trials through a four-stage pipeline: multidataset nodule profil...
124	X-OmniClaw Technical Report: A Unified Mobile Agent for Multimodal Understanding and Interaction 2605.05765	cs.CV	Xiaoming Ren, Ru Zhen, Chao Li, Yang Song, Qiuxia Hou	Inspired by the development of OpenClaw, there is a growing demand for mobile-based personal agents capable of handling complex and intuitive interactions. In this technical report, we introduce X-OmniClaw, a unified mobile agent designed for multimodal unders... Inspired by the development of OpenClaw, there is a growing demand for mobile-based personal agents capable of handling complex and intuitive interactions. In this technical report, we introduce X-OmniClaw, a unified mobile agent designed for multimodal understanding and interaction in the Android ecosystem. This unified architecture of perception, memory, and action enables the agent to handle complex mobile tasks with high contextual awareness. Specifically, Omni Perception provides a unified ...
130	The autoPET3 Challenge -- Automated Lesion Segmentation in Whole-Body PET/CT - Multitracer Multicenter Generalization 2605.05775	cs.CVcs.AI	Jakob Dexl, Katharina Jeblick, Andreas Mittermeier, Balthasar Schachtner, Anna Theresa Stüber	We report the design and results of the third autoPET challenge (MICCAI 2024), which benchmarked automated lesion segmentation in whole-body PET/CT under a compositional generalization setting. Training data comprised 1,014 [18F]-FDG PET/CT studies from the Un... We report the design and results of the third autoPET challenge (MICCAI 2024), which benchmarked automated lesion segmentation in whole-body PET/CT under a compositional generalization setting. Training data comprised 1,014 [18F]-FDG PET/CT studies from the University Hospital Tübingen and 597 [18F]/[68Ga]-PSMA PET/CT studies from the LMU University Hospital Munich, constituting the largest publicly available annotated PSMA PET/CT dataset to date. The held-out test set of 200 studies covered fou...
134	Steering Visual Generation in Unified Multimodal Models with Understanding Supervision 2605.05781	cs.CVcs.AI	Zeyu Liu, Zanlin Ni, Yang Yue, Cheng Da, Huan Yang	Unified multimodal models are envisioned to bridge the gap between understanding and generation. Yet, to achieve competitive performance, state-of-the-art models adopt largely decoupled understanding and generation components. This design, while effective for ... Unified multimodal models are envisioned to bridge the gap between understanding and generation. Yet, to achieve competitive performance, state-of-the-art models adopt largely decoupled understanding and generation components. This design, while effective for individual tasks, weakens the connection required for mutual enhancement, leaving the potential synergy empirically uncertain. We propose to explicitly restore this synergy by introducing Understanding-Oriented Post-Training (UNO), a lightw...
140	Na-IRSTD: Enhancing Infrared Small Target Detection via Native-Resolution Feature Selection and Fusion 2605.05804	cs.CV	Qian Xu, Chi Zhang, Qiming Zhang, Xi Li, Haojuan Yuan	Infrared small target detection (IRSTD) faces the inherent challenge of precisely localizing dim targets amid complex background clutter. While progress has been made, existing methods usually follow conventional strategies to downsample features and discard s... Infrared small target detection (IRSTD) faces the inherent challenge of precisely localizing dim targets amid complex background clutter. While progress has been made, existing methods usually follow conventional strategies to downsample features and discard small targets' details, resulting in suboptimal performance. In this paper, we present Na-IRSTD, a native-resolution feature extraction and fusion framework for IRSTD. This framework elegantly incorporates native-resolution features to prese...
144	CXR-ContraBench: Benchmarking Negated-Option Attraction in Medical VLMs 2605.05810	cs.CV	Zhengru Fang, Yanan Ma, Yu Guo, Senkang Hu, Yixian Zhang	When a chest X-ray shows consolidation but the question asks which finding is present, a medical vision-language model may answer "No consolidation." This is more than an incorrect choice: it is a polarity reversal that emits a clinical statement contradicting... When a chest X-ray shows consolidation but the question asks which finding is present, a medical vision-language model may answer "No consolidation." This is more than an incorrect choice: it is a polarity reversal that emits a clinical statement contradicting the image. We study this failure as negated-option attraction, where a model is drawn to a negated answer option even when it conflicts with both the visual evidence and the question. We introduce CXR-ContraBench (Chest X-Ray Contradiction...
150	ChartZero: Synthetic Priors Enable Zero Shot Chart Data Extraction 2605.05820	cs.CV	Md Touhidul Islam, Yasir Mahmud, Sujan Kumar Saha, Mark Tehranipoor, Farimah Farahmandi	Automated data extraction from line charts remains fundamentally bottlenecked by extreme stylistic diversity and a severe scarcity of comprehensively annotated, real-world datasets. Current end-to-end pipelines depend heavily on costly manual annotations, crip... Automated data extraction from line charts remains fundamentally bottlenecked by extreme stylistic diversity and a severe scarcity of comprehensively annotated, real-world datasets. Current end-to-end pipelines depend heavily on costly manual annotations, crippling their ability to generalize across arbitrary aesthetics and grid layouts. Furthermore, existing models suffer from two critical failure modes during reconstruction. First, extracting thin, intersecting curves frequently causes structu...
152	Unifying Scientific Communication: Fine-Grained Correspondence Across Scientific Media 2605.05831	cs.CV	Megha Mariam K. M, Vineeth N. Balasubramanian, C. V. Jawahar	The communication of scientific knowledge has become increasingly multimodal, spanning text, visuals, and speech through materials such as research papers, slides, and recorded presentations. These different representations collectively convey a study's reason... The communication of scientific knowledge has become increasingly multimodal, spanning text, visuals, and speech through materials such as research papers, slides, and recorded presentations. These different representations collectively convey a study's reasoning, results, and insights, offering complementary perspectives that enrich understanding. However, despite their shared purpose, such materials are rarely connected in a structured way. The absence of explicit links across formats makes it...
159	VideoRouter: Query-Adaptive Dual Routing for Efficient Long-Video Understanding 2605.05848	cs.CVcs.AI	Kuanwei Lin, Wenhao Zhang, Ge Li	Video large multimodal models increasingly face a scalability bottleneck: long videos produce excessively long visual-token sequences, which sharply increase memory and latency during inference. While existing compression methods are effective in specific sett... Video large multimodal models increasingly face a scalability bottleneck: long videos produce excessively long visual-token sequences, which sharply increase memory and latency during inference. While existing compression methods are effective in specific settings, most are either weakly query-aware or apply a fixed compression policy across frames, proving suboptimal when visual evidence is unevenly distributed over time. To address this, we present VideoRouter, a query-adaptive dual-router fra...
160	Align3D-AD: Cross-Modal Feature Alignment and Dual-Prompt Learning for Zero-shot 3D Anomaly Detection 2605.05850	cs.CV	Letian Bai, Xuanming Cao, Juan Du, Chengyu Tao	Zero-shot 3D anomaly detection aims to identify anomalies without access to training data from target categories. However, existing methods mainly rely on projecting 3D observations into multi-view representations that primarily capture geometric cues rather t... Zero-shot 3D anomaly detection aims to identify anomalies without access to training data from target categories. However, existing methods mainly rely on projecting 3D observations into multi-view representations that primarily capture geometric cues rather than realistic visual semantics and process them with vision encoders pretrained on RGB data, leading to a significant domain gap between the encoder and the projected representations. To address this issue, we propose Align3D-AD, a unified ...
169	InkDiffuser: High-Fidelity One-shot Chinese Calligraphy via Differentiable Morphological Optimization 2605.05865	cs.CV	Kunchong Shi, Jing Zhang	Current Chinese calligraphy generation methods suffer from poor stroke rendering and unrealistic ink morphology, resulting in outputs with limited visual fidelity and artistic fluidity. To address this problem, we propose \textbf{InkDiffuser}, a diffusion-base... Current Chinese calligraphy generation methods suffer from poor stroke rendering and unrealistic ink morphology, resulting in outputs with limited visual fidelity and artistic fluidity. To address this problem, we propose \textbf{InkDiffuser}, a diffusion-based generative framework for one-shot Chinese calligraphy synthesis. To guarantee high-fidelity rendering, we introduce two core contributions: a high-frequency enhancement mechanism and a Differentiable Ink Structure (DIS) loss that explicit...
178	Training-Free Dense Hand Contact Estimation with Multi-Modal Large Language Models 2605.05886	cs.CV	Daniel Sungho Jung, Kyoung Mu Lee	Dense hand contact estimation requires both high-level semantic understanding and fine-grained geometric reasoning of human interaction to accurately localize contact regions. Recently, multi-modal large language models (MLLMs) have demonstrated strong capabil... Dense hand contact estimation requires both high-level semantic understanding and fine-grained geometric reasoning of human interaction to accurately localize contact regions. Recently, multi-modal large language models (MLLMs) have demonstrated strong capabilities in understanding visual semantics, enabled by vision-language priors learned from large-scale data. However, leveraging MLLMs for dense hand contact estimation remains underexplored. There are two major challenges in applying MLLMs to...
179	DBMSolver: A Training-free Diffusion Bridge Sampler for High-Quality Image-to-Image Translation 2605.05889	cs.CVcs.AIcs.LGmath.NA	Sankarshana Venugopal, Mohammad Mostafavi, Jonghyun Choi	Diffusion-based image-to-image (I2I) translation excels in high-fidelity generation but suffers from slow sampling in state-of-the-art Diffusion Bridge Models (DBMs), often requiring dozens of function evaluations (NFEs). We introduce DBMSolver, a training-fre... Diffusion-based image-to-image (I2I) translation excels in high-fidelity generation but suffers from slow sampling in state-of-the-art Diffusion Bridge Models (DBMs), often requiring dozens of function evaluations (NFEs). We introduce DBMSolver, a training-free sampler that exploits the semi-linear structure of DBM's underlying SDE and ODE via exponential integrators, yielding highly-efficient 1st- and 2nd-order solutions. This reduces NFEs by up to 5x while boosting quality (e.g., FID drops 53%...
181	MTL-MAD: Multi-Task Learners are Effective Medical Anomaly Detectors 2605.05891	cs.CVcs.AIcs.LG	Bogdan Alexandru Bercean, Florinel Alin Croitoru, Vlad Hondru, Ciprian Mihai Ceausescu, Andreea Iuliana Ionescu	Anomaly detection in medical images is a challenging task, since anomalies are not typically available during training. Recent methods leverage a single pretext task coupled with a large-scale pre-trained model to reach state-of-the-art performance. Instead, w... Anomaly detection in medical images is a challenging task, since anomalies are not typically available during training. Recent methods leverage a single pretext task coupled with a large-scale pre-trained model to reach state-of-the-art performance. Instead, we propose to learn multiple self-supervised and pseudo-labeling tasks from scratch, using a joint model based on Mixture-of-Experts (MoE). By carefully integrating multiple proxy tasks, the joint model effectively learns a robust representa...
184	Detecting AI-Generated Videos with Spiking Neural Networks 2605.05895	cs.CVcs.AI	Minsuk Jang, Yujin Yang, Heeseon Kim, Minseok Son, Younghun Kim	Modern AI-generated videos are photorealistic at the single-frame level, leaving inter-frame dynamics as the main remaining axis for detection. Existing detectors typically handle this temporal evidence in three ways: feeding the full frame sequence to a gener... Modern AI-generated videos are photorealistic at the single-frame level, leaving inter-frame dynamics as the main remaining axis for detection. Existing detectors typically handle this temporal evidence in three ways: feeding the full frame sequence to a generic temporal backbone, reducing one dominant temporal cue to fixed video-level descriptors, or comparing temporal features to real-video statistics through a detection metric. These strategies degrade sharply under cross-generator evaluation...
187	Understanding Cross-Language Transfer Improvements in Low-Resource HTR: The Role of Sequence Modeling 2605.05900	cs.CV	Sana Al-azzawi, Chang Liu, Nudrat Habib, Elisa Barney, Marcus Liwicki	Handwritten Text Recognition (HTR) for Arabic-script languages benefits from cross-language joint training under low-resource conditions, particularly when using CRNN-based models that combine convolutional encoders with sequence modeling. However, it remains ... Handwritten Text Recognition (HTR) for Arabic-script languages benefits from cross-language joint training under low-resource conditions, particularly when using CRNN-based models that combine convolutional encoders with sequence modeling. However, it remains unclear whether these improvements are better explained by shared visual representations or sequence-level dependencies. In this work, we conduct a controlled architectural study of line-level Arabic-script HTR, comparing CNN-only models wi...
189	Architecture-agnostic Lipschitz-constant Bayesian header and its application to resolve semantically proximal classification errors with vision transformers 2605.05908	cs.CVcs.AI	Frederik Schäfer, Luis Mandl, Lars Kälber, Tim Ricken	Label noise remains a critical bottleneck for the generalization of supervised deep learning models, particularly when errors are structured rather than random. Standard robust training methods often fail in the presence of such semantically proximal classific... Label noise remains a critical bottleneck for the generalization of supervised deep learning models, particularly when errors are structured rather than random. Standard robust training methods often fail in the presence of such semantically proximal classification errors. This work presents an architecture-agnostic Lipschitz-constant Bayesian header that can be integrated into feature extractors such as vision transformers, yielding the bi-Lipschitz-constrained Bayesian Vision Transformer (LipB...
191	Plug-and-play Class-aware Knowledge Injection for Prompt Learning with Visual-Language Model 2605.05910	cs.CV	Junhui Yin, Nan Pu, Xinyu Zhang, Lingfeng Yang, Lin Wu	Prompt learning has become an effective and widely used technique in enhancing vision-language models (VLMs) such as CLIP for various downstream tasks, particularly in zero-shot classification within specific domains. Existing methods typically focus on either... Prompt learning has become an effective and widely used technique in enhancing vision-language models (VLMs) such as CLIP for various downstream tasks, particularly in zero-shot classification within specific domains. Existing methods typically focus on either learning class-shared prompts for a given domain or generating instance-specific prompts through conditional prompt learning. While these methods have achieved promising performance, they often overlook class-specific knowledge in prompt d...
198	Think, then Score: Decoupled Reasoning and Scoring for Video Reward Modeling 2605.05922	cs.CV	Yuan Wang, Ouxiang Li, Yulong Xu, Borui Liao, Jiajun Liang	Recent advances in generative video models are increasingly driven by post-training and test-time scaling, both of which critically depend on the quality of video reward models (RMs). An ideal reward model should predict accurate rewards that align with human ... Recent advances in generative video models are increasingly driven by post-training and test-time scaling, both of which critically depend on the quality of video reward models (RMs). An ideal reward model should predict accurate rewards that align with human preferences across diverse scenarios. However, existing paradigms face a fundamental dilemma: \textit{Discriminative RMs} regress rewards directly on features extracted by multimodal large language models (MLLMs) without explicit reasoning,...
200	Backdoor Mitigation in Object Detection via Adversarial Fine-Tuning 2605.05928	cs.CVcs.CR	Kealan Dunnett, Reza Arablouei, Dimity Miller, Volkan Dedeoglu, Raja Jurdak	Backdoor attacks can implant malicious behaviours into deep models while preserving performance on clean data, posing a serious threat to safety-critical vision systems. Although backdoor mitigation has been studied extensively for image classification, defens... Backdoor attacks can implant malicious behaviours into deep models while preserving performance on clean data, posing a serious threat to safety-critical vision systems. Although backdoor mitigation has been studied extensively for image classification, defenses for object detection remain comparatively underdeveloped. Adversarial fine-tuning is a common backdoor mitigation approach in classification, but adapting it to detection is nontrivial as classification-oriented adversarial generation do...
203	Whole-body CT attenuation and volume charts from routine clinical scans via evidence-grounded LLM report filtering 2605.05933	cs.CV	Christian Wachinger, Bernhard Renger, Christopher Späth, Jan Kirschke, Marcus Makowski	Interpreting quantitative CT biomarkers, such as organ volume and tissue attenuation, requires large-scale healthy reference distributions. However, creating these is challenging because clinical datasets are often heavily enriched with pathology. Here, we dev... Interpreting quantitative CT biomarkers, such as organ volume and tissue attenuation, requires large-scale healthy reference distributions. However, creating these is challenging because clinical datasets are often heavily enriched with pathology. Here, we develop an evidence-grounded, cross-verified large language model (LLM) ensemble to filter pathological findings from radiology reports, enabling the construction of pathology-reduced cohorts from over 350,000 CT examinations. Five LLMs, first...
206	RAWild: Sensor-Agnostic RAW Object Detection via Physics-Guided Curve and Grid Modeling 2605.05941	cs.CV	Shuhong Liu, Gengjia Chang, Jun Liu, Xuangeng Chu, Yinqiang Zheng	Camera sensor RAW data offers intrinsic advantages for object detection, including deeper bit depth, preserved physical information, and freedom from image signal processor (ISP) distortions. However, varying exposure conditions, spectral sensitivities, and bi... Camera sensor RAW data offers intrinsic advantages for object detection, including deeper bit depth, preserved physical information, and freedom from image signal processor (ISP) distortions. However, varying exposure conditions, spectral sensitivities, and bit depths across devices introduce substantially larger domain gaps than sRGB, making sensor-agnostic generalization a fundamental challenge. In this study, we present \textbf{RAWild}, a physics-guided global-local tone mapping framework for...
208	MobileEgo Anywhere: Open Infrastructure for long horizon egocentric data on commodity hardware 2605.05945	cs.CVcs.CL	Senthil Palanisamy, Abhishek Anand, Satpal Singh Rathor, Pratyush Patnaik, Shubhanshu Khatana	The recent advancement of Vision Language Action (VLA) models has driven a critical demand for large scale egocentric datasets. However, existing datasets are often limited by short episode durations, typically spanning only a few minutes, which fails to captu... The recent advancement of Vision Language Action (VLA) models has driven a critical demand for large scale egocentric datasets. However, existing datasets are often limited by short episode durations, typically spanning only a few minutes, which fails to capture the long horizon temporal dependencies necessary for complex robotic task execution. To bridge this gap, we present MobileEgo Anywhere, a framework designed to facilitate the collection of robust, hour plus egocentric trajectories using ...
227	Prompt-Free and Efficient SAM2 Adaptation for Biomedical Semantic Segmentation via Dual Adapters 2605.05979	cs.CV	Hinako Mitsuoka, Kazuhiro Hotta	Segment Anything Model 2 (SAM2) demonstrated impressive zero-shot capabilities on natural images but faces challenges in biomedical segmentation due to significant domain shifts and prompt dependency. To address these limitations, we propose a prompt-free, par... Segment Anything Model 2 (SAM2) demonstrated impressive zero-shot capabilities on natural images but faces challenges in biomedical segmentation due to significant domain shifts and prompt dependency. To address these limitations, we propose a prompt-free, parameter-efficient fine-tuning framework designed for multi-class segmentation on variable-sized inputs. We introduce a convolutional Positional Encoding Generator to adapt effectively to arbitrary aspect ratios and present a dual-adapter str...
232	iPhoneBlur: A Difficulty-Stratified Benchmark for Consumer Device Motion Deblurring 2605.05990	cs.CVcs.AI	Abdullah Al Shafi, Kazi Saeed Alam	Motion blur restoration on consumer mobile devices is typically evaluated using aggregate metrics that obscure performance variation across blur difficulty, masking model behavior under real deployment conditions. This work introduces iPhoneBlur, a difficulty-... Motion blur restoration on consumer mobile devices is typically evaluated using aggregate metrics that obscure performance variation across blur difficulty, masking model behavior under real deployment conditions. This work introduces iPhoneBlur, a difficulty-stratified benchmark of 7,400 image pairs synthesized from high-framerate iPhone 17 Pro videos captured in diverse real-world scenarios. Samples are partitioned into Easy, Medium, and Hard categories through PSNR-guided adaptive temporal wi...
237	4DThinker: Thinking with 4D Imagery for Dynamic Spatial Understanding 2605.05997	cs.CV	Zhangquan Chen, Manyuan Zhang, Xinlei Yu, Xiang An, Bo Li	Dynamic spatial reasoning from monocular video is essential for bridging visual intelligence and the physical world, yet remains challenging for vision-language models (VLMs). Prior approaches either verbalize spatial-temporal reasoning entirely as text, which... Dynamic spatial reasoning from monocular video is essential for bridging visual intelligence and the physical world, yet remains challenging for vision-language models (VLMs). Prior approaches either verbalize spatial-temporal reasoning entirely as text, which is inherently verbose and imprecise for complex dynamics, or rely on external geometric modules that increase inference complexity without fostering intrinsic model capability. In this paper, we present 4DThinker, the first framework that ...
240	Neuromorphic visual attention for Sign-language recognition on SpiNNaker 2605.06005	cs.CV	Sarka Liskova, Olha Vedmedenko, Mazdak Fatahi, Matej Hoffmann, P. Michael Furlong	Sign-language recognition has achieved substantial gains in classification accuracy in recent years; however, the latency and power requirements of most existing methods limit their suitability for real-time deployment. Neuromorphic sensing and processing offe... Sign-language recognition has achieved substantial gains in classification accuracy in recent years; however, the latency and power requirements of most existing methods limit their suitability for real-time deployment. Neuromorphic sensing and processing offer an alternative paradigm based on sparse, event-driven computation that supports low-latency and energy-efficient perception. In this work, we introduce an end-to-end neuromorphic architecture for American Sign Language (ASL) fingerspellin...
243	Adding Thermal Awareness to Visual Systems in Real-Time via Distilled Diffusion Models 2605.06010	cs.CVcs.AI	Yuchen Guo, Junli Gong, Wenjun Dong, Yiuming Cheung, Weifeng Su	Purely RGB-based vision models often fail to provide reliable cues in challenging scenarios such as nighttime and fog, leading to degraded performance and safety risks. Infrared imaging captures heat-emitting sources and provides critical complementary informa... Purely RGB-based vision models often fail to provide reliable cues in challenging scenarios such as nighttime and fog, leading to degraded performance and safety risks. Infrared imaging captures heat-emitting sources and provides critical complementary information, but existing high-fidelity fusion methods suffer from prohibitive latency, rendering them impractical for real-time edge deployment. To address this, we propose FusionProxy, a real-time image fusion module designed as a fully independ...
244	T2I-VeRW: Part-level Fine-grained Perception for Text-to-Image Vehicle Retrieval 2605.06012	cs.CVcs.AI	Xiao Wang, Ziwen Wang, Weizhe Kong, Wentao Wu, Yuehang Li	Vehicle Re-identification (Re-ID) aims to retrieve the most similar image to a given query from images captured by non-overlapping cameras. Extending vehicle Re-ID from image-only queries to text-based queries enables retrieval in real-world scenarios where on... Vehicle Re-identification (Re-ID) aims to retrieve the most similar image to a given query from images captured by non-overlapping cameras. Extending vehicle Re-ID from image-only queries to text-based queries enables retrieval in real-world scenarios where only a witness description of the target vehicle is available. In this paper, we propose PFCVR, a Part-level Fine-grained Cross-modal Vehicle Retrieval model for text-to-image vehicle re-identification. PFCVR constructs locally paired images ...
247	PlotPick: AI-powered batch extraction of numerical data from scientific figures 2605.06021	cs.CVcs.DL	Tommy Carstensen	Systematic reviews and meta-analyses frequently require numerical data that authors report only as figures, yet manual digitisation is slow and does not scale. We present PlotPick, an open-source tool that uses vision-language models (VLMs) to batch-extract st... Systematic reviews and meta-analyses frequently require numerical data that authors report only as figures, yet manual digitisation is slow and does not scale. We present PlotPick, an open-source tool that uses vision-language models (VLMs) to batch-extract structured tabular data from scientific figures. We evaluate six VLMs from three providers on two established chart-to-table benchmarks (ChartX and PlotQA) and compare against the dedicated chart-to-table model DePlot. All six VLMs outperform...
260	Domain Generalization through Spatial Relation Induction over Visual Primitives 2605.06043	cs.CV	Dat Nguyen, Duc-Duy Nguyen	Domain generalization requires identifying stable representations that support reliable classification across domains. Most existing methods seek such stability through improving the training process, for example, through model selection strategies, data augme... Domain generalization requires identifying stable representations that support reliable classification across domains. Most existing methods seek such stability through improving the training process, for example, through model selection strategies, data augmentation, or feature-alignment objectives. Although these strategies can be effective, they leave the representation learning of structural composition implicit, which may limit performance on compositional domain generalization benchmarks. ...
263	Fusion in Your Way: Aligning Image Fusion with Heterogeneous Demands via Direct Preference Optimization 2605.06049	cs.CV	Weijian Su, Songqian Zhang, Yuqi Han, Jian Zhuang, Yongdong Huang	As a key technique in multi-modal processing, infrared and visible image fusion (IVIF) plays a crucial role in integrating complementary spectral information for visual enhancement and downstream vision tasks. Despite remarkable progress, existing methods stru... As a key technique in multi-modal processing, infrared and visible image fusion (IVIF) plays a crucial role in integrating complementary spectral information for visual enhancement and downstream vision tasks. Despite remarkable progress, existing methods struggle to flexibly accommodate heterogeneous demands. Achieving adaptive fusion that aligns with various preferences from both human and machine vision remains an open and challenging problem. To address this challenge, we propose DPOFusion, ...
265	RealCam: Real-Time Novel-View Video Generation with Interactive Camera Control 2605.06051	cs.CV	Youcan Xu, Jiaxin Shi, Zhen Wang, Wensong Song, Feifei Shao	Camera-controlled video-to-video (V2V) generation enables dynamic viewpoint synthesis from monocular footage, holding immense potential for interactive filmmaking and live broadcasting. However, existing implicit synthesis methods fundamentally rely on non-cau... Camera-controlled video-to-video (V2V) generation enables dynamic viewpoint synthesis from monocular footage, holding immense potential for interactive filmmaking and live broadcasting. However, existing implicit synthesis methods fundamentally rely on non-causal, full-sequence processing and rigid prefix-style temporal concatenation. This architectural paradigm mandates bidirectional attention, resulting in prohibitive computational latency, quadratic complexity scaling, and inherent incompatib...
275	PersonaGesture: Single-Reference Co-Speech Gesture Personalization for Unseen Speakers 2605.06064	cs.CV	Xiangyue Zhang, Yiyi Cai, Kunhang Li, Kaixing Yang, You Zhou	We propose PersonaGesture, a diffusion-based pipeline for single-reference co-speech gesture personalization of unseen speakers. Given target speech and one motion clip from a new speaker, the model must synthesize gestures that follow the new utterance while ... We propose PersonaGesture, a diffusion-based pipeline for single-reference co-speech gesture personalization of unseen speakers. Given target speech and one motion clip from a new speaker, the model must synthesize gestures that follow the new utterance while retaining speaker-specific pose choices, without per-speaker optimization. This setting is useful for avatars and virtual agents, but it is hard because the reference mixes stable speaker habits with utterance-specific trajectories. Persona...
279	Arena as Offline Reward: Efficient Fine-Grained Preference Optimization for Diffusion Models 2605.06070	cs.CV	Zhikai Li, Yue Zhao, Edward Zhongwei Zhang, Xuewen Liu, Jing Zhang	Reinforcement learning from human feedback (RLHF) effectively promotes preference alignment of text-to-image (T2I) diffusion models. To improve computational efficiency, direct preference optimization (DPO), which avoids explicit reward modeling, has been wide... Reinforcement learning from human feedback (RLHF) effectively promotes preference alignment of text-to-image (T2I) diffusion models. To improve computational efficiency, direct preference optimization (DPO), which avoids explicit reward modeling, has been widely studied. However, its reliance on binary feedback limits it to coarse-grained modeling on chosen-rejected pairs, resulting in suboptimal optimization. In this paper, we propose ArenaPO, which leverages Arena scores as offline rewards to ...
284	MSD-Score: Multi-Scale Distributional Scoring for Reference-Free Image Caption Evaluation 2605.06080	cs.CV	Shichao Kan, Xuyang Zhang, Haojie Zhang, Zhe Zhu, Yigang Cen	Evaluating image captions without references remains challenging because global embedding similarity often misses fine-grained mismatches such as hallucinated objects, missing attributes, or incorrect relations. We propose MSD-Score, a reference-free metric th... Evaluating image captions without references remains challenging because global embedding similarity often misses fine-grained mismatches such as hallucinated objects, missing attributes, or incorrect relations. We propose MSD-Score, a reference-free metric that models image patch and text token embeddings as von Mises-Fisher mixtures on the unit hypersphere. Instead of treating each modality as a single point, MSD-Score formulates image-text matching as a multi-scale distributional scoring prob...
287	Revisiting Uncertainty: On Evidential Learning for Partially Relevant Video Retrieval 2605.06083	cs.CVcs.IRcs.LGcs.MM	Jun Li, Peifeng Lai, Xuhang Lou, Jinpeng Wang, Yuting Wang	Partially relevant video retrieval aims to retrieve untrimmed videos using text queries that describe only partial content. However, the inherent asymmetry between brief queries and rich video content inevitably introduces uncertainty into the retrieval proces... Partially relevant video retrieval aims to retrieve untrimmed videos using text queries that describe only partial content. However, the inherent asymmetry between brief queries and rich video content inevitably introduces uncertainty into the retrieval process. In this setting, vague queries often induce semantic ambiguity across videos, a challenge that is further exacerbated by the sparse temporal supervision within videos, which fails to provide sufficient matching evidence. To address this,...
288	AMIEOD: Adaptive Multi-Experts Image Enhancement for Object Detection in Low-Illumination Scenes 2605.06084	cs.CV	Xiaochen Huang, Honggang Chen, Weicheng Zhang, Xiaobo Dai, Yongyi Li	In multimedia application scenarios, images captured under low-illumination conditions often lead to lower accuracy in visual perception tasks compared to those taken in well-lit environments. To tackle this challenge, we propose AMIEOD, an image enhancement... In multimedia application scenarios, images captured under low-illumination conditions often lead to lower accuracy in visual perception tasks compared to those taken in well-lit environments. To tackle this challenge, we propose AMIEOD, an image enhancement-enabled object detection framework for low-illumination scenes, where the two tasks are jointly optimized in a detection performance-oriented manner. Specifically, to fully exploit the information in poorly lit images, a Multi-Experts Im...
289	LARGO: Low-Rank Hypernetwork for Handling Missing Modalities 2605.06086	cs.CV	Niels Vyncke, Pooya Ashtari, Aleksandra Pižurica	Addressing missing modalities is an important challenge in multimodal image analysis and often relies on complex architectures that do not transfer easily to different datasets without architectural modifications or hyperparameter tuning. While most existing m... Addressing missing modalities is an important challenge in multimodal image analysis and often relies on complex architectures that do not transfer easily to different datasets without architectural modifications or hyperparameter tuning. While most existing methods tackle this problem in feature space by engineering representations that are robust to missing inputs, we instead operate in weight space. We propose LARGO, a hypernetwork that compresses the $2^N-1$ dedicated missing-modality models...
290	OpenGaFF: Open-Vocabulary Gaussian Feature Field with Codebook Attention 2605.06088	cs.CV	Kunyi Li, Michael Niemeyer, Sen Wang, Stefano Gasperini, Nassir Navab	Understanding open-vocabulary 3D scenes with Gaussian-based representations remains challenging due to fragmented and spatially inconsistent semantic predictions across multi-view observations. In this paper, we present OpenGaFF, a novel framework for open-voc... Understanding open-vocabulary 3D scenes with Gaussian-based representations remains challenging due to fragmented and spatially inconsistent semantic predictions across multi-view observations. In this paper, we present OpenGaFF, a novel framework for open-vocabulary 3D scene understanding built upon 3D Gaussian Splatting. At the core of our method is a Gaussian Feature Field that models semantics as a continuous function of Gaussian geometry and appearance. By explicitly conditioning semantic p...
293	Boosting Self-Supervised Tracking with Contextual Prompts and Noise Learning 2605.06092	cs.CV	Yaozong Zheng, Qihua Liang, Bineng Zhong, Shuimu Zeng, Yuanliang Xue	Learning robust contextual knowledge from unlabeled videos is essential for advancing self-supervised tracking. However, conventional self-supervised trackers lack effective context modeling, while existing context association methods based on non-semantic que... Learning robust contextual knowledge from unlabeled videos is essential for advancing self-supervised tracking. However, conventional self-supervised trackers lack effective context modeling, while existing context association methods based on non-semantic queries struggle to adapt to unlabeled tracking scenarios, making it difficult to learn reliable contextual cues. In this work, we propose a novel self-supervised tracking framework, named \textbf{\tracker}, which introduces a dual-modal conte...
294	VISD: Enhancing Video Reasoning via Structured Self-Distillation 2605.06094	cs.CVcs.AI	Hao Lin, Kunyang Lv, Xu Jiang, Jingqi Tian, Zhongjing Du	Training VideoLLMs for complex reasoning remains challenging due to sparse sequence level rewards and the lack of fine grained credit assignment over long, temporally grounded reasoning trajectories. While reinforcement learning with verifiable rewards (RLVR) ... Training VideoLLMs for complex reasoning remains challenging due to sparse sequence level rewards and the lack of fine grained credit assignment over long, temporally grounded reasoning trajectories. While reinforcement learning with verifiable rewards (RLVR) provides reliable supervision, it fails to capture token level contributions, leading to inefficient learning. Conversely, existing self distillation methods offer dense supervision but lack structure and diagnostic specificity, and often i...
295	Metonymy in vision models undermines attention-based interpretability 2605.06095	cs.CV	Ananthu Aniraj, Cassio F. Dantas, Dino Ienco, Massimiliano Mancini, Diego Marcos	Part-based reasoning is a classical strategy to make a computer vision model directly focus on the object parts that are relevant to the downstream task. In the context of deep learning, this also serves to improve by-design interpretability, often by using pa... Part-based reasoning is a classical strategy to make a computer vision model directly focus on the object parts that are relevant to the downstream task. In the context of deep learning, this also serves to improve by-design interpretability, often by using part-centric attention mechanisms on top of a latent image representation provided by a standard, black-box model. This approach is based on a locality assumption: that the latent representation of an object part encodes primarily information...
303	Dynamic Pondering Sparsity-aware Mixture-of-Experts Transformer for Event Stream based Visual Object Tracking 2605.06112	cs.CVcs.AI	Shiao Wang, Xiao Wang, Duoqing Yang, Wenhao Zhang, Bo Jiang	Despite significant progress, RGB-based trackers remain vulnerable to challenging imaging conditions, such as low illumination and fast motion. Event cameras offer a promising alternative by asynchronously capturing pixel-wise brightness changes, providing hig... Despite significant progress, RGB-based trackers remain vulnerable to challenging imaging conditions, such as low illumination and fast motion. Event cameras offer a promising alternative by asynchronously capturing pixel-wise brightness changes, providing high dynamic range and high temporal resolution. However, existing event-based trackers often neglect the intrinsic spatial sparsity and temporal density of event data, while relying on a single fixed temporal-window sampling strategy that is ...
308	Pest-Thinker: Learning to Think and Reason like Entomologists via Reinforcement Learning 2605.06121	cs.CV	Xueheng Li, Yu Wang, Tao Hu, Ji Huang, Ke Cao	Pest-induced crop losses pose a major threat to global food security and sustainable agricultural development. While recent advances in Multimodal Large Language Models (MLLMs) have shown strong potential for visual understanding and smart agriculture, their d... Pest-induced crop losses pose a major threat to global food security and sustainable agricultural development. While recent advances in Multimodal Large Language Models (MLLMs) have shown strong potential for visual understanding and smart agriculture, their direct application to pest recognition remains limited due to the domain's unique challenges such as high inter-species complexity, intra-species variability, and the scarcity of expert-annotated data. In this work, we introduce Pest-Thinker...
311	Continuous Expert Assembly: Instance-Conditioned Low-Rank Residuals for All-in-One Image Restoration 2605.06127	cs.CVcs.AI	Haisen He, Xiangyu Zou, SongLin Dong, Heng Li, Yihong Gong	Real-world image degradation is often unknown, spatially non-uniform, and compositional, requiring all-in-one restoration models to adapt a single set of weights to diverse local corruption patterns without test-time degradation labels. Existing methods typica... Real-world image degradation is often unknown, spatially non-uniform, and compositional, requiring all-in-one restoration models to adapt a single set of weights to diverse local corruption patterns without test-time degradation labels. Existing methods typically modulate a shared backbone with global prompts or degradation descriptors, or route features through predefined expert pools. However, compact global conditioning can bottleneck localized degradation evidence, while static expert routin...
316	Autoregressive Visual Generation Needs a Prologue 2605.06137	cs.CVcs.AIcs.LG	Bowen Zheng, Weijian Luo, Guang Yang, Colin Zhang, Tianyang Hu	In this work, we propose Prologue, an approach to bridging the reconstruction-generation gap in autoregressive (AR) image generation. Instead of modifying visual tokens to satisfy both reconstruction and generation, Prologue generates a small set of prologue t... In this work, we propose Prologue, an approach to bridging the reconstruction-generation gap in autoregressive (AR) image generation. Instead of modifying visual tokens to satisfy both reconstruction and generation, Prologue generates a small set of prologue tokens prepended to the visual token sequence. These prologue tokens are trained exclusively with the AR cross-entropy (CE) loss, while visual tokens remain dedicated to reconstruction. This decoupled design lets us optimize generation throu...
321	AI-Generated Images: What Humans and Machines See When They Look at the Same Image 2605.06143	cs.CVcs.AI	Silvia Poletti, Justin Ilyes, Marcel Hasenbalg, David Fischinger, Martin Boyer	The misuse of generative AI in online disinformation campaigns highlights the urgent need for transparent and explainable detection systems. In this work, we investigate how detectors for AI-generated images can be more effective in providing human-understanda... The misuse of generative AI in online disinformation campaigns highlights the urgent need for transparent and explainable detection systems. In this work, we investigate how detectors for AI-generated images can be more effective in providing human-understandable explanations for their predictions. To this end, we develop a suite of detectors with various architectures and fine-tuning strategies, trained on our large-scale photorealistic fake image dataset, AIText2Image, and assess their perform...
323	Learning Discrete Autoregressive Priors with Wasserstein Gradient Flow 2605.06148	cs.CVcs.AIcs.LG	Bowen Zheng, Yihong Luo, Tianyang Hu	Discrete image tokenizers are commonly trained in two stages: first for reconstruction, and then with a prior model fitted to the frozen token sequences. This decoupling leaves the tokenizer unaware of the model that will later generate its tokens. As a result... Discrete image tokenizers are commonly trained in two stages: first for reconstruction, and then with a prior model fitted to the frozen token sequences. This decoupling leaves the tokenizer unaware of the model that will later generate its tokens. As a result, the learned tokens may preserve image information well but still be difficult for an autoregressive (AR) prior to predict from left to right. We analyze this mismatch using Tripartite Variational Consistency (TVC), which decomposes latent...
329	Beyond Forgetting in Continual Medical Image Segmentation: A Comprehensive Benchmark Study 2605.06160	cs.CV	Bomin Wang, Hangqi Zhou, Yibo Gao, Xiahai Zhuang	Continual learning (CL) is essential for deploying medical image segmentation models in clinical environments where imaging domains, anatomical targets, and diagnostic tasks evolve over time. However, continual segmentation still faces three main challenges. F... Continual learning (CL) is essential for deploying medical image segmentation models in clinical environments where imaging domains, anatomical targets, and diagnostic tasks evolve over time. However, continual segmentation still faces three main challenges. First, the scenarios for this task remain insufficiently standardized for real-world clinical settings. Second, existing research has been primarily focused on mitigating forgetting, overlooking the other essential properties such as plastic...
335	DynT2I-Eval: A Dynamic Evaluation Framework for Text-to-Image Models 2605.06170	cs.CV	Juntong Wang, Jiarui Wang, Huiyu Duan, Lewei Li, Guangtao Zhai	Existing text-to-image (T2I) benchmarks largely rely on fixed prompt sets, leaving them vulnerable to overfitting and benchmark contamination once publicly released and repeatedly reused. In this work, we propose DynT2I-Eval, a fully automated dynamic evaluati... Existing text-to-image (T2I) benchmarks largely rely on fixed prompt sets, leaving them vulnerable to overfitting and benchmark contamination once publicly released and repeatedly reused. In this work, we propose DynT2I-Eval, a fully automated dynamic evaluation framework for T2I models. It constructs a structured visual semantic space from long-form descriptions, decomposing prompts into controllable dimensions (e.g., subject, logical constraint, environment, and composition). This enables the ...
337	Retina-RAG: Retrieval-Augmented Vision-Language Modeling for Joint Retinal Diagnosis and Clinical Report Generation 2605.06173	cs.CVcs.AI	Abdelrahman Zaian, Sheethal Bhat, Mohamed Abdalkader, Andreas Maier	Diabetic Retinopathy (DR) is a leading cause of preventable blindness among working-age adults worldwide, yet most automated screening systems are limited to image-level classification and lack clinically structured reporting. We propose Retina-RAG, a low-cost... Diabetic Retinopathy (DR) is a leading cause of preventable blindness among working-age adults worldwide, yet most automated screening systems are limited to image-level classification and lack clinically structured reporting. We propose Retina-RAG, a low-cost modular framework that jointly performs DR severity grading, macular edema (ME) detection, and report generation. The architecture decouples a high-performance retinal classifier and a parameter-efficient vision-language model (Qwen2.5-VL-...
340	SuperFace: Preference-Aligned Facial Expression Estimation Beyond Pseudo Supervision 2605.06179	cs.CV	Zejian Kang, Xuanyang Xu, Wentao Yang, Kai Zheng, Yuanchen Fei	Accurate facial estimation is crucial for realistic digital human animation, and ARKit blendshape coefficients offer an interpretable representation by mapping facial motions to semantic animation controls. However, learning high-quality ARKit coefficient pred... Accurate facial estimation is crucial for realistic digital human animation, and ARKit blendshape coefficients offer an interpretable representation by mapping facial motions to semantic animation controls. However, learning high-quality ARKit coefficient prediction remains limited by the absence of reliable ground-truth supervision. Existing methods typically rely on capture software such as Live Link Face to provide pseudo labels, which may contain noisy activations, biased coefficient magnitu...
350	EA-WM: Event-Aware Generative World Model with Structured Kinematic-to-Visual Action Fields 2605.06192	cs.CVcs.AIcs.RO	Zhaoyang Yang, Yurun Jin, Lizhe Qi, Cong Huang, Kai Chen	Pretrained video diffusion models provide powerful spatiotemporal generative priors, making them a natural foundation for robotic world models. While recent world-action models jointly optimize future videos and actions, they predominantly treat video generati... Pretrained video diffusion models provide powerful spatiotemporal generative priors, making them a natural foundation for robotic world models. While recent world-action models jointly optimize future videos and actions, they predominantly treat video generation as an auxiliary representation for policy learning. Consequently, they insufficiently explore the inverse problem: leveraging action signals to guide video synthesis, thereby often failing to preserve precise robot spatial geometry and f...
352	Bridging visual saliency and large language models for explainable deep learning in medical imaging 2605.06197	cs.CVcs.LG	Paul Valery Nguezet, Elie Tagne Fute, Yusuf Brima, Benoit Martin Azanguezet, Marcellin Atemkeng	The opaque nature of deep learning models remains a significant barrier to their clinical adoption in medical imaging. This paper presents a multimodal explainability framework that bridges the gap between convolutional neural network (CNN) predictions and cli... The opaque nature of deep learning models remains a significant barrier to their clinical adoption in medical imaging. This paper presents a multimodal explainability framework that bridges the gap between convolutional neural network (CNN) predictions and clinically actionable insights for brain tumor classification, leveraging large language models (LLMs) to deliver human-interpretable diagnostic narratives. The proposed framework operates through three coupled stages. First, nine CNN architec...
360	Taming the Entropy Cliff: Variable Codebook Size Quantization for Autoregressive Visual Generation 2605.06207	cs.CVcs.AIcs.LG	Bowen Zheng, Weijian Luo, Guang Yang, Colin Zhang, Tianyang Hu	Most discrete visual tokenizers rely on a default design: every position in the sequence shares the same codebook. Researchers try to scale the codebook size $K$ to get better reconstruction performance. Such a constant-codebook design hits a fundamental infor... Most discrete visual tokenizers rely on a default design: every position in the sequence shares the same codebook. Researchers try to scale the codebook size $K$ to get better reconstruction performance. Such a constant-codebook design hits a fundamental information-theoretic limit. We observe that the per-position conditional entropy of the training set decays so quickly along the sequence that, after a few positions, the conditional distribution becomes essentially deterministic. On ImageNet w...
365	Differentiable Adaptive 4D Structured Illumination for Joint Capture of Shape and Reflectance 2605.06214	cs.CV	Huakeng Ding, Yaowen Chen, Kun Zhou, Hongzhi Wu	We present a differentiable framework to adaptively compute 4D illumination conditions with respect to an object, for efficient, high-quality simultaneous acquisition of its shape and reflectance, with a unified spatial-angular structured light and a single ca... We present a differentiable framework to adaptively compute 4D illumination conditions with respect to an object, for efficient, high-quality simultaneous acquisition of its shape and reflectance, with a unified spatial-angular structured light and a single camera. Using a simple histogram-based pixel-level probability model for depth and reflectance, we differentiably link the next illumination condition(s) with a loss that encourages the reduction in depth uncertainty. As new structured illumi...
377	Look Beyond Saliency: Low-Attention Guided Dual Encoding for Video Semantic Search 2605.06229	cs.CV	Faisal Aljehrai, Mohammed A. Alkhrashi, Alreem Almuhrij, Sarah Abuhimed, Noorh Aldossary	Video semantic search in densely crowded scenes remains a challenging task due to visual encoders tendency to prioritize salient foreground regions while neglecting contextually important, background areas. We propose an Inverse Attention Embedding mechanism t... Video semantic search in densely crowded scenes remains a challenging task due to visual encoders tendency to prioritize salient foreground regions while neglecting contextually important, background areas. We propose an Inverse Attention Embedding mechanism that explicitly captures and highlights these overlooked regions. By combining inverse attention embeddings with traditional visual embeddings, our method significantly enhances semantic retrieval performance without additional training. Ini...
394	ZScribbleSeg: A comprehensive segmentation framework with modeling of efficient annotation and maximization of scribble supervision 2605.06266	cs.CV	Ke Zhang, Bomin Wang, Hangqi Zhou, Xiahai Zhuang	Curating fully annotated datasets for medical image segmentation is labour-intensive and expertise-demanding. To alleviate this problem, prior studies have explored scribble annotations for weakly supervised segmentation. Existing solutions mainly compute loss... Curating fully annotated datasets for medical image segmentation is labour-intensive and expertise-demanding. To alleviate this problem, prior studies have explored scribble annotations for weakly supervised segmentation. Existing solutions mainly compute losses on annotated areas and generate pseudo labels by propagating annotations to adjacent regions. However, these methods often suffer from inaccurate and unrealistic segmentations due to insufficient supervision and incomplete shape informat...
395	Spark3R: Asymmetric Token Reduction Makes Fast Feed-Forward 3D Reconstruction 2605.06270	cs.CV	Zecheng Tang, Jiaye Fu, Qiankun Gao, Haijie Li, Yanmin Wu	Feed-forward 3D reconstruction models based on Vision Transformers can directly estimate scene geometry and camera poses from a small set of input images, but scaling them to video inputs with hundreds or thousands of frames remains challenging due to the quad... Feed-forward 3D reconstruction models based on Vision Transformers can directly estimate scene geometry and camera poses from a small set of input images, but scaling them to video inputs with hundreds or thousands of frames remains challenging due to the quadratic cost of global attention layers. Recent token-merging methods accelerate these models by compressing the token sequence within the global attention layers, but they apply a uniform reduction to query tokens and key-value tokens, ignor...
397	On-Orbit Real-Time Wildfire Detection Under On-Board Constraints 2605.06273	cs.CVcs.AR	Matthias Rötzer, Veronika Pörtge, Martin Ickerott, Jayendra Praveen Kumar Chorapalli, Dimitri Scheftelowitsch	We present a deployed system for on-orbit wildfire detection aboard a nine-satellite commercial thermal infrared constellation, operating under demanding joint constraints: sub-megabyte model footprint, sub-150 ms per-batch TensorRT FP16 inference on an NVIDIA... We present a deployed system for on-orbit wildfire detection aboard a nine-satellite commercial thermal infrared constellation, operating under demanding joint constraints: sub-megabyte model footprint, sub-150 ms per-batch TensorRT FP16 inference on an NVIDIA Jetson Xavier NX, and an end-to-end alert pipeline targeting under 10 minutes from satellite overpass to fire event communication. The system operates on uncalibrated mid-wave infrared (MWIR) single-band imagery at 200 m ground sampling di...
402	Eulerian Motion Guidance: Robust Image Animation via Bidirectional Geometric Consistency 2605.06280	cs.CV	Thong Nguyen, Khoi M. Le, Cong-Duy Nguyen, Luu Anh Tuan, See-Kiong Ng	Recent advancements in image animation have utilized diffusion models to breathe life into static images. However, existing controllable frameworks typically rely on Lagrangian motion guidance, where optical flow is estimated relative to the initial frame. Thi... Recent advancements in image animation have utilized diffusion models to breathe life into static images. However, existing controllable frameworks typically rely on Lagrangian motion guidance, where optical flow is estimated relative to the initial frame. This paper revisits the same optical-flow primitive through a more local supervision design: we use adjacent-frame Eulerian motion fields to guide generation, where the motion signal always describes a short temporal hop. This shift enables pa...
411	Render, Don't Decode: Weight-Space World Models with Latent Structural Disentanglement 2605.06298	cs.CVcs.AI	Roussel Desmond Nzoyem, Mauro Comi	Training world models on vast quantities of unlabelled videos is a critical step toward fully autonomous intelligence. However, the prevailing paradigm of encoding raw pixels into opaque latent spaces and relying on heavy decoders for reconstruction leaves the... Training world models on vast quantities of unlabelled videos is a critical step toward fully autonomous intelligence. However, the prevailing paradigm of encoding raw pixels into opaque latent spaces and relying on heavy decoders for reconstruction leaves these models computationally expensive and uninterpretable. We address this problem by introducing NOVA, a world modelling framework that represents the system state as the weights and biases of an auxiliary coordinate-based implicit neural re...
422	NavOne: One-Step Global Planning for Vision-Language Navigation on Top-Down Maps 2605.06317	cs.CVcs.AI	Dijia Zhan, Jinyi Li, Chenxi Zheng, Shaoyu Huang, Yong Li	Existing Vision-Language Navigation (VLN) methods typically adopt an egocentric, step-by-step paradigm, which struggles with error accumulation and limits efficiency. While recent approaches attempt to leverage pre-built environment maps, they often rely on in... Existing Vision-Language Navigation (VLN) methods typically adopt an egocentric, step-by-step paradigm, which struggles with error accumulation and limits efficiency. While recent approaches attempt to leverage pre-built environment maps, they often rely on incrementally updating memory graphs or scoring discrete path proposals, which restricts continuous spatial reasoning and creates discrete bottlenecks. We propose Top-Down VLN (TD-VLN), reformulating navigation as a one-step global path plann...
432	TinyBayes: Closed-Form Bayesian Inference via Jacobi Prior for Real-Time Image Classification on Edge Devices 2605.06333	cs.CVcs.AIcs.LGstat.APstat.ML	Shouvik Sardar, Sourish Das	Cocoa (Theobroma cacao) is a critical cash crop for millions of smallholder farmers in West Africa, where Cocoa Swollen Shoot Virus Disease (CSSVD) and anthracnose cause devastating yield losses. Automated disease detection from leaf images is essential for ea... Cocoa (Theobroma cacao) is a critical cash crop for millions of smallholder farmers in West Africa, where Cocoa Swollen Shoot Virus Disease (CSSVD) and anthracnose cause devastating yield losses. Automated disease detection from leaf images is essential for early intervention, yet deploying such systems in resource-constrained settings demands models that are small, fast, and require no internet connectivity. Existing edge-deployable plant disease systems rely on end-to-end deep learning without...
435	Earth-o1: A Grid-free Observation-native Atmospheric World Model 2605.06337	cs.CV	Junchao Gong, Kaiyi Xu, Wangxu Wei, Siwei Tu, Jingyi Xu	Despite the unprecedented volume of multimodal data provided by modern Earth observation systems, our ability to model atmospheric dynamics remains constrained. Traditional modeling frameworks force heterogeneous measurements into predefined spatial grids, inh... Despite the unprecedented volume of multimodal data provided by modern Earth observation systems, our ability to model atmospheric dynamics remains constrained. Traditional modeling frameworks force heterogeneous measurements into predefined spatial grids, inherently limiting the full exploitation of raw sensor data and creating severe computational bottlenecks. Here we present Earth-o1, an observation-native atmospheric world model that overcomes these structural limitations. Rather than relyin...
449	SwiftI2V: Efficient High-Resolution Image-to-Video Generation via Conditional Segment-wise Generation 2605.06356	cs.CV	YaoYang Liu, Yuechen Zhang, Wenbo Li, Yufei Zhao, Rui Liu	High-resolution image-to-video (I2V) generation aims to synthesize realistic temporal dynamics while preserving fine-grained appearance details of the input image. At 2K resolution, it becomes extremely challenging, and existing solutions suffer from various w... High-resolution image-to-video (I2V) generation aims to synthesize realistic temporal dynamics while preserving fine-grained appearance details of the input image. At 2K resolution, it becomes extremely challenging, and existing solutions suffer from various weaknesses: 1) end-to-end models are often prohibitively expensive in memory and latency; 2) cascading low-resolution generation with a generic video super-resolution tends to hallucinate details and drift from input-specific local structure...
457	eXplaining to Learn (eX2L): Regularization Using Contrastive Visual Explanation Pairs for Distribution Shifts 2605.06368	cs.CVcs.AIcs.LG	Paulo Mario P. Medina, Jose Marie Antonio Miñoza, Sebastian C. Ibañez	Despite extensive research into mitigating distribution shifts, many existing algorithms yield inconsistent performance, often failing to outperform baseline Empirical Risk Minimization (ERM) across diverse scenarios. Furthermore, high algorithmic complexity f... Despite extensive research into mitigating distribution shifts, many existing algorithms yield inconsistent performance, often failing to outperform baseline Empirical Risk Minimization (ERM) across diverse scenarios. Furthermore, high algorithmic complexity frequently limits interpretability and offers only an indirect means of addressing spurious correlations. We propose eXplaining to Learn (eX2L): an interpretable, explanation-based framework that decorrelates confounding features from a clas...
461	Continuous-Time Distribution Matching for Few-Step Diffusion Distillation 2605.06376	cs.CVcs.AI	Tao Liu, Hao Yan, Mengting Chen, Taihang Hu, Zhengrong Yue	Step distillation has become a leading technique for accelerating diffusion models, among which Distribution Matching Distillation (DMD) and Consistency Distillation are two representative paradigms. While consistency methods enforce self-consistency along the... Step distillation has become a leading technique for accelerating diffusion models, among which Distribution Matching Distillation (DMD) and Consistency Distillation are two representative paradigms. While consistency methods enforce self-consistency along the full PF-ODE trajectory to steer it toward the clean data manifold, vanilla DMD relies on sparse supervision at a few predefined discrete timesteps. This restricted discrete-time formulation and mode-seeking nature of the reverse KL diverge...
463	Empirical Evidence for Simply Connected Decision Regions in Image Classifiers 2605.06380	cs.CVcs.LG	Arjhun Swaminathan, Mete Akgün	Understanding the topology of decision regions is central to explaining the inner workings of deep neural networks. Prior empirical work has provided evidence that these regions are path connected. We study a stronger topological question: whether closed loops... Understanding the topology of decision regions is central to explaining the inner workings of deep neural networks. Prior empirical work has provided evidence that these regions are path connected. We study a stronger topological question: whether closed loops inside a decision region can be contracted without leaving that region. To this end, we propose an iterative quad-mesh filling procedure that constructs a finite-resolution label-preserving surface bounded by a given loop and lying entirel...
469	Reconstruction or Semantics? What Makes a Latent Space Useful for Robotic World Models 2605.06388	cs.CVcs.LGcs.RO	Nilaksh, Saurav Jha, Artem Zholus, Sarath Chandar	World model-based policy evaluation is a practical proxy for testing real-world robot control by rolling out candidate actions in action-conditioned video diffusion models. As these models increasingly adopt latent diffusion modeling (LDM), choosing the right ... World model-based policy evaluation is a practical proxy for testing real-world robot control by rolling out candidate actions in action-conditioned video diffusion models. As these models increasingly adopt latent diffusion modeling (LDM), choosing the right latent space becomes critical. While the status quo uses autoencoding latent spaces like VAEs that are primarily trained for pixel reconstruction, recent work suggests benefits from pretrained encoders with representation-aligned semantic l...
476	HumanNet: Scaling Human-centric Video Learning to One Million Hours 2605.06747	cs.CVcs.RO	Yufan Deng, Daquan Zhou	Progress in embodied intelligence increasingly depends on scalable data infrastructure. While vision and language have scaled with internet corpora, learning physical interaction remains constrained by the lack of large, diverse, and richly annotated human act... Progress in embodied intelligence increasingly depends on scalable data infrastructure. While vision and language have scaled with internet corpora, learning physical interaction remains constrained by the lack of large, diverse, and richly annotated human activity data. We present HumanNet, a one-million-hour human-centric video corpus that captures how humans interact with the physical world at scale. HumanNet spans both first-person and third-person perspectives and covers fine-grained activi...
480	FREPix: Frequency-Heterogeneous Flow Matching for Pixel-Space Image Generation 2605.06421	cs.CVcs.LG	Mingfeng Lin, Jiakun Chen, Liang Han, Liqiang Nie	Pixel-space diffusion has re-emerged as a promising alternative to latent-space generation because it avoids the representation bottleneck introduced by VAEs. Yet most existing methods still treat image generation as a frequency-homogeneous process, overlookin... Pixel-space diffusion has re-emerged as a promising alternative to latent-space generation because it avoids the representation bottleneck introduced by VAEs. Yet most existing methods still treat image generation as a frequency-homogeneous process, overlooking the distinct roles and learning dynamics of low- and high-frequency components. To address this, we propose FREPix, a FREquency-heterogeneous flow matching framework for Pixel-space image generation. FREPix explicitly decomposes generatio...
507	GeoStack: A Framework for Quasi-Abelian Knowledge Composition in VLMs 2605.06477	cs.CV	Pranav Mantini, Shishir K. Shah	We address the challenge of knowledge composition in Vision-Language Models (VLMs), where accumulating expertise across multiple domains or tasks typically leads to catastrophic forgetting. We introduce GeoStack (Geometric Stacking), a modular framework that a... We address the challenge of knowledge composition in Vision-Language Models (VLMs), where accumulating expertise across multiple domains or tasks typically leads to catastrophic forgetting. We introduce GeoStack (Geometric Stacking), a modular framework that allows independently trained domain experts to be composed into a unified model. By imposing geometric and structural constraints on the adapter manifold, GeoStack ensures the foundational knowledge of the base model is preserved. Furthermor...
513	3D MRI Image Pretraining via Controllable 2D Slice Navigation Task 2605.06487	cs.CVcs.AI	Yu Wang, Qingchao Chen	Self-supervised pretraining has become the mainstream approach for learning MRI representations from unlabeled scans. However, most existing objectives still treat each scan primarily as static aggregations of slices, patches or volumes. We ask whether there e... Self-supervised pretraining has become the mainstream approach for learning MRI representations from unlabeled scans. However, most existing objectives still treat each scan primarily as static aggregations of slices, patches or volumes. We ask whether there exists an intrinsic form of self-supervision signal that is different from reconstructing the masked patches, through transforming the 3D volumes into controllable 2D rendered sequences: by rendering slices at continuous positions, orientati...
521	MARBLE: Multi-Aspect Reward Balance for Diffusion RL 2605.06507	cs.CVcs.LG	Canyu Zhao, Hao Chen, Yunze Tong, Yu Qiao, Jiacheng Li	Reinforcement learning fine-tuning has become the dominant approach for aligning diffusion models with human preferences. However, assessing images is intrinsically a multi-dimensional task, and multiple evaluation criteria need to be optimized simultaneously.... Reinforcement learning fine-tuning has become the dominant approach for aligning diffusion models with human preferences. However, assessing images is intrinsically a multi-dimensional task, and multiple evaluation criteria need to be optimized simultaneously. Existing practice deal with multiple rewards by training one specialist model per reward, optimizing a weighted-sum reward $R(x)=\sum_k w_k R_k(x)$, or sequentially fine-tuning with a hand-crafted stage schedule. These approaches either fa...
523	FreeSpec: Training-Free Long Video Generation via Singular-Spectrum Reconstruction 2605.06509	cs.CV	Fangda Chen, Shanshan Zhao, Longrong Yang, Chuanfu Xu, Zhigang Luo	Video diffusion models perform well in short-video synthesis, but their training-free extension to long videos often suffers from content drift, temporal inconsistency, and over-smoothed dynamics. Existing methods improve temporal consistency by combining a gl... Video diffusion models perform well in short-video synthesis, but their training-free extension to long videos often suffers from content drift, temporal inconsistency, and over-smoothed dynamics. Existing methods improve temporal consistency by combining a global branch with a local branch, but they often further decompose appearance consistency and temporal dynamics within each branch using predefined criteria. This assignment is unreliable when appearance and action progression are tightly co...
525	DCR: Counterfactual Attractor Guidance for Rare Compositional Generation 2605.06512	cs.CV	Taewon Kang, Matthias Zwicker	Diffusion models generate realistic visual content, yet often fail to produce rare but plausible compositions. When prompted with combinations that are valid but underrepresented in training data, such as a snowy beach or a rainbow at night, the generation pro... Diffusion models generate realistic visual content, yet often fail to produce rare but plausible compositions. When prompted with combinations that are valid but underrepresented in training data, such as a snowy beach or a rainbow at night, the generation process frequently collapses toward more common alternatives. We identify this failure mode as default completion bias, where denoising trajectories are implicitly attracted toward high-frequency semantic configurations. Existing guidance mech...
537	Sparkle: Realizing Lively Instruction-Guided Video Background Replacement via Decoupled Guidance 2605.06535	cs.CVcs.AI	Ziyun Zeng, Yiqi Lin, Guoqiang Liang, Mike Zheng Shou	In recent years, open-source efforts like Senorita-2M have propelled video editing toward natural language instruction. However, current publicly available datasets predominantly focus on local editing or style transfer, which largely preserve the original sce... In recent years, open-source efforts like Senorita-2M have propelled video editing toward natural language instruction. However, current publicly available datasets predominantly focus on local editing or style transfer, which largely preserve the original scene structure and are easier to scale. In contrast, Background Replacement, a task central to creative applications such as film production and advertising, requires synthesizing entirely new, temporally consistent scenes while maintaining a...
538	MedHorizon: Towards Long-context Medical Video Understanding in the Wild 2605.06537	cs.CV	Bodong Du, Bowen Liu, Yang Yu, Xinpeng Ding, Zhiheng Wu	Medical multimodal large language models (MLLMs) have advanced image understanding and short-video analysis, but real clinical review often requires full-procedure video understanding. Unlike general long videos, medical procedures contain highly redundant ana... Medical multimodal large language models (MLLMs) have advanced image understanding and short-video analysis, but real clinical review often requires full-procedure video understanding. Unlike general long videos, medical procedures contain highly redundant anatomical views, while decisive evidence is temporally sparse, spatially subtle, and context dependent. Existing benchmarks often assume this evidence has already been localized through images, short clips, or pre-segmented videos, leaving th...
547	R$^3$L: Reasoning 3D Layouts from Relative Spatial Relations 2605.06758	cs.CVcs.AIcs.LGcs.RO	Zhifeng Gu, Yuqi Wang, Bing Wang	Relative spatial relations provide a compact representation of spatial structure and are fundamental to relative spatial reasoning in 3D layout generation. Recent works leverage Multimodal Large Language Models (MLLMs) to infer such relations, but the inferred... Relative spatial relations provide a compact representation of spatial structure and are fundamental to relative spatial reasoning in 3D layout generation. Recent works leverage Multimodal Large Language Models (MLLMs) to infer such relations, but the inferred relations are often unreliable and are typically handled with post-hoc heuristics. In this paper, we propose R$^3$L, a general framework that improves the reliability and consistency of relative spatial reasoning for 3D layout generation. ...
555	Solving Minimal Problems Without Matrix Inversion Using FFT-Based Interpolation 2605.06572	cs.CVmath.NA	Haidong Wu, Snehal Bhayani, Janne Heikkilä	Estimating camera geometry typically involves solving minimal problems formulated as systems of multivariate polynomial equations, which often pose computational challenges when using existing Gröbner-basis or resultant-based methods due to matrix inversion ne... Estimating camera geometry typically involves solving minimal problems formulated as systems of multivariate polynomial equations, which often pose computational challenges when using existing Gröbner-basis or resultant-based methods due to matrix inversion needed in the online solver. Here we propose a sampling-based, matrix inversion-free method that constructs the solvers using sparse hidden-variable resultants. The determinant polynomial in the hidden variable is efficiently reconstructed vi...
565	DINORANKCLIP: DINOv3 Distillation and Injection for Vision-Language Pretraining with High-Order Ranking Consistency 2605.06592	cs.CVcs.AIcs.LG	Shuyang Jiang, Nan Yu, Yiming Zhang, Zenghui Ding, Zhenyu Wu	Contrastive language-image pretraining (CLIP) suffers from two structural weaknesses: the symmetric InfoNCE loss discards the relative ordering among unmatched in-batch pairs, and global pooling collapses the visual representation into a semantic bottleneck th... Contrastive language-image pretraining (CLIP) suffers from two structural weaknesses: the symmetric InfoNCE loss discards the relative ordering among unmatched in-batch pairs, and global pooling collapses the visual representation into a semantic bottleneck that is poorly sensitive to fine-grained local structure. RANKCLIP partially addresses the first issue with a list-wise Plackett-Luce ranking-consistency loss, but its model is strictly first-order and inherits the second weakness untouched. ...
593	DPM++: Dynamic Masked Metric Learning for Occluded Person Re-identification 2605.06637	cs.CV	Lei Tan, Yingshi Luan, Pincong Zou, Pingyang Dai, Liujuan Cao	Although person re-identification has made impressive progress, occlusion caused by obstacles remains an unsettled issue in real applications. The difficulty lies in the mismatch between incomplete occluded samples and holistic identity representations. Severe... Although person re-identification has made impressive progress, occlusion caused by obstacles remains an unsettled issue in real applications. The difficulty lies in the mismatch between incomplete occluded samples and holistic identity representations. Severe occlusion removes discriminative body cues and introduces interference from background clutter and occluders, making global metric learning unreliable. Existing methods mainly rely on extra pre-trained models to estimate visible parts for ...
598	Are We Making Progress in Multimodal Domain Generalization? A Comprehensive Benchmark Study 2605.06643	cs.CVcs.AIcs.LGcs.MM	Hao Dong, Hongzhao Li, Shupan Li, Muhammad Haris Khan, Eleni Chatzi	Despite the growing popularity of Multimodal Domain Generalization (MMDG) for enhancing model robustness, it remains unclear whether reported performance gains reflect genuine algorithmic progress or are artifacts of inconsistent evaluation protocols. Current ... Despite the growing popularity of Multimodal Domain Generalization (MMDG) for enhancing model robustness, it remains unclear whether reported performance gains reflect genuine algorithmic progress or are artifacts of inconsistent evaluation protocols. Current research is fragmented, with studies varying significantly across datasets, modality configurations, and experimental settings. Furthermore, existing benchmarks focus predominantly on action recognition, often neglecting critical real-world...
cs.CY 3 papers
56	The Missing Evaluation Axis: What 10,000 Student Submissions Reveal About AI Tutor Effectiveness 2605.05648	cs.CYcs.AIcs.HC	Rose Niousha, Samantha Boatright Smith, Bita Akram, Peter Brusilovsky, Arto Hellas	Current Artificial Intelligence (AI)-based tutoring systems (AI tutors) are primarily evaluated based on the pedagogical quality of their feedback messages. While important, pedagogy alone is insufficient because it ignores a critical question: what do student... Current Artificial Intelligence (AI)-based tutoring systems (AI tutors) are primarily evaluated based on the pedagogical quality of their feedback messages. While important, pedagogy alone is insufficient because it ignores a critical question: what do students actually do with the feedback they receive? We argue that AI tutor evaluation should be extended with a behavioral dimension grounded in student interaction data, which complements pedagogical assessment. We propose an evaluation framewor...
437	A Benchmark for Strategic Auditee Gaming Under Continuous Compliance Monitoring 2605.06340	cs.CYcs.GTcs.LG	Florian A. D. Burnat, Brittany I. Davidson	Continuous post-deployment compliance audits, mandated by emerging regulations such as the EU AI Act and Digital Services Act, create a class of strategic gaming distinct from the one-shot input/output gaming studied in prior work. Regulated systems can delay ... Continuous post-deployment compliance audits, mandated by emerging regulations such as the EU AI Act and Digital Services Act, create a class of strategic gaming distinct from the one-shot input/output gaming studied in prior work. Regulated systems can delay outcome reporting, drift their reports within plausible noise envelopes, exploit longitudinal sample attrition, and cherry-pick among ambiguous metric definitions. We formalize continuous auditing as a $T$-round Stackelberg game between an ...
486	From Review to Design: Ethical Multimodal Driver Monitoring Systems for Risk Mitigation, Incident Response, and Accountability in Automated Vehicles 2605.06439	cs.CYcs.CVcs.ET	Bilal Khana, Waseem Shariff, Rory Coyne, Muhammad Ali Farooq, Peter Corcoran	As vehicles transition toward higher levels of automation, Driver Monitoring Systems (DMS) have become essential for ensuring human oversight, safety, and regulatory compliance in a vehicle. These systems rely on multimodal sensing and AI-driven inference to a... As vehicles transition toward higher levels of automation, Driver Monitoring Systems (DMS) have become essential for ensuring human oversight, safety, and regulatory compliance in a vehicle. These systems rely on multimodal sensing and AI-driven inference to assess driver attention, cognitive state, and readiness to take control. While technologically promising, their deployment introduces a complex set of ethical and legal challenges - ranging from privacy and consent to data ownership and algo...
cs.DB 1 papers
1	Anatomy of a Query: W5H Dimensions and FAR Patterns for Text-to-SQL Evaluation 2605.05525	cs.DBcs.CL	Vicki Stover Hertzberg, Eduardo Valverde, Joyce C. Ho	Natural language interfaces to databases have gained popularity, yet the theoretical foundations for evaluating and designing these systems remain underdeveloped. We present QUEST (Query Understanding Evaluation through Semantic Translation), a framework resti... Natural language interfaces to databases have gained popularity, yet the theoretical foundations for evaluating and designing these systems remain underdeveloped. We present QUEST (Query Understanding Evaluation through Semantic Translation), a framework resting on two independently motivated components: the FAR structural invariant, which holds that every well-formed query reduces to Filter, Aggregate, and Return operations; and the W5H dimensional framework, which holds that all filtering crit...
cs.DC 3 papers
27	A Scalable Digital Twin Framework for Energy Optimization in Data Centers 2605.05581	cs.DCcs.LG	Raphael Hendrigo de Souza Gonçalves, Wendel Marcos dos Santos	This study proposes a scalable Digital Twin framework for energy optimization in data centers.The framework integrates IoT-based data acquisition, cloud computing, and machine learning techniques to enable real-time monitoring, forecasting, and intelligent ene... This study proposes a scalable Digital Twin framework for energy optimization in data centers.The framework integrates IoT-based data acquisition, cloud computing, and machine learning techniques to enable real-time monitoring, forecasting, and intelligent energy management. A controlled small-scale data center environment was developed to monitor variables such as power consumption, temperature, and computational workload. Long Short-Term Memory (LSTM) models were employed to predict energy dem...
80	Irminsul: MLA-Native Position-Independent Caching for Agentic LLM Serving 2605.05696	cs.DCcs.AIcs.LG	Bole Ma, Jan Eitzinger, Harald Köstler	Agentic LLM workloads put bit-identical tokens at shifted positions every turn, voiding prefix caches at the first byte of divergence. Operators report cache-hit regressions ranging from moderate slowdowns to severe TTFT spikes of 10-16s on unchanged content. ... Agentic LLM workloads put bit-identical tokens at shifted positions every turn, voiding prefix caches at the first byte of divergence. Operators report cache-hit regressions ranging from moderate slowdowns to severe TTFT spikes of 10-16s on unchanged content. Prior position-independent caching systems correct RoPE on the full $d_K$-dimensional key, an architectural cost imposed by GQA, not by caching itself. Multi-Head Latent Attention, deployed at scale in DeepSeek-V2/V3/R1, Kimi-K2/Moonlight, ...
270	Relay Buffer Independent Communication over Pooled HBM for Efficient MoE Inference on Ascend 2605.06055	cs.DCcs.LG	Tianlun Hu, Tiancheng Hu, Shengsheng Litang, Sheng Wang, Xiaoming Bao	Mixture-of-Experts (MoE) inference requires large-scale token exchange across devices, making dispatch and combine major bottlenecks in both prefill and decode. Beyond network transfer, routing-driven layout transformation, temporary relay, and output restorat... Mixture-of-Experts (MoE) inference requires large-scale token exchange across devices, making dispatch and combine major bottlenecks in both prefill and decode. Beyond network transfer, routing-driven layout transformation, temporary relay, and output restoration can add substantial overhead. Existing MoE communication paths are often buffer-centric, using explicit inter-process relay and reordering buffers around collective transfer. This report presents a relay-buffer-free communication design...
cs.DL 1 papers
254	When AI Meets Science: Research Diversity, Interdisciplinarity, Visibility, and Retractions across Disciplines in a Global Surge 2605.06033	cs.DLcs.AIcs.CYcs.SI	Andrés F. Castro Torres, Joan Giner-Miguelez, Mercè Crosas	The extent to which Artificial Intelligence (AI) can trigger generalized paradigm shifts in science is unclear. Although some of these technologies have revolutionized data collection and analysis in specific scientific fields such as Chemistry, their overall ... The extent to which Artificial Intelligence (AI) can trigger generalized paradigm shifts in science is unclear. Although some of these technologies have revolutionized data collection and analysis in specific scientific fields such as Chemistry, their overall impact depends on the scope of adoption and the ways scholars use them. In this study, we document substantial differences in the timing and extent of AI adoption across countries and scientific domains from 1960 to 2015. After 2015, we fin...
cs.DS 1 papers
36	Nearly Optimal Attention Coresets 2605.05602	cs.DScs.AI	Edo Liberty, Alexandr Andoni, Eldar Kleiner	We consider the problem of estimating the Attention mechanism in small space, and prove the existence of coresets for it of nearly optimal size. Specifically, we show that for any set of unit-norm keys and values $(K,V)$ in $\mathbb{R}^d$, there exists a subse... We consider the problem of estimating the Attention mechanism in small space, and prove the existence of coresets for it of nearly optimal size. Specifically, we show that for any set of unit-norm keys and values $(K,V)$ in $\mathbb{R}^d$, there exists a subset $(K',V')$ of size at most $O({\sqrt{d} e^{ρ+o(ρ)}/\varepsilon})$ such that \[ \left\\| \operatorname{Attn}(q,K,V)- \operatorname{Attn}(q,K',V') \right\\| \le \varepsilon \] simultaneously for all queries whose norm is bounded by $ρ$. This o...
cs.GR 2 papers
175	3DSS: 3D Surface Splatting for Inverse Rendering 2605.05876	cs.GRcs.CV	Mae Younes, Adnane Boukhayma	We present 3D Surface Splatting (3DSS), the first differentiable surface splatting renderer for physically-based inverse rendering from multi-view images. Our central insight is that the surface separation problem at the heart of surface splatting admits a dir... We present 3D Surface Splatting (3DSS), the first differentiable surface splatting renderer for physically-based inverse rendering from multi-view images. Our central insight is that the surface separation problem at the heart of surface splatting admits a direct formulation in terms of the reconstruction kernels themselves. From this foundation we derive a coverage-based compositing model whose per-layer opacity arises directly from the accumulated Elliptical Weighted Average reconstruction wei...
274	Reality Check: How Avatar and Face Representation Affect the Perceptual Evaluation of Synthesized Gestures 2605.06063	cs.GRcs.HC	Haoyang Du, Yinghan Xu, John Dingliana, Brian Keegan, Rachel McDonnell	The capacity to create realistic virtual humans has progressed significantly, and such characters can be found in many applications across entertainment, education and health. As an essential element of interactive virtual humans, speech-driven 3D gesture gene... The capacity to create realistic virtual humans has progressed significantly, and such characters can be found in many applications across entertainment, education and health. As an essential element of interactive virtual humans, speech-driven 3D gesture generation still depends heavily on perceptual evaluation, yet studies often vary avatar appearance and facial presentation when judging the generated motions. Prior work suggests these visual choices can bias motion judgments, but controlled e...
cs.GT 2 papers
462	Independent Learning of Nash Equilibria in Partially Observable Markov Potential Games with Decoupled Dynamics 2605.06377	cs.GTcs.LGcs.MA	Philip Jordan, Maryam Kamgarpour	We study Nash equilibrium learning in partially observable Markov games (POMGs), a multi-agent reinforcement learning framework in which agents cannot fully observe the underlying state. Prior work in this setting relies on centralization or information sharin... We study Nash equilibrium learning in partially observable Markov games (POMGs), a multi-agent reinforcement learning framework in which agents cannot fully observe the underlying state. Prior work in this setting relies on centralization or information sharing, and suffers from sample and computational complexity that scales exponentially in the number of players. We focus on a subclass of POMGs with independent state transitions, where agents remain coupled through their rewards, and assume th...
529	Optimizing Social Utility in Sequential Experiments 2605.06520	cs.GTcs.LGcs.MAstat.ME	Ander Artola Velasco, Stratis Tsirtsis, Manuel Gomez-Rodriguez	Regulatory approval of products in high-stakes domains such as drug development requires statistical evidence of safety and efficacy through large-scale randomized controlled trials. However, the high financial cost of these trials may deter developers who lac... Regulatory approval of products in high-stakes domains such as drug development requires statistical evidence of safety and efficacy through large-scale randomized controlled trials. However, the high financial cost of these trials may deter developers who lack absolute certainty in their product's efficacy, ultimately stifling the development of `moonshot' products that could offer high social utility. To address this inefficiency, in this paper, we introduce a statistical protocol for experime...
cs.HC 3 papers
70	PersonaTeaming: Supporting Persona-Driven Red-Teaming for Generative AI 2605.05682	cs.HCcs.AIcs.CY	Wesley Hanwen Deng, Mingxi Yan, Sunnie S. Y. Kim, Akshita Jha, Lauren Wilcox	Recent developments in AI safety research have called for red-teaming methods that effectively surface potential risks posed by generative AI models, with growing emphasis on how red-teamers' backgrounds and perspectives shape their strategies and the risks th... Recent developments in AI safety research have called for red-teaming methods that effectively surface potential risks posed by generative AI models, with growing emphasis on how red-teamers' backgrounds and perspectives shape their strategies and the risks they uncover. While automated red-teaming approaches promise to complement human red-teaming through larger-scale exploration, existing automated approaches do not account for human identities and rarely incorporate human inputs. In this work...
125	Priming, Path-dependence, and Plasticity: Understanding the molding of user-LLM interaction and its implications from (many) chat logs in the wild 2605.05767	cs.HCcs.CL	Shengqi Zhu, Jeffrey M. Rzeszotarski, David Mimno	User interactions with LLMs are shaped by prior experiences and individual exploration, but in-lab studies do not provide system designers with visibility into these in-the-wild factors. This work explores a new approach to studying real-world user-LLM interac... User interactions with LLMs are shaped by prior experiences and individual exploration, but in-lab studies do not provide system designers with visibility into these in-the-wild factors. This work explores a new approach to studying real-world user-LLM interactions through large-scale chat logs from the wild. Through analysis of 140K chatbot sessions from 7,955 anonymized global users over time, we demonstrate key patterns in user expressions despite varied tasks: (1) LLM users are not tabula ra...
444	Human-AI Co-Evolution and Epistemic Collapse: A Dynamical Systems Perspective 2605.06347	cs.HCcs.AI	Xuening Wu, Yanlan Kang, Qianya Xu, Kexuan Xie, Jiaqi Mi	Large language models (LLMs) are reshaping how knowledge is produced, with increasing reliance on AI systems for generation, summarization, and reasoning. While prior work has studied cognitive offloading in humans and model collapse in recursive training, the... Large language models (LLMs) are reshaping how knowledge is produced, with increasing reliance on AI systems for generation, summarization, and reasoning. While prior work has studied cognitive offloading in humans and model collapse in recursive training, these effects are typically considered in isolation. We propose a unified perspective: humans and language models form a coupled dynamical system linked by a feedback loop of usage, generation, and retraining. We introduce a minimal model with...
cs.IR 2 papers
163	Bridging Passive and Active: Enhancing Conversation Starter Recommendation via Active Expression Modeling 2605.05855	cs.IRcs.CL	Yiqing Wu, Haoming Li, Guanyu Jiang, Jiahao Liang, Yongchun Zhu	Large Language Model (LLM)-driven conversational search is shifting information retrieval from reactive keyword matching to proactive, open-ended dialogues. In this context, Conversation Starters are widely deployed to provide personalized query recommendation... Large Language Model (LLM)-driven conversational search is shifting information retrieval from reactive keyword matching to proactive, open-ended dialogues. In this context, Conversation Starters are widely deployed to provide personalized query recommendations that help users initiate dialogues. Conventionally, recommending these starters relies on a closed "exposure-click" loop. Yet, this feedback loop mechanism traps the system in an echo chamber where, compounded by data sparsity, it fails t...
380	OBLIQ-Bench: Exposing Overlooked Bottlenecks in Modern Retrievers with Latent and Implicit Queries 2605.06235	cs.IRcs.AI	Diane Tchuindjo, Devavrat Shah, Omar Khattab	Retrieval benchmarks are increasingly saturating, but we argue that efficient search is far from a solved problem. We identify a class of queries we call oblique, which seek documents that instantiate a latent pattern, like finding all tweets that express an i... Retrieval benchmarks are increasingly saturating, but we argue that efficient search is far from a solved problem. We identify a class of queries we call oblique, which seek documents that instantiate a latent pattern, like finding all tweets that express an implicit stance, chat logs that demonstrate a particular failure mode, or transcripts that match an abstract scenario. We study three mechanisms through which obliqueness may arise and introduce OBLIQ-Bench, a suite of five oblique search pr...
cs.LG 193 papers
3	Energy Generative Modeling: A Lyapunov-based Energy Matching Perspective 2605.05530	cs.LG	Yixuan Wang, Wenqian Xue, Warren E. Dixon	Generative models based on static scalar energy functions represent an emerging paradigm in which a single time independent potential drives sample generation through its gradient field, eliminating the need for time conditioning entirely. We unify the trainin... Generative models based on static scalar energy functions represent an emerging paradigm in which a single time independent potential drives sample generation through its gradient field, eliminating the need for time conditioning entirely. We unify the training and sampling phases of this paradigm, conventionally treated as separate procedures, within a single framework: density transport on the Wasserstein space, cast as a nonlinear control problem in which the Kullback Leibler (KL) divergence ...
6	Adversarial Graph Neural Network Benchmarks: Towards Practical and Fair Evaluation 2605.05534	cs.LG	Tran Gia Bao Ngo, Zulfikar Alom, Federico Errica, Murat Kantarcioglu, Cuneyt Gurcan Akcora	Adversarial learning and the robustness of Graph Neural Networks (GNNs) are topics of widespread interest in the machine learning community, as documented by the number of adversarial attacks and defenses designed for these purposes. While a rigorous evaluatio... Adversarial learning and the robustness of Graph Neural Networks (GNNs) are topics of widespread interest in the machine learning community, as documented by the number of adversarial attacks and defenses designed for these purposes. While a rigorous evaluation of these adversarial methods is necessary to understand the robustness of GNNs in real-world applications, we posit that many works in the literature do not share the same experimental settings, leading to ambiguous and potentially contra...
9	Towards Scalable One-Step Generative Modeling for Autoregressive Dynamical System Forecasting 2605.05540	cs.LGphysics.flu-dyn	Tianyue Yang, Xiao Xue	Fast surrogate modeling for high-dimensional physical dynamics requires more than low short-term error: useful models must roll out efficiently while preserving the statistical structure of long trajectories. Neural operators provide inexpensive autoregressive... Fast surrogate modeling for high-dimensional physical dynamics requires more than low short-term error: useful models must roll out efficiently while preserving the statistical structure of long trajectories. Neural operators provide inexpensive autoregressive forecasts but can drift in turbulent regimes, whereas rolling diffusion and latent generative surrogates can represent stochastic transitions at the cost of multi-step denoising, noise-schedule design, or auxiliary compression models. We p...
10	Adaptive Q-Chunking for Offline-to-Online Reinforcement Learning 2605.05544	cs.LGcs.RO	Nandiraju Gireesh, Yuanliang Ju, He Wang	Offline-to-online reinforcement learning with action chunking eliminates multi-step off-policy bias and enables temporally coherent exploration, but all existing methods use a fixed chunk size across every state. This is suboptimal: near contact events the age... Offline-to-online reinforcement learning with action chunking eliminates multi-step off-policy bias and enables temporally coherent exploration, but all existing methods use a fixed chunk size across every state. This is suboptimal: near contact events the agent needs short chunks for reactive control, while during free-space motion long chunks provide better credit assignment. The natural solution is to train critics for several chunk sizes and select the best one at each state, but naive compa...
14	FedeKD: Energy-Based Gating for Robust Federated Knowledge Distillation under Heterogeneous Settings 2605.05553	cs.LG	Quang-Huy Nguyen, Jiaqi Wang, Wei-shinn Ku	Federated learning (FL) operates in heterogeneous environments, where variations in data distributions and asymmetric model design often result in negative transfer. While federated knowledge distillation (FKD) avoids direct model parameter sharing, existing m... Federated learning (FL) operates in heterogeneous environments, where variations in data distributions and asymmetric model design often result in negative transfer. While federated knowledge distillation (FKD) avoids direct model parameter sharing, existing methods typically rely on public datasets or assume that transferred knowledge is uniformly reliable, which limits their robustness in practice. This paper presents FedeKD, a reliability-aware FKD framework that makes sample-wise trust estim...
25	Accelerating LMO-Based Optimization via Implicit Gradient Transport 2605.05577	cs.LGcs.AI	Won-Jun Jang, Si-Hyeon Lee	Recent optimizers such as Lion and Muon have demonstrated strong empirical performance by normalizing gradient momentum via linear minimization oracles (LMOs). While variance reduction has been explored to accelerate LMO-based methods, it typically incurs subs... Recent optimizers such as Lion and Muon have demonstrated strong empirical performance by normalizing gradient momentum via linear minimization oracles (LMOs). While variance reduction has been explored to accelerate LMO-based methods, it typically incurs substantial computational overhead due to additional gradient evaluations. At the same time, the theoretical understanding of LMO-based methods remains fragmented across unconstrained and constrained formulations. Motivated by these limitations...
29	AeroJEPA: Learning Semantic Latent Representations for Scalable 3D Aerodynamic Field Modeling 2605.05586	cs.LG	Francisco Giral, Abhijeet Vishwasrao, Andrea Arroyo Ramo, Mahmoud Golestanian, Federica Tonti	Aerodynamic surrogate models are increasingly used to replace repeated high-fidelity CFD evaluations in many-query design settings, but current approaches still face two important limitations: they often scale poorly to the very large fields arising in realist... Aerodynamic surrogate models are increasingly used to replace repeated high-fidelity CFD evaluations in many-query design settings, but current approaches still face two important limitations: they often scale poorly to the very large fields arising in realistic 3D aerodynamics, and they rarely produce latent representations that are directly useful for analysis and design. We introduce AeroJEPA, a Joint-Embedding Predictive Architecture for aerodynamic field modeling that addresses both issues....
32	When Can Voting Help, Hurt, or Change Course? Exact Structure of Binary Test-Time Aggregation 2605.05592	cs.LGcs.IT	Yi Liu	Majority voting is one of the few black-box interventions that can improve a fixed stochastic predictor: repeated access can be cheaper than changing a high-capability model. Classical fixed-competence theory makes this intervention look monotone -- more votes... Majority voting is one of the few black-box interventions that can improve a fixed stochastic predictor: repeated access can be cheaper than changing a high-capability model. Classical fixed-competence theory makes this intervention look monotone -- more votes help above the majority threshold and hurt below it. We show that this picture is fundamentally incomplete. Under the de Finetti representation for exchangeable repeated correctness, voting is governed by a latent distribution of per-examp...
38	Optimal Contextual Pricing under Agnostic Non-Lipschitz Demand 2605.05609	cs.LGecon.EMstat.ML	Jianyu Xu, Yu-Xiang Wang	We study contextual dynamic pricing with linear valuations and bounded-support agnostic noise, whose induced demand curve may be non-Lipschitz with arbitrary jumps and atoms. Such discontinuities break the cross-context interpolation arguments used by smooth-d... We study contextual dynamic pricing with linear valuations and bounded-support agnostic noise, whose induced demand curve may be non-Lipschitz with arbitrary jumps and atoms. Such discontinuities break the cross-context interpolation arguments used by smooth-demand pricing algorithms, while the best previous method achieved only $\tilde O(T^{3/4})$ regret. We propose Conservative-Markdown Redirect-UCB Pricing, a polynomial-time algorithm that combines randomized parameter estimation, conservativ...
40	LLMSpace: Carbon Footprint Modeling for Large Language Model Inference on LEO Satellites 2605.05615	cs.LGcs.CY	Lei Jiang, Adrian Ildefonso, Daniel Loveless, Fan Chen	Large language models (LLMs) impose rapidly growing energy demands, creating an emerging energy and carbon crisis driven by large-scale inference. Solar-powered, AI-enabled low Earth orbit (LEO) satellites have been proposed to mitigate terrestrial electricity... Large language models (LLMs) impose rapidly growing energy demands, creating an emerging energy and carbon crisis driven by large-scale inference. Solar-powered, AI-enabled low Earth orbit (LEO) satellites have been proposed to mitigate terrestrial electricity consumption, but their lifecycle carbon footprint remains poorly understood due to launch emissions, satellite manufacturing, and radiation-hardened hardware requirements. This paper presents \textit{LLMSpace}, the first carbon modeling fr...
42	Region-adaptable retrieval of coastal biogeochemical parameters from near-surface hyperspectral remote sensing reflectance using physics-aware meta-learning 2605.05623	cs.LG	Yiqing Guo, Nagur R. C. Cherukuru, Eric A. Lehmann, S. L. Kesav Unnithan, Tim J. Malthus	Hyperspectral in situ sensing has shown promise in retrieving aquatic biogeochemical (BGC) parameters, such as total suspended solids, dissolved organic carbon, and total chlorophyll-a, for cost-effective monitoring of coastal water quality. However, generalis... Hyperspectral in situ sensing has shown promise in retrieving aquatic biogeochemical (BGC) parameters, such as total suspended solids, dissolved organic carbon, and total chlorophyll-a, for cost-effective monitoring of coastal water quality. However, generalising such retrieval algorithms across water bodies remains challenging, as the relationship between remote sensing reflectance (Rrs) and BGC parameters can vary considerably from one region to another due to regional distinctions in environm...
51	Scaling Pretrained Representations Enables Label-Free Out-of-Distribution Detection Without Fine-Tuning 2605.05638	cs.LG	Brett Barkley, Preston Culbertson, David Fridovich-Keil	Models trained with deep learning often fail to signal when inputs fall outside their training data manifold, leading to unreliable predictions under distribution shift. Prior work suggests that effective out-of-distribution (OOD) detection often requires clas... Models trained with deep learning often fail to signal when inputs fall outside their training data manifold, leading to unreliable predictions under distribution shift. Prior work suggests that effective out-of-distribution (OOD) detection often requires class-conditional modeling or specialized models obtained through supervised fine-tuning. We revisit this assumption in modern pretrained models and show that their frozen representations already encode sufficient geometric structure for accura...
57	Information-Preserving Domain Transfer with Unlabeled Data in Misspecified Simulation-Based Inference 2605.05652	cs.LG	Joon Jang, Eunho Jeong, Kyu Sung Choi, Hyeonjin Kim	Simulation-based inference (SBI) provides amortized Bayesian parameter inference from simulator-generated data without requiring explicit likelihood evaluation. Its reliability can degrade under model misspecification, where real-world observations are not wel... Simulation-based inference (SBI) provides amortized Bayesian parameter inference from simulator-generated data without requiring explicit likelihood evaluation. Its reliability can degrade under model misspecification, where real-world observations are not well represented by the simulator used for training. Existing methods using unlabeled real-world data often align simulated and real-world data distributions, but marginal alignment alone does not directly preserve parameter-relevant informati...
61	Structural Correspondence and Universal Approximation in Diagonal plus Low-Rank Neural Networks 2605.05659	cs.LG	Ying Chen, Aoxi Li, Jihun Kim, Javad Lavaei	The massive computational costs of scaling modern deep learning architectures have driven the widespread use of parameter-efficient low-rank structures, such as LoRA and low-rank factorization. However, theoretical guarantees for their expressive power are les... The massive computational costs of scaling modern deep learning architectures have driven the widespread use of parameter-efficient low-rank structures, such as LoRA and low-rank factorization. However, theoretical guarantees for their expressive power are less explored, often relying on restrictive priors like a pretrained base matrix, ReLU activations or non-verifiable singularity conditions. We first investigate the limits of neural networks constrained strictly to low-rank manifolds without ...
62	Distributionally Robust Multi-Objective Optimization 2605.05660	cs.LGmath.OC	Yufeng Yang, Fangning Zhuo, Ziyi Chen, Heng Huang, Yi Zhou	Multi-objective optimization (MOO) has received growing attention in applications that require learning under multiple criteria. However, the existing MOO formulations do not explicitly account for distributional shifts in the data. We introduce distributional... Multi-objective optimization (MOO) has received growing attention in applications that require learning under multiple criteria. However, the existing MOO formulations do not explicitly account for distributional shifts in the data. We introduce distributionally robust multi-objective optimization (DR-MOO), which minimizes multiple objectives under their respective worst-case distributions. We propose Pareto-type solution concepts for DR-MOO and develop multi-gradient descent algorithms (MGDA) w...
72	Temporal Functional Circuits: From Spline Plots to Faithful Explanations in KAN Forecasting 2605.05685	cs.LGcs.AIstat.ML	Naveen Mysore	Unlike MLPs, Kolmogorov-Arnold Networks (KANs) expose explicit learnable edge functions on every connection, enabling mechanistic explanation in time-series forecasting. This paper introduces Temporal Functional Circuits, a framework that transforms KAN edge f... Unlike MLPs, Kolmogorov-Arnold Networks (KANs) expose explicit learnable edge functions on every connection, enabling mechanistic explanation in time-series forecasting. This paper introduces Temporal Functional Circuits, a framework that transforms KAN edge functions from latent visualizations into faithful, temporally grounded explanations. Built on a gated residual KAN that decomposes forecasts into a linear base and a sparsely activated KAN correction, the framework (i) maps each edge to inp...
81	Budgeted Attention Allocation: Cost-Conditioned Compute Control for Efficient Transformers 2605.05697	cs.LGcs.AI	Amrit Nidhi	Transformers usually expose one inference cost per trained model, while deployed systems often need multiple cost-quality operating points. We study Budgeted Attention Allocation, a monotone head-gating mechanism conditioned on a requested attention budget. De... Transformers usually expose one inference cost per trained model, while deployed systems often need multiple cost-quality operating points. We study Budgeted Attention Allocation, a monotone head-gating mechanism conditioned on a requested attention budget. Dense warm-starting is important for stability: on a robust synthetic sequence task, one budgeted model reaches 99.7% accuracy at 0.303 estimated attention cost and 100.0% accuracy at 0.504 cost. On held-out AG News with a custom word-level t...
91	On the Blessing of Pre-training in Weak-to-Strong Generalization 2605.05710	cs.LG	Wei Yao, Wang Zhaoyang, Gengze Xu, Chen Qian, Dongrui Liu	The paradigm of Weak-to-Strong Generalization (W2SG) suggests that a pre-trained strong model can surpass its weak supervisor, yet the decisive role of pre-training remains theoretically and empirically under-explored. In this work, we identify pre-training as... The paradigm of Weak-to-Strong Generalization (W2SG) suggests that a pre-trained strong model can surpass its weak supervisor, yet the decisive role of pre-training remains theoretically and empirically under-explored. In this work, we identify pre-training as the essential prerequisite for the emergence of W2SG. Theoretically, we formalize the W2SG problem within a high-dimensional single-index model framework using spiked Gaussian data, modeling pre-training as a spectral initialization step. ...
97	Enabling Federated Inference via Unsupervised Consensus Embedding 2605.05718	cs.LG	Yui Hashimoto, Takayuki Nishio, Yuichi Kitagawa, Takahito Tanimura	Cooperative inference across independently deployed machine learning models is increasingly desirable in distributed environments, as there is a growing need to leverage multiple models while keeping their data and model parameters private. However, existing c... Cooperative inference across independently deployed machine learning models is increasingly desirable in distributed environments, as there is a growing need to leverage multiple models while keeping their data and model parameters private. However, existing cooperative frameworks typically rely on sharing input data, model parameters, or a common encoder, which limits their applicability in privacy-sensitive or cross-organizational settings. To address this challenge, we propose Consensus Embed...
102	WARP: A Benchmark for Primal-Dual Warm-Starting of Interior-Point Solvers 2605.05728	cs.LGcs.AIeess.SYmath.OC	Dhruv Suri, Helgi Hilmarsson, Shourya Bose	Solving AC Optimal Power Flow (AC-OPF) is of central importance in electricity market operations, where interior-point methods (IPMs) such as IPOPT are the standard solvers. A growing body of work uses machine learning to predict primal warm-start iterates, re... Solving AC Optimal Power Flow (AC-OPF) is of central importance in electricity market operations, where interior-point methods (IPMs) such as IPOPT are the standard solvers. A growing body of work uses machine learning to predict primal warm-start iterates, reporting iteration reductions of 30-46\%. We show that these reported gains rest on an inappropriate evaluation baseline: prior methods benchmark against the flat start $V_m = 1, V_a = 0$, whereas the solver's actual default - the variable-b...
104	CRAFT: Forgetting-Aware Intervention-Based Adaptation for Continual Learning 2605.05732	cs.LGcs.AI	Md Anwar Hossen, Fatema Siddika, Juan Pablo Munoz, Tanya Roosta, Ali Jannesari	Large language models (LLMs) can acquire new capabilities through fine-tuning, but continual adaptation often leads to catastrophic forgetting. We propose CRAFT, a continual learning framework that avoids updating model weights by instead learning low-rank int... Large language models (LLMs) can acquire new capabilities through fine-tuning, but continual adaptation often leads to catastrophic forgetting. We propose CRAFT, a continual learning framework that avoids updating model weights by instead learning low-rank interventions on hidden representations. CRAFT proceeds in three stages: it first routes each task to a group of similar tasks based on output-distribution divergence; it then fine-tunes the model using a Kullback-Leibler (KL) divergence again...
107	CoMemNet: Contrastive Sampling with Memory Replay Network for Continual Traffic Prediction 2605.05738	cs.LGcs.AI	Mei Wu, Wenchao Weng, Wenxin Su, Wenjie Tang, Wei Zhou	In recent years, the integration of non-topological space modeling with temporal learning methods has emerged as an effective approach for capturing spatio-temporal information in non-Euclidean graphs. However, most existing methods rely on static underlying g... In recent years, the integration of non-topological space modeling with temporal learning methods has emerged as an effective approach for capturing spatio-temporal information in non-Euclidean graphs. However, most existing methods rely on static underlying graph structures, which are inadequate for capturing the continuously expanding and evolving patterns in streaming traffic networks. To address this challenge, we propose a simple yet efficient dual-branch continual learning framework for tr...
108	Multi-Dimensional Behavioral Evaluation of Agentic Stock Prediction Systems Using LLM Judges with Closed-Loop Reinforcement Learning Feedback 2605.05739	cs.LGcs.AIcs.CLq-fin.CP	Mohammad Al Ridhawi, Mahtab Haj Ali, Hussein Al Osman	Agentic stock prediction systems make sequences of interdependent decisions (regime detection, pathway routing, reinforcement learning control) whose individual quality is hidden by aggregate metrics such as mean absolute percentage error (MAPE) or directional... Agentic stock prediction systems make sequences of interdependent decisions (regime detection, pathway routing, reinforcement learning control) whose individual quality is hidden by aggregate metrics such as mean absolute percentage error (MAPE) or directional accuracy. We present a behavioral evaluation framework that addresses this gap. Behavioral traces logged at every autonomous decision point are grouped into five-day episodes and scored along six domain-specific dimensions (regime detectio...
110	Weak-to-Strong Generalization is Nearly Inevitable (in Linear Models) 2605.05742	cs.LG	Scott Geng, Dutch Hansen, Jerry Li	Weak-to-strong generalization is a phenomenon in post-training whereby a strong student model, when finetuned solely with feedback from a weaker teacher, can not only surpass the teacher, but can improve upon its own capabilities. Recent work of Burns et al. (... Weak-to-strong generalization is a phenomenon in post-training whereby a strong student model, when finetuned solely with feedback from a weaker teacher, can not only surpass the teacher, but can improve upon its own capabilities. Recent work of Burns et al. (2023) demonstrated that this can occur in the setting of frontier language models, and subsequently there has been a flurry of both empirical work trying to exploit this phenomenon, as well as theoretical work attempting to understand it. I...
116	RVPO: Risk-Sensitive Alignment via Variance Regularization 2605.05750	cs.LGcs.CL	Ivan Montero, Tomasz Jurczyk, Bhuwan Dhingra	Current critic-less RLHF methods aggregate multi-objective rewards via an arithmetic mean, leaving them vulnerable to constraint neglect: high-magnitude success in one objective can numerically offset critical failures in others (e.g., safety or formatting), m... Current critic-less RLHF methods aggregate multi-objective rewards via an arithmetic mean, leaving them vulnerable to constraint neglect: high-magnitude success in one objective can numerically offset critical failures in others (e.g., safety or formatting), masking low-performing "bottleneck" rewards vital for reliable multi-objective alignment. We propose Reward-Variance Policy Optimization (RVPO), a risk-sensitive framework that penalizes inter-reward variance during advantage aggregation, sh...
121	Full-Spectrum Graph Neural Network: Expressive and Scalable 2605.05759	cs.LG	Xiaohan Wang, Deyu Bo, Longlong Li, Kelin Xia	It is well established that spectral graph neural networks (GNNs) can universally approximate node signals; however, their expressive power remains bounded by the 1-dimensional Weisfeiler-Lehman test, which is mirrored in their lack of universality for higher-... It is well established that spectral graph neural networks (GNNs) can universally approximate node signals; however, their expressive power remains bounded by the 1-dimensional Weisfeiler-Lehman test, which is mirrored in their lack of universality for higher-order signals. To go beyond this bound, we propose the Full-Spectrum GNN (FSpecGNN), a second-order generalization of classical spectral GNNs. FSpecGNN advances spectral filtering in two perspectives: (1) it lifts the signal from the node d...
123	Conditional generation of antibody sequences with classifier-guided germline-absorbing discrete diffusion 2605.06720	cs.LGcs.AI	Justin Sanders, Luca Giancardo, Lan Guo, Yue Zhao, Kemal Sonmez	Antibody therapeutics are among the most successful modern medicines, yet computationally designing antibodies with desirable binding and developability properties remains challenging. While protein language models (pLMs) have emerged as powerful tools for ant... Antibody therapeutics are among the most successful modern medicines, yet computationally designing antibodies with desirable binding and developability properties remains challenging. While protein language models (pLMs) have emerged as powerful tools for antibody sequence design, existing approaches largely suffer from two key limitations: they predominantly memorize germline sequences rather than modeling biologically meaningful somatic variation, and they offer limited support for flexible c...
127	Adaptive Selection of LoRA Components in Privacy-Preserving Federated Learning 2605.05769	cs.LGcs.AIcs.CL	Myoungjun Kim, Sangwoo Park, Yoseob Han, Jin-Hyun Ahn	Differentially private federated fine-tuning of large models with LoRA suffers from aggregation error caused by LoRA's multiplicative structure, which is further amplified by DP noise and degrades both stability and accuracy. Existing remedies apply a single u... Differentially private federated fine-tuning of large models with LoRA suffers from aggregation error caused by LoRA's multiplicative structure, which is further amplified by DP noise and degrades both stability and accuracy. Existing remedies apply a single update mode uniformly across all layers and all communication rounds (or alternate them on a fixed schedule), ignoring both the structural asymmetry between the two LoRA factors and the round-wise dynamics of training. We propose AS-LoRA, an...
136	A Measure-Theoretic Finite-Sample Theory for Adaptive-Data Fitted Q-Iteration 2605.05791	cs.LG	Manuel Haussmann, Mustafa Mert Çelikok, Melih Kandemir	While reinforcement learning (RL) promises to revolutionize the control of complex nonlinear robotic systems, a profound gap persists between the heuristic success of model-free off-policy deep RL and the underlying theory, which remains largely confined to ta... While reinforcement learning (RL) promises to revolutionize the control of complex nonlinear robotic systems, a profound gap persists between the heuristic success of model-free off-policy deep RL and the underlying theory, which remains largely confined to tabular or linearizable settings. We identify the cause of this gap as an emergent isolation of three traditions: (i) measure-theoretic MDP foundations on general spaces limit their analysis to exact dynamic programming and ignore all error s...
137	Revealing Modular Gradient Noise Imbalance in LLMs: Calibrating Adam via Signal-to-Noise Ratio 2605.05794	cs.LGcs.AI	Ziqing Wen, Zhouyang Liu, Jiahuan Wang, Ping Luo, Li Shen	The impressive performance of large language models (LLMs) arises from their massive scale and heterogeneous module composition. However, this structural heterogeneity introduces additional optimization challenges. While adaptive optimizers such as Adam(W) pro... The impressive performance of large language models (LLMs) arises from their massive scale and heterogeneous module composition. However, this structural heterogeneity introduces additional optimization challenges. While adaptive optimizers such as Adam(W) provide per-parameter adaptivity, they do not explicitly account for module-level gradient heterogeneity, resulting in slower convergence, suboptimal performance, or training instability. Existing approaches typically rely on manually tuned mo...
138	Reward Shaping and Action Masking for Compositional Tasks using Behavior Trees and LLMs 2605.05795	cs.LG	Nicholas Potteiger, Ankita Samaddar, Taylor T. Johnson, Xenofon Koutsoukos	Decomposing complex tasks into a sequence of simpler subtasks can improve learning efficiency for an autonomous agent. Reinforcement learning (RL) can be used to optimize agent policies to complete subtasks, but requires well-defined subtask rewards and benefi... Decomposing complex tasks into a sequence of simpler subtasks can improve learning efficiency for an autonomous agent. Reinforcement learning (RL) can be used to optimize agent policies to complete subtasks, but requires well-defined subtask rewards and benefits from action masking. Recent work uses large language models (LLMs) to automate reward shaping and action masking, however none of them fully address reactivity to subtask failure and modularity to varying objects for compositional tasks....
139	Selective Rollout: Mid-Trajectory Termination for Multi-Sample Agent RL 2605.05802	cs.LG	Zhiyuan Zhai, Xin Wang	Group-relative RL training (GRPO) samples a small group of parallel rollouts for every training prompt and uses their within-group reward spread to compute per-trajectory advantages. In agentic environments each rollout is a long multi-turn dialogue with one L... Group-relative RL training (GRPO) samples a small group of parallel rollouts for every training prompt and uses their within-group reward spread to compute per-trajectory advantages. In agentic environments each rollout is a long multi-turn dialogue with one LLM call per step, so this multi-sample multiplier dominates the total training cost. When every rollout of a prompt ends with the same reward, the group has zero reward variance and contributes no gradient, so the extra rollouts add no info...
141	Retrieval from Within: An Intrinsic Capability of Attention-Based Models 2605.05806	cs.LG	Elad Hoffer, Yochai Blau, Edan Kinderman, Ron Banner, Daniel Soudry	Retrieval-augmented generation (RAG) typically treats retrieval and generation as separate systems. We ask whether an attention-based encoder-decoder can instead retrieve directly from its own internal representations. We introduce INTRA (INTrinsic Retrieval v... Retrieval-augmented generation (RAG) typically treats retrieval and generation as separate systems. We ask whether an attention-based encoder-decoder can instead retrieve directly from its own internal representations. We introduce INTRA (INTrinsic Retrieval via Attention), a framework where decoder attention queries score pre-encoded evidence chunks that are then directly reused as context for generation. By construction, INTRA unifies retrieval and generation, eliminating the retriever-generat...
147	A Testable Certificate for Constant Collapse in Teacher-Guided VAEs 2605.05813	cs.LGcs.AI	Zegu Zhang, Jianhua Peng, Jian Zhang	Posterior collapse in variational autoencoders is often diagnosed by its symptoms: a small KL term, a strong decoder, or weak use of the latent code. These signals are useful, but they do not define a collapse boundary. We study a concrete failure mode, input-... Posterior collapse in variational autoencoders is often diagnosed by its symptoms: a small KL term, a strong decoder, or weak use of the latent code. These signals are useful, but they do not define a collapse boundary. We study a concrete failure mode, input-independent constant collapse, and show that this case admits an exact threshold. For any fixed nonconstant teacher distribution $T(\cdot\mid x)$, the best constant student is the dataset-average teacher distribution, and its alignment co...
149	HCInfer: An Efficient Inference System via Error Compensation for Resource-Constrained Devices 2605.05819	cs.LG	Shen Xu, Xiangwen Zhuge, Zhe Xu, Yingkun Hu, Zheng Yang	LLMs often struggle with memory-constrained deployment on consumer-grade hardware due to their massive parameter sizes. While existing solutions such as model compression and offloading improve deployment feasibility, they often suffer from substantial accurac... LLMs often struggle with memory-constrained deployment on consumer-grade hardware due to their massive parameter sizes. While existing solutions such as model compression and offloading improve deployment feasibility, they often suffer from substantial accuracy degradation or severe throughput bottlenecks. Recent error compensation methods recover accuracy through auxiliary LoRA-style branches, and we observe that these branches are inherently amenable to offloading: they require substantial par...
156	MDN: Parallelizing Stepwise Momentum for Delta Linear Attention 2605.05838	cs.LGcs.NE	Yulong Huang, Xiang Liu, Hongxiang Huang, Xiaopeng Lin, Zunchang Liu	Linear Attention (LA) offers a promising paradigm for scaling large language models (LLMs) to long sequences by avoiding the quadratic complexity of self-attention. Recent LA models such as Mamba2 and GDN interpret linear recurrences as closed-form online stoc... Linear Attention (LA) offers a promising paradigm for scaling large language models (LLMs) to long sequences by avoiding the quadratic complexity of self-attention. Recent LA models such as Mamba2 and GDN interpret linear recurrences as closed-form online stochastic gradient descent (SGD), but naive SGD updates suffer from rapid information decay and suboptimal convergence in optimization. While momentum-based optimizers provide a natural remedy, they pose challenges in simultaneously achieving ...
161	Hypothesis generation and updating in large language models 2605.05851	cs.LG	Hua-Dong Xiong	Large language models (LLMs) increasingly help people solve problems, from debugging code to repairing machinery. This process requires generating plausible hypotheses from partial descriptions, then updating them as more information arrives. Yet how LLMs perf... Large language models (LLMs) increasingly help people solve problems, from debugging code to repairing machinery. This process requires generating plausible hypotheses from partial descriptions, then updating them as more information arrives. Yet how LLMs perform this form of inference, and how close it is to optimal, remains unclear. We study this question in the number game, a controlled setting in which a learner infers the hypothesis supported by a few positive integers, such as $\{16, 8, 2,...
164	Measuring Learning Progress via Gradient-Momentum Coupling 2605.05856	cs.LG	Samuel Blad, Martin Längkvist, Amy Loutfi	Measuring learning progress is essential for curiosity-driven exploration in reinforcement learning, but widely used signals such as prediction error often fail to distinguish meaningful, learnable patterns from random noise. This paper proposes Gradient-Momen... Measuring learning progress is essential for curiosity-driven exploration in reinforcement learning, but widely used signals such as prediction error often fail to distinguish meaningful, learnable patterns from random noise. This paper proposes Gradient-Momentum Coupling (GMC), a signal derived from optimization dynamics that quantifies how useful each sample's gradient is for ongoing learning by measuring its per-parameter normalized absolute product with the momentum from previous gradients. ...
165	Offline Reinforcement Learning for Rotation Profile Control in Tokamaks 2605.05857	cs.LG	Rohit Sonker, Hiro Josep Farre Kaga, Jiayu Chen, Andrew Rothstein, Ian Char	Tokamaks remain leading candidates for achieving practical fusion energy, yet many important control problems inside these devices are still difficult or unsolved. One such challenge is controlling the plasma rotation profile, which strongly influences stabili... Tokamaks remain leading candidates for achieving practical fusion energy, yet many important control problems inside these devices are still difficult or unsolved. One such challenge is controlling the plasma rotation profile, which strongly influences stability, confinement, and transport. While the average rotation can be controlled, controlling the full profile is challenging due to high dimensionality, response to multiple actuators and dependence on plasma condition. Learning-based control ...
167	Do Neural Operators Forget Geometry? The Forgetting Hypothesis in Deep Operator Learning 2605.05862	cs.LG	Yanming Xia, Angelica I. Aviles-Rivero	Neural operators perform well on structured domains, yet their behaviour on irregular geometries remains poorly understood. We show that this limitation is not merely an encoding issue, but a depth-wise failure mode inherent to deep operator architectures. W... Neural operators perform well on structured domains, yet their behaviour on irregular geometries remains poorly understood. We show that this limitation is not merely an encoding issue, but a depth-wise failure mode inherent to deep operator architectures. We formalise the Geometric Forgetting Hypothesis: due to the Markovian structure of operator layers and their reliance on global mixing mechanisms, neural operators progressively lose access to domain geometry as depth increases. Using layer...
168	SOPE: Stabilizing Off-Policy Evaluation for Online RL with Prior Data 2605.05863	cs.LGcs.AI	Carlo Romeo, Girolamo Macaluso, Alessandro Sestini, Andrew D. Bagdanov	Incorporating prior data into online reinforcement learning accelerates training but typically forces a difficult trade-off between high computational costs and long, multi-stage training pipelines. While fixed-length stabilization phases are significantly mor... Incorporating prior data into online reinforcement learning accelerates training but typically forces a difficult trade-off between high computational costs and long, multi-stage training pipelines. While fixed-length stabilization phases are significantly more computationally efficient than static update schedules, they require task-dependent manual tuning, risking either the waste of prior knowledge or severe overfitting. To address this, we propose SOPE, an algorithm that uses an actor-aligne...
172	QuadraSHAP: Stable and Scalable Shapley Values for Product Games via Gauss-Legendre Quadrature 2605.05870	cs.LG	Majid Mohammadi, Grigory Reznikov, Pavel Sinitcyn, Krikamol Muandet, Siu Lun Chau	We study the efficient computation of Shapley values for \emph{product games} -- cooperative games in which the coalition value factorizes as a product of per-player terms. Such games arise in machine learning explainability whenever the value function inherit... We study the efficient computation of Shapley values for \emph{product games} -- cooperative games in which the coalition value factorizes as a product of per-player terms. Such games arise in machine learning explainability whenever the value function inherits a multiplicative structure from the underlying model, as in kernel methods with product kernels and tree-based models. Our key result is that the Shapley value of each player in a product game admits an exact one-dimensional integral repr...
173	Retain-Neutral Surrogates for Min-Max Unlearning 2605.05871	cs.LG	Junhao Cai, Dohun Kim, Dowon Kim, Sung Il Choi, Chengjun Jin	Machine unlearning seeks to remove the influence of designated training data while preserving performance on the remaining data. Approximate unlearning can be viewed as a local editing problem; in min-max unlearning, the key local object is the surrogate point... Machine unlearning seeks to remove the influence of designated training data while preserving performance on the remaining data. Approximate unlearning can be viewed as a local editing problem; in min-max unlearning, the key local object is the surrogate point at which the retain objective is evaluated. When forget and retain gradients are strongly aligned, an unconstrained forget-maximizing perturbation can move to a surrogate point that increases retain loss. We propose Retain-Orthogonal Surro...
180	RepFlow: Representation Enhanced Flow Matching for Causal Effect Estimation 2605.05890	cs.LGstat.ME	Yifei Xie, Jian Huang	Estimating causal effects from observational data has become increasingly critical in diverse fields including healthcare, economics, and social policy. The fundamental challenge in causal inference arises from the missing counterfactuals and the selection bia... Estimating causal effects from observational data has become increasingly critical in diverse fields including healthcare, economics, and social policy. The fundamental challenge in causal inference arises from the missing counterfactuals and the selection bias. Existing methods are largely limited to point estimates and lack the capacity for distribution modeling. In this work, we propose RepFlow, a novel framework that formulates causal effect estimation as a joint optimization problem integra...
185	VARS-FL: Validation-Aligned Client Selection for Non-IID Federated Learning in IoT Systems 2605.05896	cs.LGcs.AI	Mohamed Lakas, Mohamed Amine Ferrag	Federated learning (FL) systems typically employ stateless client selection, treating each communication round independently and ignoring accumulated evidence of client contribution quality. Under non-IID data, this leads to slow convergence and unstable train... Federated learning (FL) systems typically employ stateless client selection, treating each communication round independently and ignoring accumulated evidence of client contribution quality. Under non-IID data, this leads to slow convergence and unstable training, particularly when selection relies on local proxies (e.g., training loss) that are misaligned with the global optimization objective. These challenges are especially pronounced in Internet of Things (IoT) and Industrial IoT (IIoT) envi...
186	VisMMOE: Exploiting Visual-Expert Affinity for Efficient Visual-Language MoE Offloading 2605.05899	cs.LG	Cheng Xu, Xiaofeng Hou, Jiacheng Liu, Chao Li	Large-scale vision-language mixture-of-experts (VL-MoE) models provide strong multimodal capability, but efficient deployment on memory-constrained platforms remains difficult. Existing MoE offloading systems are largely designed for text-centric workloads and... Large-scale vision-language mixture-of-experts (VL-MoE) models provide strong multimodal capability, but efficient deployment on memory-constrained platforms remains difficult. Existing MoE offloading systems are largely designed for text-centric workloads and become much less effective for visual-heavy inputs, where large numbers of visual tokens induce broader and less predictable expert accesses. We present VisMMoE, a VL-MoE offloading system built on a single systems insight: pruning redunda...
188	Quadratic Objective Perturbation: Curvature-Based Differential Privacy 2605.05905	cs.LGmath.OC	Daniel Cortild, Coralia Cartis	Objective perturbation is a standard mechanism in differentially private empirical risk minimization. In particular, Linear Objective Perturbation (LOP) enforces privacy by adding a random linear term, while strong convexity and stability are ensured by an add... Objective perturbation is a standard mechanism in differentially private empirical risk minimization. In particular, Linear Objective Perturbation (LOP) enforces privacy by adding a random linear term, while strong convexity and stability are ensured by an additional deterministic quadratic term. However, this approach requires the strong assumption of bounded gradients of the loss function, which excludes many modern machine learning models. In this work, we introduce Quadratic Objective Pertur...
193	From Drops to Grid: Noise-Aware Spatio-Temporal Neural Process for Rainfall Estimation 2605.05912	cs.LGcs.CV	Rafael Pablos Sarabia, Joachim Nyborg, Morten Birk, Ira Assent	High-resolution rainfall observations are crucial for weather forecasting, water management, and hazard mitigation. Traditional operational measurements are often biased and low-resolution, limiting their ability to capture local rainfall. Accurate high-resolu... High-resolution rainfall observations are crucial for weather forecasting, water management, and hazard mitigation. Traditional operational measurements are often biased and low-resolution, limiting their ability to capture local rainfall. Accurate high-resolution rainfall maps require integrating sparse surface observations, yet existing deep learning densification methods are hindered by rainfall's skewed, localized nature, noise, and limited spatio-temporal fusion. We present DropsToGrid, a N...
205	Near-Policy: Accelerating On-Policy Distillation via Asynchronous Generation and Selective Packing 2605.05940	cs.LGcs.CL	Miao Rang, Zhenni Bi, Hang Zhou, Kai Han, Xuechun Wang	Standard knowledge distillation for autoregressive models often suffers from distribution mismatch. While on-policy methods mitigate this by leveraging student-generated outputs, they rely on computationally expensive Reinforcement Learning (RL) frameworks. To... Standard knowledge distillation for autoregressive models often suffers from distribution mismatch. While on-policy methods mitigate this by leveraging student-generated outputs, they rely on computationally expensive Reinforcement Learning (RL) frameworks. To improve efficiency, we propose Near-Policy Distillation (NPD), an asynchronous approach that decouples student generation from training. This reformulation enables Supervised Fine-Tuning (SFT) with sequence packing. However, asynchronous u...
214	Knowing but Not Correcting: Routine Task Requests Suppress Factual Correction in LLMs 2605.05957	cs.LG	Zixuan Chen, Hao Lin, Zizhe Chen, Yizhou Tian, Garry Yang	LLMs reliably correct false claims when presented in isolation, yet when the same claims are embedded in task-oriented requests, they often comply rather than correct. We term this failure mode \emph{correction suppression} and construct a benchmark of 300 fal... LLMs reliably correct false claims when presented in isolation, yet when the same claims are embedded in task-oriented requests, they often comply rather than correct. We term this failure mode \emph{correction suppression} and construct a benchmark of 300 false premises to systematically evaluate it across eight models. Suppression rates range from 19\% to 90\%, with four models exceeding 80\%, establishing correction suppression as a prevalent and severe phenomenon. Mechanistic analysis reveal...
219	Uncertainty Estimation via Hyperspherical Confidence Mapping 2605.05964	cs.LG	Eunseo Choi, Ho-Yeon Kim, Jaewon Lee, Taeyong jo, Myungjun lee	Quantifying uncertainty in neural network predictions is essential for high-stakes domains such as autonomous driving, healthcare, and manufacturing. While existing approaches often depend on costly sampling or restrictive distributional assumptions, we propos... Quantifying uncertainty in neural network predictions is essential for high-stakes domains such as autonomous driving, healthcare, and manufacturing. While existing approaches often depend on costly sampling or restrictive distributional assumptions, we propose Hyperspherical Confidence Mapping (HCM), a simple yet principled framework for sampling-free and distribution-free uncertainty estimation. HCM decomposes outputs into a magnitude and a normalized direction vector constrained to lie on the...
220	Beyond Uniform Credit Assignment: Selective Eligibility Traces for RLVR 2605.05965	cs.LGcs.AI	Chaoli Mou, Zhan Zhuang, Xinning Chen, Yu Zhang	Reinforcement Learning with Verifiable Rewards (RLVR) has become a key approach for improving the reasoning abilities of large language models. However, widely used critic-free algorithms such as Group Relative Policy Optimization (GRPO) necessitate a ``unifor... Reinforcement Learning with Verifiable Rewards (RLVR) has become a key approach for improving the reasoning abilities of large language models. However, widely used critic-free algorithms such as Group Relative Policy Optimization (GRPO) necessitate a ``uniform credit assignment'' assumption that indiscriminately broadcast trajectory-level advantages, hindering learning efficiency by failing to distinguish critical reasoning steps. To address this limitation, we propose Selective Eligibility Tra...
221	Sharper Guarantees for Misspecified Kernelized Bandit Optimization 2605.05967	cs.LGmath.OCstat.ML	Davide Maran, Csaba Szepesvári	Existing guarantees for misspecified kernelized bandit optimization pay for misspecification through kernel complexity: in generic offline bounds, the misspecification level $\varepsilon$ is multiplied by $\sqrt{d_\mathrm{eff}}$, where $d_\mathrm{eff}$ is the ... Existing guarantees for misspecified kernelized bandit optimization pay for misspecification through kernel complexity: in generic offline bounds, the misspecification level $\varepsilon$ is multiplied by $\sqrt{d_\mathrm{eff}}$, where $d_\mathrm{eff}$ is the kernel effective dimension, while in online regret bounds, the corresponding penalty is $\sqrt{γ_n}\,n\varepsilon$, where $γ_n$ is the maximum information gain after $n$ rounds of interaction. In this work, we show that, for a large class...
222	Training Transformers for KV Cache Compressibility 2605.05971	cs.LG	Yoav Gelberg, Yam Eitan, Michael Bronstein, Yarin Gal, Haggai Maron	Long-context language modeling is increasingly constrained by the Key-Value (KV) cache, whose memory and decode-time access costs scale linearly with the prefix length. This bottleneck has motivated a range of context-compression methods, from token-level summ... Long-context language modeling is increasingly constrained by the Key-Value (KV) cache, whose memory and decode-time access costs scale linearly with the prefix length. This bottleneck has motivated a range of context-compression methods, from token-level summarization to recent optimization-based KV compression methods. These post-hoc methods operate on the KV cache of a fixed pretrained model, so their effectiveness is fundamentally limited by how well the model's internal representations can ...
225	Physical Fidelity Reconstruction via Improved Consistency-Distilled Flow Matching for Dynamical Systems 2605.05975	cs.LGphysics.flu-dyn	Sicheng Ma, Tianyue Yang, Xiuzhe Wu, Xiao Xue	Reconstructing high-fidelity flow fields from low-fidelity observations is a central problem in scientific machine learning, yet recent diffusion and flow-matching models typically rely on iterative sampling, making them costly for latency-sensitive workflows ... Reconstructing high-fidelity flow fields from low-fidelity observations is a central problem in scientific machine learning, yet recent diffusion and flow-matching models typically rely on iterative sampling, making them costly for latency-sensitive workflows such as ensemble forecasting, real-time visualization, and simulation-in-the-loop inference. We study whether a high-fidelity flow-matching generative model can be compressed into a compact one-step model for fast scientific flow reconstruc...
230	Towards Steering without Sacrifice: Principled Training of Steering Vectors for Prompt-only Interventions 2605.05983	cs.LG	Yuntai Bao, Qinfeng Li, Xinyan Yu, Xuhong Zhang, Ge Su	Recently, steering vectors (SVs) have emerged as an effective and lightweight approach to steer behaviors of large language models (LLMs), among which fine-tuned SVs are more effective than optimization-free ones. However, current approaches to fine-tuned SVs ... Recently, steering vectors (SVs) have emerged as an effective and lightweight approach to steer behaviors of large language models (LLMs), among which fine-tuned SVs are more effective than optimization-free ones. However, current approaches to fine-tuned SVs suffer from two limitations. First, they require careful selection of steering factors on a per-SV basis to balance steering effectiveness and generation quality at inference time. Second, they operate as full-sequence SVs (FSSVs), which ca...
234	DiBA: Diagonal and Binary Matrix Approximation for Neural Network Weight Compression 2605.05994	cs.LG	Nobutaka Ono	In this paper, we propose DiBA (Diagonal and Binary Matrix Approximation), a compact matrix factorization for neural network weight compression. Many components of modern networks, including linear layers, $1\times1$ convolutions, attention projections, and em... In this paper, we propose DiBA (Diagonal and Binary Matrix Approximation), a compact matrix factorization for neural network weight compression. Many components of modern networks, including linear layers, $1\times1$ convolutions, attention projections, and embedding layers, have dense matrix weights. DiBA approximates $A\in\mathbb{R}^{m\times n}$ by $\widehat A=D_1B_1D_2B_2D_3$, where $D_1,D_2,D_3$ are diagonal matrices and $B_1,B_2$ are $0/1$ binary matrices. The intermediate dimension $k$ con...
238	A Fine-Grained Understanding of Uniform Convergence for Halfspaces 2605.06004	cs.LGcs.AImath.ST	Aryeh Kontorovich, Kasper Green Larsen	We study the fine-grained uniform convergence behavior of halfspaces beyond worst-case VC bounds. For inhomogeneous halfspaces in $\mathbb{R}^d$ with $d\ge 2$, we show that standard first-order VC bounds are essentially tight: even consistent hypotheses can in... We study the fine-grained uniform convergence behavior of halfspaces beyond worst-case VC bounds. For inhomogeneous halfspaces in $\mathbb{R}^d$ with $d\ge 2$, we show that standard first-order VC bounds are essentially tight: even consistent hypotheses can incur population error $Θ(d\ln(n/d)/n)$, and in the agnostic setting the deviation scales as $\sqrt{τ\ln(1/τ)}$ at true error $τ$. In contrast, homogeneous halfspaces in $\mathbb{R}^2$ exhibit a markedly different behavior. In the realizable ...
239	Enabling Unsupervised Training of Deep EEG Denoisers With Intelligent Partitioning 2605.06724	cs.LGcs.AIeess.SP	Qiyu Rao, Haozhe Tian, Homayoun Hamedmoghadam, Danilo Mandic	Denoising wearable electroencephalogram (EEG) is inherently challenging since neural activity is not only subtle but also inseparable from spectrally overlapping noise artifacts. Classical signal processing methods, relying on fixed or heuristic rules, cannot ... Denoising wearable electroencephalogram (EEG) is inherently challenging since neural activity is not only subtle but also inseparable from spectrally overlapping noise artifacts. Classical signal processing methods, relying on fixed or heuristic rules, cannot handle the time-varying pervasive artifacts in wearable EEGs. Deep learning methods, on the other hand, show promise in decomposition-free EEG denoising using highly expressive neural networks, but the training requires artifact-free EEG, w...
245	Quantizing With Randomized Hadamard Transforms: Efficient Heuristic Now Proven 2605.06014	cs.LGcs.AIcs.DScs.NI	Ran Ben-Basat, William Kuszmaul, Michael Mitzenmacher, Amit Portnoy, Shay Vargaftik	Uniform random rotations (URRs) are a common preprocessing step in modern quantization approaches used for gradient compression, inference acceleration, KV-cache compression, model weight quantization, and approximate nearest-neighbor search in vector database... Uniform random rotations (URRs) are a common preprocessing step in modern quantization approaches used for gradient compression, inference acceleration, KV-cache compression, model weight quantization, and approximate nearest-neighbor search in vector databases. In practice, URRs are often replaced by randomized Hadamard transforms (RHTs), which preserve orthogonality while admitting fast implementations. The remaining issue is the performance for worst-case inputs. With a URR, each coordinate i...
246	Matrix-Decoupled Concentration for Autoregressive Sequences: Dimension-Free Guarantees for Sparse Long-Context Rewards 2605.06017	cs.LGmath.PR	Pei-Sen Li	Sequence-level evaluations in autoregressive Large Language Models (LLMs) rely on highly dependent token generation. Establishing tight concentration bounds for these processes remains a challenge due to two fundamental bottlenecks in existing frameworks: (i) ... Sequence-level evaluations in autoregressive Large Language Models (LLMs) rely on highly dependent token generation. Establishing tight concentration bounds for these processes remains a challenge due to two fundamental bottlenecks in existing frameworks: (i) classical inequalities typically separate dependency structures from target sensitivities, leading to a scalar collapse that inflates the variance proxy to a suboptimal $\mathcal{O}(N)$ for sparse terminal rewards; (ii) conversely, while ce...
249	Multi-agent decision making: A Blackwell's informativeness approach 2605.06028	cs.LG	Zheng Zhang, Cuong C. Nguyen, Kevin Wells, Gustavo Carneiro	The rapid development of large language models (LLMs) has motivated research on decision-making in multi-agent systems, where multiple agents collaborate to achieve shared objectives. Existing aggregation approaches, such as voting and debate, are largely ad-h... The rapid development of large language models (LLMs) has motivated research on decision-making in multi-agent systems, where multiple agents collaborate to achieve shared objectives. Existing aggregation approaches, such as voting and debate, are largely ad-hoc and lack formal guarantees regarding the informativeness of the resulting decisions. In this paper, we provide a principled approach to analyse decisions made in the multi-LLM setting using Blackwell's informativeness framework. Within t...
251	Transformer-Based Wildlife Species Classification from Daily Movement Trajectories 2605.06726	cs.LG	Obed Irakoze, Prasenjit Mitra	Inferring the identity of wildlife species from daily movement data alone is a challenging task. We train sequence models on large-scale, 7-species GPS trajectories from the Movebank platform. Trajectories models are evaluated using a protocol in which entire ... Inferring the identity of wildlife species from daily movement data alone is a challenging task. We train sequence models on large-scale, 7-species GPS trajectories from the Movebank platform. Trajectories models are evaluated using a protocol in which entire telemetry studies or regions are heldout during testing. We compare Transformer-based sequence models to LSTM, CNN, and Temporal Convolutional Networks, and find that Transformers consistently achieve higher balanced accuracy with gains of ...
253	Does Synthetic Data Help? Empirical Evidence from Deep Learning Time Series Forecasters 2605.06032	cs.LGcs.AI	Hugo Cazaux, Eyjólfur Ingi Ásgeirsson, Hlynur Stefánsson	Synthetic data has transformed language model training, yet its role in time series forecasting remains poorly understood. We present a large-scale empirical study: nine experiment groups, 4,218 runs systematically evaluating synthetic time series augmentation... Synthetic data has transformed language model training, yet its role in time series forecasting remains poorly understood. We present a large-scale empirical study: nine experiment groups, 4,218 runs systematically evaluating synthetic time series augmentation across five architectures, four synthetic signals and seven datasets. The effect is sharply architecture-conditional: channel-mixing models (TimesNet, iTransformer) benefit in the majority of trials, while channel-independent models (DLine...
256	Optimal Transport for LLM Reward Modeling from Noisy Preference 2605.06036	cs.LGcs.AI	Licheng Pan, Haochen Yang, Haoxuan Li, Yunsheng Lu, Yongqi Tong	Reward models are fundamental to Reinforcement Learning from Human Feedback (RLHF), yet real-world datasets are inevitably corrupted by noisy preference. Conventional training objectives tend to overfit these errors, while existing denoising approaches often r... Reward models are fundamental to Reinforcement Learning from Human Feedback (RLHF), yet real-world datasets are inevitably corrupted by noisy preference. Conventional training objectives tend to overfit these errors, while existing denoising approaches often rely on homogeneous noise assumptions that fail to capture the complexity of linguistic preferences. To handle these challenges, we propose SelectiveRM, a framework grounded in optimal transport. We first devise a Joint Consistency Discrepan...
257	Medical Imaging Classification with Cold-Atom Reservoir Computing using Auto-Encoders and Surrogate-Driven Training 2605.06727	cs.LGcs.ETeess.IV	Nuno Batista, Ana Morgado, Oscar Ferraz, Sagar Silva Pratapsi, Jorge Lobo	We introduce a hybrid quantum-classical pipeline, based on neutral-atom reservoir computing, for medical image classification, focusing on the binary classification task of polyp detection. To deal effectively with the high dimensionality, we integrate a guide... We introduce a hybrid quantum-classical pipeline, based on neutral-atom reservoir computing, for medical image classification, focusing on the binary classification task of polyp detection. To deal effectively with the high dimensionality, we integrate a guided auto-encoder. This pipeline learns compact and discriminative representations of image data that are also well-suited for quantum reservoir computing. A key challenge in such systems is the non-differentiable nature of quantum measurement...
261	Requests of a Feather Must Flock Together: Batch Size vs. Prefix Homogeneity in LLM Inference 2605.06046	cs.LG	Saksham Rathi, Preeti, Mythili Vutukuru	Auto-regressive token generation in large language models is memory-bound because it requires "attending to" key and value tensors (KV cache) of all previous tokens. Prior work aims to improve the efficiency of this decode process by batching multiple requests... Auto-regressive token generation in large language models is memory-bound because it requires "attending to" key and value tensors (KV cache) of all previous tokens. Prior work aims to improve the efficiency of this decode process by batching multiple requests together, and maximizing batch size subject to GPU memory constraints. The key observation of our work is that with prefix-sharing workloads, smaller, prefix-homogeneous batches -- where all requests share a common prefix -- can achieve hi...
262	TFM-Retouche: A Lightweight Input-Space Adapter for Tabular Foundation Models 2605.06047	cs.LGcs.AI	Duong Nguyen, Mohammed Jawhar, Nicolas Chesneau	Tabular foundation models (TFMs), such as TabPFN-2.6, TabICLv2, ConTextTab, Mitra, LimiX, and TabDPT, achieve strong zero-shot performance through in-context learning, but their inductive biases remain fixed at inference time. Adapting a pretrained TFM to a sp... Tabular foundation models (TFMs), such as TabPFN-2.6, TabICLv2, ConTextTab, Mitra, LimiX, and TabDPT, achieve strong zero-shot performance through in-context learning, but their inductive biases remain fixed at inference time. Adapting a pretrained TFM to a specific dataset or task typically requires either full fine-tuning, which is computationally expensive, or parameter-efficient tuning methods (PEFT) such as LoRA, which must be tailored to the internal architecture of each TFM. Furthermore, ...
264	When Brain Networks Travel: Learning Beyond Site 2605.06050	cs.LG	Yingxu Wang, Kunyu Zhang, Yanwu Yang, Thomas Wolfers, Yujie Wu	Graph-based learning on functional magnetic resonance imaging (fMRI) has shown strong potential for brain network analysis. However, existing methods degrade under cross-site out-of-distribution (OOD) settings because site-conditioned confounders induce non-pa... Graph-based learning on functional magnetic resonance imaging (fMRI) has shown strong potential for brain network analysis. However, existing methods degrade under cross-site out-of-distribution (OOD) settings because site-conditioned confounders induce non-pathological shortcuts, while functional connectivity constructed by temporal averaging obscures transient neurodynamics, limiting generalization to unseen sites. In this paper, we propose Cross-site OOD Robust brain nEtwork (CORE), a unified...
266	The E$Δ$-MHC-Geo Transformer: Adaptive Geodesic Operations with Guaranteed Orthogonality 2605.06729	cs.LGcs.AI	Arash Shahmansoori	We present the E$Δ$-MHC-Geo Transformer, a novel architecture that unifies Manifold-Constrained Hyper-Connections (mHC), Deep Delta Learning (DDL), and the Cayley transform to obtain input-adaptive, unconditionally orthogonal residual connections. Unlike DDL, ... We present the E$Δ$-MHC-Geo Transformer, a novel architecture that unifies Manifold-Constrained Hyper-Connections (mHC), Deep Delta Learning (DDL), and the Cayley transform to obtain input-adaptive, unconditionally orthogonal residual connections. Unlike DDL, whose Householder operator is orthogonal only at $β\in \{0,2\}$, our Data-Dependent Cayley rotation $Q(x)=(I+(β/2)A(x))^{-1}(I-(β/2)A(x))$ preserves orthogonality for all $β$ and all inputs. To handle negation, an eigenvalue $-1$ case that ...
267	Semantic State Abstraction Interfaces for LLM-Augmented Portfolio Decisions: Multi-Axis News Decomposition and RL Diagnostics 2605.06730	cs.LG	Likhita Yerra, Remi Uttejitha Allam	We introduce Semantic State Abstraction Interfaces (SSAI): a methodological template for mapping sparse unstructured text into $K$ auditable, named coordinates with neutral defaults on no-news days, designed to separate representation hypotheses from optimisat... We introduce Semantic State Abstraction Interfaces (SSAI): a methodological template for mapping sparse unstructured text into $K$ auditable, named coordinates with neutral defaults on no-news days, designed to separate representation hypotheses from optimisation variance in sequential decision systems. Our contribution is the framework and its evaluation protocol, not a claim that SSAI outperforms denser alternatives. We instantiate SSAI with $K=4$ axes (sentiment, risk, confidence, volatilit...
268	Towards Generation-Efficient Uncertainty Estimation in Large Language Models 2605.06053	cs.LG	Mingcheng Zhu, Yu Liu, Tingting Zhu	Uncertainty estimation is important for deploying LLMs in high-stakes applications such as healthcare and finance, where hallucinations can appear fluent and plausible while being factually incorrect, making it difficult for users to judge whether an output sh... Uncertainty estimation is important for deploying LLMs in high-stakes applications such as healthcare and finance, where hallucinations can appear fluent and plausible while being factually incorrect, making it difficult for users to judge whether an output should be trusted. Existing methods require one or more full autoregressive generations to estimate uncertainty, which introduces substantial inference cost and often delays uncertainty assessment. In this paper, we investigate whether effect...
271	Towards Self-Explainable Document Visual Question Answering with Chain-of-Explanation Predictions 2605.06058	cs.LGcs.CV	Kjetil Indrehus, Adrian Duric, Changkyu Choi, Ali Ramezani-Kebrya	Document Visual Question Answering (DocVQA) requires vision-language models to reason not only about what information in a document is relevant to a question, but also where the answer is grounded on the page. Existing DocVQA models entangle question-relevant ... Document Visual Question Answering (DocVQA) requires vision-language models to reason not only about what information in a document is relevant to a question, but also where the answer is grounded on the page. Existing DocVQA models entangle question-relevant evidence and answer localization and operate largely as black boxes, offering limited means to verify how predictions depend on visual evidence. We propose CoExVQA, a self-explainable DocVQA framework with a grounded reasoning process throu...
273	Geometry-Aware Simplicial Message Passing 2605.06061	cs.LGcs.CGmath.AT	Elena Xinyi Wang, Bastian Rieck	The Weisfeiler--Lehman (WL) test and its simplicial extension (SWL) characterize the combinatorial expressivity of message passing networks, but they are blind to geometry, i.e., meshes with identical connectivity but different embeddings are indistinguishable... The Weisfeiler--Lehman (WL) test and its simplicial extension (SWL) characterize the combinatorial expressivity of message passing networks, but they are blind to geometry, i.e., meshes with identical connectivity but different embeddings are indistinguishable. We introduce the Geometric Simplicial Weisfeiler--Lehman (GSWL) test, which incorporates vertex coordinates into color refinement for geometric simplicial complexes. In addition, we show that (i) the expressivity of geometry-aware simplic...
276	Causal Reinforcement Learning for Complex Card Games: A Magic The Gathering Benchmark 2605.06066	cs.LGcs.AI	Cristiano da Costa Cunha, Ajmal Mian, Tim French, Wei Liu	Causal reinforcement learning (RL) lacks benchmarks for complex systems that combine sequential decision making, hidden information, large masked action spaces, and explicit causal structure. We introduce MTG-Causal-RL, a Gymnasium benchmark built on Magic: Th... Causal reinforcement learning (RL) lacks benchmarks for complex systems that combine sequential decision making, hidden information, large masked action spaces, and explicit causal structure. We introduce MTG-Causal-RL, a Gymnasium benchmark built on Magic: The Gathering with a 3,077-dimensional partial observation, a 478-action masked discrete action space, five competitive Standard archetypes, three reward schemes, and a hand-specified Structural Causal Model (SCM) over strategic variables. Ev...
277	Normalized Architectures are Natively 4-Bit 2605.06067	cs.LGcs.AI	Maxim Fishman, Brian Chmiel, Ron Banner, Daniel Soudry, Boris Ginsburg	Training large language models at 4-bit precision is critical for efficiency. We show that nGPT, an architecture that constrains weights and hidden representations to the unit hypersphere, is inherently more robust to low-precision arithmetic. This removes the... Training large language models at 4-bit precision is critical for efficiency. We show that nGPT, an architecture that constrains weights and hidden representations to the unit hypersphere, is inherently more robust to low-precision arithmetic. This removes the need for interventions-such as applying random Hadamard transforms and performing per-tensor scaling calculations-to preserve model quality, and it enables stable end-to-end NVFP4 training. We validate this approach on both a 1.2B dense mo...
280	PRISM: Iterative Cross-Modal Posterior Refinement for Dynamic Text-Attributed Graphs 2605.06073	cs.LG	Trimble Chang, Yihang Liu, Mingjing Han, Han Zhang	Dynamic text-attributed graphs (DyTAGs) provide a powerful framework for modeling evolving systems in which node semantics and time-dependent interactions are tightly coupled. Recently, multimodal learning has emerged as a promising yet underexplored direction... Dynamic text-attributed graphs (DyTAGs) provide a powerful framework for modeling evolving systems in which node semantics and time-dependent interactions are tightly coupled. Recently, multimodal learning has emerged as a promising yet underexplored direction for enhancing DyTAG representation learning. However, existing methods typically rely on rigid modality partitions and one-shot fusion strategies, which limit their ability to capture the intrinsic and evolving dependencies between node se...
282	Understanding diffusion models requires rethinking (again) generalization 2605.06077	cs.LG	Pierre Marion, Yu-Han Wu	This position paper argues that understanding generalization in diffusion models requires fundamentally new theoretical frameworks that go beyond both classical statistical learning theory and the benign overfitting paradigm developed for supervised learning. ... This position paper argues that understanding generalization in diffusion models requires fundamentally new theoretical frameworks that go beyond both classical statistical learning theory and the benign overfitting paradigm developed for supervised learning. In diffusion models, unlike in supervised learning, memorization of training data and generalization to novel samples are incompatible: a model that has fully memorized its training set generates copies rather than novel data. Several theor...
285	Fast Gauss-Newton for Multiclass Cross-Entropy 2605.06081	cs.LG	Mikalai Korbit, Mario Zanon	In multiclass softmax cross-entropy, the full generalized Gauss-Newton (GGN) curvature couples all output logits through the softmax covariance, making curvature-vector products harder to scale as the number of classes grows. We show that the standard multicla... In multiclass softmax cross-entropy, the full generalized Gauss-Newton (GGN) curvature couples all output logits through the softmax covariance, making curvature-vector products harder to scale as the number of classes grows. We show that the standard multiclass GGN can be decomposed exactly into a true-vs-rest term and a positive semidefinite within-competitor covariance term. Fast Gauss-Newton (FGN) retains the first term and drops the second, yielding a positive semidefinite under-approximati...
298	Beyond Autoregressive RTG: Conditioning via Injection Outside Sequential Modeling in Decision Transformer 2605.06104	cs.LGcs.AI	Yongyi Wang, Hanyu Liu, Lingfeng Li, Bozhou Chen, Ang Li	Decision Transformer (DT) formulates offline reinforcement learning as autoregressive sequence modeling, achieving promising results by predicting actions from a sequence of Return-to-Go (RTG), state, and action tokens. However, RTG is a scalar that summarizes... Decision Transformer (DT) formulates offline reinforcement learning as autoregressive sequence modeling, achieving promising results by predicting actions from a sequence of Return-to-Go (RTG), state, and action tokens. However, RTG is a scalar that summarizes future rewards, containing far less information than typical state or action vectors, yet it consumes the same computational budget per token. Worse, the self-attention cost of Transformers grows quadratically with sequence length, so incl...
307	BoostLLM: Boosting-inspired LLM Fine-tuning for Few-shot Tabular Classification 2605.06117	cs.LG	Yi-Siang Wang, Kuan-Yu Chen, Yu-Chen Den, Darby Tien-Hao Chang	Large language models (LLMs) have recently been adapted to tabular prediction by serializing structured features into natural language, but their performance in low-data regimes remains limited compared to gradient-boosted decision trees (GBDTs). In this work,... Large language models (LLMs) have recently been adapted to tabular prediction by serializing structured features into natural language, but their performance in low-data regimes remains limited compared to gradient-boosted decision trees (GBDTs). In this work, we revisit the boosting paradigm, traditionally associated with tree ensembles, and ask whether it can be applied as a general training principle for LLM fine-tuning. We propose BoostLLM, a framework that transforms parameter-efficient fin...
317	Listwise Policy Optimization: Group-based RLVR as Target-Projection on the LLM Response Simplex 2605.06139	cs.LGcs.AI	Yun Qu, Qi Wang, Yixiu Mao, Heming Zou, Yuhang Jiang	Reinforcement learning with verifiable rewards (RLVR) has become a standard approach for large language models (LLMs) post-training to incentivize reasoning capacity. Among existing recipes, group-based policy gradient is prevalent, which samples a group of re... Reinforcement learning with verifiable rewards (RLVR) has become a standard approach for large language models (LLMs) post-training to incentivize reasoning capacity. Among existing recipes, group-based policy gradient is prevalent, which samples a group of responses per prompt and updates the policy via group-relative advantage signals. This work reveals that these optimization strategies share a common geometric structure: each implicitly defines a target distribution on the response simplex a...
318	SymDrift: One-Shot Generative Modeling under Symmetries 2605.06140	cs.LGcs.AI	Samir Darouich, Vinh Tong, Lluís Pastor-Pérez, Tanja Bien, Loay Mualem	Generative modeling of physical systems, such as molecules, requires learning distributions that are invariant under global symmetries, such as rotations in three-dimensional space. Equivariant diffusion and flow matching models can incorporate such invariance... Generative modeling of physical systems, such as molecules, requires learning distributions that are invariant under global symmetries, such as rotations in three-dimensional space. Equivariant diffusion and flow matching models can incorporate such invariances effectively, even when trained on a non-invariant empirical distribution, but they typically rely on costly multi-step sampling. Recently, drifting models have emerged as an efficient alternative, enabling single-step generation and achie...
319	Matrix-Valued Optimism is Matrix-Valued Augmentation: Additive Hybrid Designs for Constrained Optimization 2605.06141	cs.LG	Jiayi Zhao	Augmented Lagrangian and optimistic primal--dual methods stabilize equality-constrained optimization through seemingly different mechanisms: the former adds constraint-dependent primal curvature, while the latter adds dual memory. Recent work has shown that th... Augmented Lagrangian and optimistic primal--dual methods stabilize equality-constrained optimization through seemingly different mechanisms: the former adds constraint-dependent primal curvature, while the latter adds dual memory. Recent work has shown that these mechanisms are equivalent for scalar parameters. We extend this equivalence to matrix-valued correction. We prove an additivity principle: for symmetric matrix parameters, the ideal primal trajectory depends only on the summed correctio...
322	Unifying Goal-Conditioned RL and Unsupervised Skill Learning via Control-Maximization 2605.06145	cs.LGcs.AIeess.SY	Alireza Modirshanechi, Benjamin Eysenbach, Peter Dayan, Eric Schulz	Unsupervised pretraining has driven empirical advances in goal-conditioned reinforcement learning (GCRL), but its theoretical foundations remain poorly understood. In particular, an influential class of methods, mutual information skill learning (MISL), discov... Unsupervised pretraining has driven empirical advances in goal-conditioned reinforcement learning (GCRL), but its theoretical foundations remain poorly understood. In particular, an influential class of methods, mutual information skill learning (MISL), discovers behaviorally diverse skills that can later be used for downstream goal-reaching. However, it remains a theoretical mystery why skills learned through MISL should support goal-reaching. A subtle challenge is that both GCRL and MISL are u...
324	AdaGamma: State-Dependent Discounting for Temporal Adaptation in Reinforcement Learning 2605.06149	cs.LGcs.AI	Yaomin Wang, Jianting Pan, Ran Tian, Xiaoyang Li, Yu Zhang	The discount factor in reinforcement learning controls both the effective planning horizon and the strength of bootstrapping, yet most deep RL methods use a single fixed value across all states. While state-dependent discounting is conceptually appealing, naiv... The discount factor in reinforcement learning controls both the effective planning horizon and the strength of bootstrapping, yet most deep RL methods use a single fixed value across all states. While state-dependent discounting is conceptually appealing, naive deep actor--critic implementations can become unstable and degenerate toward TD-error collapse. We propose AdaGamma, a practical deep actor--critic method for state-dependent discounting that learns a state-dependent discount function tog...
325	Grokking or Glitching? How Low-Precision Drives Slingshot Loss Spikes 2605.06152	cs.LGcs.CLmath.OCstat.ML	Liu Hanqing, Jianjun Cao, Yuanze Li, Zijian Zhou	Deep neural networks exhibit periodic loss spikes during unregularized long-term training, a phenomenon known as the "Slingshot Mechanism." Existing work usually attributes this to intrinsic optimization dynamics, but its triggering mechanism remains unclear. ... Deep neural networks exhibit periodic loss spikes during unregularized long-term training, a phenomenon known as the "Slingshot Mechanism." Existing work usually attributes this to intrinsic optimization dynamics, but its triggering mechanism remains unclear. This paper proves that this phenomenon is a result of floating-point arithmetic precision limits. As training enters a high-confidence stage, the difference between the correct-class logit and the other logits may exceed the absorption-erro...
328	Entropy-Regularized Adjoint Matching for Offline Reinforcement Learning 2605.06156	cs.LGcs.AI	Abdelghani Ghanem, Mounir Ghogho	Integrating expressive generative policies, such as flow-matching models, into offline reinforcement learning (RL) allows agents to capture complex, multi-modal behaviors. While Q-learning with Adjoint Matching (QAM) stabilizes policy optimization via the cont... Integrating expressive generative policies, such as flow-matching models, into offline reinforcement learning (RL) allows agents to capture complex, multi-modal behaviors. While Q-learning with Adjoint Matching (QAM) stabilizes policy optimization via the continuous adjoint method, it remains inherently bound to the fixed behavior distribution. This dependence induces a \textit{popularity bias} that can suppress high-reward actions in low-density regions, and creates a \textit{support binding} t...
331	On Training in Imagination 2605.06732	cs.LG	Nadav Timor, Ravid Shwartz-Ziv, Micah Goldblum, Yann LeCun, David Harel	State-of-the-art model-based reinforcement learning methods train policies on imagined rollouts. These rollouts are trajectories generated by a learned dynamics model and are scored by a learned reward model, but without querying the true environment during po... State-of-the-art model-based reinforcement learning methods train policies on imagined rollouts. These rollouts are trajectories generated by a learned dynamics model and are scored by a learned reward model, but without querying the true environment during policy updates. We study this training paradigm by quantifying how errors in learned dynamics and reward models affect returns and policy optimization. First, we extend the analysis of Asadi et al. (2018) to MDPs with learned reward models, a...
333	One Algorithm, Two Goals: Dual Scoring for Parameter and Data Selection in LLM Fine-Tuning 2605.06166	cs.LG	Xinrui Chen, Liu Yang, Ou Wu	In Large Language Model (LLM) fine-tuning, parameter and data selection are common strategies for reducing fine-tuning cost, yet they are typically driven by separate scoring mechanisms. When a parameter mask and data subset jointly determine restricted fine-t... In Large Language Model (LLM) fine-tuning, parameter and data selection are common strategies for reducing fine-tuning cost, yet they are typically driven by separate scoring mechanisms. When a parameter mask and data subset jointly determine restricted fine-tuning, this separation incurs redundant overhead and makes coordinated selection difficult. We cast parameter and data selection as two bilevel selection problems under a common validation objective and derive a shared local response-surrog...
334	Mean Mode Screaming: Mean--Variance Split Residuals for 1000-Layer Diffusion Transformers 2605.06169	cs.LGcs.CV	Pengqi Lu	Scaling Diffusion Transformers (DiTs) to hundreds of layers introduces a structural vulnerability: networks can enter a silent, mean-dominated collapse state that homogenizes token representations and suppresses centered variation. Through mechanistic auditing... Scaling Diffusion Transformers (DiTs) to hundreds of layers introduces a structural vulnerability: networks can enter a silent, mean-dominated collapse state that homogenizes token representations and suppresses centered variation. Through mechanistic auditing, we isolate the trigger event of this collapse as Mean Mode Screaming (MMS). MMS can occur even when training appears stable, with a mean-coherent backward shock on residual writers that opens deep residual branches and drives the network ...
339	Beyond Factor Aggregation: Gauge-Aware Low-Rank Server Representations for Federated LoRA 2605.06733	cs.LGcs.AI	Jinqian Chen, Chang Liu, Jihua Zhu	Federated LoRA enables parameter-efficient adaptation of large language models under decentralized data and limited client resources.However, directly averaging LoRA factors is representation-dependent: the same intrinsic update admits infinitely many gauge-eq... Federated LoRA enables parameter-efficient adaptation of large language models under decentralized data and limited client resources.However, directly averaging LoRA factors is representation-dependent: the same intrinsic update admits infinitely many gauge-equivalent factorizations, so factor-level aggregation can change under arbitrary coordinate choices while the underlying update remains unchanged. This reveals a semantic mismatch in existing federated LoRA aggregation rules. We propose \tex...
344	In-Context Black-Box Optimization with Unreliable Feedback 2605.06187	cs.LGcs.AI	Nicolas Samuel Blumer, Julien Martinelli, Samuel Kaski	Black-box optimization in science and engineering often comes with side information: experts, simulators, pretrained predictors, or heuristics can suggest which candidates look promising. This information can accelerate search, but it can also be biased, input... Black-box optimization in science and engineering often comes with side information: experts, simulators, pretrained predictors, or heuristics can suggest which candidates look promising. This information can accelerate search, but it can also be biased, input-dependent, or misleading. Feedback-aware BO methods typically handle one task at a time, limiting their ability to generalize over multiple sources of feedback. In-context optimizers address cross-task adaptation, but usually assume that o...
345	Gated QKAN-FWP: Scalable Quantum-inspired Sequence Learning 2605.06734	cs.LGcs.AIquant-ph	Kuo-Chung Peng, Samuel Yen-Chi Chen, Jiun-Cheng Jiang, Chen-Yu Liu, En-Jui Kuo	Fast Weight Programmers (FWPs) encode temporal dependencies through dynamically updated parameters rather than recurrent hidden states. Quantum FWPs (QFWPs) extend this idea with variational quantum circuits (VQCs), but existing implementations rely on multi-q... Fast Weight Programmers (FWPs) encode temporal dependencies through dynamically updated parameters rather than recurrent hidden states. Quantum FWPs (QFWPs) extend this idea with variational quantum circuits (VQCs), but existing implementations rely on multi-qubit architectures that are difficult to scale on noisy intermediate-scale quantum (NISQ) devices and expensive to simulate classically. We propose gated QKAN-FWP, a fast-weight framework that integrates FWP with Quantum-inspired Kolmogorov...
348	Constrained Contextual Bandits with Adversarial Contexts 2605.06190	cs.LG	Dhruv Sarkar, Abhishek Sinha	We study budget-constrained contextual bandits with adversarial contexts, where each action yields a random reward and incurs a random cost. We adopt the standard realizability assumption: conditioned on the observed context, rewards and costs are drawn indepe... We study budget-constrained contextual bandits with adversarial contexts, where each action yields a random reward and incurs a random cost. We adopt the standard realizability assumption: conditioned on the observed context, rewards and costs are drawn independently from fixed distributions whose expectations belong to known function classes. We focus on the continuing setting, in which the algorithm operates over the entire horizon even after the budget for cumulative cost is exhausted. In thi...
355	STDA-Net: Spectrogram-Based Domain Adaptation for cross-dataset Sleep Stage Classification 2605.06736	cs.LGcs.AIcs.HC	Unaza Tallal, Shruti Kshirsagar, Ankita Shukla	Accurate sleep stage classification across datasets remains challenging due to variability in EEG channel montages, sampling rates, recording environments, and subject populations. Although deep learning has shown considerable promise for automated sleep stagi... Accurate sleep stage classification across datasets remains challenging due to variability in EEG channel montages, sampling rates, recording environments, and subject populations. Although deep learning has shown considerable promise for automated sleep staging, most existing cross-dataset methods rely on one-dimensional EEG signal representations, whereas the use of two-dimensional spectrogram-based inputs within an unsupervised domain adaptation framework has remained largely unexplored. Here...
356	Bandit Learning in General Open Multi-agent Systems 2605.06202	cs.LGstat.ML	Mengfan Xu	Recent developments in digital platforms have highlighted the prevalence of open systems, where agents can arrive and depart over time. While bandit learning in open systems has recently received initial attention, existing work imposes structural assumptions ... Recent developments in digital platforms have highlighted the prevalence of open systems, where agents can arrive and depart over time. While bandit learning in open systems has recently received initial attention, existing work imposes structural assumptions that are frequently violated in practice. A learning paradigm for general open systems creates fresh challenges: newly arriving agents induce endogenous non-stationarity; agent patterns determine how quickly information accumulates; and new...
359	Federation of Experts: Communication Efficient Distributed Inference for Large Language Models 2605.06206	cs.LG	Muhammad Shahir Abdurrahman, Chun Deng, Azalia Mirhoseini, Philip Levis	Mixture of experts has emerged as the primary mechanism for making Large Language Models (LLMs) computationally efficient. However, in distributed settings, communicating token embeddings between experts is a significant bottleneck. We present the novel Fede... Mixture of experts has emerged as the primary mechanism for making Large Language Models (LLMs) computationally efficient. However, in distributed settings, communicating token embeddings between experts is a significant bottleneck. We present the novel Federation of Experts (FoE) architecture. FoE restructures the MoE block of a transformer layer into multiple MoE clusters. Each cluster is responsible for only one of the KV heads and expert parallelism is applied between those experts. Betwee...
362	Contrastive Identification and Generation in the Limit 2605.06211	cs.LGcs.AIcs.CLcs.DS	Xiaoyu Li, Andi Han, Jiaojiao Jiang, Junbin Gao	In the classical identification in the limit model of Gold [1967], a stream of positive examples is presented round by round, and the learner must eventually recover the target hypothesis. Recently, Kleinberg and Mullainathan [2024] introduced generation in th... In the classical identification in the limit model of Gold [1967], a stream of positive examples is presented round by round, and the learner must eventually recover the target hypothesis. Recently, Kleinberg and Mullainathan [2024] introduced generation in the limit, where the learner instead must eventually output novel elements of the target's support. Both lines of work focus on positive-only or fully labeled data. Yet many natural supervision signals are inherently relational rather than si...
363	Playing the network backward: A Game Theoretic Attribution Framework 2605.06212	cs.LGcs.CV	Jakob Paul Zimmermann, Jim Berend, Georg Loho, Sebastian Lapuschkin, Wojciech Samek	Attribution methods explain which input features drive a model's prediction, making them central to model debugging and mechanistic interpretability. Yet backward attribution methods, including gradients, LRP, and transformer-specific rules, lack a shared fram... Attribution methods explain which input features drive a model's prediction, making them central to model debugging and mechanistic interpretability. Yet backward attribution methods, including gradients, LRP, and transformer-specific rules, lack a shared framework in which to compare the underlying backward calculations. We introduce such a framework by recasting backward attribution as a two-player game on an extended network graph, building on Gaubert and Vlassopoulos' ReLU Net Game. Gradient...
368	AffineLens: Capturing the Continuous Piecewise Affine Functions of Neural Networks 2605.06218	cs.LG	Yi Wei, Xuan Qi, Furao shen, Jian Zhao, Vittorio Murino	Piecewise affine neural networks (PANNs) provide a principled geometric perspective on neural network expressivity by characterizing the input--output map as a continuous piecewise affine (CPA) function whose complexity is governed by the number, arrangement, ... Piecewise affine neural networks (PANNs) provide a principled geometric perspective on neural network expressivity by characterizing the input--output map as a continuous piecewise affine (CPA) function whose complexity is governed by the number, arrangement, and shapes of its affine regions. However, existing interpretability and expressivity analyses often rely on indirect proxies (e.g., activation statistics or theoretical upper bounds) and rarely offer practical, accurate tools for enumerati...
373	Memory Inception: Latent-Space KV Cache Manipulation for Steering LLMs 2605.06225	cs.LGcs.AI	Andy Zeyi Liu, Michael Zhang, Ilana Greenberg, Adam Alnasser, Lucas Baker	Steering large language models (LLMs) is usually done by either instruction prompting or activation steering. Prompting often gives strong control, but caches guidance tokens at every layer and can clutter long interactions; activation steering is compact but ... Steering large language models (LLMs) is usually done by either instruction prompting or activation steering. Prompting often gives strong control, but caches guidance tokens at every layer and can clutter long interactions; activation steering is compact but typically weaker and does not support large structured reminders. We introduce memory inception (MI), a training-free method that steers in latent attention space by inserting text-derived key-value (KV) banks only at selected layers. Rathe...
376	Soft Deterministic Policy Gradient with Gaussian Smoothing 2605.06228	cs.LGcs.AI	Hyunjun Na, Donghwan Lee	Deterministic policy gradient (DPG) is widely utilized for continuous control; however, it inherently relies on the differentiability of the critic with respect to the action during policy updates. This assumption is violated in practical control problems invo... Deterministic policy gradient (DPG) is widely utilized for continuous control; however, it inherently relies on the differentiability of the critic with respect to the action during policy updates. This assumption is violated in practical control problems involving sparse or discrete rewards, leading to ill-defined policy gradients and unstable learning. To address these challenges, we propose a principled alternative based on a smoothed Bellman equation formulated via Gaussian smoothing. Specif...
381	Band Together: Untargeted Adversarial Training with Multimodal Coordination against Evasion-based Promotion Attacks 2605.06238	cs.LGcs.AI	Guanmeng Xian, Ning Yang, Philip S. Yu	Multimodal recommender systems exploit visual and textual signals to alleviate data sparsity, but this also makes them more vulnerable to evasion-based promotion attacks. Existing defenses are largely limited to single-modal settings and mainly focus on poison... Multimodal recommender systems exploit visual and textual signals to alleviate data sparsity, but this also makes them more vulnerable to evasion-based promotion attacks. Existing defenses are largely limited to single-modal settings and mainly focus on poisoning-based threats, leaving evasion-based threats underexplored. In this work, we first identify a cross-modal gradient mismatch under the multi-user promotion setting, where visual and textual perturbations are optimized in inconsistent dir...
382	When Graph Language Models Go Beyond Memorization 2605.06239	cs.LG	Masatsugu Yamada, Mahito Sugiyama	It remains unclear whether graph language models learn structural regularities or merely memorize training graphs; this cannot be resolved by current aggregate fidelity metrics alone. We develop a calibrated diagnostic protocol that combines frequent subgraph ... It remains unclear whether graph language models learn structural regularities or merely memorize training graphs; this cannot be resolved by current aggregate fidelity metrics alone. We develop a calibrated diagnostic protocol that combines frequent subgraph mining, a graph-level bootstrap baseline, and three-level frequency stratification to disentangle memorization from structural alignment. Using this framework, we show that graph language models can acquire structural regularities beyond me...
383	Cumulative-Goodness Free-Riding in Forward-Forward Networks: Real, Repairable, but Not Accuracy-Dominant 2605.06240	cs.LGcs.AI	Amirhossein Yousefiramandi	Forward-Forward (FF) training allows each layer to learn from a local goodness criterion. In cumulative-goodness variants, however, later layers can inherit a task that earlier layers have already partially separated. We formalize this phenomenon as layer free... Forward-Forward (FF) training allows each layer to learn from a local goodness criterion. In cumulative-goodness variants, however, later layers can inherit a task that earlier layers have already partially separated. We formalize this phenomenon as layer free-riding: under the softplus FF criterion, the class-discrimination gradient reaching block $d$ decays exponentially with the positive margin accumulated by preceding blocks. We then study three local remedies -- per-block, hardness-gated, a...
386	Structure-Preserving Gaussian Processes Via Discrete Euler-Lagrange Equations 2605.06246	cs.LGcs.RO	Jan-Hendrik Ewering, Kathrin Flaßkamp, Niklas Wahlström, Thomas B. Schön, Thomas Seel	In this paper, we propose Lagrangian Gaussian Processes (LGPs) for probabilistic and data-efficient learning of dynamics via discrete forced Euler-Lagrange equations. Importantly, the geometric structure of the Lagrange-d'Alembert principle, which governs the ... In this paper, we propose Lagrangian Gaussian Processes (LGPs) for probabilistic and data-efficient learning of dynamics via discrete forced Euler-Lagrange equations. Importantly, the geometric structure of the Lagrange-d'Alembert principle, which governs the motion of dynamical systems, is preserved by construction in the absence of external forces. This allows learning physically consistent models that overcome erroneous drift in the system's energy, thereby providing stable long-term predicti...
387	The Role of Node Features in Graph Pooling 2605.06250	cs.LG	Jan von Pichowski, Alžbeta Hrabošová, Ingo Scholtes, Christopher Blöcker	Graph pooling is commonly applied in graph classification, yet its empirical gains over standard WL-1 expressive GNNs are often marginal or inconsistent. We study this gap by analysing the interaction between node features and graph topology and their effect o... Graph pooling is commonly applied in graph classification, yet its empirical gains over standard WL-1 expressive GNNs are often marginal or inconsistent. We study this gap by analysing the interaction between node features and graph topology and their effect on pooling objectives. Our analysis reveals that pooling operators require node features that are well-aligned with the graph's topology -- a condition often overlooked and not guaranteed in empirical networks. We formalise fundamental requi...
388	The Weight Gram Matrix Captures Sequential Feature Linearization in Deep Networks 2605.06258	cs.LGcs.AI	Taehun Cha, Daniel Beaglehole, Adityanarayanan Radhakrishnan, Donghun Lee	Understanding how deep neural networks learn representations remains a central challenge in machine learning theory. In this work, we propose a feature-centric framework for analyzing neural network training by relating weight updates to feature evolution. We ... Understanding how deep neural networks learn representations remains a central challenge in machine learning theory. In this work, we propose a feature-centric framework for analyzing neural network training by relating weight updates to feature evolution. We introduce a simple identity, the Feature Learning Equation, which identifies the weight Gram matrix as the key object capturing feature dynamics. This enables us to interpret gradient descent as implicitly inducing a hypothetical evolution ...
389	Trade-off Functions for DP-SGD with Subsampling based on Random Shuffling: Tight Upper and Lower Bounds 2605.06259	cs.LGcs.CR	Marten van Dijk, Murat Bilgehan Ertan	We derive a tight analysis of the trade-off function for Differentially Private Stochastic Gradient Descent (DP-SGD) with subsampling based on random shuffling within the $f$-DP framework. Our analysis covers the regime $σ\geq \sqrt{3/\ln M}$, where $σ$ is the... We derive a tight analysis of the trade-off function for Differentially Private Stochastic Gradient Descent (DP-SGD) with subsampling based on random shuffling within the $f$-DP framework. Our analysis covers the regime $σ\geq \sqrt{3/\ln M}$, where $σ$ is the noise multiplier and $M$ is the number of rounds within a single epoch. Unlike $f$-DP analyses for Poisson subsampling, which yield non-closed implicit formulas that can be machine computed but are non-transparent, random shuffling admits ...
390	Beyond Rigid Alignment: Graph Federated Learning via Dual Manifold Calibration 2605.06260	cs.LG	Wentao Yu, Bo Han, Jie Yang, Chen Gong	Graph Federated Learning (GFL) enables collaborative representation learning across distributed subgraphs while preserving privacy. However, heterogeneity remains a critical challenge, as subgraphs across clients typically differ significantly in both semantic... Graph Federated Learning (GFL) enables collaborative representation learning across distributed subgraphs while preserving privacy. However, heterogeneity remains a critical challenge, as subgraphs across clients typically differ significantly in both semantics and structures. Existing methods address heterogeneity by enforcing the rigid alignment of model parameters or prototypes between clients and the server. However, these alignments implicitly rely on a restrictive global linearity assumpti...
391	Inference-Time Refinement Closes the Synthetic-Real Gap in Tabular Diffusion 2605.06261	cs.LGcs.AI	Eugenio Lomurno, Filippo Balzarini, Francesco Benelle, Francesca Pia Panaccione, Matteo Matteucci	Diffusion-based generators set the current state of the art for synthetic tabular data. These methods approach but rarely exceed real-data utility, and closing this synthetic-real gap has so far been pursued exclusively at training time, via architectural adva... Diffusion-based generators set the current state of the art for synthetic tabular data. These methods approach but rarely exceed real-data utility, and closing this synthetic-real gap has so far been pursued exclusively at training time, via architectural advances, scaling, and retraining of monolithic generators. The inference-time alternative, i.e., refining the outputs of a pre-trained backbone with parameters left untouched, has remained largely unexplored for tabular synthesis. We introduce...
392	Can Attribution Predict Risk? From Multi-View Attribution to Planning Risk Signals in End-to-End Autonomous Driving 2605.06264	cs.LG	Le Yang, Ruoyu Chen, Haijun Liu, Jiawei Liang, ShangQuan Sun	End-to-end autonomous driving models generate future trajectories from multi-view inputs, improving system integration but introducing opaque decisions and hard-to-localize risks. Existing methods either rely on auxiliary monitoring models or generate textual ... End-to-end autonomous driving models generate future trajectories from multi-view inputs, improving system integration but introducing opaque decisions and hard-to-localize risks. Existing methods either rely on auxiliary monitoring models or generate textual explanations, but are decoupled from the planning process and fail to reveal the visual evidence underlying trajectory generation. While attribution offers a direct alternative, planning differs from image classification by taking six-view ...
396	A Flow Matching Algorithm for Many-Shot Adaptation to Unseen Distributions 2605.06272	cs.LG	Tyler Ingebrand, Ruihan Zhao, Kushagra Gupta, David Fridovich-Keil, Sandeep P. Chinchali	While generative modeling has achieved remarkable success on tasks like natural language-conditioned image generation, enabling model adaptation from example data points remains a relatively underexplored and challenging problem. To this end, we propose Functi... While generative modeling has achieved remarkable success on tasks like natural language-conditioned image generation, enabling model adaptation from example data points remains a relatively underexplored and challenging problem. To this end, we propose Function Projection for Flow Matching (FP-FM), an algorithm that directly conditions generation on samples from the target distribution. FP-FM learns basis functions to span the velocity fields corresponding to a set of training distributions, an...
398	When Labels Have Structure: Improving Image Classification with Hierarchy-Aware Cross-Entropy 2605.06274	cs.LGcs.CV	April Chan, Davide D'Ascenzo, Sebastiano Cultrera di Montesano	Standard cross-entropy is the default classification loss across virtually all of machine learning, yet it treats all misclassifications equally, ignoring the semantic distances that a class hierarchy encodes. We propose Hierarchy-Aware Cross-Entropy (HACE), a... Standard cross-entropy is the default classification loss across virtually all of machine learning, yet it treats all misclassifications equally, ignoring the semantic distances that a class hierarchy encodes. We propose Hierarchy-Aware Cross-Entropy (HACE), a drop-in replacement for standard cross-entropy that incorporates a known class hierarchy directly into the loss. HACE combines two components: prediction aggregation, which propagates the model's probability mass upward through the class h...
400	PACE: Prune-And-Compress Ensemble Models 2605.06278	cs.LGmath.OC	Fabian Akkerman, Julien Ferry, Théo Guyard, Thibaut Vidal	Ensemble models achieve state-of-the-art performance on prediction tasks, but usually require aggregating a large number of weak learners. This can hinder deployment, interpretability, and downstream tasks such as robustness verification. Remedies to this issu... Ensemble models achieve state-of-the-art performance on prediction tasks, but usually require aggregating a large number of weak learners. This can hinder deployment, interpretability, and downstream tasks such as robustness verification. Remedies to this issue fall into two main camps: pruning, which discards redundant learners, and compression, which generates new ones from scratch. We introduce PACE, a framework that interleaves these paradigms in a two-phase strategy. First, new learners are...
403	INEUS: Iterative Neural Solver for High-Dimensional PIDEs 2605.06281	cs.LGmath.NAq-fin.CP	Jean-Loup Dupret, Davide Gallon, Patrick Cheridito	In this paper, we introduce INEUS, a meshfree iterative neural solver for partial integro-differential equations (PIDEs). The method replaces the explicit evaluation of nonlocal jump integrals with single-jump sampling and reformulates PIDE solving as a sequen... In this paper, we introduce INEUS, a meshfree iterative neural solver for partial integro-differential equations (PIDEs). The method replaces the explicit evaluation of nonlocal jump integrals with single-jump sampling and reformulates PIDE solving as a sequence of recursive regression problems. Like Physics-Informed Neural Networks (PINNs), INEUS learns global solutions over the entire space-time domain, yet it offers a more efficient treatment of nonlocal terms and avoids the computationally e...
410	Attributions All the Way Down? The Metagame of Interpretability 2605.06295	cs.LGcs.AIstat.ML	Hubert Baniecki, Przemyslaw Biecek, Fabian Fumagalli	We introduce the metagame, a conceptual framework for quantifying second-order interaction effects of model explanations. For any first-order attribution $φ(f)$ explaining a model $f$, we measure the directional influence of feature $j$ on the attribution of f... We introduce the metagame, a conceptual framework for quantifying second-order interaction effects of model explanations. For any first-order attribution $φ(f)$ explaining a model $f$, we measure the directional influence of feature $j$ on the attribution of feature $i$, denoted as meta-attribution $\varphi_{j \to i}(f)$, by treating the attribution method itself as a cooperative game and computing its Shapley value. Theoretically, we prove that attributions hierarchically decompose into meta-at...
412	Region Seeding via Pre-Activation Regularization: A Geometric View from Piecewise Affine Nerual Networks 2605.06300	cs.LG	Yi Wei, Xuan Qi, Furao Shen	Deep networks with continuous piecewise affine activations induce polyhedral partitions of the input space, making the number of realized affine regions a natural measure of expressive capacity and a key determinant of how well the model can approximate nonlin... Deep networks with continuous piecewise affine activations induce polyhedral partitions of the input space, making the number of realized affine regions a natural measure of expressive capacity and a key determinant of how well the model can approximate nonlinear target functions. In practice, standard training realizes far fewer region refinements in data-visited neighborhoods than the architecture could in principle support, while existing region-count theory is primarily architectural and off...
413	Molecules Meet Language: Confound-Aware Representation Learning and Chemical Property Steering in Transformer-VAE Latent Spaces 2605.06303	cs.LG	Zakaria Elabid, Jan Andrzejewski, Bartosz Brzoza, Attila Cangi	Molecular generative models often assume meaningful latent geometry, but apparent property predictability can reflect sequence-level shortcuts rather than chemical organization. We study this issue in an unsupervised autoregressive Transformer-VAE trained on S... Molecular generative models often assume meaningful latent geometry, but apparent property predictability can reflect sequence-level shortcuts rather than chemical organization. We study this issue in an unsupervised autoregressive Transformer-VAE trained on SELFIES. After training, we freeze the model, fit linear probes to RDKit descriptors, and use the probe weights as candidate global steering directions. To separate chemical signal from SELFIES artifacts, we introduce a confound-aware evalua...
418	Perceive, Route and Modulate: Dynamic Pattern Recalibration for Time Series Forecasting 2605.06310	cs.LG	Siru Zhong, Zhao Meng, Haohuan Fu, Haoyang Li, Qingsong Wen	Local temporal patterns in real-world time series continuously shift, rendering globally shared transformations suboptimal. Current deep forecasting models, despite their scale and complexity, rely on fixed weight matrices applied uniformly to all temporal tok... Local temporal patterns in real-world time series continuously shift, rendering globally shared transformations suboptimal. Current deep forecasting models, despite their scale and complexity, rely on fixed weight matrices applied uniformly to all temporal tokens. This creates a static pattern response: models settle into a compromised average, unable to adapt to changing local dynamics. We introduce Dynamic Pattern Recalibration (DPR), a backbone-agnostic mechanism that resolves this via token-...
419	When Does $\ell_2$-Boosting Overfit Benignly? High-Dimensional Risk Asymptotics and the $\ell_1$ Implicit Bias 2605.06314	cs.LG	Ye Su, Jian Li, Yong Liu	Benign overfitting is well-characterized in $\ell_2$ geometries, but its behavior under the $\ell_1$ implicit bias of greedy ensembles remains challenging. The analytical barrier stems from the non-linear coupling of coordinate selection thresholds, which inva... Benign overfitting is well-characterized in $\ell_2$ geometries, but its behavior under the $\ell_1$ implicit bias of greedy ensembles remains challenging. The analytical barrier stems from the non-linear coupling of coordinate selection thresholds, which invalidates standard spectral resolvent tools. To isolate this algorithmic bias, we characterize the high-dimensional risk of continuous-time $\ell_2$-Boosting over $p$ features and $n$ samples. By coupling the Convex Gaussian Minimax Theorem w...
421	Pro-KLShampoo: Projected KL-Shampoo with Whitening Recovered by Orthogonalization 2605.06316	cs.LGcs.AI	Ruotong Sun, Ermin Wei	Optimizers that exploit the matrix structure of gradients are central to modern LLM pre-training, with two distinct frontiers: explicit Kronecker-factored preconditioning -- most recently KL-Shampoo, which estimates the preconditioner via KL divergence minimiz... Optimizers that exploit the matrix structure of gradients are central to modern LLM pre-training, with two distinct frontiers: explicit Kronecker-factored preconditioning -- most recently KL-Shampoo, which estimates the preconditioner via KL divergence minimization -- and orthogonalization of the gradient momentum, exemplified by Muon and analyzed as steepest descent under the spectral norm. The two routes are typically developed in isolation. We make a structural observation about KL-Shampoo's ...
425	SMolLM: Small Language Models Learn Small Molecular Grammar 2605.06322	cs.LG	Akhil Jindal, Harang Ju	Language models for molecular design have scaled to hundreds of millions of parameters, yet how they learn chemical grammar is poorly understood. We train SMolLM, a 53K-parameter weight-shared transformer, to generate novel SMILES with 95% validity on the ZINC... Language models for molecular design have scaled to hundreds of millions of parameters, yet how they learn chemical grammar is poorly understood. We train SMolLM, a 53K-parameter weight-shared transformer, to generate novel SMILES with 95% validity on the ZINC-250K drug-like-molecule benchmark, outperforming a standard GPT with 10 times more parameters. Mechanistically, the same block resolves SMILES constraints across passes in a fixed order: brackets first, rings second, and valence last, as s...
429	Geometric Kolmogorov--Arnold Network (GeoKAN) 2605.06740	cs.LGcs.AI	Abhijit Sen, Bikram Keshari Parida, Giridas Maiti, Mahima Arya, Denys I. Bondar	We introduce Geometric Kolmogorov--Arnold Networks (GeoKANs), a family of geometry-aware KAN-type models in which approximation is carried out in learned, geometry-adapted coordinates rather than in fixed Euclidean input coordinates. GeoKAN achieves this by le... We introduce Geometric Kolmogorov--Arnold Networks (GeoKANs), a family of geometry-aware KAN-type models in which approximation is carried out in learned, geometry-adapted coordinates rather than in fixed Euclidean input coordinates. GeoKAN achieves this by learning a diagonal Riemannian metric that warps the input before basis expansion and feature mixing. The learned metric provides a geometric inductive bias through local length scaling and volume distortion, and in physics-informed settings ...
431	LINC: Decoupling Local Consequence Scoring from Hidden Matching in Constructive Neural Routing 2605.06332	cs.LG	Shaofeng Qin, Li Wang	Constructive neural routing solvers usually score the next action by matching a decoder context to candidate embeddings, hiding deterministic one-step consequences such as travel, waiting, slack, and capacity changes. We propose LINC (Local Inference via Norme... Constructive neural routing solvers usually score the next action by matching a decoder context to candidate embeddings, hiding deterministic one-step consequences such as travel, waiting, slack, and capacity changes. We propose LINC (Local Inference via Normed Comparison), a decoder-side candidate decision architecture that computes these consequences explicitly. LINC uses them according to their decision role: centered relative consequences are compared by a shared linear local scorer, while f...
434	Eliciting associations between clinical variables from LLMs via comparison questions across populations 2605.06335	cs.LG	Fabian Kabus, Kian Kordtomeikel, Thomas Brox, Heinz Wiendl, Daiana Stolz	The training data of large language models (LLMs) comprises a wide range of biomedical literature, reflecting data from many different patient populations. We investigate how it might be possible to recover information on correlation and causal links between p... The training data of large language models (LLMs) comprises a wide range of biomedical literature, reflecting data from many different patient populations. We investigate how it might be possible to recover information on correlation and causal links between patient characteristics, as a key building block for medical decision making. To avoid the pitfalls of direct elicitation, we propose an approach based on structured comparison questions, specifically patient comparison triplet questions. Th...
438	A Closed-Form Upper Bound for Admissible Learning-Rate Steps in Belief-Space Dynamics 2605.06741	cs.LG	Zixi Li, Youzhen Li	Learning-rate steps are usually treated as hyperparameters. This paper isolates a local beliefspace calculation: when an update is modeled as a projected forward step on the probability simplex, admissibility means contractivity in the natural KL/Bregman geome... Learning-rate steps are usually treated as hyperparameters. This paper isolates a local beliefspace calculation: when an update is modeled as a projected forward step on the probability simplex, admissibility means contractivity in the natural KL/Bregman geometry. Under this model, the upper bound of an admissible step is not a tuning slogan but a formula.
445	Is Escalation Worth It? A Decision-Theoretic Characterization of LLM Cascades 2605.06350	cs.LGcs.AIcs.CL	Dylan Bouchard	Model cascades, in which a cheap LLM defers to an expensive one on low-confidence queries, are widely used to navigate the cost-quality tradeoff at deployment. Existing approaches largely treat the deferral threshold as an empirical hyperparameter, with limite... Model cascades, in which a cheap LLM defers to an expensive one on low-confidence queries, are widely used to navigate the cost-quality tradeoff at deployment. Existing approaches largely treat the deferral threshold as an empirical hyperparameter, with limited guidance on the geometry of the resulting cost-quality frontier over a model pool. We develop a decision-theoretic framework grounded in constrained optimization and duality. For a two-model cascade, we establish piecewise concavity of th...
446	Topological Signatures of Grokking 2605.06352	cs.LGcs.AIstat.ML	Yifan Tang, Qiquan Wang, Inés García-Redondo, Anthea Monod	We study the grokking phenomenon through the lens of topology. Using persistent homology on point clouds derived from the embedding matrices of a range of models trained on modular arithmetic with varying primes, we identify a clear and consistent topological ... We study the grokking phenomenon through the lens of topology. Using persistent homology on point clouds derived from the embedding matrices of a range of models trained on modular arithmetic with varying primes, we identify a clear and consistent topological signature of grokking: a sharp increase in both the maximum and total persistence of first homology ($H_1$). Persistence diagrams reveal the emergence of a dominant long-lived topological feature together with increasingly structured second...
448	Order-Agnostic Autoregressive Modelling with Missing Data 2605.06355	cs.LGstat.ML	Ignacio Peis, Pablo M. Olmos, Jes Frellsen	Order-Agnostic autoregressive models have demonstrated strong performance in deep generative modeling, yet their use in settings with incomplete data remains largely unexplored. In this work, we reinterpret them through the lens of missing data. First, we show... Order-Agnostic autoregressive models have demonstrated strong performance in deep generative modeling, yet their use in settings with incomplete data remains largely unexplored. In this work, we reinterpret them through the lens of missing data. First, we show that their standard training procedure on fully observed data implicitly performs imputation under a missing completely at random mechanism, resulting in robust out-of-sample imputation performance in settings with high missingness. Second...
450	Memory Efficient Full-gradient Attacks (MEFA) Framework for Adversarial Defense Evaluations 2605.06357	cs.LGcs.AIcs.CV	Yuan Du, Mitchel Hill, HanQin Cai	This work studies the robust evaluation of iterative stochastic purification defenses under white-box adversarial attacks. Our key technical insight is that gradient checkpointing makes exact end-to-end gradient computation through long purification trajectori... This work studies the robust evaluation of iterative stochastic purification defenses under white-box adversarial attacks. Our key technical insight is that gradient checkpointing makes exact end-to-end gradient computation through long purification trajectories practical by trading additional recomputation for substantially lower memory usage. This enables full-gradient adaptive attacks against diffusion- and Langevin-based purification defenses, where prior evaluations often resort to approxim...
452	Preliminary Insights in Chronos Frequency Data Understanding and Reconstruction 2605.06361	cs.LG	Alessandro Pagani, Marco Cominelli, Liying Han, Gaofeng Dong, Sergio Benini	This paper presents a preliminary analysis of the ability of Chronos foundation model to process and internally represent frequency domain information. Foundation models that process time-series data offer practitioners a unified architecture capable of learni... This paper presents a preliminary analysis of the ability of Chronos foundation model to process and internally represent frequency domain information. Foundation models that process time-series data offer practitioners a unified architecture capable of learning generic temporal representations across diverse tasks and domains, reducing the need for task-specific feature engineering and enabling transfer across signal modalities. Despite their growing adoption, the extent to which such models en...
453	Flow Matching with Arbitrary Auxiliary Paths 2605.06364	cs.LGcs.AI	Xin Peng, Ang Gao	We introduce a new generative modeling framework, \textbf{Flow Matching with Arbitrary Auxiliary Paths (AuxPath-FM)}, which generalizes conditional flow matching by incorporating an auxiliary variable drawn from an arbitrary distribution into the probability p... We introduce a new generative modeling framework, \textbf{Flow Matching with Arbitrary Auxiliary Paths (AuxPath-FM)}, which generalizes conditional flow matching by incorporating an auxiliary variable drawn from an arbitrary distribution into the probability path. Unlike prior methods that restrict auxiliary components to Gaussian noise, AuxPath-FM allows the variable $η$ to follow any distribution, producing trajectories of the form $X_t = a(t)X_1 + b(t)X_0 + c(t)η$. We theoretically demonstrat...
455	Layer Collapse in Diffusion Language Models 2605.06366	cs.LG	Alexander Conzelmann, Albert Catalan-Tatjer, Shiwei Liu	Diffusion language models (DLMs) have recently emerged as competitive alternatives to autoregressive (AR) language models, yet differences in their activation dynamics remain poorly understood. We characterize these dynamics in LLaDA-8B and identify a striking... Diffusion language models (DLMs) have recently emerged as competitive alternatives to autoregressive (AR) language models, yet differences in their activation dynamics remain poorly understood. We characterize these dynamics in LLaDA-8B and identify a striking layer-collapse property: a few early layers exhibit highly similar, collapsed activation patterns dominated by a single large super-outlier persisting over a long token range. Despite its apparent redundancy, this outlier is critical: prun...
460	A Unified Pair-GRPO Family: From Implicit to Explicit Preference Constraints for Stable and General RL Alignment 2605.06375	cs.LGmath.ST	Hao Yu	Large language model (LLM) alignment via reinforcement learning from human preferences (RLHF) suffers from unstable policy updates, ambiguous gradient directions, poor interpretability, and high gradient variance in mainstream pairwise preference learning para... Large language model (LLM) alignment via reinforcement learning from human preferences (RLHF) suffers from unstable policy updates, ambiguous gradient directions, poor interpretability, and high gradient variance in mainstream pairwise preference learning paradigms. To systematically address these limitations, we establish a unified theoretical framework for preference-based RL optimization centered on the Pair-GRPO family, comprising two tightly coupled variants: Soft-Pair-GRPO and Hard-Pair-GR...
465	MinMax Recurrent Neural Cascades 2605.06384	cs.LGcs.AIcs.FL	Alessandro Ronca	We show that the MinMax algebra provides a form of recurrence that is expressively powerful, efficiently implementable, and most importantly it is not affected by vanishing or exploding gradient. We call MinMax Recurrent Neural Cascades (RNCs) the models obtai... We show that the MinMax algebra provides a form of recurrence that is expressively powerful, efficiently implementable, and most importantly it is not affected by vanishing or exploding gradient. We call MinMax Recurrent Neural Cascades (RNCs) the models obtained by cascading several layers of neurons that employ such recurrence. We show that MinMax RNCs enjoy many favourable theoretical properties. First, their formal expressivity includes all regular languages, arguably the maximal expressivit...
466	Data-Driven Covariate Selection for Nonparametric and Cycle-Agnostic Causal Effect Estimation 2605.06385	cs.LG	Ana Leticia Garcez Vicente, Gijs van Seeventer, Saber Salehkaleybar	Estimating causal effects from observational data requires identifying valid adjustment sets. This task is especially challenging in realistic settings where latent confounding and feedback loops are present. Existing approaches typically assume acyclicity or ... Estimating causal effects from observational data requires identifying valid adjustment sets. This task is especially challenging in realistic settings where latent confounding and feedback loops are present. Existing approaches typically assume acyclicity or rely on global causal structure learning, limiting applicability and computational efficiency. In this work, we study a local, data-driven method for covariate selection based on conditional independence information. While this method is kn...
468	Asymmetric On-Policy Distillation: Bridging Exploitation and Imitation at the Token Level 2605.06387	cs.LGcs.AI	Nan Jia, Haojin Yang, Xing Ma, Jiesong Lian, Shuailiang Zhang	On-policy distillation (OPD) trains a student on its own trajectories with token-level teacher feedback and often outperforms off-policy distillation and standard reinforcement learning. However, we find that its standard advantage weighted policy gradient suf... On-policy distillation (OPD) trains a student on its own trajectories with token-level teacher feedback and often outperforms off-policy distillation and standard reinforcement learning. However, we find that its standard advantage weighted policy gradient suffers from three structural weaknesses, including high variance updates, vanishing gradients in zero-advantage regions, and exploration bottlenecks when corrective signals are insufficient. We therefore propose Asymmetric On-Policy Distillat...
471	Consistent Geometric Deep Learning via Hilbert Bundles and Cellular Sheaves 2605.06395	cs.LGcs.AIeess.SP	Kartik Tandon, Julian Gould, Tanishq Bhatia, Francesca Dominici, Alejandro Ribeiro	Modern deep learning architectures increasingly contend with sophisticated signals that are natively infinite-dimensional, such as time series, probability distributions, or operators, and are defined over irregular domains. Yet, a unified learning theory for ... Modern deep learning architectures increasingly contend with sophisticated signals that are natively infinite-dimensional, such as time series, probability distributions, or operators, and are defined over irregular domains. Yet, a unified learning theory for these settings has been lacking. To start addressing this gap, we introduce a novel convolutional learning framework for possibly infinite-dimensional signals supported on a manifold. Namely, we use the connection Laplacian associated with ...
472	SparseForge: Efficient Semi-Structured LLM Sparsification via Annealing of Hessian-Guided Soft-Mask 2605.06402	cs.LG	Liu Hanzuo, Chaofan Lin, Weixuan Sun, Yulong Wang, Key	Semi-structured sparsity provides a practical path to accelerate large language models (LLMs) with native hardware support, but post-training semi-structured pruning often suffers from substantial quality degradation due to strong structural coupling. Existing... Semi-structured sparsity provides a practical path to accelerate large language models (LLMs) with native hardware support, but post-training semi-structured pruning often suffers from substantial quality degradation due to strong structural coupling. Existing methods rely on large-scale sparse retraining to recover accuracy, resulting in high computational cost. We propose SparseForge, a post-training framework that improves recovery efficiency by directly optimizing the sparsity mask rather ...
474	FRInGe: Distribution-Space Integrated Gradients with Fisher--Rao Geometry 2605.06404	cs.LG	Gabriele Martino, Sebastian Tschiatschek	Gradient-based attribution methods are model-faithful and scalable, but Integrated Gradients (IG) can be brittle because explanations depend on heuristic baselines, straight-line paths, discretization, and saturation. We propose Fisher--Rao Integrated Gradient... Gradient-based attribution methods are model-faithful and scalable, but Integrated Gradients (IG) can be brittle because explanations depend on heuristic baselines, straight-line paths, discretization, and saturation. We propose Fisher--Rao Integrated Gradients (FRInGe), which defines both the reference and interpolation schedule in predictive distribution space. FRInGe replaces input baselines with a maximum-entropy predictive reference and follows a Fisher-Rao geodesic on the probability simpl...
478	E = T*H/(O+B): A Dimensionless Control Parameter for Mixture-of-Experts Ecology 2605.06415	cs.LGcs.AIcs.CLcs.CV	Qingjun Zhang	We introduce E = TH/(O+B), a dimensionless control parameter that predicts whether Mixture-of-Experts (MoE) models will develop a healthy expert ecology or collapse into dead experts. E combines four hyperparameters -- routing temperature T, routing entropy w... We introduce E = TH/(O+B), a dimensionless control parameter that predicts whether Mixture-of-Experts (MoE) models will develop a healthy expert ecology or collapse into dead experts. E combines four hyperparameters -- routing temperature T, routing entropy weight H, oracle weight O, and balance weight B -- into a single quantity. Through 12 controlled experiments (8 vision, 4 language) totaling over 11,000 training epochs, we establish that E >= 0.5 alone is sufficient to guarantee zero dead e...
482	Federated Cross-Client Subgraph Pattern Detection 2605.06433	cs.LG	Selin Ceydeli, Rui Wang, Kubilay Atasu	Subgraph pattern detection aims to uncover complex interaction structures in graphs. However, state-of-the-art graph neural network (GNN)-based solutions assume centralized access to the entire graph. When graphs are instead distributed across multiple parties... Subgraph pattern detection aims to uncover complex interaction structures in graphs. However, state-of-the-art graph neural network (GNN)-based solutions assume centralized access to the entire graph. When graphs are instead distributed across multiple parties, client-local GNN computations diverge from those of a centralized model, resulting in a representation-equivalence gap. We formalize this as a structural observability problem, where subgraph patterns crossing partition boundaries become ...
487	Hyperbolic Concept Bottleneck Models 2605.06440	cs.LGcs.CV	Daniel Uyterlinde, Swasti Shreya Mishra, Pascal Mettes	Concept Bottleneck Models (CBMs) have become a popular approach to enable interpretability in neural networks by constraining classifier inputs to a set of human-understandable concepts. While effective, current models embed concepts in flat Euclidean space, t... Concept Bottleneck Models (CBMs) have become a popular approach to enable interpretability in neural networks by constraining classifier inputs to a set of human-understandable concepts. While effective, current models embed concepts in flat Euclidean space, treating them as independent, orthogonal dimensions. Concepts, however, are highly structured and organized in semantic hierarchies. To resolve this mismatch, we propose Hyperbolic Concept Bottleneck Models (HypCBM), a post-hoc framework tha...
490	FedFrozen: Two-Stage Federated Optimization via Attention Kernel Freezing 2605.06446	cs.LG	Junye Du, Zhenghao Li, Yushi Feng, Long Feng	Federated learning with heterogeneous clients remains a significant challenge for deep learning, primarily due to client drift arising from inconsistent local updates. Existing federated optimization methods typically address this issue through objective-level... Federated learning with heterogeneous clients remains a significant challenge for deep learning, primarily due to client drift arising from inconsistent local updates. Existing federated optimization methods typically address this issue through objective-level regularization or update-correction mechanisms. Recent studies, however, suggest that Transformer-based architectures may be inherently more robust than conventional models under heterogeneous federated training. Motivated by this observat...
491	Scene-Adaptive Continual Learning for CSI-based Human Activity Recognition with Mixture of Experts 2605.06447	cs.LG	Wenhan Zheng, Yuyi Mao, Ivan Wang-Hei Ho	Channel state information (CSI)-based human activity recognition (HAR) is vulnerable to performance degradation under domain shifts across varying physical environments. Continual learning (CL) offers a principled way to learn new domains sequentially while pr... Channel state information (CSI)-based human activity recognition (HAR) is vulnerable to performance degradation under domain shifts across varying physical environments. Continual learning (CL) offers a principled way to learn new domains sequentially while preserving past knowledge, but existing CL solutions for CSI-based HAR scale poorly with accumulating domains, rely on a large replay buffer, or incur linearly growing inference cost. In this letter, we propose Scene-Adaptive Mixture of Exper...
493	ORTHOBO: Orthogonal Bayesian Hyperparameter Optimization 2605.06454	cs.LGcs.AI	Maresa Schröder, Pascal Janetzky, Michael Klar, Stefan Feuerriegel	Bayesian optimization is widely used for hyperparameter optimization when model evaluations are expensive; however, noisy acquisition estimates can lead to unstable decisions. We identify acquisition estimation noise as a failure mode that was previously overl... Bayesian optimization is widely used for hyperparameter optimization when model evaluations are expensive; however, noisy acquisition estimates can lead to unstable decisions. We identify acquisition estimation noise as a failure mode that was previously overlooked: even when the surrogate model and acquisition target are correctly specified, finite-sample Monte Carlo error can perturb acquisition values. This can, in turn, flip candidate rankings and lead to suboptimal BO decisions. As a remedy...
496	Invariant Features in Language Models: Geometric Characterization and Model Attribution 2605.06458	cs.LGcs.CL	Agnibh Dasgupta, Abdullah Tanvir, Xin Zhong	Language models exhibit strong robustness to paraphrasing, suggesting that semantic information may be encoded through stable internal representations, yet the structure and origin of such invariance remain unclear. We propose a local geometric framework in wh... Language models exhibit strong robustness to paraphrasing, suggesting that semantic information may be encoded through stable internal representations, yet the structure and origin of such invariance remain unclear. We propose a local geometric framework in which semantically equivalent inputs occupy structured regions in latent space, with paraphrastic variation along nuisance directions and semantic identity preserved in invariant subspaces. Building on this view, we make three contributions: ...
497	MINER: Mining Multimodal Internal Representation for Efficient Retrieval 2605.06460	cs.LG	Weien Li, Rui Song, Zeyu Li, Haochen Liu, Gonghao Zhang	Visual document retrieval has become essential for accessing information in visually rich documents. Existing approaches fall into two camps. Late-interaction retrievers achieve strong quality through fine-grained token-level matching but store hundreds of vec... Visual document retrieval has become essential for accessing information in visually rich documents. Existing approaches fall into two camps. Late-interaction retrievers achieve strong quality through fine-grained token-level matching but store hundreds of vectors per page, incurring large index footprints and high serving costs. By contrast, dense single-vector retrievers retain storage and latency advantages but consistently lag in quality because they compress all information into a single fi...
498	Invariant-Based Diagnostics for Graph Benchmarks 2605.06462	cs.LGmath.CO	Richard von Moos, Mathieu Alain, Bastian Rieck	Progress on graph foundation models is hindered by benchmark practices that conflate the contributions of node features and graph structure, making it hard to tell whether a model actually learns from connectivity, or whether it even needs to. We propose addre... Progress on graph foundation models is hindered by benchmark practices that conflate the contributions of node features and graph structure, making it hard to tell whether a model actually learns from connectivity, or whether it even needs to. We propose addressing this using graph invariants, i.e., permutation-invariant, task-agnostic structural descriptors that serve as a diagnostic framework for graph benchmarks. We show that (i) invariants are more expressive than standard GNNs, (ii) invaria...
499	Diversity Curves for Graph Representation Learning 2605.06466	cs.LG	Katharina Limbeck, Nadja Häusermann, Martin Carrasco, Guy Wolf, Bastian Rieck	Graph-level representations are crucial tools for characterising structural differences between graphs. However, comparing graphs with different cardinalities, even when sampled from the same underlying distribution, remains challenging. Unsupervised tasks in ... Graph-level representations are crucial tools for characterising structural differences between graphs. However, comparing graphs with different cardinalities, even when sampled from the same underlying distribution, remains challenging. Unsupervised tasks in particular require interpretable, scalable, and reliable size-aware graph representations. Our work addresses these issues by tracking the structural diversity of a graph across coarsening levels. The resulting graph embeddings, which we de...
500	No Triangulation Without Representation: Generalization in Topological Deep Learning 2605.06467	cs.LGmath.AT	Johannes S. Schmidt, Martin Carrasco, Ernst Röell, Guy Wolf, Nello Blaser	Despite an ever-increasing interest in topological deep learning models that target higher-order datasets, there is no consensus on how to evaluate such models. This is exacerbated by the fact that topological objects permit operations, such as structural refi... Despite an ever-increasing interest in topological deep learning models that target higher-order datasets, there is no consensus on how to evaluate such models. This is exacerbated by the fact that topological objects permit operations, such as structural refinements, that are not appropriate for graph data. In this work, we extend MANTRA, a benchmark dataset containing manifold triangulations, to a larger class of manifolds with more diverse homeomorphism types. We show that, unlike prior claim...
502	Hitting Time Isomorphism for Multi-Stage Planning with Foundation Policies 2605.06470	cs.LG	Magnus Victor Boock, Abdullah Akgül, Mustafa Mert Çelikok, Melih Kandemir	We present a new operator-theoretic representation learning framework for offline reinforcement learning that recovers the directed temporal geometry of a controlled Markov process from hitting time observations. While prior art often produces symmetric distan... We present a new operator-theoretic representation learning framework for offline reinforcement learning that recovers the directed temporal geometry of a controlled Markov process from hitting time observations. While prior art often produces symmetric distances or fails to satisfy the triangle inequality, our framework learns a Hilbert-space displacement geometry where expected hitting times are realized as linear functionals of latent displacements. We prove that this representation exists un...
503	Efficient Serving for Dynamic Agent Workflows with Prediction-based KV-Cache Management 2605.06472	cs.LG	Haoyu Zheng, Fangcheng Fu, Jia Wu, Binhang Yuan, Yongqiang Zhang	LLM-based workflows compose specialized agents to execute complex tasks, and these agents usually share substantial context, allowing KV-Cache reuse to save computation. Existing approaches either manage KV-Cache at agent level and fail to exploit the reuse op... LLM-based workflows compose specialized agents to execute complex tasks, and these agents usually share substantial context, allowing KV-Cache reuse to save computation. Existing approaches either manage KV-Cache at agent level and fail to exploit the reuse opportunities within workflows, or manage cache at the workflow level but assume that each workflow calls a static sequence of agents. However, practical workflows are typically dynamic, where the sequence of invoked agents and thus induced c...
504	Q-MMR: Off-Policy Evaluation via Recursive Reweighting and Moment Matching 2605.06474	cs.LGcs.AIstat.ML	Xiang Li, Nan Jiang	We present a novel theoretical framework, Q-MMR, for off-policy evaluation in finite-horizon MDPs. Q-MMR learns a set of scalar weights, one for each data point, such that the reweighted rewards approximate the expected return under the target policy. The weig... We present a novel theoretical framework, Q-MMR, for off-policy evaluation in finite-horizon MDPs. Q-MMR learns a set of scalar weights, one for each data point, such that the reweighted rewards approximate the expected return under the target policy. The weights are learned inductively in a top-down manner via a moment matching objective against a value-function discriminator class. Notably, and perhaps surprisingly, a data-dependent finite-sample guarantee for general function approximation ca...
516	Operator-Guided Invariance Learning for Continuous Reinforcement Learning 2605.06500	cs.LGcs.AI	Zuyuan Zhang, Fei Xu Yu, Tian Lan	Reinforcement learning (RL) with continuous time and state/action spaces is often data-intensive and brittle under nuisance variability and shift, motivating methods that exploit value-preserving structures to stabilize and improve learning. Most existing appr... Reinforcement learning (RL) with continuous time and state/action spaces is often data-intensive and brittle under nuisance variability and shift, motivating methods that exploit value-preserving structures to stabilize and improve learning. Most existing approaches focus on special cases, such as prescribed symmetries and exact equivariance, without addressing how to discover more general structures that require nonlinear operators to transform and map between continuous state/action systems wi...
517	Cubit: Token Mixer with Kernel Ridge Regression 2605.06501	cs.LGcs.CL	Chuanyang Zheng, Jiankai Sun, Yihang Gao, Yuehao Wang, Liangchen Tan	Since its introduction in 2017, the Transformer has become one of the most widely adopted architectures in modern deep learning. Despite extensive efforts to improve positional encoding, attention mechanisms, and feed-forward networks, the core token-mixing me... Since its introduction in 2017, the Transformer has become one of the most widely adopted architectures in modern deep learning. Despite extensive efforts to improve positional encoding, attention mechanisms, and feed-forward networks, the core token-mixing mechanism in Transformers remains attention. In this work, we show that the attention module in Transformers can be interpreted as performing Nadaraya-Watson regression, where it computes similarities between tokens and aggregates the corresp...
518	Gradient Extrapolation-Based Policy Optimization 2605.06755	cs.LGcs.AI	Ismam Nur Swapnil, Aranya Saha, Tanvir Ahmed Khan, Mohammad Ariful Haque, Ser-Nam Lim	Reinforcement learning is widely used to improve the reasoning ability of large language models, especially when answers can be automatically checked. Standard GRPO-style training updates the model using only the current step, while full multi-step lookahead c... Reinforcement learning is widely used to improve the reasoning ability of large language models, especially when answers can be automatically checked. Standard GRPO-style training updates the model using only the current step, while full multi-step lookahead can give a better update direction but is too expensive because it needs many backward passes. We propose Gradient Extrapolation-Based Policy Optimization (GXPO), a plug-compatible policy-update rule for GRPO-style reasoning RL. GXPO approxi...
519	PACZero: PAC-Private Fine-Tuning of Language Models via Sign Quantization 2605.06505	cs.LGcs.AIcs.CR	Murat Bilgehan Ertan, Xiaochen Zhu, Phuong Ha Nguyen, Marten van Dijk, Srinivas Devadas	We introduce PACZero, a family of PAC-private zeroth-order mechanisms for fine-tuning large language models that delivers usable utility at $I(S^; Y_{1:T})=0$. This privacy regime bounds the membership-inference attack (MIA) posterior success rate at the prio... We introduce PACZero, a family of PAC-private zeroth-order mechanisms for fine-tuning large language models that delivers usable utility at $I(S^; Y_{1:T})=0$. This privacy regime bounds the membership-inference attack (MIA) posterior success rate at the prior, an MIA-resistance level the DP framework matches only at $\varepsilon=0$ and infinite noise. All DP-ZO comparisons below are matched at the MIA posterior level. The key insight is that PAC Privacy charges mutual information only when the...
524	Is One Layer Enough? Understanding Inference Dynamics in Tabular Foundation Models 2605.06510	cs.LGcs.AI	Amir Rezaei Balef, Mykhailo Koshil, Katharina Eggensperger	Transformer-based tabular foundation models (TFMs) dominate small to medium tabular predictive benchmark tasks, yet their inference mechanisms remain largely unexplored. We present the first large-scale mechanistic study of layerwise dynamics in 6 state-of-the... Transformer-based tabular foundation models (TFMs) dominate small to medium tabular predictive benchmark tasks, yet their inference mechanisms remain largely unexplored. We present the first large-scale mechanistic study of layerwise dynamics in 6 state-of-the-art tabular in-context learning models. We explore how predictions emerge across depth, identify distinct stages of inference and reveal latent-space dynamics that differ from those of language models. Our findings indicate substantial dep...
527	Physics-based Digital Twins for Integrated Thermal Energy Systems Using Active Learning 2605.06756	cs.LGeess.SY	Umme Mahbuba Nabila, Paul Seurin, Linyu Lin, Majdi I. Radaideh	Real-time supervisory control of thermal energy distribution systems requires digital twins that are accurate, interpretable, and uncertainty-aware, yet remain data and computationally efficient. High-fidelity simulations alone are costly, while purely data-dr... Real-time supervisory control of thermal energy distribution systems requires digital twins that are accurate, interpretable, and uncertainty-aware, yet remain data and computationally efficient. High-fidelity simulations alone are costly, while purely data-driven surrogates often lack robustness. To address these challenges, this work proposes an active learning (AL) framework that couples system-level Modelica simulations with four simpler physics-informed and data-driven surrogate modeling ap...
528	Efficient Techniques for Data Reconstruction, with Finite-Width Recovery Guarantees 2605.06519	cs.LG	Edward Tansley, Roy Makhlouf, Estelle Massart, Coralia Cartis	Data reconstruction attacks on trained neural networks aim to recover the data on which the network has been trained and pose a significant threat to privacy, especially if the training dataset contains sensitive information. Here, we propose a unified optimiz... Data reconstruction attacks on trained neural networks aim to recover the data on which the network has been trained and pose a significant threat to privacy, especially if the training dataset contains sensitive information. Here, we propose a unified optimization formulation of the data reconstruction problem based on initial and trained parameter values, incorporating state-of-the-art proposals. We show that in the random feature model, this formulation provably leads to training data reconst...
530	Agentic AIs Are the Missing Paradigm for Out-of-Distribution Generalization in Foundation Models 2605.06522	cs.LGcs.CV	Xin Wang, Haibo Chen, Wenxuan Liu, Wenwu Zhu	Foundation models (FMs) are increasingly deployed in open-world settings where distribution shift is the rule rather than the exception. The out-of-distribution (OOD) phenomena they face -- knowledge boundaries, capability ceilings, compositional shifts, and o... Foundation models (FMs) are increasingly deployed in open-world settings where distribution shift is the rule rather than the exception. The out-of-distribution (OOD) phenomena they face -- knowledge boundaries, capability ceilings, compositional shifts, and open-ended task variation -- differ in kind from the settings that have shaped prior OOD research, and are further complicated because the pretraining and post-training distributions of modern FMs are often only partially observed. Our posit...
531	On the Implicit Reward Overfitting and the Low-rank Dynamics in RLVR 2605.06523	cs.LGcs.AI	Hao Ye, Jisheng Dang, Junfeng Fang, Bimei Wang, Yizhou Zhang	Recent extensive research has demonstrated that the enhanced reasoning capabilities acquired by models through Reinforcement Learning with Verifiable Rewards (RLVR) are primarily concentrated within the rank-1 components. Predicated on this observation, we emp... Recent extensive research has demonstrated that the enhanced reasoning capabilities acquired by models through Reinforcement Learning with Verifiable Rewards (RLVR) are primarily concentrated within the rank-1 components. Predicated on this observation, we employed Periodic Rank-1 Substitution and identified a counterintuitive phenomenon: RLVR may exhibit implicit reward overfitting to the training dataset. Specifically, the model can achieve satisfactory performance on the test set even when it...
539	Diffusion-Based Posterior Sampling: A Feynman-Kac Analysis of Bias and Stability 2605.06538	cs.LG	Matias G. Delgadino, Sebastien Motsch, Advait Parulekar, William Porteous, Sanjay Shakkottai	Diffusion-based posterior samplers use pretrained diffusion priors to sample from measurement- or reward-conditioned posteriors, and are widely used for inverse problems. Yet their theoretical behavior remains poorly understood: even with exact prior scores, t... Diffusion-based posterior samplers use pretrained diffusion priors to sample from measurement- or reward-conditioned posteriors, and are widely used for inverse problems. Yet their theoretical behavior remains poorly understood: even with exact prior scores, their outputs are biased, and in low-temperature regimes their discretizations can become unstable. We characterize this bias by introducing a tractable surrogate path connecting the true posterior to a standard Gaussian and comparing it to ...
541	Hedging Memory Horizons for Non-Stationary Prediction via Online Aggregation 2605.06541	cs.LGstat.ML	Yutong Wang, Yannig Goude, Qiwei Yao	We study online prediction under distribution shift, where inputs arrive chronologically and outcomes are revealed only after prediction. In this setting, predictors must remain stable in quiet regimes yet adapt when regimes shift, and the right adaptation mem... We study online prediction under distribution shift, where inputs arrive chronologically and outcomes are revealed only after prediction. In this setting, predictors must remain stable in quiet regimes yet adapt when regimes shift, and the right adaptation memory is unknown in advance. We propose MELO (Memory-hedged Exponentially Weighted Least-Squares Online aggregation), a model-agnostic method that hedges across adaptation scales: it wraps any non-anticipating base-predictor pool with exponen...
544	Sequential Design of Genetic Circuits Under Uncertainty With Reinforcement Learning 2605.06552	cs.LG	Michal Kobiela, Diego A. Oyarzún, Michael U. Gutmann	The design of biological systems is hindered by uncertainty arising from both intrinsic stochasticity of biomolecular reactions and variability across laboratory or experimental conditions. In this work, we present a sequential framework to optimize genetic ci... The design of biological systems is hindered by uncertainty arising from both intrinsic stochasticity of biomolecular reactions and variability across laboratory or experimental conditions. In this work, we present a sequential framework to optimize genetic circuits under both forms of uncertainty. By employing simulator models based on differential equations or Markov jump processes alongside a reinforcement learning (RL) policy-based approach, our method suggests experiments that adapt to unkn...
545	Diverse Sampling in Diffusion Models with Marginal Preserving Particle Guidance 2605.06553	cs.LG	Gal Vinograd, Idan Achituve, Ethan Fetaya	We present EDDY (Exact-marginal Diversification via Divergence-free dYnamics), a guidance mechanism for diffusion and flow matching models that promotes diversity among samples generated while maintaining quality. EDDY exploits symmetries of the Fokker-Planck ... We present EDDY (Exact-marginal Diversification via Divergence-free dYnamics), a guidance mechanism for diffusion and flow matching models that promotes diversity among samples generated while maintaining quality. EDDY exploits symmetries of the Fokker-Planck equation, using drift perturbations that change particle trajectories while preserving the evolving marginal distribution. We instantiate this principle through kernel-based anti-symmetric pairwise matrix fields, constructed from the repuls...
549	Optimal Counterfactual Search in Tree Ensembles: A Study Across Modeling and Solution Paradigms 2605.06561	cs.LG	Awa Khouna, Youssouf Emine, Julien Ferry, Thibaut Vidal	Trust in counterfactual explanations depends critically on whether their recommended changes are truly minimal: suboptimal explanations may vastly overshoot the actual changes needed to alter a decision, and heuristic errors can affect individuals unevenly, gi... Trust in counterfactual explanations depends critically on whether their recommended changes are truly minimal: suboptimal explanations may vastly overshoot the actual changes needed to alter a decision, and heuristic errors can affect individuals unevenly, giving some users relevant recourse while assigning others unnecessarily costly recommendations. Consequently, we study the problem of computing optimal counterfactual explanations for tree ensembles under plausibility and actionability const...
550	Feature Dimensionality Outweighs Model Complexity in Breast Cancer Subtype Classification Using TCGA-BRCA Gene Expression Data 2605.06562	cs.LGq-bio.GN	Meena Al Hasani	Accurate classification of breast cancer subtypes from gene expression data is critical for diagnosis and treatment selection. However, such datasets are characterized by high dimensionality and limited sample size, posing challenges for machine learning model... Accurate classification of breast cancer subtypes from gene expression data is critical for diagnosis and treatment selection. However, such datasets are characterized by high dimensionality and limited sample size, posing challenges for machine learning models. In this study, we evaluate the impact of model complexity and feature selection on subtype classification performance using TCGA-BRCA gene expression data. Logistic regression, random forest, and support vector machine (SVM) models wer...
551	Criticality and Saturation in Orthogonal Neural Networks 2605.06563	cs.LG	Max Guillen, Jan E. Gerken	It has been known for a long time that initializing weight matrices to be orthogonal instead of having i.i.d. Gaussian components can improve training performance. This phenomenon can be analyzed using finite-width corrections, where the infinite-width statist... It has been known for a long time that initializing weight matrices to be orthogonal instead of having i.i.d. Gaussian components can improve training performance. This phenomenon can be analyzed using finite-width corrections, where the infinite-width statistics are supplemented by a power series in $1/\mathrm{width}$. In particular, recent empirical results by Day et al. show that the tensors appearing in this treatment stabilize for large depth, as opposed to the tensors of i.i.d.-initialized...
553	SNAPO: Smooth Neural Adjoint Policy Optimization for Optimal Control via Differentiable Simulation 2605.06570	cs.LGmath.OCq-fin.CPq-fin.MFq-fin.RM	Dmitri Goloubentsev, Natalija Karpichina	Many real-world problems require sequential decisions under uncertainty: when to inject or withdraw gas from storage, how to rebalance a pension portfolio each month, what temperature profile to run through a pharmaceutical reactor chain. Dynamic programming s... Many real-world problems require sequential decisions under uncertainty: when to inject or withdraw gas from storage, how to rebalance a pension portfolio each month, what temperature profile to run through a pharmaceutical reactor chain. Dynamic programming solves small instances exactly but scales exponentially in state dimensions. Black-box reinforcement learning handles high-dimensional states but trains slowly and produces no sensitivities. We introduce SNAPO (Smooth Neural Adjoint Policy O...
554	CLAD: A Clustered Label-Agnostic Federated Learning Framework for Joint Anomaly Detection and Attack Classification 2605.06571	cs.LGcs.CRcs.DCcs.NI	Iason Ofeidis, Nikos Papadis, Randeep Bhatia, Leandros Tassiulas, TV Lakshman	The rapid expansion of the Internet of Things (IoT) and Industrial IoT (IIoT) has created a massive, heterogeneous attack surface that challenges traditional network security mechanisms. While Federated Learning (FL) offers a privacy-preserving alternative to ... The rapid expansion of the Internet of Things (IoT) and Industrial IoT (IIoT) has created a massive, heterogeneous attack surface that challenges traditional network security mechanisms. While Federated Learning (FL) offers a privacy-preserving alternative to centralized Intrusion Detection Systems (IDS), standard approaches struggle to generalize across diverse device behaviors and typically fail to utilize the vast amounts of unlabeled data present in realistic edge environments. To bridge the...
556	Directional Consistency as a Complementary Optimization Signal: The GONO Framework 2605.06575	cs.LGcs.AI	Victor Daniel Gera	We identify and formalize an underexplored phenomenon in deep learning optimization: directional alignment and loss convergence can be decoupled. An optimizer can exhibit near-perfect directional consistency (cc_t -> 1, measured via consecutive gradient cos... We identify and formalize an underexplored phenomenon in deep learning optimization: directional alignment and loss convergence can be decoupled. An optimizer can exhibit near-perfect directional consistency (cc_t -> 1, measured via consecutive gradient cosine similarity) while the loss remains high or decreases slowly. This observation reveals that existing optimizers such as Adam, SGD, and RMSprop lack explicit mechanisms to exploit temporal consistency in gradient directions, relying instead ...
557	On the Safety of Graph Representation Learning 2605.06576	cs.LG	Xiaoguang Guo, Zehong Wang, Ziming Li, Shawn Spitzel, Soonwoo Kwon	Graph representation learning (GRL) has evolved from topology-only graph embeddings to task-specific supervised GNNs, and more recently to reusable representations and graph foundation models (GFMs). However, existing evaluations mainly measure clean transfer,... Graph representation learning (GRL) has evolved from topology-only graph embeddings to task-specific supervised GNNs, and more recently to reusable representations and graph foundation models (GFMs). However, existing evaluations mainly measure clean transfer, adaptation, and task coverage. It remains unclear whether GRL methods stay reliable when deployment stresses affect graph signals, graph contexts, label support, structural groups, or predictive evidence. We introduce GRL-Safety, a multi-a...
558	PairAlign: A Framework for Sequence Tokenization via Self-Alignment with Applications to Audio Tokenization 2605.06582	cs.LGcs.CLcs.SD	Adhiraj Banerjee, Vipul Arora	Many operations on sensory data -- comparison, memory, retrieval, and reasoning -- are naturally expressed over discrete symbolic structures. In language this interface is given by tokens; in audio, it must be learned. Existing audio tokenizers rely on quantiz... Many operations on sensory data -- comparison, memory, retrieval, and reasoning -- are naturally expressed over discrete symbolic structures. In language this interface is given by tokens; in audio, it must be learned. Existing audio tokenizers rely on quantization, clustering, or codec reconstruction, assigning tokens locally, so sequence consistency, compactness, length control, termination, and edit similarity are rarely optimized directly. We introduce PairAlign, a framework for compact au...
561	Distributionally-Robust Learning to Optimize 2605.06585	cs.LGmath.OC	Vinit Ranjan, Jisun Park, Bartolomeo Stellato	We propose a distributionally robust approach to learning hyperparameters for first-order methods in convex optimization. Given a dataset of problem instances, we minimize a Wasserstein distributionally robust version of the performance estimation problem (PEP... We propose a distributionally robust approach to learning hyperparameters for first-order methods in convex optimization. Given a dataset of problem instances, we minimize a Wasserstein distributionally robust version of the performance estimation problem (PEP) over algorithm parameters such as step sizes. Our framework unifies two extremes: as the robustness radius vanishes, we recover classical learning to optimize (L2O); as it grows, we recover worst-case optimal algorithm design via PEP. We ...
562	Towards Metric-Faithful Neural Graph Matching 2605.06588	cs.LGcs.AI	Jyotirmaya Shivottam, Subhankar Mishra	Graph Edit Distance (GED) is a fundamental, albeit NP-hard, metric for structural graph similarity. Recent neural graph matching architectures approximate GED by first encoding graphs with a Graph Neural Network (GNN) and then applying either a graph-level reg... Graph Edit Distance (GED) is a fundamental, albeit NP-hard, metric for structural graph similarity. Recent neural graph matching architectures approximate GED by first encoding graphs with a Graph Neural Network (GNN) and then applying either a graph-level regression head or a matching-based alignment module. Despite substantial architectural progress, the role of encoder geometry in neural GED estimation remains poorly understood. In this paper, we develop a theoretical framework that connects ...
564	BRICKS: Compositional Neural Markov Kernels for Zero-Shot Radiation-Matter Simulation 2605.06591	cs.LGhep-ph	Richard Hildebrandt, Evangelos Kourlitis, Baran Hashemi, Manuel Bünstorf, Thierry Meyer	We introduce a new strategy for compositional neural surrogates for radiation-matter interactions, a key task spanning domains from particle physics through nuclear and space engineering to medical physics. Exploiting the locality and the Markov nature of part... We introduce a new strategy for compositional neural surrogates for radiation-matter interactions, a key task spanning domains from particle physics through nuclear and space engineering to medical physics. Exploiting the locality and the Markov nature of particle interactions, we create a \emph{next-particle prediction} kernel using hybrid discrete-continuous transformer models based on Riemannian Flow Matching on product manifolds. The model generates variable-sized typed sets of particles and...
571	Weight-Decay Turns Transformer Loss Landscapes Villani: Functional-Analytic Foundations for Optimization and Generalization 2605.06599	cs.LGeess.AS	Abhijit Das, Sayantan Dutta	Weight decay is widely used as a regularizer in large language models, yet its precise role in shaping Transformer loss landscapes remains theoretically underexplored. This paper provides the first rigorous functional-analytic characterization of the standard ... Weight decay is widely used as a regularizer in large language models, yet its precise role in shaping Transformer loss landscapes remains theoretically underexplored. This paper provides the first rigorous functional-analytic characterization of the standard Transformer objective--cross-entropy loss with $L^2$ regularization--by proving it satisfies Villani's criteria for coercive energy functions. Specifically, we show that the regularized loss $\mathcal{F}$ is infinitely differentiable, grows...
573	How Many Iterations to Jailbreak? Dynamic Budget Allocation for Multi-Turn LLM Evaluation 2605.06605	cs.LG	Shai Feldman, Yaniv Romano	Evaluating and predicting the performance of large language models (LLMs) in multi-turn conversational settings is critical yet computationally expensive; key events -- e.g., jailbreaks or successful task completion by an agent -- often emerge only after repea... Evaluating and predicting the performance of large language models (LLMs) in multi-turn conversational settings is critical yet computationally expensive; key events -- e.g., jailbreaks or successful task completion by an agent -- often emerge only after repeated interactions. These events might be rare, and under any feasible computational budget, remain unobserved. Recent conformal survival frameworks construct reliable lower predictive bounds (LPBs) on the number of iterations to trigger th...
576	Transformers Efficiently Perform In-Context Logistic Regression via Normalized Gradient Descent 2605.06609	cs.LGstat.ML	Chenyang Zhang, Yuan Cao	Transformers have demonstrated remarkable in-context learning (ICL) capabilities. The strong ICL performance of transformers is commonly believed to arise from their ability to implicitly execute certain algorithms on the context, thereby enhancing prediction ... Transformers have demonstrated remarkable in-context learning (ICL) capabilities. The strong ICL performance of transformers is commonly believed to arise from their ability to implicitly execute certain algorithms on the context, thereby enhancing prediction and generation. In this work, we investigate how transformers with softmax attention perform in-context learning on linear classification data. We first construct a class of multi-layer transformers that can perform in-context logistic regr...
577	SoftSAE: Dynamic Top-K Selection for Adaptive Sparse Autoencoders 2605.06610	cs.LGcs.CV	Jakub Stępień, Marcin Mazur, Jacek Tabor, Przemysław Spurek	Sparse Autoencoders (SAEs) have become an important tool in mechanistic interpretability, helping to analyze internal representations in both Large Language Models (LLMs) and Vision Transformers (ViTs). By decomposing polysemantic activations into sparse sets ... Sparse Autoencoders (SAEs) have become an important tool in mechanistic interpretability, helping to analyze internal representations in both Large Language Models (LLMs) and Vision Transformers (ViTs). By decomposing polysemantic activations into sparse sets of monosemantic features, SAEs aim to translate neural network computations into human-understandable concepts. However, common architectures such as TopK SAEs rely on a fixed sparsity level. They enforce the same number of active features ...
578	The Structural Origin of Attention Sink: Variance Discrepancy, Super Neurons, and Dimension Disparity 2605.06611	cs.LGcs.AIstat.ML	Siquan Li, Kaiqi Jiang, Jiacheng Sun, Tianyang Hu	Despite the prevalence of the attention sink phenomenon in Large Language Models (LLMs), where initial tokens disproportionately monopolize attention scores, its structural origins remain elusive. This work provides a \textit{mechanistic explanation} for this ... Despite the prevalence of the attention sink phenomenon in Large Language Models (LLMs), where initial tokens disproportionately monopolize attention scores, its structural origins remain elusive. This work provides a \textit{mechanistic explanation} for this phenomenon. First, we trace its root to the value aggregation process inherent in self-attention, which induces a systematic variance discrepancy. We further demonstrate that this discrepancy is drastically amplified by the activation of su...
579	Online Bayesian Calibration under Gradual and Abrupt System Changes 2605.06612	cs.LGcs.ETstat.ML	Yang Xu, Chiwoo Park	Bayesian model calibration is central to digital twins and computer experiments, as it aligns model outputs with field observations by estimating calibration parameters and correcting systematic model bias. Classical Bayesian calibration introduces latent para... Bayesian model calibration is central to digital twins and computer experiments, as it aligns model outputs with field observations by estimating calibration parameters and correcting systematic model bias. Classical Bayesian calibration introduces latent parameters and a discrepancy function to model bias, but suffers from parameter--discrepancy confounding and is typically formulated as an offline procedure under a stationary data-generating assumption. These limitations are restrictive in mod...
581	When and Why SignSGD Outperforms SGD: A Theoretical Study Based on $\ell_1$-norm Lower Bounds 2605.06615	cs.LGcs.AIcs.CLmath.OC	Hongyi Tao, Dingzhi Yu, Lijun Zhang	Sign-based optimization algorithms, such as SignSGD and Muon, have garnered significant attention for their remarkable performance in training large foundation models. Despite this empirical success, we still lack a theoretical understanding of when and why th... Sign-based optimization algorithms, such as SignSGD and Muon, have garnered significant attention for their remarkable performance in training large foundation models. Despite this empirical success, we still lack a theoretical understanding of when and why these sign-based methods outperform vanilla SGD. The core obstacle is that under standard smoothness and finite variance conditions, SGD is known to be minimax optimal for finding stationary points measured by $\ell_2$-norms, thereby fundamen...
585	Sparse Attention as a Range Searching Problem: Towards an Inference-Efficient Index for KV Cache 2605.06763	cs.LG	Mohsen Dehghankar, Abolfazl Asudeh	Sparse attention improves LLM inference efficiency by selecting a subset of key-value entries, but at the cost of potential accuracy degradation. In particular, omitting critical KV entries can induce substantial errors in model outputs. Existing methods typic... Sparse attention improves LLM inference efficiency by selecting a subset of key-value entries, but at the cost of potential accuracy degradation. In particular, omitting critical KV entries can induce substantial errors in model outputs. Existing methods typically operate under fixed or adaptive token budgets and provide empirical robustness or partial theoretical guarantees, yet they do not ensure zero false negatives in decoding steps, particularly since the set of relevant tokens is both quer...
589	Hybrid Quantum-Classical GANs for the Generation of Adversarial Network Flows 2605.06629	cs.LG	Prateek Paudel, Nitin Jha, Abhishek Parakh, Mahadevan Subramaniam	Classical generative adversarial networks (GANs) have been applied to generate adversarial network traffic capable of attacking intrusion detection systems, but they suffer from shortcomings such as the need for large amounts of high-dimensional datasets, mode... Classical generative adversarial networks (GANs) have been applied to generate adversarial network traffic capable of attacking intrusion detection systems, but they suffer from shortcomings such as the need for large amounts of high-dimensional datasets, mode collapse, and high computational overhead. In this work, we propose a hybrid quantum-classical GAN (QC-GAN) framework where a variational quantum generator is used to generate synthetic network traffic flows mimicking malicious traffic usi...
591	Crafting Reversible SFT Behaviors in Large Language Models 2605.06632	cs.LG	Yuping Lin, Pengfei He, Yue Xing, Yingqian Cui, Jiayuan Ding	Supervised fine-tuning (SFT) induces new behaviors in large language models, yet imposes no structural constraint on how these behaviors are distributed within the model. Existing behavior interpretation methods, such as circuit attribution approaches, identif... Supervised fine-tuning (SFT) induces new behaviors in large language models, yet imposes no structural constraint on how these behaviors are distributed within the model. Existing behavior interpretation methods, such as circuit attribution approaches, identify sparse subnetworks correlated with SFT-induced behaviors post-hoc. However, such correlations do not imply causal necessity, limiting the ability to selectively control SFT-induced behaviors at inference time. We pursue an alternative b...
595	Recursive Agent Optimization 2605.06639	cs.LGcs.AIcs.CLcs.MA	Apurva Gandhi, Satyaki Chakraborty, Xiangjun Wang, Aviral Kumar, Graham Neubig	We introduce Recursive Agent Optimization (RAO), a reinforcement learning approach for training recursive agents: agents that can spawn and delegate sub-tasks to new instantiations of themselves recursively. Recursive agents implement an inference-time scaling... We introduce Recursive Agent Optimization (RAO), a reinforcement learning approach for training recursive agents: agents that can spawn and delegate sub-tasks to new instantiations of themselves recursively. Recursive agents implement an inference-time scaling algorithm that naturally allows agents to scale to longer contexts and generalize to more difficult problems via divide-and-conquer. RAO provides a method to train models to best take advantage of such recursive inference, teaching agents ...
596	Concept-Based Abductive and Contrastive Explanations for Behaviors of Vision Models 2605.06640	cs.LGcs.AI	Ronaldo Canizales, Divya Gopinath, Corina Păsăreanu, Ravi Mangal	Concept-based explanations offer a promising approach for explaining the predictions of deep neural networks in terms of high-level, human-understandable concepts. However, existing methods either do not establish a causal connection between the concepts and... Concept-based explanations offer a promising approach for explaining the predictions of deep neural networks in terms of high-level, human-understandable concepts. However, existing methods either do not establish a causal connection between the concepts and model predictions or are limited in expressivity and only able to infer causal explanations involving single concepts. At the same time, the parallel line of work on formal abductive and contrastive explanations computes the minimal set ...
600	Edge-specific signal propagation on mature chromophore-region 3D mechanism graphs for fluorescent protein quantum-yield prediction 2605.06644	cs.LG	Yuchen Xiong, Swee Keong Yeap, Steven Aw Yoong Kit	Fluorescent protein quantum yield (QY) is governed by the mature chromophore and its three-dimensional microenvironment rather than sequence identity alone. Protein language models and emission-band averages capture global trends, but do not model how local ph... Fluorescent protein quantum yield (QY) is governed by the mature chromophore and its three-dimensional microenvironment rather than sequence identity alone. Protein language models and emission-band averages capture global trends, but do not model how local physical signals act on specific chromophore regions. We present a chromophore-centred mechanism graph algorithm for QY prediction. Each PDB structure is converted into a typed 3D residue graph, registered to a mature-CRO state, partitioned...
cs.MA 4 papers
86	Active Learning for Communication Structure Optimization in LLM-Based Multi-Agent Systems 2605.05703	cs.MAcs.AIcs.LG	Huchen Yang, Xinghao Dong, Dan Negrut, Jin-Long Wu	Optimizing the communication structure of large language model based multi-agent systems (LLM-MAS) has been shown to improve downstream performance and reduce token usage. Existing methods typically rely on randomly sampled training tasks. However, tasks may d... Optimizing the communication structure of large language model based multi-agent systems (LLM-MAS) has been shown to improve downstream performance and reduce token usage. Existing methods typically rely on randomly sampled training tasks. However, tasks may differ substantially in difficulty and domain, and thus they are not equally informative for updating communication structure, making optimization under limited training budgets often unstable and highly sensitive to the particular training ...
99	Auto Research with Specialist Agents Develops Effective and Non-Trivial Training Recipes 2605.05724	cs.MAcs.AI	Jingjie Ning, Xiaochuan Li, Ji Zeng, Hao Kang, Chenyan Xiong	We study auto research as a closed empirical loop driven by external measurement. Each submitted trial carries a hypothesis, an executable code edit, an evaluator-owned outcome, and feedback that shapes the next proposal. The output is not a generated paper or... We study auto research as a closed empirical loop driven by external measurement. Each submitted trial carries a hypothesis, an executable code edit, an evaluator-owned outcome, and feedback that shapes the next proposal. The output is not a generated paper or a single model checkpoint, but an auditable trajectory of proposals, code diffs, experiments, scores, and failure labels. We instantiate this loop with specialist agents that partition recipe surfaces and share measured lineage across tria...
424	Improving the Efficiency of Language Agent Teams with Adaptive Task Graphs 2605.06320	cs.MAcs.AIcs.CL	Elizabeth Mieczkowski, Alexander Ku, Tiwalayo Eisape, Dilip Arumugam, John Matters	Large language models (LLMs) are increasingly deployed in teams, yet existing coordination approaches often occupy two extremes. Highly structured methods rely on fixed roles, pipelines, or task decompositions assigned a priori. In contrast, fully unstructured... Large language models (LLMs) are increasingly deployed in teams, yet existing coordination approaches often occupy two extremes. Highly structured methods rely on fixed roles, pipelines, or task decompositions assigned a priori. In contrast, fully unstructured teams enable adaptability and exploration but suffer from inefficiencies such as error propagation, inter-agent conflicts, and wasted resources (measured in time, tokens, or file operations). We introduce Language Agent Teams for Task Evol...
548	Coordination Matters: Evaluation of Cooperative Multi-Agent Reinforcement Learning 2605.06557	cs.MAcs.AIcs.LG	Maria Ana Cardei, Matthew Landers, Afsaneh Doryab	Cooperative multi-agent reinforcement learning (MARL) benchmarks commonly emphasize aggregate outcomes such as return, success rate, or completion time. While essential, these metrics often fail to reveal how agents coordinate, particularly in settings where a... Cooperative multi-agent reinforcement learning (MARL) benchmarks commonly emphasize aggregate outcomes such as return, success rate, or completion time. While essential, these metrics often fail to reveal how agents coordinate, particularly in settings where agents, tasks, and joint assignment choices scale combinatorially. We propose a coordination-aware evaluation perspective that supplements return with process-level diagnostics. We instantiate this perspective using STAT, a controlled commit...
cs.MM 1 papers
385	Modality-Aware Contrastive and Uncertainty-Regularized Emotion Recognition 2605.06245	cs.MM	Yan Zhuang, Minhao Liu, Yanru Zhang, Jiawen Deng, Fuji Ren	Multimodal Emotion Recognition (MER) has attracted growing attention with the rapid advancement of human-computer interaction. However, different modalities exhibit substantial discrepancies in semantics, quality, and availability, leading to highly heterogene... Multimodal Emotion Recognition (MER) has attracted growing attention with the rapid advancement of human-computer interaction. However, different modalities exhibit substantial discrepancies in semantics, quality, and availability, leading to highly heterogeneous modality combinations and posing significant challenges to achieving consistent and reliable emotion understanding. To address this challenge, we propose the Modality-Aware Contrastive and Uncertainty-Regularized (MCUR) framework, which...
cs.NE 1 papers
439	CoupleEvo: Evolving Heuristics for Coupled Optimization Problems Using Large Language Models 2605.06341	cs.NEcs.AImath.OC	Thomas Bömer, Bastian Amberg, Max Disselnmeyer, Anne Meyer	Many real-world optimization problems consist of multiple tightly coupled subproblems whose solutions must be coordinated to achieve high overall performance. However, existing large language model driven automated heuristic design approaches are limited to si... Many real-world optimization problems consist of multiple tightly coupled subproblems whose solutions must be coordinated to achieve high overall performance. However, existing large language model driven automated heuristic design approaches are limited to single-problem settings. In this paper, we propose CoupleEvo. CoupleEvo proposes three evolutionary coordination strategies to evolve heuristics for coupled optimization problems: the sequential strategy evolves heuristics for one subproblem ...
cs.PF 1 papers
82	When Quantization Is Free: An int4 KV Cache That Outruns fp16 on Apple Silicon 2605.05699	cs.PFcs.AI	Mohamed Amine Bergach	KV-cache quantization is framed as a quality--latency trade-off. We show it is \emph{inverted} on Apple Silicon's unified memory: a single fused Metal kernel (sign-randomized FFT $+$ per-channel $λ$ $+$ per-group abs-max $+$ int4 nibble pack), exposed as a Hug... KV-cache quantization is framed as a quality--latency trade-off. We show it is \emph{inverted} on Apple Silicon's unified memory: a single fused Metal kernel (sign-randomized FFT $+$ per-channel $λ$ $+$ per-group abs-max $+$ int4 nibble pack), exposed as a HuggingFace \texttt{Cache} subclass, runs \emph{faster than fp16} across $256$--$4096$-token prefixes on Gemma-3 1B ($-3$ to $-8\%$ ms/tok) and at short context on Qwen2.5-1.5B ($-0.7$ to $-2.6\%$ through $1$K), with $3\times$ persistent memor...
cs.RO 4 papers
119	MaMi-HOI: Harmonizing Global Kinematics and Local Geometry for Human-Object Interaction Generation 2605.05756	cs.ROcs.CV	Hao Wang, Shiqi Wang, Qi Liu	Generating realistic 3D Human-Object Interactions (HOI) is a fundamental task for applications ranging from embodied AI to virtual content creation, which requires harmonizing high-level semantic intent with strict low-level physical constraints. Existing meth... Generating realistic 3D Human-Object Interactions (HOI) is a fundamental task for applications ranging from embodied AI to virtual content creation, which requires harmonizing high-level semantic intent with strict low-level physical constraints. Existing methods excel at semantic alignment, however, they struggle to maintain precise object contact. We reveal a key finding termed \textit{Geometric Forgetting}: as diffusion model depth increases, semantic feature tend to overshadow object geometr...
371	When to Trust Imagination: Adaptive Action Execution for World Action Models 2605.06222	cs.ROcs.AI	Rui Wang, Yue Zhang, Jiehong Lin, Kuncheng Luo, Jianan Wang	World Action Models (WAMs) have recently emerged as a promising paradigm for robotic manipulation by jointly predicting future visual observations and future actions. However, current WAMs typically execute a fixed number of predicted actions after each model ... World Action Models (WAMs) have recently emerged as a promising paradigm for robotic manipulation by jointly predicting future visual observations and future actions. However, current WAMs typically execute a fixed number of predicted actions after each model inference, leaving the robot blind to whether the imagined future remains consistent with the actual physical rollout. In this work, we formulate adaptive WAM execution as a future-reality verification problem: the robot should execute long...
566	ReActor: Reinforcement Learning for Physics-Aware Motion Retargeting 2605.06593	cs.ROcs.GRcs.LG	David Müller, Agon Serifi, Sammy Christen, Ruben Grandia, Espen Knoop	Retargeting human kinematic reference motion onto a robot's morphology remains a formidable challenge. Existing methods often produce physical inconsistencies, such as foot sliding, self-collisions, or dynamically infeasible motions, which hinder downstream im... Retargeting human kinematic reference motion onto a robot's morphology remains a formidable challenge. Existing methods often produce physical inconsistencies, such as foot sliding, self-collisions, or dynamically infeasible motions, which hinder downstream imitation learning. We propose a bilevel optimization framework that jointly adapts reference motions to a robot's morphology while training a tracking policy using reinforcement learning. To make the optimization tractable, we derive an appr...
568	Cross-Modal Navigation with Multi-Agent Reinforcement Learning 2605.06595	cs.ROcs.AIcs.LGcs.MA	Shuo Liu, Xinzichen Li, Christopher Amato	Robust embodied navigation relies on complementary sensory cues. However, high-quality and well-aligned multi-modal data is often difficult to obtain in practice. Training a monolithic model is also challenging as rich multi-modal inputs induce complex represe... Robust embodied navigation relies on complementary sensory cues. However, high-quality and well-aligned multi-modal data is often difficult to obtain in practice. Training a monolithic model is also challenging as rich multi-modal inputs induce complex representations and substantially enlarge the policy space. Cross-modal collaboration among lightweight modality-specialized agents offers a scalable paradigm. It enables flexible deployment and parallel execution, while preserving the strength of...
cs.SD 4 papers
39	X-Voice: Enabling Everyone to Speak 30 Languages via Zero-Shot Cross-Lingual Voice Cloning 2605.05611	cs.SDcs.AIeess.AS	Rixi Xu, Qingyu Liu, Haitao Li, Yushen Chen, Zhikang Niu	In this paper, we present X-Voice, a 0.4B multilingual zero-shot voice cloning model that clones arbitrary voices and enables everyone to speak 30 languages. X-Voice is trained on a 420K-hour multilingual corpus using the International Phonetic Alphabet (IPA) ... In this paper, we present X-Voice, a 0.4B multilingual zero-shot voice cloning model that clones arbitrary voices and enables everyone to speak 30 languages. X-Voice is trained on a 420K-hour multilingual corpus using the International Phonetic Alphabet (IPA) as a unified representation. To eliminate the reliance on prompt text without complex preprocessing like forced alignment, we design a two-stage training paradigm. In Stage 1, we establish X-Voice$_{\text{s1}}$ through standard conditional ...
229	Do Melody and Rhythm Coevolve? 2605.05982	cs.SD	Harin Lee, Rainer Polak, Manuel Anglada-Tort, Marc Schönwiesner, Minsu Park	Music comprises two core structural components, melody and rhythm, that vary widely across cultures. Whether these components coevolve in a coupled way or follow independent trajectories remains unclear. We introduce a novel computational pipeline to extract v... Music comprises two core structural components, melody and rhythm, that vary widely across cultures. Whether these components coevolve in a coupled way or follow independent trajectories remains unclear. We introduce a novel computational pipeline to extract vocal melodic pitch-interval and percussive inter-onset timing distributions from 27,628 popular songs across 59 countries, enabling large-scale cross-cultural comparison that bypasses traditional music annotations. Musical similarities betw...
255	Quantum Kernels for Audio Deepfake Detection Using Spectrogram Patch Features 2605.06035	cs.SDcs.AI	Lisan Al Amin, Rakib Hossain, Mahbubul Islam, Faisal Quader, Thanh Thi Nguyen	Quantum machine learning has emerged as a promising tool for pattern recognition, yet many audio-focused approaches still treat spectrograms as generic images and do not explicitly exploit their time-frequency structure. We propose Q-Patch, a quantum feature m... Quantum machine learning has emerged as a promising tool for pattern recognition, yet many audio-focused approaches still treat spectrograms as generic images and do not explicitly exploit their time-frequency structure. We propose Q-Patch, a quantum feature map tailored to audio that encodes local time-frequency patches from mel-spectrograms into quantum states using shallow, hardware-efficient circuits with adjacency-aware entanglement. Each selected patch is summarized by a compact four-dimen...
587	PianoCoRe: Combined and Refined Piano MIDI Dataset 2605.06627	cs.SDcs.LG	Ilya Borovik	Symbolic music datasets with matched scores and performances are essential for many music information retrieval (MIR) tasks. Yet, existing resources often cover a narrow range of composers, lack performance variety, omit note-level alignments, or use inconsist... Symbolic music datasets with matched scores and performances are essential for many music information retrieval (MIR) tasks. Yet, existing resources often cover a narrow range of composers, lack performance variety, omit note-level alignments, or use inconsistent naming formats. This work presents PianoCoRe, a large-scale piano MIDI dataset that unifies and refines major open-source piano corpora. The dataset contains 250,046 performances of 5,625 pieces written by 483 composers, totaling 21,763...
cs.SE 8 papers
54	Agentic Coding Needs Proactivity, Not Just Autonomy 2605.06717	cs.SEcs.AI	Nghi D. Q. Bui, Georgios Evangelopoulos	Coding agents are rapidly changing the landscape of software development, moving from inline completion to autonomous systems that edit repositories, open pull requests, respond to issues, and run scheduled or webhook triggered routines across the development ... Coding agents are rapidly changing the landscape of software development, moving from inline completion to autonomous systems that edit repositories, open pull requests, respond to issues, and run scheduled or webhook triggered routines across the development life cycle. The next generation is increasingly described as proactive and long-horizon: agents should notice relevant changes before the developer asks, connect signals across tools, decide when to interrupt, and carry preferences across s...
83	An Empirical Study of Proactive Coding Assistants in Real-World Software Development 2605.05700	cs.SEcs.AI	Lehui Li, Ruixuan Jia, Guo-Ye Yang, Jia Li	Large language model (LLM)-based coding assistants have made substantial progress, yet most systems remain reactive, requiring developers to explicitly formulate their needs. Proactive coding assistants aim to infer latent developer intent from integrated deve... Large language model (LLM)-based coding assistants have made substantial progress, yet most systems remain reactive, requiring developers to explicitly formulate their needs. Proactive coding assistants aim to infer latent developer intent from integrated development environment (IDE) interactions and repository context, thereby reducing interaction overhead and supporting more seamless assistance. However, research in this direction is limited by the scarcity of large-scale real-world developer...
302	Schedule-and-Calibrate: Utility-Guided Multi-Task Reinforcement Learning for Code LLMs 2605.06111	cs.SEcs.AI	Yujia Chen, Yang Ye, Xiao Chu, Yuchi Ma, Cuiyun Gao	Reinforcement learning (RL) with verifiable rewards has proven effective at post-training LLMs for coding, yet deploying separate task-specific specialists incurs costs that scale with the number of tasks, motivating a unified multi-task RL (MTRL) approach. Ho... Reinforcement learning (RL) with verifiable rewards has proven effective at post-training LLMs for coding, yet deploying separate task-specific specialists incurs costs that scale with the number of tasks, motivating a unified multi-task RL (MTRL) approach. However, existing MTRL methods treat all coding tasks uniformly, relying on fixed data curricula under a shared optimization strategy, ultimately limiting the effectiveness of multi-task training. To address these limitations, we propose ASTO...
315	BUILD-AND-FIND: An Effort-Aware Protocol for Evaluating Agent-Managed Codebases 2605.06136	cs.SEcs.AI	Jhen-Ke Lin	Most coding-agent benchmarks ask whether generated code behaves correctly. That remains essential, but repository-level engineering is increasingly agent-managed: one agent writes a repository, and later agents inspect, audit, or extend it as working context. ... Most coding-agent benchmarks ask whether generated code behaves correctly. That remains essential, but repository-level engineering is increasingly agent-managed: one agent writes a repository, and later agents inspect, audit, or extend it as working context. In that setting, a generated repository is not only an answer to a task but also a communication artifact for future work. Even when strong agents nearly satisfy the visible behavioral objective, repositories can differ in how clearly they ...
342	Teaching LLMs Program Semantics via Symbolic Execution Traces 2605.06184	cs.SEcs.LGcs.PL	Jonas Bayer, Stefan Zetzsche, Olivier Bouissou, Remi Delmas, Michael Tautschnig	We introduce an evaluation framework of 500 C verification tasks across five property types (memory safety, overflow, termination, reachability, data races) built on SV-COMP 2025, and evaluate 14 models across six families. We find that high overall accuracy m... We introduce an evaluation framework of 500 C verification tasks across five property types (memory safety, overflow, termination, reachability, data races) built on SV-COMP 2025, and evaluate 14 models across six families. We find that high overall accuracy masks a critical weakness: while most models reliably confirm properties hold, violation detection varies widely and degrades sharply with program length. To close this gap, we train on formal verification artifacts: running the Soteria symb...
357	A Self-Healing Framework for Reliable LLM-Based Autonomous Agents 2605.06737	cs.SEcs.AI	Cheonsu Jeong, Younggun Shin	Autonomous agents based on Large Language Models (LLMs) are increasingly being utilized in complex software systems. However, reliability remains a significant challenge due to unpredictable failures such as hallucinations, execution errors, and inconsistent r... Autonomous agents based on Large Language Models (LLMs) are increasingly being utilized in complex software systems. However, reliability remains a significant challenge due to unpredictable failures such as hallucinations, execution errors, and inconsistent reasoning. This paper proposes a reliability-aware self-healing framework for LLM-based software agents. The framework integrates failure detection, reliability assessment, and automated recovery mechanisms. First, we define a taxonomy of fa...
401	Correct Code, Vulnerable Dependencies: A Large Scale Measurement Study of LLM-Specified Library Versions 2605.06279	cs.SEcs.AI	Chengjie Wang, Jingzheng Wu, Xiang Ling, Tianyue Luo, Chen Zhao	Large language models (LLMs) are now largely involved in software development workflows, and the code they generate routinely includes third-party library (TPL) imports annotated with specific version identifiers. These version choices can carry security and c... Large language models (LLMs) are now largely involved in software development workflows, and the code they generate routinely includes third-party library (TPL) imports annotated with specific version identifiers. These version choices can carry security and compatibility risks, yet they have not been systematically studied. We present the first large-scale measurement study of version-level risk in LLM-generated Python code, evaluating 10 LLMs on PinTrace, a curated benchmark of 1,000 Stack O...
489	Constraint Decay: The Fragility of LLM Agents in Backend Code Generation 2605.06445	cs.SEcs.AI	Francesco Dente, Dario Satriani, Paolo Papotti	Large Language Model (LLM) agents demonstrate strong performance in autonomous code generation under loose specifications. However, production-grade software requires strict adherence to structural constraints, such as architectural patterns, databases, and ob... Large Language Model (LLM) agents demonstrate strong performance in autonomous code generation under loose specifications. However, production-grade software requires strict adherence to structural constraints, such as architectural patterns, databases, and object-relational mappings. Existing benchmarks often overlook these non-functional requirements, rewarding functionally correct but structurally arbitrary solutions. We present a systematic study evaluating how well agents handle structural ...
econ.EM 1 papers
467	Covariate Balancing and Riesz Regression Should Be Guided by the Neyman Orthogonal Score in Debiased Machine Learning 2605.06386	econ.EMcs.LGmath.STstat.MEstat.ML	Masahiro Kato	This position paper argues that, in debiased machine learning, balancing functions should be derived from the Neyman orthogonal score, not chosen only as functions of covariates. Covariate balancing is effective when the regression error entering the score can... This position paper argues that, in debiased machine learning, balancing functions should be derived from the Neyman orthogonal score, not chosen only as functions of covariates. Covariate balancing is effective when the regression error entering the score can be represented by functions of covariates alone, and it is the natural finite-dimensional approximation for targets such as ATT counterfactual means. For ATE estimation under treatment effect heterogeneity, however, the score error general...
eess.AS 5 papers
15	Optimal Transport Audio Distance with Learned Riemannian Ground Metrics 2605.05554	eess.AScs.SD	Wonwoo Jeong	In audio generation evaluation, Fréchet Audio Distance (FAD) is a 2-Wasserstein distance with structural constraints for both primitives: the cost is a frozen embedding pullback whose invariance set hides severe artifacts, and the coupling is a Gaussian fit th... In audio generation evaluation, Fréchet Audio Distance (FAD) is a 2-Wasserstein distance with structural constraints for both primitives: the cost is a frozen embedding pullback whose invariance set hides severe artifacts, and the coupling is a Gaussian fit that dilutes rank-1 contamination relative to discrete OT. We propose Optimal Transport Audio Distance (OTAD), which corrects each primitive with one dedicated mechanism -- a residual Riemannian ground-metric adapter for the cost and entropic...
300	NDF+: Joint Neural Directional Filtering and Diffuse Sound Extraction 2605.06108	eess.AS	Weilong Huang, Le Nhat Tam Huynh, Oliver Thiergart, Emanuël A. P. Habets	Recently, neural directional filtering (NDF) has been introduced as a flexible approach for reconstructing a virtual directional microphone (VDM) with a desired directivity pattern for spatial sound capture. Building on this idea, we propose NDF+, which enable... Recently, neural directional filtering (NDF) has been introduced as a flexible approach for reconstructing a virtual directional microphone (VDM) with a desired directivity pattern for spatial sound capture. Building on this idea, we propose NDF+, which enables joint neural directional filtering and diffuse sound extraction. NDF+ reformulates VDM estimation into two coupled subtasks: dereverberated VDM reconstruction and diffuse sound extraction. This reformulation enables NDF+ to manipulate dif...
347	Predictive-Generative Drift Decomposition for Speech Enhancement and Separation 2605.06189	eess.AScs.LG	Julius Richter, Yoshiki Masuyama, Christoph Boeddeker, Takahiro Edo, Gordon Wichern	We propose a plug-and-play framework for speech enhancement and separation that augments predictive methods with a generative speech prior. Our approach, termed Stochastic Interpolant Prior for Speech (SIPS), builds on stochastic interpolants and leverages the... We propose a plug-and-play framework for speech enhancement and separation that augments predictive methods with a generative speech prior. Our approach, termed Stochastic Interpolant Prior for Speech (SIPS), builds on stochastic interpolants and leverages their flexibility to bridge predictive and generative modeling. Specifically, we decompose the interpolation dynamics into a task-specific drift and a stochastic denoising component, allowing a predictive estimate to be integrated directly int...
475	WavCube: Unifying Speech Representation for Understanding and Generation via Semantic-Acoustic Joint Modeling 2605.06407	eess.AScs.AIcs.CL	Guanrou Yang, Tian Tan, Qian Chen, Zhikang Niu, Yakun Song	Integrating speech understanding and generation is a pivotal step toward building unified speech models. However, the different representations required for these two tasks currently pose significant compatibility challenges. Typically, semantics-oriented feat... Integrating speech understanding and generation is a pivotal step toward building unified speech models. However, the different representations required for these two tasks currently pose significant compatibility challenges. Typically, semantics-oriented features are learned from self-supervised learning (SSL), and acoustic-oriented features from reconstruction. Such fragmented representations hinder the realization of truly unified speech systems. We present WavCube, a compact continuous laten...
590	Task-Aware Answer Preservation under Audio Compression for Large Audio Language Models 2605.06631	eess.AS	Amir Ivry	Large audio language models (LALMs) are increasingly used to reason over long audio clips, yet deployment often compresses audio before inference to reduce memory and latency. The risk is that compression can leave aggregate accuracy acceptable while sharply d... Large audio language models (LALMs) are increasingly used to reason over long audio clips, yet deployment often compresses audio before inference to reduce memory and latency. The risk is that compression can leave aggregate accuracy acceptable while sharply degrading answers for a deployment-critical query family. We study answer-preserving audio compression, judging a compressor by the excess answer-error it induces, especially for the worst-affected family. We formulate this theoretically as ...
eess.IV 2 papers
536	Histogramless Time-Domain Sketched Fluorescence Lifetime Imaging 2605.06532	eess.IV	Zhenya Zang, Istvan Gyongy, Mike Davies	We present a statistics-aware compression strategy that processes photon timestamps directly from time-correlated single-photon counting (TCSPC) modules for time-domain fluorescence lifetime imaging (FLIM). Rather than storing or transmitting the full histogra... We present a statistics-aware compression strategy that processes photon timestamps directly from time-correlated single-photon counting (TCSPC) modules for time-domain fluorescence lifetime imaging (FLIM). Rather than storing or transmitting the full histogram per pixel, timestamps are projected onto sparse, non-uniform one-dimensional spline sketches, with knot positions optimally allocated based on Fisher information. This knot allocation concentrates sketch channels where the decay signal ex...
588	LiVeAction: a Lightweight, Versatile, and Asymmetric Neural Codec Design for Real-time Operation 2605.06628	eess.IVcs.LGcs.MMeess.ASeess.SP	Dan Jacobellis, Neeraja J. Yadwadkar	Modern sensors generate rich, high-fidelity data, yet applications operating on wearable or remote sensing devices remain constrained by bandwidth and power budgets. Standardized codecs such as JPEG and MPEG achieve efficient trade-offs between bitrate and per... Modern sensors generate rich, high-fidelity data, yet applications operating on wearable or remote sensing devices remain constrained by bandwidth and power budgets. Standardized codecs such as JPEG and MPEG achieve efficient trade-offs between bitrate and perceptual quality but are designed for human perception, limiting their applicability to machine-perception tasks and non-traditional modalities such as spatial audio arrays, hyperspectral images, and 3D medical images. General-purpose compre...
eess.SP 2 papers
297	CredibleDFGO: Differentiable Factor Graph Optimization with Credibility Supervision 2605.06100	eess.SPcs.AIcs.LGcs.RO	Liang Qian, Penggao Yan, Penghui Xu, Li-Ta Hsu	Global navigation satellite system (GNSS) positioning is widely used for urban navigation, but the covariance reported by the GNSS solver is often unreliable in urban canyons. Existing differentiable factor graph optimization (DFGO) methods already learn measu... Global navigation satellite system (GNSS) positioning is widely used for urban navigation, but the covariance reported by the GNSS solver is often unreliable in urban canyons. Existing differentiable factor graph optimization (DFGO) methods already learn measurement weighting through the solver, but they still use position-only objectives. As a result, the mean estimate may improve while the reported covariance remains too small, too large, or wrong in shape. In this work, we propose CredibleDFG...
451	The frame-level leakage trap: rethinking evaluation protocols for intrinsic image decomposition, with source-separable uncertainty as a case study 2605.06359	eess.SPcs.CV	Jihwan Woo	Evaluation protocols for learned intrinsic image decomposition on MPI Sintel have been inconsistent. Several prior works split the dataset by frames, which allows spatially similar frames of the same scene to appear in both train and test partitions. We quanti... Evaluation protocols for learned intrinsic image decomposition on MPI Sintel have been inconsistent. Several prior works split the dataset by frames, which allows spatially similar frames of the same scene to appear in both train and test partitions. We quantify this leakage effect for the first time, across three architectures: a frame-level split inflates test R_PSNR by 1.6 to 2.0 dB (p less than 0.01 for all three, paired t-test across 3 seeds) relative to a scene-level split, confirming an a...
hep-lat 1 papers
314	Diffusion model for SU(N) gauge theories 2605.06134	hep-latcs.LG	Javad Komijani, Marina K. Marinkovic, Lara Turgut	Implicit score matching provides a computationally efficient approach for training diffusion models and generating high-quality samples from complex distributions. In this work, we develop a score-matching framework for SU(N) lattice gauge theories, which can ... Implicit score matching provides a computationally efficient approach for training diffusion models and generating high-quality samples from complex distributions. In this work, we develop a score-matching framework for SU(N) lattice gauge theories, which can be extended to other Lie groups. We apply the method to SU(3) gauge configurations with the Wilson gauge action in two and four dimensions and assess the quality of the generated samples by comparison with Hybrid Monte Carlo (HMC) simulatio...
math.NA 1 papers
88	Convex-Geometric Error Bounds for Positive-Weight Kernel Quadrature 2605.05705	math.NAcs.LGmath.PRstat.ML	Satoshi Hayakawa	Kernel quadrature can exploit RKHS spectral structure and outperform Monte Carlo on smooth integrands, but optimized quadrature weights are generally signed and may be numerically unstable. We study whether spectral acceleration remains possible when the weigh... Kernel quadrature can exploit RKHS spectral structure and outperform Monte Carlo on smooth integrands, but optimized quadrature weights are generally signed and may be numerically unstable. We study whether spectral acceleration remains possible when the weights are constrained to be positive, i.e., simplex weights. In the exact-target fixed-pool setting, an evaluated i.i.d. candidate pool of size $N$ is already available and the task is to reweight it so as to approximate the kernel mean embedd...
math.OC 3 papers
22	Stability of the Monge Map in Semi-Dual Optimal Transport 2605.05569	math.OCcs.LG	Anton Selitskiy, David Millard	This paper shows that the semi-dual formulation of the optimal transport problem has a degenerate saddle-point structure, and that its numerical solution is equivalent to solving a constrained optimization problem. We derive necessary and sufficient conditions... This paper shows that the semi-dual formulation of the optimal transport problem has a degenerate saddle-point structure, and that its numerical solution is equivalent to solving a constrained optimization problem. We derive necessary and sufficient conditions for the convergence of Monge maps without requiring optimality of the dual potential. This analysis helps explain why, in practice, numerical algorithms often require more iterations to update the transport map than the potential.
501	Dynamic Controlled Variables Based Dynamic Self-Optimizing Control 2605.06469	math.OCcs.LGeess.SY	Chenchen Zhou, Shaoqi Wang, Hongxin Su, Xinhui Tang, Yi Cao	Self-optimizing control is a strategy for selecting controlled variables, where the economic objective guides the selection and design of controlled variables, with the expectation that maintaining the controlled variables at constant values can achieve optimi... Self-optimizing control is a strategy for selecting controlled variables, where the economic objective guides the selection and design of controlled variables, with the expectation that maintaining the controlled variables at constant values can achieve optimization effects, translating the process optimization problem into a process control problem. Currently, self-optimizing control is widely applied to steady-state optimization problems. However, the development of process systems exhibits a ...
526	Learning to Cut: Reinforcement Learning for Benders Decomposition 2605.06516	math.OCcs.AI	Haochen Cai, Xian Yu	Benders decomposition (BD) is a widely used solution approach for solving two-stage stochastic programs arising in real-world decision-making under uncertainty. However, it often suffers from slow convergence as the master problem grows with an increasing numb... Benders decomposition (BD) is a widely used solution approach for solving two-stage stochastic programs arising in real-world decision-making under uncertainty. However, it often suffers from slow convergence as the master problem grows with an increasing number of cuts. In this paper, we propose Reinforcement Learning for BD (RLBD), a framework that adaptively selects cuts using a neural network-based stochastic policy. The policy is trained using a policy gradient method via the REINFORCE algo...
math.ST 2 papers
126	Optimal Confidence Band for Kernel Gradient Flow Estimator 2605.05768	math.STcs.LGstat.ML	Yuqian Cheng, Zhuo Chen, Qian Lin	In this paper, we investigate the supremum-norm generalization error and the uniform inference for a specific class of kernel regression methods, namely the kernel gradient flows. Under the widely adopted capacity-source condition framework in the kernel regre... In this paper, we investigate the supremum-norm generalization error and the uniform inference for a specific class of kernel regression methods, namely the kernel gradient flows. Under the widely adopted capacity-source condition framework in the kernel regression literature, we first establish convergence rates for the supremum norm generalization error of both continuous and discrete kernel gradient flows under the source condition $s>α_0$, where $α_0\in(0,1)$ denotes the embedding index of t...
292	Time-Inhomogeneous Preconditioned Langevin Dynamics 2605.06091	math.STcs.LGmath.PRstat.CO	Alexander Falk, Laurenz Nagler, Andreas Habring, Thomas Pock	Langevin sampling from distributions of the form $p(x) \propto \exp(-Ψ(x))$ faces two major challenges: (global) mode coverage and (local) mode exploration. The first challenge is particularly relevant for multi-modal distributions with disjoint modes, whereas... Langevin sampling from distributions of the form $p(x) \propto \exp(-Ψ(x))$ faces two major challenges: (global) mode coverage and (local) mode exploration. The first challenge is particularly relevant for multi-modal distributions with disjoint modes, whereas the second arises when the potential $Ψ$ exhibits diverse and ill-conditioned local mode geometry. To address these challenges, a common approach is to precondition Langevin dynamics with problem-specific information, such as the sample co...
physics.chem-ph 1 papers
366	FunctionalAgent: Towards end-to-end on-top functional design 2605.06215	physics.chem-phcs.AI	Yuhao Chen, Donald G. Truhlar, Xiao He	Multiconfiguration pair-density functional theory (MC-PDFT) offers an efficient and accurate framework for computing electronic energies in strongly correlated molecular systems, with the quality of the on-top functional being a key determinant of its predicti... Multiconfiguration pair-density functional theory (MC-PDFT) offers an efficient and accurate framework for computing electronic energies in strongly correlated molecular systems, with the quality of the on-top functional being a key determinant of its predictive accuracy. Here we introduce FunctionalAgent, an agentic system for fully automated functional development. FunctionalAgent orchestrates a team of specialized sub-agents to decompose the development process into dataset construction, acti...
physics.flu-dyn 1 papers
574	AI CFD Scientist: Toward Open-Ended Computational Fluid Dynamics Discovery with Physics-Aware AI Agents 2605.06607	physics.flu-dyncs.AI	Nithin Somasekharan, Rabi Pathak, Manushri Dhanakoti, Tingwen Zhang, Ling Yue	Recent LLM-based agents have closed substantial portions of the scientific discovery loop in software-only machine-learning research, in chemistry, and in biology. Extending the same loop to high-fidelity physical simulators is harder, because solver completio... Recent LLM-based agents have closed substantial portions of the scientific discovery loop in software-only machine-learning research, in chemistry, and in biology. Extending the same loop to high-fidelity physical simulators is harder, because solver completion does not imply physical validity and many failure modes appear only in field-level imagery rather than in solver logs. We present AI CFD Scientist, an open-source AI scientist for computational fluid dynamics (CFD) that, to our knowledge,...
q-bio.GN 2 papers
258	OmicsLM: A Multimodal Large Language Model for Multi-Sample Omics Reasoning 2605.06728	q-bio.GNcs.AIq-bio.CB	Maciej Sypetkowski, Joanna Krawczyk, Łukasz Smoliński, Remigiusz Kinas, Przemysław Pietrzak	Interpreting transcriptomic data is one of the most common analytical tasks in modern biology. Yet most current models either consume expression profiles without producing natural-language biological explanations, or reason in language without direct access to... Interpreting transcriptomic data is one of the most common analytical tasks in modern biology. Yet most current models either consume expression profiles without producing natural-language biological explanations, or reason in language without direct access to quantitative omics measurements. We introduce OmicsLM, a multimodal LLM that connects quantitative omics profiles with natural-language biological tasks. OmicsLM represents each transcriptomic profile as a compact continuous representation...
582	A Linear-Transformer Hybrid for SNP-Based Genotype-to-Phenotype Prediction in Grapevine 2605.06762	q-bio.GNcs.AI	Yibin Wang, Murukarthick Jayakodi, Silvas Kirubakaran, Ambika Chandra, Azlan Zahid	Robust genotype-to-phenotype (G2P) prediction is essential for accelerating breeding decisions and genetic gain. However, it remains challenging to measure complex traits under variable field conditions and across years. In this study, we propose a linear-Tran... Robust genotype-to-phenotype (G2P) prediction is essential for accelerating breeding decisions and genetic gain. However, it remains challenging to measure complex traits under variable field conditions and across years. In this study, we propose a linear-Transformer approach, LiT-G2P (Linear-Transformer Genotype-to-Phenotype), an automated predictive framework that integrates additive genetic variance effects with Transformer-based nonlinear interactions using genome-wide single-nucleotide poly...
quant-ph 3 papers
43	Quantum Kernels for Parity-Structured Classification: A Hybrid Pipeline 2605.05625	quant-phcs.LG	Tushar Pandey	Parity (XOR) classification requires detecting discrete, high-order feature interactions that smooth classical kernels cannot efficiently capture. We study how quantum kernel advantage depends on parity complexity, the number of features entering the XOR rule,... Parity (XOR) classification requires detecting discrete, high-order feature interactions that smooth classical kernels cannot efficiently capture. We study how quantum kernel advantage depends on parity complexity, the number of features entering the XOR rule, and find a clear threshold behavior. We pair a ZZ quantum feature map with binary {0, pi} encoding (features median thresholded before circuit input) to expose parity structure. A binary encoding ablation, RBF SVM trained on the identical ...
195	Quantum-enhanced Large Language Models on Quantum Hardware via Cayley Unitary Adapters 2605.05914	quant-phcs.AIcs.LG	Borja Aizpurua, Sukhbinder Singh, Augustine Kshetrimayum, Saeed S. Jahromi, Roman Orus	Large language models (LLMs) have transformed artificial intelligence, yet classical architectures impose a fundamental constraint: every trainable parameter demands classical memory that scales unfavourably with model size. Quantum computing offers a qualitat... Large language models (LLMs) have transformed artificial intelligence, yet classical architectures impose a fundamental constraint: every trainable parameter demands classical memory that scales unfavourably with model size. Quantum computing offers a qualitatively different pathway, but practical demonstrations on real hardware have remained elusive for models of practical relevance. Here we show that Cayley-parameterised unitary adapters -- quantum circuit blocks inserted into the frozen proje...
207	Architecture Shape Governs QNN Trainability: Jacobian Null Space Growth and Parameter Efficiency 2605.05942	quant-phcs.LG	Michael Poppel, David Bucher, Maximilian Zorn, Markus Baumann, Sebastian Wölckert	Variational quantum circuits with angle encoding implement truncated Fourier series, and architectures arranging $N$ qubits with $L$ encoding layers each -- sharing encoding budget $E = NL$ -- generate identical frequency spectra, identical frequency redundanc... Variational quantum circuits with angle encoding implement truncated Fourier series, and architectures arranging $N$ qubits with $L$ encoding layers each -- sharing encoding budget $E = NL$ -- generate identical frequency spectra, identical frequency redundancy, and require the same minimum parameter count for coefficient control. Despite this equivalence, trainability varies substantially with architecture shape $(N,L)$ at fixed $E$. We identify structural rank deficiency of the coefficient mat...
stat.AP 1 papers
272	Correcting heterogeneous diagnostic bias when developing clinical prediction models using causal hidden Markov models 2605.06059	stat.APcs.LG	Jose Benitez-Aurioles, Ricardo Silva, Brian McMillan, Matthew Sperrin	In routine care, individuals identified a priori as high-risk are usually tested for conditions more frequently. Protected attributes, such as sex or ethnicity may also determine testing frequency. Such heterogeneous detection rates across a population induce ... In routine care, individuals identified a priori as high-risk are usually tested for conditions more frequently. Protected attributes, such as sex or ethnicity may also determine testing frequency. Such heterogeneous detection rates across a population induce label error. This causes systematic model error for specific groups and biases performance metrics during validation. This paper proposes a method to correct for such bias in prediction models due to differential diagnostic delay. We use ...
stat.ME 3 papers
406	A Topological Sorting Criterion for Random Causal Directed Acyclic Graphs 2605.06288	stat.MEcs.AI	Alexander G. Reisach, Antoine Chambaz, Gilles Blanchard, Sebastian Weichwald	Random directed acyclic graphs (DAGs) based on imposing an order on Erdős-Rényi and scale free random graphs are widely used for evaluating causal discovery algorithms. We show that in such DAGs, the set of nodes reachable via open paths, termed relatives, inc... Random directed acyclic graphs (DAGs) based on imposing an order on Erdős-Rényi and scale free random graphs are widely used for evaluating causal discovery algorithms. We show that in such DAGs, the set of nodes reachable via open paths, termed relatives, increases monotonically along the causal order. We assess the prevalence of this pattern numerically, and demonstrate that it can be exploited for causal order recovery via sorting by the estimated number of relatives. We note that many simula...
492	A Statistical Framework for Algorithmic Collective Action with Multiple Collectives 2605.06749	stat.MEcs.AI	Claudio Battiloro, Pietro Greiner, Dario Rancati, Bret Nestor, Oumaima Amezgar	As learning systems increasingly shape everyday decisions, Algorithmic Collective Action (ACA), i.e., users coordinating changes to shared data to steer model behavior, offers a complement to regulator-side policy and corporate model design. Real-world collect... As learning systems increasingly shape everyday decisions, Algorithmic Collective Action (ACA), i.e., users coordinating changes to shared data to steer model behavior, offers a complement to regulator-side policy and corporate model design. Real-world collective actions have traditionally been decentralized and fragmented into multiple collectives, despite sharing overarching objectives, with each collective differing in size, strategy, and actionable goals. However, most of the ACA literature ...
511	Estimate Level Adjustment For Inference With Proxies Under Random Distribution Shifts 2605.06484	stat.MEcs.LGstat.ML	Steven Wilkins-Reeves, Alexandra N. M. Darmon, Deeksha Sinha	In many scientific domains, including experimentation, researchers rely on measurements of proxy outcomes to achieve faster and more frequent reads, especially when the primary outcome of interest is challenging to measure directly. While proxies offer a more ... In many scientific domains, including experimentation, researchers rely on measurements of proxy outcomes to achieve faster and more frequent reads, especially when the primary outcome of interest is challenging to measure directly. While proxies offer a more readily accessible observation for inference, the ultimate goal is to draw statistical inferences about the primary outcome parameter and proxy data are typically imperfect in some ways. To correct for these imperfections, current statistic...
stat.ML 26 papers
21	Relaxed Sparsest-Permutation Formulation for Causal Discovery at Scale 2605.05568	stat.MLcs.LG	Sunmin Oh, Sang-Yun Oh, Gunwoong Park	Despite the growing availability of large datasets, causal structure learning remains computationally prohibitive at scale. We revisit sparsest-permutation learning for linear structural equation models and show that exact Cholesky factorization is unnecessary... Despite the growing availability of large datasets, causal structure learning remains computationally prohibitive at scale. We revisit sparsest-permutation learning for linear structural equation models and show that exact Cholesky factorization is unnecessary for structure recovery. This observation motivates a support-level relaxation that searches for sparse triangular factors over a precision-support screening graph. The relaxed formulation can be efficiently evaluated via masked zero-fill i...
31	In-Context Positive-Unlabeled Learning 2605.05591	stat.MLcs.LGstat.CO	Siyan Liu, Yi Chang, Manli Cheng, Qinglong Tian, Pengfei Li	Positive-unlabeled (PU) learning addresses binary classification when only a set of labeled positives is available alongside a pool of unlabeled samples drawn from a mixture of positives and negatives. Existing PU methods typically require dataset-specific tra... Positive-unlabeled (PU) learning addresses binary classification when only a set of labeled positives is available alongside a pool of unlabeled samples drawn from a mixture of positives and negatives. Existing PU methods typically require dataset-specific training or iterative optimization, which limits their applicability when many tasks must be solved quickly or with little tuning. We introduce PUICL, a pretrained transformer that solves PU classification entirely through in-context learning....
37	Variational Smoothing and Inference for SDEs from Sparse Data with Dynamic Neural Flows 2605.05606	stat.MLcs.LGmath.PR	Yu Wang, Arnab Ganguly	Stochastic differential equations (SDEs) provide a flexible framework for modeling temporal dynamics in partially observed systems. A central task is to calibrate such models from data, which requires inferring latent trajectories and parameters from sparse, n... Stochastic differential equations (SDEs) provide a flexible framework for modeling temporal dynamics in partially observed systems. A central task is to calibrate such models from data, which requires inferring latent trajectories and parameters from sparse, noisy observations. Classical smoothing methods for this problem are often limited by path degeneracy and poor scalability. In this work, we developed a novel method based on characterization of the posterior SDE in terms of conditional back...
46	Spherical Flows for Sampling Categorical Data 2605.05629	stat.MLcs.CLcs.LG	Jannis Chemseddine, Gregor Kornhardt, Gabriele Steidl	We study the problem of learning generative models for discrete sequences in a continuous embedding space. Whereas prior approaches typically operate in Euclidean space or on the probability simplex, we instead work on the sphere $\mathbb S^{d-1}$. There the v... We study the problem of learning generative models for discrete sequences in a continuous embedding space. Whereas prior approaches typically operate in Euclidean space or on the probability simplex, we instead work on the sphere $\mathbb S^{d-1}$. There the von Mises-Fisher (vMF) distribution induces a natural noise process and admits a closed-form conditional score. The conditional velocity is in general intractable. Exploiting the radial symmetry of the vMF density we reduce the continuity eq...
71	Spectral Lens: Activation and Gradient Spectra as Diagnostics of LLM Optimization 2605.05683	stat.MLcs.LG	Andy Zeyi Liu, Elliot Paquette, John Sous	Training loss and throughput can hide distinct internal representation in language-model training. To examine these hidden mechanics, we use spectral measurements as practical and operational diagnostics. Using a controlled family of decoder-only models adapte... Training loss and throughput can hide distinct internal representation in language-model training. To examine these hidden mechanics, we use spectral measurements as practical and operational diagnostics. Using a controlled family of decoder-only models adapted from the modded NanoGPT codebase, we introduce an empirical protocol based on activation covariance and per-sample gradient SVD spectra. This dual-view reveals three empirical findings and one mechanistic explanation. First, batch size ac...
111	Fourier Feature Methods for Nonlinear Causal Discovery: FFML Scoring and FFCI Testing in Mixed Data 2605.05743	stat.MLcs.AIcs.LG	Joseph D. Ramsey	Gaussian process marginal likelihood scores and kernel conditional independence tests are theoretically appealing for nonlinear causal discovery but computationally prohibitive at scale. We present two complementary RFF-based methods forming a practical toolki... Gaussian process marginal likelihood scores and kernel conditional independence tests are theoretically appealing for nonlinear causal discovery but computationally prohibitive at scale. We present two complementary RFF-based methods forming a practical toolkit for score-based, constraint-based, and hybrid causal discovery. The Fourier Feature Marginal Likelihood (FFML) score approximates the exact GP marginal likelihood by replacing the n x n kernel Gram matrix with a finite-dimensional featu...
118	Transformers Provably Implement In-Context Reinforcement Learning with Policy Improvement 2605.05755	stat.MLcs.AIcs.LG	Haodong Liang, Lifeng Lai	We investigate the ability of transformers to perform in-context reinforcement learning (ICRL), where a model must infer and execute learning algorithms from trajectory data without parameter updates. We show that a linear self-attention transformer block can ... We investigate the ability of transformers to perform in-context reinforcement learning (ICRL), where a model must infer and execute learning algorithms from trajectory data without parameter updates. We show that a linear self-attention transformer block can provably implement policy-improvement methods, including semi-gradient SARSA and actor-critic, via explicit parameter constructions. Beyond existence, we design a teacher-mimicking training procedure, analyze its gradient-flow dynamics, and...
143	Ratio-based Loss Functions 2605.05808	stat.MLcs.LGmath.ST	Lena Helgerth, Andreas Christmann	Algorithms in machine learning and AI do critically depend on at least three key components: (i) the risk function, which is the expectation of the loss function, (ii) the function space, which is often called the hypothesis space, and (iii) the set of probabi... Algorithms in machine learning and AI do critically depend on at least three key components: (i) the risk function, which is the expectation of the loss function, (ii) the function space, which is often called the hypothesis space, and (iii) the set of probability measures, which are allowed for the specified algorithm. This paper gives a survey of a certain class of loss functions, which we call ratio-based. In supervised learning, margin-based loss functions for classification tasks depending ...
174	CITE: Anytime-Valid Statistical Inference in LLM Self-Consistency 2605.05873	stat.MLcs.AIcs.LGmath.STstat.ME	Hirofumi Ota, Naoto Iwase, Yuki Ichihara, Junpei Komiyama, Masaaki Imaizumi	Large language models often improve reasoning by sampling multiple outputs and aggregating their final answers, but precise and efficient control of error levels remains a challenging task. In particular, deciding when to stop sampling remains difficult when t... Large language models often improve reasoning by sampling multiple outputs and aggregating their final answers, but precise and efficient control of error levels remains a challenging task. In particular, deciding when to stop sampling remains difficult when the stopping rule is data-dependent and the set of possible answers is not known in advance. We study anytime-valid certification of a prespecified target answer as the unique mode of the model's response distribution, a guarantee distinct f...
177	Tuning Derivatives for Causal Fairness in Machine Learning 2605.05882	stat.MLcs.AIcs.CYcs.LG	Filip Edström, Guilherme W. F. Barros, Tetiana Gorbach, Xavier de Luna	Artificial-intelligence systems are becoming ubiquitous in society, yet their predictions typically inherit biases with respect to protected attributes such as race, gender, or age. Classical fairness notions, most notably Statistical Parity (SP), demand that ... Artificial-intelligence systems are becoming ubiquitous in society, yet their predictions typically inherit biases with respect to protected attributes such as race, gender, or age. Classical fairness notions, most notably Statistical Parity (SP), demand that predictions be independent of the protected attributes, but are overly restrictive when these attributes influence mediating variables that are considered business necessities. Recent causal formulations relax SP by distinguishing allowed f...
223	Towards Reliable LLM Evaluation: Correcting the Winner's Curse in Adaptive Benchmarking 2605.05973	stat.MLcs.AIcs.LGstat.AP	Yang Xu, Jiefu Zhang, Haixiang Sun, Zihan Zhou, Tianyu Cao	Adaptive prompt and program search makes LLM evaluation selection-sensitive. Once benchmark items are reused inside tuning, the observed winner's score need not estimate the fresh-data performance of the full tune-then-deploy procedure. We study inference for ... Adaptive prompt and program search makes LLM evaluation selection-sensitive. Once benchmark items are reused inside tuning, the observed winner's score need not estimate the fresh-data performance of the full tune-then-deploy procedure. We study inference for this procedure-level target under explicit tuning budgets. We propose SIREN, a selection-aware repeated-split reporting protocol that freezes the post-search shortlist, separates splitwise selection from held-out evaluation, and uses an ite...
233	TabCF: Distributional Control Function Estimation with Tabular Foundation Models 2605.05993	stat.MLcs.LGstat.MEstat.OT	Geping Chen, Chunlin Li, Tianzhong Yang, Zhengyuan Zhu, Jing Zhou	Instrumental variable (IV) and control function (CF) methods are powerful tools for causal effect estimation in the presence of unmeasured confounding, yet most existing approaches target only mean effects and/or demand substantial fitting and tuning effort. I... Instrumental variable (IV) and control function (CF) methods are powerful tools for causal effect estimation in the presence of unmeasured confounding, yet most existing approaches target only mean effects and/or demand substantial fitting and tuning effort. In this paper, we introduce a simple method, TabCF, for control function regression using tabular foundation models, which enables accurate, fast, identification-transparent, and tuning-light causal estimation of distributional quantities, s...
236	Gaussian mixture models in Hilbert spaces via kernel methods 2605.05996	stat.MLcs.LG	Daniel López-Montero, Antonio Álvarez-López, Marcos Matabuena	Modern datasets across many disciplines increasingly consist of time-evolving, potentially infinite-dimensional random objects, such as dynamic functional data, which are naturally modeled in Hilbert spaces. In these settings, characterizing probability measur... Modern datasets across many disciplines increasingly consist of time-evolving, potentially infinite-dimensional random objects, such as dynamic functional data, which are naturally modeled in Hilbert spaces. In these settings, characterizing probability measures, for example, through densities, can be ill-defined or technically challenging. Motivated by clustering applications, we propose a Gaussian mixture framework for Hilbert-space-valued data based on kernel mean embeddings and develop effic...
336	Expressivity of Bi-Lipschitz Normalizing Flows: A Score-Based Diffusion Perspective 2605.06172	stat.MLcs.LGmath.NAmath.PR	Meira Iske, Carola-Bibiane Schönlieb	Many normalizing flow architectures impose regularity constraints, yet their distributional approximation properties are not fully characterized. We study the expressivity of bi-Lipschitz normalizing flows through the lens of score-based diffusion models. For ... Many normalizing flow architectures impose regularity constraints, yet their distributional approximation properties are not fully characterized. We study the expressivity of bi-Lipschitz normalizing flows through the lens of score-based diffusion models. For the probability flow ODE of a variance-preserving diffusion, Lipschitz regularity of the score induces a flow of bi-Lipschitz diffeomorphic transport maps. This ODE bridge allows us to analyze the distributional approximation power of bi-Li...
358	When Does Trimming Help Conformal Prediction? A Retained-Law Diagnostic under Calibration Contamination 2605.06204	stat.MLcs.LG	Congye Wang	Trimming suspicious calibration points is a common response to contamination in conformal prediction. Its effect on clean-target coverage, however, is governed by the retained law induced by trimming, not by the contamination level alone. We analyse fixed-thre... Trimming suspicious calibration points is a common response to contamination in conformal prediction. Its effect on clean-target coverage, however, is governed by the retained law induced by trimming, not by the contamination level alone. We analyse fixed-threshold trimming as conditioning rather than purification. It replaces the contaminated calibration law with a retained law, reducing clean-target coverage to a one-dimensional score-CDF transfer problem with an exact finite-sample identity. ...
361	Super-Level-Set Regression: Conditional Quantiles via Volume Minimization 2605.06210	stat.MLcs.AIcs.LGstat.APstat.ME	Sacha Braun, Michael I. Jordan, Francis Bach	Constructing minimum-volume prediction regions that satisfy conditional coverage is a fundamental challenge in multivariate regression. Standard approaches rely on explicitly estimating the full conditional density and subsequently thresholding it. This two-st... Constructing minimum-volume prediction regions that satisfy conditional coverage is a fundamental challenge in multivariate regression. Standard approaches rely on explicitly estimating the full conditional density and subsequently thresholding it. This two-step plug-in process is notoriously difficult, sensitive to estimation errors, and computationally expensive. One would like to instead optimize the region directly. Formulating a direct solution is challenging, however, because it requires m...
393	ConquerNet: Convolution-Smoothed Quantile ReLU Neural Networks with Minimax Guarantees 2605.06265	stat.MLcs.LG	Tianpai Luo, Fangwei Wu, Weichi Wu	Quantile regression is a fundamental tool for distributional learning but poses significant optimization challenges for deep models due to the non-smoothness of the pinball loss. We propose ConquerNet, a class of \textbf{con}volution-smoothed \textbf{qu}antil\... Quantile regression is a fundamental tool for distributional learning but poses significant optimization challenges for deep models due to the non-smoothness of the pinball loss. We propose ConquerNet, a class of \textbf{con}volution-smoothed \textbf{qu}antil\textbf{e} \textbf{R}eLU neural \textbf{net}works, which yield smooth objectives while preserving the underlying quantile structure. We establish general nonasymptotic risk bounds for ConquerNet under mild conditions, providing minimax guara...
407	Multimodal Deep Generative Model for Semi-Supervised Learning under Class Imbalance 2605.06289	stat.MLcs.AIcs.LG	Heegeon Yoon, Heeyoung Kim	When modeling class-imbalanced data, it is crucial to address the imbalance, as models trained on such data tend to be biased towards the majority classes. This problem is amplified under partial supervision, where pseudo-labels for unlabeled data are predicte... When modeling class-imbalanced data, it is crucial to address the imbalance, as models trained on such data tend to be biased towards the majority classes. This problem is amplified under partial supervision, where pseudo-labels for unlabeled data are predicted based on imbalanced labeled data, propagating the bias. While recent semi-supervised models address class imbalance, they typically assume single-modal input data. However, with the growing availability of multimodal data, it is essential...
420	End-to-End Identifiable and Consistent Recurrent Switching Dynamical Systems 2605.06315	stat.MLcs.LG	Carles Balsells-Rodas, Zhengrui Xiang, Xavier Sumba, Yingzhen Li	Learning identifiable representations in deep generative models remains a fundamental challenge, particularly for sequential data with regime-switching dynamics. Existing approaches establish identifiability under restrictive assumptions, such as stationarity ... Learning identifiable representations in deep generative models remains a fundamental challenge, particularly for sequential data with regime-switching dynamics. Existing approaches establish identifiability under restrictive assumptions, such as stationarity or limited emission models, and typically rely on variational autoencoder (VAE) estimators, which introduce approximation gaps that limit the recovery of the latent structure. In this work, we address both the theoretical and practical limi...
456	The Interplay of Data Structure and Imbalance in the Learning Dynamics of Diffusion Models 2605.06367	stat.MLcond-mat.dis-nncs.LG	Flavio Nicoletti, Chenxiao Ma, Enrico Ventura, Luca Saglietti, Stefano Sarao Mannelli	Real-world datasets are inherently heterogeneous, yet how per-class structural differences and sampling imbalance shape the training dynamics of diffusion models-and potentially exacerbate disparities-remains poorly understood. While models typically transitio... Real-world datasets are inherently heterogeneous, yet how per-class structural differences and sampling imbalance shape the training dynamics of diffusion models-and potentially exacerbate disparities-remains poorly understood. While models typically transition from an initial phase of generalization to memorizing the training set, existing theory assumes homogeneous data, leaving open how class imbalance and heterogeneity reshape these dynamics. In this work, we develop a high-dimensional analy...
459	Beyond the Independence Assumption: Finite-Sample Guarantees for Deep Q-Learning under $τ$-Mixing 2605.06373	stat.MLcs.LG	Leon Halgryn, Sophie Langer, Janusz M. Meylahn, E. Moritz Hahn	Finite-sample analyses of deep Q-learning typically treat replayed data as independent, even though it is sampled from temporally dependent state-action trajectories. We study the Deep Q-networks (DQN) algorithm under explicit dependence by modelling the minib... Finite-sample analyses of deep Q-learning typically treat replayed data as independent, even though it is sampled from temporally dependent state-action trajectories. We study the Deep Q-networks (DQN) algorithm under explicit dependence by modelling the minibatches used for updating the network as $τ$-mixing. We show that this assumption holds under certain dependence conditions on the underlying trajectories and the mechanism used to sample minibatches. Building on this observation, we extend ...
477	Decoupled PFNs: Identifiable Epistemic-Aleatoric Decomposition via Structured Synthetic Priors 2605.06413	stat.MLcs.LG	Richard Bergna, Stefan Depeweg, José Miguel Hernández-Lobato	Prior-Fitted Networks (PFNs) amortize Bayesian prediction by meta-learning over a synthetic task prior, but their standard output is a posterior predictive distribution over noisy observations. For sequential decision-making, such as active learning and Bayesi... Prior-Fitted Networks (PFNs) amortize Bayesian prediction by meta-learning over a synthetic task prior, but their standard output is a posterior predictive distribution over noisy observations. For sequential decision-making, such as active learning and Bayesian optimization, acquisition should prioritize epistemic uncertainty about the latent signal rather than irreducible aleatoric observation noise. We show that this epistemic--aleatoric split is not identifiable in general from the posterior...
485	Neural-Actuarial Longevity Forecasting: Anchoring LSTMs for Explainable Risk Management 2605.06438	stat.MLcs.LGq-fin.RM	Davide Rindori	Traditional multi-population models, such as the Li-Lee framework, rely on the assumption of mean-reverting country-specific deviations. However, recent data from high-longevity clusters suggest a systemic break in this paradigm. We identify a stationarity par... Traditional multi-population models, such as the Li-Lee framework, rely on the assumption of mean-reverting country-specific deviations. However, recent data from high-longevity clusters suggest a systemic break in this paradigm. We identify a stationarity paradox where mortality residuals in countries like Sweden and West Germany exhibit persistent unit roots, leading to a systematic mispricing of longevity risk in linear models. To address these non-linearities, we propose Hybrid-Lift, a neura...
508	Risk-Controlled Post-Processing of Decision Policies 2605.06479	stat.MLcs.LGmath.ST	Sunay Joshi, Tao Wang, Hamed Hassani, Edgar Dobriban	Predictive models are often deployed through existing decision policies that stakeholders are reluctant to change unless a risk constraint requires intervention. We study risk-controlled post-processing: given a deterministic baseline policy, choose a new poli... Predictive models are often deployed through existing decision policies that stakeholders are reluctant to change unless a risk constraint requires intervention. We study risk-controlled post-processing: given a deterministic baseline policy, choose a new policy that maximizes agreement with the baseline subject to a chance constraint on a user-specified loss. At the population level, we show that the optimal policy has a threshold structure: it follows the baseline except on contexts where swit...
552	Dynamic Treatment on Networks 2605.06564	stat.MLcs.LG	Bengusu Nar, Jiguang Li, Veronika Ročková, Panos Toulis	In networks, effective dynamic treatment allocation requires deciding both whom to treat and also when, so as to amplify policy impact through spillovers. An early intervention at a well-connected node can trigger cascades that change which nodes are worth tar... In networks, effective dynamic treatment allocation requires deciding both whom to treat and also when, so as to amplify policy impact through spillovers. An early intervention at a well-connected node can trigger cascades that change which nodes are worth targeting in the next period. Existing treatment strategies under network interference are largely static while dynamic treatment frameworks typically ignore network structure altogether. We integrate these perspectives and propose Q-Ising, a ...
575	DARTS: Targeting Prognostic Covariates in Budget-Constrained Sequential Experiments 2605.06608	stat.MLcs.LGstat.ME	Kateryna Husar, Alexander Volfovsky	Randomized controlled trials typically assume that prognostic covariates are known and available at no cost. In practice, obtaining high-dimensional pretreatment data is costly, forcing a trade-off between covariate-adaptive precision and a measurement budget.... Randomized controlled trials typically assume that prognostic covariates are known and available at no cost. In practice, obtaining high-dimensional pretreatment data is costly, forcing a trade-off between covariate-adaptive precision and a measurement budget. We introduce Dynamic Adaptive Rerandomization via Thompson Sampling (DARTS), which treats covariate acquisition as a sequential optimization problem embedded within a design-based causal inference task. A budgeted combinatorial Thompson sa...