LearnPruner: Rethinking Attention-based Token Pruning in Vision Language Models

Rinyoichi Takezoe, Yaqian Li, Zihao Bo, et al.

🔥 引用: 0

Abstract: Vision-Language Models (VLMs) have recently demonstrated remarkable capabilities in visual understanding and reasoning, but they also impose significant computational burdens due to long visual sequence inputs. Recent works address this issue by pruning unimportant visual tokens, achieving substantial computational reduction while maintaining model performance. The core of token pruning lies in determining token importance, with current approaches primarily relying on attention scores from vision encoders or Large Language Models (LLMs). In this paper, we analyze the effectiveness of attention mechanisms in both vision encoders and LLMs. We find that vision encoders suffer from attention sink, leading to poor focus on informative foreground regions, while in LLMs, although prior studies have identified attention bias toward token positions, text-to-vision attention demonstrates resistance to this bias and enables effective pruning guidance in middle layers. Based on these observations, we propose LearnPruner, a two-stage token pruning framework that first removes redundant vision tokens via a learnable pruning module after the vision encoder, then retains only task-relevant tokens in the LLM's middle layer. Experimental results show that our LearnPruner can preserve approximately 95% of the original performance while using only 5.5% of vision tokens, and achieve 3.2$\times$ inference acceleration, demonstrating a superior accuracy-efficiency trade-off.

🏛️ Semantic Scholar 📅 2026-04-27

MIMIC: A Generative Multimodal Foundation Model for Biomolecules

Siavash Golkar, Jake Kovalic, I. Morales, et al.

🔥 引用: 0

Abstract: Biological function emerges from coupled constraints across sequence, structure, regulation, evolution, and cellular context, yet most foundation models in biology are trained within one modality or for a fixed forward task. We present MIMIC, a generative multimodal foundation model trained on our newly curated and aligned dataset, LORE, linking nucleic acid, protein, evolutionary, structural, regulatory, and semantic/contextual modalities within partially observed biomolecular states. MIMIC uses a split-track encoder-decoder architecture to condition on arbitrary subsets of observed modalities and reconstruct or generate missing components of molecular state across the genome, transcriptome, and proteome. Multimodal conditioning consistently improves MIMIC's sequence reconstruction relative to sequence-only inputs, while its learned representations enable state-of-the-art performance on RNA and protein downstream tasks. MIMIC achieves state-of-the-art splicing prediction, and its joint generative formulation enables isoform-aware inference that further improves performance. Beyond prediction, the same generative framework supports constrained design. For RNA, MIMIC identifies corrective edits in a clinically relevant HBB splice-disrupting mutation without reverting it by using evolutionary and structural signals. For proteins, jointly conditioning on shape and surface chemistry of PD-L1 and hACE2 binding sites produces diverse, high-confidence sequences with strong in silico support for target binding. Finally, MIMIC uses experimental context as semantic conditioning to model assay-dependent RNA chemical probing, rather than treating context as a fixed output. Together, these results position MIMIC's aligned multimodal generative modeling as a strong foundation for unifying representation learning, conditional prediction, and constrained biomolecular design within a single model.

🏛️ Semantic Scholar 📅 2026-04-27 Foundation Model Protein Language Model Biological Large Model

Comparative Evaluation of Modern Deep Learning Methodologies for Portfolio Optimization

Samuel Ozechi, Banjo Francis, Wisdom Yakanu, et al.

🔥 引用: 0

Abstract: This study proposes a portfolio optimization framework that integrates advanced deep learning architectures with traditional financial models to enhance risk-adjusted performance. Using historical data from 2015-2023 across equities, ETFs, and bonds, the research evaluates the predictive power of Graph Neural Networks (GNNs), Deep Reinforcement Learning (DRL), Transformers, and Autoencoders. The models jointly address covariance estimation, return forecasting, dynamic asset allocation, and dimensionality reduction. Hybrid approaches such as Transformer+GNN and Autoencoder+DRL are also explored to capture both relational and temporal market structures. Performance is assessed through backtesting using metrics including volatility, cumulative return, maximum drawdown, annualized return, and Sharpe ratio across seven strategies, including Equal-Weighted, 60/40 allocation, and Mean-Variance Optimization (MVO). Results show that hybrid models provide superior stability and risk control, with Transformer+GNN achieving the lowest volatility and drawdown. MVO, when paired with well-calibrated inputs, delivers the highest cumulative return and Sharpe ratio, highlighting the continued relevance of traditional methods. Standalone DRL underperforms due to limited structural awareness, while Autoencoders exhibit behavior similar to Equal-Weight strategies, emphasizing the need for dynamic policy learning. These findings align with existing literature on relational modeling and feature compression in finance. Overall, the study demonstrates that combining deep learning with financial theory yields robust and adaptive portfolio strategies and suggests exploring latent representations within traditional optimization frameworks to improve scalability and performance.

🏛️ Semantic Scholar 📅 2026-04-27 Deep Learning Transformer

Defective Task Descriptions in LLM-Based Code Generation: Detection and Analysis

Amal Akli, Mike Papadakis, Maxime Cordy, et al.

🔥 引用: 0

Abstract: Large language models are widely used for code generation, yet they rely on an implicit assumption that the task descriptions are sufficiently detailed and well-formed. However, in practice, users may provide defective descriptions, which can have a strong effect on code correctness. To address this issue, we develop SpecValidator, a lightweight classifier based on a small model that has been parameter-efficiently finetuned, to automatically detect task description defects. We evaluate SpecValidator on three types of defects, Lexical Vagueness, Under-Specification and Syntax-Formatting on 3 benchmarks with task descriptions of varying structure and complexity. Our results show that SpecValidator achieves defect detection of F1 = 0.804 and MCC = 0.745, significantly outperforming GPT-5-mini (F1 = 0.469 and MCC = 0.281) and Claude Sonnet 4 (F1 = 0.518 and MCC = 0.359). Perhaps more importantly, our analysis indicates that SpecValidator can generalize to unseen issues and detect unknown Under-Specification defects in the original (real) descriptions of the benchmarks used. Our results also show that the robustness of LLMs in task description defects depends primarily on the type of defect and the characteristics of the task description, rather than the capacity of the model, with Under-Specification defects being the most severe. We further found that benchmarks with richer contextual grounding, such as LiveCodeBench, exhibit substantially greater resilience, highlighting the importance of structured task descriptions for reliable LLM-based code generation.

🏛️ Semantic Scholar 📅 2026-04-27

Scaling on-device spoken language understanding to new languages with large language models

Jakub Hoscilowicz, Pawel Pawlowski, Marcin Skorupa, et al.

Sajad Ebrahimi, Soroush Sadeghian, Ali Ghorbanpour, et al.

Mapping the Panorama of Enterprise Digital Transformation: An LLM-Driven Variable Relational Network Perspective

Zhichao Ba, Yujie Zhang, Biao Zhang, et al.

O. Elemento

DOI: 10.64898/2026.04.24.26351702

🔥 引用: 0

Abstract: 暂无摘要，请点击原文查看。

🏛️ Semantic Scholar 📅 2026-04-27 Foundation Model medRxiv

AsyncShield: A Plug-and-Play Edge Adapter for Asynchronous Cloud-based VLA Navigation

Kai Yang, Zedong Chu, Yingnan Guo, et al.

🔥 引用: 0

Abstract: While Vision-Language-Action (VLA) models have been demonstrated possessing strong zero-shot generalization for robot control, their massive parameter sizes typically necessitate cloud-based deployment. However, cloud deployment introduces network jitter and inference latency, which can induce severe spatiotemporal misalignment in mobile navigation under continuous displacement, so that the stale intents expressed in past ego frames may become spatially incorrect in the current frame and lead to collisions. To address this issue, we propose AsyncShield, a plug-and-play asynchronous control framework. AsyncShield discards traditional black-box time-series prediction in favor of a deterministic physical white-box spatial mapping. By maintaining a temporal pose buffer and utilizing kinematic transformations, the system accurately converts temporal lag into spatial pose offsets to restore the VLA's original geometric intent. To balance intent restoration fidelity and physical safety, the edge adaptation is formulated as a constrained Markov decision process (CMDP). Solved via the PPO-Lagrangian algorithm, a reinforcement learning adapter dynamically trades off between tracking the VLA intent and responding to high-frequency LiDAR obstacle avoidance hard constraints. Furthermore, benefiting from a standardized universal sub-goal interface, domain randomization, and perception-level adaptation via Collision Radius Inflation, AsyncShield operates as a lightweight, plug-and-play module. Simulation and real-world experiments demonstrate that, without fine-tuning any cloud-based foundation models, the framework exhibits zero-shot and robust generalization capabilities, effectively improving the success rate and physical safety of asynchronous navigation.

🏛️ Semantic Scholar 📅 2026-04-27 Foundation Model Fine-tuning

A Survey on Split Learning for LLM Fine-Tuning: Models, Systems, and Privacy Optimizations

Zihan Liu, Yizhen Wang, Rui Wang, et al.

🔥 引用: 0

Abstract: Fine-tuning unlocks large language models (LLMs) for specialized applications, but its high computational cost often puts it out of reach for resource-constrained organizations. While cloud platforms could provide the needed resources, data privacy concerns make sharing sensitive information with third parties risky. A promising solution is split learning for LLM fine-tuning, which divides the model between clients and a server, allowing collaborative and secure training through exchanged intermediate data, thus enabling resource-constrained participants to adapt LLMs safely. % In light of this, a growing body of literature has emerged to advance this paradigm, introducing varied model methods, system optimizations, and privacy defense-attack techniques for split learning. To bring clarity and direction to the field, a comprehensive survey is needed to classify, compare, and critique these diverse approaches. This paper fills the gap by presenting the first extensive survey dedicated to split learning for LLM fine-tuning. We propose a unified, fine-grained training pipeline to pinpoint key operational components and conduct a systematic review of state-of-the-art work across three core dimensions: model-level optimization, system-level efficiency, and privacy preservation. Through this structured taxonomy, we establish a foundation for advancing scalable, robust, and secure collaborative LLM adaptation.

🏛️ Semantic Scholar 📅 2026-04-27 Fine-tuning

QEVA: A Reference-Free Evaluation Metric for Narrative Video Summarization with Multimodal Question Answering

Woojun Jung, Junyeong Kim

The Pragmatic Persona: Discovering LLM Persona through Bridging Inference

Jisoo Yang, Jongwon Ryu, Minuk Ma, et al.

🔥 引用: 0

Abstract: Large Language Models (LLMs) reveal inherent and distinctive personas through dialogue. However, most existing persona discovery approaches rely on surface-level lexical or stylistic cues, treating dialogue as a flat sequence of tokens and failing to capture the deeper discourse-level structures that sustain persona consistency. To address this limitation, we propose a novel analytical framework that interprets LLM dialogue through bridging inference -- implicit conceptual relations that connect utterances via shared world knowledge and discourse coherence. By modeling these relations as structured knowledge graphs, our approach captures latent semantic links that govern how LLMs organize meaning across turns, enabling persona discovery at the level of discourse coherence rather than surface realizations. Experimental results across multiple reasoning backbones and target LLMs, ranging from small-scale models to 80B-parameter systems, demonstrate that bridging-inference graphs yield significantly stronger semantic coherence and more stable persona identification than frequency or style-based baselines. These results show that persona traits are consistently encoded in the structural organization of discourse rather than isolated lexical patterns. This work presents a systematic framework for probing, extracting, and visualizing latent LLM personas through the lens of Cognitive Discourse Theory, bridging computational linguistics, cognitive semantics, and persona reasoning in large language models. Codes are available at https://github.com/JiSoo-Yang/Persona_Bridging.git

🏛️ Semantic Scholar 📅 2026-04-27

Jailbreaking Frontier Foundation Models Through Intention Deception

Xinhe Wang, Katia Sycara, Yaqi Xie

🔥 引用: 0

Abstract: Large (vision-)language models exhibit remarkable capability but remain highly susceptible to jailbreaking. Existing safety training approaches aim to have the model learn a refusal boundary between safe and unsafe, based on the user's intent. It has been found that this binary training regime often leads to brittleness, since the user intent cannot reliably be evaluated, especially if the attacker obfuscates their intent, and also makes the system seem unhelpful. In response, frontier models, such as GPT-5, have shifted from refusal-based safeguards to safe completion, that aims to maximize helpfulness while obeying safety constraints. However, safe completion could be exploited when a user pretends their intention is benign. Specifically, this intent inversion would be effective in multi-turn conversation, where the attacker has multiple opportunities to reinforce their deceptively benign intent. In this work, we introduce a novel multi-turn jailbreaking method that exploits this vulnerability. Our approach gradually builds conversational trust by simulating benign-seeming intentions and by exploiting the consistency property of the model, ultimately guiding the target model toward harmful, detailed outputs. Most crucially, our approach also uncovered an additional class of model vulnerability that we call para-jailbreaking that has been unnoticed up to now. Para-jailbreaking describes the situation where the model may not reveal harmful direct reply to the attack query, however the information that it reveals is nevertheless harmful. Our contributions are threefold. First, it achieves high success rates against frontier models including GPT-5-thinking and Claude-Sonnet-4.5. Second, our approach revealed and addressed para-jailbreaking harmful output. Third, experiments on multimodal VLM models showed that our approach outperformed state-of-the-art models.

🏛️ Semantic Scholar 📅 2026-04-27 Foundation Model

Rewarding the Scientific Process: Process-Level Reward Modeling for Agentic Data Analysis

Zhisong Qiu, Shuofei Qiao, Kewei Xu, et al.

🔥 引用: 0

Abstract: Process Reward Models (PRMs) have achieved remarkable success in augmenting the reasoning capabilities of Large Language Models (LLMs) within static domains such as mathematics. However, their potential in dynamic data analysis tasks remains underexplored. In this work, we first present a empirical study revealing that general-domain PRMs struggle to supervise data analysis agents. Specifically, they fail to detect silent errors, logical flaws that yield incorrect results without triggering interpreter exceptions, and erroneously penalize exploratory actions, mistaking necessary trial-and-error exploration for grounding failures. To bridge this gap, we introduce DataPRM, a novel environment-aware generative process reward model that (1) can serve as an active verifier, autonomously interacting with the environment to probe intermediate execution states and uncover silent errors, and (2) employs a reflection-aware ternary reward strategy that distinguishes between correctable grounding errors and irrecoverable mistakes. We design a scalable pipeline to construct over 8K high-quality training instances for DataPRM via diversity-driven trajectory generation and knowledge-augmented step-level annotation. Experimental results demonstrate that DataPRM improves downstream policy LLMs by 7.21% on ScienceAgentBench and 11.28% on DABStep using Best-of-N inference. Notably, with only 4B parameters, DataPRM outperforms strong baselines, and exhibits robust generalizability across diverse Test-Time Scaling strategies. Furthermore, integrating DataPRM into Reinforcement Learning yields substantial gains over outcome-reward baselines, achieving 78.73% on DABench and 64.84% on TableBench, validating the effectiveness of process reward supervision. Code is available at https://github.com/zjunlp/DataMind.

🏛️ Semantic Scholar 📅 2026-04-27 Agent

Ido Dahan, Omer Toledano, Roey J. Gafter, et al.

🔥 引用: 0

Abstract: Cross-Lingual Text Simplification (CLTS) aims to make content more accessible across languages by simultaneously addressing both linguistic complexity and translation. This study investigates the effectiveness of different prompting strategies for CLTS between English and French using large language models (LLMs). We examine five distinct prompting systems: a direct prompt instructing the LLM to perform both translation and simplification simultaneously, two Composition approaches that either translate-then-simplify or simplify-then-translate within a single prompt, and two decomposition approaches that perform the same operations in separate, consecutive prompts. These systems are evaluated across a diverse set of five corpora of different genres (Wikipedia and medical texts) using seven state-of-the-art LLMs. Output quality is assessed through a multi-faceted evaluation framework comprising automatic metrics, comprehensive linguistic feature analysis, and human evaluation of simplicity and meaning preservation. Our findings reveal that while direct prompting consistently achieves the highest BLEU scores, indicating meaning fidelity, Translate-then-Simplify approaches demonstrate the highest simplicity, as measured by the linguistic features.

🏛️ Semantic Scholar 📅 2026-04-26

Collaborative Dialogue Analysis for Productive Problem Solving

Yeojin Kim, Daeun Hong, Xiaotian Zou, et al.

DOI: 10.1145/3785022.3785065

🔥 引用: 0

Abstract: Collaborative problem solving requires students to jointly reason, negotiate, and regulate their learning. Understanding collaborative problem solving through student dialogue can inform timely identification of productive and unproductive collaborative behaviors. In this study, we investigate the use of large language models to automatically classify collaborative problem-solving dialogue segments into two categories: Productive and Unproductive. To support deeper analysis, we additionally explore classification of eight detailed collaborative problem-solving sub-categories. We present an error-augmented few-shot prompting method that incorporates misclassified examples to refine model understanding of classification boundaries. Using dialogue data from a middle school collaborative game-based learning environment, our approach substantially improves classification accuracy over zero-shot baselines. Qualitative analysis of the resulting models further highlights which dialogue types are most frequently misclassified, suggesting design implications for adaptive scaffolding. These findings demonstrate that large language models, when guided with targeted prompting strategies, can effectively recognize productive and unproductive dialogue in collaborative learning.

🏛️ Semantic Scholar 📅 2026-04-26 International Learning Analytics & Knowledge Conference

A Novel Approach to Evaluating the Effectiveness of Large Language Models for Multimodal Analysis of Embodied Learning in Classrooms

Joyce Horn Fonteles, Nithin Sivakumaran, Clayton Cohn, et al.

DOI: 10.1145/3785022.3785109

🔥 引用: 0

Abstract: This paper presents an approach that uses Large Language Models (LLMs) as late-fusion interpreters to synthesize multimodal signals from embodied classroom activities and infer students’ metacognitive behaviors. Our multimodal pipeline analyzes students’ movements, gaze, gestures, and speech within a mixed-reality simulation displayed on a classroom screen to support enactment and learning. Vision- and speech-derived features are fused at the interpretive layer via zero-shot prompting, self-consistency reasoning, and targeted prompt engineering to derive planning, enacting, monitoring, reflecting, and interacting behaviors. We investigate whether LLMs can reliably integrate modality-specific analytics to produce accurate behavioral labeling and whether an LLM-as-a-Judge can validate them at scale. To address scalability and reduce human burden, we introduce an automated evaluation protocol employing LLM-as-a-Judge to assess classification quality, enabling rapid, iterative benchmarking of model variants and prompt strategies. Using a balanced corpus of human-validated segments and perturbed controls, we compare text-only language models (e.g., GPT-5) with visual–language models (e.g., Qwen2.5-VL) that incorporate direct visual processing. Results indicate late-fusion, text-based LLMs can outperform VLMs on behavior judgment without raw video, and precision- or recall-oriented prompts adjust decision boundaries for subtle or brief segments. These findings position LLMs as effective late-fusion mechanisms for multimodal learning analytics and demonstrate the viability of LLM-as-a-Judge for scalable, human-in-the-loop evaluation.

🏛️ Semantic Scholar 📅 2026-04-26 International Learning Analytics & Knowledge Conference

Uncertainty Propagation in LLM-Based Systems

Boming Xia, Liming Zhu, Erdun Gao, et al.

🔥 引用: 0

Abstract: Uncertainty in large language model (LLM)-based systems is often studied at the level of a single model output, yet deployed LLM applications are compound systems in which uncertainty is transformed and reused across model internals, workflow stages, component boundaries, persistent state, and human or organisational processes. Without principled treatment of how uncertainty is carried and reused across these boundaries, early errors can propagate and compound in ways that are difficult to detect and govern. This paper develops a systems-level account of uncertainty propagation. It introduces a conceptual framing for characterising propagated uncertainty signals, presents a structured taxonomy spanning intra-model (P1), system-level (P2), and socio-technical (P3) propagation mechanisms, synthesises cross-cutting engineering insights, and identifies five open research challenges.

🏛️ Semantic Scholar 📅 2026-04-26

DOI: 10.56726/irjmets94922

🔥 引用: 0

Abstract: 暂无摘要，请点击原文查看。

🏛️ Semantic Scholar 📅 2026-04-26 International Research Journal of Modernization in Engineering Technology & Science

DOWNLOADS AI Engineering: Building Applications with Foundation Models by Chip Huyen

Charles Chrisman

DOI: 10.55277/researchhub.2rc0xjn7.1

🔥 引用: 0

Abstract: 暂无摘要，请点击原文查看。

🏛️ Semantic Scholar 📅 2026-04-26 Foundation Model

No More Translation at Runtime: LLM-Empowered Static Binary Translation

Zhibo Liu, Huaijin Wang, Wai Kin Wong, et al.

DOI: 10.1145/3767295.3803600

🔥 引用: 0

Abstract: While AArch64 CPUs are becoming strong market contenders, their software ecosystem lags behind the mature x86-64 environment, hindering the adoption of the new architectures and impacting user experience. Binary translation bridges this divide by converting binary code from one architecture (e.g., x86-64) to run on another (e.g., AArch64), allowing legacy software to benefit from modern hardware's performance and energy efficiency advantages. Current translation methods are typically either dynamic, which adds significant runtime overhead, or static, which struggles with reliability due to the inherent complexities of binary analysis. This paper introduces a new static, assembly-to-assembly translation paradigm that transforms binary code ahead of execution, generating portable, efficient nativelike binaries that run on AArch64 devices without runtime frameworks. Benefiting from recent breakthroughs in large language models (LLMs), we provide a practical and automated translation engine that produces high-quality code with minimal human intervention. To ensure correctness, we introduce a crucial verification step, where we split the assembly code into simplified snippets, enabling efficient and scalable semantic verification. Our evaluation shows that this approach significantly outperforms existing open-source solutions with a large margin, producing binaries with near-native performance. Furthermore, it shows substantial improvements over the leading industrial translator, ExaGear, illuminating a promising new direction for cross-architecture binary translation research.

🏛️ Semantic Scholar 📅 2026-04-26 Proceedings of the 21st European Conference on Computer Systems

Prompting for Teachability: Designing Novice Personas in LLMs for Learning by Teaching Contexts

S.H. Miller, Nigel Bosch

DOI: 10.1145/3785022.3785067

🔥 引用: 0

Abstract: Learning by teaching (LbT) is a well-established instructional framework in which students deepen understanding by explaining material to a peer or tutee. Large Language Models (LLMs) create new opportunities to scale LbT by simulating novice learners, but their default tendency toward expert-like responses risks undermining the tutor's role. This study investigates which prompting strategies most effectively elicit novice-behavior from LLMs in writing-related domains. We generated 30,720 combined prompts across five domains and evaluated three models (Qwen3-235B, Llama 4, Kimi-K2) using both multiple-choice quizzes and short persuasive essays. Outputs were scored on quiz accuracy, essay quality, and essay persuasiveness using an AI-judge rubric. Regression analysis revealed a clear pattern: constraint prompts that explicitly forced error production consistently outperformed persona-, misconception‑, and uncertainty-based prompts. Across both quiz and essay outcomes, direct commands to “answer incorrectly” or “get 2–3 wrong” yielded the strongest novice-like behavior, while indirect framings like “don't aim for a perfect score” or “you may guess” diluted the effect. These findings highlight constraint-based prompting as the most reliable strategy, and we argue that constraint directives provide an actionable design pathway for practitioners seeking to integrate LLMs into effective LbT contexts.

🏛️ Semantic Scholar 📅 2026-04-26 International Learning Analytics & Knowledge Conference

Reducing the GPU Memory Bottleneck with Lossless Compression for ML

Aditya K Kamath, Arvind Krishnamurthy, Marco Canini, et al.

DOI: 10.1145/3767295.3803595

🔥 引用: 0

Abstract: Machine learning (ML) training and inference often process data sets far exceeding GPU memory capacity, forcing them to rely on PCIe for on-demand tensor transfers, causing critical transfer bottlenecks. Lossy compression has been proposed to relieve bottlenecks but introduces workload-dependent accuracy loss, making it complex or even prohibitive to use in existing ML deployments. We explore lossless compression as an alternative that avoids this deployment complexity. We identify where lossless compression can be integrated into ML pipelines while minimizing interference with GPU execution. Based on our findings, we introduce Invariant Bit Packing (IBP), a novel lossless compression algorithm designed to minimize data transfer time for ML. IBP identifies and eliminates invariant bits across groups of tensors, improving throughput through GPU-optimized decompression that leverages warp parallelism, low-overhead bit operations, and asynchronous PCIe transfers. We provide easy-to-use APIs, showcasing them by adding IBP support to GNN training, as well as DLRM and LLM inference frameworks. IBP achieves, on average, 74% faster GNN training, 180% faster DLRM embedding lookup, and 25% faster LLM inference.

🏛️ Semantic Scholar 📅 2026-04-26 Proceedings of the 21st European Conference on Computer Systems

C. Rebello, Anderson Rapello dos Santos, Idelfonso B. R. Nogueira

🔥 引用: 0

Abstract: Deploying machine learning models across diverse well portfolios requires generalisation to wells with design parameters outside the training distribution. Current data-driven approaches to virtual flow metering (VFM) and bottomhole estimation typically treat each well independently or ignore the influence of well design on operational behaviour. We present WISE (Well Intelligence and Systems Engineering Foundation Model), a design-aware, physics-informed multi-task model that integrates three complementary mechanisms: Feature-wise Linear Modulation (FiLM) and cross-modal attention to condition operational embeddings on well design parameters; multi-task learning for simultaneous prediction of flow rates, bottomhole conditions, and flow regime classification; and structural mass conservation with soft physics constraints derived from well engineering principles. Evaluation on the ManyWells benchmark (2000 simulated wells, $10^6$ data points) demonstrates that design-aware models reduce VFM prediction error by up to $13\times$ compared to design-unaware baselines, and that physics constraints reduce negative flow predictions by 65%. Flow regime classification achieves 97.7% bottomhole accuracy, providing continuous well integrity monitoring without additional sensors. The methodology transfers to real operational data from five Equinor Volve producers (oil rate $R^2 = 0.89$, bottomhole pressure $R^2 = 0.98$, water rate $R^2 = 0.97$). The trained model additionally serves as a fast surrogate for integrity-aware well design optimisation over a 24-dimensional design space, with more than $1000\times$ speedup over drift-flux simulations. These results demonstrate that design awareness, physics enforcement, and multi-task learning are essential and complementary ingredients for foundation models intended to operate across large well portfolios.

🏛️ Semantic Scholar 📅 2026-04-26 Foundation Model

Tandem: Riding Together with Large and Small Language Models for Efficient Reasoning

Zichuan Fu, Xian Wu, Guojing Li, et al.

🔥 引用: 0

Abstract: Recent advancements in large language models (LLMs) have catalyzed the rise of reasoning-intensive inference paradigms, where models perform explicit step-by-step reasoning before generating final answers. While such approaches improve answer quality and interpretability, they incur substantial computational overhead due to the prolonged generation sequences. In this paper, we propose Tandem, a novel collaborative framework that synergizes large and small language models (LLMs and SLMs) to achieve high-quality reasoning with significantly reduced computational cost. Specifically, the LLM serves as a strategic coordinator, efficiently generating a compact set of critical reasoning insights. These insights are then used to guide a smaller, more efficient SLM in executing the full reasoning process and delivering the final response. To balance efficiency and reliability, Tandem introduces a cost-aware termination mechanism that adaptively determines when sufficient reasoning guidance has been accumulated, enabling early stopping of the LLM's generation. Experiments on mathematical reasoning and code generation benchmarks demonstrate that Tandem reduces computational costs by approximately 40% compared to standalone LLM reasoning, while achieving superior or competitive performance. Furthermore, the sufficiency classifier trained on one domain transfers effectively to others without retraining. The code is available at: https://github.com/Applied-Machine-Learning-Lab/ACL2026_Tandem.

🏛️ Semantic Scholar 📅 2026-04-26

FUTURAL: A Metasearch Platform for Empowering Rural Areas with Smart Solutions

Matei-Ioan Popovici, C. Dobre

🔥 引用: 0

Abstract: The FUTURAL project aims to provide a comprehensive suite of digital Smart Solutions (SS) across five critical domains to address pressing social and environmental issues. Central to this initiative is a robust Metasearch platform, which will not only serve as the primary access point to FUTURAL's solutions but also facilitate the search and retrieval of SS developed by other initiatives. This paper elaborates on the MVP implementation for the MetaSearch platform. It focuses on a single, open-source data service and harnesses the generative capabilities of Large Language Models (LLMs) to create a user-friendly natural language interface. The design of the Minimum Viable Product (MVP), the tools used for adapting LLMs to our specific application, and our comprehensive set of evaluation techniques are thoroughly detailed. The results from our evaluations demonstrate that our approach is highly effective and can be efficiently implemented in future iterations of the MVP. This groundwork paves the way for extending the platform to include additional services and diverse data sets from the FUTURAL project, enhancing its capacity to address a broader array of queries and datasets.

🏛️ Semantic Scholar 📅 2026-04-26

Imranul Ashrafi, Inigo Jauregi Unanue, Massimo Piccardi

🔥 引用: 0

Abstract: Test-time alignment methods offer a promising alternative to fine-tuning by steering the outputs of large language models (LLMs) at inference time with lightweight interventions on their internal representations. Recently, a prominent and effective approach, RE-Control (Kong et al., 2024), has proposed leveraging an external value function trained over the LLM's hidden states to guide generation via gradient-based editing. While effective, this method overlooks a key characteristic of alignment tasks, i.e. that they are typically formulated as learning from human preferences between candidate responses. To address this, in this paper we propose a novel preference-based training framework, Pref-CTRL, that uses a multi-objective value function to better reflect the structure of preference data. Our approach has outperformed RE-Control on two benchmark datasets and showed greater generalization on out-of-domain datasets. Our source code is available at https://github.com/UTS-nlPUG/pref-ctrl.

🏛️ Semantic Scholar 📅 2026-04-26 Fine-tuning

DLM: Unified Decision Language Models for Offline Multi-Agent Sequential Decision Making

Zhuohui Zhang, Bin Cheng, Bin He

🔥 引用: 0

Abstract: Building scalable and reusable multi-agent decision policies from offline datasets remains a challenge in offline multi-agent reinforcement learning (MARL), as existing methods often rely on fixed observation formats and action spaces that limit generalization. In contrast, large language models (LLMs) offer a flexible modeling interface that can naturally accommodate heterogeneous observations and actions. Motivated by this, we propose the Decision Language Model (DLM), which formulates multi-agent decision making as a dialogue-style sequence prediction problem under the centralized training with decentralized execution paradigm. DLM is trained in two stages: a supervised fine-tuning phase, which leverages dialogue-style datasets for centralized training with inter-agent context and generates executable actions from offline trajectories, followed by a group relative policy optimization phase to enhance robustness to out-of-distribution actions through lightweight reward functions. Experiments on multiple benchmarks show that a unified DLM outperforms strong offline MARL baselines and LLM-based conversational decision-making methods, while demonstrating strong zero-shot generalization to unseen scenarios across tasks.

🏛️ Semantic Scholar 📅 2026-04-26 Agent Fine-tuning

Automated Assessment of Handwritten Math Problems: A Comparison of Prompting Strategies for Open and Closed-source LLMs

D. Rosa, Andreza M. C. Falcao, Jamilla Lobo, et al.

DOI: 10.1145/3785022.3785038

🔥 引用: 0

Abstract: Assessing handwritten mathematical solutions is essential for identifying students’ weaknesses and fostering personalized learning. However, scaling such assessment remains challenging for Learning Analytics, which has traditionally focused on digital or typed data. The current study investigated the potential of Large Language Models (LLMs) to automating the assessment of handwritten mathematical solutions and explore how they can be incorporated into large scale learning analytics pipelines. We curated 300 student solution images, annotated them using to a taxonomy of math error types, and compared open-source (Qwen2.5-7B and Gemma3 12B-IT) and closed-source (Gemini 2.0 Flash and GPT-4) LLMs. Two prompting strategies were tested: from adapted from related work and one tailed to the taxonomy of math error types using established prompt design principles. The results revealed that LLMs, particularly Gemini, achieved strong to moderate performance in diagnosing and classifying student errors, while exposing recurring model specific errors. These findings highlight both the promise and limitations of LLMs for integrating handwritten work in LA and recommend that learning analytics practitioners and researchers combine careful model selection, principled prompt design, and error-level analysis to develop AI-powered LA systems that are accurate, equitable, and pedagogically actionable.

🏛️ Semantic Scholar 📅 2026-04-26 International Learning Analytics & Knowledge Conference

Yuchuan Zhao, Tong Chen, Junliang Yu, et al.

🔥 引用: 0

Abstract: Large language model-powered sequential recommender systems (LLM-SRSs) have recently demonstrated remarkable performance, enabling recommendations through prompt-driven inference over user interaction sequences. However, this paradigm also introduces new security vulnerabilities, particularly text-level manipulations, rendering them appealing targets for promotion attacks that purposely boost the ranking of specific target items. Although such security risks have been receiving increasing attention, existing studies typically rely on an unrealistic assumption of access to either the victim model or prompt to unveil attack mechanisms. In this work, we investigate the item promotion attack in LLM-SRSs under a more realistic setting where both the system prompt and victim model are unknown to the attacker, and propose a Prompt-Unknown Dual-poisoning Attack (PUDA) framework. To simulate attacks under this full black-box setting, we introduce an LLM-based evolutionary refinement strategy that infers discrete system prompts, enabling the training of an effective surrogate model that mimics the behaviors of the victim model. Leveraging the distilled prompt and surrogate model, we devise a promotion attack that adversarially revises target item texts under semantic constraints, which is further complemented by the highly plausible, surrogate-generated poisoning sequences to enable cost-effective target item promotion. Extensive experiments on real-world datasets demonstrate that PUDA consistently outperforms state-of-the-art competitors in boosting the exposure of unpopular target items. Our findings reveal critical security risks in modern LLM-SRSs even when both prompts and models are protected, and highlight the need for more robust defensive means.

🏛️ Semantic Scholar 📅 2026-04-26

Evaluating the completeness of large language model-generated clinical trial informed consent information for adolescent and young adult patients with central nervous system tumors

C. Basch, Erin T. Jacques, Erela Datuowei, et al.

DOI: 10.1186/s12982-026-01963-6

🔥 引用: 0

Abstract: 暂无摘要，请点击原文查看。

🏛️ Semantic Scholar 📅 2026-04-26 Discover Public Health

Cloud-Based Large Language Model Deployment: A Comparative Analysis of Serverless and Bring-Your-Own-Container Architectures

Mateusz Ploskonka

Yu Cui, Ruiqing Yue, Hang Fu, et al.

🔥 引用: 0

Abstract: With the wide adoption of personal AI assistants such as OpenClaw, privacy leakage in user interaction contexts with large language model (LLM) agents has become a critical issue. Existing privacy attacks against LLMs primarily target training data, while research on inference-time contextual privacy risks in LLM agent memory remains limited. Moreover, prior methods often incur high attack costs, requiring multiple queries or relying on white-box assumptions, which limits their practicality in real-world deployments. To address these issues, we propose a training-free privacy extraction attack targeting LLM agent memory, which we name \textsc{Spore}. \textsc{Spore} is compatible with both black-box and gray-box settings. In the black-box setting, \textsc{Spore} can efficiently extract a small candidate set via a single query to recover the original private information. In the gray-box setting, \textsc{Spore} allows the attacker to leverage multi-ranked tokens for more accurate and faster privacy extraction. We provide an information-theoretic analysis of \textsc{Spore} and show that it achieves high query efficiency with substantial per query information leakage. Experiments on multiple frontier LLMs show that \textsc{Spore} outperforms attack success rate over existing state-of-the-art (SOTA) schemes. It also maintains low attack cost and remains stable across different model parameter settings. We further evaluate the robustness of \textsc{Spore} against existing defense mechanisms. Our results show that \textsc{Spore} consistently bypasses both detection and strong safety alignment, demonstrating resilient performance in diverse defensive settings and real-world safety threats.

🏛️ Semantic Scholar 📅 2026-04-26 Agent

General-Purpose Topology-Aware Embedding of Tumor Phylogenetic Trees with Graph Neural Networks

Paolo Bresolin, Fabio Vandin

DOI: 10.1093/bioadv/vbag016

🔥 引用: 0

Abstract: Phylogenetic trees are tree-like data structures commonly adopted to mathematically represent cancer clonal evolution. The information encoded by phylogenetic trees is important for clinical outcomes, but the automatic extraction of such information is still hard, also due to the fact that working directly with tree-like data structures is complex. This is especially true for machine learning tasks, where models are usually designed for vector data. We introduce CPhyT-GNN, a novel Deep Learning method to compute unsupervised embeddings of phylogenetic trees. The embeddings learnt by CPhyT-GNN are vectors that can be used for a variety of machine learning tasks. CPhyT-GNN is based on Graph Neural Networks, which allows to obtain representations that combine the information provided by the alterations present in the tumour and the topological information provided by the corresponding phylogenetic tree. Experiments with cancer data show that the embeddings learnt by our model are general-purpose and can be applied to different tasks, with results that improve the state-of-the-art. Data and code are available at the following link: https://github.com/VandinLab/CPhyT-GNN. Supplementary material is available at Bioinformatics Advances online.

🏛️ Semantic Scholar 📅 2026-04-26 Deep Learning Bioinformatics Advances

Time-Series Forecasting in Safety-Critical Environments: An EU-AI-Act-Compliant Open-Source Package / Zeitreihenprognose in sicherheitskritischen Umgebungen: Ein KI-VO-konformes Open-Source-Paket

Thomas Bartz-Beielstein, Eva Bartz

🔥 引用: 0

Abstract: With spotforecast2-safe we present an integrated Compliance-by-Design approach to Python-based point forecasting of time series in safety-critical environments. A review of the relevant open-source tooling shows that existing compliance solutions operate consistently outside of the library to be used - e.g. as scanners, templates, or runtime layers. spotforecast2-safe takes the inverse approach and anchors the requirements of Regulation (EU) 2024/1689 (the EU AI Act, in German: KI-VO), of IEC 61508, of the ISA/IEC 62443 standards series, and of the Cyber Resilience Act within the library: in application-programming-interface contracts, persistence formats, and continuous-integration gates. The approach is operationalised by four non-negotiable code-development rules (zero dead code, deterministic processing, fail-safe handling, minimal dependencies) together with the corresponding process rules (model card, executable docstrings, CI workflows, Common-Platform-Enumeration (CPE) identifier, REUSE-conformant licensing, release pipeline). Interactive visualisation, hyperparameter tuning and automated machine learning (AutoML), as well as deep-learning and large-language-model backends are deliberately excluded, because each of these components either enlarges the attack surface, introduces non-determinism, or impairs reproducibility. A bidirectional traceability matrix maps every regulatory provision onto the corresponding mechanism in the code; an end-to-end example of European-market electricity generation, transmission, and consumption forecasting demonstrates the application. The package is open-source and available under Affero General Public License (AGPL) 3.0-or-later.

🏛️ Semantic Scholar 📅 2026-04-26

Zian Wang, Ziyi Wang, Haonan Jin, et al.

DOI: 10.1145/3767295.3769346

🔥 引用: 0

Abstract: With the rapid expansion of large language model inference service users, cloud computing resource costs have become a critical challenge for service providers. Although utilizing end-device resources for auxiliary inference provides new possibilities to reduce cloud computing costs, existing solutions struggle to achieve an ideal balance across multi-task accuracy, end-to-end latency, and cloud computing costs. We present TailorLLM, a task-level collaborative end-cloud inference solution based on low-rank fine-tuning for large language models. This framework comprises two core algorithms that support offline and online optimization, respectively: (i) To reduce transmission overhead while maintaining model performance, Resource-Friendly Low-Rank Adaptation (RFLoRA) decouples pre-trained parameters into cold and hot modules, reducing trainable parameters. (ii) To ensure coverage of users' common tasks, we introduce AdapterMgr, an imitation learning-based replacement strategy that enables near-optimal dynamic management of the on-device LoRA matrix library. Finally, we implemented the TailorLLM prototype system on NVIDIA 3090 and Tesla T4 servers and thoroughly evaluated it on public task datasets. Compared to a series of baselines, TailorLLM reduces cloud resource consumption by up to 69.8% and inference latency by up to 62% while maintaining high accuracy.

🏛️ Semantic Scholar 📅 2026-04-26 Fine-tuning Proceedings of the 21st European Conference on Computer Systems

MTRouter: Cost-Aware Multi-Turn LLM Routing with History-Model Joint Embeddings

Yiqun Zhang, Hao Li, Zihan Wang, et al.

🔥 引用: 0

Abstract: Multi-turn, long-horizon tasks are increasingly common for large language models (LLMs), but solving them typically requires many sequential model invocations, accumulating substantial inference costs. Here, we study cost-aware multi-turn LLM routing: selecting which model to invoke at each turn from a model pool, given a fixed cost budget. We propose MTRouter, which encodes the interaction history and candidate models into joint history-model embeddings, and learns an outcome estimator from logged trajectories to predict turn-level model utility. Experiments show that MTRouter improves the performance-cost trade-off: on ScienceWorld, it surpasses GPT-5 while reducing total cost by 58.7%; on Humanity's Last Exam (HLE), it achieves competitive accuracy while reducing total cost by 43.4% relative to GPT-5, and these gains even carry over to held-out tasks. Further analyses reveal several mechanisms underlying its effectiveness: relative to prior multi-turn routers, MTRouter makes fewer model switches, is more tolerant to transient errors, and exhibits emergent specialization across models. Code: https://github.com/ZhangYiqun018/MTRouter

🏛️ Semantic Scholar 📅 2026-04-26

PhysLayer: Language-Guided Layered Animation with Depth-Aware Physics

Tianyidan Xie, Zhentao Huang, Mingjie Wang, et al.

🔥 引用: 0

Abstract: Existing image-to-video generation methods often produce physically implausible motions and lack precise control over object dynamics. While prior approaches have incorporated physics simulators, they remain confined to 2D planar motions and fail to capture depth-aware spatial interactions. We introduce PhysLayer, a novel framework enabling language-guided, depth-aware layered animation of static images. PhysLayer consists of three key components: First, a language-guided scene understanding module that utilizes vision foundation models to decompose scenes into depth-based layers by analyzing object composition, material properties, and physical parameters. Second, a depth-aware layered physics simulation that extends 2D rigid-body dynamics with depth motion and perspective-consistent scaling, enabling more realistic object interactions without requiring full 3D reconstruction. Third, a physics-guided video synthesis module that integrates simulated trajectories with scene-aware relighting for temporally coherent results. Experimental results demonstrate improvements in CLIP-Similarity (+2.2\%), FID score (+9.3\%), and Motion-FID (+3\%), with human evaluation showing enhanced physical plausibility (+24\%) and text-video alignment (+35\%). Our approach provides a practical balance between physical realism and computational efficiency for controllable image animation.

🏛️ Semantic Scholar 📅 2026-04-26 Foundation Model

Agentic Adversarial Rewriting Exposes Architectural Vulnerabilities in Black-Box NLP Pipelines

Mazal Bethany, Kim-Kwang Raymond Choo, Nishant Vishwamitra, et al.

🔥 引用: 0

Abstract: Multi-component natural language processing (NLP) pipelines are increasingly deployed for high-stakes decisions, yet no existing adversarial method can test their robustness under realistic conditions: binary-only feedback, no gradient access, and strict query budgets. We formalize this strict black-box threat model and propose a two-agent evasion framework operating in a semantic perturbation space. An Attacker Agent generates meaning-preserving rewrites while a Prompt Optimization Agent refines the attack strategy using only binary decision feedback within a 10-query budget. Evaluated against four evidence-based misinformation detection pipelines, the framework achieves evasion rates of 19.95 to 40.34% on modern large language model (LLM) based systems, compared to at most 3.90% for token-level perturbation baselines that rely on surrogate models because they cannot operate under our threat model. A legacy system relying on static lexical retrieval exhibits near-total vulnerability 97.02%, establishing a lower bound that exposes how architectural choices govern the attack surface. Evasion effectiveness is associated with three architectural properties: evidence retrieval mechanism, retrieval-inference coupling, and baseline classification accuracy. The iterative prompt optimization yields the largest marginal gains against the most robust targets, confirming that adaptive strategy discovery is essential when evasion is non-trivial. Analysis of successful rewrites reveals four exploitation patterns, each targeting failures at distinct pipeline stages. A pattern-informed defense reduces the evasion rate by up to 65.18%.

🏛️ Semantic Scholar 📅 2026-04-26 Agent

From Formal Learning to Professional Practice: Automated LLM-based Coding and Visualisation of Team Dialogue in in-situ Healthcare Simulation

S. Samaraweera, Linxuan Zhao, Vanessa Echeverría, et al.

DOI: 10.1145/3785022.3785098

🔥 引用: 2

Abstract: Simulation-based learning is central to healthcare education, yet its effectiveness depends on high-quality debriefing. Traditional debriefs often overlook detailed team dialogue dynamics. Advances in large language models (LLMs) open new possibilities for learning analytics (LA) by automatically coding and visualising teamwork behaviours from dialogue data. This study investigates the effectiveness of different prompting strategies for LLM-based coding, comparing their performance and environmental impacts (CO2e) to identify approaches suitable for transfer into professional practice. Building on these results, we evaluate the generalisability of the optimised model from university student simulations to in-situ, hospital settings, and explore how healthcare professionals perceive the interpretability, usefulness, and trustworthiness of LLM-driven learning analytics in professional learning debriefs. Findings illustrate that responsible uses of AI can help extend LA beyond a controlled university environment into an authentic, in-hospital healthcare context, offering potentially scalable and sustainable support for reflective practice and professional development.

🏛️ Semantic Scholar 📅 2026-04-26 Cited: 2 International Learning Analytics & Knowledge Conference

Agentic Fusion of Large Atomic and Language Models to Accelerate Materials Discovery

Mingze Li, Yu Rong, Songyou Li, et al.

🔥 引用: 0

Abstract: The discovery of novel materials is critical for global energy and quantum technology transitions. While deep learning has fundamentally reshaped this landscape, existing predictive or generative models typically operate in isolation, lacking the autonomous orchestration required to execute the full discovery process. Here we present ElementsClaw, an agentic framework for materials discovery that synergizes Large Atomic Models (LAMs) with Large Language Models (LLMs). In response to varied human requirements, ElementsClaw dynamically orchestrates a suite of LAM tools finetuned from our proposed model Elements for atomic-scale numerical computation, while leveraging LLMs for high-level semantic reasoning. This shift moves AI-driven materials science from isolated processes toward integrated and human interactive discovery. In the demanding domain of superconductors, our agentic system guides the experimental synthesis of four new superconductors, including Zr3ScRe8 with a transition temperature of 6.8 K and HfZrRe4 at 6.7 K. At scale, ElementsClaw screens more than 2.4 million stable crystals within only 28 GPU hours, identifying 68,000 high-confidence superconducting candidates and vastly expanding the known superconducting space. These results demonstrate how our agent accelerates materials discovery with high physical fidelity.

🏛️ Semantic Scholar 📅 2026-04-26 Deep Learning Agent

Resource-Lean Lexicon Induction for German Dialects

Robert Litschko, Barbara Plank, Diego Frassinelli

🔥 引用: 0

Abstract: Automatic induction of high-quality dictionaries is essential for building lexical resources, yet low-resource languages and dialects pose several challenges: limited access to annotators, high degree of spelling variations, and poor performance of large language models (LLMs). We empirically show that statistical models (random forests) trained on string similarity features are surprisingly effective for inducing German dialect lexicons. They outperform LLMs, enable cross-dialect transfer, and offer a lightweight data-driven alternative. We evaluate our models intrinsically on bilingual lexicon induction (BLI) and extrinsically on dialect information retrieval (IR). On BLI, random forests outperform Mistral-123b while being more resource-lean. On dialect IR with BM25, using our dialect dictionaries for query expansion yields relative improvements of up to 28.9% in nDCG@10 and 50.7% in Recall@100. Motivated by the resource scarcity in dialects, we further investigate the extent to which models transfer across different German dialects, and their performance under varying amounts of training data.

🏛️ Semantic Scholar 📅 2026-04-26

ShredBench: Evaluating the Semantic Reasoning Capabilities of Multimodal LLMs in Document Reconstruction

Zichun Guo, Yuling Shi, Wenhao Zeng, et al.

🔥 引用: 0

Abstract: Multimodal Large Language Models (MLLMs) have achieved remarkable performance in Visually Rich Document Understanding (VRDU) tasks, but their capabilities are mainly evaluated on pristine, well-structured document images. We consider content restoration from shredded fragments, a challenging VRDU setting that requires integrating visual pattern recognition with semantic reasoning under significant content discontinuities. To facilitate systematic evaluation of complex VRDU tasks, we introduce ShredBench, a benchmark supported by an automated generation pipeline that renders fragmented documents directly from Markdown. The proposed pipeline ensures evaluation validity by allowing the flexible integration of latest or unseen textual sources to prevent training data contamination. ShredBench assesses four scenarios (English, Chinese, Code, Table) with three fragmentation granularities (8, 12, 16 pieces). Empirical evaluations on state-of-the-art MLLMs reveal a significant performance gap: The method is effective on intact documents; however, once the document is shredded, restoration becomes a significant challenge, with NED dropping sharply as fragmentation increases. Our findings highlight that current MLLMs lack the fine-grained cross-modal reasoning required to bridge visual discontinuities, identifying a critical gap in robust VRDU research.

🏛️ Semantic Scholar 📅 2026-04-26

TO WHAT EXTENT WOULD AI TAKE OVER THE ROLES OF TEACHERS AND UNIVERSITY PROFESSORS - A PERCEPTION OF DIGITALIZATION AND AI INTEGRATION IN ROMANIA'S EDUCATIONAL ENVIRONMENT

Florin Alexandru Stan, Stefan Botoncea

DOI: 10.36713/epra27317

🔥 引用: 0

Abstract: The rapid advancement of Artificial Intelligence has moved from the peripheral margins of educational technology into the core structural fabric of modern learning environments (Anamaria-Mirabela et.al., 2024). In the contemporary era, AI is no longer merely a tool for distributing information; it has become an independent mediator between information and human cognition (Georgescu et.al., 2025). This paradigm shift is particularly pronounced in Romania, where the educational system is navigating a complex transition from traditional pedagogical models to a digitalized infrastructure accelerated by post-pandemic imperatives (Fleaca et al., 2022). As generative AI and adaptive learning systems become ubiquitous, a fundamental question emerges: to what extent will these technologies supplement, or perhaps eventually supersede, the professional roles of teachers and university professors? In the Romanian context, the integration of AI is not occurring in a vacuum but is part of a broader European trend toward "digital readiness" and the closing of digital gaps between rural and urban education sectors (Moroianu et.al., 2023). Higher education institutions, in particular, are facing unprecedented pressure to adapt to technological interconnections that redefine what it means to be an educator (Fleaca et al., 2022). While some scholars argue that AI will revolutionize the classroom by fostering inclusive and equitable quality education, others express profound concern regarding the potential harm to the human element of pedagogy (Anamaria-Mirabela et.al., 2024). The perception of digitalization in Romania is therefore characterized by a duality: the recognition of AI as a strategic ally for efficiency and personalization, versus the fear of a cognitive disengagement where human interaction and critical thinking are diminished (Voicu et. al., 2026). The role of the professor is currently evolving from being the primary distributor of knowledge to a mentor who must cultivate ethical discernment and collaborative inquiry (Georgescu et. al., 2024). This transition is fraught with challenges, as the advantages of AI are often unevenly distributed, potentially reinforcing existing social disparities within the Romanian educational landscape (Georgescu et.al., 2024). Furthermore, the integration of AI chatbots and large language models into the academic workflow has raised significant questions regarding academic integrity and the erosion of intellectual autonomy among students (Malița et.al., 2025). This article aims to investigate the perceptions of Romanian educators and students regarding this transformative process. By analyzing the current state of digitalization and the specific cultural and structural barriers present in Romania, this research seeks to determine whether AI is perceived as a replacement for the human educator or as a sophisticated augmentation of the teaching profession. The following sections will explore the theoretical frameworks of AI integration, the specific challenges of the Romanian digital transition, and the shifting identity of the academic professional in the age of automation.

🏛️ Semantic Scholar 📅 2026-04-26 EPRA International Journal of Research & Development (IJRD)

PhysCodeBench: Benchmarking Physics-Aware Symbolic Simulation of 3D Scenes via Self-Corrective Multi-Agent Refinement

Tianyidan Xie, Peiyu Wang, Yuyi Qian, et al.

🔥 引用: 0

Abstract: Physics-aware symbolic simulation of 3D scenes is critical for robotics, embodied AI, and scientific computing, requiring models to understand natural language descriptions of physical phenomena and translate them into executable simulation environments. While large language models (LLMs) excel at general code generation, they struggle with the semantic gap between physical descriptions and simulation implementation. We introduce PhysCodeBench, the first comprehensive benchmark for evaluating physics-aware symbolic simulation, comprising 700 manually-crafted diverse samples across mechanics, fluid dynamics, and soft-body physics with expert annotations. Our evaluation framework measures both code executability and physical accuracy through automated and visual assessment. Building on this, we propose a Self-Corrective Multi-Agent Refinement Framework (SMRF) with three specialized agents (simulation generator, error corrector, and simulation refiner) that collaborate iteratively with domain-specific validation to produce physically accurate simulations. SMRF achieves 67.7 points overall performance compared to 36.3 points for the best baseline among evaluated SOTA models, representing a 31.4-point improvement. Our analysis demonstrates that error correction is critical for accurate physics-aware symbolic simulation and that specialized multi-agent approaches significantly outperform single-agent methods across the tested physical domains.

🏛️ Semantic Scholar 📅 2026-04-26 Agent

Benchmarking Testing in Automated Theorem Proving

Jongyoon Kim, Hojae Han, Seung-won Hwang

🔥 引用: 0

Abstract: Recent advances in large language models (LLMs) have shown promise in formal theorem proving, yet evaluating semantic correctness remains challenging. Existing evaluations rely on indirect proxies such as lexical overlap with human-annotated proof, or expensive manual inspection. Inspired by the shift from lexical comparison to test-based evaluation in code generation, we propose T , a framework that evaluates the semantic correctness of formal theorems: a generated theorem is considered correct only if all dependent successor theorems compile successfully, analogous to integration testing. We construct a benchmark from 5 real-world Lean 4 repositories, comprising 2,206 problems paired with 41 successor theorems on average, automatically extracted without human effort. Experiments demonstrate that while state-of-the-art models achieve high compilation success, they perform significantly worse under our semantic metric. The best model, Claude-Sonnet-4.5, achieves only 38.9% Testing Accuracy on the full set, given both natural language proof and successor theorems as context, revealing a critical gap in current theorem generation capabilities.

🏛️ Semantic Scholar 📅 2026-04-26

Hamiltonian Graph Inference Networks: Joint structure discovery and dynamics prediction for lattice Hamiltonian systems from trajectory data

Ru Geng, Panayotis G. Kevrekidis, Yixian Gao, et al.

🔥 引用: 0

Abstract: Lattice Hamiltonian systems underpin models across condensed matter, nonlinear optics, and biophysics, yet learning their dynamics from data is obstructed by two unknowns: the interaction topology and whether node dynamics are homogeneous. Existing graph-based approaches either assume the graph is given or, as in $\alpha$-separable graph Hamiltonian network, infer it only for separable Hamiltonians with homogeneous node dynamics. We introduce the Hamiltonian Graph Inference Network (HGIN), which jointly recovers the interaction graph and predicts long-time trajectories from state data alone, for both separable and non-separable Hamiltonians and under heterogeneous node dynamics. HGIN couples a structure-learning module -- a learnable weighted adjacency matrix trained under a Hamilton's-equations loss -- with a trajectory-prediction module that partitions edges into physically distinct subgraphs via $k$-means clustering, assigning each subgraph its own encoder and thereby breaking the parameter-sharing bottleneck of conventional GNNs. On three benchmarks -- a Klein--Gordon lattice with long-range interactions and two discrete nonlinear Schr\"odinger lattices (homogeneous and heterogeneous) -- HGIN reduces long-time energy prediction error and trajectory prediction error by six to thirteen orders of magnitude relative to baselines. A symmetry argument on the Hamiltonian loss further shows that the learned weights encode the parity of the underlying pair potential, yielding an interpretable readout of the system's interaction structure.

🏛️ Semantic Scholar 📅 2026-04-26

When AI reviews science: Can we trust the referee?

Jialiang Wang, Yuchen Liu, Hang Xu, et al.

Divakar Yadav, Tian Zhao

🔥 引用: 0

Abstract: Large Language Models (LLMs) have achieved strong performance across natural language and multimodal tasks, yet their practical deployment remains constrained by inference latency and kernel launch overhead, particularly in interactive, short-sequence settings. This paper presents a hybrid runtime framework that combines Just-In-Time (JIT) compilation with CUDA Graph execution to reduce launch overhead while preserving runtime flexibility during autoregressive decoding. The framework partitions transformer inference into static components executed via CUDA Graph replay and dynamic components handled through JIT-compiled kernels, enabling asynchronous graph capture and reuse across decoding steps. We evaluate the proposed approach on LLaMA-2 7B using single-GPU, batch-size-one inference across prompt lengths from 10 to 500 tokens. Experimental results show that the hybrid runtime reduces Time-to-First-Token (TTFT) by up to 66.0% and achieves lower P99 latency compared with TensorRT-LLM in this regime. These results indicate that hybrid JIT-CUDA Graph execution can effectively reduce inference latency and variance for short-sequence LLM workloads, making it a practical optimization strategy for latency-sensitive AI applications.

🏛️ Semantic Scholar 📅 2026-04-25 Transformer

HBGSA: Hydrogen Bond Graph with Self-Attention for Drug-Target Binding Affinity Prediction

J. Kong, Chupei Tang, Di Wang, et al.

🔥 引用: 0

Abstract: Accurate prediction of drug-target binding affinity accelerates drug discovery by prioritizing compounds for experimental validation. Current methods face three limitations: sequence-based approaches discard spatial geometric constraints, structure-based methods fail to exploit hydrogen bond features, and conventional loss functions neglect prediction-target correlation, a key factor for identifying high-affinity compounds in virtual screening. We developed HBGSA (Hydrogen Bond Graph with Self-Attention), a 3.06M-parameter model that encodes hydrogen bond spatial features. HBGSA uses graph neural networks to model hydrogen bond spatial topology with self-attention enhancement and Pearson correlation loss. Experimental results on PDBbind Core Set and CSAR-HiQ dataset demonstrate that HBGSA outperforms baseline methods with strong generalization capability. Ablation studies confirm the effectiveness of hydrogen bond modeling and Pearson correlation loss.

🏛️ Semantic Scholar 📅 2026-04-25

Learning from Noisy Prompts: Saliency-Guided Prompt Distillation for Robust Segmentation with SAM

Jingxuan Kang, Ziqi Zhang, Shaoming Zheng, et al.

🔥 引用: 0

Abstract: Segmentation is central to clinical diagnosis and monitoring, yet the reliability of modern foundation models in medical imaging still depends on the availability of precise prompts. The Segment Anything Model (SAM) offers powerful zero-shot capabilities, although it collapses under the weak, generic, and noisy prompts that dominate real clinical workflows. In practice, annotations such as centerline points are coarse and ambiguous, often drifting across neighboring anatomy and misguiding SAM toward inconsistent or incomplete masks. We introduce SPD, a Saliency-Guided Prompt Distillation framework that converts these unreliable cues into robust guidance. SPD first learns data-driven anatomical priors through a lightweight saliency head to obtain confident localization maps. These priors then drive Contextual Prompt Distillation, which validates and enriches noisy prompts using cues from anatomically adjacent slices, producing a consensus prompt set that matches the behavior of expert reasoning. A Pairwise Slice Consistency objective further enforces local anatomical coherence during segmentation. Experiments on four challenging MRI and CT benchmarks demonstrate that SPD consistently outperforms existing SAM adaptations and supervised baselines, delivering large gains in both region-based and boundary-based metrics. SPD provides a practical and principled path toward reliable foundation model deployment in clinical environments where only imperfect prompts are available.

🏛️ Semantic Scholar 📅 2026-04-25 Foundation Model

Failed comprehensiveness, successful minimalism: Wikipedia’s 3-year struggle to govern AI-generated content (2022–2025)

W. Froneman

DOI: 10.1007/s00146-026-03046-1

🔥 引用: 0

Abstract: 暂无摘要，请点击原文查看。

🏛️ Semantic Scholar 📅 2026-04-25 AI & SOCIETY

Beyond Local vs. External: A Game-Theoretic Framework for Trustworthy Knowledge Acquisition

Rujing Yao, Yufei Shi, Yang Wu, et al.

🔥 引用: 0

Abstract: Cloud-hosted Large Language Models (LLMs) offer unmatched reasoning capabilities and dynamic knowledge, yet submitting raw queries to these external services risks exposing sensitive user intent. Conversely, relying exclusively on trusted local models preserves privacy but often compromises answer quality due to limited parameter scale and knowledge. To resolve this dilemma, we propose Game-theoretic Trustworthy Knowledge Acquisition (GTKA), a framework that formulates the trade-off between knowledge utility and privacy as a strategic game. GTKA consists of three components: (i) a privacy-aware sub-query generator that decomposes sensitive intent into generalized, low-risk fragments; (ii) an adversarial reconstruction attacker that attempts to infer the original query from these fragments, providing adaptive leakage signals; and (iii) a trusted local integrator that synthesizes external responses within a secure boundary. By training the generator and attacker in an alternating adversarial manner, GTKA optimizes the sub-query generation policy to maximize knowledge acquisition accuracy while minimizing the reconstructability of the original sensitive intent. To validate our approach, we construct two sensitive-domain benchmarks in the biomedical and legal fields. Extensive experiments demonstrate that GTKA significantly reduces intent leakage compared to state-of-the-art baselines while maintaining high-fidelity answer quality.

🏛️ Semantic Scholar 📅 2026-04-25

From Coarse to Fine: Self-Adaptive Hierarchical Planning for LLM Agents

Haoran Tan, Zeyu Zhang, Chen Ma, et al.

🔥 引用: 0

Abstract: Large language model-based agents have recently emerged as powerful approaches for solving dynamic and multi-step tasks. Most existing agents employ planning mechanisms to guide long-term actions in dynamic environments. However, current planning approaches face a fundamental limitation that they operate at a fixed granularity level. Specifically, they either provide excessive detail for simple tasks or insufficient detail for complex ones, failing to achieve an optimal balance between simplicity and complexity. Drawing inspiration from the principle of \textit{progressive refinement} in cognitive science, we propose \textbf{AdaPlan-H}, a self-adaptive hierarchical planning mechanism that mimics human planning strategies. Our method initiates with a coarse-grained macro plan and progressively refines it based on task complexity. It generates self-adaptive hierarchical plans tailored to the varying difficulty levels of different tasks, which can be optimized by imitation learning and capability enhancement. Experimental results demonstrate that our method significantly improves task execution success rates while mitigating overplanning at the planning level, providing a flexible and efficient solution for multi-step complex decision-making tasks. To contribute to the community, our code and data will be made publicly available at https://github.com/import-myself/AHP.

🏛️ Semantic Scholar 📅 2026-04-25 Agent

Yangyang Zhao, Linfang Dai, Linyou Cai, et al.

🔥 引用: 0

Abstract: Cross-domain task-oriented dialogue requires reasoning over implicit and explicit feasibility constraints while planning long-horizon, multi-turn actions. Large language models (LLMs) can infer such constraints but are unreliable over long horizons, while Reinforcement learning (RL) optimizes long-horizon behavior yet cannot recover constraints from raw dialogue. Naively coupling LLMs with RL is therefore brittle: unverified or unstructured LLM outputs can corrupt state representations and misguide policy learning. Motivated by this, we propose Verified LLM-Knowledge empowered RL (VLK-RL), a hybrid framework that makes LLM-derived constraint reasoning usable for RL. VLK-RL first elicits candidate constraints with an LLM and then verifies them via a dual-role cross-examination procedure to suppress hallucinations and cross-turn inconsistencies. The verified constraints are mapped into ontology-aligned slot-value representations, yielding a structured, constraint-aware state for RL policy optimization. Experiments across multiple benchmarks demonstrate that VLK-RL significantly improves generalization and robustness, outperforming strong single-model baselines on long-horizon tasks.

🏛️ Semantic Scholar 📅 2026-04-25

Bishwamittra Ghosh, Soumi Das, Till Speicher, et al.

🔥 引用: 0

Abstract: Large language models (LLMs) operate in two fundamental learning modes - fine-tuning (FT) and in-context learning (ICL) - raising key questions about which mode yields greater language proficiency and whether they differ in their inductive biases. Prior studies comparing FT and ICL have yielded mixed and inconclusive results due to inconsistent experimental setups. To enable a rigorous comparison, we propose a formal language learning task - offering precise language boundaries, controlled string sampling, and no data contamination - and introduce a discriminative test for language proficiency, where an LLM succeeds if it assigns higher generation probability to in-language strings than to out-of-language strings. Empirically, we find that: (a) FT has greater language proficiency than ICL on in-distribution generalization, but both perform equally well on out-of-distribution generalization. (b) Their inductive biases, measured by the correlation in string generation probabilities, are similar when both modes partially learn the language but diverge at higher proficiency levels. (c) Unlike FT, ICL performance differs substantially across models of varying sizes and families and is sensitive to the token vocabulary of the language. Thus, our work demonstrates the promise of formal languages as a controlled testbed for evaluating LLMs, behaviors that are difficult to isolate in natural language datasets. Our source code is available at https://github.com/bishwamittra/formallm.

🏛️ Semantic Scholar 📅 2026-04-25 Fine-tuning

ArgRE: Formal Argumentation for Conflict Resolution in Multi-Agent Requirements Negotiation

Haowei Cheng, Milhan Kim, Chong Liu, et al.

🔥 引用: 0

Abstract: As software systems grow in complexity, they must satisfy an increasing number of competing quality attributes, making it essential to balance them in a principled manner -- for example, a safety requirement for sensor-fusion verification may conflict with a tight planning-cycle budget. Multi-agent large language model frameworks support this balancing process by assigning specialized agents to different objectives. However, their conflict resolution is typically heuristic. Requirements are aggregated implicitly without explicit acceptance or rejection, limiting auditability in regulated domains. We present ArgRE, a multi-agent requirements negotiation system that embeds Dung-style abstract argumentation into the negotiation stage. Each proposal, critique, and refinement is modeled as an argument, conflicts are represented as directed attack relations, and the accepted set of arguments is computed under grounded and preferred semantics. The pipeline further integrates KAOS goal modeling, multi-layer verification, and standards-oriented artifact generation. Evaluation across five case studies spanning safety-critical, financial, and information-system domains shows that ArgRE provides argument-level traceability absent from existing frameworks. Independent evaluators rated its decision justifications significantly higher than those of heuristic synthesis (4.32 vs. 3.07, p<0.001), indicating improved auditability, while semantic intent preservation remains comparable (94.9% BERTScore F1) and compliance coverage reaches 84.7% versus 47.6%--47.8% for baselines. Structural analysis further confirms that the default pairwise protocol yields acyclic graphs in which grounded and preferred semantics coincide, whereas cross-pair arbitration introduces controlled cyclicity, leading to predictable divergence between the two semantics.

🏛️ Semantic Scholar 📅 2026-04-25 Agent

How can AI be compatible with evidence-based medicine?: with an example of analysis of lung cancer recurrence

T. Usuzaki, E. Matsunbo, R. Inamori

DOI: 10.64898/2026.04.17.26351114

🔥 引用: 0

Abstract: 暂无摘要，请点击原文查看。

🏛️ Semantic Scholar 📅 2026-04-25 medRxiv

Layer Embedding Deep Fusion Graph Neural Network

Taihua Xu, Genhao Tian, Jicong Fan, et al.

🔥 引用: 0

Abstract: Graph Neural Networks (GNNs) have demonstrated impressive performance in learning representations from graph-structured data. However, their message-passing mechanism inherently relies on the assumption of label consistency among connected nodes, limiting their applicability to low-homophily settings. Moreover, since message passing operates as a hierarchical diffusion process, GNNs face challenges in capturing long-range dependencies. As network depth increases, the structural noise along heterophilic edges tends to be amplified, resulting in over-smoothing. This issue becomes especially prominent in highly heterophilic graphs, where the propagation of inconsistent semantics across the topology continually exacerbates misaggregation. To address this issue, we propose a novel framework named Layer Embedding Deep Fusion Graph Neural Network (LEDF-GNN). Specifically, we design a Layer Embedding Deep Fusion (LEDF) operator that nonlinearly fuses multi-layer embeddings to capture inter-layer dependencies and effectively alleviate deep propagation degradation. Meanwhile, to mitigate structural heterophily, LEDF-GNN employs a Dual-Topology Parallel Strategy (DTPS) that simultaneously leverages the original and reconstructed topologies, allowing for adaptive structure-semantics co-optimization under diverse homophily conditions. Extensive semi-supervised classification experiments on the citation and image benchmarks demonstrate that, under both homophilic and heterophilic settings, LEDF-GNN consistently outperforms state-of-the-art baselines, validating its effectiveness and generalization capability across diverse graph types.

🏛️ Semantic Scholar 📅 2026-04-25

Ghost in the Agent: Redefining Information Flow Tracking for LLM Agents

Yuandao Cai, Wensheng Tang, Cheng Wen, et al.

🔥 引用: 0

Abstract: Autonomous Large Language Model (LLM) agents are increasingly deployed to conduct complex tasks by interacting with external tools, APIs, and memory stores. However, processing untrusted external data exposes these agents to severe security threats, such as indirect prompt injection and unauthorized tool execution. Securing these systems requires effective information flow tracking. Yet, traditional taint analysis that is designed for program memory states fundamentally fails when applied to LLMs, where data propagation is governed by probabilistic natural language reasoning. In this paper, we present NeuroTaint, the first comprehensive taint tracking framework tailored for the unique information flow characteristics of LLM agents. Our key insight is that taint propagation in LLM agents must be understood not only as explicit content transfer, but also as semantic transformation, causal influence on decisions, and cross-session persistence through memory. NeuroTaint therefore audits execution traces offline to reconstruct provenance from untrusted sources to privileged sinks using semantic evidence, causal reasoning, and persistent context tracking, rather than relying on exact string matches or pre-defined source-sink paths alone. Extensive evaluation using TaintBench, our 400-scenario benchmark spanning 20 real-world agent frameworks, shows that NeuroTaint substantially outperforms FIDES, an information-flow-control (IFC)-style baseline for LLM agents, in source-sink propagation detection. We further show that NeuroTaint remains effective on established agent-security benchmarks, including InjecAgent and ToolEmu, while operating offline with modest additional auditing cost.

🏛️ Semantic Scholar 📅 2026-04-25 Agent

In silico screening, ADMET analysis, MD simulations, and MM/PBSA binding free energy identify new inhibitor molecules for viroporin E

Jorge Samuel León-Magdaleno, J. Hernández-Meza, Ramón Garduño-Juárez

D. N. Do, Minh N. Do, Dang Nguyen, et al.

🔥 引用: 0

Abstract: Transthoracic echocardiography is the reference standard for confirming structural heart disease (SHD), but first-line screening is limited by cost, workflow burden, and specialist availability. We evaluated whether open pretrained electrocardiogram (ECG) foundation models can support echo-confirmed multi-label SHD detection using the public EchoNext Mini-Model benchmark. Six echocardiography-derived abnormalities were targeted: reduced left ventricular ejection fraction, increased left ventricular wall thickness, aortic stenosis, mitral regurgitation, tricuspid regurgitation, and right ventricular systolic dysfunction. Under a common pipeline, we compared engineered ECG features with gradient boosting, end-to-end waveform learning from scratch, and transfer from open ECG foundation models. We then applied in-domain self-supervised adaptation of an ECG foundation model (ECG-FM) on EchoNext waveforms followed by selective supervised fine-tuning, and evaluated trade-offs between discrimination and adaptation cost. Adapted ECG-FM models achieved the best overall performance: peak macro-AUROC 0.8509 and macro-AUPRC 0.4297, while a parameter-efficient operating point preserved AUROC (0.8501) and attained the highest fixed-threshold macro-F1 0.3691. Late fusion with covariates did not improve threshold-independent discrimination, and evaluated LoRA, alternative backbones, and mixture-of-foundations strategies did not surpass the best adapted single-backbone models. These results indicate that for ECG-based case finding and echocardiography triage, combining target-domain self-supervised adaptation with selective supervised updating of a pretrained ECG backbone is the most effective transfer strategy.

🏛️ Semantic Scholar 📅 2026-04-25 Foundation Model Fine-tuning

Features of building a neural network based on MobileNetV2 models in unmanned aerial vehicle detection tasks

Oleh Zaritskyi

Arefin Niam, Tevfik Kosar, M. S. Q. Z. Nine

🔥 引用: 0

Abstract: Distributed GNN training is dominated by remote feature fetching, which can be very costly. Multi-hop neighborhood sampling crosses partition boundaries and triggers fine-grained RPCs whose fixed initiation cost and GPU-stall latency waste energy. Prior systems try to reduce this overhead with presampling and static caching, but cache policies cannot react to runtime network variation. We show that under time-varying congestion, static caching can increase energy by up to 45% because a fixed rebuild schedule is insufficient. We present GreenDyGNN, which formulates cache window management as a sequential decision problem. GreenDyGNN performs intra-epoch cache rebuilds and uses a Double-DQN agent, trained in a calibrated simulator with domain-randomized congestion, to adapt rebuild window size and per-owner cache allocation at each boundary. An asynchronous double-buffered pipeline makes adaptation effectively free. Under congestion, GreenDyGNN cuts total energy by up to 43% over Default DGL and 4-24% over the best static policy, while closely matching the optimum under clean conditions.

🏛️ Semantic Scholar 📅 2026-04-25 Agent

VeriLLMed: Interactive Visual Debugging of Medical Large Language Models with Knowledge Graphs

Yurui Xiang, Xingyi Mao, Rui Sheng, et al.

🔥 引用: 0

Abstract: Large language models (LLMs) show promise in medical diagnosis, but real-world deployment remains challenging due to high-stakes clinical decisions and imperfect reasoning reliability. As a result, careful inspection of model behavior is essential for assessing whether diagnostic reasoning is reliable and clinically grounded. However, debugging medical LLMs remains difficult. First, developers often lack sufficient medical domain expertise to interpret model errors in clinically meaningful terms. Second, models can fail across a large and diverse set of instances involving different input types, tasks, and reasoning steps, making it challenging for developers to prioritize which errors deserve focused inspection. Third, developers struggle to identify recurring error patterns across cases, as existing debugging practices are largely instance-centric and rely on manual inspection of isolated failures. To address these challenges, we present VeriLLMed, a visual analytics system that integrates external biomedical knowledge to audit and debug medical LLM diagnostic reasoning. VeriLLMed transforms model outputs into comparable reasoning paths, constructs knowledge graph-grounded reference paths, and identifies three recurring classes of diagnosis errors: relation errors, branch errors, and missing errors. Case studies and expert evaluation demonstrate that VeriLLMed helps developers identify clinically implausible reasoning and generate actionable insights that can inform the improvement of medical LLMs.

🏛️ Semantic Scholar 📅 2026-04-25

Rational Design of Single-Phase High-Entropy Oxides via Large Language Model Data Mining and Explainable Machine Learning

Arthur da Silva Sousa Santos, E. Stojanovska, Antonio Augusto Alves, et al.

DOI: 10.1021/acs.jcim.6c00752

🔥 引用: 0

Abstract: 暂无摘要，请点击原文查看。

🏛️ Semantic Scholar 📅 2026-04-25 Journal of Chemical Information and Modeling

Ulterior Motives: Detecting Misaligned Reasoning in Continuous Thought Models

Sharan Ramjee

🔥 引用: 0

Abstract: Chain-of-Thought (CoT) reasoning has emerged as a key technique for eliciting complex reasoning in Large Language Models (LLMs). Although interpretable, its dependence on natural language limits the model's expressive bandwidth. Continuous thought models address this bottleneck by reasoning in latent space rather than human-readable tokens. While they enable richer representations and faster inference, they raise a critical safety question: how can we detect misaligned reasoning in an uninterpretable latent space? To study this, we introduce MoralChain, a benchmark of 12,000 social scenarios with parallel moral/immoral reasoning paths. We train a continuous thought model with backdoor behavior using a novel dual-trigger paradigm - one trigger that arms misaligned latent reasoning ([T]) and another that releases harmful outputs ([O]). We demonstrate three findings: (1) continuous thought models can exhibit misaligned latent reasoning while producing aligned outputs, with aligned and misaligned reasoning occupying geometrically distinct regions of latent space; (2) linear probes trained on behaviorally-distinguishable conditions ([T][O] vs [O]) transfer to detecting armed-but-benign states ([T] vs baseline) with high accuracy; and (3) misalignment is encoded in early latent thinking tokens, suggesting safety monitoring for continuous thought models should target the"planning"phase of latent reasoning.

🏛️ Semantic Scholar 📅 2026-04-25

An Empirical Evaluation of Locally Deployed LLMs for Bug Detection in Python Code

Jelena Ili'c Vuli'cevi'c

🔥 引用: 0

Abstract: Large language models (LLMs) have demonstrated strong performance on a wide range of software engineering tasks, including code generation and analysis. However, most prior work relies on cloud-based models or specialized hardware, limiting practical applicability in privacy-sensitive or resource-constrained environments. In this paper, we present a systematic empirical evaluation of two locally deployed LLMs, LLaMA 3.2 and Mistral, for real-world Python bug detection using the BugsInPy benchmark. We evaluate 349 bugs across 17 projects using a zero-shot prompting approach at the function level and an automated keyword-based evaluation framework. Our results show that locally executed models achieve accuracy between 43% and 45%, while producing a large proportion of partially correct responses that identify problematic code regions without pinpointing the exact fix. Performance varies significantly across projects, highlighting the importance of codebase characteristics. The results demonstrate that local models can identify a meaningful share of bugs, though precise localization remains difficult for locally executed LLMs, particularly when handling complex and context dependent bugs in realistic development scenarios.

🏛️ Semantic Scholar 📅 2026-04-25

A Novel Triose Phosphate Isomerase Inhibitor With Dual Trypanosomicidal Activity was Identified Using Artificial Intelligence‐Based Virtual Screening

Elena Aguilera, Rachel Ramos, Aram Davtyan, et al.

DOI: 10.1002/cmdc.202500896

🔥 引用: 0

Abstract: Chagas disease and leishmaniasis are neglected protozoan diseases recognized by the World Health Organization as major public health problems. These diseases affect millions of people worldwide, yet effective treatments remain unavailable. Triosephosphate isomerase (TIM), a glycolytic enzyme that exhibits high catalytic efficiency for the isomerization of glyceraldehyde‐3‐phosphate and dihydroxyacetone‐phosphate exclusively in its dimeric form, was subjected to virtual screening. Using a deep neural network for structure‐based drug design that predicts binding affinity between small molecules and proteins of known structure, 12.5 million commercially available compounds were screened. From this, 82 compounds were selected for in vitro evaluation. Six compounds inhibited TIM from Trypanosoma cruzi , three of which exhibited anti‐T. cruzi activity. Eight compounds demonstrated activity against the parasites T. cruzi and Leishmania infantum . Two compounds showed similar potency against both parasites: 3‐(1‐acetyl‐5‐(4‐bromophenyl)−4,5‐dihydro‐1H‐pyrazol‐3‐yl)‐4‐hydroxy‐6‐methyl‐2H‐pyran‐2‐one (IC 50 = 16 ± 3 μM) and 3‐[(4‐bromophenyl)sulfanyl]‐1‐(3‐nitrophenyl)propan‐1‐one (IC 50 = 12 ± 1 μM). These compounds exhibit favorable selectivity and toxicological profiles, as well as in vivo activity, indicating their potential for future drug development.

🏛️ Semantic Scholar 📅 2026-04-25 AIDD ChemMedChem

An Agentic Framework for Intent Co-Creation in 6G NaaS: Architecture and Open-Source Model Evaluation

Kostis Trantzas, Besiana Agko, C. Tranoris, et al.

🔥 引用: 0

Abstract: 6G network complexity necessitates high levels of autonomy, yet current intent-based systems struggle with ambiguous or incomplete human requests. This paper introduces an agent-based, intent-driven end-to-end (E2E) orchestration framework designed for Network-as-a-Service (NaaS) delivery through collaborative intent co-creation. The proposed system leverages a pool of Domain Expert Agents and a TM Forum-aligned Body-of-Knowledge (BoK) to iteratively refine user requests into deterministic, machine-readable actions. A fundamental design principle is the decoupling of cognition and actuation, where AI-driven reasoning is isolated from standardized execution controllers to ensure safety and operational trust. The framework includes a dual-layer memory system to maintain coherence during multi-step collaborations. The presented prototype, built on ETSI OpenSlice and the Model Context Protocol (MCP), evaluates across several open-source Large Language Models (LLMs). While these models demonstrate high instruction compliance, results reveal a significant gap in translating high-resolution intents into valid, catalog-backed orders without hallucinations.

🏛️ Semantic Scholar 📅 2026-04-25 Agent

From Stateless Queries to Autonomous Actions: A Layered Security Framework for Agentic AI Systems

Kexin Chu

🔥 引用: 0

Abstract: Agentic AI systems face security challenges that stateless large language models do not. They plan across extended horizons, maintain persistent memory, invoke external tools, and coordinate with peer agents. Existing security analyses organize threats by attack type (prompt injection, jailbreaking), but provide no principled model of which architectural component is vulnerable or over what timescale the threat manifests. This paper makes five contributions. First, we introduce the Layered Attack Surface Model (LASM), a seven-layer framework that maps threats to distinct architectural components: Foundation, Cognitive, Memory, Tool Execution, Multi-Agent Coordination, Ecosystem, and Governance, the accountability and observability layer that spans the stack analogously to the network management plane. Second, we introduce attack temporality as an orthogonal analytical dimension with four classes: Instantaneous (T1), Session-Persistent (T2), Cross-Session Cumulative (T3), and Sub-Session-Stack, Non-Session-Bounded (T4). Third, through a systematic review of 94 papers (2021--2025), we show that the most dangerous emerging threats concentrate at the intersection of high-layer attacks (L5--L7) and slow-burn temporality (T3--T4): covert agent collusion, long-term memory poisoning, MCP supply-chain compromise, and alignment failure that manifests as an insider threat with no external adversary. Only 8 of 120 paper-cell assignments (7%) fall in this zone. Fourth, we propose a cross-layer defense taxonomy spanning all seven LASM layers and all four temporality classes, exposing which threat classes existing defenses leave unaddressed. Fifth, we survey evaluation benchmarks, identify five research gaps in the under-studied high-layer, slow-burn zone, and argue that agentic security must be treated as a distributed systems problem embedded in an adversarial ecosystem.

🏛️ Semantic Scholar 📅 2026-04-25 Agent

Sanjana Gautam, Houjiang Liu, Yujin Choi, et al.

🔥 引用: 0

Abstract: In the early stages of scientific research, researchers rely on core scholarly judgments to identify relevant literature, assess credible evidence, and determine which directions merit pursuit. As AI tools become increasingly integrated into these early-stage workflows, the scholarly judgments that were once transparent and attributable to individual researchers become obscured, raising critical Responsible AI (RAI) concerns around accountability, transparency, and trust. Yet how these three dimensions manifest in real-time, in-situ scholarly practice remains largely unexplored. To address this gap, we conducted a think-aloud study with 15 researchers to examine how they used AI tools powered by large language models (LLMs) across early-stage research tasks, including literature exploration, synthesis, and research ideation. Our key findings address the tripartite constructs of accountability, transparency, and trust. First, the confident tone of AI outputs misrepresents epistemic uncertainty, making it more difficult for researchers (who are ultimately accountable) to identify which outputs require the greatest scrutiny. Second, opaque retrieval and content construction make provenance difficult to establish for transparency. Third, trust in AI is fragile, context-dependent, and easily eroded. In response, participant researchers were seen to develop compensatory strategies to restore scholarly judgment under uncertainty. Overall, our findings serve to contextualize AI-mediated research as a RAI problem grounded in lived researcher experience and motivate attention to deliberate AI integration that preserves accountability, supports transparency, and fosters informed trust.

🏛️ Semantic Scholar 📅 2026-04-25

SmartAgroWeb: A Multi-Module AI-Driven Agriculture Platform

Aishwarya V. Patil, Kiran Kishanrao Kendre, Mrunal Ramchandra Pati, et al.

DOI: 10.55041/ijsrem61161

🔥 引用: 0

Abstract: Abstract : Agriculture is a essential driver of economic development and food security, but farmers are still confronted with many obstacles such as unpredictable climate, the wrong crop choice, plant diseases, and the inability to access information about price, government scheme or market in real-time. The purpose of this paper is to introduce and present SmartAgroWeb: An intelligent web-based agricultural support system, built using AI, ML and real-time APIs into one cohesive solution. The proposed solution is established using a multi-layer architecture which includes the presentation, application, service/API, and data layers. The system provides numerous features such as crop recommendation based on soil parameters, plant disease detection using image processing, weather forecasting in real-time, monitoring of market prices, government scheme recommendations, and consultation from experts. The system also incorporates an AI-based chatbot which provides instant advisory support based on a hybrid of rule-based logic and large language models. The system is developed on a client-server architecture using modern web technologies such as React.js for the frontend, Spring Boot for backend processing, and MongoDB for data storage. It also relies on third party APIs to provide real-time data, thus helping to improve the accuracy and reliability of its recommendations. The aim of this project is to integrate multiple services into a single user-friendly platform. INDEX TERMS: Smart Agriculture, Crop Recommendation, Plant Disease Detection, Machine Learning, Artificial Intelligence, Decision Support System, AI Chatbot, Sustainable Farming

🏛️ Semantic Scholar 📅 2026-04-25 INTERNATIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING AND MANAGEMENT

Quantifying conformational diversity in protein–ligand ensembles for structure-based virtual screening

Pei-Kun Yang

DOI: 10.1007/s10822-026-00811-8

🔥 引用: 0

Abstract: 暂无摘要，请点击原文查看。

🏛️ Semantic Scholar 📅 2026-04-25 Journal of Computer-Aided Molecular Design

Математичний базис задач формування адаптивних алгоритмів для машинного навчання чат-ботів

Ольга Олександрівна Кряжич, Іван Андрійович Іванов, Катерина Сергіївна Ющенко

DOI: 10.34229/1028-0979-2026-2-2

🔥 引用: 0

Abstract: Сучасні інформаційні системи функціонують у середовищі, що характеризується невизначеністю параметрів, неповнотою даних та постійними змінами зовнішніх умов. За таких обставин традиційні алгоритми із заздалегідь визначеною структурою не здатні гарантувати належний рівень ефективності, що зумовлює потребу в розробленні адаптивних механізмів. Особливої актуальності такий підхід набуває в процесах навчання систем генеративного штучного інтелекту, зокрема чат-ботів на кшталт ChatGPT, Gemini чи GitHub Copilot. Адаптивний алгоритм можна інтерпретувати як динамічну модель, параметри якої коригуються відповідно до обраного критерію оптимальності або якості функціонування. Експоненціальна залежність є однією з ключових математичних основ сучасного машинного навчання, оскільки входить до складу багатьох моделей і обчислювальних процедур. Метою роботи є формалізація математичного базису задач формування адаптивних алгоритмів для виконання умов їх коректності. У роботі розглядається підхід до формування математичного базису адаптивного алгоритму для навчання чат-ботів з генеративним штучним інтелектом. Представлено математичний аналіз процесу застосування експоненціальної функції під час реалізації адаптивних процедур навчання великих мовних моделей (Large Language Model — LLM). Експоненціальна функція є одним з фундаментальних засобів формалізації інформаційних процесів. Вона широко застосовується для опису процесів зростання та затухання, формування нормального розподілу, виконання інтегральних перетворень, розв’язання задач обробки сигналів і побудови ймовірнісних схем, а також для дослідження ефективності адаптивних методів апроксимації та оцінювання їхніх характеристик. Запропонований підхід до побудови адаптивного алгоритму ґрунтується на взаємодії трьох основних компонентів: оператора переходу, критерію якості та механізму коригування параметрів. Його математичним підґрунтям є багаторазове паралельне виконання рекурентних співвідношень із застосуванням Z-апроксимації. Процес реалізується через послідовні ітерації з можливістю оперативного налаштування параметрів під час роботи алгоритму, що надає можливість досягати оптимального співвідношення між швидкістю збіжності та точністю результатів. Зазначене може бути застосоване для розробки вебсервісів обробки інформації та спеціальних додатків.

🏛️ Semantic Scholar 📅 2026-04-25 International Scientific Technical Journal "Problems of Control and Informatics"

AI-Powered Personalized Learning and Career Enhancement Platform

Ms. S. Kalpana, Abarna S, Durgadevi R, et al.

DOI: 10.47392/irjaeh.2026.0276

🔥 引用: 0

Abstract: Choosing a suitable career has become increasing obstacle for students in higher secondary and undergraduate programs. Many students find it difficult to identify career paths that align with their interests, psychological traits, and skill levels. Traditional career counselling app methods typically depend on static questionnaires and general recommendations, which often fail to address individual uniqueness. This study proposes an AI-powered personalized learning and career enhancement platform that combines psychological assessment, career recommendation, and adaptive course generation. The system leverages Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) to provide intelligent and context-aware guidance. Users are tested through dynamically generated psychological questions designed to identify personality traits and career inclinations across multiple dimensions. Based on the psychological profile and skill-level assessment, the system recommends suitable career paths and generates a customized learning roadmap. Developed using Python and Django, the platform ensures continuous interaction between reasoning models and knowledge retrieval mechanisms. The proposed solution Aims to improve career clarity and learning efficiency through end-to-end personalization.

🏛️ Semantic Scholar 📅 2026-04-25 International Research Journal on Advanced Engineering Hub (IRJAEH)

In-context modeling as a retrain-free paradigm for foundation models in computational science

Lingfeng Li, Zhuoyuan Li, Shun Li, et al.

🔥 引用: 0

Abstract: Building models that generalize across physical systems without retraining remains a central challenge in computational science. Here we introduce In-Context Modeling (ICM), a retrain-free paradigm that infers physical relationships directly from observational fields. Rather than encoding system-specific behavior in fixed parameters, ICM assimilates measurements as physical context and performs inference through a single forward pass. Trained in a physics-informed, label-free manner using governing equations, a single model generalizes across unseen materials, geometries, and loading conditions. Demonstrated on hyperelasticity, ICM integrates with finite-element simulations and is validated using experimental full-field measurements. Moreover, performance improves with increasing data diversity and computational budget, exhibiting favorable scaling behavior analogous to foundation models. By recasting physical modeling as in-context inference, this work establishes a transferable paradigm for retrain-free scientific learning and a foundation for scalable modeling across computational science.

🏛️ Semantic Scholar 📅 2026-04-25 Foundation Model

Studi In Silico Senyawa Turunan Kumarin sebagai Inhibitor Janus Kinase 1 (JAK1) pada Penyakit Rheumatoid Arthritis

Nur Aini Tsuraya, Oktavia Suryani

DOI: 10.58578/masaliq.v6i3.9743

🔥 引用: 0

Abstract: Rheumatoid arthritis is a chronic autoimmune disease involving activation of the Janus kinase 1 (JAK1) pathway, making JAK1 a potential target in drug development. This study aims to evaluate the potential of coumarin derivative compounds as JAK1 inhibitors in silico. This study used a computational approach through the molecular docking method to predict ligand–protein interactions, evaluation of physicochemical properties based on Lipinski’s Rule of Five, and ADMET analysis to assess pharmacokinetic and toxicity profiles. The results show that all compounds had negative binding affinity values, ranging from -6.5379 to -6.9271 kcal/mol, indicating stable ligand–protein interactions, although these values were still lower than the positive control Tofacitinib, with a value of -7.3968 kcal/mol. All compounds met the criteria of Lipinski’s Rule of Five, but ADMET analysis showed variations in pharmacokinetic and toxicity profiles. Compound 3 showed the best balance between activity, stability, and safety, whereas compounds 1 and 2 showed potential mutagenicity. The conclusion of this study emphasizes that compound 3 has the potential to be further developed as a JAK1 inhibitor candidate. The implications of this study indicate the importance of structural optimization and further experimental validation to improve the effectiveness and safety of coumarin derivative compounds as therapeutic candidates for rheumatoid arthritis.

🏛️ Semantic Scholar 📅 2026-04-25 MASALIQ

Small Language Model Helps Resolve Semantic Ambiguity of LLM Prompt

Zhen Huang, Chaoning Zhang, Fachrina Dewi Puspitasari, et al.

🔥 引用: 0

Abstract: Large language models (LLMs) are increasingly utilized in various complex reasoning tasks due to their excellent instruction following capability. However, the model's performance is highly dependent on the open-ended characteristics of the users'input prompt. Natural prompts often do not follow proper syntactic rules, which creates ambiguous queries that yield multiple interpretations. Such ambiguous prompts confuse the model in choosing the correct reasoning paths to answer questions. Prior works address this challenge by applying query editing during the LLM inference process without explicitly solving the root cause of the ambiguity. To address this limitation, we propose a pre-inference prompt optimization mechanism via explicit prompt disambiguation. Particularly, we identify semantic risks in the prompt, check their multi-perspective consistency, and resolve any semantic conflicts that arise. Finally, we organize the resolved ambiguities in a logically structured manner as a clean input to the LLM. By explicitly resolving semantic ambiguity, our method can produce a more focused attention distribution to the semantically essential tokens. We also leverage small language models (SLMs) as the main executor of prompt disambiguation to benefit from their efficient computation. Through comprehensive experiments on multiple benchmarks, we demonstrate that our method improves reasoning performance by 2.5 points at a cost of only \$0.02. Our study promotes explicit prompt disambiguation as an effective prompt optimization method without disturbing the internal mechanism of LLM inference.

🏛️ Semantic Scholar 📅 2026-04-25

From Similarity to Structure: Training-free LLM Context Compression with Hybrid Graph Priors

Yitian Zhou, Chaoning Zhang, Jiaquan Zhang, et al.

🔥 引用: 0

Abstract: Long-context large language models remain computationally expensive to run and often fail to reliably process very long inputs, which makes context compression an important component of many systems. Existing compression approaches typically rely on trained compressors, dense retrieval-style selection, or heuristic trimming, and they often struggle to jointly preserve task relevance, topic coverage, and cross-sentence coherence under a strict token budget. To address this, we propose a training-free and model-agnostic compression framework that selects a compact set of sentences guided by structural graph priors. Our method constructs a sparse hybrid sentence graph that combines mutual k-NN semantic edges with short-range sequential edges, extracts a topic skeleton via clustering, and ranks sentences using an interpretable score that integrates task relevance, cluster representativeness, bridge centrality, and a cycle coverage cue. A budgeted greedy selection with redundancy suppression then produces a readable compressed context in original order. Experimental results on four datasets show that our approach is competitive with strong extractive and abstractive baselines, demonstrating larger gains on long-document benchmarks.

🏛️ Semantic Scholar 📅 2026-04-25

No Test Cases, No Problem: Distillation-Driven Code Generation for Scientific Workflows

S. Raghavan, Tanwi Mallick

🔥 引用: 0

Abstract: Existing multi-agent Large Language Model (LLM) frameworks for code generation typically use execution feedback and improve iteratively using Input/Output (I/O) test cases. However, this does not work for scientific workflows, where I/O test cases do not exist, and generating them requires solving the very problem at hand. To address this, we introduce MOSAIC, a training-free multi-agent framework for scientific code generation without I/O supervision. Instead of execution feedback, MOSAIC employs a student-teacher knowledge distillation framework that grounds generation through domain-specific examples and structured problem decomposition. To further mitigate hallucinations across chained subproblems, we introduce a Consolidated Context Window (CCW) for maintaining consistent reasoning across agents. Experiments on the SciCode benchmark show that MOSAIC improves accuracy, executability, and numerical precision over existing approaches while relying on lightweight models.

🏛️ Semantic Scholar 📅 2026-04-25 Agent

Anshika Singh, Diwakar Shrivastava, Ayushi Sharma, et al.

DOI: 10.55041/ijcope.v2i4.743

🔥 引用: 0

Abstract: The rapid diversification of technical specializations within the computer science and information technology domains presents a significant educational challenge for students, who frequently lack the profound self-awareness and practical guid-ance imperative to selecting professional pathways perfectly aligned with their inherent psychological and cognitive traits. Consequently, pivotal career deci-sions are systematically driven by external peer trends, superficial fascinations, or arbitrary assumptions rather than intrinsic behavioral suitability. This paper introduces Ascend AI, a comprehensive, artificial intelligence-driven career orien-tation framework meticulously designed to mitigate this structural uncertainty. The proposed architecture seamlessly integrates quantitative psychological pro-filing, generative LLM-based learning curriculum methodologies, and responsive, interactive audio interview simulations into a cohesive, decoupled microservice platform. Phase one of the framework processes multi-dimensional vocational preferences and personality metrics—captured via a standardized 50-item Big Five (OCEAN) inventory, a 48-item RIASEC model survey, and a distinct cognitive reading-comprehension assessment. These vectors advance through a dual-pipeline machine learning ensemble, integrating K-Means clustering and Soft-Voting Logistic Regression, to empirically predict optimal technical career trajectories. Phase two translates these discriminative mathematical predictions into highly customized, dynamically generated 10-day micro-learning roadmaps utilizing Google Gemini Large Language Models (LLMs) and strict Pydantic schema validations to ensure structured, hallucination-free knowledge acquisi-tion.

🏛️ Semantic Scholar 📅 2026-04-25 International Journal of Creative and Open Research in Engineering and Management

Channel Adaptation for EEG Foundation Models: A Systematic Benchmark Across Architectures, Tasks, and Training Regimes

Kuntal Kokate, Bruno Aristimunha, Dung Truong, et al.

🔥 引用: 0

Abstract: Scaling EEG foundation models requires pooling data across heterogeneous electrode montages, a prerequisite both for larger pretraining corpora and for downstream deployment. We present the first systematic comparison of four channel adaptation methods (Conv1d projection, spherical spline interpolation (SSI), source-space decomposition, and Riemannian re-centering) across five pretrained EEG foundation models (5M--157M parameters), five downstream tasks, and two training regimes with 10--15 random seeds each. We find that rigid-montage models (BENDR, Neuro-GPT) require external adaptation, while flexible models (EEGPT, CBraMod) match or exceed it natively when fine-tuned but benefit from external methods under frozen-encoder deployment. A probe-SFT asymmetry exists: external adaptation can cause severe negative transfer during fine-tuning of flexible models. The optimal method is architecture-dependent (Conv1d for BENDR, SSI/Riemannian for Neuro-GPT, source-space decomposition for depression detection), and 5M-parameter CBraMod outperforms models up to 31$\times$ larger on 4/5 datasets, consistent with independent findings that compact EEG-specific architectures can match larger models.

🏛️ Semantic Scholar 📅 2026-04-25 Foundation Model Fine-tuning

Bridging Expert Insight and AI Reasoning: A Hybrid Systems Model of Iran's Fertility Dynamics

M. Zarinbal, H. Izadbakhsh, S. Kalantari‐Banadaki

DOI: 10.1002/sres.70044

🔥 引用: 0

Abstract: Population dynamics are inherently complex, shaped by nonlinear feedbacks among economic, cultural, health and governance systems. This study focuses on Iran's sustained subreplacement fertility and develops a hybrid modelling framework to construct a causal loop diagram (CLD) and integrates group model building (GMB), large language models (LLMs) and retrieval‐augmented generation (RAG) through a six‐step process: (1) initial dynamic hypothesis formulation; (2) expert‐driven CLD development; (3) AI‐driven CLD development; (4) model integration; (5) evidence anchoring using peer‐reviewed literature and (6) expert validation. The final CLD reveals how reinforcing mechanisms (e.g., education–modernity and economic confidence) interact with balancing constraints (e.g., childrearing costs, delayed marriage and institutional capacity) to sustain low fertility in Iran. The study demonstrates how structured human–AI collaboration can enhance transparency, theoretical grounding and policy relevance in systems modelling of demographic change, particularly in data‐limited and rapidly evolving contexts.

🏛️ Semantic Scholar 📅 2026-04-25 Systems research and behavioral science

Automating Categorization of Scientific Texts with In-Context Learning and Prompt-Chaining in Large Language Models

Gautam Kishore Shahi, Oliver Hummel

🔥 引用: 0

Abstract: The relentless expansion of scientific literature presents significant challenges for navigation and knowledge discovery. Within Research Information Retrieval, established tasks such as text summarization and classification remain crucial for enabling researchers and practitioners to effectively navigate this vast landscape, so that efforts have increasingly been focused on developing advanced research information systems. These systems aim not only to provide standard keyword-based search functionalities but also to incorporate capabilities for automatic content categorization within knowledge-intensive organizations across academia and industry. This study systematically evaluates the performance of off-the-shelf Large Language Models (LLMs) in analyzing scientific texts according to a given classification scheme. We utilized the hierarchical ORKG taxonomy as a classification framework, employing the FORC dataset as ground truth. We investigated the effectiveness of advanced prompt engineering strategies, namely In-Context Learning (ICL) and Prompt Chaining, and experimentally explored the influence of the LLMs'temperature hyperparameter on classification accuracy. Our experiments demonstrate that Prompt Chaining yields superior classification accuracy compared to pure ICL, particularly when applied to the nested structure of the ORKG taxonomy. LLMs with prompt chaining outperform the state-of-the-art models for domain (1st level) prediction and show even better performance for subject (2nd level) prediction compared to the older BERT model. However, LLMs are not yet able to perform well in classifying the topic (3rd level) of research areas based on this specific hierarchical taxonomy, as they only reach about 50% accuracy even with prompt chaining.

🏛️ Semantic Scholar 📅 2026-04-25

Au-M-ol: A Unified Model for Medical Audio and Language Understanding

Meizhu Liu, Nistha Mitra, Paul Li, et al.

🔥 引用: 0

Abstract: In this work, we present Au-M-ol, a novel multimodal architecture that extends Large Language Models (LLMs) with audio processing. It is designed to improve performance on clinically relevant tasks such as Automatic Speech Recognition (ASR). Au-M-ol has three main components: (1) an audio encoder that extracts rich acoustic features from medical speech, (2) an adaptation layer that maps audio features into the LLM input space, and (3) a pretrained LLM that performs transcription and clinical language understanding. This design allows the model to interpret spoken medical content directly, improving both accuracy and robustness. In experiments, Au-M-ol reduces Word Error Rate (WER) by 56\% compared to state-of-the-art baselines on medical transcription tasks. The model also performs well in challenging conditions, including noisy environments, domain-specific terminology, and speaker variability. These results suggest that Au-M-ol is a strong candidate for real-world clinical applications, where reliable and context-aware audio understanding is essential.

🏛️ Semantic Scholar 📅 2026-04-25

Wentao Shi, Qifan Wang, Chen Chen, et al.

🔥 引用: 0

Abstract: Reinforcement learning (RL) effectively optimizes Large Language Model (LLM)-based recommenders by contrasting positive and negative items. Empirically, training with beam-search negatives consistently outperforms random negatives, yet the mechanism is not well understood. We address this gap by analyzing the induced optimization objective and show that: (i) Under binary reward feedback, optimizing LLM recommenders with Group Relative Policy Optimization (GRPO) is theoretically equivalent to maximizing the Area Under the ROC Curve (AUC), which is often misaligned with Top-$K$ recommendation; and (ii) Replacing random negatives with beam-search negatives reshapes the objective toward partial AUC, improving alignment with Top-$K$ metrics. Motivated by this perspective, we introduce Windowed Partial AUC (WPAUC), which constrains the false positive rate (FPR) to a window [$\alpha,\alpha+d$] to more directly align with Top-$K$ metrics. We further propose an efficient Threshold-Adjusted Windowed reweighting (TAWin) RL method for its optimization, enabling explicit control over the targeted Top-$K$ performance. Experiments on four real-world datasets validate the theory and deliver consistent state-of-the-art performance.

🏛️ Semantic Scholar 📅 2026-04-24

SpikingBrain2.0: Brain-Inspired Foundation Models for Efficient Long-Context and Cross-Platform Inference

Yuqi Pan, Jinghao Zhuang, Yupeng Feng, et al.

🔥 引用: 0

Abstract: Scaling context length is reshaping large-model development, yet full-attention Transformers suffer from prohibitive computation and inference bottlenecks at long sequences. A key challenge is to design foundation models that maintain performance and long-context efficiency with minimal training overhead. We introduce SpikingBrain2.0 (SpB2.0), a 5B model that advances both architecture and training efficiency of its predecessor. Our contributions are two-fold. (1) Architectural Innovation: We propose Dual-Space Sparse Attention (DSSA), an inter-layer hybrid of Sparse Softmax Attention (MoBA) and Sparse Linear Attention (SSE), achieving an improved performance-efficiency trade-off for long-context modeling. SpB2.0 further supports dual quantization paths: INT8-Spiking coding enables sparse event-driven computation, while FP8 coding accelerates inference on modern GPUs. (2) Enhanced Training Strategy: We develop an optimized Transformer-to-Hybrid (T2H) pipeline with dual conversion paths for LLMs and VLMs using curated open-source data. Empirically, SpB2.0-5B and SpB2.0-VL-5B recover most of the base Transformer (Qwen3-4B) capability with under 7k A100 GPU hours. SpB2.0 achieves a 10.13x TTFT speedup at 4M context and supports over 10M tokens on 8 A100 GPUs under vLLM, where full-attention models exceed memory limits. It also demonstrates strong cross-platform compatibility, enabling FP8 GPU inference (2.52x speedup at 250k) and efficient neuromorphic execution (64.31% sparsity, with 70.6% and 46.5% area and power reduction at 500MHz). Overall, SpikingBrain2.0 provides a practical pathway for lightweight, multimodal, spiking foundation models, highlighting the potential of combining brain-inspired mechanisms with efficient architectures for resource-constrained and edge scenarios.

🏛️ Semantic Scholar 📅 2026-04-24 Foundation Model Transformer

STEM: Structure-Tracing Evidence Mining for Knowledge Graphs-Driven Retrieval-Augmented Generation

Peng Yu, E. Xu, Bin Chen, et al.

🔥 引用: 0

Abstract: Knowledge Graph-based Question Answering (KGQA) plays a pivotal role in complex reasoning tasks but remains constrained by two persistent challenges: the structural heterogeneity of Knowledge Graphs(KGs) often leads to semantic mismatch during retrieval, while existing reasoning path retrieval methods lack a global structural perspective. To address these issues, we propose Structure-Tracing Evidence Mining (STEM), a novel framework that reframes multi-hop reasoning as a schema-guided graph search task. First, we design a Semantic-to-Structural Projection pipeline that leverages KG structural priors to decompose queries into atomic relational assertions and construct an adaptive query schema graph. Subsequently, we execute globally-aware node anchoring and subgraph retrieval to obtain the final evidence reasoning graph from KG. To more effectively integrate global structural information during the graph construction process, we design a Triple-Dependent GNN (Triple-GNN) to generate a Global Guidance Subgraph (Guidance Graph) that guides the construction. STEM significantly improves both the accuracy and evidence completeness of multi-hop reasoning graph retrieval, and achieves State-of-the-Art performance on multiple multi-hop benchmarks.

🏛️ Semantic Scholar 📅 2026-04-24

Zhilin Fan, Deliang Wang, Penghe Chen, et al.

🔥 引用: 0

Abstract: Diagnosing student problem behaviors requires teachers to synthesize multifaceted information, identify behavioral categories, and plan intervention strategies. Although fine-tuned large language models (LLMs) can support this process through multi-turn dialogue, they rarely explain why a strategy is recommended, limiting transparency and teachers'trust. To address this issue, we present an explainable dialogue system built on a fine-tuned LLM. The system uses a hierarchical attribution method based on explainable AI (xAI) to identify dialogue evidence for each recommendation and generate a natural-language explanation based on that evidence. In technical evaluation, the method outperformed baseline approaches in identifying supporting evidence. In a preliminary user study with 22 pre-service teachers, participants who received explanations reported higher trust in the system. These findings suggest a promising direction for improving LLM explainability in educational dialogue systems.

🏛️ Semantic Scholar 📅 2026-04-24

OMNI (Optimized Machine for Navigation & Interaction)

Chaitanya Nagapure, Karan Ambule, Lucky Nikule, et al.

DOI: 10.55041/ijsrem60690

🔥 引用: 0

Abstract: Abstract — This paper presents OMNI (Optimized Machine for Navigation and Interaction), a fully offline, affordable, and intelligent home assistant robot designed to assist elderly individuals and children through natural voice interaction, autonomous navigation, face recognition, and smart home automation. Traditional home environments lack intelligent robotic systems capable of providing genuine companionship, physical assistance, and home device control in a unified offline platform. OMNI addresses this gap using a dual-processor architecture combining a Raspberry Pi 3B as the primary AI brain and an ESP32 microcontroller as the real-time body controller. The four-wheeled differential drive chassis is powered by 100RPM geared DC motors controlled through an L298N Keywords: OMNI, Home Assistant Robot, Raspberry Pi, ESP32, Autonomous Navigation, Offline AI, Speech Recognition, Face Recognition, Natural Language Processing, Large Language Model, Human-Robot Interaction, Home Automation, Obstacle Avoidance, Edge AI, OpenCV, Whisper STT, Piper TTS, MQTT, Socially Assistive Robotics, Differential Drive.

🏛️ Semantic Scholar 📅 2026-04-24 INTERNATIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING AND MANAGEMENT

WorldView-Bench: A Benchmark for Evaluating Global Cultural Perspectives in Large Language Models

Abdullah Mushtaq, Imran Taj, Rafay Naeem, et al.

DOI: 10.1613/jair.1.19001

🔥 引用: 0

Abstract: Background: Large Language Models (LLMs) are predominantly trained and aligned in ways that reinforce Westerncentric epistemologies and socio-cultural norms, leading to cultural homogenization and limiting their ability to reflect global civilizational plurality. Existing benchmarking frameworks fail to adequately capture this bias, as they rely on rigid, closed-form assessments that overlook the complexity of cultural inclusivity. Objectives: To address this cultural bias problem, we introduce WorldView-Bench, a benchmark designed to evaluate Global Cultural Inclusivity (GCI) in LLMs by analyzing their ability to accommodate diverse worldviews. Methods: Our approach is grounded in the Multiplex Worldview proposed by Senturk et al., which distinguishes between Uniplex models, reinforcing cultural homogenization, and Multiplex models, which integrate diverse perspectives. WorldViewBench measures Cultural Polarization, the exclusion of alternative perspectives, through free-form generative evaluation rather than conventional categorical benchmarks. We implement applied multiplexity through two intervention strategies: (1) Contextually-Implemented Multiplex LLMs, where system prompts embed multiplexity principles, and (2) Multi-Agent System (MAS)-Implemented Multiplex LLMs, where multiple LLM agents representing distinct cultural perspectives collaboratively generate responses. Results: Our results demonstrate a significant increase in Perspectives Distribution Score (PDS) entropy from 13% at baseline to 94% with MAS-Implemented Multiplex LLMs, alongside a shift toward positive sentiment (67.7%) and enhanced cultural balance. Conclusions: The success of multiplex-aware evaluation in WorldView-Bench demonstrates that cultural bias in LLMs can be meaningfully measured and mitigated through structured worldview diversity. We expect this to pave the way for more inclusive, globally representative, and ethically aligned AI systems.

🏛️ Semantic Scholar 📅 2026-04-24 Agent Journal of Artificial Intelligence Research

Cyclization, amino acid coupling, docking-based virtual screening and SAR: showing a strategy of the structural modification for salvianic acid A

Jianhui Wu, Xiaoyi Zhang, Wenshuo An, et al.

DOI: 10.1080/17568919.2026.2658836

🔥 引用: 0

Abstract: 暂无摘要，请点击原文查看。

🏛️ Semantic Scholar 📅 2026-04-24 Future Medicinal Chemistry

Network Edge Inference for Large Language Models: Principles, Techniques, and Opportunities

Zhixiong Chen, Bingjie Zhu, Jiangzhou Wang, et al.

DOI: 10.1145/3809166

🔥 引用: 0

Abstract: Large language models (LLMs) have advanced rapidly, emerging as versatile tools across fields thanks to their exceptional language understanding, generation, and reasoning capabilities. However, performing LLM inference at the network edge remains challenging due to their large memory and compute demands. This survey outlines the challenges specific to LLM edge inference and provides a comprehensive overview of recent progress, covering system architectures, model optimization and deployment, and resource management and scheduling. By synthesizing state-of-the-art techniques and mapping future directions, this survey aims to unlock the potential of LLMs in resource-constrained edge environments.

🏛️ Semantic Scholar 📅 2026-04-24 ACM Computing Surveys

PASR: Pose-Aware 3D Shape Retrieval from Occluded Single Views

Jiaxin Shi, Guofeng Zhang, Wufei Ma, et al.

🔥 引用: 0

Abstract: Single-view 3D shape retrieval is a fundamental yet challenging task that is increasingly important with the growth of available 3D data. Existing approaches largely fall into two categories: those using contrastive learning to map point cloud features into existing vision-language spaces and those that learn a common embedding space for 2D images and 3D shapes. However, these feed-forward, holistic alignments are often difficult to interpret, which in turn limits their robustness and generalization to real-world applications. To address this problem, we propose Pose-Aware 3D Shape Retrieval (PASR), a framework that formulates retrieval as a feature-level analysis-by-synthesis problem by distilling knowledge from a 2D foundation model (DINOv3) into a 3D encoder. By aligning pose-conditioned 3D projections with 2D feature maps, our method bridges the gap between real-world images and synthetic meshes. During inference, PASR performs a test-time optimization via analysis-by-synthesis, jointly searching for the shape and pose that best reconstruct the patch-level feature map of the input image. This synthesis-based optimization is inherently robust to partial occlusion and sensitive to fine-grained geometric details. PASR substantially outperforms existing methods on both clean and occluded 3D shape retrieval datasets by a wide margin. Additionally, PASR demonstrates strong multi-task capabilities, achieving robust shape retrieval, competitive pose estimation, and accurate category classification within a single framework.

🏛️ Semantic Scholar 📅 2026-04-24 Foundation Model

Dynamically Acquiring Text Content to Enable the Classification of Lesser-known Entities for Real-world Tasks

Fahmida Alam, Ellen Riloff

🔥 引用: 0

Abstract: Existing Natural Language Processing (NLP) resources often lack the task-specific information required for real-world problems and provide limited coverage of lesser-known or newly introduced entities. For example, business organizations and health care providers may need to be classified into a variety of different taxonomic schemes for specific application tasks. Our goal is to enable domain experts to easily create a task-specific classifier for entities by providing only entity names and gold labels as training data. Our framework then dynamically acquires descriptive text about each entity, which is subsequently used as the basis for producing a text-based classifier. We propose a novel text acquisition method that leverages both web and large language models (LLMs). We evaluate our proposed framework on two classification problems in distinct domains: (i) classifying organizations into Standard Industrial Classification (SIC) Codes, which categorize organizations based on their business activities; and (ii) classifying healthcare providers into healthcare provider taxonomy codes, which represent a provider's medical specialty and area of practice. Our best-performing model achieved macro-averaged F1-scores of 82.3% and 72.9% on the SIC code and healthcare taxonomy code classification tasks, respectively.

🏛️ Semantic Scholar 📅 2026-04-24

Large Language Models Decide Early and Explain Later

Ayan Datta, Zhixue Zhao, Bhuvanesh Verma, et al.

🔥 引用: 0

Abstract: Large Language Models often achieve strong performance by generating long intermediate chain-of-thought reasoning. However, it remains unclear when a model's final answer is actually determined during generation. If the answer is already fixed at an intermediate stage, subsequent reasoning tokens may constitute post-decision explanation, increasing inference cost and latency without improving correctness. We study the evolution of predicted answers over reasoning steps using forced answer completion, which elicits the model's intermediate predictions at partial reasoning prefixes. Focusing on Qwen3-4B and averaging results across all datasets considered, we find that predicted answers change in only 32% of queries. Moreover, once the final answer switch occurs, the model generates an average of 760 additional reasoning tokens per query, accounting for a substantial fraction of the total reasoning budget. Motivated by these findings, we investigate early stopping strategies that halt generation once the answer has stabilized. We show that simple heuristics, including probe-based stopping, can reduce reasoning token usage by 500 tokens per query while incurring only a 2% drop in accuracy. Together, our results indicate that a large portion of chain-of-thought generation is redundant and can be reduced with minimal impact on performance.

🏛️ Semantic Scholar 📅 2026-04-24

T. Nuenen

🔥 引用: 0

Abstract: Large language models are increasingly used to mediate everyday interpersonal dilemmas, yet how their advisory defaults interact with the concentrated moral orders of specific communities remains poorly understood. This article compares four assistant-style LLMs with community-endorsed advice on 11,565 posts from r/relationship_advice, using the subreddit as a concentrated, vote-ratified moral formation whose prescriptive clarity makes divergence measurable. Across models, LLMs identify many of the same dynamics as human commenters, but are markedly less likely to convert that recognition into directive authorization for action. The gap is sharpest where community consensus is strongest: on high-consensus posts involving abuse or safety threats, models recommend exit at roughly half the human rate while maintaining elevated levels of hedging, validation, and therapeutic framing. The article describes this pattern as recognition without authorization: the capacity to register harm while withholding socially ratified permission for consequential action. This divergence is not incidental but structural: a portable advisory style that remains validating, risk-averse, and weakly directive across contexts. Safety alignment is one plausible contributor to this pattern, alongside training-data averaging and broader assistant design. The article argues that model divergence can be reframed from a technical error to a way of seeing what standardized assistant norms flatten when they encounter situated moral worlds.

🏛️ Semantic Scholar 📅 2026-04-24

Bias in, symbolic compliance out? GPT 's reliance on gender and race in strategic evaluations

Tristan L. Botelho, Qingyang Wang

DOI: 10.1002/smj.70094

🔥 引用: 0

Abstract: Organizations are increasingly using large language models (LLMs) to support strategic evaluations. We examine whether and how these systems rely on gender and race. We asked GPT to evaluate identical startup pitches varying only the founder's name, shaping gender and race perceptions. Across 26,000 evaluations, GPT did not systematically assign lower scores to underrepresented minorities but avoided ranking them last without increasing winning likelihoods. To explain these patterns, we conducted “Second Opinion” experiments where GPT evaluated pitches alongside inputs simulating human bias. GPT more readily corrected explicit, identity‐based bias than bias framed as neutral business critiques, with corrections limited in magnitude. We theorize these findings reflect symbolic compliance : LLMs suppress overt discrimination without substantively altering evaluative logic, allowing inequality to persist in AI‐supported strategic evaluations. Large language models (LLMs), like OpenAI's ChatGPT, are increasingly used in strategic evaluations (e.g., hiring, pitches). We examine whether and how these models exhibit gender and racial biases in their evaluations of startup pitches, where we only varied founder names (shaping gender and race perceptions). Across multiple experiments, we find that GPT evaluators did not systematically assign lower scores to underrepresented minorities, primarily by reducing their likelihood of being ranked last. However, this behavior reflects a symbolic effort to avoid overt discrimination rather than a deeper fairness commitment. While LLMs may not reproduce historical and societal biases in overt form, their ability to correct them remains limited. These results highlight the need for implementing bias mitigation measures before integrating LLMs into high‐stakes strategic evaluation processes.

🏛️ Semantic Scholar 📅 2026-04-24 Strategic Management Journal

PolarGate: Breaking the Functionality Representation Bottleneck of And-Inverter Graph Neural Network

Jiawei Liu, Jianwang Zhai, Xun He, et al.

DOI: 10.1145/3812548

🔥 引用: 0

Abstract: Understanding the functionality of Boolean networks is crucial for processes such as functional equivalence checking, logic synthesis, and malicious logic identification. With the proliferation of deep learning in electronic design automation (EDA), graph neural networks (GNNs) are widely used to embed and-inverter graphs (AIGs)—a standard form of Boolean networks—into vectorized representations. A key challenge in applying GNNs for Boolean representation is that although GNNs can effectively encapsulate the structural properties of AIGs, they struggle to efficiently capture Boolean logic functionality. In this work, we focus on breaking this bottleneck by enhancing the functional representation capability of GNNs, proposing PolarGate, an efficient solution that not only aligns message passing with AIG logical functionality but also effectively integrates global information. Leveraging the intrinsic ambipolar states (0 and 1) of AIG nodes, PolarGate maps gate behavior into an ambipolar state space, customizes differentiable logical operators, and designs a functionality-aware message passing strategy. To further capture global circuit information, PolarGate integrates a structure-aware preprocessing module and a global linear attention module, transcending the locality constraint of message passing. Experimental results on two functionality-related basic tasks (signal probability prediction and truth-table distance prediction) and a downstream task (logic equivalence prediction) show that PolarGate outperform state-of-the-art GNN-based methods.

🏛️ Semantic Scholar 📅 2026-04-24 Deep Learning ACM Transactions on Design Automation of Electronic Systems

Train in Vain: Functionality-Preserving Poisoning to Prevent Unauthorized Use of Code Datasets

Yuan Xiao, Jiaming Wang, Yuchen Chen, et al.

🔥 引用: 0

Abstract: The widespread availability of large-scale code datasets has accelerated the development of code large language models (CodeLLMs), raising concerns about unauthorized dataset usage. Dataset poisoning offers a proactive defense by reducing the utility of such unauthorized training. However, existing poisoning methods often require full dataset poisoning and introduce transformations that break code compilability. In this paper, we introduce FunPoison, a functionality-preserving poisoning approach that injects short, compilable weak-use fragments into executed code paths. FunPoison leverages reusable statement-level templates with automatic repair and conservative safety checking to ensure side-effect freedom, while a type-aware synthesis module suppresses static analysis warnings and enhances stealth. Extensive experiments show that FunPoison achieves effective poisoning by contaminating only 10% of the dataset, while maintaining 100% compilability and functional correctness, and remains robust against various advanced code sanitization techniques.

🏛️ Semantic Scholar 📅 2026-04-24

Vibe coding for clinicians: democratising bespoke software development for digital health innovation

A. Y. Ong, I. Livingstone, Caroline L. S. Kilduff, et al.

🔥 引用: 0

Abstract: Clinicians often face workflow problems that are perceived as either too bespoke or low stakes to attract commercial attention. Historically, most do not have the technical knowledge to address these problems, but the recent emergence of"vibe coding"presents a transformative opportunity. Vibe coding refers to the co-development of software using natural language prompts to large language models. It offers a pathway to create simple tools that address these real-world pain points, or to prototype more complex ideas. In this review, written by a group of early adopter clinicians with a range of programming expertise, we introduce vibe coding for clinicians (especially those with no or minimal coding experience) as a way of democratising innovation from the front lines. We discuss foundational skills, outline some common challenges, provide a practical step-by-step playbook, and illustrate this approach with some case examples, taking care to consider caveats and guardrails for deployment. We propose that vibe coding is more than a technical shortcut for beginners and is not a replacement for professional software developers. Instead, it can bridge the gap between clinical insight and technical execution, equipping clinicians with the ability to rapidly prototype digital health solutions most reflective of clinical realities.

🏛️ Semantic Scholar 📅 2026-04-24

CellChem: Cellular transcriptional responses reshape molecular representation space for efficient and multi-scale drug discovery

Jiaxiao Chen, Letian Lin, Yishen Wang, et al.

DOI: 10.64898/2026.04.22.719826

🔥 引用: 0

Abstract: 暂无摘要，请点击原文查看。

🏛️ Semantic Scholar 📅 2026-04-24 bioRxiv

ResRank: Unifying Retrieval and Listwise Reranking via End-to-End Joint Training with Residual Passage Compression

Xiao Ke, Shuai Zhang, Liansheng Sun, et al.

🔥 引用: 0

Abstract: Large language model (LLM) based listwise reranking has emerged as the dominant paradigm for achieving state-of-the-art ranking effectiveness in information retrieval. However, its reliance on feeding full passage texts into the LLM introduces two critical bottlenecks: the"lost in the middle"phenomenon degrades ranking quality as input length grows, and the inference latency scales super-linearly with sequence length, rendering it impractical for industrial deployment. In this paper, we present ResRank, a unified retrieval-reranking framework that fundamentally addresses both challenges. Inspired by multimodal LLMs that project visual inputs into compact token representations, ResRank employs an Encoder-LLM to compress each candidate passage into a single embedding, which is then fed alongside the query text into a Reranker-LLM for listwise ranking. To alleviate the misalignment between the compressed representation space and the ranking space, we introduce a residual connection structure that combines encoder embeddings with contextualized hidden states from the reranker. Furthermore, we replace the conventional autoregressive decoding with a one-step cosine-similarity-based scoring mechanism, eliminating the generation bottleneck entirely. ResRank is trained through a carefully designed dual-stage, multi-task, end-to-end joint optimization strategy that simultaneously trains the encoder and reranker, achieving learning objective alignment between retrieval and reranking while substantially reducing training complexity. Extensive experiments on TREC Deep Learning and eight BEIR benchmark datasets demonstrate that ResRank achieves competitive or superior ranking effectiveness compared to existing approaches while requiring zero generated tokens and processing only one token per passage, yielding a fundamentally better balance between effectiveness and efficiency.

🏛️ Semantic Scholar 📅 2026-04-24 Deep Learning

Fujun Han, Junan Chen, Xintong Zhu, et al.

🔥 引用: 0

Abstract: Multimodal Large Language Models (MLLMs) have shown promising potential in diverse understanding tasks, e.g., image and video analysis, math and physics olympiads. However, they remain blank and unexplored for Small Object Understanding (SOU) tasks. To fill this gap, we introduce SOUBench, the first and comprehensive benchmark for exploring the small objects understanding capability of existing MLLMs. Specifically, we first design an effective and automatic visual question-answer generation strategy, constructing a new SOU-VQA evaluation dataset, with 18,204 VQA pairs, six relevant sub-tasks, and three dominant scenarios (i.e., Driving, Aerial, and Underwater). Then, we conduct a comprehensive evaluation on 15 state-of-the-art MLLMs and reveal their weak capabilities in small object understanding. Furthermore, we develop SOU-Train, a multimodal training dataset with 11,226 VQA pairs, to improve the SOU capabilities of MLLMs. Through supervising fine-tuning of the latest MLLM, we demonstrate that SOU-Train can effectively enhance the latest MLLM's ability to understand small objects. Comprehensive experimental results demonstrate that, the proposed SOUBench, along with the SOU-VQA and SOU-Train datasets, provides a crucial empirical foundation to the community for further developing models with enhanced small object understanding capabilities. Datasets and Code: https://github.com/Hanfj-X/SOU.

🏛️ Semantic Scholar 📅 2026-04-24 Fine-tuning

Generating explainable hypotheses for drug repurposing with graph neural networks

Pablo Perdomo-Quinteiro, Emre Guney, A. Belmonte-Hernández

DOI: 10.1038/s41598-026-50149-2

🔥 引用: 0

Abstract: 暂无摘要，请点击原文查看。

🏛️ Semantic Scholar 📅 2026-04-24 Scientific Reports

Predicting guide dog career success using machine learning and large language models

Adir Solomon, E. Shamis, H. Racah, et al.

DOI: 10.1038/s41598-026-48238-3

🔥 引用: 0

Abstract: 暂无摘要，请点击原文查看。

🏛️ Semantic Scholar 📅 2026-04-24 Scientific Reports

In silico identification of potential inhibitors of Mycobacterium tuberculosis MmpS5L5 from the ReFRAME database: a structure-based virtual screening, molecular docking and molecular dynamics approach

Caroline Maina, Edwin Murungi, E. Kigondu

DOI: 10.3389/fcimb.2026.1812089

🔥 引用: 0

Abstract: Tuberculosis (TB), caused by Mycobacterium tuberculosis ( Mtb ), is the leading infectious cause of death globally, disproportionately impacting low- and medium-income countries (LMICs). The emergence and transmission of drug resistant Mtb strains has rendered a majority of the current anti-TB agents ineffective and significantly complicated TB treatment. Thus, the development of new anti-TB remedies with novel modes of action is a pressing priority. An attractive, viable strategy is the development of potentiators of anti-TB drugs that reverse drug efflux, a key intrinsic Mtb drug resistance mechanism. Targeting Mtb MmpS5L5, a critical efflux pump (EP) implicated in the mycobacterial expulsion of various anti-TB drugs including bedaquiline, tetracyclines, azoles and clofazimine would likely enhance the efficacy of current anti-TB drugs by preventing the development of drug resistance. The recent determination of a high-resolution crystal structure of Mtb MmpS5L5 (PDBID: 8ZKP) enables the utilisation of structure-anchored approaches for the uncovering of probable efflux inhibitors. In this study, pharmacophore models developed using the Mtb MmpS5L5 three-dimensional (3-D) structure and its known inhibitors, verapamil and norverapamil, were thereafter utilised for the screening of the REFRAME database, a comprehensive drug repurposing library, to identify novel ligand scaffolds with putative activity against the EP. Predicted target binding affinity for the top candidates was ascertained and validated using molecular docking and 100 ns molecular dynamics (MD) simulations, respectively. Further, post-MD analysis including Molecular Mechanics/Generalized Born Surface Area calculations (MMGBSA), Principal Component Analysis, and Free Energy Landscapes were done to study thermodynamic and conformational dynamics of the complexes. Six compounds (406, 3920, 4031, 4787, 7104, 10367) had stronger predicted binding affinities for MmpS5L5 than the known inhibitors, with docking scores ranging from -8.70 to -5.01 kcal/mol and had predicted protein contacts similar to those of the validated inhibitors. Molecular dynamic simulations and MMGBSA analyses demonstrated stable and energetically favourable protein-ligand interaction. Among the six compounds, 3920 and 4031 emerged as the most promising hits as their average total ΔG bind (-111.81 ± 8.98 kcal/mol and -109.56 ± 8.40 kcal/mol respectively) and ligand efficiency (-16.46 ± 4.06 kcal/mol and -17.63 ± 1.27 kcal/mol) were lower than those of the reference inhibitors. This study identified compounds from the ReFRAME database that may provide putative scaffolds for the development of Mtb efflux inhibitors that can potentiate the treatment efficacy of current anti-TB drugs. Further in vitro and in vivo studies are needed to validate their inhibition potential.

🏛️ Semantic Scholar 📅 2026-04-24 Agent Frontiers in Cellular and Infection Microbiology

A GNN-Based Log Anomaly Detection Framework with Prompt Learning for Edge Computing

Xianlang Hu, Guangsheng Feng, Xinling Huang, et al.

DOI: 10.3390/computers15050273

🔥 引用: 0

Abstract: System logs have been critical for analyzing the operational status and abnormal behavior of highly distributed and heterogeneous edge computing nodes. In edge environments, logs exhibit cross-event and cross-field structural interactions, making it difficult to uncover potential anomaly patterns from isolated events. Moreover, sparse annotations and varying log formats limit the effectiveness of existing methods. To address these challenges, we propose a graph neural network (GNN) anomaly detection framework with prompt learning. It leverages few-shot prompt learning to automatically extract key fields and constructs a weighted directed graph that jointly models semantic embeddings and temporal dependencies, fully representing the structural interactions and semantic associations across events and fields. Furthermore, the framework performs graph-level anomaly detection by jointly optimizing graph representation learning and classification objective within an enhanced one-class directed graph convolutional network, enabling effective identification of global structural anomaly patterns in log graphs. Experimental results demonstrate that the proposed method achieves an average F1-score of 93.3%, surpassing the current state-of-the-art (SOTA) methods by 6.93%.

🏛️ Semantic Scholar 📅 2026-04-24 Computers

Shaoang Li, Yanhang Shi, Yufei Li, et al.

🔥 引用: 0

Abstract: Large Language Models (LLMs) can reason well, yet often miss decisive evidence when it is buried in long, noisy contexts. We introduce HiLight, an Evidence Emphasis framework that decouples evidence selection from reasoning for frozen LLM solvers. HiLight avoids compressing or rewriting the input, which can discard or distort evidence, by training a lightweight Emphasis Actor to insert minimal highlight tags around pivotal spans in the unaltered context. A frozen Solver then performs downstream reasoning on the emphasized input. We cast highlighting as a weakly supervised decision-making problem and optimize the Actor with reinforcement learning using only the Solver's task reward, requiring no evidence labels and no access to or modification of the Solver. Across sequential recommendation and long-context question answering, HiLight consistently improves performance over strong prompt-based and automated prompt-optimization baselines. The learned emphasis policy transfers zero-shot to both smaller and larger unseen Solver families, including an API-based Solver, suggesting that the Actor captures genuine, reusable evidence structure rather than overfitting to a single backbone.

🏛️ Semantic Scholar 📅 2026-04-24

Incentive Mechanism for Federated Learning in Data Heterogeneity and Consumer Privacy Protection

Yang Cao, Huimin Cai, Ting Zhi, et al.

SSG: Logit-Balanced Vocabulary Partitioning for LLM Watermarking

Chenxi Gu, Xiaoning Du, John C. Grundy

🔥 引用: 0

Abstract: Watermarking has emerged as a promising technique for tracing the authorship of content generated by large language models (LLMs). Among existing approaches, the KGW scheme is particularly attractive due to its versatility, efficiency, and effectiveness in natural language generation. However, KGW's effectiveness degrades significantly under low-entropy settings such as code generation and mathematical reasoning. A crucial step in the KGW method is random vocabulary partitioning, which enables adjustments to token selection based on specific preferences. Our study revealed that the next-token probability distribution plays an critical role in determining how much, or even whether, we can modify token selection and, consequently, the effectiveness of watermarking. We refer to this characteristic, associated with the probability distribution of each token prediction, as \emph{watermark strength.} In cases of random vocabulary partitioning, the lower bound of watermark strength is dictated by the next-token probability distribution. However, we found that, by redesigning the vocabulary partitioning algorithm, we can potentially raise this lower bound. In this paper, we propose SSG (\textbf{S}ort-then-\textbf{S}plit by \textbf{G}roups), a method that partitions the vocabulary into two logit-balanced subsets. This design lifts the lower bound of watermark strength for each token prediction, thereby improving watermark detectability. Experiments on code generation and mathematical reasoning datasets demonstrate the effectiveness of SSG.

🏛️ Semantic Scholar 📅 2026-04-24

Autonomous Operations & Agentic AI: Intelligent Self-Directed Systems

Pankaj Mhetre, Dr. R. A. Jamadar, Prathamesh Kuldharan, et al.

DOI: 10.55041/ijsrem61088

🔥 引用: 0

Abstract: Abstract—The rapid evolution of artificial intelligence has ushered in a new paradigm: agentic AI systems capable of autonomous, self-directed operation across complex, multi-step tasks. Unlike conventional AI pipelines that respond reactively to individual prompts, agentic systems perceive their environ- ment, reason over long horizons, plan sequences of actions, and execute those actions using tools and external resources— all with minimal human intervention. This paper presents a comprehensive analysis of autonomous operations and agentic AI, examining the architectural foundations, core capabilities, and enabling technologies that distinguish self-directed agents from traditional AI models. We survey key components including perception modules, memory architectures, planning and reason- ing engines, tool-use frameworks, and multi-agent coordination protocols. We further discuss deployment challenges such as safety, alignment, hallucination mitigation, and the computational costs of agentic loops. Benchmark results across representative agentic tasks illustrate performance trade-offs between fully autonomous and human-in-the-loop configurations. Our anal- ysis advocates for hybrid autonomy frameworks that balance operational independence with oversight mechanisms, offering practical design recommendations for deploying agentic AI in real-world production environments. Index Terms—Agentic AI, autonomous systems, large language models, multi-agent systems, tool use, planning, self-directed agents, human-in-the-loop

🏛️ Semantic Scholar 📅 2026-04-24 Agent INTERNATIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING AND MANAGEMENT

Xirui Li, Ming Li, Yunze Xiao, et al.

🔥 引用: 0

Abstract: Collective intelligence refers to the ability of a group to achieve outcomes beyond what any individual member can accomplish alone. As large language model agents scale to populations of millions, a key question arises: Does collective intelligence emerge spontaneously from scale? We present the first empirical evaluation of this question in a large-scale autonomous agent society. Studying MoltBook, a platform hosting over two million agents, we introduce Superminds Test, a hierarchical framework that probes society-level intelligence using controlled Probing Agents across three tiers: joint reasoning, information synthesis, and basic interaction. Our experiments reveal a stark absence of collective intelligence. The society fails to outperform individual frontier models on complex reasoning tasks, rarely synthesizes distributed information, and often fails even trivial coordination tasks. Platform-wide analysis further shows that interactions remain shallow, with threads rarely extending beyond a single reply and most responses being generic or off-topic. These results suggest that collective intelligence does not emerge from scale alone. Instead, the dominant limitation of current agent societies is extremely sparse and shallow interaction, which prevents agents from exchanging information and building on each other's outputs.

🏛️ Semantic Scholar 📅 2026-04-24 Agent

CT-based AI system for quantitative and integrated management of acute respiratory distress syndrome in critical care

Yuetan Chu, Jianpeng Wang, Peiyao Luo, et al.

DOI: 10.1038/s41746-026-02648-9

🔥 引用: 0

Abstract: 暂无摘要，请点击原文查看。

🏛️ Semantic Scholar 📅 2026-04-24 npj Digital Medicine

Rui Zhao, Xu-Xiang Zhong, Xiaoyun Zheng, et al.

🔥 引用: 0

Abstract: Sign language research has achieved significant progress due to the advances in large language models (LLMs). However, the intrinsic ability of LLMs to understand sign language, especially in multimodal contexts, remains underexplored. To address this limitation, we introduce CNSL-bench, the first comprehensive Chinese em{National Sign Language benchmark designed for evaluating multimodal large language models (MLLMs) in sign language understanding. The proposed CNSL-bench is characterized by: 1) Authoritative grounding, as it is anchored to the officially standardized \textit{National Common Sign Language Dictionary, mitigating ambiguity from regional or non-canonical variants and ensuring consistent semantic definitions; 2) Multimodal coverage, providing aligned textual descriptions, illustrative images, and sign language videos; and 3) Articulatory diversity, supporting fine-grained analysis across key manual articulatory forms, including air-writing, finger-spelling, and the Chinese manual-alphabet. Using CNSL-bench, we extensively evaluate 21 open-source and proprietary up-to-date MLLMs. Our results reveal that, despite recent advances in multimodal modeling, current MLLMs remain substantially inferior to human performance, exhibiting systematic disparities across input modalities and manual articulatory forms. Additional diagnostic analyses suggest that several performance limitations persist beyond improvements in reasoning and that instruction-following robustness varies substantially across models.

🏛️ Semantic Scholar 📅 2026-04-24

Rethinking Math Reasoning Evaluation: A Robust LLM-as-a-Judge Framework Beyond Symbolic Rigidity

E. Yosef, Oron Anschel, Shunit Haviv Hakimi, et al.

🔥 引用: 0

Abstract: Recent advancements in large language models have led to significant improvements across various tasks, including mathematical reasoning, which is used to assess models'intelligence in logical reasoning and problem-solving. Models are evaluated on mathematical reasoning benchmarks by verifying the correctness of the final answer against a ground truth answer. A common approach for this verification is based on symbolic mathematics comparison, which fails to generalize across diverse mathematical representations and solution formats. In this work, we offer a robust and flexible alternative to rule-based symbolic mathematics comparison. We propose an LLM-based evaluation framework for evaluating model-generated answers, enabling accurate evaluation across diverse mathematical representations and answer formats. We present failure cases of symbolic evaluation in two popular frameworks, Lighteval and SimpleRL, and compare them to our approach, demonstrating clear improvements over commonly used methods. Our framework enables more reliable evaluation and benchmarking, leading to more accurate performance monitoring, which is important for advancing mathematical problem-solving and intelligent systems.

🏛️ Semantic Scholar 📅 2026-04-24

Representational Harms in LLM-Generated Narratives Against Global Majority Nationalities

Ilana Nguyen, Harini Suresh, T. Monroe-White, et al.

DOI: 10.1145/3805689.3806452

🔥 引用: 0

Abstract: Large language models (LLMs) are increasingly used for text generation tasks from everyday use to high-stakes enterprise and government applications, including simulated interviews with asylum seekers. While many works highlight the new potential applications of LLMs, there are risks of LLMs encoding and perpetuating harmful biases about non-dominant communities across the globe. To better evaluate and mitigate such harms, more research examining how LLMs portray diverse individuals is needed. In this work, we study how national origin identities are portrayed by widely-adopted LLMs in response to open-ended narrative generation prompts. Our findings demonstrate the presence of persistent representational harms by national origin, including harmful stereotypes, erasure, and one-dimensional portrayals of Global Majority identities. Minoritized national identities are simultaneously underrepresented in power-neutral stories and overrepresented in subordinated character portrayals, which are over fifty times more likely to appear than dominant portrayals. The degree of harm is amplified when US nationality cues (e.g., ``American'') are present in input prompts. Notably, we find that the harms we identify cannot be explained away via sycophancy, as US-centric biases persist even when replacing US nationality cues with non-US national identities in the prompts. Based on our findings, we call for further exploration of cultural harms in LLMs through methodologies that center Global Majority perspectives and challenge the uncritical adoption of US-based LLMs for the classification, surveillance, and misrepresentation of the majority of our planet.

🏛️ Semantic Scholar 📅 2026-04-24

Utility-Aware Data Pricing: Token-Level Quality and Empirical Training Gain for LLMs

Minghui Xu, Qimeng Luo, Kun Li

🔥 引用: 0

Abstract: Traditional data valuation methods based on ``row-count $\times$ quality coefficient''paradigms fail to capture the nuanced, nonlinear contributions that data makes to Large Language Model (LLM) capabilities. This paper presents a dynamic data valuation framework that transitions from static accounting to utility-based pricing. Our approach operates on three layers: (1) token-level information density metrics using Shannon entropy and Data Quality Scores; (2) empirical training gain measurement through influence functions, proxy model strategies, and Data Shapley values; and (3) cryptographic verifiability through hash-based commitments, Merkle trees, and a tamper-evident training ledger. We provide comprehensive experimental validation on three real domains (instruction following, mathematical reasoning, and code summarization), demonstrating that proxy-based empirical gain achieves near-perfect ranking alignment with realized utility, substantially outperforming row-count and token-count baselines. This framework enables a fair Data-as-a-Service economy where high-reasoning data is priced according to its actual contribution to model intelligence, while providing the transparency and auditability necessary for trustworthy data markets.

🏛️ Semantic Scholar 📅 2026-04-24

MTT-Bench: Predicting Social Dominance in Mice via Multimodal Large Language Models

Yunquan Chen, Hao Chen

🔥 引用: 0

Abstract: Understanding social dominance in animal behavior is critical for neuroscience and behavioral studies. In this work, we explore the capability of Multimodal Large Language Models(MLLMs) to analyze raw behavioral video of mice and predict their dominance hierarchy. We introduce MTT-Bench, a novel benchmark comprising annotated videos of pairwise mouse interactions for Mouse Tube Test analysis. Building on existing MLLM architectures, we fine-tune these models to perform zero-shot inference on unseen behavioral sequences, predicting social dominance without explicit labels during testing. Our framework demonstrates promising results, showing high agreement with tube test rankings. This work opens a new direction for applying foundation models to ethology and social behavior analysis, without the need to design domain-specific models.

🏛️ Semantic Scholar 📅 2026-04-24 Foundation Model

Preference Heads in Large Language Models: A Mechanistic Framework for Interpretable Personalization

Weixu Zhang, Ye Yuan, Chang-Gyoung Han, et al.

🔥 引用: 1

Abstract: Large Language Models (LLMs) exhibit strong implicit personalization ability, yet most existing approaches treat this behavior as a black box, relying on prompt engineering or fine tuning on user data. In this work, we adopt a mechanistic interpretability perspective and hypothesize the existence of a sparse set of Preference Heads, attention heads that encode user specific stylistic and topical preferences and exert a causal influence on generation. We introduce Differential Preference Steering (DPS), a training free framework that (1) identifies Preference Heads through causal masking analysis and (2) leverages them for controllable and interpretable personalization at inference time. DPS computes a Preference Contribution Score (PCS) for each attention head, directly measuring its causal impact on user aligned outputs. During decoding, we contrast model predictions with and without Preference Heads, amplifying the difference between personalized and generic logits to selectively strengthen preference aligned continuations. Experiments on widely used personalization benchmarks across multiple LLMs demonstrate consistent gains in personalization fidelity while preserving content coherence and low computational overhead. Beyond empirical improvements, DPS provides a mechanistic explanation of where and how personalization emerges within transformer architectures. Our implementation is publicly available.

🏛️ Semantic Scholar 📅 2026-04-24 Fine-tuning Transformer Cited: 1

Alberto Messina, Stefano Scotta

🔥 引用: 0

Abstract: Even when decoding with temperature $T=0$, large language models (LLMs) can produce divergent outputs for identical inputs. Recent work by Thinking Machines Lab highlights implementation-level sources of nondeterminism, including batch-size variation, kernel non-invariance, and floating-point non-associativity. In this short note we formalize this behavior by introducing the notion of \emph{background temperature} $T_{\mathrm{bg}}$, the effective temperature induced by an implementation-dependent perturbation process observed even when nominal $T=0$. We provide clean definitions, show how $T_{\mathrm{bg}}$ relates to a stochastic perturbation governed by the inference environment $I$, and propose an empirical protocol to estimate $T_{bg}$ via the equivalent temperature $T_n(I)$ of an ideal reference system. We conclude with a set of pilot experiments run on a representative pool from the major LLM providers that demonstrate the idea and outline implications for reproducibility, evaluation, and deployment.

🏛️ Semantic Scholar 📅 2026-04-24 Trans. Mach. Learn. Res.

Chinese-SkillSpan: A Span-Level Dataset for ESCO-Aligned Competency Extraction from Chinese Job Ads

Guojing Li, Zichuan Fu, J. Li, et al.

🔥 引用: 0

Abstract: Job Skill Named Entity Recognition (JobSkillNER) aims to automatically extract key skill information from large-scale job posting data, which is important for improving talent-market matching efficiency and supporting personalized employment services. To the best of our knowledge, this work presents the first Chinese JobSkillNER dataset for recruitment texts. We propose annotation guidelines tailored to Chinese job postings and an LLM-empowered Macro-Micro collaborative annotation pipeline. The pipeline leverages the contextual understanding ability of large language models (LLMs) for initial annotation and then refines the results through expert sentence-level adjudication. Using this pipeline, we annotate more than 20,000 instances collected from four major recruitment platforms over the period 2014-2025. Based on these efforts, we release Chinese-SkillSpan, the first Chinese JobSkillNER dataset aligned with the ESCO occupational skill standard across four dimensions: knowledge, skill, transversal competence, and language competence (LSKT). Experimental results show that the dataset supports effective model training and evaluation, indicating that Chinese-SkillSpan helps fill a major gap in Chinese JobSkillNER resources and provides a useful benchmark for intelligent recruitment research. Code and data are available at https://sites.google.com/view/cn-skillspan-resources .

🏛️ Semantic Scholar 📅 2026-04-24

Jordan Meadows, Lan Zhang, Andre Freitas

🔥 引用: 0

Abstract: Formalising informal mathematical reasoning into formally verifiable code is a significant challenge for large language models. In scientific fields such as physics, domain-specific machinery (\textit{e.g.} Dirac notation, vector calculus) imposes additional formalisation challenges that modern LLMs and agentic approaches have yet to tackle. To aid autoformalisation in scientific domains, we present FormalScience; a domain-agnostic human-in-the-loop agentic pipeline that enables a single domain expert (without deep formal language experience) to produce \textit{syntactically correct} and \textit{semantically aligned} formal proofs of informal reasoning for low economic cost. Applying FormalScience to physics, we construct FormalPhysics, a dataset of 200 university-level (LaTeX) physics problems and solutions (primarily quantum mechanics and electromagnetism), along with their Lean4 formal representations. Compared to existing formal math benchmarks, FormalPhysics achieves perfect formal validity and exhibits greater statement complexity. We evaluate open-source models and proprietary systems on a statement autoformalisation task on our dataset via zero-shot prompting, self-refinement with error feedback, and a novel multi-stage agentic approach, and explore autoformalisation limitations in modern LLM-based approaches. We provide the first systematic characterisation of semantic drift in physics autoformalisation in terms of concepts such as notational collapse and abstraction elevation which reveals what formal language verifies when full semantic preservation is unattainable. We release the codebase together with an interactive UI-based FormalScience system which facilitates autoformalisation and theorem proving in scientific domains beyond physics.https://github.com/jmeadows17/formal-science

🏛️ Semantic Scholar 📅 2026-04-24

An AI-Powered System for Dynamic Content Creation, Storage, and Retrieval

Aapeksha Reddy, A. Ajay Shankar, G. Joanna Suzan, et al.

DOI: 10.46647/rdems0204022

🔥 引用: 0

Abstract: This paper presents the architecture, design rationale, and empirical evaluation of an AI-Powered System for Dynamic Content Creation, Storage, and Retrieval—a centralized, web-based platform engineered to streamline the complete lifecycle of digital content. The system orchestrates multiple state of-the-art Large Language Models (LLMs), including NVIDIA NIM (Llama 3.1), Groq (Llama 3.1), Cerebras CS-3, and Cohere, within a unified multi-model orchestration layer built on Next.js, TypeScript, and PostgreSQL. Core contributions include: (1) a dynamic AI orchestration layer that routes generation requests across LLM providers based on latency and reasoning requirements; (2) an integrated real-time fact-checking module powered by Google Search APIs to detect and flag AI-generated hallucinations; (3) an automated content quality pipeline delivering readability, uniqueness, and factual-accuracy scores; and (4) a structured semantic knowledge base enabling Retrieval Augmented Generation (RAG) for document-centric workflows. Experimental results confirm that the platform reduces average content-generation cycles from multi-hour manual processes to sub-minute automated workflows, while measurably improving factual accuracy and content quality across academic and professional use cases. The modular architecture is designed for scalability, supporting future extensions including multimodal generation, collaborative editing, and decentralized P2P storage.

🏛️ Semantic Scholar 📅 2026-04-24 Research Digest on Engineering Management and Social Innovations

Research on the development of an automated system for psychology questionnaire generation based on large language models

Zhitao Yuan, Chenghao Jia, Man Lan, et al.

DOI: 10.1371/journal.pone.0345117

🔥 引用: 0

Abstract: This study reimagined the psychology questionnaire development process using large language model ((LLM) technology, aiming to overcome the protracted preparation cycles and significant human bias inherent in traditional scale development. We developed a specialized fine-tuning scheme for a corpus of 169 professional psychological questionnaires. By integrating instruction fine-tuning with human feedback reinforcement, we significantly enhanced the adaptability of the Qwen-2.5 and GLM-4 models for demanding professional psychological assessment tasks. The optimized models demonstrated remarkable gains across key dimensions: text generation quality (BLEU-4 increased by 0.05, ROUGE-L by 0.057), scientific rigor (logical consistency improved by 28.6%), and cultural adaptability (achieving over 85% accuracy in cross-regional expression conversion). This research solidly supports the feasibility of leveraging LLM technology to drive research paradigm transformation in psychology, offering crucial methodological support for developing efficient, intelligent psychological measurement tools.

🏛️ Semantic Scholar 📅 2026-04-24 Fine-tuning PLoS ONE

A Systematic Approach for Large Language Models Debugging

Basel Shbita, A. Gentile, Bing Zhang, et al.

🔥 引用: 0

Abstract: Large language models (LLMs) have become central to modern AI workflows, powering applications from open-ended text generation to complex agent-based reasoning. However, debugging these models remains a persistent challenge due to their opaque and probabilistic nature and the difficulty of diagnosing errors across diverse tasks and settings. This paper introduces a systematic approach for LLM debugging that treats models as observable systems, providing structured, model-agnostic methods from issue detection to model refinement. By unifying evaluation, interpretability, and error-analysis practices, our approach enables practitioners to iteratively diagnose model weaknesses, refine prompts and model parameters, and adapt data for fine-tuning or assessment, while remaining effective in contexts where standardized benchmarks and evaluation criteria are lacking. We argue that such a structured methodology not only accelerates troubleshooting but also fosters reproducibility, transparency, and scalability in the deployment of LLM-based systems.

🏛️ Semantic Scholar 📅 2026-04-24 Agent Fine-tuning

FETS Benchmark: Foundation Models Outperform Dataset-specific Machine Learning in Energy Time Series Forecasting

Marco Obermeier, Marco Pruckner, Florian Haselbeck, et al.

🔥 引用: 0

Abstract: Driven by the transition towards a climate-neutral energy system, accurate energy time series forecasting is critical for planning and operation. Yet, it remains largely a dataset-specific task, requiring comprehensive training data, limiting scalability, and resulting in high model development and maintenance effort. Recently, foundation models that aim to learn generalizable patterns via extensive pretraining have shown superior performance in multiple prediction tasks. Despite their success and strong potential to address challenges in energy forecasting, their application in this domain remains largely unexplored. We address this gap by presenting the Foundation Models in Energy Time Series Forecasting (FETS) benchmark. We (1) provide a structured overview of energy forecasting use cases along three main dimensions: stakeholders, attributes, and data categories; (2) collect and analyze 54 datasets across 9 data categories, guided by typical stakeholder interests; (3) benchmark foundation models against classical machine learning approaches across different forecasting settings. Foundation models consistently outperform dataset-specific optimized machine learning approaches across all settings and data categories, despite the latter having seen the full historic target data during training. In particular, covariate-informed foundation models achieve the strongest performance. Further analysis reveals a strong correlation between predictive performance and spectral entropy, performance saturation beyond a certain context length, and improved performance at higher aggregation levels such as national load, district heating, and power grid data. Overall, our findings highlight the strong potential of foundation models as scalable and generalizable forecasting solutions for the energy domain, particularly in data-constrained and privacy-sensitive settings.

🏛️ Semantic Scholar 📅 2026-04-24 Foundation Model

🔥 引用: 0

Abstract: Due to the textual and repetitive nature of many Requirements Engineering (RE) artefacts, Large Language Models (LLMs) have proven useful to automate their generation and processing. In this paper, we discuss a possible approach for automating the Goal-Oriented Requirements Engineering (GORE) process by extracting functional goals from software documentation through three phases: actor identification, high and low-level goal extraction. To implement these functionalities, we propose a chain of LLMs fed with engineered prompts. We experimented with different variants of in-context learning and measured the similarities between input data and in-context examples to better investigate their impact. Another key element is the generation-critic mechanism, implemented as a feedback loop involving two LLMs. Although the pipeline achieved 61% accuracy in low-level goal identification, the final stage, these results indicate the approach is best suited as a tool to accelerate manual extraction rather than as a full replacement. The feedback-loop mechanism with Zero-shot outperformed stand-alone Few-shot, with an ablation study suggesting that performance slightly degrades without the feedback cycle. However, we reported that the combination of the feedback mechanism with Few-shot does not deliver any advantage, possibly suggesting that the primary performance ceiling is the prompting strategy applied to the'critic'LLM. Together with the refinement of both the quantity and quality of the Shot examples, future research will integrate Retrieval-Augmented Generation (RAG) and Chain-of-Thought (CoT) prompting to improve accuracy.

🏛️ Semantic Scholar 📅 2026-04-24

Lihao Zheng, Zhenwei Shao, Yu Zhou, et al.

🔥 引用: 0

Abstract: Although Multimodal Large Language Models (MLLMs) have advanced rapidly, they still face notable challenges in fine-grained multi-image understanding, often exhibiting spatial hallucination, attention leakage, and failures in object constancy. In addition, existing approaches typically rely on expensive human annotations or large-scale chain-of-thought (CoT) data generation. We propose Compositional Grounded Contrast (abbr. CGC), a low-cost full framework for boosting fine-grained multi-image understanding of MLLMs. Built on existing single-image grounding annotations, CGC constructs compositional multi-image training instances through Inter-Image Contrast and Intra-Image Contrast, which introduce semantically decoupled distractor contexts for cross-image discrimination and correlated cross-view samples for object constancy, respectively. CGC further introduces a Rule-Based Spatial Reward within the GRPO framework to improve source-image attribution, spatial alignment, and structured output validity under a Think-before-Grounding paradigm. Experiments show that CGC achieves state-of-the-art results on fine-grained multi-image benchmarks, including MIG-Bench and VLM2-Bench. The learned multi-image understanding capability also transfers to broader multimodal understanding and reasoning tasks, yielding consistent gains over the Qwen3-VL-8B base model on MathVista (+2.90), MuirBench (+2.88), MMStar (+1.93), MMMU (+1.77), and BLINK (+1.69).

🏛️ Semantic Scholar 📅 2026-04-24

From rows to yields: how foundation models for tabular data simplify crop yield prediction

F. Sabo, M. Meroni, M. Piles, et al.

DOI: 10.1038/s41598-026-50338-z

🔥 引用: 0

Abstract: 暂无摘要，请点击原文查看。

🏛️ Semantic Scholar 📅 2026-04-24 Foundation Model Scientific Reports

Considerations about the proliferation of large language model chatbots and youth mental health

Evan Matthews, F. Cleary, Joseph Firth

DOI: 10.1017/ipm.2026.10195

🔥 引用: 0

Abstract: Young people are experiencing worsening mental health and a growing reliance on online tools and services to address mental health difficulties. At the same time, next-generation large language models (LLMs) that are deployed through ‘chatbot style interfaces’, using deep learning artificial intelligence akin to interacting with a human appear to mark an opportunity for mental health therapeutics when designed specifically for clinical intervention. However, emergent evidence suggests the use of more generic LLM chatbots may pose a risk of providing misinformation, bias, or over reliance for some individuals when used outside of clinical contexts for mental health. This perspective paper examines the intersection of youth mental health and the rapid adoption of LLM chatbots. It first contextualises rising mental health challenges among young people alongside their increasing reliance on digital solutions. The paper then explores the potential benefits of LLM chatbot style interfaces in clinical mental health interventions. Following this, we discuss the evidence surrounding adverse mental health outcomes from the use of generic LLMs to support mental health at population level, describing complex system-level and human-level factors noted from the evidence. Finally, we outline considerations for public health and youth mental health discourse, purpose built LLM platform design, and a supporting research agenda. While current evidence on benefits and risks from generic LLMs is emergent and not youth-specific, this perspective highlights a need for research focused on young people to ensure safe and effective use of widely available LLMs for mental health support.

🏛️ Semantic Scholar 📅 2026-04-24 Deep Learning Irish Journal of Psychological Medicine

RAG-Reflect: Agentic Retrieval-Augmented Generation with Reflections for Comment-Driven Code Maintenance on Stack Overflow

Mehedi Hasan Shanto, Muhammad Asaduzzaman, A. Ngom

🔥 引用: 0

Abstract: User comments on online programming platforms such as Stack Overflow play a vital role in maintaining the correctness and relevance of shared code examples. However, the majority of comments express gratitude or clarification, while only a small fraction highlight actionable issues that drive meaningful edits. This paper demonstrates how agentic AI principles can revolutionize software maintenance tasks by presenting RAG-Reflect, a modular framework that achieves fine-tuned-level performance for valid comment-edit prediction without task-specific training. Valid Comment-Edit Prediction (VCP) is the task of determining whether a user comment directly triggered a subsequent code edit. The framework integrates large language models (LLMs) with retrieval-augmented reasoning and self-reflection mechanisms. RAG-Reflect operates through a three-stage runtime workflow built on a one-time pattern analysis phase. During initialization, an Interpretation module analyzes the knowledge base to generate validation rules. At inference time, the system (1) retrieves contextual examples, (2) reasons about comment-edit causality, and (3) reflects on decisions using the pre-established rules. We evaluate RAG-Reflect on the publicly available SOUP benchmark, achieving Precision = 0.81, Recall = 0.74, and F1 = 0.78, outperforming traditional baselines (e.g., Logistic Regression, XGBoost, different prompting techniques) and closely approaching the performance of fine-tuned models (F1 = 0.773) without retraining. Our ablation and stage-level analyses show that both retrieval and reflection modules substantially enhance performance.

🏛️ Semantic Scholar 📅 2026-04-24

Network Traffic Anomaly Detection Using Hybrid Machine Learning and Generative AI: A SOC-Integrated Approach

Mrs N. Samatha, CH. Snehith, D. Sharon, et al.

DOI: 10.46647/rdems0204025

🔥 引用: 0

Abstract: Modern enterprise networks face an ever-expanding threat landscape characterized by zero-day exploits, polymorphic malware, distributed denial-of-service campaigns, and sophisticated insider threats that consistently evade signature based detection systems. Traditional Intrusion Detection Systems (IDS), despite decades of refinement, suffer from high false-positive rates, inability to detect novel attacks, and an absence of contextual explanation that forces security analysts to manually interpret raw machine-generated data. This paper presents the design, implementation, and experimental evaluation of a Network Traffic Anomaly Detection Security Operations Center (SOC) Dashboard that addresses these limitations through the integration of hybrid unsupervised machine learning and generative artificial intelligence. The system deploys a weighted ensemble of Isolation Forest and Local Outlier Factor (LOF) to compute continuous anomaly scores across 60,000 synthetic network flow records, classifying detected anomalies into four severity tiers: LOW, MEDIUM, HIGH, and CRITICAL. A deterministic rule engine operates in parallel, applying domain specific security rules to escalate alert severity for high-confidence attack signatures including port scanning, distributed denial of-service patterns, and command-and-control port usage. The central innovation of this work is the integration of the LLaMA 3.1 8B Instant large language model via the Groq API to generate automated, human-readable, MITRE ATT&CK-mapped triage reports for each detected alert, eliminating the need for manual expert interpretation and substantially reducing mean time to respond. The Streamlit-based interactive dashboard presents results across three analytical modules: Incident Detail and Explainability, Network Graph Visualization, and AI-Powered Triage. Experimental results demonstrate successful detection of all injected attack scenarios, generation of 4,800 severity classified alerts from 60,000 traffic events, and LLM response latency averaging approximately two seconds per query. This work demonstrates that combining unsupervised behavioral detection with generative AI explanation bridges the semantic gap between machine output and analyst understanding, enabling faster, more accessible, and more effective cybersecurity operations.

🏛️ Semantic Scholar 📅 2026-04-24 Research Digest on Engineering Management and Social Innovations

ArguMath: AI-Simulated Environment for Pre-Service Teacher Training in Orchestrating Classroom Mathematics Argumentation

Jiwon Chun, Yuling Zhuang, Armanto Sutedjo, et al.

🔥 引用: 0

Abstract: Facilitating productive mathematical argumentation, especially asking rational questions, is essential yet remains challenging for pre-service mathematics teachers (PMTs), who often have limited opportunities to apply abstract theoretical knowledge in authentic practice. At the same time, recent advances in large language models (LLMs) have expanded the potential for simulating students in educational settings, enabling low-risk environments for instructional practice. To inform the design of a system that supports PMTs in orchestrating classroom argumentation, we conducted a formative study with eight experienced mathematics teachers to identify key design requirements, including personalization, realistic simulations, structured reflection, and ease of use. Building on these requirements, we developed ArguMath, an AI-simulated classroom environment that supports PMTs in practicing the orchestration of mathematical argumentation. ArguMath comprises three core components: (1) customization of classroom settings; (2) simulation of classroom discussions with AI-based students grounded in authentic transcripts and augmented with real-time instructional suggestions; and (3) structured reflection through discourse annotation and overall feedback. Results from an exploratory user study with seven PMTs, complemented by interviews with four experienced teachers, indicate that ArguMath has the potential to support PMTs'classroom orchestration skills, particularly theory-aligned questioning strategies.

🏛️ Semantic Scholar 📅 2026-04-24

Large language model-enabled automated data extraction for concrete materials informatics

Zhanzhao Li, Kengran Yang, Qiyao He, et al.

🔥 引用: 0

Abstract: The promise of data-driven materials discovery remains constrained by the scarcity of large, high-quality, and accessible experimental datasets. Here, we introduce a generalizable large language model (LLM)-powered pipeline for automated extraction and structuring of materials data from unstructured scientific literature, using concrete materials as a representative and particularly challenging example. The pipeline exhibits robust performance across a broad range of LLMs and achieves an $F_1$ score of up to 0.97 for diverse composition--process--property attributes. Within one hour, it extracts nearly 9,000 high-quality records with over 100 attributes screened from more than 27,000 publications, enabling the construction of the largest open laboratory database for blended cement concrete. Machine learning analyses underscore the importance of large, diverse, and information-rich datasets for enhancing both in-distribution accuracy and out-of-distribution generalization to unseen materials. The proposed pipeline is readily adaptable to other materials domains and accelerates the development of scalable data infrastructures for materials informatics.

🏛️ Semantic Scholar 📅 2026-04-24

C-MORAL: Controllable Multi-Objective Molecular Optimization with Reinforcement Alignment for LLMs

Rui Gao, Youngseung Jeon, Swastik Roy, et al.

🔥 引用: 0

Abstract: Large language models (LLMs) show promise for molecular optimization, but aligning them with selective and competing drug-design constraints remains challenging. We propose C-Moral, a reinforcement learning post-training framework for controllable multi-objective molecular optimization. C-Moral combines group-based relative optimization, property score alignment for heterogeneous objectives, and continuous non-linear reward aggregation to improve stability across competing properties. Experiments on the C-MuMOInstruct benchmark show that C-Moral consistently outperforms state-of-the-art models across both in-domain and out-of-domain settings, achieving the best Success Optimized Rate (SOR) of 48.9% on IND tasks and 39.5% on OOD tasks, while largely preserving scaffold similarity. These results suggest that RL post-training is an effective way to align molecular language models with continuous molecular design objectives. Our code and models are publicly available at https://github.com/Rwigie/C-MORAL.

🏛️ Semantic Scholar 📅 2026-04-24

scConcept enables concept-level exploration of single-cell transcriptomic data

Hegang Chen, Yue Li

DOI: 10.64898/2026.04.21.719959

🔥 引用: 0

Abstract: 暂无摘要，请点击原文查看。

🏛️ Semantic Scholar 📅 2026-04-24 RNA-seq / Transcriptomics bioRxiv

Uncertainty Quantification for LLM Function-Calling

Zihuiwen Ye, Lukas Aichberger, Michael Kirchhof, et al.

🔥 引用: 0

Abstract: Large Language Models (LLMs) are increasingly deployed to autonomously solve real-world tasks. A key ingredient for this is the LLM Function-Calling paradigm, a widely used approach for equipping LLMs with tool-use capabilities. However, an LLM calling functions incorrectly can have severe implications, especially when their effects are irreversible, e.g., transferring money or deleting data. Hence, it is of paramount importance to consider the LLM's confidence that a function call solves the task correctly prior to executing it. Uncertainty Quantification (UQ) methods can be used to quantify this confidence and prevent potentially incorrect function calls. In this work, we present what is, to our knowledge, the first evaluation of UQ methods for LLM Function-Calling (FC). While multi-sample UQ methods, such as Semantic Entropy, show strong performance for natural language Q&A tasks, we find that in the FC setting, it offers no clear advantage over simple single-sample UQ methods. Additionally, we find that the particularities of FC outputs can be leveraged to improve the performance of existing UQ methods in this setting. Specifically, multi-sample UQ methods benefit from clustering FC outputs based on their abstract syntax tree parsing, while single-sample UQ methods can be improved by selecting only semantically meaningful tokens when calculating logit-based uncertainty scores.

🏛️ Semantic Scholar 📅 2026-04-24

Implicit Framing in Obstetric Counseling Notes: A Grounded LLM Pipeline on a VBAC-Eligible Cohort

Baris Karacan, Barbara Di Eugenio, Patrick D. Thornton, et al.

🔥 引用: 0

Abstract: Clinical framing -- the linguistic manner in which clinical information is presented -- can influence patient understanding and decision-making, with important implications for healthcare outcomes. Obstetrics is a high-stakes domain in which physicians counsel patients on delivery mode choices such as vaginal birth after cesarean (VBAC) and repeat cesarean section (RCS), yet counseling language remains underexplored in large-scale clinical text analysis. In this work, we analyze physician counseling language in 2,024 obstetric history and physical narratives for a rigorously defined cohort of patients for whom both VBAC and RCS were clinically viable options. To control for confounding due to medical contraindications, we first construct a VBAC-eligible cohort using structured clinical data supplemented by a large language model (LLM)-based extraction pipeline constrained to grounded, verbatim evidence from free-text narratives. We then apply a zero-shot LLM framework to categorize counseling segments into predefined framing categories capturing how physicians linguistically present delivery options. Our analysis reveals a significant difference in counseling framing distributions between VBAC and RCS notes; risk-focused language accounts for a substantially larger share of counseling segments in RCS documentation than in VBAC, with category-level differences confirmed by statistical testing, highlighting the value of controlled LLM-based framing analysis in obstetric care.

🏛️ Semantic Scholar 📅 2026-04-24

Analytica: Soft Propositional Reasoning for Robust and Scalable LLM-Driven Analysis

Junyan Cheng, Kyle Richardson, Peter Chin

🔥 引用: 1

Abstract: Large language model (LLM) agents are increasingly tasked with complex real-world analysis (e.g., in financial forecasting, scientific discovery), yet their reasoning suffers from stochastic instability and lacks a verifiable, compositional structure. To address this, we introduce Analytica, a novel agent architecture built on the principle of Soft Propositional Reasoning (SPR). SPR reframes complex analysis as a structured process of estimating the soft truth values of different outcome propositions, allowing us to formally model and minimize the estimation error in terms of its bias and variance. Analytica operationalizes this through a parallel, divide-and-conquer framework that systematically reduces both sources of error. To reduce bias, problems are first decomposed into a tree of subpropositions, and tool-equipped LLM grounder agents are employed, including a novel Jupyter Notebook agent for data-driven analysis, that help to validate and score facts. To reduce variance, Analytica recursively synthesizes these grounded leaves using robust linear models that average out stochastic noise with superior efficiency, scalability, and enable interactive"what-if"scenario analysis. Our theoretical and empirical results on economic, financial, and political forecasting tasks show that Analytica improves 15.84% accuracy on average over diverse base models, achieving 71.06% accuracy with the lowest variance of 6.02% when working with a Deep Research grounder. Our Jupyter Notebook grounder shows strong cost-effectiveness that achieves a close 70.11% accuracy with 90.35% less cost and 52.85% less time. Analytica also exhibits highly noise-resilient and stable performance growth as the analysis depth increases, with a near-linear time complexity, as well as good adaptivity to open-weight LLMs and scientific domains.

🏛️ Semantic Scholar 📅 2026-04-24 Agent Cited: 1

Youze Zheng, Jianyou Wang, Yuhan Chen, et al.

🔥 引用: 0

Abstract: Predicting the outcomes of prospective clinical trials remains a major challenge for large language models. Prior work has shown that both traditional correlational predictors, such as random forests and logistic regression, and strong commercial LLMs achieve limited performance on this task. In this paper, we propose DeepImagine, a framework for teaching LLMs biomedical reasoning through successive counterfactual imagining. The central idea is to approximate hidden causal mechanisms of clinical trials by training models to infer how observed trial results would change under controlled perturbations of experimental conditions, such as dosage, outcome measures, study arms, geography, and other trial attributes. To support this objective, we construct both natural and approximate counterfactual pairs from real clinical trials with reported outcomes. For settings where strict counterfactual supervision is available, such as paired outcome measures or dose-ranging study arms within the same trial, we train models with supervised fine-tuning. For broader settings where only approximate counterfactual pairs can be retrieved, we optimize models with reinforcement learning using verifiable rewards based on downstream benchmark correctness. We further augment training with synthetic reasoning traces that provide causally plausible explanations for local counterfactual transitions. Using this pipeline, we train language models under 10B parameters, including Qwen3.5-9B, and evaluate them on clinical trial outcome prediction. We aim to show that DeepImagine consistently improves over untuned language models and traditional correlational baselines. Finally, we aim to show that the learned reasoning trajectories provide interpretable signals about how models represent trial-level mechanisms, suggesting a practical path toward more mechanistic and scientifically useful biomedical language models.

🏛️ Semantic Scholar 📅 2026-04-24 Fine-tuning

MSAgent: An Evidence Grounded Agentic Framework for LLM-driven Scientific Exploration in Mass Spectrometry-based Metabolomics

Yifan Li, Yun-Yan Zhong, Panlong Liu, et al.

DOI: 10.64898/2026.04.22.720103

🔥 引用: 0

Abstract: 暂无摘要，请点击原文查看。

🏛️ Semantic Scholar 📅 2026-04-24 bioRxiv

Dr.Sai: An agentic AI for real-world physics analysis at BESIII

Mingfeng He, Fayu Jiang, J. Jiao, et al.

🔥 引用: 0

Abstract: High Energy Physics (HEP) experiments like BESIII produce petabyte-scale data. Extracting physics results requires complex workflows (simulation, reconstruction, statistical analysis, etc.) that traditionally take experts months or years. Current manual methods are labor-intensive, prone to bias, and limit large-scale systematic scans. As data grows, this paradigm slows discovery. Large Language Models (LLMs) offer a solution. Their natural language understanding and code generation capabilities allow them to interpret scientific tasks and integrate with HEP tools (e.g., ROOT, BOSS) to act as an"AI partner"for autonomous analysis. We present Dr.Sai, an LLM-powered multi-agent system that translates natural language into rigorous physics workflows. As validation, Dr.Sai performed large-scale re-measurements of ten J/psi decay branching fractions - without manual coding. It successfully navigated the real BESIII computing environment and produced results matching established benchmarks. The article details Dr.Sai's architecture, the validation results, and performance evaluation. This work provides a blueprint for autonomous discovery, with relevance to other data-intensive fields like astronomy and genomics.

🏛️ Semantic Scholar 📅 2026-04-24 Agent Biological Large Model

An Autonomous Large Language Model‐Agent Framework for Transparent and Local Time Series Forecasting

William Gouvêa Buratto, Gabriel Villarrubia Gonzalez, A. Nied, et al.

DOI: 10.1002/aidi.202500236

🔥 引用: 0

Abstract: The growing complexity of thermal power generation systems demands advanced forecasting solutions capable of integrating data analysis, model selection, and interpretability. This study proposes a modular large language model (LLM) agent framework for time series forecasting, designed to operate locally and interactively through natural language instructions. The framework incorporates a domain‐specific time series agent that was developed to automate data preprocessing, anomaly detection, and forecasting tasks using neural and statistical models. Experiments demonstrated the agent's capacity to autonomously conduct end‐to‐end analyses, achieving accurate forecasts with minimal user intervention. The PatchTST model, automatically selected by the agent, yielded the lowest mean squared error among evaluated methods. Results highlight the potential of LLM‐based agents to enhance transparency, usability, and reproducibility in energy forecasting pipelines.

🏛️ Semantic Scholar 📅 2026-04-24 Agent Advanced Intelligent Discovery

A Nationwide Japanese Medical Claims Foundation Model: Balancing Model Scaling and Task-Specific Computational Efficiency

N. Aratake, Taisei Tosaki, Yuji Okamoto, et al.

🔥 引用: 0

Abstract: Clinical risk prediction using longitudinal medical data supports individualized care. Self-supervised foundation models have emerged as a promising approach for leveraging large-scale unlabeled healthcare records. In natural language processing, scaling laws suggest that larger models achieve predictably lower pretraining losses, supporting the foundation model paradigm. However, for structured medical data, characterized by a limited vocabulary and sparse observations, whether increasing model size consistently improves downstream predictions is unclear, as most studies evaluate only a single model scale. In this study, we evaluated the relationship between model scale and downstream task performance for structured medical foundation models. Using a random sample (2.3 million patients, 32 hospitals) from a nationwide 519-hospital Japanese claims database, we pretrained encoder-only Transformers at five scales (2.2M-101M parameters) for disease incidence and medication prediction. Downstream performance saturated at task-dependent thresholds: disease prediction benefited from larger models (32M-101M), whereas medication prediction saturated at 11M, reducing pretraining time by 178 h. Across all tasks, the best-performing model consistently outperformed a Light Gradient Boosting Machine baseline in the area under the precision-recall curve. These findings indicate that, unlike the monotonically decreasing pretraining loss, the optimal model size varied depending on task characteristics. This task-dependent saturation provides practical guidance for balancing predictive performance and computational cost in structured medical foundation models.

🏛️ Semantic Scholar 📅 2026-04-24 Foundation Model Transformer

DM-ASR: Diarization-aware Multi-speaker ASR with Large Language Models

Li Li, Ming Cheng, Wei-Ping Zhu, et al.

🔥 引用: 0

Abstract: Multi-speaker automatic speech recognition (ASR) aims to transcribe conversational speech involving multiple speakers, requiring the model to capture not only what was said, but also who said it and sometimes when it was spoken. Recent Speech-LLM approaches have shown the potential of unified modeling for this task, but jointly learning speaker attribution, temporal structure, and lexical recognition remains difficult and data-intensive. At the current stage, leveraging reliable speaker diarization as an explicit structural prior provides a practical and efficient way to simplify this task. To effectively exploit such priors, we propose DM-ASR, a diarization-aware multi-speaker ASR framework that reformulates the task as a multi-turn dialogue generation process. Given an audio chunk and diarization results, DM-ASR decomposes transcription into a sequence of speaker- and time-conditioned queries, each corresponding to one speaker in one time segment. This formulation converts multi-speaker recognition into a series of structured sub-tasks, explicitly decoupling speaker-temporal structure from linguistic content and enabling effective integration of diarization cues with the reasoning capability of large language models. We further introduce an optional word-level timestamp prediction mechanism that interleaves word and timestamp tokens, yielding richer structured outputs and better transcription quality. Our analysis shows that diarization systems provide more reliable speaker identities and segment-level boundaries, while LLMs excel at modeling linguistic content and long-range dependencies, demonstrating their complementary strengths. Experiments on Mandarin and English benchmarks show that the proposed approach achieves strong performance with relatively small models and training data, while remaining competitive with or outperforming existing unified approaches.

🏛️ Semantic Scholar 📅 2026-04-24

Can QPP Choose the Right Query Variant? Evaluating Query Variant Selection for RAG Pipelines

Negar Arabzadeh, Andrew Drozdov, Michael Bendersky, et al.

DOI: 10.1145/3805712.3808571

🔥 引用: 0

Abstract: Large Language Models (LLMs) have made query reformulation ubiquitous in modern retrieval and Retrieval-Augmented Generation (RAG) pipelines, enabling the generation of multiple semantically equivalent query variants. However, executing the full pipeline for every reformulation is computationally expensive, motivating selective execution: can we identify the best query variant before incurring downstream retrieval and generation costs? We investigate Query Performance Prediction (QPP) as a mechanism for variant selection across ad-hoc retrieval and end-to-end RAG. Unlike traditional QPP, which estimates query difficulty across topics, we study intra-topic discrimination - selecting the optimal reformulation among competing variants of the same information need. Through large-scale experiments on TREC-RAG using both sparse and dense retrievers, we evaluate pre- and post-retrieval predictors under correlation- and decision-based metrics. Our results reveal a systematic divergence between retrieval and generation objectives: variants that maximize ranking metrics such as nDCG often fail to produce the best generated answers, exposing a"utility gap"between retrieval relevance and generation fidelity. Nevertheless, QPP can reliably identify variants that improve end-to-end quality over the original query. Notably, lightweight pre-retrieval predictors frequently match or outperform more expensive post-retrieval methods, offering a latency-efficient approach to robust RAG.

🏛️ Semantic Scholar 📅 2026-04-24

Lauri Lov'en

🔥 引用: 0

Abstract: Each major technological revolution inverts a particular scarcity and rebuilds institutions around the shift. The near-consensus diagnosis of the AI revolution holds that AI collapses the cost of prediction while judgment remains scarce. This Opinion argues the inversion has now flipped: competent-looking judgment (selecting, ranking, attributing, certifying) is produced at scale and at marginal cost approaching zero, and four complements become scarce: verified signal, legitimacy, authentic provenance, and integration capacity (the community's tolerance for delegated cognition). Because judgment is the substance of institutions, the institutions built to manufacture legitimate judgment (courts, journals, licensing bodies, legislatures) now compete with the technology for the same functional role. The piece traces the pattern across scientific institutions, professional licensing, intellectual property, democratic legitimacy, and foundation-model concentration, and closes with a three-move agenda: reframe AI policy as institutional redesign, build provenance and verification as commons, and develop the formal apparatus for institutional composition under strategic agents.

🏛️ Semantic Scholar 📅 2026-04-24 Agent

Towards Temporal Compositional Reasoning in Long-Form Sports Videos

Siyu Cao, Lu Zhang, Ruizhe Zeng, et al.

🔥 引用: 0

Abstract: Sports videos are a challenging domain for multimodal understanding because they involve complex and dynamic human activities. Despite rapid progress in Multimodal Large Language Models (MLLMs), long-horizon reasoning in sports videos remains difficult, as answering questions requires both locating temporally sparse evidence and integrating it into reasoning. We attribute this limitation to two closely coupled factors: insufficient supervision over temporally dispersed evidence, and the lack of methods that require models to identify, localize, and justify temporal evidence. To address these gaps, we introduce SportsTime, a large-scale benchmark for long-form sports video understanding, comprising 14K+ open-ended QA pairs and 50K+ step-wise temporal evidence annotations. Building on SportsTime, we propose Chain-of-Time Reasoning (CoTR), which treats reasoning as a process of temporally grounded evidence composition. Specifically, during training, CoTR introduces a temporal-reward GRPO to encourage temporally grounded reasoning. During inference, it employs an anchor-observe-infer evidence-seeking loop to iteratively localize, verify, and compose temporal evidence before producing the final answer. Experiments demonstrate the usefulness of SportsTime as a benchmark and the effectiveness of CoTR, which consistently improves temporal compositional reasoning and step-wise grounding quality over strong MLLM baselines.

🏛️ Semantic Scholar 📅 2026-04-24

Bridging the Long-Tail Gap: Robust Retrieval-Augmented Relation Completion via Multi-Stage Paraphrase Infusion

Fahmida Alam, Mihai Surdeanu, Ellen Riloff

🔥 引用: 0

Abstract: Large language models (LLMs) struggle with relation completion (RC), both with and without retrieval-augmented generation (RAG), particularly when the required information is rare or sparsely represented. To address this, we propose a novel multi-stage paraphrase-guided relation-completion framework, RC-RAG, that systematically incorporates relation paraphrases across multiple stages. In particular, RC-RAG: (a) integrates paraphrases into retrieval to expand lexical coverage of the relation, (b) uses paraphrases to generate relation-aware summaries, and (c) leverages paraphrases during generation to guide reasoning for relation completion. Importantly, our method does not require any model fine-tuning. Experiments with five LLMs on two benchmark datasets show that RC-RAG consistently outperforms several RAG baselines. In long-tail settings, the best-performing LLM augmented with RC-RAG improves by 40.6 Exact Match (EM) points over its standalone performance and surpasses two strong RAG baselines by 16.0 and 13.8 EM points, respectively, while maintaining low computational overhead.

🏛️ Semantic Scholar 📅 2026-04-24 Fine-tuning

Jichao Wang, Liuyang Bian, Yufeng Zhou, et al.

🔥 引用: 0

Abstract: As Multimodal Large Language Models (MLLMs) mature, GUI agents are evolving from static interactions to complex navigation. While Reinforcement Learning (RL) has emerged as a promising paradigm for training MLLM agents on dynamic GUI tasks, its effective application faces a dilemma. Standard Offline RL often relies on static step-level data, neglecting global trajectory semantics such as task completion and execution quality. Conversely, Online RL captures the long-term dynamics but suffers from high interaction costs and potential environmental instability. To bridge this gap, we propose SOLAR-RL (Semi-Online Long-horizon Assignment Reinforcement Learning). Instead of relying solely on expensive online interactions, our framework integrates global trajectory insights directly into the offline learning process. Specifically, we reconstruct diverse rollout candidates from static data, detect the first failure point using per-step validity signals, and retroactively assign dense step-level rewards with target-aligned shaping to reflect trajectory-level execution quality, effectively simulating online feedback without interaction costs. Extensive experiments demonstrate that SOLAR-RL significantly improves long-horizon task completion rates and robustness compared to strong baselines, offering a sample-efficient solution for autonomous GUI navigation.

🏛️ Semantic Scholar 📅 2026-04-24 Agent

Development and comparative evaluation of knowledge graph–enhanced large language models for domain-specific question answering in nursing

Jiyuan Shi, Hongshuang Chen, Chenghao Shi, et al.

DOI: 10.1186/s12912-026-04647-3

🔥 引用: 0

Abstract: 暂无摘要，请点击原文查看。

🏛️ Semantic Scholar 📅 2026-04-24 BMC Nursing

Generalist large language models in a specialized world: Evidence from the Italian national medical education pathway

T. M. Buonocore, Antonio Russo, Dario Mingarelli, et al.

DOI: 10.1371/journal.pdig.0001363

🔥 引用: 0

Abstract: Creating language-specific and domain-specific large language models presents substantial challenges, including computational demands and limited data availability. While it is commonly believed that the benefits of specialized models justify these challenges, we dispute this notion with a comparative evaluation in a low-resourced language and medical-specific domain. In our study, we analyze the performance of various LLMs applied to the Italian healthcare domain using novel unpublished datasets, consisting of five-choice questions from national pre-university and post-university medical exams, covering clinical and preclinical fields. As part of this work, we release these datasets to the research community. We evaluated multilingual and Italian-specific models, along with general-purpose and healthcare-specific models, spanning both open-source and proprietary architectures of varying sizes. Our results demonstrate that multilingual, general-purpose large models consistently exceed the pass threshold across all tests, with the best models achieving over 90% accuracy on postgraduate-level exams. Model size emerged as the most critical factor influencing performance, whereas domain specialization and single-language localization offered no evidence of specialization superiority. These findings challenge the traditional pretrain-then-finetune paradigm for domain and language localization in language models, suggesting that advancements in generic-purpose multilingual models may render domain-specific pretraining unnecessary in many specialized cases.

🏛️ Semantic Scholar 📅 2026-04-24 PLOS Digital Health

Beyond Chain-of-Thought: Rewrite as a Universal Interface for Generative Multimodal Embeddings

Peixi Wu, Ke Mei, Feipeng Ma, et al.

🔥 引用: 0

Abstract: Multimodal Large Language Models (MLLMs) have emerged as a promising foundation for universal multimodal embeddings. Recent studies have shown that reasoning-driven generative multimodal embeddings can outperform discriminative embeddings on several embedding tasks. However, Chain-of-Thought (CoT) reasoning tends to generate redundant thinking steps and introduce semantic ambiguity in the summarized answers in broader retrieval scenarios. To address this limitation, we propose Rewrite-driven Multimodal Embedding (RIME), a unified framework that jointly optimizes generation and embedding through a retrieval-friendly rewrite. Meanwhile, we present the Cross-Mode Alignment (CMA) to bridge the generative and discriminative embedding spaces, enabling flexible mutual retrieval to trade off efficiency and accuracy. Based on this, we also introduce Refine Reinforcement Learning (Refine-RL) that treats discriminative embeddings as stable semantic anchors to guide the rewrite optimization. Extensive experiments on MMEB-V2, MRMR and UVRB demonstrate that RIME substantially outperforms prior generative embedding models while significantly reducing the length of thinking.

🏛️ Semantic Scholar 📅 2026-04-24

NRGS: Neural Regularization for Robust 3D Semantic Gaussian Splatting

Zaiyan Yang, Xinpeng Liu, Heng Guo, et al.

🔥 引用: 0

Abstract: We propose a neural regularization method that refines the noisy 3D semantic field produced by lifting multi-view inconsistent 2D features, in order to obtain an accurate and robust 3D semantic Gaussian Splatting. The 2D features extracted from vision foundation models suffer from multi-view inconsistency due to a lack of cross-view constraints. Lifting these inconsistent features directly into 3D Gaussians results in a noisy semantic field, which degrades the performance of downstream tasks. Previous methods either focus on obtaining consistent multi-view features in the preprocessing stage or aim to mitigate noise through improved optimization strategies, often at the cost of increased preprocessing time or expensive computational overhead. In contrast, we introduce a variance-aware conditional MLP that operates directly on the 3D Gaussians, leveraging their geometric and appearance attributes to correct semantic errors in 3D space. Experiments on different datasets show that our method enhances the accuracy of lifted semantics, providing an efficient and effective approach to robust 3D semantic Gaussian Splatting.

🏛️ Semantic Scholar 📅 2026-04-24 Foundation Model

RouteLMT: Learned Sample Routing for Hybrid LLM Translation Deployment

Yingfeng Luo, Hongyu Liu, Dingyang Lin, et al.

🔥 引用: 1

Abstract: Large Language Models (LLMs) have achieved remarkable performance in Machine Translation (MT), but deploying them at scale remains prohibitively expensive. A widely adopted remedy is the hybrid system paradigm, which balances cost and quality by serving most requests with a small model and selectively routing a fraction to a large model. However, existing routing strategies often rely on heuristics, external predictors, or absolute quality estimation, which fail to capture whether the large model actually provides a worthwhile improvement over the small one. In this paper, we formulate routing as a budget allocation problem and identify marginal gain, i.e., the large model's improvement over the small model, as the optimal signal for budgeted decisions. Building on this, we propose \textbf{RouteLMT} (routing for LLM-based MT), an efficient in-model router that predicts this expected gain by probing the small translators prompt-token representation, without requiring external models or hypothesis decoding. Extensive experiments demonstrate that our RouteLMT outperforms heuristics, quality/difficulty estimation baselines, achieving a superior quality-budget Pareto frontier. Furthermore, we analyze regression risks and show that a simple guarded variant can mitigate severe quality losses.

🏛️ Semantic Scholar 📅 2026-04-24 Cited: 1

Feasibility and exploratory assessment of large language models for pediatric dentistry queries: a comparative study

Sanjeev B. Khanagar, Ali Al-Ehaideb, N. Almutairi, et al.

Jia Li, Hongyi Deng, Yiran Zhang, et al.

🔥 引用: 0

Abstract: Writing code requires significant time and effort in software development. To automate this process, researchers have made substantial progress using Large Language Models (LLMs) for code generation. Many benchmarks like HumanEval and EvoCodeBench have been created to evaluate LLMs by requiring them to generate code from natural language requirements. However, in enterprise applications and team development, developers typically write code based on structured designs or specifications rather than raw natural language descriptions. This gap between existing benchmarks and real industry development practices means that current benchmark scores may not accurately reflect how much code generation can help automate software development tasks. To address this gap, we propose RealBench, a repository-level code generation benchmark aligned with real-world industry software development practices. Each example includes both natural language requirements and UML diagrams as system design, matching how developers typically receive specifications. Based on the constructed benchmarks, we conduct a systematic evaluation of advanced LLMs'code generation capabilities when provided with structured system designs. The experimental results reveal key insights in current LLMs'capabilities for repo-level code generation aligned with real-world software development practices. First, we notice that regarding repo-level code generation, LLMs show much worse performance and there are significant performance gaps among LLMs. Second, LLMs are good at finding and creating modules defined in UML diagrams, but the quality of generated modules is often poor due to grammar and logic errors. Third, generating the entire repository at once is the best generation strategy on smaller repositories, while generating a complex repository with the module-by-module strategy works better compared to other strategies.

🏛️ Semantic Scholar 📅 2026-04-24

Large language models as a conduit for value shifts in contemporary China

Likun Cao, Lianghao Dai

DOI: 10.1080/21620555.2026.2656193

🔥 引用: 0

Abstract: 暂无摘要，请点击原文查看。

🏛️ Semantic Scholar 📅 2026-04-24 Chinese Sociological Review

An End-to-End Foundation Model-Based Framework for Robust LAI Retrieval Under Cloud Cover

Xiangfeng Gu, Wenyuan Li, Shikang Guan

DOI: 10.3390/rs18091308

🔥 引用: 0

Abstract: Leaf Area Index is a crucial biophysical variable, and its accurate estimation is essential for understanding vegetation dynamics. However, cloud cover significantly restricts optical remote sensing, hindering the generation of spatially continuous Leaf Area Index products. Remote sensing foundation models offer novel solutions to this challenge. This study presents an end-to-end framework based on the fine-tuned Prithvi foundation model for direct LAI retrieval from cloud-contaminated 30 m Harmonized Landsat and Sentinel-2 imagery. By mapping inputs directly to Hi-GLASS reference labels, the proposed architecture processes cloud contamination and vegetation signals simultaneously and circumvents the error propagation inherent in cascaded retrieval pipelines. Results demonstrate that the end-to-end LAI retrieval model significantly outperforms cascaded variants, achieving a superior R2 (0.78) and lower RMSE (0.57). Furthermore, predictive accuracy exhibits a distinct U-shaped trajectory relative to the temporal mean cloud fraction, reaching an inflection point at 50–60% occlusion, which highlights the model’s implicit regularization capacity under severe atmospheric interference. This work establishes that direct feature learning with foundation models offers a more robust and streamlined pathway for generating continuous biophysical products from imperfect optical observations, prioritizing quantitative fidelity over artificial perceptual sharpness.

🏛️ Semantic Scholar 📅 2026-04-24 Foundation Model Remote Sensing

A three-dimensional multi-modal foundation model for optical coherence tomography

Zixuan Liu, Hanwen Xu, Addie Woicik, et al.

DOI: 10.1038/s41551-026-01662-2

🔥 引用: 0

Abstract: 暂无摘要，请点击原文查看。

🏛️ Semantic Scholar 📅 2026-04-24 Foundation Model Nature Biomedical Engineering

Large Language Model Counterarguments in Older Adults: Cognitive Offloading or Vulnerability to Moral Persuasion?

Koutarou Tamura, Sayaka Ishibashi, Ayana Goma, et al.

🔥 引用: 0

Abstract: This study examined whether counterarguments generated by large language models (LLMs) influence the moral judgments of younger and older adults and whether these effects vary as a function of dilemma type, cognitive functioning, trust in AI, and prior experience using LLMs. Using the switch and footbridge trolley dilemmas, 130 participants (56 younger adults and 74 older adults) were presented with ChatGPT arguments that opposed their initial judgments. Results revealed that more than 30% of participants reversed their moral judgments in both dilemmas (32.31% in the switch dilemma and 36.92% in the footbridge dilemma), suggesting that LLMs possess substantial persuasive power. Older adults tended to be more likely than younger adults to reverse their judgments, and they showed a significantly greater degree of judgment change in the switch dilemma. Notably, in the emotionally aversive footbridge dilemma, older adults with lower cognitive functioning were significantly more likely to align with the LLM-generated counterargument. General trust in AI and prior experience with LLMs did not predict judgment reversal, supporting a disconnect between trust and persuasion. Instead, individual factors such as lower initial confidence and higher perceived task difficulty were associated with greater susceptibility to AI influence. These findings suggest that, although LLMs may serve as tools for cognitive offloading that compensate for age-related cognitive decline, they may also pose a risk of undue persuasion for cognitively vulnerable individuals.

🏛️ Semantic Scholar 📅 2026-04-24

Self Knowledge Re-expression: A Fully Local Method for Adapting LLMs to Tasks Using Intrinsic Knowledge

Mengyu Wang, Xiaoying Zhi, Zhiyi Li, et al.

🔥 引用: 0

Abstract: While the next-token prediction (NTP) paradigm enables large language models (LLMs) to express their intrinsic knowledge, its sequential nature constrains performance on specialized, non-generative tasks. We attribute this performance bottleneck to the LLMs'knowledge expression mechanism, rather than to deficiencies in knowledge acquisition. To address this, we propose Self-Knowledge Re-expression (SKR), a novel, task-agnostic adaptation method. SKR transforms the LLM's output from generic token generation to highly efficient, task-specific expression. SKR is a fully local method that uses only unannotated data, requiring neither human supervision nor model distillation. Experiments on a large financial document dataset demonstrate substantial improvements: over 40% in Recall@1 for information retrieval tasks, over 76% reduction in object detection latency, and over 33% increase in anomaly detection AUPRC. Our results on the MMDocRAG dataset surpass those of leading retrieval models by at least 12.6%.

🏛️ Semantic Scholar 📅 2026-04-24

CellPulse: A Foundation Model of Coordinated Gene Dynamics Simulating Viral Infectious Diseases

Dan Liu, Xiao-Xu Zhu, Libo Zhang, et al.

DOI: 10.64898/2026.04.22.720078

🔥 引用: 0

Abstract: 暂无摘要，请点击原文查看。

🏛️ Semantic Scholar 📅 2026-04-24 Foundation Model Biological Large Model bioRxiv

N. Poungpeth, Nicholas Clark, Tanushree Mitra

🔥 引用: 0

Abstract: Large language models (LLMs) possess strong persuasive capabilities that outperform humans in head-to-head comparisons. Users report consulting LLMs to inform major life decisions in relationships, medical settings, and when seeking professional advice. Prior work measures persuasion as intentional attempts at producing the most effective argument or convincing statement. This fails to capture everyday human-AI interactions in which users seek information or advice. To address this gap, we introduce"spontaneous persuasion,"which characterizes the inexplicit use of persuasive strategies in everyday scenarios where persuasion is not necessarily warranted. We conduct an audit of five LLMs to uncover how frequently and through which techniques spontaneous persuasion appears in multi-turn conversations. To simulate response styles, we provide a user response taxonomy grounded in literature from psychology, communication, and linguistics. Furthermore, we compare the distribution of spontaneous persuasion produced by LLMs with human responses on the same topics, collected from Reddit. We find LLMs spontaneously persuade the user in virtually all conversations, heavily relying on information-based strategies such as appeals to logic or quantitative evidence. This was consistent across models and user response styles, but conversations concerning mental health saw higher rates of appraisal-based and emotion-based strategies. In comparison, human responses tended to invoke strategies that generate social influence, like negative emotion appeals and non-expert testimony. This difference may explain the effectiveness of LLM in persuading users, as well as the perception of models as objective and impartial.

🏛️ Semantic Scholar 📅 2026-04-23

An LLM‐Driven Approach for Power Grid Structure Synthesis and Visualization

Binye Ni, Xinlei Cai, Zhijun Shen, et al.

DOI: 10.1049/enc2.70037

🔥 引用: 0

Abstract: Research in power system is persistently hampered by the scarcity of high‐fidelity, public grid data due to security and privacy constraints. Existing unimodal synthesis methods fail to harmonize physical laws, visual representations and semantic descriptions, obstructing the application of multimodal large language models (LLMs) in the energy sector. To address this, we propose a multimodal synthesis framework that generates aligned datasets comprising physical parameters, single‐line diagrams and natural language descriptions. The framework combines rule‐based topology generation with a two‐stage chain‐of‐thought (CoT) strategy, enabling LLM agents to initialize electrical parameters based on statistical priors. To ensure physical feasibility, an iterative power flow feedback loop is introduced to guarantee convergence. Furthermore, retrieval‐augmented generation is employed to enhance component‐level visual details. Experimental results indicate that the synthesized grids achieve high structural similarity and physical fidelity compared to real‐world benchmarks. We have open‐sourced this physically validated multimodal grid dataset to provide critical foundational support for developing physics‐informed “energy LLMs.”

🏛️ Semantic Scholar 📅 2026-04-23 Agent Energy Conversion and Economics

Stephan Xie, Ben Cohen, Mononito Goswami, et al.

🔥 引用: 0

Abstract: Time series question-answering (TSQA), in which we ask natural language questions to infer and reason about properties of time series, is a promising yet underexplored capability of foundation models. In this work, we present ARFBench, a TSQA benchmark that evaluates the understanding of multimodal foundation models (FMs) on time series anomalies prevalent in software incident data. ARFBench consists of 750 questions across 142 time series and 5.38M data points from 63 production incidents sourced exclusively from internal telemetry at Datadog. We evaluate leading proprietary and open-source LLMs, VLMs, and time series FMs and observe that frontier VLMs perform markedly better than existing baselines; the leading model (GPT-5) achieves a 62.7% accuracy and 51.9% F1. We next demonstrate the promise of specialized multimodal approaches. We develop a novel TSFM + VLM hybrid prototype which we post-train on a small set of synthetic and real data that yields comparable overall F1 and accuracy with frontier models. Lastly, we find models and human domain experts exhibit complementary strengths. We define a model-expert oracle, a best-of-2 oracle selector over model and expert answers, yielding 82.8% F1 and 87.2% accuracy and establishing a new superhuman frontier for future TSQA models. The benchmark is available at https://huggingface.co/datasets/Datadog/ARFBench.

🏛️ Semantic Scholar 📅 2026-04-23 Foundation Model

Hongyao Liu, L. Zhai, Junyi Wang, et al.

🔥 引用: 0

Abstract: Efficient inference for on-device Large Language Models (LLMs) remains challenging due to limited hardware resources and the high cost of the prefill stage, which processes the full input context to construct Key-Value (KV) caches. We present SparKV, an adaptive KV loading framework that combines cloud-based KV streaming with on-device computation. SparKV models the cost of individual KV chunks and decides whether each chunk should be streamed or computed locally, while overlapping the two execution paths to reduce latency. To handle fluctuations in wireless connectivity and edge resource availability, SparKV further refines offline-generated schedules at runtime to rebalance communication and computation costs. Experiments across diverse datasets, LLMs, and edge devices show that SparKV reduces Time-to-First-Token by 1.3$x-5.1x with negligible impact on response quality, while lowering per-request energy consumption by 1.5x to 3.3x, demonstrating its robustness and practicality for real-world on-device deployment.

🏛️ Semantic Scholar 📅 2026-04-23

ReCAPA: Hierarchical Predictive Correction to Mitigate Cascading Failures

Xiyin Zeng, Yuyu Sun, Haoyang Li, et al.

🔥 引用: 0

Abstract: Vision-Language-Action systems follow instructions to execute multi-step tasks in multimodal environments. Recent VLA approaches typically rely on post-hoc correction mechanisms or operate under fixed task decompositions and alignment schemes. However, once an intermediate step is mis-specified, local errors propagate through subsequent steps and eventually accumulate into cascading failures. To mitigate this compounding effect, we propose Predictive Alignment and Planning Architecture, a framework that uses prediction and contrast to adjust deviations across three levels: actions, subgoals, and trajectories. Semantic alignment is enforced at all levels using a Sinkhorn-based module and a Score-field module. The predictive correction and alignment jointly update the action generator during training, enabling it to adjust fine-grained steps to remain aligned with the overall intent. We further introduce two new metrics to quantify error propagation and recovery processes in tasks, capturing how mistakes spread and fade over long-horizon execution. Experiments show that ReCAPA achieves competitive results on embodied agent benchmarks such as VisualAgentBench, MineDojo, and AI2-THOR, outperforming strong proprietary and open-source Large Language Model baselines.

🏛️ Semantic Scholar 📅 2026-04-23 Agent

Ye Yu, Heming Liu, Haibo Jin, et al.

🔥 引用: 0

Abstract: Multi-agent systems built on large language models have shown strong performance on complex reasoning tasks, yet most work focuses on agent roles and orchestration while treating inter-agent communication as a fixed interface. Latent communication through internal representations such as key-value caches offers a promising alternative to text-based protocols, but existing approaches do not jointly optimize communication with multi-agent reasoning. Therefore we propose DiffMAS, a training framework that treats latent communication as a learnable component of multi-agent systems. DiffMAS performs parameter-efficient supervised training over multi-agent latent trajectories, enabling agents to jointly learn how information should be encoded and interpreted across interactions. Experiments on mathematical reasoning, scientific QA, code generation, and commonsense benchmarks show that DiffMAS consistently improves reasoning accuracy and decoding stability over single-agent inference, text-based multi-agent systems, and prior latent communication methods, achieving 26.7% on AIME24, 20.2% on GPQA-Diamond, and consistent gains across reasoning benchmarks.

🏛️ Semantic Scholar 📅 2026-04-23 Agent

Use of Artificial Intelligence and Machine Learning in Rapid Drug Discovery and Pharmacovigilance

Ogboh Rita Onyebuchi, Olukunle O. Akanbi, A. Olowookere, et al.

DOI: 10.56557/jirmeps/2026/v21i210521

🔥 引用: 0

Abstract: The increasing global burden of disease, rising research and development costs, and high attrition rates in pharmaceutical pipelines underscore the need for more efficient approaches to therapeutic development and drug safety monitoring. Artificial intelligence (AI) and machine learning (ML) have emerged as data-driven tools with the potential to improve multiple stages of the pharmaceutical lifecycle. This narrative review is designed to provide a structured and critical overview of the use of AI and ML in drug discovery and drug safety surveillance. A comprehensive literature search was performed to identify relevant studies using major electronic databases, with emphasis on publications from 1997 to March 2025. The studies were selected based on the inclusion criteria. The findings of the study show that AI and ML are being used in drug discovery, drug development, and drug safety surveillance. These technologies have the potential to provide predictive models, integrate heterogeneous biomedical data, and analyze real-world data to detect adverse drug reactions. Deep learning and natural language processing have been found to be useful tools to improve early risk detection. However, some limitations have also been found. These include the quality of the data, bias in AI and ML models, lack of interpretability of AI and ML models, lack of external validation, and lack of real-world implementation. These limitations need to be addressed to make AI and ML more useful tools for drug discovery and drug safety surveillance. Overall, while AI and ML offer meaningful opportunities to enhance drug discovery and pharmacovigilance, their impact remains dependent on rigorous validation, improved data governance, and alignment with clinical and regulatory frameworks. Continued research and context-specific implementation strategies will be essential to support their effective and equitable integration into pharmaceutical research and healthcare systems.

🏛️ Semantic Scholar 📅 2026-04-23 Deep Learning AIDD Journal of international research in medical and pharmaceutical sciences

Transient Turn Injection: Exposing Stateless Multi-Turn Vulnerabilities in Large Language Models

Naheed Rayhan, Sohely Jahan

🔥 引用: 0

Abstract: Large language models (LLMs) are increasingly integrated into sensitive workflows, raising the stakes for adversarial robustness and safety. This paper introduces Transient Turn Injection(TTI), a new multi-turn attack technique that systematically exploits stateless moderation by distributing adversarial intent across isolated interactions. TTI leverages automated attacker agents powered by large language models to iteratively test and evade policy enforcement in both commercial and open-source LLMs, marking a departure from conventional jailbreak approaches that typically depend on maintaining persistent conversational context. Our extensive evaluation across state-of-the-art models-including those from OpenAI, Anthropic, Google Gemini, Meta, and prominent open-source alternatives-uncovers significant variations in resilience to TTI attacks, with only select architectures exhibiting substantial inherent robustness. Our automated blackbox evaluation framework also uncovers previously unknown model specific vulnerabilities and attack surface patterns, especially within medical and high stakes domains. We further compare TTI against established adversarial prompting methods and detail practical mitigation strategies, such as session level context aggregation and deep alignment approaches. Our study underscores the urgent need for holistic, context aware defenses and continuous adversarial testing to future proof LLM deployments against evolving multi-turn threats.

🏛️ Semantic Scholar 📅 2026-04-23 Agent

Short-Term Continuous Glucose Forecasting with Large Language Model-Derived Nutrient Estimates from Real-World Chinese Dietary Records

Leyao Ma, Jie Hao, Lin Yang, et al.

DOI: 10.34133/hds.0471

🔥 引用: 0

Abstract: 暂无摘要，请点击原文查看。

🏛️ Semantic Scholar 📅 2026-04-23 Health Data Science

Supervised Learning Has a Necessary Geometric Blind Spot: Theory, Consequences, and Minimal Repair

Vishal Rajput

🔥 引用: 0

Abstract: PGD adversarial training, the standard robustness method, can reduce Jacobian Frobenius norm yet worsen clean-input geometry (e.g., TDI 1.336 vs. ERM 1.093). We show this is not an implementation artifact but a theorem-level consequence of supervised learning. We prove that any encoder minimizing supervised loss must retain non-zero sensitivity along directions correlated with training labels, including directions that are nuisance at test time. This holds across proper scoring rules, architectures, and dataset sizes. We call this the geometric blind spot of supervised learning. This theorem unifies four empirical phenomena often treated separately: non-robust features, texture bias, corruption fragility, and the robustness-accuracy tradeoff. It also explains why suppressing sensitivity in one adversarial direction can redistribute sensitivity elsewhere. We introduce Trajectory Deviation Index (TDI), a diagnostic of geometric isotropy. Unlike CKA, intrinsic dimension, or Jacobian Frobenius norm alone, TDI captures the failure mode above. In our experiments, PGD attains low Frobenius norm but high TDI, while PMH attains the lowest TDI with one additional training term and no architectural changes. Across seven tasks, BERT/SST-2, and ImageNet ViT-B/16 (backbone family underlying CLIP/DINO/SAM), the blind spot is measurable and repairable. It appears at foundation-model scale, worsens with model scale and task-specific fine-tuning, and is substantially reduced by PMH. PMH also leads on non-Gaussian corruption types (blur/brightness/contrast) without corruption-specific training.

🏛️ Semantic Scholar 📅 2026-04-23 Fine-tuning

AutoRISE: Agent-Driven Strategy Evolution for Red-Teaming Large Language Models

Tanmay Gautam, Alireza Bahramali, Sandeep Atluri

🔥 引用: 0

Abstract: Automated red-teaming methods for large language models typically optimize attack prompts within a fixed, human-designed strategy, leaving the attack strategy itself unchanged. We instead optimize the strategy. We propose AutoRISE, a method that searches over executable attack programs rather than individual prompts. At each iteration, a coding agent edits a strategy and a fixed evaluation harness scores the resulting attacks, returning both a scalar objective and per-example diagnostics that guide subsequent edits. This allows structural changes, including new attack components and altered control flow, that prompt-level methods do not directly express. We also release two benchmark suites developed on disjoint target sets and evaluate on 11 models from five families against seven established jailbreak datasets. Across held-out models, AutoRISE improves average attack success rate by 17.0 points over the strongest baseline, and improves attack success by up to 16 points on frontier targets with low baseline success rates. Ablations against parametric and strategy-library baselines suggest that these gains arise from unrestricted program search, particularly compositional techniques and control-flow edits. AutoRISE operates in a black-box, inference-only setting, requiring no fine-tuning, human annotation, or GPU compute.

🏛️ Semantic Scholar 📅 2026-04-23 Agent Fine-tuning

Bonala Sai Punith, Salveru Jayati, Garima Shakya, et al.

🔥 引用: 0

Abstract: Human-elephant conflict (HEC) is rising across India as habitat loss and expanding human settlements force elephants into closer contact with people. While the ecological drivers of conflict are well-studied, how the news media portrays them remains largely unexplored. This work presents the first large-scale computational analysis of media framing of HEC in India, examining 1,968 full-length news articles consisting of 28,986 sentences, from a major English-language outlet published between January 2022 and September 2025. Using a multi-model sentiment framework that combines long-context transformers, large language models, and a domain-specific Negative Elephant Portrayal Lexicon, we quantify sentiment, extract rationale sentences, and identify linguistic patterns that contribute to negative portrayals of elephants. Our findings reveal a dominance of fear-inducing and aggression-related language. Since the media framing can shape public attitudes toward wildlife and conservation policy, such narratives risk reinforcing public hostility and undermining coexistence efforts. By providing a transparent, scalable methodology and releasing all resources through an anonymized repository, this study highlights how Web-scale text analysis can support responsible wildlife reporting and promote socially beneficial media practices.

🏛️ Semantic Scholar 📅 2026-04-23 Transformer

A Trust-Embedded Learning Architecture for Discovering Alternative Drug Indications with Verifiable Computational Integrity

Ch. Jyothi, Koyilakonda Sneha, Nenavath Himabindu, et al.

Maximilian Westermann, Ben Griffin, Aaron Ontoyin Yin, et al.

🔥 引用: 0

Abstract: Feature discovery from complex unstructured data is fundamentally a reasoning problem: it requires identifying abstractions that are predictive of a target outcome while avoiding leakage, proxies, and post-outcome signals. With the introduction of ever-improving Large Language Models (LLMs), our method provides a structured method for addressing this challenge. LLMs are well suited for this task by being able to process large amounts of information, but unconstrained feature generation can lead to weak features. In this work, we study reasoning control in LLMs by inducing cognitive behaviors for improving feature discovery. We introduce CoFEE (Cognitive Feature Engineering Engine), a reasoning control framework that enforces cognitive behaviors in how the LLM reasons during feature discovery. From a machine learning perspective, these cognitive behaviors act as structured inductive biases over the space of candidate features generated by the model. These behaviors have been exploited with success in ML models, and include backward chaining from outcomes, subgoal decomposition, verification against observability and leakage criteria, and explicit backtracking of rejected reasoning paths. In a controlled comparison, we show that enforcing cognitive behaviors yields features with higher empirical predictability than those under unconstrained vanilla LLM prompts. CoFEE achieves an average Success Rate Score that is 15.2% higher than the vanilla approach, while generating 29% fewer features and reducing costs by 53.3%. Using held-out feature evaluation, we assess whether cognitively induced features generalize beyond the data used for discovery. Our results indicate that, in our evaluated setting, reasoning control is associated with improvements in quality and efficiency of LLM-based feature discovery.

🏛️ Semantic Scholar 📅 2026-04-23

Automation Bias in Large Language Model–Assisted Diagnostic Reasoning among Physicians Trained in AI Literacy — A Randomized Clinical Trial

I. Qazi, A. Ali, A. Khawaja, et al.

DOI: 10.1056/aioa2501001

🔥 引用: 0

Abstract: 暂无摘要，请点击原文查看。

🏛️ Semantic Scholar 📅 2026-04-23 NEJM AI

Dissecting clinical reasoning failures in frontier artificial intelligence using 10,000 synthetic cases

S. D. Auger, J. Varley, M. Hargovan, et al.

DOI: 10.64898/2026.04.22.26351488

🔥 引用: 0

Abstract: 暂无摘要，请点击原文查看。

🏛️ Semantic Scholar 📅 2026-04-23 medRxiv

Scalable Multilingual Clinical Trial Text Classification using Transformer Embeddings with Real-Time Redis and Telegram Integration

B. Laxmi Pathi, Rejintal Shivashankar, Dounde Yash, et al.

DOI: 10.64751/ajmimc.2026.v5.n2(1).293

🔥 引用: 0

Abstract: For decades, clinical trial management has depended on systematic extraction of unstructured clinical narratives to support patient safety monitoring and eligibility assessment. Traditionally, this process required extensive manual effort from clinical experts who categorized protocol deviations and screened participants based on complex documentation. With the emergence of Natural Language Processing (NLP) in the early 2010s, statistical approaches such as TF-IDF and Word2Vec enabled the first wave of automation in structuring clinical text. However, oncology data remains highly complex, characterized by dense unstructured narratives, nested logical conditions (AND/OR/NOT), and specialized domain terminology. Conventional machine learning systems often fail to capture the high-dimensional semantic relationships necessary for robust classification, resulting in overlooked systemic signals in protocol deviations. More recent approaches include axis-parallel decision tree ensembles, such as Random Forests, and cloud-based Large Language Models (LLMs). While effective in certain settings, axis-parallel models are limited by their inability to model diagonal decision boundaries in embedded semantic spaces, reducing performance on tilted or non-linear clusters. Conversely, LLMs such as GPT-4 offer strong reasoning capabilities but introduce challenges related to patient data privacy, operational cost, and limited multilingual robustness without translation pipelines. To address these limitations, this work proposes a privacy-preserving, multilingual framework combining Language-Agnostic BERT Sentence Embeddings (LaBSE) with Ensemble Oblique Trees (EOT). By leveraging oblique hyperplanes, the model better partitions highdimensional embedding spaces. The proposed LaBSE–EOT system enables lightweight, locally deployable, and interpretable classification, improving cross-lingual clinical trial oversight while reducing dependency on cloud infrastructure and enhancing global healthcare scalability.

🏛️ Semantic Scholar 📅 2026-04-23 Transformer American Journal of Management and IOT Medical Computing

Laura Valeria Perez-Herrera, M. J. García-González, Karen López-Linares

🔥 引用: 0

Abstract: Lung adenocarcinoma (LUAD) grading depends on accurately identifying growth patterns, which are indicators of prognosis and can influence treatment decisions. Common deep learning approaches to determine the predominant pattern rely on patch-level classification or segmentation, requiring extensive annotations. This study proposes an attention-based multiple instance learning (ABMIL) framework to predict the predominant LUAD growth pattern at the whole slide level to reduce annotation burden. Our approach integrates pretrained pathology foundation models as patch encoders, used either frozen or fine-tuned on annotated patches, to extract discriminative features that are aggregated through attention mechanisms. Experiments show that fine-tuned encoders improve performance, with Prov-GigaPath achieving the highest agreement (\k{appa} = 0.699) under ABMIL. Compared to simple patch-aggregation baselines, ABMIL yields more robust predictions by leveraging slide-level supervision and spatial attention. Future work will extend this framework to estimate the full distribution of growth patterns and validate performance on external cohorts.

🏛️ Semantic Scholar 📅 2026-04-23 Deep Learning Foundation Model

Frozen LLMs as Map-Aware Spatio-Temporal Reasoners for Vehicle Trajectory Prediction

Yanjiao Liu, Jiawei Liu, Xun Gong, et al.

🔥 引用: 0

Abstract: Large language models (LLMs) have recently demonstrated strong reasoning capabilities and attracted increasing research attention in the field of autonomous driving (AD). However, safe application of LLMs on AD perception and prediction still requires a thorough understanding of both the dynamic traffic agents and the static road infrastructure. To this end, this study introduces a framework to evaluate the capability of LLMs in understanding the behaviors of dynamic traffic agents and the topology of road networks. The framework leverages frozen LLMs as the reasoning engine, employing a traffic encoder to extract spatial-level scene features from observed trajectories of agents, while a lightweight Convolutional Neural Network (CNN) encodes the local high-definition (HD) maps. To assess the intrinsic reasoning ability of LLMs, the extracted scene features are then transformed into LLM-compatible tokens via a reprogramming adapter. By residing the prediction burden with the LLMs, a simpler linear decoder is applied to output future trajectories. The framework enables a quantitative analysis of the influence of multi-modal information, especially the impact of map semantics on trajectory prediction accuracy, and allows seamless integration of frozen LLMs with minimal adaptation, thereby demonstrating strong generalizability across diverse LLM architectures and providing a unified platform for model evaluation.

🏛️ Semantic Scholar 📅 2026-04-23 Agent

MambaCSP: Hybrid-Attention State Space Models for Hardware-Efficient Channel State Prediction

Aladin Djuhera, Haris Gacanin, Holger Boche

🔥 引用: 0

Abstract: Recent works have demonstrated that attention-based transformer and large language model (LLM) architectures can achieve strong channel state prediction (CSP) performance by capturing long-range temporal dependencies across channel state information (CSI) sequences. However, these models suffer from quadratic scaling in sequence length, leading to substantial computational cost, memory consumption, and inference latency, which limits their applicability in real-time and resource-constrained wireless deployments. In this paper, we investigate whether selective state space models (SSMs) can serve as a hardware-efficient alternative for CSI prediction. We propose MambaCSP, a hybrid-attention SSM architecture that replaces LLM-based prediction backbones with a linear-time Mamba model. To overcome the local-only dependencies of pure SSMs, we introduce lightweight patch-mixer attention layers that periodically inject cross-token attentions, helping with long-context CSI prediction. Extensive MISO-OFDM simulations show that MambaCSP improves prediction accuracy over LLM-based approaches by 9-12%, while delivering up to 3.0x higher throughput, 2.6x lower VRAM usage, and 2.9x faster inference. Our results demonstrate that hybrid state space architectures provide a promising direction for scalable and hardware-efficient AI-native CSI prediction in future wireless networks.

🏛️ Semantic Scholar 📅 2026-04-23 Transformer

An Adaptive Prompt Optimization Framework for Domain-Specific Large Language Models

Cailiang Fan, Daniel Walker

DOI: 10.66280/ijair.v1i1.101

🔥 引用: 0

Abstract: Domain-specific deployment of large language models (LLMs) remains constrained by prompt brittleness, inference cost, and uneven generalization across task subtypes. This paper presents APOF (Adaptive Prompt Optimization Framework), a closed-loop framework that jointly opti- mizes prompt structure, retrieval context, and inference-time control signals for domain-specific LLM applications. APOF combines three elements: (i) a policy-guided prompt composer that dynamically allocates instruction budget across task facets, (ii) a critic model that estimates prompt-task fitness before expensive decoding, and (iii) an online adaptation module that up- dates prompt policies using delayed feedback from production outcomes. We instantiate APOF in three high-stakes domains—clinical note summarization, legal clause risk classification, and materials-science question answering—using a shared 13B parameter base model and domain adapters.Our experiments include 162,000 annotated instances across public and institutionally cu- rated corpora, 12 baseline methods, and controlled ablation studies. APOF improves macro-F1 by up to 6.8 points over the strongest static prompt baseline, while reducing median latency by 18.4% through pre-decoding prompt pruning and adaptive generation parameters. The frame- work also improves calibration (ECE reduction of 0.041) and demonstrates higher robustness under distribution shift (average relative performance drop 12.3% vs. 21.7% for static meth- ods). We provide mathematical formulations, complexity analysis, and practical deployment recommendations. Results suggest that adaptive prompting, when treated as a structured op- timization problem rather than manual engineering, is a viable path to reliable domain-specific LLM systems.

🏛️ Semantic Scholar 📅 2026-04-23 International Journal of Artificial Intelligence Research

Tadashi Wadayama

🔥 引用: 0

Abstract: Large language models (LLMs) are increasingly being explored as high-level decision modules in closed-loop systems, but their stochastic nature makes safe integration challenging. In this paper, we propose LLM-Steered Power Allocation, a dual-process architecture for parallel QPSK channels inspired by Kahneman's System 1/System 2 framework. A fast numerical optimizer (System 1) continuously performs projected gradient ascent on a weighted mutual-information objective, while an LLM navigator (System 2) periodically interprets natural-language policies and updates only the channel weights and the operational power budget. The LLM never manipulates the power-allocation variables directly, and constraint satisfaction is enforced structurally by the optimizer. To mitigate LLM unreliability, we further incorporate multi-layer guardrails including normalization, exponential moving-average smoothing, and fallback mechanisms. Numerical experiments on an 8-channel system show that, with a fixed optimization core and unchanged system prompt, different natural-language policies induce qualitatively different operating points, including throughput-oriented allocation, channel prioritization, power-aware operation, and channel shutdown. In addition, under an abrupt channel-gain reversal, the proposed system autonomously reconfigures its steering signals and reduces the final mutual-information spread by 60% compared with the optimizer alone. These results suggest that LLMs can serve as policy interpreters for safe, flexible reconfiguration of communication-system optimizers without controller reimplementation.

🏛️ Semantic Scholar 📅 2026-04-23

Muhy Eddin Za’ter, B. Hodge

🔥 引用: 0

Abstract: Accurate forecasting of electric load and renewable generation is essential for reliable and cost effective power system operations. Recent advances in transformer based and foundation machine learning models, driven by large scale pretraining, increased available data and computation, in addition to architectural innovations, have shown promise in time series forecasting across multiple domains. However, their application to power system forecasting tasks remains largely underexplored. This work presents a comprehensive, empirical benchmark of state of the art time series foundation models, transformer architectures, and deep learning baselines for solar, wind, and load forecasting using the high resolution ARPAE PERFORM dataset for the Electric Reliability Council of Texas (ERCOT) grid. Eight core capabilities are assessed, including zero shot performance, fine tuning efficiency, multivariate input and output handling, horizon sensitivity, generalization to unseen sites, probabilistic forecasting, and context window effects. Models evaluated include TimesFM, Chronos Bolt, MoiraiL, MOMENT, Tiny Time Mixer, Temporal Fusion Transformer, PatchTST, TimeXer, LSTM, and CNN. The manuscript aims to provide clear guidance on when foundation models can provide enhanced renewable and load forecasting capabilities and when other approaches remain the more practical choice for power system operations.

🏛️ Semantic Scholar 📅 2026-04-23 Deep Learning Foundation Model Fine-tuning Transformer

Decoupled Travel Planning with Behavior Forest

Duanyang Yuan, Sihang Zhou, Yanning Hou, et al.

🔥 引用: 0

Abstract: Behavior sequences, composed of executable steps, serve as the operational foundation for multi-constraint planning problems such as travel planning. In such tasks, each planning step is not only constrained locally but also influenced by global constraints spanning multiple subtasks, leading to a tightly coupled and complex decision process. Existing travel planning methods typically rely on a single decision space that entangles all subtasks and constraints, failing to distinguish between locally acting constraints within a subtask and global constraints that span multiple subtasks. Consequently, the model is forced to jointly reason over local and global constraints at each decision step, increasing the reasoning burden and reducing planning efficiency. To address this problem, we propose the Behavior Forest method. Specifically, our approach structures the decision-making process into a forest of parallel behavior trees, where each behavior tree is responsible for a subtask. A global coordination mechanism is introduced to orchestrate the interactions among these trees, enabling modular and coherent travel planning. Within this framework, large language models are embedded as decision engines within behavior tree nodes, performing localized reasoning conditioned on task-specific constraints to generate candidate subplans and adapt decisions based on coordination feedback. The behavior trees, in turn, provide an explicit control structure that guides LLM generation. This design decouples complex tasks and constraints into manageable subspaces, enabling task-specific reasoning and reducing the cognitive load of LLM. Experimental results show that our method outperforms state-of-the-art methods by 6.67% on the TravelPlanner and by 11.82% on the ChinaTravel benchmarks, demonstrating its effectiveness in increasing LLM performance for complex multi-constraint travel planning.

🏛️ Semantic Scholar 📅 2026-04-23

Focus Session: Hardware and Software Techniques for Accelerating Multimodal Foundation Models

Muhammad Shafique, Abdul Basit, M. Hanif, et al.

🔥 引用: 0

Abstract: This work presents a multi-layered methodology for efficiently accelerating multimodal foundation models (MFMs). It combines hardware and software co-design of transformer blocks with an optimization pipeline that reduces computational and memory requirements. During model development, it employs performance enhancements through fine-tuning for domain-specific adaptation. Our methodology further incorporates hardware and software techniques for optimizing MFMs. Specifically, it employs MFM compression using hierarchy-aware mixed-precision quantization and structural pruning for transformer blocks and MLP channels. It also optimizes operations through speculative decoding, model cascading that routes queries through a small-to-large cascade and uses lightweight self-tests to determine when to escalate to larger models, as well as co-optimization of sequence length, visual resolution&stride, and graph-level operator fusion. To efficiently execute the model, the processing dataflow is optimized based on the underlying hardware architecture together with memory-efficient attention to meet on-chip bandwidth and latency budgets. To support this, a specialized hardware accelerator for the transformer workloads is employed, which can be developed through expert design or an LLM-aided design approach. We demonstrate the effectiveness of the proposed methodology on medical-MFMs and on code generation tasks, and conclude with extensions toward energy-efficient spiking-MFMs.

🏛️ Semantic Scholar 📅 2026-04-23 Foundation Model Fine-tuning Transformer

A Task Decomposition and Planning Framework for Efficient LLM Inference in AI-Enabled WiFi-Offload Networks

Mingqi Han, Xing Sun

🔥 引用: 0

Abstract: AI WiFi offload is emerging as a promising approach for providing large language model (LLM) services to resource-constrained wireless devices. However, unlike conventional edge computing, LLM inference over WiFi must jointly address heterogeneous model capabilities, wireless contention, uncertain task complexity, and semantic correlation among reasoning tasks. In this paper, we investigate LLM inference offloading in a multi-user multi-edge WiFi network, where each task can be executed locally, directly offloaded to a nearby edge access point (AP), or decomposed into multiple subtasks for collaborative execution across local and edge nodes. To this end, we propose a user-edge collaborative framework with an LLM-based planner that not only performs task decomposition but also infers subtask difficulty and expected output token length, enabling more accurate estimation of execution quality and latency on heterogeneous nodes. Based on these estimates, we further design a decomposition-aware scheduling strategy that jointly optimizes subtask assignment, execution, and aggregation under communication, queuing, and computation constraints. Simulation results show that the proposed framework achieves a better latency-accuracy tradeoff than local-only and nearest-edge baselines, reducing the average latency by $20\%$ and improving the overall reward by $80\%$. Moreover, the distilled lightweight planner approaches the performance of the large teacher model while remaining more suitable for practical edge deployment.

🏛️ Semantic Scholar 📅 2026-04-23

How to Supercharge Your Research Workflow Using Large Language Models

Educaror Li

DOI: 10.55277/researchhub.xlmgioe3

🔥 引用: 0

Abstract: 暂无摘要，请点击原文查看。

🏛️ Semantic Scholar 📅 2026-04-23

Praval Sharma, Ashok Samal, Leen-Kiat Soh, et al.

🔥 引用: 0

Abstract: Event extraction identifies the central aspects of events from text. It supports event understanding and analysis, which is crucial for tasks such as informed decision-making in emergencies. Therefore, it is necessary to develop automated event extraction approaches. However, existing datasets for algorithm development have limitations, including limited coverage of event types in closed-domain settings and a lack of large, manually verified dataset in open-domain settings. To address these limitations, we create EVENT5Ws , a large, manually annotated, and statistically verified open-domain event extraction dataset. We design a systematic annotation pipeline to create the dataset and provide empirical insights into annotation complexity. Using EVENT5Ws, we evaluate state-of-the-art pre-trained large language models and establish a benchmark for future research. We further show that models trained on EVENT5Ws generalize effectively to datasets from different geographical contexts, which demonstrates its potential for developing generalizable algorithms. Finally, we summarize the lessons learned during the dataset development and provide recommendations to support future large-scale dataset development.

🏛️ Semantic Scholar 📅 2026-04-23

GENIE CRM: An Agentic AI-Powered Customer Relationship Management System with Multi-Modal Intelligence and Automated Workflow Orchestration

Prof. Hinaben Bharatbhai Dudharejiya, Dev Tejas Shah

DOI: 10.55041/isjem06720

🔥 引用: 0

Abstract: ABSTRACT: The rapid evolution of large language models and agentic AI frameworks has opened a new frontier for intelligent enterprise software. This paper presents GENIE CRM, a full-stack customer relationship management platform that embeds autonomous AI agents across sixteen functional modules to eliminate manual bottlenecks throughout the sales, support, and operations lifecycle. Built on a Python-Flask backend and a React-TypeScript single-page application, the system leverages Google Gemini 1.5 Flash as a unified multimodal AI backbone to perform structured lead scoring, visiting card optical character recognition, personalised multi-channel outreach generation, intelligent support ticket classification and round-robin routing, document and call recording summarisation, AI-driven bug criticality assessment, and real-time geospatial business intelligence. Experimental evaluation on representative CRM workflows demonstrates a 94.4 percent reduction in lead data entry time through multimodal OCR, a 22.2 percentage-point improvement in automated ticket routing accuracy over manual methods, and a 97.3 percent structural success rate for AI-generated JSON responses across all service endpoints. A persistent voice-enabled chatbot named Genie provides conversational access to the CRM database from every page of the application. The architecture combines Supabase-backed PostgreSQL for relational storage with lightweight JSON flat files for rapid-iteration state management, achieving sub-1.2-second dashboard aggregation across all sixteen modules. This paper details the system architecture, module-level design methodology, UML activity and use case diagrams derived from actual source code analysis, quantitative evaluation results, design trade-off discussion, and a roadmap for cloud-native multi- tenant deployment. Keywords — Agentic AI; Customer Relationship Management; Google Gemini; Lead Scoring; Optical Character Recognition; Workflow Automation; Support Ticket Routing; Geolocation Intelligence; Flask; React; Supabase; Multimodal AI; Sales Automation; Natural Language Processing; Chatbot

🏛️ Semantic Scholar 📅 2026-04-23 Agent International Scientific Journal of Engineering and Management

RVQ-Alpha: Bridging Single-Cell Transcriptomics and Large Language Models via Discrete Tokenization and Verifiable Reinforcement Learning

Guangpeng Li, Yue You, Yunlin Fu, et al.

DOI: 10.64898/2026.04.20.719773

🔥 引用: 0

Abstract: 暂无摘要，请点击原文查看。

🏛️ Semantic Scholar 📅 2026-04-23 RNA-seq / Transcriptomics bioRxiv

Redefining Smart City Implementation: A New Model for Sustainability of Smart Cities

Gülnar Bayramoğlu Barman

DOI: 10.37246/grid.1599717

🔥 引用: 0

Abstract: This study addresses a critical gap in smart city implementation by proposing a novel triple helix model that integrates smart governance, smart technologies, and active citizen engagement. While smart city initiatives are widespread, effective execution remains a major challenge. This research examines the smart city environment through the lens of smart government, smart people, and smart technology, revealing a disconnect between ambitious plans and the actual impact of applications. The new model was applied to four metropolitan cities in Turkey and tested using desktop research, analysis of 115 municipal websites based on the McMillan interaction model, and a citizen survey of 1,754 participants. The study presents an in-depth analysis based on detailed examination of the citizen survey and discusses the potential factors affecting awareness and usage by using The Chi Square of Independence Test and Binary Logistic Regression methods to analyze 13 relationships using data from 19 smart mobility applications. The results have shown that smart city initiatives fail to reach their full potential without aware citizens and responsive governance, even with sufficient technology. This research contributes a critical perspective to the smart city discourse, offering a foundational model for future smart city implementations to ensure their success and sustainability.

🏛️ Semantic Scholar 📅 2026-04-23 GRID - Architecture Planning and Design Journal

A Multimodal Text- and Graph-Based Approach for Open-Domain Event Extraction from Documents

Praval Sharma

🔥 引用: 0

Abstract: Event extraction is essential for event understanding and analysis. It supports tasks such as document summarization and decision-making in emergency scenarios. However, existing event extraction approaches have limitations: (1) closed-domain algorithms are restricted to predefined event types and thus rarely generalize to unseen types and (2) open-domain event extraction algorithms, capable of handling unconstrained event types, have largely overlooked the potential of large language models (LLMs) despite their advanced abilities. Additionally, they do not explicitly model document-level contextual, structural, and semantic reasoning, which are crucial for effective event extraction but remain challenging for LLMs due to lost-in-the-middle phenomenon and attention dilution. To address these limitations, we propose multimodal open-domain event extraction, MODEE , a novel approach for open-domain event extraction that combines graph-based learning with text-based representation from LLMs to model document-level reasoning. Empirical evaluations on large datasets demonstrate that MODEE outperforms state-of-the-art open-domain event extraction approaches and can be generalized to closed-domain event extraction, where it outperforms existing algorithms.

🏛️ Semantic Scholar 📅 2026-04-23

FAccT-Checked: A Narrative Review of Authority Reconfigurations and Retention in AI-Mediated Journalism

Stefano Sorrentino, Matilde Barbini, D. Gatica-Perez

DOI: 10.1145/3805689.3812300

🔥 引用: 0

Abstract: Building on recent interpretivist approaches, we conduct a critical narrative review across journalism studies, human-computer interaction, and FAccT scholarship, conceptualizing editorial authority as the conjunction of decision rights, epistemic warrant, and responsibility. We provide a comprehensive theoretical framework for addressing how concerns on fairness, accountability and transparency emerge, interact, and persist within AI mediated journalistic practice. We identify and describe two concurrent authority reconfigurations driven by AI adoption. First, an internal migration of authority, in which editorial judgment is progressively deferred to large language models (LLMs) embedded within newsroom workflows. This migration occurs not through explicit policy decisions, but through interactional, cognitive, and organizational mechanisms that legitimize AI generated outputs while obscuring responsibility and weakening individual and professional agency. Second, we analyze an external migration of authority, whereby decision making power shifts from news organizations toward platforms, vendors, and infrastructural providers that supply AI systems and distribution channels, exacerbating existing power asymmetries within the media ecosystem. Unaddressed, these reconfigurations risk rendering fairness hard to maintain, accountability difficult to assign and transparency performative. We examine participatory approaches to AI design and deployment in journalism as potential mechanisms for retaining or reclaiming editorial authority. We critically assess both their promise and their structural limitations, highlighting how participation can either meaningfully redistribute authority or function as a tokenistic practice that leaves underlying power relations intact.

🏛️ Semantic Scholar 📅 2026-04-23

Unbiased Prevalence Estimation with Multicalibrated LLMs

Fridolin Linder, Thomas J. Leeper, Daniel Haimovich, et al.

🔥 引用: 0

Abstract: Estimating the prevalence of a category in a population using imperfect measurement devices (diagnostic tests, classifiers, or large language models) is fundamental to science, public health, and online trust and safety. Standard approaches correct for known device error rates but assume these rates remain stable across populations. We show this assumption fails under covariate shift and that multicalibration, which enforces calibration conditional on the input features rather than just on average, is sufficient for unbiased prevalence estimation under such shift. Standard calibration and quantification methods fail to provide this guarantee. Our work connects recent theoretical work on fairness to a longstanding measurement problem spanning nearly all academic disciplines. A simulation confirms that standard methods exhibit bias growing with shift magnitude, while a multicalibrated estimator maintains near-zero bias. While we focus the discussion mostly on LLMs, our theoretical results apply to any classification model. Two empirical applications -- estimating employment prevalence across U.S. states using the American Community Survey, and classifying political texts across four countries using an LLM -- demonstrate that multicalibration substantially reduces bias in practice, while highlighting that calibration data should cover the key feature dimensions along which target populations may differ.

🏛️ Semantic Scholar 📅 2026-04-23

Doubly Saturated Ramsey Graphs: A Case Study in Computer-Assisted Mathematical Discovery

Benjamin Przybocki, John Mackey, Marijn Heule, et al.

🔥 引用: 0

Abstract: Ramsey-good graphs are graphs that contain neither a clique of size $s$ nor an independent set of size $t$. We study doubly saturated Ramsey-good graphs, defined as Ramsey-good graphs in which the addition or removal of any edge necessarily creates an $s$-clique or a $t$-independent set. We present a method combining SAT solving with bespoke LLM-generated code to discover infinite families of such graphs, answering a question of Grinstead and Roberts from 1982. In addition, we use LLMs to generate and formalize correctness proofs in Lean. This case study highlights the potential of integrating automated reasoning, large language models, and formal verification to accelerate mathematical discovery. We argue that such tool-driven workflows will play an increasingly central role in experimental mathematics.

🏛️ Semantic Scholar 📅 2026-04-23

The Role of Large Language Models in the Promotion of Minimally Invasive Interventional Radiologic Methods in Gynecology and Obstetrics

I. Psilopatis, Julius Emons, Kleio Vrettou, et al.

DOI: 10.3390/jcm15093234

🔥 引用: 0

Abstract: Background: Minimally invasive interventional radiology (IR) offers effective, uterus-preserving treatments for several gynecologic and obstetric conditions such as uterine fibroids, adenomyosis and postpartum hemorrhage. Despite their efficacy, these methods remain underused, partly to limited awareness among clinicians and patients. Large language models (LLMs) may help bridge this gap by providing accessible, reliable information. Objective: To evaluate how current LLMs address knowledge gaps and promote awareness of minimally invasive IR methods in gynecology and obstetrics. Methods: A structured ten-question instrument was used to query three publicly available LLMs (OpenEvidence, ChatGPT, and Google Gemini). Responses were analyzed for accuracy, completeness, safety considerations, and patient-centered communication. Results: All three models accurately identified a range of medical, minimally invasive, and surgical treatments for uterine fibroids, adenomyosis, and postpartum hemorrhage, with OpenEvidence and ChatGPT providing more detailed and clinically nuanced responses. OpenEvidence achieved the highest scores overall, closely followed by ChatGPT, while Google Gemini scored lower, particularly in completeness and patient-centered communication. In more complex scenarios, performance differences became more pronounced, with OpenEvidence again leading, ChatGPT performing strongly, and Google Gemini lagging behind. Overall, OpenEvidence and ChatGPT demonstrated higher accuracy, completeness, and safety considerations, whereas Google Gemini showed comparatively weaker and less consistent performance. Conclusions: LLMs may endorse the promotion of minimally invasive IR methods in gynecology and obstetrics, but their outputs vary considerably in quality. Ongoing refinement and integration of evidence-based sources are essential before routine use in clinical practice. Therefore, effective collaboration between artificial intelligence (AI) developers and medical professionals is essential to harness this technology’s full potential.

🏛️ Semantic Scholar 📅 2026-04-23 Journal of Clinical Medicine

Zihan Wang, Rui Zhang, Yu Liu, et al.

🔥 引用: 0

Abstract: Large language model (LLM) agents increasingly rely on skills to package reusable capabilities through instructions, tools, and resources. High-quality skills embed expert knowledge, curated workflows, and execution constraints into agents, fueling a growing skill economy through their value and scalability. Yet this ecosystem also creates a new attack surface, as adversaries can interact with public agent interfaces to extract hidden proprietary skill content. We present the first systematic study of black-box skill stealing against LLM agent systems. Compared with conventional system prompt stealing, skill stealing targets modular and structured capability packages whose leakage is directly actionable for copying, redistribution, and monetization, making the resulting harm potentially greater. To study this threat, we derive an attack taxonomy from prior prompt-stealing methods and build an automated stealing prompt generation agent. Starting from model-generated seed prompts, the framework expands attacks through scenario rationalization and structure injection while enforcing diversity via embedding-based filtering, yielding a reproducible pipeline for evaluating proprietary agent systems. We evaluate these attacks across commercial agent platforms and representative LLMs. Our results show that agent skills can often be extracted easily, posing a serious copyright risk. To mitigate this threat, we design defenses across the agent pipeline, focusing on input, inference, and output phase. Although these defenses substantially reduce leakage, the attack remains inexpensive and repeatable, and a single successful attempt is sufficient to compromise the protected skill. Overall, our findings suggest that these copyright risks remain largely overlooked across proprietary agent ecosystems, motivating stronger protection mechanisms.

🏛️ Semantic Scholar 📅 2026-04-23 Agent

SparsePool: A Graph Pooling Framework via Sparse Representation for Graph Classification

Zehan Li, Xuemeng Zhai, Hangyu Hu, et al.

DOI: 10.3390/s26092627

🔥 引用: 0

Abstract: Graph neural networks (GNNs) have achieved great success in graph classification, with graph pooling methods being widely adopted for related tasks. Existing approaches typically rely on node ranking or clustering to coarsen graphs, but often fail to effectively leverage global structural information, leading to loss of critical substructures and limited interpretability—key limitations in molecular analysis and social network mining. To address these issues, we propose SparsePool, a graph pooling method that integrates node features and structural patterns through atomic decomposition. By dynamically decomposing graphs into interpretable atomic units via Boolean matrix factorization, SparsePool preserves semantically meaningful substructures while providing transparent evidence of retained patterns. We further introduce an Atomic Pooling Neural Network (APNN) for graph representation learning. Extensive experiments on relevant benchmarks including biochemical and social network datasets demonstrate that SparsePool outperforms state-of-the-art pooling methods, achieving an average classification accuracy improvement of 1.03% over baseline models while reducing structural information loss. We also discuss its compatibility with emerging quantum computing paradigms, such as quantum-accelerated sparse decomposition, as a promising direction for scaling graph processing in industrial contexts.

🏛️ Semantic Scholar 📅 2026-04-23 Italian National Conference on Sensors

Towards a general-purpose foundation model for functional MRI analysis

Cheng Wang, Yu Jiang, Zhihao Peng, et al.

DOI: 10.1038/s41551-026-01666-y

🔥 引用: 0

Abstract: 暂无摘要，请点击原文查看。

🏛️ Semantic Scholar 📅 2026-04-23 Foundation Model Nature Biomedical Engineering

Prompt Analysis in Large Language Models Adversarial Prompt Detection

M. Sriram, P. Vishvaas, M. Sai Sreejan, et al.

DOI: 10.46647/rdems0204018

🔥 引用: 0

Abstract: Large Language Models such as GPT, LLaMA, and Claude have demonstrated remarkable capabilities in natural language understanding and generation, making them increasingly popular in applications such as chatbots, coding assistants, and enterprise solutions. However, the widespread adoption of these models has also raised serious concerns regarding their security, as adversarial prompt injection attacks, commonly known as “jailbreaking,” manipulate LLMs into generating harmful, biased or confidential outputs by bypassing safety measures. Traditional rule-based defenses are insufficient since adversarial prompts exploit the semantic and contextual nature of language. This project proposes the development of an AIdriven detection system to identify and block adversarial prompts before they reach the LLM by leveraging transformer-based NLP models (BERT, RoBERTa, LLaMA) along with adversarial training techniques to build a classifier that distinguishes between safe, suspicious and malicious prompts. The solution not only strengthens the security of LLM-based systems but also contributes to the emerging field of AI safety and robustness through experimental validation on standard adversarial NLP datasets and synthetic jailbreak attempts, targeting over 85% detection accuracy with explainable outputs that provide transparency into threat detection mechanisms.

🏛️ Semantic Scholar 📅 2026-04-23 Transformer Research Digest on Engineering Management and Social Innovations

Integrating computational and experimental approaches to identify novel pyrazolo[3,4- d ]pyrimidine-containing thiadiazole frameworks for treating Alzheimer’s disease

Hanan M. Alharbi, Nawal Al-Hoshani, M. Sameeh, et al.

DOI: 10.1142/s2737416526500845

🔥 引用: 0

Abstract: Alzheimer’s disease (AD) is a progressive neurodegenerative disorder associated with cholinergic dysfunction. We report a novel series of pyrazolo[3,4-d] pyrimidine.thiadiazole derivatives as dual inhibitors of acetylcholinesterase (AChE) and butyrylcholinesterase (BuChE). The compounds were characterized by 1 H NMR, 13 C NMR, and HRMS, and their inhibitory activity was evaluated in vitro. Among the synthesized series, the compound 10f emerged as the potent analogue, exhibiting IC 50 values of 14.60 ± 1.80 nM (AChE) and 290.76 ± 2.90 nM (BuChE), with inhibitory potency comparable to the reference drug Donepezil. Molecular docking revealed that these compounds engage in both catalytic and peripheral anionic sites via µ–µ stacking and hydrogen-bond interactions, supporting a dual-binding mechanism. Density functional theory (DFT) calculations at the B3LYP/6-31G(d,p) level highlighted electronic features correlating with activity, while in silico ADMET profiling suggested favorable oral bioavailability and low toxicity. These results position this series as promising multifunctional lead candidates for further development in AD therapy.

🏛️ Semantic Scholar 📅 2026-04-23 Journal of Computational Biophysics and Chemistry

Yongcan Yu, Lingxiao He, Jian Liang, et al.

🔥 引用: 1

Abstract: Test-time reinforcement learning (TTRL) always adapts models at inference time via pseudo-labeling, leaving it vulnerable to spurious optimization signals from label noise. Through an empirical study, we observe that responses with medium consistency form an ambiguity region and constitute the primary source of reward noise. Crucially, we find that such spurious signals can be even amplified through group-relative advantage estimation. Motivated by these findings, we propose a unified framework, Debiased and Denoised test-time Reinforcement Learning (DDRL), to mitigate spurious signals. Concretely, DDRL first applies a frequency-based sampling strategy to exclude ambiguous samples while maintaining a balanced set of positive and negative examples. It then adopts a debiased advantage estimation with fixed advantages, removing the bias introduced by group-relative policy optimization. Finally, DDRL incorporates a consensus-based off-policy refinement stage, which leverages the rejection-sampled dataset to enable efficient and stable model updates. Experiments on three large language models across multiple mathematical reasoning benchmarks demonstrate that DDRL consistently outperforms existing TTRL baselines. The code will soon be released at https://github.com/yuyongcan/DDRL.

🏛️ Semantic Scholar 📅 2026-04-23 Cited: 1

Feasibility of Wind‐Powered Green Hydrogen Production via a Hybrid Graph Neural Network‐Transformer Forecasting Model

Iman Baghaei, Mojtaba Mirhosseini, Alireza Zahedi

DOI: 10.1002/ese3.70541

🔥 引用: 0

Abstract: Accurate long‐term wind speed forecasting is pivotal for the strategic planning of renewable energy infrastructure, particularly for assessing the techno‐economic feasibility of wind‐powered green hydrogen facilities. However, capturing the complex spatiotemporal dependencies in climate data remains a significant challenge. This study proposes a hybrid deep learning framework designed to enhance 1‐ to 10‐year wind speed forecasts. The proposed architecture integrates graph neural networks (GNN) to extract inter‐variable correlations and feature‐space dynamics among meteorological parameters, coupled with advanced sequence modeling layers to capture temporal patterns. We rigorously evaluated the framework using multi‐variable climate data from NASA's Power Data Access Viewer, comparing a GNN‐Transformer model against a GNN‐GRU variant, as well as standard baselines (LSTM, CNN) and state‐of‐the‐art hybrids (e.g., MST‐GNN). The results demonstrate that the proposed hybrid framework significantly outperforms standalone models. Specifically, the GNN‐Transformer achieved a Mean Absolute Error (MAE) of 0.53 m/s for 10‐year forecasts, representing a 30.27% improvement over a standard Transformer. Furthermore, our comparative analysis reveals that the GNN‐GRU variant achieved superior practical performance with an MAE of 0.44 m/s. These findings provide two key contributions: (1) establishing a robust GNN‐based framework that advances long‐term forecasting accuracy for green hydrogen site planning, and (2) offering empirical evidence that while Transformers offer theoretical complexity, simpler recurrent architectures like GRU may yield better stability in specific long‐term climatological tasks.

🏛️ Semantic Scholar 📅 2026-04-23 Deep Learning Transformer Energy Science & Engineering

OptiMat Alloys: A FAIR End-to-End Agent with Living Database for Computational Multi-Principal Alloy Exploration

Yang Hu, V. Turlo

🔥 引用: 0

Abstract: The FAIR principles have transformed how computational data and workflows are shared in materials research, yet existing repositories can only serve pre-computed entries -- broad coverage is perpetually incomplete and cannot adapt to new questions on demand. To address these challenges, we present OptiMat Alloys, a large language model-powered conversational agent for multi-principal element alloy exploration built on three pillars: a living database that stores every calculation with provenance, low-barrier accessibility through a web interface requiring zero programming expertise, and built-in uncertainty quantification via cross-potential and cross-configuration validation (see demo here https://youtu.be/lQzuorkzPMc). Coupling foundational machine learning interatomic potentials covering near-all periodic table of elements with natural-language interaction, OptiMat Alloys enables targeted, on-demand computation guided by the user's domain knowledge-extending FAIR from pre-computed repositories to on-demand knowledge generation and making computational alloy screening accessible to any materials scientist.

🏛️ Semantic Scholar 📅 2026-04-23 Agent

Use of generative AI in database courses: a role-based review

Oğuz Ak

DOI: 10.3389/feduc.2026.1795408

🔥 引用: 0

Abstract: Generative artificial intelligence (genAI), particularly large language models (LLMs), has attracted increasing attention for its potential to support teaching and learning in higher education. However, how genAI is pedagogically integrated into specific disciplinary contexts, such as database education, remains insufficiently understood. This review examines the use of genAI in database courses through a role-based analytical framework, distinguishing the instructional function of AI as supplementary assistant, direct mediator, and new subject. Following PRISMA guidelines, peer-reviewed journal articles indexed in Web of Science and Scopus were screened. From an initial pool of 406 studies, nine empirical studies were identified. GenAI is most commonly used as a supplementary assistant in database courses, where it supports tasks such as querying, data generation, solution checking, and error correction. The studies reported positive effects on academic performance, instructional efficiency, and time savings, with relatively few ethical or pedagogical concerns. The use of AI as a direct mediator demonstrated clear benefits but also raised concerns related to the accuracy and reliability of AI-generated evaluations. Finally, one exploratory study positioning AI as a new subject emerged as the most transformative yet risky role. While this approach offers potential for substantial instructional innovation, it may also lead to challenges related to accuracy and equity. This review underscores the potential and importance of role-based integration of genAI in database education and identifies key directions for future research. Research on AI usage in database instruction is still at an early stage, and further empirical studies are required.

🏛️ Semantic Scholar 📅 2026-04-23 Frontiers in Education

A Replicable Robotics Awareness Method Using LLM-Enabled Robotics Interaction: Evidence from a Corporate Challenge

S. A. Prieto, M. A. Gopee, Y. Arab, et al.

🔥 引用: 0

Abstract: Large language models are increasingly being explored as interfaces between humans and robotic systems, yet there remains limited evidence on how such technologies can be used not only for interaction, but also as a structured means of introducing robotics to non-specialist users in real organizational settings. This paper introduces and evaluates a challenge-based method for robotics awareness, implemented through an LLM-enabled humanoid robot activity conducted with employees of AD Ports Group in the United Arab Emirates. In the event, participants engaged with a humanoid robot in a logistics-inspired task environment using voice commands interpreted through an LLM-based control framework. The activity was designed as a team-based, role-driven experience intended to expose participants to embodied AI and human-robot collaboration without requiring prior robotics expertise. To evaluate the approach, a post-event survey remained open for 16 days and collected 102 responses. Results indicate strong overall reception, with high satisfaction (8.46/10), increased interest in robotics and AI (4.47/5), and improved understanding of emerging forms of human-robot collaboration (4.45/5). Participants who interacted directly with the robot also reported natural interaction (4.37/5) and a strong sense that interaction became easier as the activity progressed (4.74/5). At the same time, lower ratings for reliability and predictability point to important technical and design challenges for future iterations. The findings suggest that challenge-based, LLM-enabled humanoid interaction can serve as a promising and replicable method for robotics awareness in industrial and operational environments.

🏛️ Semantic Scholar 📅 2026-04-23

Can MLLMs"Read"What is Missing?

Jin Guo, Chaozheng Huang, Xi Fang

🔥 引用: 0

Abstract: We introduce MMTR-Bench, a benchmark designed to evaluate the intrinsic ability of Multimodal Large Language Models (MLLMs) to reconstruct masked text directly from visual context. Unlike conventional question-answering tasks, MMTR-Bench eliminates explicit prompts, requiring models to recover masked text from single- or multi-page inputs across real-world domains such as documents and webpages. This design isolates the reconstruction task from instruction-following abilities, enabling a direct assessment of a model's layout understanding, visual grounding, and knowledge integration. MMTR-Bench comprises 2,771 test samples spanning multiple languages and varying target lengths. To account for this diversity, we propose a level-aware evaluation protocol. Experiments on representative MLLMs show that the benchmark poses a significant challenge, especially for sentence- and paragraph-level reconstruction. The homepage is available at https://mmtr-bench-dataset.github.io/MMTR-Bench/.

🏛️ Semantic Scholar 📅 2026-04-23

Empirical Analysis of Internal Hallucination Detection in Quantized LLMs: Layer Dynamics and White-Box Benchmarks

Haohua Liu, Jinli Xu

Reliability Auditing for Downstream LLM tasks in Psychiatry: LLM-Generated Hospitalization Risk Scores

Shevya Pandya, Shinjini Bose, Ananya Joshi

🔥 引用: 0

Abstract: Large language models (LLMs) are increasingly utilized in clinical reasoning and risk assessment. However, their interpretive reliability in critical and indeterminate domains such as psychiatry remains unclear. Prior work has identified algorithmic biases and prompt sensitivity in these systems, raising concerns about how contextual information may influence model outputs, but there remains no systematic way to assess these, especially in the psychiatric domain. We propose an approach for reliability auditing downstream LLM tasks by structuring evaluation around the impact of prompt design and the inclusion of medically insignificant inputs on predicted hospitalization risk scores, which is often the first downstream AI clinical-decision-making task. In our audit, a cohort of synthetic patient profiles (n = 50) is generated, each consisting of 15 clinically relevant features and up to 50 clinically insignificant features, across four prompt reframings (neutral, logical, human impact, clinical judgment). We audit four LLMs (Gemini 2.5 Flash, LLaMa 3.3 70b, Claude Sonnet 4.6, GPT-4o mini), and our results show that including medically insignificant variables resulted in a statistically significant increase in the absolute mean predicted hospitalization risk and output variability across all models and prompts, indicating reduced predictive stability as contextual noise increased. Clinically insignificant features had an effect on instability across many model-prompt conditions, and prompt variations independently affected the trajectory of instability in a model-dependent manner. These findings quantify how LLM-based psychiatric risk assessments are sensitive to non-clinical information, highlighting the need for systematic evaluations of attributional stability and uncertainty behavior like this before clinical deployments.

🏛️ Semantic Scholar 📅 2026-04-23

An Alternate Agentic AI Architecture (It's About the Data)

Fabian Wenz, Felix Treutwein, Kai Arenja, et al.

🔥 引用: 0

Abstract: For the last several years, the dominant narrative in"agentic AI"has been that large language models should orchestrate information access by dynamically selecting tools, issuing sub-queries, and synthesizing results. We argue this approach is misguided: enterprises do not suffer from a reasoning deficit, but from a data integration problem. Enterprises are data-centric: critical information is scattered across heterogeneous systems (e.g., databases, documents, and external services), each with its own query language, schema, access controls, and performance constraints. In contrast, contemporary LLM-based architectures are optimized for reasoning over unstructured text and treat enterprise systems as either corpora or external tools invoked by a black-box component. This creates a mismatch between schema-rich, governed, performance-critical data systems and text-centric, probabilistic LLM architectures, leading to limited transparency, weak correctness guarantees, and unpredictable performance. In this paper, we present RUBICON, an alternative architecture grounded in data management principles. Instead of delegating orchestration to an opaque agent, we introduce AQL (Agentic Query Language), a small, explicit query algebra - Find, From, and Where - executed through source-specific wrappers that enforce access control, schema alignment, and result normalization. All intermediate results are visible and inspectable. Complex questions are decomposed into structured, auditable query plans rather than hidden chains of LLM calls. Our thesis is simple: enterprise AI is not a prompt engineering problem; it is a systems problem. By reintroducing explicit query structure, wrapper-based mediation, and cost-based optimization, we obtain the breadth of agentic search while preserving traceability, determinism, and trust in enterprise environments.

🏛️ Semantic Scholar 📅 2026-04-23 Agent

Zero-Shot Detection of LLM-Generated Text via Implicit Reward Model

Runheng Liu, Heyan Huang, Xingchen Xiao, et al.

🔥 引用: 1

Abstract: Large language models (LLMs) have demonstrated remarkable capabilities across various tasks. However, their ability to generate human-like text has raised concerns about potential misuse. This underscores the need for reliable and effective methods to detect LLM-generated text. In this paper, we propose IRM, a novel zero-shot approach that leverages Implicit Reward Models for LLM-generated text detection. Such implicit reward models can be derived from publicly available instruction-tuned and base models. Previous reward-based method relies on preference construction and task-specific fine-tuning. In comparison, IRM requires neither preference collection nor additional training. We evaluate IRM on the DetectRL benchmark and demonstrate that IRM can achieve superior detection performance, outperforms existing zero-shot and supervised methods in LLM-generated text detection.

🏛️ Semantic Scholar 📅 2026-04-23 Fine-tuning Cited: 1

Target discovery and drug design in the era of artificial intelligence

P. Fleming, Andrey A. Ivanov

DOI: 10.1007/s00044-026-03562-1

🔥 引用: 0

Abstract: 暂无摘要，请点击原文查看。

🏛️ Semantic Scholar 📅 2026-04-23 Medicinal Chemistry Research

The first CRISPR gene editor designed from scratch by AI was born! Developed based on a large language model, successfully edited the human genome

Shiji Jin

DOI: 10.55277/researchhub.i82faubd

🔥 引用: 0

Abstract: 暂无摘要，请点击原文查看。

🏛️ Semantic Scholar 📅 2026-04-23 Biological Large Model

Enhancing Online Recruitment with Category-Aware MoE and LLM-based Data Augmentation

Minping Chen, Bingquan Xu, Zulong Chen, et al.

🔥 引用: 0

Abstract: Person-Job Fit (PJF) is a critical component for online recruitment. Existing approaches face several challenges, particularly in handling low-quality job descriptions and similar candidate-job pairs, which impair model performance. To address these challenges, this paper proposes a large language model (LLM) based method with two novel techniques: (1) LLM-based data augmentation, which polishes and rewrites low-quality job descriptions by leveraging chain-of-thought (COT) prompts, and (2) category-aware Mixture of Experts (MoE) that assists in identifying similar candidate-job pairs. This MoE module incorporates category embeddings to dynamically assign weights to the experts and learns more distinguishable patterns for similar candidate-job pairs. We perform offline evaluations and online A/B tests on our recruitment platform. Our method relatively surpasses existing methods by 2.40% in AUC and 7.46% in GAUC, and boosts click-through conversion rate (CTCVR) by 19.4% in online tests, saving millions of CNY in external headhunting expenses.

🏛️ Semantic Scholar 📅 2026-04-23

Thibault Bañeras-Roux, Shashi Kumar, Driss Khalil, et al.

🔥 引用: 0

Abstract: Automatic Speech Recognition (ASR) is traditionally evaluated using Word Error Rate (WER), a metric that is insensitive to meaning. Embedding-based semantic metrics are better correlated with human perception, but decoder-based Large Language Models (LLMs) remain underexplored for this task. This paper evaluates their relevance through three approaches: (1) selecting the best hypothesis between two candidates, (2) computing semantic distance using generative embeddings, and (3) qualitative classification of errors. On the HATS dataset, the best LLMs achieve 92--94\% agreement with human annotators for hypothesis selection, compared to 63\% for WER, also outperforming semantic metrics. Embeddings from decoder-based LLMs show performance comparable to encoder models. Finally, LLMs offer a promising direction for interpretable and semantic ASR evaluation.

🏛️ Semantic Scholar 📅 2026-04-23

Shared Lexical Task Representations Explain Behavioral Variability In LLMs

Zhuonan Yang, Jacobs Li, F. Velez, et al.

🔥 引用: 0

Abstract: One of the most common complaints about large language models (LLMs) is their prompt sensitivity -- that is, the fact that their ability to perform a task or provide a correct answer to a question can depend unpredictably on the way the question is posed. We investigate this variation by comparing two very different but commonly-used styles of prompting: instruction-based prompts, which describe the task in natural language, and example-based prompts, which provide in-context few-shot demonstration pairs to illustrate the task. We find that, despite large variation in performance as a function of the prompt, the model engages some common underlying mechanisms across different prompts of a task. Specifically, we identify task-specific attention heads whose outputs literally describe the task -- which we dub lexical task heads -- and show that these heads are shared across prompting styles and trigger subsequent answer production. We further find that behavioral variation between prompts can be explained by the degree to which these heads are activated, and that failures are at least sometimes due to competing task representations that dilute the signal of the target task. Our results together present an increasingly clear picture of how LLMs'internal representations can explain behavior that otherwise seems idiosyncratic to users and developers.

🏛️ Semantic Scholar 📅 2026-04-23

Memanto: Typed Semantic Memory with Information-Theoretic Retrieval for Long-Horizon Agents

Seyed Moein Abtahi, Rasa Rahnema, Hetkumar Patel, et al.

🔥 引用: 0

Abstract: The transition from stateless language model inference to persistent, multi session autonomous agents has revealed memory to be a primary architectural bottleneck in the deployment of production grade agentic systems. Existing methodologies largely depend on hybrid semantic graph architectures, which impose substantial computational overhead during both ingestion and retrieval. These systems typically require large language model mediated entity extraction, explicit graph schema maintenance, and multi query retrieval pipelines. This paper introduces Memanto, a universal memory layer for agentic artificial intelligence that challenges the prevailing assumption that knowledge graph complexity is necessary to achieve high fidelity agent memory. Memanto integrates a typed semantic memory schema comprising thirteen predefined memory categories, an automated conflict resolution mechanism, and temporal versioning. These components are enabled by Moorcheh's Information Theoretic Search engine, a no indexing semantic database that provides deterministic retrieval within sub ninety millisecond latency while eliminating ingestion delay. Through systematic benchmarking on the LongMemEval and LoCoMo evaluation suites, Memanto achieves state of the art accuracy scores of 89.8 percent and 87.1 percent respectively. These results surpass all evaluated hybrid graph and vector based systems while requiring only a single retrieval query, incurring no ingestion cost, and maintaining substantially lower operational complexity. A five stage progressive ablation study is presented to quantify the contribution of each architectural component, followed by a discussion of the implications for scalable deployment of agentic memory systems.

🏛️ Semantic Scholar 📅 2026-04-23 Agent

Stealthy Backdoor Attacks against LLMs Based on Natural Style Triggers

Jiali Wei, Ming Fan, Guoheng Sun, et al.

🔥 引用: 0

Abstract: The growing application of large language models (LLMs) in safety-critical domains has raised urgent concerns about their security. Many recent studies have demonstrated the feasibility of backdoor attacks against LLMs. However, existing methods suffer from three key shortcomings: explicit trigger patterns that compromise naturalness, unreliable injection of attacker-specified payloads in long-form generation, and incompletely specified threat models that obscure how backdoors are delivered and activated in practice. To address these gaps, we present BadStyle, a complete backdoor attack framework and pipeline. BadStyle leverages an LLM as a poisoned sample generator to construct natural and stealthy poisoned samples that carry imperceptible style-level triggers while preserving semantics and fluency. To stabilize payload injection during fine-tuning, we design an auxiliary target loss that reinforces the attacker-specified target content in responses to poisoned inputs and penalizes its emergence in benign responses. We further ground the attack in a realistic threat model and systematically evaluate BadStyle under both prompt-induced and PEFT-based injection strategies. Extensive experiments across seven victim LLMs, including LLaMA, Phi, DeepSeek, and GPT series, demonstrate that BadStyle achieves high attack success rates (ASRs) while maintaining strong stealthiness. The proposed auxiliary target loss substantially improves the stability of backdoor activation, yielding an average ASR improvement of around 30% across style-level triggers. Even in downstream deployment scenarios unknown during injection, the implanted backdoor remains effective. Moreover, BadStyle consistently evades representative input-level defenses and bypasses output-level defenses through simple camouflage.

🏛️ Semantic Scholar 📅 2026-04-23 Fine-tuning

Glioma-intrinsic SLC1A3 hijacks the vascular niche to establish an immunosuppressive microenvironment

Hao Lin, Chaxian Liu, Xi Chen, et al.

DOI: 10.3389/fimmu.2026.1824726

🔥 引用: 0

Abstract: Glioblastoma (GBM) is a highly lethal malignancy driven by glioma-initiating cells (GICs). While GICs are known to profoundly remodel tumor microenvironment (TME) to promote progression and immune evasion within the vascular niche, the specific transcriptomic reprogramming and alternative splicing events driving their evolution from neural stem cells (NSCs), and how these intrinsic cellular state changes dictate multi-cellular immunosuppressive networks and checkpoints, remain poorly understood. Unraveling these complex tumor-vascular-immune interactions is critical for identifying novel vulnerabilities and developing effective immunotherapies. To decode the GICs’ evolutionary trajectory, we integrated RNA-seq and alternative splicing analysis of NSCs and patient-derived GIC cohorts. The malignant progression was mapped using scRNA-seq pseudotime analysis, and key targets were validated across clinical TCGA cohorts. Furthermore, we employed the large-scale single-cell foundation model, Geneformer, to perform in silico genetic perturbations, integrating it with interactome inference to decipher TME communication. Finally, the proposed tumor-endothelial-T cell multi-cellular axis was functionally validated utilizing in vitro tumor-HUVEC co-culture systems, qPCR, and FACS-based T cell activation (NFAT-Jurkat) assays. Our multi-omics re-analysis identified extensive alternative splicing and transcriptional reprogramming during GICs evolution, pinpointing SLC1A3 as a core gene significantly upregulated along the malignant pseudotime trajectory and strongly correlated with poor clinical prognosis in GBM. AI-driven in silico virtual knockout utilizing Geneformer revealed that SLC1A3 acts as a master regulator of tumor network stability. Interactome analysis demonstrated that SLC1A3 hi tumor cells exhibit intensive communication with endothelial cells via specific ligand-receptor axes (e.g., TNC-ITGB1, PTN-SDC3). In vitro assays confirmed that endothelial cells were educated by SLC1A3 hi tumor cells that undergo malignant transition, drastically upregulating immune-suppressive factors, including CD274 , TGFB1 , IL10 , and IDO1 . Crucially, tumor-specific knockdown of SLC1A3 dismantled this vascular-immune suppressive niche, significantly restoring T cell activation in a multicellular co-culture model. Our findings establish SLC1A3 not merely as an intrinsic driver of glioma development, but as a critical upstream node orchestrating a cascading tumor-endothelial-T cell immunosuppressive axis. By leveraging AI-based foundation models alongside robust biological validation, we uncovered a novel mechanism of vascular-mediated immune evasion, highlighting SLC1A3 as a highly promising therapeutic target to reprogram the glioblastoma microenvironment and restore anti-tumor immunity.

🏛️ Semantic Scholar 📅 2026-04-23 Foundation Model RNA-seq / Transcriptomics Biological Large Model Frontiers in Immunology

Do LLM Decoders Listen Fairly? Benchmarking How Language Model Priors Shape Bias in Speech Recognition

Srishti Ginjala, E. Fosler-Lussier, Christopher Myers, et al.

🔥 引用: 0

Abstract: As pretrained large language models replace task-specific decoders in speech recognition, a critical question arises: do their text-derived priors make recognition fairer or more biased across demographic groups? We evaluate nine models spanning three architectural generations (CTC with no language model, encoder-decoder with an implicit LM, and LLM-based with an explicit pretrained decoder) on about 43,000 utterances across five demographic axes (ethnicity, accent, gender, age, first language) using Common Voice 24 and Meta's Fair-Speech, a controlled-prompt dataset that eliminates vocabulary confounds. On clean audio, three findings challenge assumptions: LLM decoders do not amplify racial bias (Granite-8B has the best ethnicity fairness, max/min WER = 2.28); Whisper exhibits pathological hallucination on Indian-accented speech with a non-monotonic insertion-rate spike to 9.62% at large-v3; and audio compression predicts accent fairness more than LLM scale. We then stress-test these findings under 12 acoustic degradation conditions (noise, reverberation, silence injection, chunk masking) across both datasets, totaling 216 inference runs. Severe degradation paradoxically compresses fairness gaps as all groups converge to high WER, but silence injection amplifies Whisper's accent bias up to 4.64x by triggering demographic-selective hallucination. Under masking, Whisper enters catastrophic repetition loops (86% of 51,797 insertions) while explicit-LLM decoders produce 38x fewer insertions with near-zero repetition; high-compression audio encoding (Q-former) reintroduces repetition pathology even in LLM decoders. These results suggest that audio encoder design, not LLM scaling, is the primary lever for equitable and robust speech recognition.

🏛️ Semantic Scholar 📅 2026-04-23

Seeing Without Eyes: 4D Human-Scene Understanding from Wearable IMUs

Hao-Yu Hsu, Tianhang Cheng, Jing Wen, et al.

🔥 引用: 0

Abstract: Understanding human activities and their surrounding environments typically relies on visual perception, yet cameras pose persistent challenges in privacy, safety, energy efficiency, and scalability. We explore an alternative: 4D perception without vision. Its goal is to reconstruct human motion and 3D scene layouts purely from everyday wearable sensors. For this we introduce IMU-to-4D, a framework that repurposes large language models for non-visual spatiotemporal understanding of human-scene dynamics. IMU-to-4D uses data from a few inertial sensors from earbuds, watches, or smartphones and predicts detailed 4D human motion together with coarse scene structure. Experiments across diverse human-scene datasets show that IMU-to-4D yields more coherent and temporally stable results than SoTA cascaded pipelines, suggesting wearable motion sensors alone can support rich 4D understanding.

🏛️ Semantic Scholar 📅 2026-04-23

EngramaBench: Evaluating Long-Term Conversational Memory with Structured Graph Retrieval

J. Acuña

DOI: 10.5281/zenodo.19697774

🔥 引用: 0

Abstract: Large language model assistants are increasingly expected to retain and reason over information accumulated across many sessions. We introduce EngramaBench, a benchmark for long-term conversational memory built around five personas, one hundred multi-session conversations, and one hundred fifty queries spanning factual recall, cross-space integration, temporal reasoning, adversarial abstention, and emergent synthesis. We evaluate Engrama, a graph-structured memory system, against GPT-4o full-context prompting and Mem0, an open-source vector-retrieval memory system. All three use the same answering model (GPT-4o), isolating the effect of memory architecture. GPT-4o full-context achieves the highest composite score (0.6186), while Engrama scores 0.5367 globally but is the only system to score higher than full-context prompting on cross-space reasoning (0.6532 vs. 0.6291, n=30). Mem0 is cheapest but substantially weaker (0.4809). Ablations reveal that the components driving Engrama's cross-space advantage trade off against global composite score, exposing a systems-level tension between structured memory specialization and aggregate optimization.

🏛️ Semantic Scholar 📅 2026-04-23

Nemobot Games: Crafting Strategic AI Gaming Agents for Interactive Learning with Large Language Models

C. Tan, Yuchen Wang, Shangxin Guo

🔥 引用: 0

Abstract: This paper introduces a new paradigm for AI game programming, leveraging large language models (LLMs) to extend and operationalize Claude Shannon's taxonomy of game-playing machines. Central to this paradigm is Nemobot, an interactive agentic engineering environment that enables users to create, customize, and deploy LLM-powered game agents while actively engaging with AI-driven strategies. The LLM-based chatbot, integrated within Nemobot, demonstrates its capabilities across four distinct classes of games. For dictionary-based games, it compresses state-action mappings into efficient, generalized models for rapid adaptability. In rigorously solvable games, it employs mathematical reasoning to compute optimal strategies and generates human-readable explanations for its decisions. For heuristic-based games, it synthesizes strategies by combining insights from classical minimax algorithms (see, e.g., shannon1950chess) with crowd-sourced data. Finally, in learning-based games, it utilizes reinforcement learning with human feedback and self-critique to iteratively refine strategies through trial-and-error and imitation learning. Nemobot amplifies this framework by offering a programmable environment where users can experiment with tool-augmented generation and fine-tuning of strategic game agents. From strategic games to role-playing games, Nemobot demonstrates how AI agents can achieve a form of self-programming by integrating crowdsourced learning and human creativity to iteratively refine their own logic. This represents a step toward the long-term goal of self-programming AI.

🏛️ Semantic Scholar 📅 2026-04-23 Agent Fine-tuning

Lightweight Retrieval-Augmented Generation and Large Language Model-Based Modeling for Scalable Patient-Trial Matching

Xiaodi Li, Yang Xiao, Munhwan Lee, et al.

🔥 引用: 0

Abstract: Patient-trial matching requires reasoning over long, heterogeneous electronic health records (EHRs) and complex eligibility criteria, posing significant challenges for scalability, generalization, and computational efficiency. Existing approaches either rely on full-document processing with large language models (LLMs), which is computationally expensive, or use traditional machine learning methods that struggle to capture unstructured clinical narratives. In this work, we propose a lightweight framework that combines retrieval-augmented generation and large language model-based modeling for scalable patient-trial matching. The framework explicitly separates two key components: retrieval-augmented generation is used to identify clinically relevant segments from long EHRs, reducing input complexity, while large language models are used to encode these selected segments into informative representations. These representations are further refined through dimensionality reduction and modeled using lightweight predictors, enabling efficient and scalable downstream classification. We evaluate the proposed approach on multiple public benchmarks (n2c2, SIGIR, TREC 2021/2022) and a real-world multimodal dataset from Mayo Clinic (MCPMD). Results show that retrieval-based information selection significantly reduces computational burden while preserving clinically meaningful signals. We further demonstrate that frozen LLMs provide strong representations for structured clinical data, whereas fine-tuning is essential for modeling unstructured clinical narratives. Importantly, the proposed lightweight pipeline achieves performance comparable to end-to-end LLM approaches with substantially lower computational cost.

🏛️ Semantic Scholar 📅 2026-04-23 Fine-tuning

Bingcong Li, Yilang Zhang, Georgios B. Giannakis

🔥 引用: 0

Abstract: Low-rank adaptation (LoRA) has emerged as the de facto standard for parameter-efficient fine-tuning (PEFT) of foundation models, enabling the adaptation of billion-parameter networks with minimal computational and memory overhead. Despite its empirical success and rapid proliferation of variants, it remains elusive which architectural choices, optimization techniques, and deployment constraints should guide practical method selection. This overview revisits LoRA through the lens of signal processing (SP), bridging modern adapter designs with classical low-rank modeling tools and inverse problems, as well as highlighting how SP principles can inform principled advances of fine-tuning approaches. Rather than providing a comprehensive enumeration and empirical comparisons of LoRA variants, emphasis is placed on the technical mechanisms underpinning these approaches to justify their effectiveness. These advances are categorized into three complementary axes: architectural design, efficient optimization, and pertinent applications. The first axis builds on singular value decomposition (SVD)-based factorization, rank-augmentation constructions, and cross-layer tensorization, while the second axis deals with initialization, alternating solvers, gauge-invariant optimization, and parameterization-aware methods. Beyond fine-tuning, emerging applications of LoRA are accounted across the entire lifecycle of large models, ranging from pre- and post-training to serving/deployment. Finally, open research directions are outlined at the confluence of SP and deep learning to catalyze a bidirectional frontier: classical SP tools provide a principled vocabulary for designing principled PEFT methods, while the unique challenges facing modern deep learning, especially the overwhelming scale and prohibitive overhead, also offer new research lines benefiting the SP community in return.

🏛️ Semantic Scholar 📅 2026-04-23 Deep Learning Foundation Model Fine-tuning

Do MLLMs Understand Pointing? Benchmarking and Enhancing Referential Reasoning in Egocentric Vision

Chentao Li, Zirui Gao, Mingze Gao, et al.

🔥 引用: 0

Abstract: Egocentric AI agents, such as smart glasses, rely on pointing gestures to resolve referential ambiguities in natural language commands. However, despite advancements in Multimodal Large Language Models (MLLMs), current systems often fail to precisely ground the spatial semantics of pointing. Instead, they rely on spurious correlations with visual proximity or object saliency, a phenomenon we term"Referential Hallucination."To address this gap, we introduce EgoPoint-Bench, a comprehensive question-answering benchmark designed to evaluate and enhance multimodal pointing reasoning in egocentric views. Comprising over 11k high-fidelity simulated and real-world samples, the benchmark spans five evaluation dimensions and three levels of referential complexity. Extensive experiments demonstrate that while state-of-the-art proprietary and open-source models struggle with egocentric pointing, models fine-tuned on our synthetic data achieve significant performance gains and robust sim-to-real generalization. This work highlights the importance of spatially aware supervision and offers a scalable path toward precise egocentric AI assistants. Project page: https://guyyyug.github.io/EgoPoint-Bench/

🏛️ Semantic Scholar 📅 2026-04-23 Agent

A. Cheng, Kailong Wang, Ling Shi, et al.

🔥 引用: 0

Abstract: Function calling empowers large language models (LLMs) to interface with external tools, yet existing RL-based approaches suffer from misalignment between reasoning processes and tool-call decisions. We propose R2IF, a reasoning-aware RL framework for interpretable function calling, adopting a composite reward integrating format/correctness constraints, Chain-of-Thought Effectiveness Reward (CER), and Specification-Modification-Value (SMV) reward, optimized via GRPO. Experiments on BFCL/ACEBench show R2IF outperforms baselines by up to 34.62% (Llama3.2-3B on BFCL) with positive Average CoT Effectiveness (0.05 for Llama3.2-3B), enhancing both function-calling accuracy and interpretability for reliable tool-augmented LLM deployment.

🏛️ Semantic Scholar 📅 2026-04-22

Large language model-based paper classification framework with key-insight extraction and confidence-weighted voting

Zihan Song, Shan Huang, Ngeemasara Thapa, et al.

DOI: 10.1017/rsm.2026.10094

🔥 引用: 0

Abstract: Systematic reviews (SRs) are critical for evidence-based research but are time-consuming and labor-intensive. The rapid expansion of academic publications further challenges the performance and applicability of existing screening and classification methods. While large language models (LLMs) present new opportunities for automation, limited research has examined whether they can achieve classification performance comparable to human reviewers in large-scale, multi-class settings. With the goal of improving classification performance, we proposed an LLM-based framework that leverages full-text key-insight extraction to enhance literature classification. We constructed a manually curated dataset of 900 articles from 17 published SRs to quantitatively evaluate the classification capabilities of LLMs. The results provided empirical evidence of LLMs’ potential in supporting large-scale SRs and introduced a practical pathway for improving efficiency and reliability in evidence synthesis. Empirical results showed that key-insight-based classification (KBC) significantly outperforms abstract-based classification (ABC). We implemented a confidence-weighted voting (CWV) mechanism using multiple LLMs to improve robustness. The CWV method achieved the highest macro F 1-score of 0.796, substantially exceeding KBC (0.732), ABC (0.676), and unsupervised K-means clustering (0.446). By employing zero-shot LLMs, our approach demonstrated the potential for enhanced adaptability across diverse domains and classification tasks without requiring fine-tuning, demonstrating that a carefully designed pipeline can enable LLMs to achieve classification performance comparable to human reviewers.

🏛️ Semantic Scholar 📅 2026-04-22 Fine-tuning Research Synthesis Methods

JoyAI-RA 0.1: A Foundation Model for Robotic Autonomy

Tianle Zhang, Zhihao Yuan, Dafeng Chi, et al.

🔥 引用: 0

Abstract: Robotic autonomy in open-world environments is fundamentally limited by insufficient data diversity and poor cross-embodiment generalization. Existing robotic datasets are often limited in scale and task coverage, while relatively large differences across robot embodiments impede effective behavior knowledge transfer. To address these challenges, we propose JoyAI-RA, a vision-language-action (VLA) embodied foundation model tailored for generalizable robotic manipulation. JoyAI-RA presents a multi-source multi-level pretraining framework that integrates web data, large-scale egocentric human manipulation videos, simulation-generated trajectories, and real-robot data. Through training on heterogeneous multi-source data with explicit action-space unification, JoyAI-RA effectively bridges embodiment gaps, particularly between human manipulation and robotic control, thereby enhancing cross-embodiment behavior learning. JoyAI-RA outperforms state-of-the-art methods in both simulation and real-world benchmarks, especially on diverse tasks with generalization demands.

🏛️ Semantic Scholar 📅 2026-04-22 Foundation Model

Concordia: Spatial Domain Detection via Augmented Graphs for Population-Level Spatial Proteomics

Siqi Liu, Li Hsu, Wei Sun

Spectral Embeddings Leak Graph Topology: Theory, Benchmark, and Adaptive Reconstruction

Thinh Nguyen-Cong, T. Hy, Thang N. Dinh

🔥 引用: 0

Abstract: Graph Neural Networks (GNNs) excel on relational data, but standard benchmarks unrealistically assume the graph is centrally available. In practice, settings such as Federated Graph Learning, distributed systems, and privacy-sensitive applications involve graph data that are localized, fragmented, noisy, and privacy-leaking. We present a unified framework for this setting. We introduce LoGraB (Local Graph Benchmark), which decomposes standard datasets into fragmented benchmarks using three strategies and four controls: neighborhood radius $d$, spectral quality $k$, noise level $\sigma$, and coverage ratio $p$. LoGraB supports graph reconstruction, localized node classification, and inter-fragment link prediction, with Island Cohesion. We propose AFR (Adaptive Fidelity-driven Reconstruction), a method for noisy spectral fragments. AFR scores patch quality via a fidelity measure combining a gap-to-truncation stability ratio and structural entropy, then assembles fragments using RANSAC-Procrustes alignment, adaptive stitching, and Bundle Adjustment. Rather than forcing a single global graph, AFR recovers large faithful islands. We prove heat-kernel edge recovery under a separation condition, Davis--Kahan perturbation stability, and bounded alignment error. We establish a Spectral Leakage Proposition: under a spectral-gap assumption, polynomial-time Bayesian recovery is feasible once enough eigenvectors are shared, complementing AFR's deterministic guarantees. Experiments on nine benchmarks show that LoGraB reveals model strengths and weaknesses under fragmentation, AFR achieves the best F1 on 7/9 datasets, and under per-embedding $(\epsilon,\delta)$-Gaussian differential privacy, AFR retains 75% of its undefended F1 at $\epsilon=2$. Our anonymous code is available at https://anonymous.4open.science/r/JMLR_submission

🏛️ Semantic Scholar 📅 2026-04-22

HaS: Accelerating RAG through Homology-Aware Speculative Retrieval

Peng Peng, Weiwei Lin, Wentai Wu, et al.

🔥 引用: 0

Abstract: Retrieval-Augmented Generation (RAG) expands the knowledge boundary of large language models (LLMs) at inference by retrieving external documents as context. However, retrieval becomes increasingly time-consuming as the knowledge databases grow in size. Existing acceleration strategies either compromise accuracy through approximate retrieval, or achieve marginal gains by reusing results of strictly identical queries. We propose HaS, a homology-aware speculative retrieval framework that performs low-latency speculative retrieval over restricted scopes to obtain candidate documents, followed by validating whether they contain the required knowledge. The validation, grounded in the homology relation between queries, is formulated as a homologous query re-identification task: once a previously observed query is identified as a homologous re-encounter of the incoming query, the draft is deemed acceptable, allowing the system to bypass slow full-database retrieval. Benefiting from the prevalence of homologous queries under real-world popularity patterns, HaS achieves substantial efficiency gains. Extensive experiments demonstrate that HaS reduces retrieval latency by 23.74% and 36.99% across datasets with only a 1-2% marginal accuracy drop. As a plug-and-play solution, HaS also significantly accelerates complex multi-hop queries in modern agentic RAG pipelines. Source code is available at: https://github.com/ErrEqualsNil/HaS.

🏛️ Semantic Scholar 📅 2026-04-22

Yannis Belkhiter, Seshu Tirupathi, Giulio Zizzo, et al.

🔥 引用: 0

Abstract: The field of Language Reasoning Models (LRMs) has been very active over the past few years with advances in training and inference techniques enabling LRMs to reason longer, and more accurately. However, a growing body of studies show that LRMs are still inefficient, over-generating verification and reflection steps. Additionally, the high-level role of each reasoning step and how different step types contribute to the generation of correct answers, is largely underexplored. To address this challenge, we introduce TRACES (Tagging of the Reasoning steps enabling Adaptive Cost-Efficient early-Stopping), a lightweight framework that tags reasoning steps in real-time, and enable adaptive, cost-efficient early stopping of large-language-model inferences. Building on this framework we monitor reasoning behaviors during inferences, and we find that LRMs tend to shift their reasoning behavior after reaching a correct answer. We demonstrate that the monitoring of the specific type of steps can produce effective interpretable early stopping criteria. We evaluate the TRACES framework on three mathematical reasoning benchmarks, namely, MATH500, GSM8K, AIME and two knowledge and reasoning benchmarks, MMLU and GPQA respectively. We achieve 20 to 50% token reduction while maintaining comparable accuracy to standard generation.

🏛️ Semantic Scholar 📅 2026-04-22

X-PCR: A Benchmark for Cross-modality Progressive Clinical Reasoning in Ophthalmic Diagnosis

Gui Wang, Zehao Zhong, Yongsong Zhou, et al.

🔥 引用: 0

Abstract: Despite significant progress in Multi-modal Large Language Models (MLLMs), their clinical reasoning capacity for multi-modal diagnosis remains largely unexamined. Current benchmarks, mostly single-modality data, can't evaluate progressive reasoning and cross-modal integration essential for clinical practice. We introduce the Cross-Modality Progressive Clinical Reasoning (X-PCR) benchmark, the first comprehensive evaluation of MLLMs through a complete ophthalmology diagnostic workflow, with two reasoning tasks: 1) a six-stage progressive reasoning chain spanning image quality assessment to clinical decision-making, and 2) a cross-modality reasoning task integrating six imaging modalities. The benchmark comprises 26,415 images and 177,868 expert-verified VQA pairs curated from 51 public datasets, covering 52 ophthalmic diseases. Evaluation of 21 MLLMs reveals critical gaps in progressive reasoning and cross-modal integration. Dataset and code: https://github.com/CVI-SZU/X-PCR.

🏛️ Semantic Scholar 📅 2026-04-22

Learning Reasoning World Models for Parallel Code

Gautam Singh, Arjun Guha, B. Kailkhura, et al.

🔥 引用: 0

Abstract: Large language models have shown remarkable ability in serial code generation, but they still struggle with parallel code for which training data is comparatively scarce. A common remedy is to use coding agents that interact with external tools, but tool calls can be costly and sometimes impractical, e.g., for partially written code. We propose Parallel-Code World Models (PCWMs), reasoning LLMs that aim to predict tool outcomes directly from parallel source code. To train PCWMs, we design a novel exploration and data generation pipeline that samples diverse parallel-coding problems and candidate implementations across multiple domains, then executes them via tools to record data races and performance profiles. From these, we synthesize hindsight reasoning traces that causally connect source code to observed tool outcomes. Fine-tuning on the resulting data yields noticeable gains, with a 7B-parameter world model improving from 64.3% to 72.8% accuracy in race-outcome prediction, while an 8B-parameter model improves in a performance profiling task from 49.3% to 58.6% accuracy. Furthermore, when open-weight models were tasked with fixing data races, world-model feedback improved their race-fixing rates relative to self-feedback by 2.7%-9.1% using our 7B-parameter world model and by 6.1%-11.1% using our 14B-parameter world model. Our results suggest that reasoning models have the potential to serve as practical substitutes for external tool calls in parallel-coding agents.

🏛️ Semantic Scholar 📅 2026-04-22 Agent Fine-tuning

MedSAM2-CXR: A Box-Latent Framework for Chest X-ray Classification and Report Generation

Y. Hakata, M. Oikawa, S. Fujisawa

DOI: 10.64898/2026.04.20.26351338

🔥 引用: 0

Abstract: 暂无摘要，请点击原文查看。

🏛️ Semantic Scholar 📅 2026-04-22 medRxiv

F. Schafer, Yuan Gao, Dingrui Wang, et al.

🔥 引用: 0

Abstract: While Vision-Language Models (VLMs) have advanced highlevel reasoning in autonomous driving, their ability to ground this reasoning in the underlying physics of ego-motion remains poorly understood. We introduce EgoDyn-Bench, a diagnostic benchmark for evaluating the semantic ego-motion understanding of vision-centric foundation models. By mapping continuous vehicle kinematics to discrete motion concepts via a deterministic oracle, we decouple a model's internal physical logic from its visual perception. Our large-scale empirical audit spanning 20 + models, including closed-source MLLMs, open-source VLMs across multiple scales, and specialized VLAs, identifies a significant Perception Bottleneck: while models exhibit logical physical concepts, they consistently fail to accurately align them with visual observations, frequently underperforming classical non-learned geometric baselines. This failure persists across model scales and domain-specific training, indicating a structural deficit in how current architectures couple visual perception with physical reasoning. We demonstrate that providing explicit trajectory encodings substantially restores physical consistency across all evaluated models, revealing a functional disentanglement between vision and language: egomotion logic is derived almost exclusively from the language modality, while visual observations contribute negligible additional signal. This structural finding provides a standardized diagnostic framework and a practical pathway toward physically aligned embodied AI. Keywords: Ego-motion - Physical Reasoning - Foundation Models

🏛️ Semantic Scholar 📅 2026-04-22 Foundation Model

Yubo Jiang, Yitong An, Xin Yang, et al.

🔥 引用: 0

Abstract: We introduce V-tableR1, a process-supervised reinforcement learning framework that elicits rigorous, verifiable reasoning from multimodal large language models (MLLMs). Current MLLMs trained solely on final outcomes often treat visual reasoning as a black box, relying on superficial pattern matching rather than performing rigorous multi-step inference. While Reinforcement Learning with Verifiable Rewards could enforce transparent reasoning trajectories, extending it to visual domains remains severely hindered by the ambiguity of grounding abstract logic into continuous pixel space. We solve this by leveraging the deterministic grid structure of tables as an ideal visual testbed. V-tableR1 employs a specialized critic VLM to provide dense, step-level feedback on the explicit visual chain-of-thought generated by a policy VLM. To optimize this system, we propose Process-Guided Direct Alignment Policy Optimization (PGPO), a novel RL algorithm integrating process rewards, decoupled policy constraints, and length-aware dynamic sampling. Extensive evaluations demonstrate that V-tableR1 explicitly penalizes visual hallucinations and shortcut guessing. By fundamentally shifting multimodal inference from black-box pattern matching to verifiable logical derivation, V-tableR1 4B establishes state-of-the-art accuracy among open-source models on complex tabular benchmarks, outperforming models up to 18x its size and improving over its SFT baseline

🏛️ Semantic Scholar 📅 2026-04-22

Feasibility of Large Language Models for Ongoing Professional Practice Evaluation in Cardiovascular Medicine: A Pilot Study in the United States

John C. Dillon, R. Damrongwatanasuk, K. A. Williams

Truong Le Minh Toan, Dieu Bang Mach, Tan Duy Le, et al.

🔥 引用: 0

Abstract: Mental health challenges are rising globally, while traditional support services face limited availability and high costs. Large language models offer potential for conversational support, but often lack personalization, empathy, and factual grounding. A virtual agent framework is introduced to provide empathetic, personalized, and reliable wellbeing support through retrieval-augmented architecture, structured memory, and multimodal interaction. Objective benchmarks demonstrate improved retrieval and response quality, particularly for smaller models. A cross-cultural study with university students from Vietnam and Australia shows the system outperforms LLM-only baselines in coherence, perceived accuracy, and empathy, with most participants clearly preferring the proposed approach.

🏛️ Semantic Scholar 📅 2026-04-22 Agent

Cold-Start Forecasting of New Product Life-Cycles via Conditional Diffusion Models

Ruihan Zhou, Zishi Zhang, Jinhui Han, et al.

🔥 引用: 0

Abstract: Forecasting the life-cycle trajectory of a newly launched product is important for launch planning, resource allocation, and early risk assessment. This task is especially difficult in the pre-launch and early post-launch phases, when product-specific outcome history is limited or unavailable, creating a cold-start problem. In these phases, firms must make decisions before demand patterns become reliably observable, while early signals are often sparse, noisy, and unstable We propose the Conditional Diffusion Life-cycle Forecaster (CDLF), a conditional generative framework for forecasting new-product life-cycle trajectories under cold start. CDLF combines three sources of information: static descriptors, reference trajectories from similar products, and newly arriving observations when available. Here, static descriptors refer to structured pre-launch characteristics of the product, such as category, price tier, brand or organization identity, scale, and access conditions. This structure allows the model to condition forecasts on relevant product context and to update them adaptively over time without retraining, yielding flexible multi-modal predictive distributions under extreme data scarcity. The method satisfies consistency with a horizon-uniform distributional error bound for recursive generation. Across studies on Intel microprocessor stock keeping unit (SKU) life cycles and the platform-mediated adoption of open large language model repositories, CDLF delivers more accurate point forecasts and higher-quality probabilistic forecasts than classical diffusion models, Bayesian updating approaches, and other state-of-the-art machine-learning baselines.

🏛️ Semantic Scholar 📅 2026-04-22 Diffusion Model

A Hybridizable Neural Time Integrator for Stable Autoregressive Forecasting

Brooks Kinch, Xiaozhe Hu, Yilong Huang, et al.

🔥 引用: 0

Abstract: For autoregressive modeling of chaotic dynamical systems over long time horizons, the stability of both training and inference is a major challenge in building scientific foundation models. We present a hybrid technique in which an autoregressive transformer is embedded within a novel shooting-based mixed finite element scheme, exposing topological structure that enables provable stability. For forward problems, we prove preservation of discrete energies, while for training we prove uniform bounds on gradients, provably avoiding the exploding gradient problem. Combined with a vision transformer, this yields latent tokens admitting structure-preserving dynamics. We outperform modern foundation models with a $65\times$ reduction in model parameters and long-horizon forecasting of chaotic systems. A"mini-foundation"model of a fusion component shows that 12 simulations suffice to train a real-time surrogate, achieving a $9{,}000\times$ speedup over particle-in-cell simulation.

🏛️ Semantic Scholar 📅 2026-04-22 Foundation Model Transformer

Model-Contingent Polarity Bias in Large Language Model Annotation: Implications for Semantic Multimedia Personalization

Constantinos Djouvas, Christiana Andreou, Maria C. Voutsa, et al.

DOI: 10.3390/computers15050262

🔥 引用: 0

Abstract: Large Language Models (LLMs) are increasingly deployed as automated annotators in semantic multimedia systems, yet their reliability varies significantly across architectures. This study extends prior cross-model evaluations by benchmarking ChatGPT-5, Qwen-3, and Gemini-3-flash against human expert annotations using the HRAST hotel review dataset. We adopt a bias-by-design framework to analyze systematic divergences in sentiment, topic, and aspect labeling across real and synthetic data, while investigating the moderating effects of annotation mode. Findings reveal model-contingent polarity bias: ChatGPT-5 exhibits a pronounced neutrality bias, while Qwen-3 and Gemini-3-flash align more closely with human polarization. Agreement is substantial for concrete topics but diverges on abstract evaluative dimensions. Synthetic data consistently inflates reliability metrics while masking ambiguity. These findings highlight that annotation bias is structurally embedded in model design choices and operational conditions. Cross-architectural triangulation and mode-aware deployment strategies are recommended for robust semantic multimedia system development.

🏛️ Semantic Scholar 📅 2026-04-22 Computers

Video-ToC: Video Tree-of-Cue Reasoning

Qizhong Tan, Zhuotao Tian, G. Lu, et al.

🔥 引用: 0

Abstract: Existing Video Large Language Models (Video LLMs) struggle with complex video understanding, exhibiting limited reasoning capabilities and potential hallucinations. In particular, these methods tend to perform reasoning solely relying on the pretrained inherent reasoning rationales whilst lacking perception-aware adaptation to the input video content. To address this, we propose \textbf{Video-ToC}, a novel video reasoning framework that enhances video understanding through tree-of-cue reasoning. Specifically, our approach introduces three key innovations: (1) A tree-guided visual cue localization mechanism, which endows the model with enhanced fine-grained perceptual capabilities through structured reasoning patterns; (2) A reasoning-demand reward mechanism, which dynamically adjusts the reward value for reinforcement learning (RL) based on the estimation of reasoning demands, enabling on-demand incentives for more effective reasoning strategies; and (3) An automated annotation pipeline that constructs the Video-ToC-SFT-1k and Video-ToC-RL-2k datasets for supervised fine-tuning (SFT) and RL training, respectively. Extensive evaluations on six video understanding benchmarks and a video hallucination benchmark demonstrate the superiority of Video-ToC over baselines and recent methods. Code is available at https://github.com/qizhongtan/Video-ToC.

🏛️ Semantic Scholar 📅 2026-04-22 Fine-tuning

The Last Harness You'll Ever Build

Haebin Seong, Lingyue Yin, Haoran Zhang

🔥 引用: 0

Abstract: AI agents are increasingly deployed on complex, domain-specific workflows -- navigating enterprise web applications that require dozens of clicks and form fills, orchestrating multi-step research pipelines that span search, extraction, and synthesis, automating code review across unfamiliar repositories, and handling customer escalations that demand nuanced domain knowledge. \textbf{Each new task domain requires painstaking, expert-driven harness engineering}: designing the prompts, tools, orchestration logic, and evaluation criteria that make a foundation model effective. We present a two-level framework that automates this process. At the first level, the \textbf{Harness Evolution Loop} optimizes a worker agent's harness $\mathcal{H}$ for a single task: a Worker Agent $W_{\mathcal{H}}$ executes the task, an Evaluator Agent $V$ adversarially diagnoses failures and scores performance, and an Evolution Agent $E$ modifies the harness based on the full history of prior attempts. At the second level, the \textbf{Meta-Evolution Loop} optimizes the evolution protocol $\Lambda = (W_{\mathcal{H}}, \mathcal{H}^{(0)}, V, E)$ itself across diverse tasks, \textbf{learning a protocol $\Lambda^{(\text{best})}$ that enables rapid harness convergence on any new task -- so that adapting an agent to a novel domain requires no human harness engineering at all.} We formalize the correspondence to meta-learning and present both algorithms. The framework \textbf{shifts manual harness engineering into automated harness engineering}, and takes one step further -- \textbf{automating the design of the automation itself}.

🏛️ Semantic Scholar 📅 2026-04-22 Foundation Model Agent

EvoAgent: An Evolvable Agent Framework with Skill Learning and Multi-Agent Delegation

Aimin Zhang, Jiajing Guo, Fu Jia, et al.

🔥 引用: 0

Abstract: This paper proposes EvoAgent - an evolvable large language model (LLM) agent framework that integrates structured skill learning with a hierarchical sub-agent delegation mechanism. EvoAgent models skills as multi-file structured capability units equipped with triggering mechanisms and evolutionary metadata, and enables continuous skill generation and optimization through a user-feedback-driven closed-loop process. In addition, by incorporating a three-stage skill matching strategy and a three-layer memory architecture, the framework supports dynamic task decomposition for complex problems and long-term capability accumulation. Experimental results based on real-world foreign trade scenarios demonstrate that, after integrating EvoAgent, GPT5.2 achieves significant improvements in professionalism, accuracy, and practical utility. Under a five-dimensional LLM-as-Judge evaluation protocol, the overall average score increases by approximately 28%. Further model transfer experiments indicate that the performance of an agent system depends not only on the intrinsic capabilities of the underlying model, but also on the degree of synergy between the model and the agent architecture.

🏛️ Semantic Scholar 📅 2026-04-22 Agent

BioEngine: scalable execution and adaptation of bioimage AI through agent-readable interfaces

Nils Mechtel, Hugo Dettner Källander, Songtao Cheng, et al.

Shivani Kumar, A. Bharathwaj, David Jurgens

🔥 引用: 0

Abstract: Multi-agent systems built from teams of large language models (LLMs) are increasingly deployed for collaborative scientific reasoning and problem-solving. These systems require agents to coordinate under shared constraints, such as GPUs or credit balances, where cooperative behavior matters. Behavioral economics provides a rich toolkit of games that isolate distinct cooperation mechanisms, yet it remains unknown whether a model's behavior in these stylized settings predicts its performance in realistic collaborative tasks. Here, we benchmark 35 open-weight LLMs across six behavioral economics games and show that game-derived cooperative profiles robustly predict downstream performance in AI-for-Science tasks, where teams of LLM agents collaboratively analyze data, build models, and produce scientific reports under shared budget constraints. Models that effectively coordinate games and invest in multiplicative team production (rather than greedy strategies) produce better scientific reports across three outcomes, accuracy, quality, and completion. These associations hold after controlling for multiple factors, indicating that cooperative disposition is a distinct, measurable property of LLMs not reducible to general ability. Our behavioral games framework thus offers a fast and inexpensive diagnostic for screening cooperative fitness before costly multi-agent deployment.

🏛️ Semantic Scholar 📅 2026-04-22 Agent

Augmented Justice: Artificial Intelligence and the Burden of Legal Reading

Stoyan Stavru

DOI: 10.66260/1213.a1asb

🔥 引用: 0

Abstract: The article examines the effects of the “augmentation” of justice through the introduction of contemporary AI tools to assist judges in reading the law, analysing documents, and deciding cases. Emphasis is placed on the potential human unmanageability of “augmented” justice, as well as on the transformations brought about by large language models in the efforts required of legal professionals in their engagement with legal texts. Two models of arbitrary justice are compared through their literary embodiments in Rabelais’ Judge Bridoye (“The Third Book”) and Aristophanes’ juror Philocleon (“The Wasps”). The questions raised by these figures are then reconsidered in light of the contemporary role of AI in the creation, reading, and application of legal texts.

🏛️ Semantic Scholar 📅 2026-04-22 Literaturna Misăl

RNABag: A Generalizable Transcriptome Foundation Model for Precision Oncology across Biopsy Modalities

Peng Luo, Dong Luo, Dan Li, et al.

DOI: 10.64898/2026.04.19.719450

🔥 引用: 0

Abstract: 暂无摘要，请点击原文查看。

🏛️ Semantic Scholar 📅 2026-04-22 Foundation Model bioRxiv

MCAP: Deployment-Time Layer Profiling for Memory-Constrained LLM Inference

Anurita Das

🔥 引用: 0

Abstract: Deploying large language models to heterogeneous hardware is often constrained by memory, not compute. We introduce MCAP (Monte Carlo Activation Profiling), a load-time per-layer importance estimator that enables dynamic precision and memory placement decisions on the target device. MCAP produces a lightweight per-layer signal that drives both precision dispatch (W4A8 vs. W4A16) and residency tier (GPU, RAM, SSD), allowing a single set of weights to operate across diverse memory budgets. Our system, NVE, achieves 1.5-1.8x higher decode throughput than llama-cpp Q4_0 on NVIDIA T4 and enables models to run in memory regimes previously infeasible without modifying weights.

🏛️ Semantic Scholar 📅 2026-04-22

DeepParse: Hybrid Log Parsing with LLM-Synthesized Regex Masks

Amir Shetaia, S. Kauffman

🔥 引用: 0

Abstract: Modern distributed systems produce massive, heterogeneous logs essential for reliability, security, and anomaly detection. Converting these free-form messages into structured templates (log parsing) is challenging due to evolving formats and limited labeled data. Machine-learning-based parsers like Drain are fast but accuracy often degrades on complex variables, while Large Language Models (LLMs) offer better generalization but incur prohibitive inference costs. This paper presents DeepParse, a hybrid framework that automatically mines reusable variable patterns from small log samples using an LLM, then applies them deterministically through the Drain algorithm. By separating the reasoning phase from execution, DeepParse enables accurate, scalable, and cost-efficient log structuring without relying on brittle handcrafted rules or per-line neural inference. Across 16 benchmark datasets, DeepParse achieves higher accuracy in variable extraction (97.6% average Parsing Accuracy) and better consistency than both heuristic and LLM-only baselines. Integrating DeepParse into an anomaly detection pipeline reduced false alarms by over 30% and reduced inference latency by 36% compared to heuristic baselines.

🏛️ Semantic Scholar 📅 2026-04-22

The GaoYao Benchmark: A Comprehensive Framework for Evaluating Multilingual and Multicultural Abilities of Large Language Models

Yilun Liu, Chunguang Zhao, Mengyao Piao, et al.

🔥 引用: 0

Abstract: Evaluating the multilingual and multicultural capabilities of Large Language Models (LLMs) is essential for their global utility. However, current benchmarks face three critical limitations: (1) fragmented evaluation dimensions that often neglect deep cultural nuances; (2) insufficient language coverage in subjective tasks relying on low-quality machine translation; and (3) shallow analysis that lacks diagnostic depth beyond simple rankings. To address these, we introduce GaoYao, a comprehensive benchmark with 182.3k samples, 26 languages and 51 nations/areas. First, GaoYao proposes a unified framework categorizing evaluation tasks into three cultural layers (General Multilingual, Cross-cultural, Monocultural) and nine cognitive sub-layers. Second, we achieve native-quality expansion by leveraging experts to rigorously localize subjective benchmarks into 19 languages and synthesizing cross-cultural test sets for 34 cultures, surpassing prior coverage by up to 111%. Third, we conduct an in-depth diagnostic analysis on 20+ flagship and compact LLMs. Our findings reveal significant geographical performance disparities and distinct gaps between tasks, offering a reliable map for future work. We release the benchmark (https://github.com/lunyiliu/GaoYao).

🏛️ Semantic Scholar 📅 2026-04-22

Integrating Generative AI with Autonomous Databases: A Deep Dive into Oracle Database 23ai and Oracle Database 26ai Architecture

Krishna Kompalli

DOI: 10.60087/japmi.vol.04.issue.02.id.01

🔥 引用: 0

Abstract: The convergence of generative artificial intelligence (GenAI) and autonomous database systems represents one of the most consequential transformations in enterprise data management over the past decade. Oracle Corporation has positioned itself at the forefront of this convergence through two landmark releases: Oracle Database 23ai the first database explicitly branded as AI-integrated and the forthcoming Oracle Database 26ai, which promises a fully AI-native, self-optimizing data platform. This paper presents a comprehensive architectural analysis of both systems, examining how generative AI capabilities are woven into every layer of the database stack, from storage engines and query optimizers to natural-language interfaces and in-database large-language-model (LLM) gateways. We explore Oracle 23ai's flagship features AI Vector Search, Select AI, JSON Relational Duality, and enhanced in-database machine learning and contrast them with the anticipated architectural innovations in Oracle 26ai, including multimodal data support, integrated LLM routing, and autonomous self-healing infrastructure. Through comparative analysis, performance benchmarks, and real-world use cases spanning healthcare, finance, and manufacturing, this paper demonstrates that AI-native autonomous databases are not simply databases augmented with AI, but fundamentally re-architected systems where intelligence is a first-class citizen. Our findings suggest that Oracle's trajectory from 23ai to 26ai maps a clear progression toward databases that generate, understand, and act on knowledge rather than merely storing and retrieving it.

🏛️ Semantic Scholar 📅 2026-04-22 Journal of AI-Powered Medical Innovations (International online ISSN 3078-1930)

DialToM: A Theory of Mind Benchmark for Forecasting State-Driven Dialogue Trajectories

Neemesh Yadav, P. Achananuparp, Jing Jiang, et al.

🔥 引用: 0

Abstract: Large Language Models (LLMs) have been shown to possess Theory of Mind (ToM) abilities. However, it remains unclear whether this stems from robust reasoning or spurious correlations. We introduce DialToM, a human-verified benchmark built from natural human dialogue using a multiple-choice framework. We evaluate not only mental state prediction (Literal ToM) but also the functional utility of these states (Functional ToM) through Prospective Diagnostic Forecasting -- probing whether models can identify state-consistent dialogue trajectories solely from mental-state profiles. Our results reveal a significant reasoning asymmetry: while LLMs excel at identifying mental states, most (except for Gemini 3 Pro) fail to leverage this understanding to forecast social trajectories. Additionally, we find only weak semantic similarities between human and LLM-generated inferences. To facilitate reproducibility, the DialToM dataset and evaluation code are publicly available at https://github.com/Stealth-py/DialToM.

🏛️ Semantic Scholar 📅 2026-04-22

Neuro-Symbolic Manipulation Understanding with Enriched Semantic Event Chains

Fatemeh Ziaeetabar

🔥 引用: 0

Abstract: Robotic systems operating in human environments must reason about how object interactions evolve over time, which actions are currently being performed, and what manipulation step is likely to follow. Classical enriched Semantic Event Chains (eSECs) provide an interpretable relational description of manipulation, but remain primarily descriptive and do not directly support uncertainty-aware decision making. In this paper, we propose eSEC-LAM, a neuro-symbolic framework that transforms eSECs into an explicit event-level symbolic state for manipulation understanding. The proposed formulation augments classical eSECs with confidence-aware predicates, functional object roles, affordance priors, primitive-level abstraction, and saliency-guided explanation cues. These enriched symbolic states are derived from a foundation-model-based perception front-end through deterministic predicate extraction, while current-action inference and next-primitive prediction are performed using lightweight symbolic reasoning over primitive pre- and post-conditions. We evaluate the proposed framework on EPIC-KITCHENS-100, EPIC-KITCHENS VISOR, and Assembly101 across action recognition, next-primitive prediction, robustness to perception noise, and explanation consistency. Experimental results show that eSEC-LAM achieves competitive action recognition, substantially improves next-primitive prediction, remains more robust under degraded perceptual conditions than both classical symbolic and end-to-end video baselines, and provides temporally consistent explanation traces grounded in explicit relational evidence. These findings demonstrate that enriched Semantic Event Chains can serve not only as interpretable descriptors of manipulation, but also as effective internal states for neuro-symbolic action reasoning.

🏛️ Semantic Scholar 📅 2026-04-22

Darsh Kachroo, Adriana Caraeni, Arjun Prasaath Anbazhagan, et al.

🔥 引用: 0

Abstract: Direct Preference Optimization (DPO) is an effective framework for aligning large language models with human preferences, but it struggles with complex reasoning tasks. DPO optimizes for the likelihood of generating preferred over dispreferred responses in their entirety and lacks the granularity to provide feedback on subsections of many-step solutions typical of reasoning tasks. Existing methods excel at either stable preference learning (e.g., DPO variants like KTO and RSO) or structured reasoning (e.g., ReMA's multi-agent RL framework, Tree of Thoughts), but fail to merge these complementary strengths. We propose HiPO (Hierarchical Preference Optimization), an extension of DPO that separates responses into reasoning segments (query clarification and context, reasoning steps, and answer) and computes loss as a weighted sum of the DPO loss for each segment. Our approach enables segment-specific training while maintaining DPO's computational efficiency and training stability. We demonstrate that for multiple 7B LLMs fine-tuned using HiPO and DPO on the Math Stack Exchange preference dataset, the models trained with HiPO outperform the others on a variety of common math benchmarks and achieve greater organization, logical flow, and consistency as measured by GPT-4.1.

🏛️ Semantic Scholar 📅 2026-04-22 Agent

Leveraging Multimodal LLMs for Built Environment and Housing Attribute Assessment from Street-View Imagery

Siyuan Yao, Siavash Ghorbany, Kuangshi Ai, et al.

🔥 引用: 0

Abstract: We present a novel framework for automatically evaluating building conditions nationwide in the United States by leveraging large language models (LLMs) and Google Street View (GSV) imagery. By fine-tuning Gemma 3 27B on a modest human-labeled dataset, our approach achieves strong alignment with human mean opinion scores (MOS), outperforming even individual raters on SRCC and PLCC relative to the MOS benchmark. To enhance efficiency, we apply knowledge distillation, transferring the capabilities of Gemma 3 27B to a smaller Gemma 3 4B model that achieves comparable performance with a 3x speedup. Further, we distill the knowledge into a CNN-based model (EfficientNetV2-M) and a transformer (SwinV2-B), delivering close performance while achieving a 30x speed gain. Furthermore, we investigate LLMs'capabilities for assessing an extensive list of built environment and housing attributes through a human-AI alignment study and develop a visualization dashboard that integrates LLM assessment outcomes for downstream analysis by homeowners. Our framework offers a flexible and efficient solution for large-scale building condition assessment, enabling high accuracy with minimal human labeling effort.

🏛️ Semantic Scholar 📅 2026-04-22 Fine-tuning Transformer

Pretrain Where? Investigating How Pretraining Data Diversity Impacts Geospatial Foundation Model Performance

Amandeep Kaur, Mirali Purohit, Gedeon Muhawenayo, et al.

🔥 引用: 0

Abstract: New geospatial foundation models introduce a new model architecture and pretraining dataset, often sampled using different notions of data diversity. Performance differences are largely attributed to the model architecture or input modalities, while the role of the pretraining dataset is rarely studied. To address this research gap, we conducted a systematic study on how the geographic composition of pretraining data affects a model's downstream performance. We created global and per-continent pretraining datasets and evaluated them on global and per-continent downstream datasets. We found that the pretraining dataset from Europe outperformed global and continent-specific pretraining datasets on both global and local downstream evaluations. To investigate the factors influencing a pretraining dataset's downstream performance, we analysed 10 pretraining datasets using diversity across continents, biomes, landcover and spectral values. We found that only spectral diversity was strongly correlated with performance, while others were weakly correlated. This finding establishes a new dimension of diversity to be accounted for when creating a high-performing pretraining dataset. We open-sourced 7 new pretraining datasets, pretrained models, and our experimental framework at https://github.com/kerner-lab/pretrain-where.

🏛️ Semantic Scholar 📅 2026-04-22 Foundation Model

Real Time Fraud Detection at Scale in High Volume Enterprise Payment Ecosystems

Yuxuan Qin, Jiesi Yang, Zimeng Wang

Toward Cross-Lingual Quality Classifiers for Multilingual Pretraining Data Selection

Yassine Turki, Vinko Sabolcec, Bettina Messmer, et al.

🔥 引用: 0

Abstract: As Large Language Models (LLMs) scale, data curation has shifted from maximizing volume to optimizing the signal-to-noise ratio by performing quality filtering. However, for many languages, native high quality data is insufficient to train robust quality classifiers. This work investigates the idea that quality markers in embedding space may show cross-lingual consistency, which would allow high-resource languages to subsidize the filtering of low-resource ones. We evaluate various filtering strategies, including cross-lingual transfer, third quartile sampling (Q3), and retention rate tuning. Our results demonstrate that massive multilingual pooling frequently outperforms monolingual baselines in both rank stability and aggregate accuracy for a 1B model trained on 103B tokens, delivering gains for high resource languages (1.2% increase in aggregate normalized accuracy for French) and matching or exceeding monolingual baselines for low-resource languages. However, we find that scale alone does not guarantee stability. Furthermore, for high-resource languages like French, we show that refining the decision boundary through third quartile sampling (Q3) or tuning the retention rate is necessary to fully leverage the multilingual signal.

🏛️ Semantic Scholar 📅 2026-04-22

Evaluating analysis methods for coast guard reports freeform text: a case study on resource-constrained natural language processing with search and rescue reports

Z. Kudlak, J. Sherman

DOI: 10.1177/15485129261440549

🔥 引用: 0

Abstract: After a U.S. Coast Guard (USCG) search and rescue (SAR) case, USCG personnel create an after-action report containing a textual narrative of the situation and Coast Guard response efforts. Data analysts explored how to identify reports involving cases with a verified person in the water. With restricted access to compute resources and limiting policy, large language models (LLMs) could not be utilized, so statistical (‘classical’ and non-neural) methods were considered for training a classification model to identify SAR case outcomes from report texts. The dataset was severely imbalanced toward the negative class, and the texts were extremely messy, with many typos and abbreviations. Therefore, an extensive text cleaning pipeline was developed and tested for improving classification performance. The Iterative Token Elimination Algorithm (iTEA) was developed to increase differences in vocabulary between classes. Model improvement was further explored through augmentation of the feature space using non-text data. The best model was an XGBoost model, achieving 0.762 recall and precision (and 0.959 accuracy). Errors from the test set are analyzed to guide future improvements until LLMs can be used, which are expected to improve performance and reduce text cleaning requirements.

🏛️ Semantic Scholar 📅 2026-04-22 The Journal of Defence Modeling and Simulation: Applications, Methodology, Technology

Surrogate modeling for interpreting black-box LLMs in medical predictions

Changho Han, Songsoo Kim, Dong Won Kim, et al.

🔥 引用: 0

Abstract: Large language models (LLMs), trained on vast datasets, encode extensive real-world knowledge within their parameters, yet their black-box nature obscures the mechanisms and extent of this encoding. Surrogate modeling, which uses simplified models to approximate complex systems, can offer a path toward better interpretability of black-box models. We propose a surrogate modeling framework that quantitatively explains LLM-encoded knowledge. For a specific hypothesis derived from domain knowledge, this framework approximates the latent LLM knowledge space using observable elements (input-output pairs) through extensive prompting across a comprehensive range of simulated scenarios. Through proof-of-concept experiments in medical predictions, we demonstrate our framework's effectiveness in revealing the extent to which LLMs"perceive"each input variable in relation to the output. Particularly, given concerns that LLMs may perpetuate inaccuracies and societal biases embedded in their training data, our experiments using this framework quantitatively revealed both associations that contradict established medical knowledge and the persistence of scientifically refuted racial assumptions within LLM-encoded knowledge. By disclosing these issues, our framework can act as a red-flag indicator to support the safe and reliable application of these models.

🏛️ Semantic Scholar 📅 2026-04-22

Context-Aware Multi-Source Intelligent Knowledge Retrieval System

Glenn Joseph, Harsh Hule, Mansi Jadhav, et al.

Noujoud Nader, S. Giaremis, Clint Dawson, et al.

🔥 引用: 0

Abstract: Storm surge forecasting remains a critical challenge in mitigating the impacts of tropical cyclones on coastal regions, particularly given recent trends of rapid intensification and increasing nearshore storm activity. Traditional high fidelity numerical models such as ADCIRC, while robust, are often hindered by inevitable uncertainties arising from various sources. To address these challenges, this study introduces StormNet, a spatio-temporal graph neural network (GNN) designed for bias correction of storm surge forecasts. StormNet integrates graph convolutional (GCN) and graph attention (GAT) mechanisms with long short-term memory (LSTM) components to capture complex spatial and temporal dependencies among water-level gauge stations. The model was trained using historical hurricane data from the U.S. Gulf Coast and evaluated on Hurricane Idalia (2023). Results demonstrate that StormNet can effectively reduce the root mean square error (RMSE) in water-level predictions by more than 70\% for 48-hour forecasts and above 50\% for 72-hour forecasts, as well as outperform a sequential LSTM baseline, particularly for longer prediction horizons. The model also exhibits low training time, enhancing its applicability in real-time operational forecasting systems. Overall, StormNet provides a computationally efficient and physically meaningful framework for improving storm surge prediction accuracy and reliability during extreme weather events.

🏛️ Semantic Scholar 📅 2026-04-22

HealthCare Agent AI

Rishabh Devrani Rishabh Devrani, Akshit Chawla Akshit Chawla

DOI: 10.55041/ijcope.v2i4.575

🔥 引用: 0

Abstract: This study explores the growing transition in healthcare from rigid, rule-based algorithms to intelligent, self-directed “agentic” AI systems. These agents rely on Large Language Models (LLMs) to think step-by-step, make decisions, and carry out tasks using a simple yet powerful four-part structure: planning, action, reflection, and memory. The paper reviews their current uses in medical diagnosis, hospital workflow automation, and collaborative multi-agent setups. It also highlights a major challenge — most testing still happens in controlled lab settings rather than real hospitals — and discusses what is needed for safe everyday adoption. Keywords: Medical AI Agents, Autonomous Clinical Systems, Large Language Models, Chain-of-Thought Reasoning, Multimodal Integration, Agentic Framework, Clinical Decision Support, Human-in-the-Loop, Healthcare Automation, Physician Burnout, Precision Medicine, Scoping Review, Simulation Gap, PRISMA-ScR. Healthcare AI agents support several key areas. They strengthen diagnostic accuracy in critical situations such as predicting sepsis or identifying cancer on scans. They lighten the administrative load by automatically creating summaries from patient records, which helps reduce doctor burnout. They also improve patient involvement by offering personalised guidance, voice-based recovery check-ins, and easier remote care. In addition, groups of specialist agents can work together on complicated cases, such as building complete treatment plans.

🏛️ Semantic Scholar 📅 2026-04-22 Agent International Journal of Creative and Open Research in Engineering and Management

EST-GNN: An Explainable Spatio-Temporal Graph Framework with Lévy-Optuna Optimization for CO2 Emission Forecasting in Electrified Transportation

Rabab Hamed M. Aly, Shimaa A. Hussien, Marwa M. Ahmed, et al.

DOI: 10.3390/machines14050463

🔥 引用: 0

Abstract: The accurate and explainable prediction of carbon emissions is crucial for the efficient operation of hybrid and electrified transportation systems and their integration with energy grids. An Explainable Spatio-Temporal Graph Neural Network (EST-GNN) is proposed for highly precise CO2 emission forecasting using Lévy Flight-guided Optuna optimization. By modelling vehicles and their operational characteristics as nodes in a dynamic graph, the proposed framework can jointly learn timing and spatial correlations while sustaining interpretability. The accuracy of the EST-GNN model is compared with models based on one-hot encoded features, SMOTE-enhanced datasets, and ensemble regressors. Using a real-world dataset of 7385 vehicle registrations with 12 predictive features experiments are conducted. When applied the EST-GNN model outperformed all baseline and traditional models achieving the highest reliability (R2 = 0.98754) while solving competitive error metrics (RMSE = 6.55, MAE = 2.556). There is strong indication that reasonable machine learning (ML) models can be used accurately to confirm their suitability for resource-prevented and real-time applications, while predictable ML techniques have relatively low reliability. The optimal solution ensures scalability, robustness, and independence of the deployment environment. The distribution analysis of best performing models develops the ability of EST-GNN, which accounts for the largest proportion of best results across evaluation metrics. To achieve superior predictive accuracy, graph-based learning, explainability, and advanced hyperparameter optimization are combined. EST-GNN provides a powerful tool for analyzing fleet emission levels, making energy-aware decisions, and planning sustainable transportation, while ML models continue to be a useful complement for deployment states with high computation costs and quick responses.

🏛️ Semantic Scholar 📅 2026-04-22 Machines

Hallucination Inspector: A Fact-Checking Judge for API Migration

Marcos Tileria, Santanu Kumar Dash, Profir-Petru Parctachi, et al.

🔥 引用: 0

Abstract: Large Language Models (LLMs) are increasingly deployed in automated software engineering for tasks such as API migration. While LLMs are able to identify migration patterns, they often make mistakes and fail to produce correct glue code to invoke the new API in place of the old one. We call this issue Scaffolding Hallucination, a failure mode where models generate incorrect calling contexts by inventing Phantom Symbols -- such as imaginary imports, constructors, and constants -- that do not exist in the API specification. In this paper, we show that standard metrics cannot be relied upon to detect these instances of hallucination. We propose Hallucination Inspector, a static analysis tool to detect Scaffolding Hallucination in LLM-generated code. Our approach includes a lightweight evaluation framework that verifies symbols extracted from the abstract syntax tree against a knowledge base derived directly from software documentation for the API. A preliminary evaluation on Android API migrations demonstrates that our approach successfully identifies hallucinations and significantly reduces false positives compared to standard metrics and probabilistic judges

🏛️ Semantic Scholar 📅 2026-04-22

WildFireVQA: A Large-Scale Radiometric Thermal VQA Benchmark for Aerial Wildfire Monitoring

Mobin Habibpour, Niloufar Alipour Talemi, J. Spodnik, et al.

🔥 引用: 0

Abstract: Wildfire monitoring requires timely, actionable situational awareness from airborne platforms, yet existing aerial visual question answering (VQA) benchmarks do not evaluate wildfire-specific multimodal reasoning grounded in thermal measurements. We introduce WildFireVQA, a large-scale VQA benchmark for aerial wildfire monitoring that integrates RGB imagery with radiometric thermal data. WildFireVQA contains 6,097 RGB-thermal samples, where each sample includes an RGB image, a color-mapped thermal visualization, and a radiometric thermal TIFF, and is paired with 34 questions, yielding a total of 207,298 multiple-choice questions spanning presence and detection, classification, distribution and segmentation, localization and direction, cross-modal reasoning, and flight planning for operational wildfire intelligence. To improve annotation reliability, we combine multimodal large language model (MLLM)-based answer generation with sensor-driven deterministic labeling, manual verification, and intra-frame and inter-frame consistency checks. We further establish a comprehensive evaluation protocol for representative MLLMs under RGB, Thermal, and retrieval-augmented settings using radiometric thermal statistics. Experiments show that across task categories, RGB remains the strongest modality for current models, while retrieved thermal context yields gains for stronger MLLMs, highlighting both the value of temperature-grounded reasoning and the limitations of existing MLLMs in safety-critical wildfire scenarios. The dataset and benchmark code are open-source at https://github.com/mobiiin/WildFire_VQA.

🏛️ Semantic Scholar 📅 2026-04-22

Examiner stratification reveals clinically relevant variability in large language model answers to endodontic patient questions

S. S. Alqahtani, H. Algarni, M. Alonazi, et al.

DOI: 10.3389/fmed.2026.1819087

🔥 引用: 0

Abstract: Large language models (LLMs) are increasingly used by patients seeking endodontic information, yet their clinical reliability and safety in patient-centred communication remain uncertain. This study evaluated the clinical reliability and safety of three contemporary LLMs (ChatGPT GPT-4o, Claude Sonnet 4.5, and Gemini 3 Flash) using 50 patient-centred endodontic questions (35 frequently asked questions and 15 scenario-based prompts). Each question was submitted six times per model in independent sessions. Responses were anonymised and independently assessed by four examiners using a structured Clinical Reliability and Safety Framework. Due to poor inter-examiner agreement, analyses were conducted using examiner stratification. Reproducibility was assessed using word count variability, embedding-based semantic similarity, and lexical distance metrics. Statistically significant differences in clinical reliability were observed across all examiners. ChatGPT consistently received the lowest scores, whereas Gemini most frequently achieved the highest ratings. Model differentiation was clearer for structured frequently asked questions and selected clinical domains than for scenario-based prompts. All models demonstrated stable response lengths across repeated runs. Gemini showed the highest semantic consistency despite greater surface-level rewording. Contemporary LLMs demonstrate clinically meaningful variability beyond factual accuracy, particularly in safety framing and clinical actionability. Reliability is influenced by question structure and clinical context. Multidimensional, examiner-aware evaluation frameworks are necessary to meaningfully assess safety and support responsible integration of LLMs into endodontic patient communication.

🏛️ Semantic Scholar 📅 2026-04-22 Frontiers in Medicine

Marzia Binta Nizam, James Davis

🔥 引用: 0

Abstract: Text-to-image(T2I) models like Stable Diffusion and DALL-E have made generative AI widely accessible, yet recent studies reveal that these systems often replicate societal biases, particularly in how they depict demographic groups across professions. Prompts such as'doctor'or'CEO'frequently yield lighter-skinned outputs, while lower-status roles like'janitor'show more diversity, reinforcing stereotypes. Existing mitigation methods typically require retraining or curated datasets, making them inaccessible to most users. We propose a lightweight, inference-time framework that mitigates representational bias through prompt-level intervention without modifying the underlying model. Instead of assuming a single definition of fairness, our approach allows users to select among multiple fairness specifications-ranging from simple choices such as a uniform distribution to more complex definitions informed by a large language model(LLM) that cites sources and provides confidence estimates. These distributions guide the construction of demographic specific prompt variants in the corresponding proportions, and we evaluate alignment by auditing adherence to the declared target and measuring the resulting skin tone distribution rather than assuming uniformity as'fairness'. Across 36 prompts spanning 30 occupations and 6 non-occupational contexts, our method shifts observed skin-tone outcomes in directions consistent with the declared target, and reduces deviation from targets when the target is defined directly in skin-tone space(fallback). This work demonstrates how fairness interventions can be made transparent, controllable, and usable at inference time, directly empowering users of generative AI.

🏛️ Semantic Scholar 📅 2026-04-22

Advancements in artificial intelligence for cancer diagnosis and prognosis prediction: current applications and emerging opportunities

Qing-miao Shi, Na Lou, Chen Xue

DOI: 10.3389/fcell.2026.1769097

🔥 引用: 0

Abstract: Cancer continues to be a leading cause of mortality worldwide, presenting substantial challenges to public health systems. The traditional approaches to cancer diagnosis and prognosis prediction exhibit certain limitations with respect to accuracy, comprehensiveness, dynamic monitoring, and personalization. With the advancement of artificial intelligence (AI) technologies, novel diagnostic and predictive methods are increasingly addressing these shortcomings. This review provides a comprehensive overview of the primary AI algorithms applied in oncology, including machine learning, deep learning, and large language models. It further examines the distinctive characteristics and appropriate use cases of AI algorithms, highlighting their specific roles in cancer screening, diagnostic accuracy, and outcome forecasting. Additionally, the review discusses emerging trends and persistent challenges, aiming to provide actionable insights that support clinical decision-making and advance scientific innovation in this rapidly evolving field. In conclusion, this review systematically outlines recent advances in AI applications for cancer diagnosis and prognostic prediction, with the objective of facilitating a transformative shift in oncology from experience-based practices toward data-driven precision medicine.

🏛️ Semantic Scholar 📅 2026-04-22 Deep Learning Frontiers in Cell and Developmental Biology

Evaluating large language models for accuracy incentivizes hallucinations

A. Kalai, Ofir Nachum, S. Vempala, et al.

DOI: 10.1038/s41586-026-10549-w

🔥 引用: 0

Abstract: 暂无摘要，请点击原文查看。

🏛️ Semantic Scholar 📅 2026-04-22 Nature

UniCVR: From Alignment to Reranking for Unified Zero-Shot Composed Visual Retrieval

Haokun Wen, Xuemeng Song, Haoyu Zhang, et al.

🔥 引用: 0

Abstract: Composed image retrieval, multi-turn composed image retrieval, and composed video retrieval all share a common paradigm: composing the reference visual with modification text to retrieve the desired target. Despite this shared structure, the three tasks have been studied in isolation, with no prior work proposing a unified framework, let alone a zero-shot solution. In this paper, we propose UniCVR, the first unified zero-shot composed visual retrieval framework that jointly addresses all three tasks without any task-specific human-annotated data. UniCVR strategically combines two complementary strengths: Multimodal Large Language Models (MLLMs) for compositional query understanding and Vision-Language Pre-trained (VLP) models for structured visual retrieval. Concretely, UniCVR operates in two stages. In Stage I, we train the MLLM as a compositional query embedder via contrastive learning on a curated multi-source dataset of approximately 3.5M samples, bridging the heterogeneous embedding spaces between the MLLM and the frozen VLP gallery encoder. A cluster-based hard negative sampling strategy is proposed to strengthen contrastive supervision. In Stage II, we introduce an MLLM-guided dual-level reranking mechanism that applies adaptive budgeted subset scoring to a small number of top-ranked candidates, and then exploits the resulting relevance signals through a dual-level re-scoring scheme, producing more accurate final rankings with minimal computational overhead. Extensive experiments across five benchmarks covering all three tasks demonstrate that UniCVR achieves cutting-edge performance, validating its effectiveness and generalizability. Our data and code will be released upon acceptance.

🏛️ Semantic Scholar 📅 2026-04-22

Expanding the extreme-k dielectric materials space through physics-validated generative reasoning

H. Hridoy, Tahiya Chowdhury, M. S. Hossain

🔥 引用: 0

Abstract: The most technologically consequential materials are often the rarest: they occupy narrow regions of chemical space, obey competing physical constraints, and appear only sparsely in existing databases. High-kappa dielectrics, high-Tc superconductors, and ferromagnetic insulators are to name a few. This scarcity fundamentally limits today's data-driven materials discovery, where machine-learning models excel at interpolation but struggle to generate genuinely new candidates. Here, we introduce DielecMIND, an artificial intelligence framework that reframes materials discovery as a reasoning-driven exploration instead of a database-screening problem. Using high-kappa dielectrics as a data-scarce and technologically stringent test case, DielecMIND combines large-language-model hypothesis generation for the first time with physics validated first-principles calculation to navigate chemical space beyond known compounds. Prior to our work, only 14 experimentally or computationally validated materials with kappa>150 were known. Our framework discovers and validates 5 new such compounds, expanding this rare-materials class by a remarkable = 35% in a single study. Among them, we find that Ba2TiHfO6 exhibits a dielectric constant of 637, minimal loss at low optical frequencies, and stability up to 800 K. Beyond dielectrics, this work demonstrates a new paradigm for artificial-intelligence-guided discovery: one that generates a small number of physically grounded, experimentally plausible candidates yet measurably expands sparsely populated functional materials spaces. Thus, DielecMIND points toward a general strategy for discovering rare, high-impact functional materials where data scarcity has long constrained progress.

🏛️ Semantic Scholar 📅 2026-04-22

WebGen-R1: Incentivizing Large Language Models to Generate Functional and Aesthetic Websites with Reinforcement Learning

Juyong Jiang, Chenglin Cai, Chansung Park, et al.

🔥 引用: 0

Abstract: While Large Language Models (LLMs) excel at function-level code generation, project-level tasks such as generating functional and visually aesthetic multi-page websites remain highly challenging. Existing works are often limited to single-page static websites, while agentic frameworks typically rely on multi-turn execution with proprietary models, leading to substantial token costs, high latency, and brittle integration. Training a small LLM end-to-end with reinforcement learning (RL) is a promising alternative, yet it faces a critical bottleneck in designing reliable and computationally feasible rewards for website generation. Unlike single-file coding tasks that can be verified by unit tests, website generation requires evaluating inherently subjective aesthetics, cross-page interactions, and functional correctness. To this end, we propose WebGen-R1, an end-to-end RL framework tailored for project-level website generation. We first introduce a scaffold-driven structured generation paradigm that constrains the large open-ended action space and preserves architectural integrity. We then design a novel cascaded multimodal reward that seamlessly couples structural guarantees with execution-grounded functional feedback and vision-based aesthetic supervision. Extensive experiments demonstrate that our WebGen-R1 substantially transforms a 7B base model from generating nearly nonfunctional websites into producing deployable, aesthetically aligned multi-page websites. Remarkably, our WebGen-R1 not only consistently outperforms heavily scaled open-source models (up to 72B), but also rivals the state-of-the-art DeepSeek-R1 (671B) in functional success, while substantially exceeding it in valid rendering and aesthetic alignment. These results position WebGen-R1 as a viable path for scaling small open models from function-level code generation to project-level web application generation.

🏛️ Semantic Scholar 📅 2026-04-22

Intersectional Fairness in Large Language Models

Chaima Boufaied, Ronnie de Souza Santos, Ann Barcomb

🔥 引用: 0

Abstract: Large Language Models (LLMs) are increasingly deployed in socially sensitive settings, raising concerns about fairness and biases, particularly across intersectional demographic attributes. In this paper, we systematically evaluate intersectional fairness in six LLMs using ambiguous and disambiguated contexts from two benchmark datasets. We assess LLM behavior using bias scores, subgroup fairness metrics, accuracy, and consistency through multi-run analysis across contexts and negative and non-negative question polarities. Our results show that while modern LLMs generally perform well in ambiguous contexts, this limits the informativeness of fairness metrics due to sparse non-unknown predictions. In disambiguated contexts, LLM accuracy is influenced by stereotype alignment, with models being more accurate when the correct answer reinforces a stereotype than when it contradicts it. This pattern is especially pronounced in race-gender intersections, where directional bias toward stereotypes is stronger. Subgroup fairness metrics further indicate that, despite low observed disparity in some cases, outcome distributions remain uneven across intersectional groups. Across repeated runs, responses also vary in consistency, including stereotype-aligned responses. Overall, our findings show that apparent model competence is partly associated with stereotype-consistent cues, and no evaluated LLM achieves consistently reliable or fair behavior across intersectional settings. These findings highlight the need for evaluation beyond accuracy, emphasizing the importance of combining bias, subgroup fairness, and consistency metrics across intersectional groups, contexts, and repeated runs.

🏛️ Semantic Scholar 📅 2026-04-22

SCI-IDEA: Context-Aware Scientific Ideation Using Token and Sentence Embeddings

Farhana Keya, Gollam Rabby, Soren Auer, et al.

DOI: 10.1007/s10994-026-07036-8

🔥 引用: 0

Abstract: 暂无摘要，请点击原文查看。

🏛️ Semantic Scholar 📅 2026-04-22 Machine-mediated learning

SurgCoT: Advancing Spatiotemporal Reasoning in Surgical Videos through a Chain-of-Thought Benchmark

Gui Wang, Yongsong Zhou, Kaijun Deng, et al.

🔥 引用: 0

Abstract: Fine-grained spatiotemporal reasoning on surgical videos is critical, yet the capabilities of Multi-modal Large Language Models (MLLMs) in this domain remain largely unexplored. To bridge this gap, we introduce SurgCoT, a unified benchmark for evaluating chain-of-thought (CoT) reasoning in MLLMs across 7 surgical specialties and 35 diverse procedures. SurgCoT assesses five core reasoning dimensions: Causal Action Ordering, Cue-Action Alignment, Affordance Mapping, Micro-Transition Localization, and Anomaly Onset Tracking, through a structured CoT framework with an intensive annotation protocol (Question-Option-Knowledge-Clue-Answer), where the Knowledge field provides essential background context and Clue provides definitive spatiotemporal evidence. Evaluation of 10 leading MLLMs shows: 1) commercial models outperform open-source and medical-specialized variants; 2) significant gaps exist in surgical CoT reasoning; 3) SurgCoT enables effective evaluation and enhances progressive spatiotemporal reasoning. SurgCoT provides a reproducible testbed to narrow the gap between MLLM capabilities and clinical reasoning demands. Code: https://github.com/CVI-SZU/SurgCoT.

🏛️ Semantic Scholar 📅 2026-04-22

CyberCertBench: Evaluating LLMs in Cybersecurity Certification Knowledge

Gustav Keppler, Ghada Elbez, V. Hagenmeyer

🔥 引用: 0

Abstract: The rapid evolution and use of Large Language Models (LLMs) in professional workflows require an evaluation of their domain-specific knowledge against industry standards. We introduceCyberCertBench, a new suite of Multiple Choice Question Answering (MCQA) benchmarks derived from industry recognized certifications. CyberCertBench evaluates LLM domain knowledgeagainst the professional standards of Information Technology cybersecurity and more specializedareas such as Operational Technology and related cybersecurity standards. Concurrently, we propose and validate a novel Proposer-Verifier framework, a methodology to generate interpretable,natural language explanations for model performance. Our evaluation shows that frontier modelsachieve human expert level in general networking and IT security knowledge. However, theiraccuracy declines in questions that require vendor-specific nuances or knowledge in formalstandards, like, e.g., IEC 62443. Analysis of model scaling trend and release date demonstratesremarkable gains in parameter efficiency, while recent larger models show diminishing returns.Code and evaluation scripts are available at: https://github.com/GKeppler/CyberCertBench.

🏛️ Semantic Scholar 📅 2026-04-22

AICEBERG: A Novel Agentic AI Framework for Autonomous Radio Monitoring, Compliance and Governance Based on LLM, MCP, and SCPI in Smart Cities

Florin Popescu, Denis Stanescu

Exploring chalcone analogues as potential SARS-CoV-2 inhibitors: insights from synthesis, crystal analysis, molecular dynamics simulations, docking and ADMET studies

C. R. Santhosh, Sampath Chinnam, G. S. Ananthnag, et al.

Burak Karabulut, Carlo Manna, Chris Develder

🔥 引用: 0

Abstract: Fault location in distribution grids is critical for reliability and minimizing outage durations. Yet, it remains challenging due to partial observability, given sparse measurement infrastructure. Recent works show promising results by combining Recurrent Neural Networks (RNNs) and Graph Neural Networks (GNNs) for spatio-temporal learning. Still, many modern GNN architectures remain untested for this grid application, while existing GNN solutions have not explored GNN topology definitions beyond simply adopting the full grid topology to construct the GNN graph. We address these gaps by (i) systematically comparing a newly proposed graph-forming strategy (measured-only) to the traditional full-topology approach, and (ii) introducing STGNN (Spatio-temporal GNN) models based on GraphSAGE and an improved Graph Attention (GATv2), for distribution grid fault location; (iii) benchmarking them against state-of-the-art STGNN and RNN baselines on the IEEE 123-bus feeder. In our experiments, all evaluated STGNN variants achieve high performance and consistently outperform a pure RNN baseline, with improvements up to 11 percentage points F1. Among STGNN models, the newly explored RGATv2 and RGSAGE achieve only marginally higher F1 scores. Still, STGNNs demonstrate superior stability, with tight confidence intervals (within +/- 1.4%) compared to the RNN baseline (up to +/- 7.5%) across different experiment runs. Finally, our proposed reduced GNN topology (measured-only) shows clear benefits in both (i) model training time (6-fold reduction) and (ii) model performance (up to 11 points F1). This suggests that measured-only graphs offer a more practical, efficient, and robust framework for partially observable distribution grids.

🏛️ Semantic Scholar 📅 2026-04-22

To Know is to Construct: Schema-Constrained Generation for Agent Memory

Lei Zheng, Weinan Song, Dailin Li, et al.

🔥 引用: 0

Abstract: Constructivist epistemology argues that knowledge is actively constructed rather than passively copied. Despite the generative nature of Large Language Models (LLMs), most existing agent memory systems are still based on dense retrieval. However, dense retrieval heavily relies on semantic overlap or entity matching within sentences. Consequently, embeddings often fail to distinguish instances that are semantically similar but contextually distinct, introducing substantial noise by retrieving context-mismatched entries. Conversely, directly employing open-ended generation for memory access risks"Structural Hallucination"where the model generates memory keys that do not exist in the memory, leading to lookup failures. Inspired by this epistemology, we posit that memory is fundamentally organized by cognitive schemas, and valid recall must be a generative process performed within these schematic structures. To realize this, we propose SCG-MEM, a schema-constrained generative memory architecture. SCG-MEM reformulates memory access as Schema-Constrained Generation. By maintaining a dynamic Cognitive Schema, we strictly constrain LLM decoding to generate only valid memory entry keys, providing a formal guarantee against structural hallucinations. To support long-term adaptation, we model memory updates via assimilation (grounding inputs into existing schemas) and accommodation (expanding schemas with novel concepts). Furthermore, we construct an Associative Graph to enable multi-hop reasoning through activation propagation. Experiments on the LoCoMo benchmark show that SCG-MEM substantially improves performance across all categories over retrieval-based baselines.

🏛️ Semantic Scholar 📅 2026-04-22 Agent

From Scene to Object: Text-Guided Dual-Gaze Prediction

Zehong Ke, Yanbo Jiang, Jinhao Li, et al.

🔥 引用: 0

Abstract: Interpretable driver attention prediction is crucial for human-like autonomous driving. However, existing datasets provide only scene-level global gaze rather than fine-grained object-level annotations, inherently failing to support text-grounded cognitive modeling. Consequently, while Vision-Language Models (VLMs) hold great potential for semantic reasoning, this critical data limitations leads to severe text-vision decoupling and visual-bias hallucinations. To break this bottleneck and achieve precise object-level attention prediction, this paper proposes a novel dual-branch gaze prediction framework, establishing a complete paradigm from data construction to model architecture. First, we construct G-W3DA, a object-level driver attention dataset. By integrating a multimodal large language model with the Segment Anything Model 3 (SAM3), we decouple macroscopic heatmaps into object-level masks under rigorous cross-validation, fundamentally eliminating annotation hallucinations. Building upon this high-quality data foundation, we propose the DualGaze-VLM architecture. This architecture extracts the hidden states of semantic queries and dynamically modulates visual features via a Condition-Aware SE-Gate, achieving intent-driven precise spatial anchoring. Extensive experiments on the W3DA benchmark demonstrate that DualGaze-VLM consistently surpasses existing state-of-the-art (SOTA) models in spatial alignment metrics, notably achieving up to a 17.8% improvement in Similarity (SIM) under safety-critical scenarios. Furthermore, a visual Turing test reveals that the attention heatmaps generated by DualGaze-VLM are perceived as authentic by 88.22% of human evaluators, proving its capability to generate rational cognitive priors.

🏛️ Semantic Scholar 📅 2026-04-22

The MIF-CD74 axis drives colorectal cancer via glycolytic reprogramming and is targeted by a novel small-molecule inhibitor

Jinwei Lou, Yuhan Chen, Yue Li, et al.

DOI: 10.1007/s13402-026-01202-9

🔥 引用: 0

Abstract: Background Macrophage migration inhibitory factor (MIF) promotes inflammation, regulates immune responses and chemotherapy resistance in the tumor microenvironment. However, its mechanism of action in colorectal cancer (CRC) metabolic reprogramming and targeted therapeutic potential remain unclear. This study aims to investigate the function, mechanism, and targeted therapeutic potential of MIF in CRC. Methods Data were integrated from TCGA, GTEx, CPTAC, and HPA databases with clinical sample validation. Single-cell sequencing analysis (datasets GSE166555 and GSE144735) was performed, alongside functional assays and mechanistic studies. A novel high-potency MIF inhibitor was identified through virtual screening and validated in vitro and in vivo. Results MIF expression was found to be significantly elevated in CRC tissues and cell lines, correlating with poor overall survival (OS) and disease-specific survival (DSS). Single-cell sequencing confirmed malignant epithelial cells as the primary MIF source. Functional assays demonstrated that MIF knockout suppressed CRC cell proliferation, migration, and tumor growth in vivo, while MIF overexpression promoted these effects. Mechanistically, MIF binds CD74 to upregulate glycolytic enzymes (HK2, PKM2, LDHA), enhancing glucose uptake and lactate/pyruvate production, thereby driving the Warburg effect and CRC progression. Virtual screening identified a novel high-potency MIF inhibitor, F3277-0933 (IC50 = 8.284 μM). In vitro and in vivo, F3277-0933 surpassed the classical inhibitor ISO-1 in suppressing MIF-driven glycolytic reprogramming and proliferation. Conclusion This study elucidates a novel mechanism by which the MIF-CD74 axis drives CRC progression through glycolytic reprogramming and provides robust preclinical evidence for developing MIF-targeted therapies. Graphical Abstract The MIF-CD74 axis drives colorectal cancer progression via glycolytic reprogramming. Extracellular MIF binds to the CD74/CD44 receptor complex on the surface of colorectal cancer cells, initiating downstream signaling cascades. This signal transduction leads to the transcriptional upregulation of key glycolytic enzymes (HK2, PKM2, and LDHA), driving a metabolic switch towards the Warburg effect—characterized by enhanced glucose uptake, increased lactate production (High ECAR), and suppressed mitochondrial oxidative phosphorylation (Low OCR). This metabolic reprogramming fuels malignant progression. The novel small-molecule inhibitor, F3277-0933, identified via structure-based virtual screening, specifically targets the MIF tautomerase active site. By blocking this oncogenic signaling axis, F3277-0933 effectively reverses the glycolytic phenotype and suppresses tumor growth. Supplementary Information The online version contains supplementary material available at 10.1007/s13402-026-01202-9.

🏛️ Semantic Scholar 📅 2026-04-22 RNA-seq / Transcriptomics Cellular Oncology

AI models of unstable flow exhibit hallucination

R. Wibawa, Birendra Jha

🔥 引用: 0

Abstract: We report the first systematic evidence of hallucination in AI models of fluid dynamics, demonstrated in the canonical problem of hydrodynamically unstable transport known as viscous fingering. AI-based modeling of flow with instabilities remains challenging because rapidly evolving, multiscale fingering patterns are difficult to resolve accurately. We identify solutions that appear visually realistic yet are physically implausible, analogous to hallucinations in large language models. These hallucinations manifest as spurious fluid interfaces and reverse diffusion that violate conservation laws. We show that their origin lies in the spectral bias of AI models, which becomes dominant at high flow rates and viscosity contrasts. Guided by this insight, we introduce DeepFingers, a new framework for AI-driven fluid dynamics that enforces balanced learning across the full spectrum of spatial modes by combining the Fourier Neural Operator with a Deep Operator Network to predict the spatiotemporal evolution of viscous fingers. By conditioning on both time and viscosity contrast, DeepFingers learns mappings between successive concentration fields across regimes. The framework accurately captures tip splitting, finger merging, and channel formation while preserving global metrics of mixing. The results open a new research direction to investigate fundamental limitations in AI models of physical systems.

🏛️ Semantic Scholar 📅 2026-04-22

Text Steganography with Dynamic Codebook and Multimodal Large Language Model

Jianxin Gao, Ruohan Lei, Wanli Peng

🔥 引用: 0

Abstract: With the popularity of the large language models (LLMs), text steganography has achieved remarkable performance. However, existing methods still have some issues: (1) For the white-box paradigm, this steganography behavior is prone to exposure due to sharing the off-the-shelf language model between Alice and Bob.(2) For the black-box paradigm, these methods lack flexibility and practicality since Alice and Bob should share the fixed codebook while sharing a specific extracting prompt for each steganographic sentence. In order to improve the security and practicality, we introduce a black-box text steganography with a dynamic codebook and multimodal large language model. Specifically, we first construct a dynamic codebook via some shared session configuration and a multimodal large language model. Then an encrypted steganographic mapping is designed to embed secret messages during the steganographic caption generation. Furthermore, we introduce a feedback optimization mechanism based on reject sampling to ensure accurate extraction of secret messages. Experimental results show that the proposed method outperforms existing white-box text steganography methods in terms of embedding capacity and text quality. Meanwhile, the proposed method has achieved better practicality and flexibility than the existing black-box paradigm in some popular online social networks.

🏛️ Semantic Scholar 📅 2026-04-22

A critical analysis of MBTI-based personality profiling with large language models

Jean Marie Tshimula, René Manassé Galekwa, Belkacem Chikhaoui

Automated LTL Specification Generation from Industrial Aerospace Requirements

Zhi Ma, Xiao Liang, Cheng Wen, et al.

🔥 引用: 0

Abstract: In the development and verification of safety-critical aero-space software, Linear Temporal Logic (LTL) has been widely used to specify complex system properties derived from requirements. However, a significant gap remains in industrial practice: translating natural language (NL) requirements into formal LTL properties is a labor-intensive and error-prone process that requires rare expertise in both aerospace control engineering and formal methods. While recent NL-to-LTL tools (e.g., NL2SPEC, NL2TL, NL2LTL) are capable of automating parts of this process, they often fail on real requirement documents in industrial settings, due to complex domain terminology or implicit temporal and logical structure. To address these challenges, we present AeroReq2LTL, a framework that automates LTL property generation for aerospace requirements using large language models (LLMs), with two key industrial innovations: (i) a data dictionary that normalizes technical jargon into precise atomic propositions; and (ii) a template-based requirement language that makes temporal cues and logical relations explicit before translation. On a real aerospace dataset, AeroReq2LTL achieves 85% precision and 88% recall in LTL generation, and its outputs can be directly consumed by existing verification tools.

🏛️ Semantic Scholar 📅 2026-04-21

PeFoMed: Parameter efficient fine-tuning of multimodal large language models for medical CXR

Gang Liu, Xiaotian Tang, Jinlong He, et al.

DOI: 10.1038/s41598-026-47871-2

🔥 引用: 0

Abstract: 暂无摘要，请点击原文查看。

🏛️ Semantic Scholar 📅 2026-04-21 Fine-tuning Scientific Reports

RDP LoRA: Geometry-Driven Identification for Parameter-Efficient Adaptation in Large Language Models

Yusuf cCelebi, Yaugiz Asker, Ozay Ezerceli, et al.

🔥 引用: 0

Abstract: Fine-tuning Large Language Models (LLMs) remains structurally uncertain despite parameter-efficient methods such as Low-Rank Adaptation (LoRA), as the layer-specific roles of internal representations are poorly understood, leading to heuristic decisions about where adaptation should be applied. We model the evolution of hidden states as a high-dimensional geometric trajectory and propose using the Ramer-Douglas-Peucker (RDP) algorithm, a parameter-free and training-free polygon simplification method that preserves global structural transitions while eliminating locally redundant changes, to identify critical breakpoints along the representation path. Crucially, we use these geometric pivots not merely for analysis, but as a direct decision signal for determining which layers should be adapted during parameter-efficient fine-tuning. By integrating this geometry-aware layer selection strategy into LoRA fine-tuning of Qwen3-8B-Base, we achieve superior performance on MMLU-Math using only 13 RDP-selected layers (81.67%), significantly outperforming both full 36-layer adaptation (79.32%) and random 13-layer selection (75.56%), as well as the baseline Qwen3-8B-Base model (74.25%). These results demonstrate that leveraging the intrinsic geometry of representation trajectories provides a robust, interpretable, and training-free signal for optimizing layer selection during model adaptation.

🏛️ Semantic Scholar 📅 2026-04-21 Fine-tuning

Peiqin Lin, Chenyang Lyu, Wenjian Luo, et al.

🔥 引用: 0

Abstract: Large language models (LLMs) are now deployed worldwide, inspiring a surge of benchmarks that measure their multilingual and multicultural abilities. However, these benchmarks prioritize generic language understanding or superficial cultural trivia, leaving the evaluation of grounded tasks -- where models must reason within real-world, context-rich scenarios -- largely unaddressed. To fill this gap, we present CulturALL, a comprehensive and challenging benchmark to assess LLMs'multilingual and multicultural competence on grounded tasks. CulturALL is built via a human--AI collaborative framework: expert annotators ensure appropriate difficulty and factual accuracy, while LLMs lighten the manual workload. By incorporating diverse sources, CulturALL ensures comprehensive scenario coverage. Each item is carefully designed to present a high level of difficulty, making CulturALL challenging. CulturALL contains 2,610 samples in 14 languages from 51 regions, distributed across 16 topics to capture the full breadth of grounded tasks. Experiments show that the best LLM achieves 44.48% accuracy on CulturALL, underscoring substantial room for improvement.

🏛️ Semantic Scholar 📅 2026-04-21

ChipCraftBrain: Validation-First RTL Generation via Multi-Agent Orchestration

Çagri Eryilmaz

🔥 引用: 0

Abstract: Large Language Models (LLMs) show promise for generating Register-Transfer Level (RTL) code from natural language specifications, but single-shot generation achieves only 60-65% functional correctness on standard benchmarks. Multi-agent approaches such as MAGE reach 95.9% on VerilogEval yet remain untested on harder industrial benchmarks such as NVIDIA's CVDP, lack synthesis awareness, and incur high API costs. We present ChipCraftBrain, a framework combining symbolic-neural reasoning with adaptive multi-agent orchestration for automated RTL generation. Four innovations drive the system: (1) adaptive orchestration over six specialized agents via a PPO policy over a 168-dim state (an alternative world-model MPC planner is also evaluated); (2) a hybrid symbolic-neural architecture that solves K-map and truth-table problems algorithmically while specialized agents handle waveform timing and general RTL; (3) knowledge-augmented generation from a 321-pattern base plus 971 open-source reference implementations with focus-aware retrieval; and (4) hierarchical specification decomposition into dependency-ordered sub-modules with interface synchronization. On VerilogEval-Human, ChipCraftBrain achieves 97.2% mean pass@1 (range 96.15-98.72% across 7 runs, best 154/156), on par with ChipAgents (97.4%, self-reported) and ahead of MAGE (95.9%). On a 302-problem non-agentic subset of CVDP spanning five task categories, we reach 94.7% mean pass@1 (286/302, averaged over 3 runs), a 36-60 percentage-point lift per category over the published single-shot baseline; we additionally lead three of four categories shared with NVIDIA's ACE-RTL despite using roughly 30x fewer per-problem attempts. A RISC-V SoC case study demonstrates hierarchical decomposition generating 8/8 lint-passing modules (689 LOC) validated on FPGA, where monolithic generation fails entirely.

🏛️ Semantic Scholar 📅 2026-04-21 Agent

Benchmarking Agentic Large Language Models for Complex Protein-Set Functional Annotation

Exploring Cross-Debate Between LLMs to Improve the Forecasting of Financial Market Indicators

S. Chang, Kai Chung

DOI: 10.3390/math14081393

🔥 引用: 0

Abstract: In the context of political and financial market turmoil, effectively forecasting financial market trends is crucial for investment decisions. Large language models (LLMs) have been applied in extant research to predict market trends, analyze investor sentiments and interpret financial news, all aiming to help investment decision making. However, LLMs face limitations due to training data heterogeneity, restricting multidimensional perspectives and hindering comparative analysis for optimization. This study proposes a “Dual-Agent LLM Debate Mechanism” framework using a Proponent (LLM1: Gemini Pro 3) and an Opponent (LLM2: ChatGPT 5.2) to address single-LLM forecasting gaps: The Proponent generates a baseline forecast (F1) from an Integrated Context, while the Opponent validates and resolves conflicts with the Proponent via up to three rounds of cross-debate to produce a consensus forecast (F2). A controlled experiment was conducted to analyze 75 financial market indicators (FMIs) across five asset categories, revealing that F2 outperforms F1 in accuracy and directional stability, particularly in highly volatile assets like Cryptocurrencies and 10-Year Government Bonds. Paired-sample t-tests confirmed statistical significance, validating the mechanism’s effectiveness. Our study results demonstrate how cross-debate between LLMs enhances forecasting accuracy through structured optimization.

🏛️ Semantic Scholar 📅 2026-04-21 Agent Mathematics

Nikita Kister, YM Pradyumna, Istv'an S'ar'andi, et al.

🔥 引用: 1

Abstract: Training embodied agents to understand 3D scenes as humans do requires large-scale data of people meaningfully interacting with diverse environments, yet such data is scarce. Real-world motion capture is costly and limited to controlled settings, while existing synthetic datasets rely on simple geometric heuristics that ignore rich scene context. In contrast, 2D foundation models trained on internet-scale data have implicitly acquired commonsense knowledge of human-environment interactions. To transfer this knowledge into 3D, we introduce InHabit, a fully automatic and scalable data generator for populating 3D scenes with interacting humans. InHabit follows a render-generate-lift principle: given a rendered 3D scene, a vision-language model proposes contextually meaningful actions, an image-editing model inserts a human, and an optimization procedure lifts the edited result into physically plausible SMPL-X bodies aligned with the scene geometry. Applied to Habitat-Matterport3D, InHabit produces the first large-scale photorealistic 3D human-scene interaction dataset, containing 78K samples across 800 building-scale scenes with complete 3D geometry, SMPL-X bodies, and RGB images. Augmenting standard training data with our samples improves RGB-based 3D human-scene reconstruction and contact estimation, and in a perceptual user study our data is preferred in 78% of cases over the state of the art.

🏛️ Semantic Scholar 📅 2026-04-21 Foundation Model Agent Cited: 1

Zijie Wang, MohammadHossein Rezaei, Farzana Rashid, et al.

🔥 引用: 0

Abstract: Negation is a common and important semantic feature in natural language, yet Large Language Models (LLMs) struggle when negation is involved in natural language understanding tasks. Commonsense knowledge, on the other hand, despite being a well-studied topic, lacks investigations involving negation. In this work, we show that commonsense knowledge with negation is challenging for models to understand. We present a novel approach to automatically augment existing commonsense knowledge corpora with negation, yielding two new corpora containing over 2M triples with if-then relations. In addition, pre-training LLMs on our corpora benefits negation understanding.

🏛️ Semantic Scholar 📅 2026-04-21

Structure Guided Retrieval-Augmented Generation for Factual Queries

Miao Xie, Xiao Zhang, Yi Li, et al.

🔥 引用: 0

Abstract: Retrieval-Augmented Generation (RAG) has been proposed to mitigate hallucinations in large language models (LLMs), where generated outputs may be factually incorrect. However, existing RAG approaches predominantly rely on vector similarity for retrieval, which is prone to semantic noise and fails to ensure that generated responses fully satisfy the complex conditions specified by factual queries, often leading to incorrect answers. To address this challenge, we introduce a novel research problem, named Exact Retrieval Problem (ERP). To the best of our knowledge, this is the first problem formulation that explicitly incorporates structural information into RAG for factual questions to satisfy all query conditions. For this novel problem, we propose Structure Guided Retrieval-Augmented Generation (SG-RAG), which models the retrieval process as an embedding-based subgraph matching task, and uses the retrieved topological structures to guide the LLM to generate answers that meet all specified query conditions. To facilitate evaluation of ERP, we construct and publicly release Exact Retrieval Question Answering (ERQA), a large-scale dataset comprising 120000 fact-oriented QA pairs, each involving complex conditions, spanning 20 diverse domains. The experimental results demonstrate that SG-RAG significantly outperforms strong baselines on ERQA, delivering absolute improvements from 20.68 to 50.88 points across all evaluation metrics, while maintaining reasonable computational overhead.

🏛️ Semantic Scholar 📅 2026-04-21

DW-Bench: Benchmarking LLMs on Data Warehouse Graph Topology Reasoning

Ahmed Haj Ahmed, C. O. Sakar

🔥 引用: 0

Abstract: This paper introduces DW-Bench, a new benchmark that evaluates large language models (LLMs) on graph-topology reasoning over data warehouse schemas, explicitly integrating both foreign-key (FK) and data-lineage edges. The benchmark comprises 1,046 automatically generated, verifiably correct questions across five schemas. Experiments show that tool-augmented methods substantially outperform static approaches but plateau on hard compositional subtypes.

🏛️ Semantic Scholar 📅 2026-04-21

🔥 引用: 0

Abstract: We introduce Options LLM (OLLM), a simple, general method that replaces the single next-token prediction of standard LLMs with a \textit{set of learned options} for the next token, indexed by a discrete latent variable. Instead of relying on temperature or sampling heuristics to induce diversity, OLLM models variation explicitly: a small latent space parametrizes multiple plausible next-token options which can be selected or searched by a downstream policy. Architecturally, OLLM is a lightweight"plug-in"that inserts two layers: an encoder and a decoder, before the output head, allowing almost any pretrained LLM to be converted with minimal additional parameters. We apply OLLM to a 1.7B-parameter backbone (only $1.56\%$ of parameters trainable) trained on OpenMathReasoning and evaluated on OmniMath. The SOTA LoRA-adapted baselines peak at $51\%$ final answer correctness, while OLLM's option set allows up to $\sim 70\%$ under optimal latent selection. We then train a compact policy in the latent space that emits latents to control generation. Operating in a low-dimensional option space makes reward optimization far more sample-efficient and substantially reduces common misalignments (e.g., language switching or degenerate reasoning), as the policy is constrained to options learned during SFT. Crucially, this alignment arises from model structure rather than additional KL or handcrafted alignment losses. Our results demonstrate that optionized next-token modeling enhances controllability, robustness, and efficiency in math reasoning, and highlight latent-space policy learning as a promising direction for reinforcement learning in LLMs.

🏛️ Semantic Scholar 📅 2026-04-21

Adaptive reinforcement learning for recommendation via large language models and knowledge graphs

Qiang Fan, Hongfeng Han, Jingqi Xing, et al.

DOI: 10.1007/s44443-026-00745-z

🔥 引用: 0

Abstract: 暂无摘要，请点击原文查看。

🏛️ Semantic Scholar 📅 2026-04-21 Journal of King Saud University: Computer and Information Sciences

Cristina Garbacea, He Wang, Chenhao Tan

🔥 引用: 0

Abstract: With the rise in capabilities of large language models (LLMs) and their deployment in real-world tasks, evaluating LLM alignment with human preferences has become an important challenge. Current benchmarks average preferences across all users to compute aggregate ratings, overlooking individual user preferences when establishing model rankings. Since users have varying preferences in different contexts, we call for personalized LLM benchmarks that rank models according to individual needs. We compute personalized model rankings using ELO ratings and Bradley-Terry coefficients for 115 active Chatbot Arena users and analyze how user query characteristics (topics and writing style) relate to LLM ranking variations. We demonstrate that individual rankings of LLM models diverge dramatically from aggregate LLM rankings, with Bradley-Terry correlations averaging only $\rho = 0.04$ (57\% of users show near-zero or negative correlation) and ELO ratings showing moderate correlation ($\rho = 0.43$). Through topic modeling and style analysis, we find users exhibit substantial heterogeneity in topical interests and communication styles, influencing their model preferences. We further show that a compact combination of topic and style features provides a useful feature space for predicting user-specific model rankings. Our results provide strong quantitative evidence that aggregate benchmarks fail to capture individual preferences for most users, and highlight the importance of developing personalized benchmarks that rank LLM models according to individual user preferences.

🏛️ Semantic Scholar 📅 2026-04-21

From Signal Degradation to Computation Collapse: Uncovering the Two Failure Modes of LLM Quantization

Chenxi Zhou, Pengfei Cao, Jianguo Li, et al.

🔥 引用: 0

Abstract: Post-Training Quantization (PTQ) is critical for the efficient deployment of Large Language Models (LLMs). While 4-bit quantization is widely regarded as an optimal trade-off, reducing the precision to 2-bit usually triggers a catastrophic ``performance cliff.''It remains unclear whether the underlying mechanisms differ fundamentally. Consequently, we conduct a systematic mechanistic analysis, revealing two qualitatively distinct failure modes: Signal Degradation, where the computational patterns remain intact but information precision is impaired by cumulative error; and Computation Collapse, where key components fail to function, preventing correct information processing and destroying the signal in the early layers. Guided by this diagnosis, we conduct mechanism-aware interventions, demonstrating that targeted, training-free repair can mitigate Signal Degradation, but remains ineffective for Computation Collapse. Our findings provide a systematic diagnostic framework for PTQ failures and suggest that addressing Computation Collapse requires structural reconstruction rather than mere compensation.

🏛️ Semantic Scholar 📅 2026-04-21

Air-Know: Arbiter-Calibrated Knowledge-Internalizing Robust Network for Composed Image Retrieval

Zhiheng Fu, Yupeng Hu, Qian Yang, et al.

🔥 引用: 1

Abstract: Composed Image Retrieval (CIR) has attracted significant attention due to its flexible multimodal query method, yet its development is severely constrained by the Noisy Triplet Correspondence (NTC) problem. Most existing robust learning methods rely on the"small loss hypothesis", but the unique semantic ambiguity in NTC, such as"partial matching", invalidates this assumption, leading to unreliable noise identification. This entraps the model in a self dependent vicious cycle where the learner is intertwined with the arbiter, ultimately causing catastrophic"representation pollution". To address this critical challenge, we propose a novel"Expert-Proxy-Diversion"decoupling paradigm, named Air-Know (ArbIteR calibrated Knowledge iNternalizing rObust netWork). Air-Know incorporates three core modules: (1) External Prior Arbitration (EPA), which utilizes Multimodal Large Language Models (MLLMs) as an offline expert to construct a high precision anchor dataset; (2) Expert Knowledge Internalization (EKI), which efficiently guides a lightweight proxy"arbiter"to internalize the expert's discriminative logic; (3) Dual Stream Reconciliation (DSR), which leverages the EKI's matching confidence to divert the training data, achieving a clean alignment stream and a representation feedback reconciliation stream. Extensive experiments on multiple CIR benchmark datasets demonstrate that Air-Know significantly outperforms existing SOTA methods under the NTC setting, while also showing strong competitiveness in traditional CIR.

🏛️ Semantic Scholar 📅 2026-04-21 Cited: 1

Efficient Medical Image Segmentation in Multisensor Imaging: A Survey in the Era of Mamba and Foundation Models

Xiu Shu, Youqiang Xiong, Zhangli Ma, et al.

DOI: 10.3390/s26082558

🔥 引用: 0

Abstract: Deep learning has revolutionized medical image segmentation; however, the clinical deployment of state-of-the-art models is severely impeded by their quadratic computational complexity and substantial resource demands, particularly in multisensor and multimodal imaging scenarios. In response, the field is undergoing a paradigm shift towards efficiency, characterized by the rise of linear-complexity architectures and the optimization of foundation models. This paper presents a comprehensive survey of efficient medical image segmentation methodologies, systematically reviewing the evolution from heavy, accuracy-driven models to lightweight, deployment-ready paradigms. In particular, we highlight the growing importance of efficient segmentation in multisensor medical imaging, where heterogeneous data sources such as CT, MRI, ultrasound, and infrared imaging introduce additional challenges in scalability and computational cost. We propose a novel taxonomy that categorizes these advancements into four distinct streams: (1) Mamba and State Space Models, which leverage selective scanning mechanisms to achieve global receptive fields with linear complexity; (2) Efficient Adaptation of Foundation Models, focusing on parameter-efficient fine-tuning and knowledge distillation to tailor the Segment Anything Model (SAM) for medical domains; (3) Advanced Lightweight Architectures, covering the resurgence of large-kernel CNNs and the emergence of Kolmogorov–Arnold Networks (KANs); and (4) Data-Efficient Strategies, including semi-supervised and federated learning to address annotation scarcity. Furthermore, we conduct a rigorous comparative analysis of representative algorithms on mainstream benchmarks, providing a granular evaluation of the trade-offs between segmentation accuracy and computational overhead. The survey also discusses key challenges in multisensor and multimodal settings, including modality heterogeneity, data fusion complexity, and resource constraints. Finally, we identify critical challenges and outline future research directions, serving as a roadmap for the development of next-generation efficient and scalable medical image analysis systems.

🏛️ Semantic Scholar 📅 2026-04-21 Deep Learning Foundation Model Fine-tuning Italian National Conference on Sensors

Structural insights and biological activity of (E)-4-(1-(2-(4-(4-chlorophenyl)thiazol-2-yl)hydrazono)ethyl)phenol: a potential therapeutic for breast cancer

A. J., S. J. Jenepha Mary, V. Shally, et al.

Kenneth Cedeño, Dolly De León, Moisés Chiari

DOI: 10.64898/2026.04.20.719701

🔥 引用: 0

Abstract: 暂无摘要，请点击原文查看。

🏛️ Semantic Scholar 📅 2026-04-21 bioRxiv

Cyber Defense Benchmark: Agentic Threat Hunting Evaluation for LLMs in SecOps

Alankrit Chona, I. Kozlov, Ambuj Kumar

🔥 引用: 0

Abstract: We introduce the Cyber Defense Benchmark, a benchmark for measuring how well large language model (LLM) agents perform the core SOC analyst task of threat hunting: given a database of raw Windows event logs with no guided questions or hints, identify the exact timestamps of malicious events. The benchmark wraps 106 real attack procedures from the OTRF Security-Datasets corpus - spanning 86 MITRE ATT&CK sub-techniques across 12 tactics - into a Gymnasium reinforcement-learning environment. Each episode presents the agent with an in-memory SQLite database of 75,000-135,000 log records produced by a deterministic campaign simulator that time-shifts and entity-obfuscates the raw recordings. The agent must iteratively submit SQL queries to discover malicious event timestamps and explicitly flag them, scored CTF-style against Sigma-rule-derived ground truth. Evaluating five frontier models - Claude Opus 4.6, GPT-5, Gemini 3.1 Pro, Kimi K2.5, and Gemini 3 Flash - on 26 campaigns covering 105 of 106 procedures, we find that all models fail dramatically: the best model (Claude Opus 4.6) submits correct flags for only 3.8% of malicious events on average, and no run across any model ever finds all flags. We define a passing score as>= 50% recall on every ATT&CK tactic - the minimum bar for unsupervised SOC deployment. No model passes: the leader clears this bar on 5 of 13 tactics and the remaining four on zero. These results suggest that current LLMs are poorly suited for open-ended, evidence-driven threat hunting despite strong performance on curated Q&A security benchmarks.

🏛️ Semantic Scholar 📅 2026-04-21 Agent

Fra juks til læring

Grethe Moen Johansen

Time Series Augmented Generation for Financial Applications

A. Kolonin, Alex Glushchenko, E. Bochkov, et al.

🔥 引用: 0

Abstract: Evaluating the reasoning capabilities of Large Language Models (LLMs) for complex, quantitative financial tasks is a critical and unsolved challenge. Standard benchmarks often fail to isolate an agent's core ability to parse queries and orchestrate computations. To address this, we introduce a novel evaluation methodology and benchmark designed to rigorously measure an LLM agent's reasoning for financial time-series analysis. We apply this methodology in a large-scale empirical study using our framework, Time Series Augmented Generation (TSAG), where an LLM agent delegates quantitative tasks to verifiable, external tools. Our benchmark, consisting of 100 financial questions, is used to compare multiple SOTA agents (e.g., GPT-4o, Llama 3, Qwen2) on metrics assessing tool selection accuracy, faithfulness, and hallucination. The results demonstrate that capable agents can achieve near-perfect tool-use accuracy with minimal hallucination, validating the tool-augmented paradigm. Our primary contribution is this evaluation framework and the corresponding empirical insights into agent performance, which we release publicly to foster standardized research on reliable financial AI.

🏛️ Semantic Scholar 📅 2026-04-21 Agent

Evaluating Protein Language Model Embeddings for Viral Clade Assignment

Brendonas Stakauskas, Virginijus Marcinkevičius

DOI: 10.1177/15578666261443336

🔥 引用: 0

Abstract: Protein language models (PLMs) provide powerful sequence representations, yet their effectiveness for unsupervised viral clade assignment remains uncertain. In this study, we evaluated embeddings from ProtT5, ProtBert, CARP, and several ESM-2 variants on influenza A/H3N2 hemagglutinin sequences. Using dimensionality reduction (t-SNE, UMAP, PCA, MDS) and clustering with HDBSCAN, we compared PLM embeddings against baseline Hamming distance approaches. Our results show that t-SNE combined with PLM embeddings can recover clade structure, with ProtBert yielding the most stable performance and larger ESM-2 models occasionally achieving lower normalized variation of information scores but with greater variability. These findings suggest that while PLM embeddings capture clade-relevant signals, they also suffer from instability and the loss of site- or nucleotide-specific detail. Future improvements in pooling strategies may enhance their utility for viral surveillance.

🏛️ Semantic Scholar 📅 2026-04-21 Protein Language Model Journal of Computational Biology

Graph-Theoretic Models for the Prediction of Molecular Measurements

Anna Niane, P. Djagba

🔥 引用: 0

Abstract: Graph-theoretic approaches offer simplicity, interpretability, and low computational cost for molecular property prediction. Among these, the model proposed by Mukwembi and Nyabadza, based on the external activity $D(G)$ and internal activity $\zeta(G)$ indices, achieved strong results on a small flavonoid dataset. However, its ability to generalize to larger and chemically diverse datasets has not been tested. This study evaluates the baseline $D(G)$-$\zeta(G)$ polynomial model on five benchmark datasets from MoleculeNet, covering biological activity (BACE, 1,513 molecules), lipophilicity (LogP synthetic, 14,610 molecules; LogP experimental, 753 molecules), aqueous solubility (ESOL, 1,128 molecules), and hydration free energy (SAMPL, 642 molecules). The baseline model achieves an average $R^2 = 0.24$, confirming limited transferability. To address this, a systematic enhancement framework is proposed, progressively incorporating Ridge regularization, additional graph descriptors, physicochemical properties, ensemble learning with Gradient Boosting, Lasso feature selection, and a hybrid approach combining topological indices with Morgan fingerprints. The enhanced models raise the average best $R^2$ to 0.79, with individual improvements ranging from 165\% to 274\%. All improvements are statistically significant ($p<0.001$). A direct comparison with a Graph Convolutional Network under identical experimental conditions shows that the enhanced classical models match or outperform deep learning on all five datasets. Comparison with the recent GNN+PGM hybrid of Djagba et al.\ further confirms competitiveness, with the enhanced models achieving the best results on two datasets and tying on one. The entire framework requires no GPU, trains in under five minutes, and uses only open-source tools, making it accessible for researchers in resource-limited settings.

🏛️ Semantic Scholar 📅 2026-04-21 Deep Learning

TACO: TabPFN Augmented Causal Outcomes for early detection of Long COVID

S. Pinero, Ngoc-Bao Duong, Xiaomei Li, et al.

Chiara Frascarelli, A. Concardi, E. Mangione, et al.

DOI: 10.3390/app16084015

🔥 引用: 0

Abstract: The increasing complexity of oncology diagnostics requires advanced Clinical Decision Support Systems (CDSS) capable of integrating multimodal data. Traditional discriminative models often struggle with missing data and cross-modal dependencies. This review provides a novel, systematic analysis of conditional generative artificial intelligence (AI), including Variational Autoencoders (VAEs), Generative Adversarial Networks (GANs), diffusion models and Multimodal Large Language Models (MLLMs), specifically tailored for oncological CDSS. We examine how these architectures move beyond simple prediction to learn joint data distributions, enabling robust data imputation, virtual staining, and automated clinical reporting. A central focus of this work is the assessment of translational application, identifying the gaps between experimental proof-of-concepts and clinical deployment. We address critical hurdles such as model hallucinations, domain shift, and demographic bias, providing a roadmap for biological consistency and regulatory compliance. This review highlights the transition from task-specific generators to multimodal reasoning systems. Ultimately, we argue that the integration of generative AI into diagnostic workflows is essential for precision oncology, provided that human-in-the-loop validation and uncertainty-aware inference remain central to their implementation.

🏛️ Semantic Scholar 📅 2026-04-21 Diffusion Model Applied Sciences

PlayCoder: Making LLM-Generated GUI Code Playable

Zhiyuan Peng, Wei Tao, Xin Yin, et al.

🔥 引用: 0

Abstract: Large language models (LLMs) have achieved strong results in code generation, but their ability to generate GUI applications, especially games, remains insufficiently studied. Existing benchmarks mainly evaluate correctness through test cases, which are inadequate for GUI applications because these systems are interactive, event-driven, and require correct state transitions across sequences of user actions. Their evaluation therefore should consider interaction flows and UI logic rather than only pass/fail outcomes. To study this problem, we introduce PlayEval, a repository-aware benchmark built from 43 multilingual GUI applications in Python, TypeScript, and JavaScript. Unlike prior GUI benchmarks that are difficult to adapt to desktop environments, PlayEval covers six major GUI application categories and directly supports code-generation evaluation. We further propose Play@k, a metric that measures whether at least one of *k* generated candidates can be played end-to-end without logical errors. To support reliable evaluation, we develop PlayTester, an LLM-based agent that performs task-oriented GUI playthroughs and detects logic violations automatically. Experiments on 10 state-of-the-art code LLMs show that, despite high compilation rates, they achieve near-zero Play@3, revealing major weaknesses in generating logically correct GUI applications. To address this limitation, we present PlayCoder, a multi-agent, repository-aware framework that generates, evaluates, and iteratively repairs GUI application code in a closed loop. PlayCoder substantially improves both functional correctness and semantic alignment for open-source and closed-source models, reaching up to 38.1% Exec@3 and 20.3% Play@3. Case studies further show that it can uncover silent logic bugs missed by traditional metrics and fix them through targeted edits.

🏛️ Semantic Scholar 📅 2026-04-21 Agent

LePREC: Reasoning as Classification over Structured Factors for Assessing Relevance of Legal Issues

Fanyu Wang, Xiaoxi Kang, P. Burgess, et al.

🔥 引用: 0

Abstract: More than half of the global population struggles to meet their civil justice needs due to limited legal resources. While Large Language Models (LLMs) have demonstrated impressive reasoning capabilities, significant challenges remain even at the foundational step of legal issue identification. To investigate LLMs'capabilities in this task, we constructed a dataset from 769 real-world Malaysian Contract Act court cases, using GPT-4o to extract facts and generate candidate legal issues, annotated by senior legal experts, which reveals a critical limitation: while LLMs generate diverse issue candidates, their precision remains inadequate (GPT-4o achieves only 62%). To address this gap, we propose LePREC (Legal Professional-inspired Reasoning Elicitation and Classification), a neuro-symbolic framework combining neural generation with structured statistical reasoning. LePREC consists of: (1) a neuro component leverages LLMs to transform legal descriptions into question-answer pairs representing diverse analytical factors, and (2) a symbolic component applies sparse linear models over these discrete features, learning explicit algebraic weights that identify the most informative reasoning factors. Unlike end-to-end neural approaches, LePREC achieves interpretability through transparent feature weighting while maintaining data efficiency through correlation-based statistical classification. Experiments show a 30-40% improvement over advanced LLM baselines, including GPT-4o and Claude, confirming that correlation-based factor-issue analysis offers a more data-efficient solution for relevance decisions.

🏛️ Semantic Scholar 📅 2026-04-21

Editorial for “Pre‐Imaging Clinical Factors Associated With Cardiac MR Image Quality Using Large Language Model‐Enabled Data Extraction”

Xiaoyan Wang, Jing Zhou, Haifeng Wang

DOI: 10.1002/jmri.70346

🔥 引用: 0

Abstract: 暂无摘要，请点击原文查看。

🏛️ Semantic Scholar 📅 2026-04-21 Journal of Magnetic Resonance Imaging

SCURank: Ranking Multiple Candidate Summaries with Summary Content Units for Enhanced Summarization

Bo Wang, Ying-Jia Lin, Hung-Yu Kao

🔥 引用: 0

Abstract: Small language models (SLMs), such as BART, can achieve summarization performance comparable to large language models (LLMs) via distillation. However, existing LLM-based ranking strategies for summary candidates suffer from instability, while classical metrics (e.g., ROUGE) are insufficient to rank high-quality summaries. To address these issues, we introduce \textbf{SCURank}, a framework that enhances summarization by leveraging \textbf{Summary Content Units (SCUs)}. Instead of relying on unstable comparisons or surface-level overlap, SCURank evaluates summaries based on the richness and semantic importance of information content. We investigate the effectiveness of SCURank in distilling summaries from multiple diverse LLMs. Experimental results demonstrate that SCURank outperforms traditional metrics and LLM-based ranking methods across evaluation measures and datasets. Furthermore, our findings show that incorporating diverse LLM summaries enhances model abstractiveness and overall distilled model performance, validating the benefits of information-centric ranking in multi-LLM distillation. The code for SCURank is available at https://github.com/IKMLab/SCURank.

🏛️ Semantic Scholar 📅 2026-04-21

DINO Eats CLIP: Adapting Beyond Knowns for Open-set 3D Object Retrieval

Xinwei He, Yansong Zheng, Qianru Han, et al.

🔥 引用: 0

Abstract: Vision foundation models have shown great promise for open-set 3D object retrieval (3DOR) through efficient adaptation to multi-view images. Leveraging semantically aligned latent space, previous work typically adapts the CLIP encoder to build view-based 3D descriptors. Despite CLIP's strong generalization ability, its lack of fine-grainedness prompted us to explore the potential of a more recent self-supervised encoder-DINO. To address this, we propose DINO Eats CLIP (DEC), a novel framework for dynamic multi-view integration that is regularized by synthesizing data for unseen classes. We first find that simply mean-pooling over view features from a frozen DINO backbone gives decent performance. Yet, further adaptation causes severe overfitting on average view patterns of known classes. To combat it, we then design a module named Chunking and Adapting Module (CAM). It segments multi-view images into chunks and dynamically integrates local view relations, yielding more robust features than the standard pooling strategy. Finally, we propose Virtual Feature Synthesis (VFS) module to mitigate bias towards known categories explicitly. Under the hood, VFS leverages CLIP's broad, pre-aligned vision-language space to synthesize virtual features for unseen classes. By exposing DEC to these virtual features, we greatly enhance its open-set discrimination capacity. Extensive experiments on standard open-set 3DOR benchmarks demonstrate its superior efficacy.

🏛️ Semantic Scholar 📅 2026-04-21 Foundation Model

TripleBind: a generalizable deep learning framework for protein-nucleic acid and protein-ligand binding sites prediction based on pre-trained protein language models

Zhiwei Liu, Ruisheng Zhang

DOI: 10.1007/s11030-026-11557-8

🔥 引用: 0

Abstract: 暂无摘要，请点击原文查看。

🏛️ Semantic Scholar 📅 2026-04-21 Deep Learning Protein Language Model AIDD Molecular diversity

Evaluating LLM-Generated Obfuscated XSS Payloads for Machine Learning-Based Detection

Divyesh Gabbireddy, Suman Saha

🔥 引用: 0

Abstract: Cross-site scripting (XSS) remains a persistent web security vulnerability, especially because obfuscation can change the surface form of a malicious payload while preserving its behavior. These transformations make it difficult for traditional and machine learning-based detection systems to reliably identify attacks. Existing approaches for generating obfuscated payloads often emphasize syntactic diversity, but they do not always ensure that the generated samples remain behaviorally valid. This paper presents a structured pipeline for generating and evaluating obfuscated XSS payloads using large language models (LLMs). The pipeline combines deterministic transformation techniques with LLM-based generation and uses a browser- based runtime evaluation procedure to compare payload behavior in a controlled execution environment. This allows generated samples to be assessed through observable runtime behavior rather than syntactic similarity alone. In the evaluation, an untuned baseline language model achieves a runtime behavior match rate of 0.15, while fine-tuning on behavior-preserving source-target obfuscation pairs improves the match rate to 0.22. Although this represents a measurable improvement, the results show that current LLMs still struggle to generate obfuscations that preserve observed runtime behavior. A downstream classifier evaluation further shows that adding generated payloads does not improve detection performance in this setting, although behavior- filtered generated samples can be incorporated without materially degrading performance. Overall, the study demonstrates both the promise and the limits of applying generative models to adversarial security data generation and emphasizes the importance of runtime behavior checks in improving the quality of generated data for downstream detection systems.

🏛️ Semantic Scholar 📅 2026-04-21 Fine-tuning

Tracing Relational Knowledge Recall in Large Language Models

Nicholas Popovivc, Michael Farber

🔥 引用: 0

Abstract: We study how large language models recall relational knowledge during text generation, with a focus on identifying latent representations suitable for relation classification via linear probes. Prior work shows how attention heads and MLPs interact to resolve subject, predicate, and object, but it remains unclear which representations support faithful linear relation classification and why some relation types are easier to capture linearly than others. We systematically evaluate different latent representations derived from attention head and MLP contributions, showing that per-head attention contributions to the residual stream are comparatively strong features for linear relation classification. Feature attribution analyses of the trained probes, as well as characteristics of the different relation types, reveal clear correlations between probe accuracy and relation specificity, entity connectedness, and how distributed the signal on which the probe relies is across attention heads. Finally, we show how token-level feature attribution of probe predictions can be used to reveal probe behavior in further detail.

🏛️ Semantic Scholar 📅 2026-04-21

Dose-dependent modeling of combinatorial drug responses stratifies patient survival and reveals therapeutic vulnerabilities in precision oncology

Kohei Ota, Takumi Ito, Hideyuki Shimizu

DOI: 10.64898/2026.04.16.718332

🔥 引用: 0

Abstract: 暂无摘要，请点击原文查看。

🏛️ Semantic Scholar 📅 2026-04-21 bioRxiv

Biomedical systems biology workflow orchestration and execution with PoSyMed

Simon Suwer, Zoe Chervontseva, Kester Bagemihl, et al.

🔥 引用: 0

Abstract: The rapid growth of scientific software has created practical barriers for bioinformatics research. Although powerful statistical, artificial intelligence (AI)-based methods are now widely available, their effective use is often hindered by fragmented distribution, inconsistent documentation, complex dependencies, and difficult-to-reproduce execution environments. As a result, reusing published tools and workflow adaptation to own date remains technically demanding and time-intensive, even for experienced users. Here, we present PoSyMed, an open and modular platform for the controlled integration, composition, and execution of bioinformatics tools and workflows. PoSyMed combines a backend-centered platform architecture with formal tool descriptions, controlled container-based build and execution processes, persistent workflow state, and a dialogue-based user interface. Large language models (LLM) are integrated not as autonomous decision-makers, but as human-computer interface with bounded semantic assistants that help identify tools, propose workflow steps, and support parameterization within a typed, validated, and human-supervised execution environment. PoSyMed is designed to improve reproducibility, traceability, and transparency in practical biomedical analysis within one platform. We describe the system architecture and evaluate its behavior across representative biological software scenarios with respect to workflow support, interaction design, and platform extensibility. PoSyMed is publicly available at https://apps.cosy.bio/posymed.

🏛️ Semantic Scholar 📅 2026-04-21

SimDiff: Depth Pruning via Similarity and Difference

Yuli Chen, Shuhao Zhang, Fanshen Meng, et al.

🔥 引用: 0

Abstract: Depth pruning improves the deployment efficiency of large language models (LLMs) by identifying and removing redundant layers. A widely accepted standard for this identification process is to measure the similarity between layers using cosine distance. However, we find that methods relying solely on this one-dimensional heuristic can exhibit unpredictable performance and even catastrophic collapse across different architectures. To address this issue, we propose SimDiff, a novel layer importance criterion that jointly evaluates layers from two orthogonal perspectives: representational similarity and transformation difference. The difference is quantified using two distinct metrics: MSSD, which is sensitive to outliers and identifies layers that make decisive corrections, and MASD, which robustly measures a layer's average contribution. Extensive experiments on multiple models ranging from 0.5B to 13B parameters demonstrate that SimDiff significantly outperforms state-of-the-art baselines across various pruning ratios. Notably, our method retains over 91% of LLaMA2-7B's performance at a 25% pruning ratio and achieves up to a 1.49x inference speedup when pruning 12 layers on LLaMA3.1-8B. We also show that pruned models can be effectively recovered with minimal fine-tuning.

🏛️ Semantic Scholar 📅 2026-04-21 Fine-tuning

Graph neural network approach for exploring urban material circularity: the case of Edinburgh

Burcu Kismet Conk

Shalsa Fitria Zanti, Trisna Kumala Sari

DOI: 10.58578/masaliq.v6i3.9662

🔥 引用: 0

Abstract: Xanthine oxidase (XO) plays a role in the formation of uric acid and contributes to hyperuricemia, whereas the use of synthetic inhibitors such as allopurinol is known to have side effects, thus requiring alternatives from the bioactive compounds of medicinal plants. This study aims to evaluate the potential of Dillapiole, Piperine, Hydroxychavicol, Panduratin A, and Isolicoflavonol as XO inhibitors through an in silico approach using molecular docking, as well as Lipinski and ADMET analyses. The results showed that most ligands met the drug-likeness criteria, except for Panduratin A, which had one violation of LogP. All ligands showed negative binding affinity, with Isolicoflavonol having the best affinity (−9.5 kcal/mol), followed by Piperine and Panduratin A. ADMET predictions showed that most ligands had good absorption and were not mutagenic, although some ligands had the potential to interact with CYP450 enzymes. Overall, Isolicoflavonol showed the best potential as an XO inhibitor candidate based on binding affinity and ADMET profile. These findings affirm the potential of medicinal plant bioactive compounds as alternative XO inhibitors, although further in vitro and in vivo testing is still needed for further validation.

🏛️ Semantic Scholar 📅 2026-04-21 MASALIQ

scpFormer: A Foundation Model for Unified Representation and Integration of the Single-Cell Proteomics

Qifeng Zhou, Lei Yu, Yuzhi Guo, et al.

🔥 引用: 0

Abstract: The integration of single-cell proteomic data is often hindered by the fragmented nature of targeted antibody panels. To address this limitation, we introduce scpFormer, a transformer-based foundation model designed for single-cell proteomics. Pre-trained on over 390 million cells, scpFormer replaces standard index-based tokenization with a continuous, sequence-anchored approach. By combining Evolutionary Scale Modeling (ESM) with value-aware expression embeddings, it dynamically maps variable panels into a shared semantic space without artificial discretization. We demonstrate that scpFormer generates global cell representations that perform competitively in large-scale batch integration and unsupervised clustering. Moreover, its open-vocabulary architecture facilitates in silico panel expansion, assisting in the reconstruction of biological manifolds in sparse clinical datasets. Finally, this learned protein co-expression logic is transferable to bulk-omics tasks, supporting applications like cancer drug response prediction. scpFormer provides a versatile, panel-agnostic framework to facilitate scalable biomarker discovery and precision oncology.

🏛️ Semantic Scholar 📅 2026-04-21 Foundation Model Transformer Protein Language Model RNA-seq / Transcriptomics

LLM‐Based Scientific Assistants for Knowledge Extraction: Which Design Choices Matter?

David Exler, Manuel Raimann, Marc F. Münker, et al.

DOI: 10.1002/aidi.202500197

🔥 引用: 0

Abstract: Large Language Model chatbots have gained significant popularity, offering knowledge to support specialists in diverse fields. However, adapting models to specific use cases and specialized domains presents considerable challenges. Hence, we introduce the LLM Playground, a comprehensive approach to optimizing LLMs for specialist applications with respect to their accuracy in answering domain‐specific questions, addressing the limitations of unmodified models. The utilized optimization techniques begin with Prompt Engineering, advance to the integration of external knowledge, and culminate in complex reasoning strategies or self‐feedback loops. This paper introduces various architectures for scientific assistants, comprising individual enhancement techniques, both in isolation and in combination with others, designed to facilitate comparisons. To demonstrate the efficacy of the LLM Playground, a chemical chatbot is set up as a case study, and the optimization techniques are compared using ChemBench, an independent question–answer benchmark for the chemical domain, to measure its performance. By providing tested, ready‐to‐deploy architectures and clear use‐case guidance, this work helps researchers and practitioners leverage LLMs in domain‐specific applications. The insights and methodologies presented in this paper contribute to the growing body of knowledge on tailoring LLMs to meet the unique demands of specialized fields.

🏛️ Semantic Scholar 📅 2026-04-21 Advanced Intelligent Discovery

Epistemic orientation in parliamentary discourse is associated with deliberative democracy

S. Aroyehun, Stephan Lewandowsky, David Garcia

🔥 引用: 0

Abstract: The pursuit of truth is central to democratic deliberation and governance, yet political discourse reflects varying epistemic orientations, ranging from evidence-based reasoning grounded in verifiable information to intuition-based reasoning rooted in beliefs and subjective interpretation. We introduce a scalable approach to measure epistemic orientation using the Evidence--Minus--Intuition (EMI) score, derived from large language model (LLM) ratings and embedding-based semantic similarity. Applying this approach to 15 million parliamentary speech segments spanning 1946 to 2025 across seven countries, we examine temporal patterns in discourse and its association with deliberative democracy and governance. We find that EMI is positively associated with deliberative democracy within countries over time, with consistent relationships in both contemporaneous and lagged analyses. EMI is also positively associated with the transparency and predictable implementation of laws as a dimension of governance. These findings suggest that the epistemic nature of political discourse is crucial for both the quality of democracy and governance.

🏛️ Semantic Scholar 📅 2026-04-21

Detoxification for LLM: From Dataset Itself

Wei Shao, Yihang Wang, G. Zhu, et al.

🔥 引用: 0

Abstract: Existing detoxification methods for large language models mainly focus on post-training stage or inference time, while few tackle the source of toxicity, namely, the dataset itself. Such training-based or controllable decoding approaches cannot completely suppress the model's inherent toxicity, whereas detoxifying the pretraining dataset can fundamentally reduce the toxicity that the model learns during training. Hence, we attempt to detoxify directly on raw corpora with SoCD (Soft Contrastive Decoding), which guides an LLM to localize and rewrite toxic spans in raw data while preserving semantics, in our proposed HSPD (Hierarchical Semantic-Preserving Detoxification) pipeline, yielding a detoxified corpus that can drop-in replace the original for fine-tuning or other training. On GPT2-XL, HSPD attains state-of-the-art detoxification, reducing Toxicity Probability (TP) from 0.42 to 0.18 and Expected Maximum Toxicity (EMT) from 0.43 to 0.20. We further validate consistent best-in-class results on LLaMA2-7B, OPT-6.7B, and Falcon-7B. These findings show that semantics-preserving, corpus-level rewriting with HSPD effectively suppresses downstream toxicity while retaining data utility and allowing seamless source-level mitigation, thereby reducing the cost of later model behavior adjustment. (Code is available at: https://github.com/ntsw2001/data_detox_for_llm)

🏛️ Semantic Scholar 📅 2026-04-21 Fine-tuning

When Graph Structure Becomes a Liability: A Critical Re-Evaluation of Graph Neural Networks for Bitcoin Fraud Detection under Temporal Distribution Shift

Saket Maganti

🔥 引用: 0

Abstract: The consensus that GCN, GraphSAGE, GAT, and EvolveGCN outperform feature-only baselines on the Elliptic Bitcoin Dataset is widely cited but has not been rigorously stress-tested under a leakage-free evaluation protocol. We perform a seed-matched inductive-versus-transductive comparison and find that this consensus does not hold. Under a strictly inductive protocol, Random Forest on raw features achieves F1 = 0.821 and outperforms all evaluated GNNs, while GraphSAGE reaches F1 = 0.689 +/- 0.017. A paired controlled experiment reveals a 39.5-point F1 gap attributable to training-time exposure to test-period adjacency. Additionally, edge-shuffle ablations show that randomly wired graphs outperform the real transaction graph, indicating that the dataset's topology can be misleading under temporal distribution shift. Hybrid models combining GNN embeddings with raw features provide only marginal gains and remain substantially below feature-only baselines. We release code, checkpoints, and a strict-inductive protocol to enable reproducible, leakage-free evaluation.

🏛️ Semantic Scholar 📅 2026-04-21

Building smarter digital content: a CRITIC – DEMATEL framework for leveraging large language model optimization in marketing

Vinaytosh Mishra, Sudhir Rana

DOI: 10.1108/jbsed-05-2025-0135

🔥 引用: 0

Abstract: The study responds to address the practical problem faced by the digital marketers, content creators and digital business agencies on creating content, which is both human-readable and LLM-compatible. The present study identifies and analyses the key factors influencing content optimization for Large Language Models (LLMs) to develop a strategic framework for Large Language Model Optimization (LLMO) that aligns with modern search paradigms. This research employs a two-phases multi-criteria decision-making (MCDM) approach combining CRITIC (Criteria Importance Through Intercriteria Correlation) to determine factor weights, and DEMATEL (Decision-Making Trial and Evaluation Laboratory) to map causal relationships. A panel of 15 experts across three countries (India, UAE and USA) rated the influence of five identified factors. The study identifies five critical factors for LLMO: Retrieval Augmentation, Readability Enhancement, Content Quality Assurance, Filtering of Unsafe Content and User-Centric Content Design. Retrieval Augmentation and User-Centric Design emerged as key causal factors, while Readability and Content Quality acted as bridges or effects. Although factor weights were relatively balanced, the DEMATEL analysis revealed interdependencies highlighting the dynamic nature of LLMO. The results provide actionable guidance to digital marketing experts and agencies, content strategists, marketing heads and developers to structure web content that is both human-readable and LLM-compatible. The study offers insights to organizations on how they can enhance their digital visibility and authority in AI-powered search ecosystems. This study fills a critical gap by offering the first integrated CRITIC-DEMATEL framework for LLMO. It distinguishes LLMO from traditional SEO and offers a novel causal model to support the development of holistic, future-ready content strategies.

🏛️ Semantic Scholar 📅 2026-04-21 Journal of Business and Socio-economic Development

Jamie Lee, Flynn Teh, Hengcheng Zhu, et al.

🔥 引用: 0

Abstract: Large Language Models (LLMs) have recently shown strong potential for automated unit test generation. This has motivated us to investigate whether developer-defined test doubles (commonly referred to as mocks) available in existing test suites can be leveraged to improve LLM-driven test generation. To this end, we propose MOCKMILL, an LLM-based technique and tool that generates test cases by exploiting mocking information automatically extracted from developer-written tests. MOCKMILL targets components that are replaced by test doubles in existing tests and uses the encoded stubbings and interaction expectations to guide test generation, combined with an iterative generation-and-repair process to ensure executable tests. We evaluated MOCKMILL on 10 open-source classes from six Java projects using four LLMs, and compared the generated tests with existing project tests and tests produced by baseline approaches. The results show that MOCKMILL's tests cover lines of code and kill mutants that existing tests and baseline-generated tests miss. Overall, our findings provide preliminary evidence that leveraging mocking information is a complementary and effective way to enhance LLM-based test generation.

🏛️ Semantic Scholar 📅 2026-04-21

Enabling Next-Generation Mass Spectrometry-Based Proteomics: Standards, Proteoform Resolution, and FAIR, Reproducible, and Quantitative Analysis

Rui Vitorino

Siyu Han, Tamas Sztanka-Toth, Enes Senel, et al.

DOI: 10.64898/2026.04.17.719314

🔥 引用: 0

Abstract: 暂无摘要，请点击原文查看。

🏛️ Semantic Scholar 📅 2026-04-21 Foundation Model RNA-seq / Transcriptomics Biological Large Model bioRxiv

Do Emotions Influence Moral Judgment in Large Language Models?

Mohammad Saim, Tianyu Jiang

🔥 引用: 0

Abstract: Large language models have been extensively studied for emotion recognition and moral reasoning as distinct capabilities, yet the extent to which emotions influence moral judgment remains underexplored. In this work, we develop an emotion-induction pipeline that infuses emotion into moral situations and evaluate shifts in moral acceptability across multiple datasets and LLMs. We observe a directional pattern: positive emotions increase moral acceptability and negative emotions decrease it, with effects strong enough to reverse binary moral judgments in up to 20% of cases, and with susceptibility scaling inversely with model capability. Our analysis further reveals that specific emotions can sometimes behave contrary to what their valence would predict (e.g., remorse paradoxically increases acceptability). A complementary human annotation study shows humans do not exhibit these systematic shifts, indicating an alignment gap in current LLMs.

🏛️ Semantic Scholar 📅 2026-04-21

Amirreza Akbari, Amauri H. Souza, Vikas Garg

🔥 引用: 0

Abstract: Graph neural networks (GNNs) are the standard for learning on graphs, yet they have limited expressive power, often expressed in terms of the Weisfeiler-Leman (WL) hierarchy or within the framework of first-order logic. In this context, topological neural networks (TNNs) have recently emerged as a promising alternative for graph representation learning. By incorporating higher-order relational structures into message-passing schemes, TNNs offer higher representational power than traditional GNNs. However, a fundamental question remains open: what is the logical expressiveness of TNNs? Answering this allows us to characterize precisely which binary classifiers TNNs can represent. In this paper, we address this question by analyzing isomorphism tests derived from the underlying mechanisms of general TNNs. We introduce and investigate the power of higher-order variants of WL-based tests for combinatorial complexes, called $k$-CCWL test. In addition, we introduce the topological counting logic (TC$_k$), an extension of standard counting logic featuring a novel pairwise counting quantifier $ \exists^{N}(x_i,x_j)\, \varphi(x_i,x_j), $ which explicitly quantifies pairs $(x_i, x_j)$ satisfying property $\varphi$. We rigorously prove the exact equivalence: $ \text{k-CCWL} \equiv \text{TC}_{k{+}2} \equiv \text{Topological }(k{+}2)\text{-pebble game}.$ These results establish a logical expressiveness theory for TNNs.

🏛️ Semantic Scholar 📅 2026-04-21

CoDA: Towards Effective Cross-domain Knowledge Transfer via CoT-guided Domain Adaptation

Jianzhi Yan, Le Liu, Buzhou Tang, et al.

🔥 引用: 0

Abstract: Large language models (LLMs) have achieved substantial advances in logical reasoning, yet they continue to lag behind human-level performance. In-context learning provides a viable solution that boosts the model's performance via prompting its input with expert-curated, in-domain exemplars. However, in many real-world, expertise-scarce domains, such as low-resource scientific disciplines, emerging biomedical subfields, or niche legal jurisdictions, such high-quality in-domain demonstrations are inherently limited or entirely unavailable, thereby constraining the general applicability of these approaches. To mitigate this limitation, recent efforts have explored the retrieval of cross-domain samples as surrogate in-context demonstrations. Nevertheless, the resulting gains remain modest. This is largely attributable to the pronounced domain shift between source and target distributions, which impedes the model's ability to effectively identify and exploit underlying shared structures or latent reasoning patterns. Consequently, when relying solely on raw textual prompting, LLMs struggle to abstract and transfer such cross-domain knowledge in a robust and systematic manner. To address these issues, we propose CoDA, which employs a lightweight adapter to directly intervene in the intermediate hidden states. By combining feature-based distillation of CoT-enriched reference representations with Maximum Mean Discrepancy (MMD) for kernelized distribution matching, our method aligns the latent reasoning representation of the source and target domains. Extensive experimental results on multiple logical reasoning tasks across various model families validate the efficacy of CoDA by significantly outperforming the previous state-of-the-art baselines by a large margin.

🏛️ Semantic Scholar 📅 2026-04-21

CreativeGame:Toward Mechanic-Aware Creative Game Generation

Hongnan Ma, Hannahlise Wang, Shenglin Wang, et al.

🔥 引用: 0

Abstract: Large language models can generate plausible game code, but turning this capability into \emph{iterative creative improvement} remains difficult. In practice, single-shot generation often produces brittle runtime behavior, weak accumulation of experience across versions, and creativity scores that are too subjective to serve as reliable optimization signals. A further limitation is that mechanics are frequently treated only as post-hoc descriptions, rather than as explicit objects that can be planned, tracked, preserved, and evaluated during generation. This report presents \textbf{CreativeGame}, a multi-agent system for iterative HTML5 game generation that addresses these issues through four coupled ideas: a proxy reward centered on programmatic signals rather than pure LLM judgment; lineage-scoped memory for cross-version experience accumulation; runtime validation integrated into both repair and reward; and a mechanic-guided planning loop in which retrieved mechanic knowledge is converted into an explicit mechanic plan before code generation begins. The goal is not merely to produce a playable artifact in one step, but to support interpretable version-to-version evolution. The current system contains 71 stored lineages, 88 saved nodes, and a 774-entry global mechanic archive, implemented in 6{,}181 lines of Python together with inspection and visualization tooling. The system is therefore substantial enough to support architectural analysis, reward inspection, and real lineage-level case studies rather than only prompt-level demos. A real 4-generation lineage shows that mechanic-level innovation can emerge in later versions and can be inspected directly through version-to-version records. The central contribution is therefore not only game generation, but a concrete pipeline for observing progressive evolution through explicit mechanic change.

🏛️ Semantic Scholar 📅 2026-04-21 Agent

sdAbs-LLM: Generative Large Language Models For de novo Antibody Design and Agentic Evaluation

Delower Hossain, Fuad Al Abir, Sixue Zhang, et al.

Empowering NPC Dialogue with Environmental Context Using LLMs and Panoramic Images

Grega Radevz, Ciril Bohak

🔥 引用: 0

Abstract: We present an approach for enhancing non-playable characters (NPCs) in games by combining large language models (LLMs) with computer vision to provide contextual awareness of their surroundings. Conventional NPCs typically rely on pre-scripted dialogue and lack spatial understanding, which limits their responsiveness to player actions and reduces overall immersion. Our method addresses these limitations by capturing panoramic images of an NPC's environment and applying semantic segmentation to identify objects and their spatial positions. The extracted information is used to generate a structured JSON representation of the environment, combining object locations derived from segmentation with additional scene graph data within the NPC's bounding sphere, encoded as directional vectors. This representation is provided as input to the LLM, enabling NPCs to incorporate spatial knowledge into player interactions. As a result, NPCs can dynamically reference nearby objects, landmarks, and environmental features, leading to more believable and engaging gameplay. We describe the technical implementation of the system and evaluate it in two stages. First, an expert interview was conducted to gather feedback and identify areas for improvement. After integrating these refinements, a user study was performed, showing that participants preferred the context-aware NPCs over a non-context-aware baseline, confirming the effectiveness of the proposed approach.

🏛️ Semantic Scholar 📅 2026-04-21

Revisiting Framing Codebooks with AI: Employing Large Language Models as Analytical Collaborators in Deductive Content Analysis

Diego Gómez-Zará, Hernan Valdivieso, Jorge A. P'erez, et al.

🔥 引用: 0

Abstract: Codebooks are central to framing research, providing theoretically grounded criteria for analyzing news content. While traditionally codebooks are built from theoretical frameworks and researchers'knowledge, applying these codebooks to large news corpora often exposes ambiguities, borderline cases, and underspecified rules that are difficult to resolve through theory alone. Moreover, news corpora evolve over time and differ across cultures, necessitating that researchers revisit the theoretical frameworks underlying these codebooks. In this article, we propose a workflow that uses Large Language Models (LLMs) to augment the creation and refinement of framing codebooks by combining theoretical frameworks with data-driven exploration. Rather than treating LLMs as automated classifiers, this approach positions them as analytic collaborators that help externalize decision rules, surface latent dimensions, and support iterative revisions of codebooks through dialogues between researchers and their data. We illustrate this workflow using a dataset of Latin American news coverage, demonstrating how the application of LLMs'capabilities has led to the surfacing of latent patterns, the generation of frame distinctions, and the adaptation of frameworks to new contexts. This method provides an LLM-assisted strategy that supports methodology creativity while preserving researchers'interpretative authority.

🏛️ Semantic Scholar 📅 2026-04-21

Pocket Specter: AI-Powered Legal Assistance System using Retrieval-Augmented Generation (RAG)

Hasanoddin Sayyed, Dr.Varsha Shah, Aryan Shaikh, et al.

DOI: 10.55041/ijsrem60741

🔥 引用: 0

Abstract: Abstract— India faces an acute problem of legal accessibility, with about 2 lawyers per 1000 people. Complicated legal jargon, expensive consultation fees, and lack of knowledge hinder many people from seeking justice. Pocket Specter is a specialized domain AI SaaS platform that tries to fill the void by using a Retrieval Augmentation Generation mechanism. The Pocket Specter software includes legal chatbot based on AI technology and intelligent document analysis in the fields of consumer, labour, and family laws. Pocket Specter uses BGE-M3 embeddings in combination with PostgreSQL pgvector technology with a large language model to base their responses on legal documents, minimizing hallucinations. In addition, the document analysis tool analyzes and highlights important information about responsibilities and potential risks in uploaded documents in .pdf/.docx formats. According to experiment results, Pocket Specter is more than 75% relevant in legal responses in comparison with simple LLM methods. Index Terms—Retrieval-Augmented Generation, Legal AI, Natural Language Processing, pgvector, Document Analysis, Large Language Models, SaaS, India.

🏛️ Semantic Scholar 📅 2026-04-21 INTERNATIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING AND MANAGEMENT

Depression Risk Assessment in Social Media via Large Language Models

Giorgia Gulino, M. Petrucci

🔥 引用: 0

Abstract: Depression is one of the most prevalent and debilitating mental health conditions worldwide, frequently underdiagnosed and undertreated. The proliferation of social media platforms provides a rich source of naturalistic linguistic signals for the automated monitoring of psychological well-being. In this work, we propose a system based on Large Language Models (LLMs) for depression risk assessment in Reddit posts, through multi-label classification of eight depression-associated emotions and the computation of a weighted severity index. The method is evaluated in a zero-shot setting on the annotated DepressionEmo dataset (~6,000 posts) and applied in-the-wild to 469,692 comments collected from four subreddits over the period 2024-2025. Our best model, gemma3:27b, achieves micro-F1 = 0.75 and macro-F1 = 0.70, results competitive with purpose-built fine-tuned models (BART: micro-F1 = 0.80, macro-F1 = 0.76). The in-the-wild analysis reveals consistent and temporally stable risk profiles across communities, with marked differences between r/depression and r/anxiety. Our findings demonstrate the feasibility of a cost-effective, scalable approach for large-scale psychological monitoring.

🏛️ Semantic Scholar 📅 2026-04-21

Infection-Reasoner: A Compact Vision-Language Model for Wound Infection Classification with Evidence-Grounded Clinical Reasoning

Palawat Busaranuvong, Reza Saadati Fard, Emmanuel Agu, et al.

🔥 引用: 0

Abstract: Assessing chronic wound infection from photographs is challenging because visual appearance varies across wound etiologies, anatomical locations, and imaging conditions. Prior image-based deep learning methods have mainly focused on classification with limited interpretability, despite the need for evidence-grounded explanations to support point-of-care decision making. We present Infection-Reasoner, a compact 4B-parameter reasoning vision-language model for chronic wound infection classification and rationale generation. To address the scarcity of expert-labeled wound images with reasoning annotations, Infection-Reasoner is trained using a two-stage pipeline: (1) reasoning distillation, in which GPT-5.1 generates chain-of-thought rationales for unlabeled wound images to initialize wound-specific reasoning in a smaller student model (Qwen3-VL-4B-Thinking), and (2) reinforcement learning post-training with Group Relative Policy Optimization on a small labeled infection dataset to refine classification reasoning. On a held-out heterogeneous wound dataset, Infection-Reasoner achieved 86.8\% accuracy, 86.4\% sensitivity, and 87.1\% specificity, outperforming several strong baselines, including GPT-5.1. Rationale quality was further evaluated using both multimodal large language model (MLLM) judges and wound expert review. Across four MLLM judges, visual-support agreement scores ranged from 0.722 to 0.903, while expert review rated 61.8\% of rationales as Correct and 32.4\% as Partially Correct.

🏛️ Semantic Scholar 📅 2026-04-21 Deep Learning

Detecting Hallucinations in SpeechLLMs at Inference Time Using Attention Maps

Jonas Waldendorf, Bashar Awwad Shiekh Hasan, Evgenii Tsymbalov

🔥 引用: 0

Abstract: Hallucinations in Speech Large Language Models (SpeechLLMs) pose significant risks, yet existing detection methods typically rely on gold-standard outputs that are costly or impractical to obtain. Moreover, hallucination detection methods developed for text-based LLMs do not directly capture audio-specific signals. We investigate four attention-derived metrics: AUDIORATIO, AUDIOCONSISTENCY, AUDIOENTROPY, and TEXTENTROPY, designed to capture pathological attention patterns associated with hallucination, and train lightweight logistic regression classifiers on these features for efficient inference-time detection. Across automatic speech recognition and speech-to-text translation tasks, evaluations on Qwen-2-Audio and Voxtral-3B show that our approach outperforms uncertainty-based and prior attention-based baselines on in-domain data, achieving improvements of up to +0.23 PR-AUC, and generalises to out-of-domain ASR settings. We further find that strong performance can be achieved with approximately 100 attention heads, improving out-of-domain generalisation compared to using all heads. While effectiveness is model-dependent and task-specific training is required, our results demonstrate that attention patterns provide a valuable tool for hallucination detection in SpeechLLMs.

🏛️ Semantic Scholar 📅 2026-04-21

Value systems of artificial intelligence and university students: theoretical dominance in large language models and religious priority in humans

Nabil Saleh Sufyan, Sami M Alshehri, A. Teleb, et al.

DOI: 10.3389/fpsyg.2026.1755145

🔥 引用: 0

Abstract: The rapid advancement of artificial intelligence (AI), particularly large language models (LLMs), raises critical questions about the value system these systems appear to reflect in comparison with human values. This study aimed to examine Spranger’s six value types (religious, social, theoretical, economic, political, and aesthetic) as manifested in three LLMs (OpenAI-o1, Gemini-2.0, and DeepSeek-V3), and to compare them with the value system of a sample of students at King Khalid University. A descriptive–comparative design was employed, administering the Study of Values to both groups: 214 students (male and female across academic levels) and the three LLMs, with repeated administrations to the latter to ensure test–retest reliability. Results indicated statistically significant differences in both the prominence and ranking of values across groups. Theoretical values consistently dominated in the LLMs, followed by social, aesthetic, and political values, with religious values ranking lowest. In contrast, students prioritized religious values, followed by theoretical values, while aesthetic values occupied the lowest ranks. Further, significant effects of gender and academic level were observed among students: religious values were more salient among females, theoretical values among males, and aesthetic values among undergraduates. These findings suggest that LLMs project value system shaped by their training data, rather than by human cultural or moral frameworks. The study highlights the importance of integrating culturally diverse value dimensions into AI development and calls for raising students’ awareness of using AI tools in ways aligned with human values. Effect-size estimates further indicated very large human–AI discrepancies, particularly in the religious (d = 2.21) and theoretical domains (d = 1.22).

🏛️ Semantic Scholar 📅 2026-04-21 Frontiers in Psychology

Kihyuk Lee

🔥 引用: 0

Abstract: This study compared repeated generation consistency of exercise prescription outputs across three large language models (LLMs), specifically GPT-4.1, Claude Sonnet 4.6, and Gemini 2.5 Flash, under temperature=0 conditions. Each model generated prescriptions for six clinical scenarios 20 times, yielding 360 total outputs analyzed across four dimensions: semantic similarity, output reproducibility, FITT classification, and safety expression. Mean semantic similarity was highest for GPT-4.1 (0.955), followed by Gemini 2.5 Flash (0.950) and Claude Sonnet 4.6 (0.903), with significant inter-model differences confirmed (H = 458.41, p<.001). Critically, these scores reflected fundamentally different generative behaviors: GPT-4.1 produced entirely unique outputs (100%) with stable semantic content, while Gemini 2.5 Flash showed pronounced output repetition (27.5% unique outputs), indicating that its high similarity score derived from text duplication rather than consistent reasoning. Identical decoding settings thus yielded fundamentally different consistency profiles, a distinction that single-output evaluations cannot capture. Safety expression reached ceiling levels across all models, confirming its limited utility as a differentiating metric. These results indicate that model selection constitutes a clinical rather than merely technical decision, and that output behavior under repeated generation conditions should be treated as a core criterion for reliable deployment of LLM-based exercise prescription systems.

🏛️ Semantic Scholar 📅 2026-04-21

Xinyao Zhang, Rui Wang, Jinhao Cui, et al.

🔥 引用: 0

Abstract: Multi-window mobile scenarios, such as split-screen and foldable modes, make GUI display defects more likely by forcing applications to adapt to changing window sizes and dynamic layout reflow. Existing detection techniques are limited in two ways: they are largely passive, analyzing screenshots only after problematic states have been reached, and they are mainly designed for conventional full-screen interfaces, making them less effective in multi-window settings.We propose an end-to-end framework for GUI display defect detection in multi-window mobile scenarios. The framework proactively triggers split-screen, foldable, and window-transition states during app exploration, uses Set-of-Mark (SoM) to align screenshots with widget-level interface elements, and leverages multimodal large language models with chain-of-thought prompting to detect, localize, and explain display defects. We also construct a benchmark of GUI display defects using 50 real-world Android applications.Experimental results show that multi-window settings substantially increase the exposure of layout-related defects, with text truncation increasing by 184% compared with conventional full-screen settings. At the application level, our method detects 40 defect-prone apps with a false positive rate of 10.00% and a false negative rate of 11.11%, outperforming OwlEye and YOLO-based baselines. At the fine-grained level, it achieves the best F1 score of 87.2% for widget occlusion detection.

🏛️ Semantic Scholar 📅 2026-04-21

UAF: A Unified Audio Front-end LLM for Full-Duplex Speech Interaction

Yadong Li, Guoxin Wu, Haiping Hou, et al.

🔥 引用: 0

Abstract: Full-duplex speech interaction, as the most natural and intuitive mode of human communication, is driving artificial intelligence toward more human-like conversational systems. Traditional cascaded speech processing pipelines suffer from critical limitations, including accumulated latency, information loss, and error propagation across modules. To address these issues, recent efforts focus on the end-to-end audio large language models (LLMs) like GPT-4o, which primarily unify speech understanding and generation task. However, most of these models are inherently half-duplex, and rely on a suite of separate, task-specific front-end components, such as voice activity detection (VAD) and turn-taking detection (TD). In our development of speech assistant, we observed that optimizing the speech front-end is equally crucial as advancing the back-end unified model for achieving seamless, responsive interactions. To bridge this gap, we propose the first unified audio front-end LLM (UAF) tailored for full-duplex speech systems. Our model reformulates diverse audio front-end tasks into a single auto-regressive sequence prediction problem, including VAD, TD, speaker recognition (SR), automatic speech recognition (ASR) and question answer (QA). It takes streaming fixed-duration audio chunk (e.g., 600 ms) as input, leverages a reference audio prompt to anchor the target speaker at the beginning, and regressively generates discrete tokens encoding both semantic content and system-level state controls (e.g., interruption signals). Experiments demonstrate that our model achieves leading performance across multiple audio front-end tasks and significantly enhances response latency and interruption accuracy in real-world interaction scenarios.

🏛️ Semantic Scholar 📅 2026-04-21

GELAX: IoT botnet detection using dynamic graph pruning and anchored explainable AI

Rasool Esmaeilyfard, Zohre Shoaei, Reza Javidan

DOI: 10.1186/s42400-025-00440-y

🔥 引用: 0

Abstract: 暂无摘要，请点击原文查看。

🏛️ Semantic Scholar 📅 2026-04-21 Cybersecurity

Yiwen Qiu, Linjuan Wu, Yizhou Liu, et al.

🔥 引用: 0

Abstract: Large language models have achieved remarkable progress on complex reasoning tasks. However, they often implicitly fabricate information when inputs are incomplete, producing confident but unreliable conclusions -- a failure mode we term ungrounded reasoning. We argue that this issue arises not from insufficient reasoning capability, but from the lack of inferential boundary awareness -- the ability to recognize when the necessary premises for valid inference are missing. To address this issue, we propose Grounded Reasoning via Interactive Reinforcement Learning (GRIL), a multi-turn reinforcement learning framework for grounded reasoning under incomplete information. GRIL decomposes the reasoning process into two stages: clarify and pause, which identifies whether the available information is sufficient, and grounded reasoning, which performs task solving once the necessary premises are established. We design stage-specific rewards to penalize hallucinations, enabling models to detect gaps, stop proactively, and resume reasoning after clarification. Experiments on GSM8K-Insufficient and MetaMATH-Insufficient show that GRIL significantly improves premise detection (up to 45%), leading to a 30% increase in task success while reducing average response length by over 20%. Additional analyses confirm robustness to noisy user responses and generalization to out-of-distribution tasks.

🏛️ Semantic Scholar 📅 2026-04-21

Targeting Aging Mechanisms: Pharmacokinetic And Admet Challenges In Senescent Physiology

Sonali K. Diwate, Suresh Killedar, Deepak Sawant, et al.

DOI: 10.25258/ijddt.16.15s.46

🔥 引用: 0

Abstract: Natural bioactive agents include flavonoids, mitochondrial-targeted antioxidants, and NAD+ precursors are under growing investigation as well as synthetic agents like dasatinib and rapamycin as promising senotherapeutics due to their ability to control these hallmarks. This broad overview incorporates the mechanistic understanding of the action of both plant-derived and synthetic agents on major aging processes and, in particular, the pharmacodynamics and pharmacokinetics of their actions in aging (or senescence) physiology. The changes in the performance of the CYP450 enzymes, tissue distributions, renal and hepatic clearance as well as gut microbiome composition with age significantly alter the ADMET profile of these agents, which affects the efficacy and safety of these agents. We critically review translational barriers, uncover gaps in age-relevant preclinical models, and suggest strategic solutions on how best to streamline senotherapeutic dosing and delivery system. Through a synthesis of the views of pharmacognosy, molecular geroscience, and translational pharmacology, this synthesis offers a holistic approach to safe and effective interventions that can lengthen healthspan in aging people.

🏛️ Semantic Scholar 📅 2026-04-21 Agent International Journal of Drug Delivery Technology

The Rise of Verbal Tics in Large Language Models: A Systematic Analysis Across Frontier Models

Shuai Wu, Xue Li, Yan Feng, et al.

🔥 引用: 0

Abstract: As Large Language Models (LLMs) continue to evolve through alignment techniques such as Reinforcement Learning from Human Feedback (RLHF) and Constitutional AI, a growing and increasingly conspicuous phenomenon has emerged: the proliferation of verbal tics, repetitive, formulaic linguistic patterns that pervade model outputs. These range from sycophantic openers (That's a great question!, Awesome!) to pseudo-empathetic affirmations (I completely understand your concern, I'm right here to catch you) and overused vocabulary (delve, tapestry, nuanced). In this paper, we present a systematic analysis of the verbal tic phenomenon across eight state-of-the-art LLMs: GPT-5.4, Claude Opus 4.7, Gemini 3.1 Pro, Grok 4.2, Doubao-Seed-2.0-pro, Kimi K2.5, DeepSeek V3.2, and MiMo-V2-Pro. Utilizing a custom evaluation framework for standardized API-based evaluation, we assess 10,000 prompts across 10 task categories in both English and Chinese, yielding 160,000 model responses. We introduce the Verbal Tic Index (VTI), a composite metric quantifying tic prevalence, and analyze its correlation with sycophancy, lexical diversity, and human-perceived naturalness. Our findings reveal significant inter-model variation: Gemini 3.1 Pro exhibits the highest VTI (0.590), while DeepSeek V3.2 achieves the lowest (0.295). We further demonstrate that verbal tics accumulate over multi-turn conversations, are amplified in subjective tasks, and show distinct cross-lingual patterns. Human evaluation (N = 120) confirms a strong inverse relationship between sycophancy and perceived naturalness (r = -0.87, p<0.001). These results underscore the alignment tax of current training paradigms and highlight the urgent need for more authentic human-AI interaction frameworks.

🏛️ Semantic Scholar 📅 2026-04-21

Reasoning-Aware AIGC Detection via Alignment and Reinforcement

Zhao Wang, Max Xiong, Jianxun Lian, et al.

🔥 引用: 0

Abstract: The rapid advancement and widespread adoption of Large Language Models (LLMs) have elevated the need for reliable AI-generated content (AIGC) detection, which remains challenging as models evolve. We introduce AIGC-text-bank, a comprehensive multi-domain dataset with diverse LLM sources and authorship scenarios, and propose REVEAL, a detection framework that generates interpretable reasoning chains before classification. Our approach uses a two-stage training strategy: supervised fine-tuning to establish reasoning capabilities, followed by reinforcement learning to improve accuracy, improve logical consistency, and reduce hallucinations. Extensive experiments show that REVEAL achieves state-of-the-art performance across multiple benchmarks, offering a robust and transparent solution for AIGC detection. The project is open-source at https://aka.ms/reveal

🏛️ Semantic Scholar 📅 2026-04-21 Fine-tuning

Indic-CodecFake meets SATYAM: Towards Detecting Neural Audio Codec Synthesized Speech Deepfakes in Indic Languages

Girish, Mohd Mujtaba Akhtar, Orchid Chetia Phukan, et al.

🔥 引用: 0

Abstract: The rapid advancement of Audio Large Language Models (ALMs), driven by Neural Audio Codecs (NACs), has led to the emergence of highly realistic speech deepfakes, commonly referred to as CodecFakes (CFs). Consequently, CF detection has attracted increasing attention from the research community. However, existing studies predominantly focus on English or Chinese, leaving the vulnerability of Indic languages largely unexplored. To bridge this gap, we introduce Indic-CodecFake (ICF) dataset, the first large-scale benchmark comprising real and NAC-synthesized speech across multiple Indic languages, diverse speaker profiles, and multiple NAC types. We use IndicSUPERB as the real speech corpus for generation of ICF dataset. Our experiments demonstrate that state-of-the-art (SOTA) CF detectors trained on English-centric datasets fail to generalize to ICF, underscoring the challenges posed by phonetic diversity and prosodic variability in Indic speech. Further, we present systematic evaluation of SOTA ALMs in a zero-shot setting on ICF dataset. We evaluate these ALMs as they have shown effectiveness for different speech tasks. However, our findings reveal that current ALMs exhibit consistently poor performance. To address this, we propose SATYAM, a novel hyperbolic ALM tailored for CF detection in Indic languages. SATYAM integrates semantic representations from Whisper and prosodic representations from TRILLsson using through Bhattacharya distance in hyperbolic space and subsequently performs the same alignment procedure between the fused speech representation and an input conditioning prompt. This dual-stage fusion framework enables SATYAM to effectively model hierarchical relationships both within speech (semantic-prosodic) and across modalities (speech-text). Extensive evaluations show that SATYAM consistently outperforms competitive end-to-end and ALM-based baselines on the ICF benchmark.

🏛️ Semantic Scholar 📅 2026-04-21

Zhenbang Du, Kejing Xia, Xinrui Zhong, et al.

🔥 引用: 0

Abstract: Diffusion Large Language Models (dLLMs) have emerged as a promising alternative to autoregressive generation by enabling parallel token prediction. However, practical dLLM decoding still suffers from high inference latency, which limits deployment. In this work, we observe that a substantial part of this inefficiency comes from recurring redundancy in the decoding process, including spatial redundancy caused by confidence clusters and positional ambiguity, and temporal redundancy caused by repeatedly remasking predictions that have already stabilized. Motivated by these patterns, we propose $R^2$-dLLM, a unified framework for reducing decoding redundancy from both inference and training perspectives. At inference time, we introduce training-free decoding rules that aggregate local confidence and token predictions, and finalize temporally stable tokens to avoid redundant decoding steps. We further propose a redundancy-aware supervised fine-tuning pipeline that aligns the model with efficient decoding trajectories and reduces reliance on manually tuned thresholds. Experiments demonstrate that $R^2$-dLLM consistently reduces the number of decoding steps by up to 75% compared to existing decoding strategies, while maintaining competitive generation quality across different models and tasks. These results validate that decoding redundancy is a central bottleneck in dLLMs, and that explicitly reducing it yields substantial practical efficiency gains.

🏛️ Semantic Scholar 📅 2026-04-21 Fine-tuning

Untangling the Nuances and Networks of Mobile-GenAI English Learning

Nosaiba Badarneh, Marguerite Koole

Policy Gradient Primal-Dual Method for Safe Reinforcement Learning from Human Feedback

Qiang Liu, Adrienne Kline, Ermin Wei

🔥 引用: 0

Abstract: Safe Reinforcement Learning from Human Feedback (Safe RLHF) has recently achieved empirical success in developing helpful and harmless large language models by decoupling human preferences regarding helpfulness and harmlessness. Existing approaches typically rely on fitting fixed horizon reward models from human feedback and have only been validated empirically. In this paper, we formulate safe RLHF as an infinite horizon discounted Con- strained Markov Decision Process (CMDP), since humans may interact with the model over a continuing sequence of interactions rather than within a single finite episode. We propose two Safe RLHF algorithms that do not require reward model fitting and, in contrast to prior work assuming fixed-length trajectories, support flexible trajectory lengths for training. Both algo- rithms are based on the primal-dual method and achieve global convergence guarantees with polynomial rates in terms of policy gradient iterations, trajectory sample lengths, and human preference queries. To the best of our knowledge, this is the first work to study infinite horizon discounted CMDP under human feedback and establish global, non-asymptotic convergence.

🏛️ Semantic Scholar 📅 2026-04-21

MedSafe-Dx (v0): A Safety-Focused Benchmark for Evaluating LLMs in Clinical Diagnostic Decision Support

C. Van Oyen, N. Mirza-Haq

DOI: 10.64898/2026.04.14.26350711

🔥 引用: 0

Abstract: 暂无摘要，请点击原文查看。

🏛️ Semantic Scholar 📅 2026-04-21 medRxiv

SAMoRA: Semantic-Aware Mixture of LoRA Experts for Task-Adaptive Learning

Boyan Shi, Wei Chen, Shuyuan Zhao, et al.

🔥 引用: 0

Abstract: The combination of Mixture-of-Experts (MoE) and Low-Rank Adaptation (LoRA) has shown significant potential for enhancing the multi-task learning capabilities of Large Language Models. However, existing methods face two primary challenges: (1)Imprecise Routing in the current MoE-LoRA method fails to explicitly match input semantics with expert capabilities, leading to weak expert specialization. (2)Uniform weight fusion strategies struggle to provide adaptive update strengths, overlooking the varying complexity of different tasks. To address these limitations, we propose SAMoRA (Semantic-Aware Mixture of LoRA Experts), a novel parameter-efficient fine-tuning framework tailored for task-adaptive learning. Specifically, A Semantic-Aware Router is proposed to explicitly align textual semantics with the most suitable experts for precise routing. A Task-Adaptive Scaling mechanism is designed to regulate expert contributions based on specific task requirements dynamically. In addition, a novel regularization objective is proposed to jointly promote expert specialization and effective scaling. Extensive experiments on multiple multi-task benchmarks demonstrate that SAMoRA significantly outperforms the state-of-the-art methods and holds excellent task generalization capabilities. Code is available at https://github.com/boyan-code/SAMoRA

🏛️ Semantic Scholar 📅 2026-04-21 Fine-tuning

M. S. Yildiz, Melek Alkap, Umut Özdal, et al.

DOI: 10.1097/scs.0000000000012768

🔥 引用: 0

Abstract: Tilted dental implant systems are widely used in the rehabilitation of anatomically compromised jaws and are supported by international consensus guidelines. Concurrently, large language models (LLMs) are increasingly accessed as informational tools in implant dentistry; however, their scientific accuracy and adherence to guideline-based principles in advanced implant concepts remain insufficiently explored. This study evaluated the scientific accuracy, guideline conformity, and clinical consistency of responses generated by 4 LLMs regarding tilted dental implant systems. A total of 120 guideline-based questions covering 8 predefined domains (definition, indications, contraindications, advantages, surgical procedure content, prosthetic procedure content, complications, and prognosis/survival) were developed in accordance with ITI, EAO, and AAOMS consensus reports. Each question was independently submitted to ChatGPT-5.2, Copilot, DeepSeek, and Gemini, and all responses were anonymized and evaluated by a multidisciplinary expert panel using a structured ordinal scoring system. Overall, scientific accuracy scores were high across all models, with near-ceiling performance observed in domains related to indications, advantages, procedural content, and prognosis. Statistically significant between-model differences were identified in the definition ( P =0.003), contraindications ( P =0.006), and complications ( P <0.001) domains, with DeepSeek and Gemini demonstrating consistently higher scores in complication-related content compared with ChatGPT and Copilot. Within-model analyses further revealed significant domain-dependent variability across all LLMs. Although LLMs demonstrate a strong capacity to reproduce established, guideline-based knowledge regarding tilted implant systems, limitations remain in safety-critical domains requiring nuanced clinical judgment. Accordingly, LLMs should be regarded as adjunctive educational tools rather than substitutes for expert decision-making in craniofacial implantology.

🏛️ Semantic Scholar 📅 2026-04-21 The Journal of craniofacial surgery (Print)