今日前沿文献
当前显示 1000 篇
From simulation to pedagogy: structured AI standardized patients for clinical communication training validated through multi-model and randomized evaluation
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
One Size Fits All? Comparing Foundation and Task-specific Models for Retinal Fluid Segmentation
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
Silent numerical failures in large language model-generated pharmacokinetic simulation code: a benchmark against target-controlled infusion validation criteria using the Marsh propofol model
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
Hierarchical Prototype-based Domain Priors for Multiple Instance Learning in Multimodal Histopathology Analysis
🔥 引用:
0
Abstract: Digital pathology has fundamentally altered diagnostic workflows by enabling the computational analysis of gigapixel Whole Slide Images (WSIs), yet effectively deciphering their complex tumor microenvironments remains a formidable challenge. Existing Multiple Instance Learning (MIL) frameworks typically treat Whole Slide Images as unstructured bags of patches, discarding critical morphological semantics and spatial geometry. This lack of inductive bias often leads to overfitting on background noise and fails to align visual features with high-level diagnostic knowledge. To overcome these limitations, we propose the Hierarchical Prototype-based Domain Priors (HPDP) framework, a unified multimodal approach for joint histopathology diagnosis and prognosis. HPDP mitigates the data-driven"black box"issue by introducing a Morphologically Anchored Prototype System (MAPS), which anchors learning to interpretable morphological clusters, and a Sinusoidal Positional Encoder (SPE) to explicitly model tissue architecture. Furthermore, we bridge the semantic gap via a Hierarchical Cross-Modal Alignment (HCMA) module, using Large Language Model (LLM)-generated descriptions to contextually refine visual representations. Extensive experiments across seven cancer cohorts demonstrate that HPDP consistently achieves state-of-the-art performance with superior robustness and interpretability.
AI Agents Need Both Hardware-Backed Security and Application-Level Guardrails
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
DPEPO: Diverse Parallel Exploration Policy Optimization for LLM-based Agents
🔥 引用:
0
Abstract: Large language model (LLM) agents that follow the sequential"reason-then-act"paradigm have achieved superior performance in many complex tasks.However, these methods suffer from limited exploration and incomplete environmental understanding, as they interact with only a single environment per step. In this paper, we first introduce a novel paradigm that enables an agent to interact with multiple environments simultaneously and share cross-trajectory experiences. Building upon this paradigm, we further propose DPEPO, a reinforcement learning (RL) algorithm that encourages the agent to perform diverse parallel exploration. There are two stages in DPEPO: initial supervised fine-tuning (SFT) imparts basic parallel reasoning and action generation, followed by reinforcement learning stage with a hierarchical reward scheme. We design a parallel trajectory-level success reward and two step-level rewards: Diverse Action Reward and Diverse State Transition Reward, which actively penalize behavioral redundancy and promote broad exploration. Extensive experiments on ALFWorld and ScienceWorld show that DPEPO achieves state-of-the-art (SOTA) success rates, while maintaining comparable efficiency to strong sequential baselines. (Code is available at https://github.com/LePanda026/Code-for-DPEPO)
Strategic Bidding in 6G Spectrum Auctions with Large Language Models
🔥 引用:
0
Abstract: Efficient and fair spectrum allocation is a central challenge in 6G networks, where massive connectivity and heterogeneous services continuously compete for limited radio resources. We investigate the use of Large Language Models (LLMs) as bidding agents in repeated 6G spectrum auctions with budget constraints in vehicular networks. Each user equipment (UE) acts as a rational player optimizing its long-term utility through repeated interactions. Using the Vickrey-Clarke-Groves (VCG) mechanism as a benchmark for incentive-compatible, dominant-strategy truthfulness, we compare LLM-guided bidding against truthful and heuristic strategies. Unlike heuristics, LLMs leverage historical outcomes and prompt-based reasoning to adapt their bidding behavior dynamically. Results show that when the theoretical assumptions guaranteeing truthfulness hold, LLM bidders recover near-equilibrium outcomes consistent with VCG predictions. However, when these assumptions break -- such as under static budget constraints -- LLMs sustain longer participation and achieve higher utilities, revealing their ability to approximate adaptive equilibria beyond static mechanism design. This work provides the first systematic evaluation of LLM bidders in repeated spectrum auctions, offering new insights into how AI-driven agents can interact strategically and reshape market dynamics in future 6G networks.
Quantum Kernel Advantage over Classical Collapse in Medical Foundation Model Embeddings
🔥 引用:
0
Abstract: We provide evidence of quantum kernel advantage under noiseless simulation in binary insurance classification on MIMIC-CXR chest radiographs using quantum support vector machines (QSVM) with frozen embeddings from three medical foundation models (MedSigLIP-448, RAD-DINO, ViT-patch32). We propose a two-tier fair comparison framework in which both classifiers receive identical PCA-q features. At Tier 1 (untuned QSVM vs. untuned linear SVM, C = 1 both sides), QSVM wins minority-class F1 in all 18 tested configurations (17 at p<0.001, 1 at p<0.01). The classical linear kernel collapses to majority-class prediction on 90-100% of seeds at every qubit count, while QSVM maintains non-trivial recall. At q = 11 (MedSigLIP-448 plateau center), QSVM achieves mean F1 = 0.343 vs. classical F1 = 0.050 (F1 gain = +0.293, p<0.001) without hyperparameter tuning. Under Tier 2 (untuned QSVM vs. C-tuned RBF SVM), QSVM wins all seven tested configurations (mean gain +0.068, max +0.112). Eigenspectrum analysis reveals quantum kernel effective rank reaches 69.80 at q = 11, far exceeding linear kernel rank, while classical collapse remains C-invariant. A full qubit sweep reveals architecture-dependent concentration onset across models. Code: https://github.com/sebasmos/qml-medimage
MEG-RAG: Quantifying Multi-modal Evidence Grounding for Evidence Selection in RAG
🔥 引用:
0
Abstract: Multimodal Retrieval-Augmented Generation (MRAG) addresses key limitations of Multimodal Large Language Models (MLLMs), such as hallucination and outdated knowledge. However, current MRAG systems struggle to distinguish whether retrieved multimodal data truly supports the semantic core of an answer or merely provides superficial relevance. Existing metrics often rely on heuristic position-based confidence, which fails to capture the informational density of multimodal entities. To address this, we propose Multi-modal Evidence Grounding (MEG), a semantic-aware metric that quantifies the contribution of retrieved evidence. Unlike standard confidence measures, MEG utilizes Semantic Certainty Anchoring, focusing on high-IDF information-bearing tokens that better capture the semantic core of the answer. Building on MEG, we introduce MEG-RAG, a framework that trains a multimodal reranker to align retrieved evidence with the semantic anchors of the ground truth. By prioritizing high-value content based on semantic grounding rather than token probability distributions, MEG-RAG improves the accuracy and multimodal consistency of generated outputs. Extensive experiments on the M$^2$RAG benchmark show that MEG-RAG consistently outperforms strong baselines and demonstrates robust generalization across different teacher models.
BitRL: Reinforcement Learning with 1-bit Quantized Language Models for Resource-Constrained Edge Deployment
🔥 引用:
0
Abstract: The deployment of intelligent reinforcement learning (RL) agents on resource-constrained edge devices remains a fundamental challenge due to the substantial memory, computational, and energy requirements of modern deep learning systems. While large language models (LLMs) have emerged as powerful architectures for decision-making agents, their multi-billion parameter scale confines them to cloud-based deployment, raising concerns about latency, privacy, and connectivity dependence. We introduce BitRL, a framework for building RL agents using 1-bit quantized language models that enables practical on-device learning and inference under severe resource constraints. Leveraging the BitNet b1.58 architecture with ternary weights (-1, 0, +1) and an optimized inference stack, BitRL achieves 10-16x memory reduction and 3-5x energy efficiency improvements over full-precision baselines while maintaining 85-98 percent of task performance across benchmarks. We provide theoretical analysis of quantization as structured parameter perturbation, derive convergence bounds for quantized policy gradients under frozen-backbone architectures, and identify the exploration-stability trade-off in extreme quantization. Our framework systematically integrates 1-bit quantized language models with reinforcement learning for edge deployment and demonstrates effectiveness on commodity hardware.
SEARCH-R: Structured Entity-Aware Retrieval with Chain-of-Reasoning Navigator for Multi-hop Question Answering
🔥 引用:
0
Abstract: Multi-hop Question Answering (MHQA) aims to answer questions that require multi-step reasoning. It presents two key challenges: generating correct reasoning paths in response to the complex user queries, and accurately retrieving essential knowledge in the face of potential limitations in large language models (LLMs). Existing approaches primarily rely on prompt-based methods to generate reasoning paths, which are further combined with traditional sparse or dense retrieval to produce the final answer. However, the generation of reasoning paths commonly lacks effective control over the generative process, thus leading the reasoning astray. Meanwhile, the retrieval methods over-rely on knowledge matching or similarity scores rather than evaluating the practical utility of the information, resulting in retrieving homogeneous or non-useful information. Therefore, we propose a Structured Entity-Aware Retrieval with Chain-of-Reasoning Navigator framework named SEARCH-R. Specifically, SEARCH-R trains an end-to-end reasoning path navigator, which is able to provide a powerful sub-question decomposer by fine-tuning the Llama3.1-8B model. Moreover, a novel dependency tree-based retrieval is designed to evaluate the informational contribution of the document quantitatively. Extensive experiments on three challenging multi-hop datasets validate the effectiveness of the proposed framework. The code and dataset are available at: https://github.com/Applied-Machine-Learning-Lab/ACL2026_SEARCH-R.
Contextual Linear Activation Steering of Language Models
🔥 引用:
0
Abstract: Linear activation steering is a powerful approach for eliciting the capabilities of large language models and specializing their behavior using limited labeled data. While effective, existing methods often apply a fixed steering strength to all tokens, resulting in inconsistent steering quality across diverse input prompts. In this work, we introduce Contextual Linear Activation Steering (CLAS), a method that dynamically adapts linear activation steering to context-dependent steering strengths. Across eleven steering benchmarks and four model families, it consistently outperforms standard linear activation steering and matches or exceeds the performance of ReFT and LoRA in settings with limited labeled data. We therefore propose CLAS as a scalable, interpretable, and accurate method for specializing and steering large language models.
EXACT: an explainable anomaly-aware vision foundation model for analysis of 3D chest CT
🔥 引用:
0
Abstract: Chest computed tomography (CT) is central to the detection and management of thoracic disease, yet the growing scale and complexity of volumetric imaging increasingly exceed what can be addressed by scan-level prediction alone. Clinically useful AI for CT must not only recognize disease across the whole volume, but also localize abnormalities and provide interpretable visual evidence. Existing vision-language foundation models typically compress scans and reports into global image-text representations, limiting their ability to preserve spatial evidence and support clinically meaningful interpretation. Here we developed EXACT, an explainable anomaly-aware foundation model for three-dimensional chest CT that learns spatially resolved representations from paired clinical scans and radiology reports. EXACT was pre-trained on 25,692 CT-reports pairs using anatomy-aware weak supervision, jointly learning organ segmentation and multi-instance anomaly localization without manual voxel-level annotations. The resulting organ-specific anomaly-aware maps assign each voxel a disease-specific anomaly score confined to its corresponding anatomy, jointly encoding lesion extent and organ-level context. In retrospective multinational and multi-center evaluations, EXACT showed broad and consistent improvements across clinically relevant CT tasks, spanning multi-disease diagnosis, zero-shot anomaly localization, downstream adaptation, and visually grounded report generation, outperforming existing three-dimensional medical foundation models. By transforming routine clinical CT scans and free-text reports into explainable voxel-level representations, EXACT establishes a scalable paradigm for trustworthy volumetric medical AI.
Zero-to-CAD: Agentic Synthesis of Interpretable CAD Programs at Million-Scale Without Real Data
🔥 引用:
0
Abstract: Computer-Aided Design (CAD) models are defined by their construction history: a parametric recipe that encodes design intent. However, existing large-scale 3D datasets predominantly consist of boundary representations (B-Reps) or meshes, stripping away this critical procedural information. To address this scarcity, we introduce Zero-to-CAD, a scalable framework for synthesizing executable CAD construction sequences. We frame synthesis as an agentic search problem: by embedding a large language model (LLM) within a feedback-driven CAD environment, our system iteratively generates, executes, and validates code using tools and documentation lookup to promote geometric validity and operation diversity. This agentic approach enables the synthesis of approximately one million executable, readable, editable CAD sequences, covering a rich vocabulary of operations beyond sketch-and-extrude workflows. We also release a curated subset of 100,000 high-quality models selected for geometric diversity. To demonstrate the dataset's utility, we fine-tune a vision-language model on our synthetic data to reconstruct editable CAD programs from multi-view images, outperforming strong baselines, including GPT-5.2, and effectively bootstrapping sequence generation capabilities without real construction-history training data. Zero-to-CAD bridges the gap between geometric scale and parametric interpretability, offering a vital resource for the next generation of CAD AI.
FlashOverlap: Minimizing Tail Latency in Communication Overlap for Distributed LLM Training
🔥 引用:
0
Abstract: The rapid growth in the size of large language models has necessitated the partitioning of computational workloads across accelerators such as GPUs, TPUs, and NPUs. However, these parallelization strategies incur substantial data communication overhead significantly hindering computational efficiency. While communication-computation overlap presents a promising direction, existing data slicing based solutions suffer from tail latency. To overcome this limitation, this research introduces a novel communication-computation overlap technique to eliminate this tail latency in state of the art overlap methods for distributed LLM training. The aim of this technique is to effectively mitigate communication bottleneck of tensor parallelism and data parallelism for distributed training and inference. In particular, we propose a novel method termed Flash-Overlap that replaces conventional collective operations of reduce-scatter and all-gather with decomposed peer-to-peer (P2P) communication and schedules partitioned computations to enable fine-grained overlap. Our method provides an exact algorithm for reducing communication overhead that eliminates tail latency. Moreover, it presents a versatile solution compatible with data-parallel training and various tensor-level parallelism strategies, including TPSP and UP. Experimental evaluations demonstrate that our technique consistently achieves lower latency, superior Model FLOPS Utilization (MFU), and high throughput.
AgentVisor: Defending LLM Agents Against Prompt Injection via Semantic Virtualization
🔥 引用:
0
Abstract: Large Language Model (LLM) agents are increasingly used to automate complex workflows, but integrating untrusted external data with privileged execution exposes them to severe security risks, particularly direct and indirect prompt injection. Existing defenses face significant challenges in balancing security with utility, often encountering a trade-off where rigorous protection leads to over-defense, or where subtle indirect injections bypass detection. Drawing inspiration from operating system virtualization, we propose AgentVisor, a novel defense framework that enforces semantic privilege separation. AgentVisor treats the target agent as an untrusted guest and intercepts tool calls via a trusted semantic visor. Central to our approach is a rigorous audit protocol grounded in classic OS security primitives, designed to systematically mitigate both direct and indirect injection attacks. Furthermore, we introduce a one-shot self-correction mechanism that transforms security violations into constructive feedback, enabling agents to recover from attacks. Extensive experiments show that AgentVisor reduces the attack success rate to 0.65%, achieving this strong defense while incurring only a 1.45% average decrease in utility relative to the No Defense scenario, demonstrating superior performance compared to existing defense methods.
On the generation of 12-channel electrocardiograms based on a hybrid of diffusion and graph neural network models
🔥 引用:
0
Abstract: A hybrid VAE-GNN-SSSD model is presented for generating physiologically correct 12-channel electro-cardiograms with a duration of 10 seconds. The proposed architecture combines three key components: a variational autoencoder for isolating the morphological components of P-QRS-T, a graph neural network with a partially fixed adjacency matrix to ensure compliance with the biophysical laws of Einthoven and Wilson, as well as a diffusion model with a structured state space for modeling long-term time dependencies. The model allows you to generate signals controlled by clinical parameters: type of arrhythmia, age, gender, and heart rate. Experimental results on the PTB-XL test sample showed FID = 0.052 and PRD = 10.8%, which is comparable with the results of modern methods. The key advantage of the model is its built–in biophysical correctness, confirmed by the MSE metric according to Einthoven's law (0.084). The practical effectiveness was confirmed in the MIT BIH classification of arrhythmias: augmentation with synthetic data increased Macro F1 from 0.84 to 0.89 (+6%), improved the recognition of rare ventricular and fuzed contractions by 5-7% and reduced the false omission of dangerous arrhythmias by 27%. The model has demonstrated good generalizing ability on independent ICU data (MIMIC-IV-ECG). The results open up prospects for the use of diagnostic systems for training, pathology simulation, creation of digital heart twins and training of medical specialists in solving the problem of shortage of annotated data and maintaining patient privacy.
Deep learning-assisted proteomic dissection reveals sex biased and shared proteomic patterns in Populus deltoides under waterlogging stress and subsequent recovery
🔥 引用:
0
Abstract:
Sexual dimorphism in dioecious tree species and proteomic responses under stress represents an underappreciated axis of stress resilience. Here, we investigated sex-biased proteomic patterns in Populus deltoides exposed to long-term waterlogging and subsequent recovery. By integrating iTRAQ- based quantitative proteomics with machine learning approaches, including autoencoders, graph neural networks (GNNs), and SHAP-based model explainability, we explored latent structure within high-dimensional protein abundance data. Unsupervised autoencoders captured dominant stress-related variation but showed limited resolution of sex-associated differences. In contrast, GNN and GraphSAGE embeddings disentangled sex-biased proteins embedded within broader waterlogging and recovery responses, highlighting a structured and topologically coherent proteomic differentiation. SHAP analyses applied to Random Forest, and XGBoost models trained on latent features confirmed that waterlogging is the most decisive contributor to sex-biased compared to recovery. Further, in males, selective retention of photosynthetic capacity was evident through the increased protein abundance of Photosystem II components (PsbH, PsbR) and light-harvesting proteins (Lhcb1, Lhcb2), coupled with targeted decreased protein abundance of cytochrome b6f and Photosystem I subunits during stress and a partial rebound post-waterlogging. In contrast, females displayed widespread suppression across the photosynthetic apparatus, suggesting a reduced capacity for functional recovery. Stress-specific modulation of phenylpropanoid and proline pathways further revealed divergent metabolic strategies. Together, our findings revealed sex-biasedproteomic plasticity as a critical determinant of recovery potential in P. deltoides and established a graph-based machine-learning framework for decoding the hidden layers of functional dimorphism under abiotic stress.
Assessing physiological coherence in stress related predictions of large language models: a surrogate based analysis of the Mistral 3 family using wearable HRV data
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
XGRAG: A Graph-Native Framework for Explaining KG-based Retrieval-Augmented Generation
🔥 引用:
0
Abstract: Graph-based Retrieval-Augmented Generation (GraphRAG) extends traditional RAG by using knowledge graphs (KGs) to give large language models (LLMs) a structured, semantically coherent context, yielding more grounded answers. However, GraphRAG reasoning process remains a black-box, limiting our ability to understand how specific pieces of structured knowledge influence the final output. Existing explainability (XAI) methods for RAG systems, designed for text-based retrieval, are limited to interpreting an LLM response through the relational structures among knowledge components, creating a critical gap in transparency and trustworthiness. To address this, we introduce XGRAG, a novel framework that generates causally grounded explanations for GraphRAG systems by employing graph-based perturbation strategies, to quantify the contribution of individual graph components on the model answer. We conduct extensive experiments comparing XGRAG against RAG-Ex, an XAI baseline for standard RAG, and evaluate its robustness across various question types, narrative structures and LLMs. Our results demonstrate a 14.81% improvement in explanation quality over the baseline RAG-Ex across NarrativeQA, FairyTaleQA, and TriviaQA, evaluated by F1-score measuring alignment between generated explanations and original answers. Furthermore, XGRAG explanations exhibit a strong correlation with graph centrality measures, validating its ability to capture graph structure. XGRAG provides a scalable and generalizable approach towards trustworthy AI through transparent, graph-based explanations that enhance the interpretability of RAG systems.
Multi-View Synergistic Learning with Vision-Language Adaption for Low-Resource Biomedical Image Classification
🔥 引用:
0
Abstract: Accurate biomedical image classification under low-resource conditions remains challenging due to limited annotations, subtle inter-class visual differences, and complex disease semantics. While vision--language models offer a promising foundation for mitigating data scarcity, their effective adaptation in biomedical settings is constrained by the need for parameter-efficient tuning alongside fine-grained and semantically consistent representation learning. In this work, we propose Multi-View Synergistic Learning (MVSL), a unified framework that addresses these challenges by jointly considering adaptation paradigms, representation granularity, and disease semantic relationships. MVSL decouples the adaptation of visual and textual encoders to respect their distinct representational characteristics, enabling more stable and effective parameter-efficient fine-tuning. It further introduces multi-granularity contrastive learning to explicitly model both global image semantics and localized lesion-level evidence, improving fine-grained discrimination for visually similar disease categories. In addition, MVSL preserves disease-level semantic structure by incorporating structured supervision derived from large language models, which constrains textual representations at the class level and indirectly regularizes visual embeddings through cross-modal alignment. Together, these components enable more stable cross-modal alignment and improved discrimination under limited supervision. Extensive experiments on $11$ public biomedical datasets spanning $9$ imaging modalities and $10$ anatomical regions demonstrate that MVSL consistently outperforms state-of-the-art methods in few-shot and zero-shot classification settings.
MEMCoder: Multi-dimensional Evolving Memory for Private-Library-Oriented Code Generation
🔥 引用:
0
Abstract: Large Language Models (LLMs) excel at general code generation, but their performance drops sharply in enterprise settings that rely on internal private libraries absent from public pre-training corpora. While Retrieval-Augmented Generation (RAG) offers a training-free alternative by providing static API documentation, we find that such documentation typically provides only isolated definitions, leaving a fundamental knowledge gap. Specifically, LLMs struggle with a task-level lack of coordination patterns between APIs and an API-level misunderstanding of parameter constraints and boundary conditions. To address this, we propose MEMCoder, a novel framework that enables LLMs to autonomously accumulate and evolve Usage Guidelines across these two dimensions. MEMCoder introduces a Multi-dimensional Evolving Memory that captures distilled lessons from the model's own problem-solving trajectories. During inference, MEMCoder employs a dual-source retrieval mechanism to inject both static documentation and relevant historical guidelines into the context. The framework operates in an automated closed loop by using objective execution feedback to reflect on successes and failures, resolve knowledge conflicts, and dynamically update memory. Extensive evaluations on the NdonnxEval and NumbaEval benchmarks demonstrate that MEMCoder substantially enhances existing RAG systems, yielding an average absolute pass@1 gain of 16.31%. Furthermore, MEMCoder exhibits vastly superior domain-specific adaptation compared to existing memory-based continual learning methods.
Advancing Ligand-based Virtual Screening and Molecular Generation with Pretrained Molecular Embedding Distance
🔥 引用:
0
Abstract: Molecular similarity plays a central role in ligand-based drug discovery, such as virtual screening, analog searching, and goal-directed molecular generation. However, traditional similarity measures, ranging from fingerprint-based Tanimoto coefficients to 3D shape overlays, are often computationally expensive at scale or rely on hand-crafted molecular descriptors. Meanwhile, many deep learning approaches to similarity-aware design still depend on similarity-specific supervision or costly data curation, limiting their generality across targets. In this work, we propose pretrained embedding distance (PED) as an effective alternative, computed directly from pretrained molecular models without task-specific training. Experimental results show that PED exhibits distinct correlations with traditional similarity metrics, and performs effectively in both ranking molecules for virtual screening and guiding molecular generation via reward design. These findings suggest that pretrained molecular embeddings capture rich structural information and can serve as a promising and scalable similarity measurement for modern AI-aided drug discovery.
STELLAR-E: a Synthetic, Tailored, End-to-end LLM Application Rigorous Evaluator
🔥 引用:
0
Abstract: The increasing reliance on Large Language Models (LLMs) across diverse sectors highlights the need for robust domain-specific and language-specific evaluation datasets; however, the collection of such datasets is challenging due to privacy concerns, regulatory restrictions, and the time cost for manual creation. Existing automated benchmarking methods are often limited by relying on pre-existing data, poor scalability, single-domain focus, and lack of multilingual support. We present STELLAR-E - a fully automated system to generate high-quality synthetic datasets of custom size, using minimal human inputs without depending on existing datasets. The system is structured in two stages: (1) We modify the TGRT Self-Instruct framework to create a synthetic data engine that enables controllable, custom synthetic dataset generation, and (2) an evaluation pipeline incorporating statistical and LLM-based metrics to assess the applicability of the synthetic dataset for LLM-based application evaluations. The synthetic datasets reach an average difference of +5.7% in terms of LLM-as-a-judge scores against existing language-specific benchmarks, demonstrating comparable quality for comprehensive assessment of big and small LLMs. While real datasets remain slightly more challenging for LLMs especially for smaller models, this work establishes a scalable and domain-adaptable benchmarking framework that supports fair evaluation of LLM applications, offering a faster alternative to manual approaches and enabling high-efficiency automated quality assurance cycles.
A2DEPT: Large Language Model-Driven Automated Algorithm Design via Evolutionary Program Trees
🔥 引用:
0
Abstract: Designing heuristics for combinatorial optimization problems (COPs) is a fundamental yet challenging task that traditionally requires extensive domain expertise. Recently, Large Language Model (LLM)-based Automated Heuristic Design (AHD) has shown promise in autonomously generating heuristic components with minimal human intervention. However, most existing LLM-based AHD methods enforce fixed algorithmic templates to ensure executability, which confines the search to component-level tuning and limits system-level algorithmic expressiveness. To enable open-ended solver synthesis beyond rigid templates, we propose Automated Algorithm Design via Evolutionary Program Trees (A2DEPT), which treats LLMs as system-level algorithm architects. A2DEPT explores the vast program space via a tree-structured evolutionary search with hybrid selection and hierarchical operators, enabling iterative refinement of complete algorithms. To make open-ended generation practical, we enforce executability with a lightweight program-maintenance loop that performs feedback-driven repair. In experiments, A2DEPT consistently outperforms representative LLM-based baselines on both standard and highly constrained benchmarks. On the standard benchmarks, it reduces the mean normalized optimality gap by 9.8% relative to the strongest competing AHD baseline.
AgentWard: A Lifecycle Security Architecture for Autonomous AI Agents
🔥 引用:
0
Abstract: Autonomous AI agents extend large language models into full runtime systems that load skills, ingest external content, maintain memory, plan multi-step actions, and invoke privileged tools. In such systems, security failures rarely remain confined to a single interface; instead, they can propagate across initialization, input processing, memory, decision-making, and execution, often becoming apparent only when harmful effects materialize in the environment. This paper presents AgentWard, a lifecycle-oriented, defense-in-depth architecture that systematically organizes protection across these five stages. AgentWard integrates stage-specific, heterogeneous controls with cross-layer coordination, enabling threats to be intercepted along their propagation paths while safeguarding critical assets. We detail the design rationale and architecture of five coordinated protection layers, and implement a plugin-native prototype on OpenClaw to demonstrate practical feasibility. This perspective provides a concrete blueprint for structuring runtime security controls, managing trust propagation, and enforcing execution containment in autonomous AI agents. Our code is available at https://github.com/FIND-Lab/AgentWard .
Robust Grounding with MLLMs Against Occlusion and Small Objects via Language-Guided Semantic Cues
🔥 引用:
0
Abstract: While Multimodal Large Language Models (MLLMs) have enhanced grounding capabilities in general scenes, their robustness in crowded scenes remains underexplored. Crowded scenes entail visual challenges (i.e., occlusion and small objects), which impair object semantics and degrade grounding performance. In contrast, language expressions are immune to such degradation and preserve object semantics. In light of these observations, we propose a novel method that overcomes such constraints by leveraging Language-Guided Semantic Cues (LGSCs). Specifically, our approach introduces a Semantic Cue Extractor (SCE) to derive semantic cues of objects from the visual pipeline of an MLLM. We then guide these cues using corresponding text embeddings to produce LGSCs as linguistic semantic priors. Subsequently, they are reintegrated into the original visual pipeline to refine object semantics. Extensive experiments and analyses demonstrate that incorporating LGSCs into an MLLM effectively improves grounding accuracy in crowded scenes.
LearnPruner: Rethinking Attention-based Token Pruning in Vision Language Models
🔥 引用:
0
Abstract: Vision-Language Models (VLMs) have recently demonstrated remarkable capabilities in visual understanding and reasoning, but they also impose significant computational burdens due to long visual sequence inputs. Recent works address this issue by pruning unimportant visual tokens, achieving substantial computational reduction while maintaining model performance. The core of token pruning lies in determining token importance, with current approaches primarily relying on attention scores from vision encoders or Large Language Models (LLMs). In this paper, we analyze the effectiveness of attention mechanisms in both vision encoders and LLMs. We find that vision encoders suffer from attention sink, leading to poor focus on informative foreground regions, while in LLMs, although prior studies have identified attention bias toward token positions, text-to-vision attention demonstrates resistance to this bias and enables effective pruning guidance in middle layers. Based on these observations, we propose LearnPruner, a two-stage token pruning framework that first removes redundant vision tokens via a learnable pruning module after the vision encoder, then retains only task-relevant tokens in the LLM's middle layer. Experimental results show that our LearnPruner can preserve approximately 95% of the original performance while using only 5.5% of vision tokens, and achieve 3.2$\times$ inference acceleration, demonstrating a superior accuracy-efficiency trade-off.
MIMIC: A Generative Multimodal Foundation Model for Biomolecules
🔥 引用:
0
Abstract: Biological function emerges from coupled constraints across sequence, structure, regulation, evolution, and cellular context, yet most foundation models in biology are trained within one modality or for a fixed forward task. We present MIMIC, a generative multimodal foundation model trained on our newly curated and aligned dataset, LORE, linking nucleic acid, protein, evolutionary, structural, regulatory, and semantic/contextual modalities within partially observed biomolecular states. MIMIC uses a split-track encoder-decoder architecture to condition on arbitrary subsets of observed modalities and reconstruct or generate missing components of molecular state across the genome, transcriptome, and proteome. Multimodal conditioning consistently improves MIMIC's sequence reconstruction relative to sequence-only inputs, while its learned representations enable state-of-the-art performance on RNA and protein downstream tasks. MIMIC achieves state-of-the-art splicing prediction, and its joint generative formulation enables isoform-aware inference that further improves performance. Beyond prediction, the same generative framework supports constrained design. For RNA, MIMIC identifies corrective edits in a clinically relevant HBB splice-disrupting mutation without reverting it by using evolutionary and structural signals. For proteins, jointly conditioning on shape and surface chemistry of PD-L1 and hACE2 binding sites produces diverse, high-confidence sequences with strong in silico support for target binding. Finally, MIMIC uses experimental context as semantic conditioning to model assay-dependent RNA chemical probing, rather than treating context as a fixed output. Together, these results position MIMIC's aligned multimodal generative modeling as a strong foundation for unifying representation learning, conditional prediction, and constrained biomolecular design within a single model.
Comparative Evaluation of Modern Deep Learning Methodologies for Portfolio Optimization
🔥 引用:
0
Abstract: This study proposes a portfolio optimization framework that integrates advanced deep learning architectures with traditional financial models to enhance risk-adjusted performance. Using historical data from 2015-2023 across equities, ETFs, and bonds, the research evaluates the predictive power of Graph Neural Networks (GNNs), Deep Reinforcement Learning (DRL), Transformers, and Autoencoders. The models jointly address covariance estimation, return forecasting, dynamic asset allocation, and dimensionality reduction. Hybrid approaches such as Transformer+GNN and Autoencoder+DRL are also explored to capture both relational and temporal market structures. Performance is assessed through backtesting using metrics including volatility, cumulative return, maximum drawdown, annualized return, and Sharpe ratio across seven strategies, including Equal-Weighted, 60/40 allocation, and Mean-Variance Optimization (MVO). Results show that hybrid models provide superior stability and risk control, with Transformer+GNN achieving the lowest volatility and drawdown. MVO, when paired with well-calibrated inputs, delivers the highest cumulative return and Sharpe ratio, highlighting the continued relevance of traditional methods. Standalone DRL underperforms due to limited structural awareness, while Autoencoders exhibit behavior similar to Equal-Weight strategies, emphasizing the need for dynamic policy learning. These findings align with existing literature on relational modeling and feature compression in finance. Overall, the study demonstrates that combining deep learning with financial theory yields robust and adaptive portfolio strategies and suggests exploring latent representations within traditional optimization frameworks to improve scalability and performance.
Defective Task Descriptions in LLM-Based Code Generation: Detection and Analysis
🔥 引用:
0
Abstract: Large language models are widely used for code generation, yet they rely on an implicit assumption that the task descriptions are sufficiently detailed and well-formed. However, in practice, users may provide defective descriptions, which can have a strong effect on code correctness. To address this issue, we develop SpecValidator, a lightweight classifier based on a small model that has been parameter-efficiently finetuned, to automatically detect task description defects. We evaluate SpecValidator on three types of defects, Lexical Vagueness, Under-Specification and Syntax-Formatting on 3 benchmarks with task descriptions of varying structure and complexity. Our results show that SpecValidator achieves defect detection of F1 = 0.804 and MCC = 0.745, significantly outperforming GPT-5-mini (F1 = 0.469 and MCC = 0.281) and Claude Sonnet 4 (F1 = 0.518 and MCC = 0.359). Perhaps more importantly, our analysis indicates that SpecValidator can generalize to unseen issues and detect unknown Under-Specification defects in the original (real) descriptions of the benchmarks used. Our results also show that the robustness of LLMs in task description defects depends primarily on the type of defect and the characteristics of the task description, rather than the capacity of the model, with Under-Specification defects being the most severe. We further found that benchmarks with richer contextual grounding, such as LiveCodeBench, exhibit substantially greater resilience, highlighting the importance of structured task descriptions for reliable LLM-based code generation.
Scaling on-device spoken language understanding to new languages with large language models
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
Fix Initial Codes and Iteratively Refine Textual Directions Toward Safe Multi-Turn Code Correction
🔥 引用:
0
Abstract: Recent work on large language models (LLMs) has emphasized the importance of scaling inference compute. From this perspective, the state-of-the-art method Scattered Forest Search (SFS) has been proposed, employing Monte Carlo Tree Search with carefully crafted initial seeds and textual optimization for multi-turn code correction. However, its complexity makes it unclear what factors contribute to improvements in inference performance. To address this problem, we analyze SFS and propose a simpler method, Iterative Refinement of Textual Directions (IRTD), which fixes initial codes and iteratively refines textual directions. Because of the simplicity of IRTD, we theoretically establish the safety of IRTD using Oracle-Guided Inductive Synthesis (OGIS). Experiments on several code generation benchmarks suggest that IRTD achieves inference performance comparable to state-of-the-art methods. These results indicate that, even without complex search structures, refining initial codes with high-quality textual directions alone can effectively improve inference performance.
AgenticCache: Cache-Driven Asynchronous Planning for Embodied AI Agents
🔥 引用:
0
Abstract: Embodied AI agents increasingly rely on large language models (LLMs) for planning, yet per-step LLM calls impose severe latency and cost. In this paper, we show that embodied tasks exhibit strong plan locality, where the next plan is largely predictable from the current one. Building on this, we introduce AgenticCache, a planning framework that reuses cached plans to avoid per-step LLM calls. In AgenticCache, each agent queries a runtime cache of frequent plan transitions, while a background Cache Updater asynchronously calls the LLM to validate and refine cached entries. Across four multi-agent embodied benchmarks, AgenticCache improves task success rate by 22% on average across 12 configurations (4 benchmarks x 3 models), reduces simulation latency by 65%, and lowers token usage by 50%. Cache-based plan reuse thus offers a practical path to low-latency, low-cost embodied agents. Code is available at https://github.com/hojoonleokim/MLSys26_AgenticCache.
DPRM: A Plug-in Doob h transform-induced Token-Ordering Module for Diffusion Language Models
🔥 引用:
0
Abstract: Diffusion language models generate without a fixed left-to-right order, making token ordering a central algorithmic choice: which tokens should be revealed, retained, revised or verified at each step? Existing systems mainly use random masking or confidence-driven ordering. Random masking creates train--test mismatch, while confidence-only rules are efficient but can be myopic and suppress useful exploration. We introduce DPRM (Doob h-transform Process Reward Model), a plug-in token-ordering module for diffusion language models. DPRM keeps the host architecture, denoising objective and supervision unchanged, and changes only the ordering policy. It starts from confidence-driven progressive ordering and gradually shifts to Doob h transform Process Reward guided ordering through online estimates. We characterize the exact DPRM policy as a reward-tilted Gibbs reveal law, prove O(1/N) convergence of the stagewise Soft-BoN approximation, and show that the online bucketized controller tracks the exact DPRM score at empirical-Bernstein rates. Under tractable optimization assumptions, DPRM also yields a sample-complexity advantage over random and confidence-only ordering. DPRM improves over confidence-based baselines in pretraining, post-training, test-time scaling, and single-cell masked diffusion, with particularly strong gains on harder reasoning subsets. In protein, molecular generation and DNA design, the effect is more multi-objective: ordering-aware variants significantly improve selected structural or fragment-constrained metrics while not uniformly dominating the host baseline on every quality metric. These results identify token ordering as a fundamental control axis in diffusion language models and establish DPRM as a general-purpose module for improving it. Code is available at https://github.com/DakeBU/DPRM-DLLM.
Disagreement as Signals: Dual-view Calibration for Sequential Recommendation Denoising
🔥 引用:
0
Abstract: Sequential recommendation seeks to model the evolution of user interests by capturing temporal user intent and item-level transition patterns. Transformer-based recommenders demonstrate a strong capacity for learning long-range and interpretable dependencies, yet remain vulnerable to behavioral noise that is misaligned with users'true preferences. Recent large language model (LLM)-based approaches attempt to denoise interaction histories through static semantic editing. Such methods neglect the learning dynamics of recommendation models and fail to account for the evolving nature of user interests. To address this limitation, we propose a Dual-view Calibration framework for Sequential Recommendation denoising (DC4SR). Specifically, we introduce a semantic prior, derived from an LLM fine-tuned via labeled historical interactions, to estimate the noise distribution from a semantic perspective. From the learning perspective, we further employ a model-side posterior that infers the noise distribution based on the model's learning dynamics. The disagreement between the two distributions is then leveraged to jointly refine semantic understanding and learning-aware model-side representations. Through iterative updates, dynamic dual-view calibration is achieved for both the global semantic prior and the model-side posterior, enabling consistent alignment with evolving user interests. Extensive experiments demonstrate that DC4SR consistently outperforms strong Transformer-based recommenders and LLM-based denoising methods, exhibiting enhanced robustness across training stages and noise conditions.
Robust Deepfake Detection, NTIRE 2026 Challenge: Report
🔥 引用:
1
Abstract: Robustness is a long-overlooked problem in deepfake detection. However, detection performance is nearly worthless in the real world if it suffers under exposure to even slight image degradation. In addition to weaker degradations that can accidentally occur in the image processing pipeline, there is another risk of malicious deepfakes that specifically introduce degradations, purposefully exploiting the detector's weaknesses in that regard. Here, we present an overview of the NTIRE 2026 Robust Deepfake Detection Challenge, which specifically addresses that problem. Participants were tasked with building a detector that would later be tested on an unknown test-set, which included both common and uncommon degradations of various strengths. With a total number of 337 participants and 57 submissions to the final leaderboard, the first edition of the challenge was well received. To ensure the reliability of the results, participants were given only 24h to complete the test run with no labels provided, limiting the possibility of training on the test data. Furthermore, the top solutions were scored on a private test-set to detect any such overfitting. This report presents the competition setting, dataset preparation, as well as details and performance of methods. Top methods rely on large foundation models, ensembles, and degradation training to combine generality and robustness.
Defusing the Trigger: Plug-and-Play Defense for Backdoored LLMs via Tail-Risk Intrinsic Geometric Smoothing
🔥 引用:
0
Abstract: Defending against backdoor attacks in large language models remains a critical practical challenge. Existing defenses mitigate these threats but typically incur high preparation costs and degrade utility via offline purification, or introduce severe latency via complex online interventions. To overcome this dichotomy, we present Tail-risk Intrinsic Geometric Smoothing (TIGS), a plug-and-play inference-time defense requiring no parameter updates, external clean data, or auxiliary generation. TIGS leverages the observation that successful backdoor triggers consistently induce localized attention collapse within the semantic content region. Operating entirely within the native forward pass, TIGS first performs content-aware tail-risk screening to identify suspicious attention heads and rows using sample-internal signals. It then applies intrinsic geometric smoothing: a weak content-domain correction preserves semantic anchoring, while a stronger full-row contraction disrupts trigger-dominant routing. Finally, a controlled full-row write-back reconstructs the attention matrix to ensure inference stability. Extensive evaluations demonstrate that TIGS substantially suppresses attack success rates while strictly preserving clean reasoning and open-ended semantic consistency. Crucially, this favorable security-utility-latency equilibrium persists across diverse architectures, including dense, reasoning-oriented, and sparse mixture-of-experts models. By structurally disrupting adversarial routing with marginal latency overhead, TIGS establishes a highly practical, deployment-ready defense standard for state-of-the-art LLMs.
Continual Calibration: Coverage Can Collapse Before Accuracy in Lifelong LLM Fine-Tuning
🔥 引用:
0
Abstract: Continual learning for large language models is typically evaluated through accuracy retention under sequential fine-tuning. We argue that this perspective is incomplete, because uncertainty reliability can degrade earlier and more sharply than top-1 performance. We study this empirically by measuring conformal coverage and calibration error on sequentially fine-tuned models across three model families and eight task sequences drawn primarily from classification and multiple-choice benchmarks. Across the classification-style settings we study, coverage loss exceeds accuracy loss by a factor of roughly \(3.4\times \pm 0.5\times\) on average across seeds; in the most pronounced case, coverage drops from \(0.92\) to \(0.61\), while accuracy remains within three points of baseline. Standard continual-learning methods that preserve accuracy do not automatically preserve coverage, and naive calibration baselines recover only part of the gap. We propose calibration replay, a lightweight post-hoc procedure that maintains a task-specific held-out buffer and refits a task-specific conformal threshold under the current model after each update. It adds no training-time gradient cost, uses less than one percent of the memory of ordinary experience replay, and typically restores coverage to within two points of nominal at buffer size \(m = 200\). We accompany the empirical study with a drift decomposition, a finite-sample recovery theorem showing exact conformal validity under exchangeability, and a mixture-validity proposition explaining why pooled thresholds do not suffice. Our guarantees are stated for classification-style tasks with task-specific buffers; extensions to open-ended generation are exploratory.
Can LLMs Act as Historians? Evaluating Historical Research Capabilities of LLMs via the Chinese Imperial Examination
🔥 引用:
0
Abstract: While Large Language Models (LLMs) have increasingly assisted in historical tasks such as text processing, their capacity for professional-level historical reasoning remains underexplored. Existing benchmarks primarily assess basic knowledge breadth or lexical understanding, failing to capture the higher-order skills, such as evidentiary reasoning,that are central to historical research. To fill this gap, we introduce ProHist-Bench, a novel benchmark anchored in the Chinese Imperial Examination (Keju) system, a comprehensive microcosm of East Asian political, social, and intellectual history spanning over 1,300 years. Developed through deep interdisciplinary collaboration, ProHist-Bench features 400 challenging, expert-curated questions across eight dynasties, accompanied by 10,891 fine-grained evaluation rubrics. Through a rigorous evaluation of 18 LLMs, we reveal a significant proficiency gap: even state-of-the-art LLMs struggle with complex historical research questions. We hope ProHist-Bench will facilitate the development of domain-specific reasoning LLMs, advance computational historical research, and further uncover the untapped potential of LLMs. We release ProHist-Bench at https://github.com/inclusionAI/ABench/tree/main/ProHist-Bench.
Towards Lawful Autonomous Driving: Deriving Scenario-Aware Driving Requirements from Traffic Laws and Regulations
🔥 引用:
0
Abstract: Driving in compliance with traffic laws and regulations is a basic requirement for human drivers, yet autonomous vehicles (AVs) can violate these requirements in diverse real-world scenarios. To encode law compliance into AV systems, conventional approaches use formal logic languages to explicitly specify behavioral constraints, but this process is labor-intensive, hard to scale, and costly to maintain. With recent advances in artificial intelligence, it is promising to leverage large language models (LLMs) to derive legal requirements from traffic laws and regulations. However, without explicitly grounding and reasoning in structured traffic scenarios, LLMs often retrieve irrelevant provisions or miss applicable ones, yielding imprecise requirements. To address this, we propose a novel pipeline that grounds LLM reasoning in a traffic scenario taxonomy through node-wise anchors that encode hierarchical semantics. On Chinese traffic laws and OnSite dataset (5,897 scenarios), our method improves law-scenario matching by 29.1\% and increases the accuracy of derived mandatory and prohibitive requirements by 36.9\% and 38.2\%, respectively. We further demonstrate real-world applicability by constructing a law-compliance layer for AV navigation and developing an onboard, real-time compliance monitor for in-field testing, providing a solid foundation for future AV development, deployment, and regulatory oversight.
AdapTime: Enabling Adaptive Temporal Reasoning in Large Language Models
🔥 引用:
0
Abstract: Large language models have demonstrated strong reasoning capabilities in general knowledge question answering. However, their ability to handle temporal information remains limited. To address this limitation, existing approaches often involve external tools or manual verification and are tailored to specific scenarios, leading to poor generalizability. Moreover, these methods apply a fixed pipeline to all questions, overlooking the fact that different types of temporal questions require distinct reasoning strategies, which leads to unnecessary processing for simple cases and inadequate reasoning for complex ones. To this end, we propose AdapTime, an adaptive temporal reasoning method that dynamically executes reasoning steps based on the input context. Specifically, it involves three temporal reasoning actions: reformulate, rewrite and review, with an LLM planner guiding the reasoning process. AdapTime integrates seamlessly with state-of-the-art LLMs and significantly enhances their temporal reasoning capabilities without relying on external support. Extensive experiments demonstrate the effectiveness of our approach.
K-MetBench: A Multi-Dimensional Benchmark for Fine-Grained Evaluation of Expert Reasoning, Locality, and Multimodality in Meteorology
🔥 引用:
0
Abstract: The development of practical (multimodal) large language model assistants for Korean weather forecasters is hindered by the absence of a multidimensional, expert-level evaluation framework grounded in authoritative sources. To address this, we introduce K-MetBench, a diagnostic benchmark grounded in national qualification exams. It exposes critical gaps across four dimensions: expert visual reasoning of charts, logical validity via expert-verified rationales, Korean-specific geo-cultural comprehension, and fine-grained domain analysis. Our evaluation of 55 models reveals a profound modality gap in interpreting specialized diagrams and a reasoning gap where models hallucinate logic despite correct predictions. Crucially, Korean models outperform significantly larger global models in local contexts, demonstrating that parameter scaling alone cannot resolve cultural dependencies. K-MetBench serves as a roadmap for developing reliable, culturally aware expert AI agents. The dataset is available at https://huggingface.co/datasets/soyeonbot/K-MetBench .
PeeriScope: A Multi-Faceted Framework for Evaluating Peer Review Quality
🔥 引用:
0
Abstract: The increasing scale and variability of peer review in scholarly venues has created an urgent need for systematic, interpretable, and extensible tools to assess review quality. We present PeeriScope, a modular platform that integrates structured features, rubric-guided large language model assessments, and supervised prediction to evaluate peer review quality along multiple dimensions. Designed for openness and integration, PeeriScope provides both a public interface and a documented API, supporting practical deployment and research extensibility. The demonstration illustrates its use for reviewer self-assessment, editorial triage, and large-scale auditing, and it enables the continued development of quality evaluation methods within scientific peer review. PeeriScope is available both as a live demo at https://app.reviewer.ly/app/peeriscope and via API services at https://github.com/Reviewerly-Inc/Peeriscope.
The Chameleon's Limit: Investigating Persona Collapse and Homogenization in Large Language Models
🔥 引用:
0
Abstract: Applications based on large language models (LLMs), such as multi-agent simulations, require population diversity among agents. We identify a pervasive failure mode we term \emph{Persona Collapse}: agents each assigned a distinct profile nonetheless converge into a narrow behavioral mode, producing a homogeneous simulated population. To quantify persona collapse, we propose a framework that measures how much of the persona space a population occupies (Coverage), how evenly agents spread across it (Uniformity), and how rich the resulting behavioral patterns are (Complexity). Evaluating ten LLMs on personality simulation (BFI-44), moral reasoning, and self-introduction, we observe persona collapse along two axes: (1) Dimensions: a model can appear diverse on one axis yet structurally degenerate on another, and (2) Domains: the same model may collapse the most in personality yet be the most diverse in moral reasoning. Furthermore, item-level diagnostics reveal that behavioral variation tracks coarse demographic stereotypes rather than the fine-grained individual differences specified in each persona. Counter-intuitively, \textbf{the models achieving the highest per-persona fidelity consistently produce the most stereotyped populations}. We release our toolkit and data to support population-level evaluation of LLMs.
Coverage-Based Calibration for Post-Training Quantization via Weighted Set Cover over Outlier Channels
🔥 引用:
0
Abstract: Post-Training Quantization (PTQ) compresses large language models to low bit-widths using a small calibration set, and its quality depends strongly on which samples are chosen. We identify a failure mode in which calibration samples fail to activate outlier channels, hidden dimensions with unusually large activations, causing the quantizer to underestimate their dynamic range and producing per-channel reconstruction errors that dominate layer-wise loss. Motivated by this observation, we argue that PTQ calibration quality is governed more by weighted outlier-channel coverage than by generic sample representativeness, and formulate calibration selection as a weighted set cover problem over outlier channels. The objective is monotone submodular, and the greedy algorithm, COVERCAL, operates on pre-computed activation statistics and requires no GPU time at selection. We further show that the weight choice is internally consistent: under a stylized clipping model, missed weighted coverage upper-bounds surrogate loss, justifying the weighted coverage objective as principled rather than purely empirical. Across LLaMA-2, LLaMA-3, and Mistral, under AWQ and GPTQ backends and five downstream evaluations, COVERCAL improves over random, max-perplexity, max-activation-variance, and stratified baselines, with the largest gains at small calibration budgets. At INT4 with 128 samples, COVERCAL improves MMLU by 1.2 to 1.5 points over random calibration and reduces perplexity degradation by 15 to 30\%; with 64 samples, it matches or exceeds random calibration at 256. The contribution is not a new PTQ backend but a formulation of calibration selection as weighted outlier coverage, with a simple, efficient algorithm and a surrogate-based justification.
Kwai Summary Attention Technical Report
🔥 引用:
0
Abstract: Long-context ability, has become one of the most important iteration direction of next-generation Large Language Models, particularly in semantic understanding/reasoning, code agentic intelligence and recommendation system. However, the standard softmax attention exhibits quadratic time complexity with respect to sequence length. As the sequence length increases, this incurs substantial overhead in long-context settings, leading the training and inference costs of extremely long sequences deteriorate rapidly. Existing solutions mitigate this issue through two technique routings: i) Reducing the KV cache per layer, such as from the head-level compression GQA, and the embedding dimension-level compression MLA, but the KV cache remains linearly dependent on the sequence length at a 1:1 ratio. ii) Interleaving with KV Cache friendly architecture, such as local attention SWA, linear kernel GDN, but often involve trade-offs among KV Cache and long-context modeling effectiveness. Besides the two technique routings, we argue that there exists an intermediate path not well explored: {Maintaining a linear relationship between the KV cache and sequence length, but performing semantic-level compression through a specific ratio $k$}. This $O(n/k)$ path does not pursue a ``minimum KV cache'', but rather trades acceptable memory costs for complete, referential, and interpretable retention of long distant dependency. Motivated by this, we propose Kwai Summary Attention (KSA), a novel attention mechanism that reduces sequence modeling cost by compressing historical contexts into learnable summary tokens.
How Personal Characteristics Shape User Exploration of Diverse Movie Recommendations with a LLM-Based Multi-Agent System
🔥 引用:
0
Abstract: Diversity is an important evaluation criterion for recommender systems beyond accuracy, yet users differ in their willingness to engage with novel and diverse content. In this work, we investigate how a Large Language Model (LLM)-based multi-agent system supports users'exploration of diverse recommendations, and how individual characteristics shape user experiences. We conducted a between-subjects user study (N = 100) comparing a single-agent system (baseline) with a multi-agent system for movie recommendations. We measured Perceived Accuracy, diversity, novelty, and overall rating, and examined the influence of personal characteristics, including personality traits, demographics, GenAI recommendation experience, and GenAI skepticism. Results show that the multi-agent system significantly increases Perceived Novelty and Shannon Diversity. Conscientiousness is positively associated with Perceived Accuracy and diversity, whereas extraversion is negatively associated with Perceived Diversity. Prior experience with GenAI-based recommendations is positively associated with Shannon Diversity, while skepticism toward GenAI is negatively associated with it. We also observe significant interaction effects between system design and user characteristics. These findings highlight the importance of personality-aware conversational recommender systems and caution against one-size-fits-all multi-agent designs.
A Limit Theory of Foundation Models: A Mathematical Approach to Understanding Emergent Intelligence and Scaling Laws
🔥 引用:
0
Abstract: Emergent intelligence have played a major role in the modern AI development. While existing studies primarily rely on empirical observations to characterize this phenomenon, a rigorous theoretical framework remains underexplored. This study attempts to develop a mathematical approach to formalize emergent intelligence from the perspective of limit theory. Specifically, we introduce a performance function E(N, P, K), dependent on data size N, model size P and training steps K, to quantify intelligence behavior. We posit that intelligence emerges as a transition from finite to effectively infinite knowledge, and thus recast emergent intelligence as existence of the limit $\lim_{N,P,K \to \infty} \mathcal{E}(N,P,K)$, with emergent abilities corresponding to the limiting behavior. This limit theory helps reveal that emergent intelligence originates from the existence of a parameter-limit architecture (referred to as the limit architecture), and that emergent intelligence rationally corresponds to the learning behavior of this limit system. By introducing tools from nonlinear Lipschitz operator theory, we prove that the necessary and sufficient conditions for existence of the limit architecture. Furthermore, we derive the scaling law of foundation models by leveraging tools of Lipschitz operator and covering number. Theoretical results show that: 1) emergent intelligence is governed by three key factors-training steps, data size and the model architecture, where the properties of basic blocks play a crucial role in constructing foundation models; 2) the critical condition Lip(T)=1 for emergent intelligence provides theoretical support for existing findings. 3) emergent intelligence is determined by an infinite-dimensional system, yet can be effectively realized in practice through a finite-dimensional architecture. Our empirical results corroborate these theoretical findings.
OS-SPEAR: A Toolkit for the Safety, Performance,Efficiency, and Robustness Analysis of OS Agents
🔥 引用:
0
Abstract: The evolution of Multimodal Large Language Models (MLLMs) has shifted the focus from text generation to active behavioral execution, particularly via OS agents navigating complex GUIs. However, the transition of these agents into trustworthy daily partners is hindered by a lack of rigorous evaluation regarding safety, efficiency, and multi-modal robustness. Current benchmarks suffer from narrow safety scenarios, noisy trajectory labeling, and limited robustness metrics. To bridge this gap, we propose OS-SPEAR, a comprehensive toolkit for the systematic analysis of OS agents across four dimensions: Safety, Performance, Efficiency, and Robustness. OS-SPEAR introduces four specialized subsets: (1) a S(afety)-subset encompassing diverse environment- and human-induced hazards; (2) a P(erformance)-subset curated via trajectory value estimation and stratified sampling; (3) an E(fficiency)-subset quantifying performance through the dual lenses of temporal latency and token consumption; and (4) a R(obustness)-subset that applies cross-modal disturbances to both visual and textual inputs. Additionally, we provide an automated analysis tool to generate human-readable diagnostic reports. We conduct an extensive evaluation of 22 popular OS agents using OS-SPEAR. Our empirical results reveal critical insights into the current landscape: notably, a prevalent trade-off between efficiency and safety or robustness, the performance superiority of specialized agents over general-purpose models, and varying robustness vulnerabilities across different modalities. By providing a multidimensional ranking and a standardized evaluation framework, OS-SPEAR offers a foundational resource for developing the next generation of reliable and efficient OS agents. The dataset and codes are available at https://github.com/Wuzheng02/OS-SPEAR.
Stabilizing Efficient Reasoning with Step-Level Advantage Selection
🔥 引用:
0
Abstract: Large language models (LLMs) achieve strong reasoning performance by allocating substantial computation at inference time, often generating long and verbose reasoning traces. While recent work on efficient reasoning reduces this overhead through length-based rewards or pruning, many approaches are post-trained under a much shorter context window than base-model training, a factor whose effect has not been systematically isolated. We first show that short-context post-training alone, using standard GRPO without any length-aware objective, already induces substantial reasoning compression-but at the cost of increasingly unstable training dynamics and accuracy degradation. To address this, we propose Step-level Advantage Selection (SAS), which operates at the reasoning-step level and assigns a zero advantage to low-confidence steps in correct rollouts and to high-confidence steps in verifier-failed rollouts, where failures often arise from truncation or verifier issues rather than incorrect reasoning. Across diverse mathematical and general reasoning benchmarks, SAS improves average Pass@1 accuracy by 0.86 points over the strongest length-aware baseline while reducing average reasoning length by 16.3%, yielding a better accuracy-efficiency trade-off.
Can You Make It Sound Like You? Post-Editing LLM-Generated Text for Personal Style
🔥 引用:
0
Abstract: Despite the growing use of large language models (LLMs) for writing tasks, users may hesitate to rely on LLMs when personal style is important. Post-editing LLM-generated drafts or translations is a common collaborative writing strategy, but it remains unclear whether users can effectively reshape LLM-generated text to reflect their personal style. We conduct a pre-registered online study ($n=81$) in which participants post-edit LLM-generated drafts for writing tasks where personal style matters to them. Using embedding-based style similarity metrics, we find that post-editing increases stylistic similarity to participants'unassisted writing and reduces similarity to fully LLM-generated output. However, post-edited text still remains stylistically closer in style to LLM text than to participants'unassisted control text, and it exhibits reduced stylistic diversity compared to unassisted human text. We find a gap between perceived stylistic authenticity and model-measured stylistic similarity, with post-edited text often perceived as representative of participants'personal style despite remaining detectable LLM stylistic traces.
Grounding Before Generalizing: How AI Differs from Humans in Causal Transfer
🔥 引用:
0
Abstract: Extracting abstract causal structures and applying them to novel situations is a hallmark of human intelligence. While Large Language Models (LLMs) and Vision Language Models (VLMs) have shown strong performance on a wide range of reasoning tasks, their capacity for interactive causal learning -- inducing latent structures through sequential exploration and transferring them across contexts -- remains uncharacterized. Human learners accomplish such transfer after minimal exposure, whereas classical Reinforcement Learning (RL) agents fail catastrophically. Whether state-of-the-art Artificial Intelligence (AI) models possess human-like mechanisms for abstract causal structure transfer is an open question. Using the OpenLock paradigm requiring sequential discovery of Common Cause (CC) and Common Effect (CE) structures, here we show that models exhibit fundamentally delayed or absent transfer: even successful models require initial environmental-specific mapping -- what we term environmental grounding -- before efficiency gains emerge, whereas humans leverage prior structural knowledge from the very first solution attempt. In the text-only condition, models matched or exceeded human discovery efficiency. In contrast, visual information -- in both the image-only and text-and-image conditions -- overall degraded rather than enhanced performance, revealing a broad reliance on symbolic processing rather than integrated multimodal reasoning. Models further exhibited systematic CC/CE asymmetries absent in humans, suggesting heuristic biases rather than direction-neutral causal abstraction. These findings reveal that large-scale statistical learning does not produce the decontextualized causal schemas underpinning human analogical reasoning, establishing grounding-dependent transfer as a fundamental limitation of current LLMs and VLMs.
Leveraging LLMs for Multi-File DSL Code Generation: An Industrial Case Study
🔥 引用:
0
Abstract: Large language models (LLMs) perform strongly on general-purpose code generation, yet their applicability to enterprise domain-specific languages (DSLs) remains underexplored, especially for repository-scale change generation spanning multiple files and folder structures from a single natural-language (NL) instruction. We report an industrial case study at BMW that adapts code-oriented LLMs to generate and modify project-root DSL artifacts for an Xtext-based DSL that drives downstream Java/TypeScript code generation. We develop an end-to-end pipeline for dataset construction, multi-file task representation, model adaptation, and evaluation. We encode DSL folder hierarchies as structured, path-preserving JSON, allowing single-response generation at repository scale and learning cross-file dependencies. We evaluate two instruction-tuned code LLMs (Qwen2.5-Coder and DeepSeek-Coder, 7B) under three configurations: baseline prompting, one-shot in-context learning, and parameter-efficient fine-tuning (QLoRA). Beyond standard similarity metrics, we introduce task-specific measures that assess edit correctness and repository structural fidelity. Fine-tuning yields the most significant gains across models and metrics, achieving high exact-match accuracy, substantial edit similarity, and structural fidelity of 1.00 on our held-out set for multi-file outputs. At the same time, one-shot in-context learning provides smaller but consistent improvements over baseline prompting. We further validate practical utility via an expert developer survey and an execution-based check using the existing code generator.
Culture-Aware Machine Translation in Large Language Models: Benchmarking and Investigation
🔥 引用:
0
Abstract: Large language models (LLMs) have achieved strong performance in general machine translation, yet their ability in culture-aware scenarios remains poorly understood. To bridge this gap, we introduce CanMT, a Culture-Aware Novel-Driven Parallel Dataset for Machine Translation, together with a theoretically grounded, multi-dimensional evaluation framework for assessing cultural translation quality. Leveraging CanMT, we systematically evaluate a wide range of LLMs and translation systems under different translation strategy constraints. Our findings reveal substantial performance disparities across models and demonstrate that translation strategies exert a systematic influence on model behavior. Further analysis shows that translation difficulty varies across types of culture-specific items, and that a persistent gap remains between models'recognition of culture-specific knowledge and their ability to correctly operationalize it in translation outputs. In addition, incorporating reference translations is shown to substantially improve evaluation reliability in LLM-as-a-judge, underscoring their essential role in assessing culture-aware translation quality. The corpus and code are available at CanMT.
Context-Aware Hospitalization Forecasting Evaluations for Decision Support using LLMs
🔥 引用:
0
Abstract: Medical and public health experts must make real-time resource decisions, such as expanding hospital bed capacity, based on projected hospitalization trends during large-scale healthcare disruptions (e.g., operational failures or pandemics). Forecasting models can assist in this task by analyzing large volumes of resource-related data at the facility level, but they must be reliable for decision-making under real-world data conditions. Recent work shows that large language models (LLMs) can incorporate richer forms of context into numerical forecasting. Whereas traditional models rely primarily on temporal context (i.e., past observations), LLMs can also leverage non-temporal public health context such as demographic, geographic, and population-level features. However, it remains unclear how these models should be used to produce stable or decision-relevant predictions in real-world healthcare settings. To evaluate how LLMs can be effectively used in this setting, we evaluate three approaches across 60 counties with low-,mid-, and high-hospitalization intensities in the United States: direct LLM-based forecasting, classical time-series models, and a context-augmented hybrid pipeline (HybridARX) that incorporates LLM-derived signals into structured models. Because the goal is operational decision-making rather than error minimization alone, we evaluate performance with bias and lead-lag alignment in addition to standard forecasting metrics. Our results show that HybridARX improves over classical ARX by yielding more stable and better-calibrated forecasts, particularly when incorporating noisy contextual signals into structured time-series models. These findings suggest that, in non-stationary healthcare resource forecasting, LLMs are most useful when embedded within structured hybrid models.
Why AI Harms Can't Be Fixed One Identity at a Time: What 5300 Incident Reports Reveal About Intersectionality
🔥 引用:
1
Abstract: AI risk assessment is the primary tool for identifying harms caused by AI systems. These include intersectional harms, which arise from the interaction between identity categories (e.g., class and skin tone) and which do not occur, or occur differently, when those categories are considered separately. Yet existing AI risk assessments are still built around isolated identity categories, and when intersections are considered, they focus almost exclusively on race and gender. Drawing on a large-scale analysis of documented AI incidents, we show that AI harms do not occur one identity category at a time. Using a structured rubric applied with a Large Language Model (LLM), we analyze 5,300 reports from 1,200 documented incidents in the AI Incident Database, the most curated source of incident data. From these reports, we identify 1,513 harmed subjects and their associated identity categories, achieving 98% accuracy. At the level of individual categories, we find that age and political identity appear in documented AI harms at rates comparable to race and gender. At the level of intersecting categories, harm is amplified up to three times at specific intersections: adolescent girls, lower-class people of color, and upper-class political elites. We argue that intersectionality should be a core component of AI risk assessment to more accurately capture how harms are produced and distributed across social groups.
Mapping the Panorama of Enterprise Digital Transformation: An LLM-Driven Variable Relational Network Perspective
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
Benchmarking Pathology Foundation Models for Breast Cancer Survival Prediction
🔥 引用:
0
Abstract: Pathology foundation models (PFMs) have recently emerged as powerful pretrained encoders for computational pathology, enabling transfer learning across a wide range of downstream tasks. However, systematic comparisons of these models for clinically meaningful prediction problems remain limited, especially in the context of survival prediction under external validation. In this study, we benchmark widely used and recently proposed PFMs for breast cancer survival prediction from whole-slide histopathology images. Using a standardized pipeline based on patch-level feature extraction and a unified survival modeling framework, we evaluate model representations across three independent clinical cohorts comprising more than 5,400 patients with long-term follow-up. Models are trained on one cohort and evaluated on two independent external cohorts, enabling a rigorous assessment of cross-dataset generalization. Overall, H-optimus-1 achieves the strongest survival prediction performance. More broadly, we observe consistent generational improvements across model families, with second-generation PFMs outperforming their first-generation counterparts. However, absolute performance differences between many recent PFMs remain modest, suggesting diminishing returns from further scaling of pretraining data or model size alone. Notably, the compact distilled model H0-mini slightly outperforms its larger teacher model H-optimus-0, despite using fewer than 8% of the parameters and enabling significantly faster feature extraction. Together, these results provide the first large-scale, externally validated benchmark of PFMs for breast cancer survival prediction, and offer practical guidance for efficient deployment of PFMs in clinical workflows.
World-R1: Reinforcing 3D Constraints for Text-to-Video Generation
🔥 引用:
0
Abstract: Recent video foundation models demonstrate impressive visual synthesis but frequently suffer from geometric inconsistencies. While existing methods attempt to inject 3D priors via architectural modifications, they often incur high computational costs and limit scalability. We propose World-R1, a framework that aligns video generation with 3D constraints through reinforcement learning. To facilitate this alignment, we introduce a specialized pure text dataset tailored for world simulation. Utilizing Flow-GRPO, we optimize the model using feedback from pre-trained 3D foundation models and vision-language models to enforce structural coherence without altering the underlying architecture. We further employ a periodic decoupled training strategy to balance rigid geometric consistency with dynamic scene fluidity. Extensive evaluations reveal that our approach significantly enhances 3D consistency while preserving the original visual quality of the foundation model, effectively bridging the gap between video generation and scalable world simulation.
Zero-shot Large Language Models for Automatic Readability Assessment
🔥 引用:
0
Abstract: Unsupervised automatic readability assessment (ARA) methods have important practical and research applications (e.g., ensuring medical or educational materials are suitable for their target audiences). In this paper, we propose a new zero-shot prompting methodology for ARA and present the first comprehensive evaluation of using large language models (LLMs) as an unsupervised ARA method by testing 10 diverse open-source LLMs (e.g., different sizes and developers) on 14 diverse datasets (e.g., different text lengths and languages). Our findings show that our proposed prompting methodology outperforms prior methods on 13 of the 14 datasets. Furthermore, we propose LAURAE, which combines LLM and readability formula scores to improve robustness by capturing both contextual and shallow (e.g., sentence length) features of readability. Our evaluation demonstrates that LAURAE robustly outperforms prior methods across languages, text lengths, and amounts of technical language.
Crystal structure prediction using graph neural combinatorial optimization
🔥 引用:
0
Abstract: Crystalline materials are widely used in technological applications, yet their discovery remains a significant challenge. As their properties are driven by structure, crystal structure prediction (CSP) methods play a central role in computational approaches aiming to accelerate this process. Previously, CSP has been approached from a combinatorial optimization perspective, with the core challenge of allocating atoms on a fine grid of predefined discrete positions within a unit cell while minimizing their interaction energy. Exact mathematical optimization methods provide guaranteed solutions, but they become computationally expensive for large-scale instances, where the atomic configuration space grows rapidly, particularly in the absence of additional symmetry constraints. In this work, we introduce a neural combinatorial optimization approach to the atom allocation challenge and, subsequently, CSP, based on graph neural networks (GNNs), which can effectively sample from the distribution of feasible structures in an unsupervised manner. We leverage expander graphs to construct computational graphs over discrete positions that capture both short- and long-range interactions between atoms, and employ the Gumbel-Sinkhorn approach to enforce the desired stoichiometry of the generated structures. We demonstrate that our method outperforms classical heuristic approaches and is competitive with a commercial optimization solver across a range of chemical compositions. This enables the use of ever-expanding GPU infrastructure to tackle the inherent combinatorial challenges of CSP, paving the way for scaling beyond current capabilities.
Layerwise Convergence Fingerprints for Runtime Misbehavior Detection in Large Language Models
🔥 引用:
0
Abstract: Large language models deployed at runtime can misbehave in ways that clean-data validation cannot anticipate: training-time backdoors lie dormant until triggered, jailbreaks subvert safety alignment, and prompt injections override the deployer's instructions. Existing runtime defenses address these threats one at a time and often assume a clean reference model, trigger knowledge, or editable weights, assumptions that rarely hold for opaque third-party artifacts. We introduce Layerwise Convergence Fingerprinting (LCF), a tuning-free runtime monitor that treats the inter-layer hidden-state trajectory as a health signal: LCF computes a diagonal Mahalanobis distance on every inter-layer difference, aggregates via Ledoit-Wolf shrinkage, and thresholds via leave-one-out calibration on 200 clean examples, with no reference model, trigger knowledge, or retraining. Evaluated on four architectures (Llama-3-8B, Qwen2.5-7B, Gemma-2-9B, Qwen2.5-14B) across backdoors, jailbreaks, and prompt injection (56 backdoor combinations, 3 jailbreak techniques, and BIPIA email + code-QA), LCF reduces mean backdoor attack success rate (ASR) below 1% on Qwen2.5-7B and Gemma-2 and to 1.3% on Qwen2.5-14B, detects 92-100% of DAN jailbreaks (62-100% for GCG and softer role-play), and flags 100% of text-payload injections across all eight (model, domain) cells, at 12-16% backdoor FPR and<0.1% inference overhead. A single aggregation score covers all three threat families without threat-specific tuning, positioning LCF as a general-purpose runtime safety layer for cloud-served and on-device LLMs.
BiMol-Diff: A Unified Diffusion Framework for Molecular Generation and Captioning
🔥 引用:
0
Abstract: Bridging molecular structures and natural language is essential for controllable design. Autoregressive models struggle with long-range dependencies, while standard diffusion processes apply uniform corruption across positions, which can distort structurally informative tokens. We present BiMol-Diff, a unified diffusion framework for the paired tasks of text-conditioned molecule generation and molecule captioning. Our key component is a token-aware noise schedule that assigns position-dependent corruption based on token recovery difficulty, preserving harder-to-recover substructures during the forward process. On ChEBI-20 and M3-20M, BiMol-Diff improves molecule reconstruction with a 15.4% relative gain in Exact Match and achieves strong captioning results, attaining best BLEU and BERTScore among compared baselines. These results indicate token-aware noising improves fidelity in molecular structure-language modelling.
Fraud Detection in Cryptocurrency Markets with Spatio-Temporal Graph Neural Networks
🔥 引用:
0
Abstract: Technological advancements in cryptocurrency markets have increased accessibility for investors, but concurrently exposed them to the risks of market manipulations. Existing fraud detection mechanisms typically rely on machine learning methods that treat each financial asset (i.e., token) and its related transactions independently. However, market manipulation strategies are rarely isolated events, but are rather characterized by coordination, repetition, and frequent transfers among related assets. This suggests that relational structure constitutes an integral component of the signal and can be effectively represented through graphical means. In this paper, we propose three graph construction methods that rely on aggregated hourly market data. The proposed graphs are processed by a unified spatio-temporal Graph Neural Network (GNN) architecture that combines attention-based spatial aggregation with temporal Transformer encoding. We evaluate our methodology on a real-world dataset comprised of pump-and-dump schemes in cryptocurrency markets, spanning a period of over three years. Our comparative results showcase that our graph-based models achieve significant improvements over standard machine learning baselines in detecting anomalous events. Our work highlights that learned market connectivity provides substantial gains for detecting coordinated market manipulation schemes.
Representational Curvature Modulates Behavioral Uncertainty in Large Language Models
🔥 引用:
0
Abstract: In autoregressive large language models (LLMs), temporal straightening offers an account of how the next-token prediction objective shapes representations. Models learn to progressively straighten the representational trajectory of input sequences across layers, potentially facilitating next-token prediction via linear extrapolation. However, a direct link between this trajectory and token-level behavior has been missing. We provide such a link by relating contextual curvature-a geometric measure of how sharply the representational trajectory bends over recent context-to next-token entropy. Across two models (GPT-2 XL and Pythia-2.8B), contextual curvature is correlated with entropy, and this relationship emerges during training. Perturbation experiments reveal selective dependence: manipulating curvature through trajectory-aligned interventions reliably modulates entropy, while geometrically misaligned perturbations have no effect. Finally, regularizing representations to be straighter during training modestly reduces token-level entropy without degrading validation loss. These results identify trajectory curvature as a task-aligned representational feature that influences behavioral uncertainty in LLMs.
An Information-Geometric Framework for Stability Analysis of Large Language Models under Entropic Stress
🔥 引用:
0
Abstract: As large language models (LLMs) are increasingly deployed in high-stakes and operational settings, evaluation strategies based solely on aggregate accuracy are often insucient to characterize system reliability. This study proposes a thermodynamic inspired modeling framework for analyzing the stability of LLM outputs under conditions of uncertainty and perturbation. The framework introduces a composite stability score that integrates task utility, entropy as a measure of external uncertainty, and two internal structural proxies: internal integration and aligned reective capacity. Rather than interpreting these quantities as physical variables, the formulation is intended as an interpretable abstraction that captures how internal structure may modulate the impact of disorder on model behavior. Using the IST-20 benchmarking protocol and associated metadata, we analyze 80 modelscenario observations across four contemporary LLMs. The proposed formulation consistently yields higher stability scores than a reduced utilityentropy baseline, with a mean improvement of 0.0299 (95% CI: 0.02470.0351). The observed gain is more pronounced under higher entropy conditions, suggesting that the framework captures a form of nonlinear attenuation of uncertainty. We do not claim a fundamental physical law or a complete theory of machine ethics. Instead, the contribution of this work is a compact and interpretable modeling perspective that connects uncertainty, performance, and internal structure within a unied evaluation lens. The framework is intended to complement existing benchmarking approaches and to support ongoing discussions in AI safety, reliability, and governance.
Meta-Aligner: Bidirectional Preference-Policy Optimization for Multi-Objective LLMs Alignment
🔥 引用:
0
Abstract: Multi-Objective Alignment aims to align Large Language Models (LLMs) with diverse and often conflicting human values by optimizing multiple objectives simultaneously. Existing methods predominantly rely on static preference weight construction strategies. However, rigidly aligning to fixed targets discards valuable intermediate information, as training responses inherently embody valid preference trade-offs even when deviating from the target. To address this limitation, we propose Meal, i.e., MEta ALigner, a bi-level meta-learning framework enabling bidirectional optimization between preferences and policy responses, generating instructive dynamic preferences for steadier training. Specifically, we introduce a preference-weight-net as a meta-learner to generate adaptive preference weights based on input prompts and update the preference weights as learnable parameters, while the LLM policy acts as a base-learner optimizing response generation conditioned on these preferences with rejection sampling strategy. Extensive empirical results demonstrate that our method achieves superior performance on several multi-objective benchmarks, validating the effectiveness of the dynamic bidirectional preference-policy optimization framework.
System-aware contextual digital twin for ICS anomaly diagnosis
🔥 引用:
0
Abstract: Industrial Control Systems (ICS) integrate computing, physical processes, and communication to operate critical infrastructures such as power grids, water treatment plants, and oil and gas facilities. As ICS become increasingly targeted by cyberattacks, timely and reliable anomaly diagnosis is essential for protecting operational safety. However, existing ICS anomaly detection approaches face practical limitations: supervised methods require extensive labeled attack data and suffer from class imbalance, while model-based detectors often lack the ability to provide deep insight into the root causes of anomalies, leading to elevated false alarms and making it difficult for operators to initiate a timely response. In this work, we propose a system-aware unsupervised framework for ICS anomaly diagnosis that combines lightweight online detection with contextual explanation. The system identifies deviations from observed normal behaviors without prior knowledge of system topology. To support actionable response, we further concatenate a contextual digital twin augmented with an Large Language Model (LLM) to enhance interpretability, which translates detection evidence into grounded diagnostic hypotheses and verification steps for operators. Experiments on public ICS benchmarks demonstrate that the proposed framework achieves real-time detection efficiency and provides consistent, interpretable anomaly diagnoses, enabling low-latency warning and practical deployment in complex industrial environments.
Skill Retrieval Augmentation for Agentic AI
🔥 引用:
0
Abstract: As large language models (LLMs) evolve into agentic problem solvers, they increasingly rely on external, reusable skills to handle tasks beyond their native parametric capabilities. In existing agent systems, the dominant strategy for incorporating skills is to explicitly enumerate available skills within the context window. However, this strategy fails to scale: as skill corpora expand, context budgets are consumed rapidly, and the agent becomes markedly less accurate in identifying the right skill. To this end, this paper formulates Skill Retrieval Augmentation (SRA), a new paradigm in which agents dynamically retrieve, incorporate, and apply relevant skills from large external skill corpora on demand. To make this problem measurable, we construct a large-scale skill corpus and introduce SRA-Bench, the first benchmark for decomposed evaluation of the full SRA pipeline, covering skill retrieval, skill incorporation, and end-task execution. SRA-Bench contains 5,400 capability-intensive test instances and 636 manually constructed gold skills, which are mixed with web-collected distractor skills to form a large-scale corpus of 26,262 skills. Extensive experiments show that retrieval-based skill augmentation can substantially improve agent performance, validating the promise of the paradigm. At the same time, we uncover a fundamental gap in skill incorporation: current LLM agents tend to load skills at similar rates, regardless of whether a gold skill is retrieved or whether the task actually requires external capabilities. This shows that the bottleneck in skill augmentation lies not only in retrieval but also in the base model's ability to determine which skill to load and when external loading is actually needed. These findings position SRA as a distinct research problem and establish a foundation for the scalable augmentation of capabilities in future agent systems.
MultiDx: A Multi-Source Knowledge Integration Framework towards Diagnostic Reasoning
🔥 引用:
0
Abstract: Diagnostic prediction and clinical reasoning are critical tasks in healthcare applications. While Large Language Models (LLMs) have shown strong capabilities in commonsense reasoning, they still struggle with diagnostic reasoning due to limited domain knowledge. Existing approaches often rely on internal model knowledge or static knowledge bases, resulting in knowledge insufficiency and limited adaptability, which hinder their capacity to perform diagnostic reasoning. Moreover, these methods focus solely on the accuracy of final predictions, overlooking alignment with standard clinical reasoning trajectories. To this end, we propose MultiDx, a two-stage diagnostic reasoning framework that performs differential diagnosis by analyzing evidence collected from multiple knowledge sources. Specifically, it first generates suspected diagnoses and reasoning paths by leveraging knowledge from web search, SOAP-formatted case, and clinical case database. Then it integrates multi-perspective evidence through matching, voting, and differential diagnosis to generate the final prediction.~Extensive experiments on two public benchmarks demonstrate the effectiveness of our approach.
Evaluation of LLM-Based Software Engineering Tools: Practices, Challenges, and Future Directions
🔥 引用:
1
Abstract: Large Language Models (LLMs) are increasingly embedded in software engineering (SE) tools, powering applications such as code generation, automated code review, and bug triage. As these LLM-based AI for Software Engineering (AI4SE) systems transition from experimental prototypes to widely deployed tools, the question of what it means to evaluate their behavior reliably has become both critical and unanswered. Unlike traditional SE or machine learning systems, LLM-based tools often produce open-ended, natural language outputs, admit multiple valid answers, and exhibit non-deterministic behavior across runs. These characteristics fundamentally challenge long-standing evaluation assumptions such as the existence of a single ground truth, deterministic outputs, and objective correctness. In this paper, we examine LLM evaluation as a general, task-dependent concept through the lens of SE tasks. We discuss why reliable evaluation is essential for trust, adoption, and meaningful assessment of LLM-based tools, summarize the current state of evaluation practices, and highlight their limitations in realistic AI4SE settings. We then identify key challenges facing current approaches, including the absence of stable ground truth, subjectivity and multi-dimensional quality, evaluation instability due to non-determinism, limitations of automated and model-based evaluation, and fragmentation of evaluation practices. Finally, we outline future directions aimed at advancing LLM evaluation toward more robust, scalable, and trustworthy methodologies, to stimulate discussion on principled evaluation practices that can keep pace with the growing role of LLMs in SE.
What Did They Mean? How LLMs Resolve Ambiguous Social Situations across Perspectives and Roles
🔥 引用:
0
Abstract: People increasingly turn to large language models (LLMs) to interpret ambiguous social situations: a delayed text reply, an unusually cold supervisor, a teacher's mixed signals, or a boundary-crossing friend. Yet in many such cases, no stable interpretation can be verified from the available evidence alone. We study how LLMs respond to these situations across four domains: early-stage romantic relationships, teacher--student dynamics, workplace hierarchies, and ambiguous friendships. Across 72 responses from GPT, Claude, and Gemini, only 9 (12.5\%) genuinely preserved uncertainty. The remaining 87.5% produced interpretive closure through recurring pathways including narrative alignment, narrative reversal, normative advice under uncertainty, and hedged language that still supported a single conclusion. We further find that narrator perspective shapes the path to closure: first-person accounts more often elicited alignment, while third-person accounts invited more detached interpretation, even when the underlying situation remained comparable. Together, these findings show that LLMs do not simply assist interpersonal sensemaking; they tend to resolve ambiguity into coherent and actionable narratives. These results suggest that the central risk is not only that LLMs may misinterpret social situations, but that they may make unresolved situations feel prematurely settled. We frame this tendency as a design challenge for uncertainty-preserving social AI.
Multi-Hospital Electronic Health Record Foundation Models Without Data Sharing: A Comparison of Federated Learning and Inference-Time Ensembling
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
AsyncShield: A Plug-and-Play Edge Adapter for Asynchronous Cloud-based VLA Navigation
🔥 引用:
0
Abstract: While Vision-Language-Action (VLA) models have been demonstrated possessing strong zero-shot generalization for robot control, their massive parameter sizes typically necessitate cloud-based deployment. However, cloud deployment introduces network jitter and inference latency, which can induce severe spatiotemporal misalignment in mobile navigation under continuous displacement, so that the stale intents expressed in past ego frames may become spatially incorrect in the current frame and lead to collisions. To address this issue, we propose AsyncShield, a plug-and-play asynchronous control framework. AsyncShield discards traditional black-box time-series prediction in favor of a deterministic physical white-box spatial mapping. By maintaining a temporal pose buffer and utilizing kinematic transformations, the system accurately converts temporal lag into spatial pose offsets to restore the VLA's original geometric intent. To balance intent restoration fidelity and physical safety, the edge adaptation is formulated as a constrained Markov decision process (CMDP). Solved via the PPO-Lagrangian algorithm, a reinforcement learning adapter dynamically trades off between tracking the VLA intent and responding to high-frequency LiDAR obstacle avoidance hard constraints. Furthermore, benefiting from a standardized universal sub-goal interface, domain randomization, and perception-level adaptation via Collision Radius Inflation, AsyncShield operates as a lightweight, plug-and-play module. Simulation and real-world experiments demonstrate that, without fine-tuning any cloud-based foundation models, the framework exhibits zero-shot and robust generalization capabilities, effectively improving the success rate and physical safety of asynchronous navigation.
A Survey on Split Learning for LLM Fine-Tuning: Models, Systems, and Privacy Optimizations
🔥 引用:
0
Abstract: Fine-tuning unlocks large language models (LLMs) for specialized applications, but its high computational cost often puts it out of reach for resource-constrained organizations. While cloud platforms could provide the needed resources, data privacy concerns make sharing sensitive information with third parties risky. A promising solution is split learning for LLM fine-tuning, which divides the model between clients and a server, allowing collaborative and secure training through exchanged intermediate data, thus enabling resource-constrained participants to adapt LLMs safely. % In light of this, a growing body of literature has emerged to advance this paradigm, introducing varied model methods, system optimizations, and privacy defense-attack techniques for split learning. To bring clarity and direction to the field, a comprehensive survey is needed to classify, compare, and critique these diverse approaches. This paper fills the gap by presenting the first extensive survey dedicated to split learning for LLM fine-tuning. We propose a unified, fine-grained training pipeline to pinpoint key operational components and conduct a systematic review of state-of-the-art work across three core dimensions: model-level optimization, system-level efficiency, and privacy preservation. Through this structured taxonomy, we establish a foundation for advancing scalable, robust, and secure collaborative LLM adaptation.
QEVA: A Reference-Free Evaluation Metric for Narrative Video Summarization with Multimodal Question Answering
🔥 引用:
1
Abstract: Video-to-text summarization remains underexplored in terms of comprehensive evaluation methods. Traditional n-gram overlap-based metrics and recent large language model (LLM)-based approaches depend heavily on human-written reference summaries, limiting their practicality and sensitivity to nuanced semantic aspects. In this paper, we propose QEVA, a reference-free metric evaluating candidate summaries directly against source videos through multimodal question answering. QEVA assesses summaries along three clear dimensions: Coverage, Factuality, and Chronology. We also introduce MLVU(VS)-Eval, a new annotated benchmark derived from the MLVU dataset, comprising 800 summaries generated from 200 videos using state-of-the-art video-language multimodal models. This dataset establishes a transparent and consistent framework for evaluation. Experimental results demonstrate that QEVA shows higher correlation with human judgments compared to existing approaches, as measured by Kendall's $\tau_b$, $\tau_c$, and Spearman's $\rho$. We hope that our benchmark and metric will facilitate meaningful progress in video-to-text summarization research and provide valuable insights for the development of future evaluation methods.
GAMMAF: A Common Framework for Graph-Based Anomaly Monitoring Benchmarking in LLM Multi-Agent Systems
🔥 引用:
0
Abstract: The rapid integration of Large Language Models (LLMs) into Multi-Agent Systems (MAS) has significantly enhanced their collaborative problem-solving capabilities, but it has also expanded their attack surfaces, exposing them to vulnerabilities such as prompt infection and compromised inter-agent communication. While emerging graph-based anomaly detection methods show promise in protecting these networks, the field currently lacks a standardized, reproducible environment to train these models and evaluate their efficacy. To address this gap, we introduce Gammaf (Graph-based Anomaly Monitoring for LLM Multi-Agent systems Framework), an open-source benchmarking platform. Gammaf is not a novel defense mechanism itself, but rather a comprehensive evaluation architecture designed to generate synthetic multi-agent interaction datasets and benchmark the performance of existing and future defense models. The proposed framework operates through two interdependent pipelines: a Training Data Generation stage, which simulates debates across varied network topologies to capture interactions as robust attributed graphs, and a Defense System Benchmarking stage, which actively evaluates defense models by dynamically isolating flagged adversarial nodes during live inference rounds. Through rigorous evaluation using established defense baselines (XG-Guard and BlindGuard) across multiple knowledge tasks (such as MMLU-Pro and GSM8K), we demonstrate Gammaf's high utility, topological scalability, and execution efficiency. Furthermore, our experimental results reveal that equipping an LLM-MAS with effective attack remediation not only recovers system integrity but also substantially reduces overall operational costs by facilitating early consensus and cutting off the extensive token generation typical of adversarial agents.
SemiSAM-O1: How far can we push the boundary of annotation-efficient medical image segmentation?
🔥 引用:
0
Abstract: Semi-supervised learning (SSL) has become a promising solution to alleviate the annotation burden of deep learning-based medical image segmentation models. While recent advances in foundation model-driven SSL have pushed the boundary to extremely limited annotation scenarios, they fail to maintain robust competitive performance in complex imaging modalities. In this paper, we propose SemiSAM-O1, an annotation-efficient framework using only one annotated template image for segmentation. SemiSAM-O1 extends the specialist-generalist collaborative learning framework to the extreme one-label setting by fully exploiting the foundation model's feature representation capability beyond its prompting interface. SemiSAM-O1 operates in two stages. In the first stage, the foundation model's encoder extracts dense features from all volumes, and class prototypes derived from the single annotated template are propagated to the unlabeled pool via feature similarity to produce coarse initial pseudo-labels. In the second stage, an iterative training-and-refinement loop progressively improves both the segmentation model and the pseudo-labels over multiple rounds, where each round trains the model from scratch on current pseudo-labels and generates updated predictions with voxel-wise uncertainty estimates. An uncertainty-guided refinement step further leverages the foundation model's global feature space to correct high-uncertainty regions by aggregating labels from their most similar confident neighbors, establishing a virtuous cycle of mutual improvement. Extensive experiments on a wide range of segmentation tasks across different modalities and anatomical targets demonstrate that SemiSAM-O1 significantly narrows the performance gap between one-label semi-supervised learning and full supervision, while significantly reducing the computational overhead of online foundation model inference.
DepthKV: Layer-Dependent KV Cache Pruning for Long-Context LLM Inference
🔥 引用:
0
Abstract: Long-context reasoning is a critical capability of large language models (LLMs), enabling applications such as long-document understanding, summarization, and code generation. However, efficient autoregressive inference relies on the key-value (KV) cache, whose memory footprint grows linearly with sequence length, leading to a major memory bottleneck. To mitigate this overhead, KV cache pruning methods discard cached tokens with low attention scores during inference. Most existing methods apply a uniform pruning ratio across layers, implicitly assuming that all layers contribute equally to overall model performance. We show that this assumption is suboptimal, as layers differ significantly in their sensitivity to pruning. We propose DepthKV, a layer-dependent pruning framework that allocates a fixed global KV budget across layers based on their sensitivity, rather than using a uniform allocation. Across multiple models and tasks, DepthKV consistently outperforms uniform pruning at the same global pruning ratio, demonstrating more effective utilization of the KV cache budget through layer-dependent allocation.
Benchmarking Source-Sensitive Reasoning in Turkish: Humans and LLMs under Evidential Trust Manipulation
🔥 引用:
0
Abstract: This paper investigates whether source trustworthiness shapes Turkish evidential morphology and whether large language models (LLMs) track this sensitivity. We study the past-domain contrast between -DI and -mIs in controlled cloze contexts where the information source is overtly external, while only its perceived reliability is manipulated (High-Trust vs. Low-Trust). In a human production experiment, native speakers of Turkish show a robust trust effect: High-Trust contexts yield relatively more -DI, whereas Low-Trust contexts yield relatively more -mIs, with the pattern remaining stable across sensitivity analyses. We then evaluate 10 LLMs in three prompting paradigms (open gap-fill, explicit past-tense gap-fill, and forced-choice A/B selection). LLM behavior is highly model- and prompt-dependent: some models show weak or local trust-consistent shifts, but effects are generally unstable, often reversed, and frequently overshadowed by output-compliance problems and strong base-rate suffix preferences. The results provide new evidence for a trust-/commitment-based account of Turkish evidentiality and reveal a clear human-LLM gap in source-sensitive evidential reasoning.
Green Shielding: A User-Centric Approach Towards Trustworthy AI
🔥 引用:
0
Abstract: Large language models (LLMs) are increasingly deployed, yet their outputs can be highly sensitive to routine, non-adversarial variation in how users phrase queries, a gap not well addressed by existing red-teaming efforts. We propose Green Shielding, a user-centric agenda for building evidence-backed deployment guidance by characterizing how benign input variation shifts model behavior. We operationalize this agenda through the CUE criteria: benchmarks with authentic Context, reference standards and metrics that capture true Utility, and perturbations that reflect realistic variations in the Elicitation of model behavior. Guided by the PCS framework and developed with practicing physicians, we instantiate Green Shielding in medical diagnosis through HealthCareMagic-Diagnosis (HCM-Dx), a benchmark of patient-authored queries, together with structured reference diagnosis sets and clinically grounded metrics for evaluating differential diagnosis lists. We also study perturbation regimes that capture routine input variation and show that prompt-level factors shift model behavior along clinically meaningful dimensions. Across multiple frontier LLMs, these shifts trace out Pareto-like tradeoffs. In particular, neutralization, which removes common user-level factors while preserving clinical content, increases plausibility and yields more concise, clinician-like differentials, but reduces coverage of highly likely and safety-critical conditions. Together, these results show that interaction choices can systematically shift task-relevant properties of model outputs and support user-facing guidance for safer deployment in high-stakes domains. Although instantiated here in medical diagnosis, the agenda extends naturally to other decision-support settings and agentic AI systems.
Measuring the Unmeasurable: Markov Chain Reliability for LLM Agents
🔥 引用:
0
Abstract: Large language model (LLM) agents increasingly operate as sequential software systems, but their reliability is often summarized by scalar benchmark metrics. Metrics such as pass$@k$, pass$^k$, and the reliability decay curve (RDC) are useful summaries, but they do not identify the success-time distribution being estimated, test whether traces support that distribution, or quantify finite-trace uncertainty. We present \textsc{TraceToChain}, a reproducible pipeline that fits agent execution traces to an absorbing discrete-time Markov chain (DTMC), $\hat M=(\hat Q,\hat R_\oplus,\hat R_\ominus)$, with explicit diagnostics and uncertainty. The pipeline builds an automatic cluster taxonomy, estimates transitions with Laplace-smoothed maximum-likelihood estimation (MLE), checks fit with a composite Akaike information criterion (AIC) and Kolmogorov--Smirnov (KS) goodness-of-fit certificate, and reports Dirichlet-posterior credible intervals and non-parametric bootstrap intervals. We adapt classical reliability mathematics (Kemeny--Snell~\cite{kemenysnell}, Cheung~\cite{cheung1980}, Goel--Okumoto~\cite{goelokt}) to agent traces. The resulting first-passage view reconciles metrics usually reported separately: pass$@k$, pass$^k$, and the RDC are projections of one success-time distribution. On seven controlled MAST-style frameworks with a strict 50/50 fit/test protocol, held-out empirical RDCs overlay their analytic counterparts with max $L_\infty^{\mathrm{RDC}} = 0.053$ (median $0.048$). A two-sample KS test on the first-passage cumulative distribution function (CDF) accepts the fitted chain with $p>0.05$ on $7/7$ frameworks (min $p = 0.78$), and per-entry $95\%$ posterior and bootstrap intervals agree to $\approx\!0.01$ at the median.
Generating Place-Based Compromises Between Two Points of View
🔥 引用:
0
Abstract: Large Language Models (LLMs) excel academically but struggle with social intelligence tasks, such as creating good compromises. In this paper, we present methods for generating empathically neutral compromises between two opposing viewpoints. We first compared four different prompt engineering methods using Claude 3 Opus and a dataset of 2,400 contrasting views on shared places. A subset of the gen erated compromises was evaluated for acceptability in a 50-participant study. We found that the best method for generating compromises between two views used external empathic similarity between a compromise and each viewpoint as iterative feedback, outperforming stan dard Chain of Thought (CoT) reasoning. The results indicate that the use of empathic neutrality improves the acceptability of compromises. The dataset of generated compromises was then used to train two smaller foundation models via margin-based alignment of human preferences, improving efficiency and removing the need for empathy estimation during inference.
RefEvo: Agentic Design with Co-Evolutionary Verification for Agile Reference Model Generation
🔥 引用:
0
Abstract: As the complexity of System-on-Chip (SoC) designs grows, the shift-left paradigm necessitates the rapid development of high-fidelity reference models (typically written in SystemC) for early architecture exploration and verification. While Large Language Models (LLMs) show promise in code generation, their application to hardware modeling faces unique challenges: (1) Rigid, static workflows fail to adapt to varying design complexity, causing inefficiency; (2) Context window overflow in multi-turn interactions leads to catastrophic forgetting of critical specifications; and (3) the Coupled Validation Failure problem--where generated Testbenches (TBs) incorrectly validate flawed models due to correlated hallucinations--severely undermines reliability. To address these limitations, we introduce RefEvo, a dynamic multi-agent framework designed for agile and reliable reference modeling. RefEvo features three key innovations: (1) A Dynamic Design Planner that autonomously decomposes design specifications and constructs tailored execution workflows based on semantic complexity; (2) A Co-Evolutionary Verification Mechanism, which employs a Dialectical Arbiter to simultaneously rectify the model and verification logic against the specification (Spec) oracle, effectively mitigating false positives; and (3) A Spec Anchoring Strategy for lossless context compression. Evaluated on a diverse benchmark of 20 hardware modules, RefEvo achieves a 95% pass rate, outperforming static baselines by a large margin. Furthermore, our context optimization reduces token consumption by an average of 71.04%, achieving absolute savings of over 70,000 tokens per session for complex designs while maintaining 100% specification recall.
Exploring Creativity in Human-Human-LLM Collaborative Software Design
🔥 引用:
0
Abstract: While the use of Large Language Models (LLMs) in programming has been extensively studied, there is limited understanding of how LLMs support collaborative work where creativity plays a central role. Software design, as a collaborative and creative activity, provides a valuable context for exploring the influence of LLMs on creativity. This study investigates how and where creativity naturally emerges when software designers collaborate with an LLM during a design task. In a laboratory setting simulating a workplace environment, 18 pairs of software professionals with design experience were asked to complete a design task. Each pair had 90 minutes to produce a software design based on a set of requirements, with optional access to a custom LLM interface. Pairs were not primed to be creative. We find that creativity was present in all pairs in design processes, with 13 producing design documents containing creativity. We primarily attribute creativity to the human designers, driven by traits such as prior experience, empathy, and the use of analogies. The LLM contributed by producing novel ideas and elaborating human ideas. However, in some cases, the LLM appeared to hinder creativity by suggesting complex solutions or adding to unproductive digressions. LLMs can support creativity in collaborative software design, but human insights remain central. To effectively augment human creativity, designers must be intentional in their engagement with LLMs.
A Multi-Dimensional Audit of Politically Aligned Large Language Models
🔥 引用:
0
Abstract: As the application of Large Language Models (LLMs) spreads across various industries, there are increasing concerns about the potential for their misuse, especially in sensitive areas such as political discourse. Deliberately aligning LLMs with specific political ideologies, through prompt engineering or fine-tuning techniques, can be advantageous in use cases such as political campaigns, but requires careful consideration due to heightened risks of performance degradation, misinformation, or increased biased behavior. In this work, we propose a multi-dimensional framework inspired by Habermas'Theory of Communicative Action to audit politically aligned language models across four dimensions: effectiveness, fairness, truthfulness, and persuasiveness using automated, quantitative metrics. Applying this to nine popular LLMs aligned via fine-tuning or role-playing revealed consistent trade-offs: while larger models tend to be more effective at role-playing political ideologies and truthful in their responses, they were also less fair, exhibiting higher levels of bias in the form of angry and toxic language towards people of different ideologies. Fine-tuned models exhibited lower bias and more effective alignment than the corresponding role-playing models, but also saw a decline in performance reasoning tasks and an increase in hallucinations. Overall, all of the models tested exhibited some deficiency in at least one of the four metrics, highlighting the need for more balanced and robust alignment strategies. Ultimately, this work aims to ensure politically-aligned LLMs generate legitimate, harmless arguments, offering a framework to evaluate the responsible political alignment of these models.
When Prompt Under-Specification Improves Code Correctness: An Exploratory Study of Prompt Wording and Structure Effects on LLM-Based Code Generation
🔥 引用:
0
Abstract: Large language models are increasingly used for code generation, yet the correctness of their outputs depends not only on model capability but also on how tasks are specified. Prior studies demonstrate that small changes in natural language prompts, particularly under-specification can substantially reduce code correctness; however, these findings are largely based on minimal-specification benchmarks such as HumanEval and MBPP, where limited structural redundancy may exaggerate sensitivity. In this exploratory study, we investigate how prompt structure, task complexity, and specification richness interact with LLM robustness to prompt mutations. We evaluate 10 different models across HumanEval and the structurally richer LiveCodeBench. Our results reveal that robustness is not a fixed property of LLMs but is highly dependent on prompt structure: the same under-specification mutations that degrade performance on HumanEval have near-zero net effect on LiveCodeBench due to redundancy across descriptions, constraints, examples, and I/O conventions. Surprisingly, we also find that prompt mutations can improve correctness. In LiveCodeBench, under-specification often breaks misleading lexical or structural cues that trigger incorrect retrieval-based solution strategies, leading to correctness improvements that counterbalance degradations. Manual analysis identifies consistent mechanisms behind these improvements, including the disruption of over-fitted terminology, removal of misleading constraints, and elimination of spurious identifier triggers. Overall, our study shows that structurally rich task descriptions can substantially mitigate the negative effects of under-specification and, in some cases, even enhance correctness. We outline categories of prompt modifications that positively influence the behavior of LLM code-generation, offering practical insights for writing robust prompts.
Failure-Centered Runtime Evaluation for Deployed Trilingual Public-Space Agents
🔥 引用:
0
Abstract: This paper presents PSA-Eval, a failure-centered runtime evaluation framework for deployed trilingual public-space agents. The central claim is that, when the evaluation object shifts from a static input-output mapping to a runtime system, the basic unit of analysis should shift from score to failure. PSA-Eval extends the conventional chain Question ->Answer ->Score ->End into Question ->Batch ->Run ->Score ->Failure Case ->Repair ->Regression Batch, making failures traceable, reviewable, repairable, and regression-testable. The framework uses trilingual equivalent inputs as controlled probes for observing group-level cross-language policy drift. We conduct a pilot study on a real trilingual digital front-desk system deployed in the lobby of an international financial institution. The pilot uses a simplified single-foundation-model setting (MA = MB), so the observed drift should not be interpreted as an A/B foundation-model difference. The study contains 81 samples organized into 27 trilingual equivalent question groups. Although the system achieves an average score of 23.15/24, 14 groups show non-zero cross-language score drift, 5 groups show drift of at least 3 points, and the maximum drift reaches 9 points. These results provide initial evidence that failure-centered runtime evaluation can expose structured deployment signals hidden by aggregate scoring.
Turn food waste into climate action: engaging customers
🔥 引用:
0
Abstract:
This study aims to explore how consumers engage with climate change issues through their interactions with the Too Good To Go application, an online marketplace designed to combat food waste at restaurants.
Drawing on customer engagement (CE) theory and the norm activation model (NAM) and using a mixed-methods approach, this research uses large language model-assisted thematic analysis to explore the key motivational drivers of engagement. Building on these insights, the authors used a survey and conducted structural equation modeling to test the conceptual model.
The results reveal that perceived sustainability, novelty, sense of community and value for money significantly foster affective engagement, which in turn drives behavioral engagement outcomes.
This research deepens understanding of pro-environmental consumer behavior by integrating CE theory with the NAM, thereby explicating the moral activation mechanisms underlying sustainable dining behaviors. This study also makes a methodological contribution by combining mixed methods with large language model-assisted thematic analysis to examine CE at scale.
By focusing on restaurant-based climate action, the study provides actionable insights for hospitality businesses seeking to embed sustainability into their operations and customer experience strategies.
This research makes a significant contribution to the field through its innovative methodological approach, its targeted application within the hospitality sector and its examination of how digital transformation facilitates sustainable behavioral change. The study provides valuable insights for designing digital platforms that simultaneously promote environmental action and enhance consumer loyalty.
The Pragmatic Persona: Discovering LLM Persona through Bridging Inference
🔥 引用:
0
Abstract: Large Language Models (LLMs) reveal inherent and distinctive personas through dialogue. However, most existing persona discovery approaches rely on surface-level lexical or stylistic cues, treating dialogue as a flat sequence of tokens and failing to capture the deeper discourse-level structures that sustain persona consistency. To address this limitation, we propose a novel analytical framework that interprets LLM dialogue through bridging inference -- implicit conceptual relations that connect utterances via shared world knowledge and discourse coherence. By modeling these relations as structured knowledge graphs, our approach captures latent semantic links that govern how LLMs organize meaning across turns, enabling persona discovery at the level of discourse coherence rather than surface realizations. Experimental results across multiple reasoning backbones and target LLMs, ranging from small-scale models to 80B-parameter systems, demonstrate that bridging-inference graphs yield significantly stronger semantic coherence and more stable persona identification than frequency or style-based baselines. These results show that persona traits are consistently encoded in the structural organization of discourse rather than isolated lexical patterns. This work presents a systematic framework for probing, extracting, and visualizing latent LLM personas through the lens of Cognitive Discourse Theory, bridging computational linguistics, cognitive semantics, and persona reasoning in large language models. Codes are available at https://github.com/JiSoo-Yang/Persona_Bridging.git
Jailbreaking Frontier Foundation Models Through Intention Deception
🔥 引用:
0
Abstract: Large (vision-)language models exhibit remarkable capability but remain highly susceptible to jailbreaking. Existing safety training approaches aim to have the model learn a refusal boundary between safe and unsafe, based on the user's intent. It has been found that this binary training regime often leads to brittleness, since the user intent cannot reliably be evaluated, especially if the attacker obfuscates their intent, and also makes the system seem unhelpful. In response, frontier models, such as GPT-5, have shifted from refusal-based safeguards to safe completion, that aims to maximize helpfulness while obeying safety constraints. However, safe completion could be exploited when a user pretends their intention is benign. Specifically, this intent inversion would be effective in multi-turn conversation, where the attacker has multiple opportunities to reinforce their deceptively benign intent. In this work, we introduce a novel multi-turn jailbreaking method that exploits this vulnerability. Our approach gradually builds conversational trust by simulating benign-seeming intentions and by exploiting the consistency property of the model, ultimately guiding the target model toward harmful, detailed outputs. Most crucially, our approach also uncovered an additional class of model vulnerability that we call para-jailbreaking that has been unnoticed up to now. Para-jailbreaking describes the situation where the model may not reveal harmful direct reply to the attack query, however the information that it reveals is nevertheless harmful. Our contributions are threefold. First, it achieves high success rates against frontier models including GPT-5-thinking and Claude-Sonnet-4.5. Second, our approach revealed and addressed para-jailbreaking harmful output. Third, experiments on multimodal VLM models showed that our approach outperformed state-of-the-art models.
Rewarding the Scientific Process: Process-Level Reward Modeling for Agentic Data Analysis
🔥 引用:
0
Abstract: Process Reward Models (PRMs) have achieved remarkable success in augmenting the reasoning capabilities of Large Language Models (LLMs) within static domains such as mathematics. However, their potential in dynamic data analysis tasks remains underexplored. In this work, we first present a empirical study revealing that general-domain PRMs struggle to supervise data analysis agents. Specifically, they fail to detect silent errors, logical flaws that yield incorrect results without triggering interpreter exceptions, and erroneously penalize exploratory actions, mistaking necessary trial-and-error exploration for grounding failures. To bridge this gap, we introduce DataPRM, a novel environment-aware generative process reward model that (1) can serve as an active verifier, autonomously interacting with the environment to probe intermediate execution states and uncover silent errors, and (2) employs a reflection-aware ternary reward strategy that distinguishes between correctable grounding errors and irrecoverable mistakes. We design a scalable pipeline to construct over 8K high-quality training instances for DataPRM via diversity-driven trajectory generation and knowledge-augmented step-level annotation. Experimental results demonstrate that DataPRM improves downstream policy LLMs by 7.21% on ScienceAgentBench and 11.28% on DABStep using Best-of-N inference. Notably, with only 4B parameters, DataPRM outperforms strong baselines, and exhibits robust generalizability across diverse Test-Time Scaling strategies. Furthermore, integrating DataPRM into Reinforcement Learning yields substantial gains over outcome-reward baselines, achieving 78.73% on DABench and 64.84% on TableBench, validating the effectiveness of process reward supervision. Code is available at https://github.com/zjunlp/DataMind.
Learning to Route Queries to Heads for Attention-based Re-ranking with Large Language Models
🔥 引用:
0
Abstract: Large Language Models (LLMs) have recently been explored as fine-grained zero-shot re-rankers by leveraging attention signals to estimate document relevance. However, existing methods either aggregate attention signals across all heads or rely on a statically selected subset identified by heuristic rules. This solution can be suboptimal because the informative heads can vary across queries or domains. Moreover, naively combining multiple heads can degrade performance due to redundancy or conflicting ranking signals. In this paper, we propose a query-dependent head selection method, RouteHead, for attention-based re-ranking with LLMs. Specifically, we learn a lightweight router that can map each query to an optimal head set, and relevance scores are computed by aggregating attention signals only from these heads. Since query-to-head optimal labels are unavailable, we first construct pseudo labels via an offline search. The router represents each head with a learnable embedding and represents each query using an embedding extracted from the hidden states of the frozen LLM. Then it is trained on the pseudo labels with a sparsity regularizer. Experiments on diverse benchmarks and multiple LLM backbones show that the proposed method consistently outperforms strong baselines.
MegaScale-Omni: A Hyper-Scale, Workload-Resilient System for MultiModal LLM Training in Production
🔥 引用:
0
Abstract: As the foundational component of versatile AI applications, training an multimodal large language model (MLLM) relies on multimodal datasets with dynamic modality mixture proportions and sample length distributions. However, existing MLLM systems remain inefficient under dynamic workloads, due to statically coupled decisions of resource allocation and model parallelization between encoders and the LLM backbone. This paper presents MegaScale-Omni, an industrial-grade MLLM training system tailored for dynamic workload adaption and hyper-scale deployment. MegaScale-Omni is built upon the training scheme of encoder-LLM multiplexing with three key innovations: (1) Decoupled parallelism strategies with long-short sequence parallelism for encoders to process variable-length samples, and full-fledged 5D parallelism for the LLM backbone, both organized under a communication-efficient parallelization layout. (2) Unified encoder-LLM representations for flexible, extensible colocation, and a new paradigm of encoder-LLM joint pipeline with workload resilience. (3) Workload balancing techniques via decentralized grouped reordering in data loaders and adaptive resharding from encoder to LLM ranks. MegaScale-Omni is deployed as the foundation of our in-house large-scale MLLM training tasks with thousands of GPUs. Our experimental results demonstrate 1.27×–7.57× throughput improvement under production-grade dynamic workloads, as compared to four state-of-the-art systems.
Suika: Efficient and High-quality Re-scheduling of 3D-parallelized LLM Training Jobs in Shared Clusters
🔥 引用:
0
Abstract: Large Language Models (LLMs) are usually trained with 3D (data, tensor, and pipeline) parallelism—in shared GPU clusters where the available resources are highly dynamic. Rescheduling the idle resources to ongoing jobs can help improve cluster utilization, but doing so for 3D-parallelized training jobs suffers large overheads in performance modeling, decision making, and redeployment. We present Suika, a cluster training system that supports efficient and high-quality resource rescheduling for 3D-parallelized LLM training jobs. Suika holistically addresses the complexity challenges by exploiting the incremental nature of rescheduling. For performance modeling, it builds an accurate performance estimator with non-disruptive online profiling. For decision-making, it employs topology-aware sorting and an expand-and-balance algorithm to reduce the complexity of resource allocation and job parallelization, without compromising decision quality. Suika further integrates a device-to-device redeployment method to leverage the overlapping nature of incremental reconfiguration for overhead reduction. Experiments on 64-GPU physical cluster and 1024-GPU simulated cluster show that, Suika achieves 1.29 ~ 1.31× reduction in average JCT compared to state-of-the-art schedulers.
Translate or Simplify First: An Analysis of Cross-lingual Text Simplification in English and French
🔥 引用:
0
Abstract: Cross-Lingual Text Simplification (CLTS) aims to make content more accessible across languages by simultaneously addressing both linguistic complexity and translation. This study investigates the effectiveness of different prompting strategies for CLTS between English and French using large language models (LLMs). We examine five distinct prompting systems: a direct prompt instructing the LLM to perform both translation and simplification simultaneously, two Composition approaches that either translate-then-simplify or simplify-then-translate within a single prompt, and two decomposition approaches that perform the same operations in separate, consecutive prompts. These systems are evaluated across a diverse set of five corpora of different genres (Wikipedia and medical texts) using seven state-of-the-art LLMs. Output quality is assessed through a multi-faceted evaluation framework comprising automatic metrics, comprehensive linguistic feature analysis, and human evaluation of simplicity and meaning preservation. Our findings reveal that while direct prompting consistently achieves the highest BLEU scores, indicating meaning fidelity, Translate-then-Simplify approaches demonstrate the highest simplicity, as measured by the linguistic features.
Collaborative Dialogue Analysis for Productive Problem Solving
🔥 引用:
0
Abstract: Collaborative problem solving requires students to jointly reason, negotiate, and regulate their learning. Understanding collaborative problem solving through student dialogue can inform timely identification of productive and unproductive collaborative behaviors. In this study, we investigate the use of large language models to automatically classify collaborative problem-solving dialogue segments into two categories: Productive and Unproductive. To support deeper analysis, we additionally explore classification of eight detailed collaborative problem-solving sub-categories. We present an error-augmented few-shot prompting method that incorporates misclassified examples to refine model understanding of classification boundaries. Using dialogue data from a middle school collaborative game-based learning environment, our approach substantially improves classification accuracy over zero-shot baselines. Qualitative analysis of the resulting models further highlights which dialogue types are most frequently misclassified, suggesting design implications for adaptive scaffolding. These findings demonstrate that large language models, when guided with targeted prompting strategies, can effectively recognize productive and unproductive dialogue in collaborative learning.
A Novel Approach to Evaluating the Effectiveness of Large Language Models for Multimodal Analysis of Embodied Learning in Classrooms
🔥 引用:
0
Abstract: This paper presents an approach that uses Large Language Models (LLMs) as late-fusion interpreters to synthesize multimodal signals from embodied classroom activities and infer students’ metacognitive behaviors. Our multimodal pipeline analyzes students’ movements, gaze, gestures, and speech within a mixed-reality simulation displayed on a classroom screen to support enactment and learning. Vision- and speech-derived features are fused at the interpretive layer via zero-shot prompting, self-consistency reasoning, and targeted prompt engineering to derive planning, enacting, monitoring, reflecting, and interacting behaviors. We investigate whether LLMs can reliably integrate modality-specific analytics to produce accurate behavioral labeling and whether an LLM-as-a-Judge can validate them at scale. To address scalability and reduce human burden, we introduce an automated evaluation protocol employing LLM-as-a-Judge to assess classification quality, enabling rapid, iterative benchmarking of model variants and prompt strategies. Using a balanced corpus of human-validated segments and perturbed controls, we compare text-only language models (e.g., GPT-5) with visual–language models (e.g., Qwen2.5-VL) that incorporate direct visual processing. Results indicate late-fusion, text-based LLMs can outperform VLMs on behavior judgment without raw video, and precision- or recall-oriented prompts adjust decision boundaries for subtle or brief segments. These findings position LLMs as effective late-fusion mechanisms for multimodal learning analytics and demonstrate the viability of LLM-as-a-Judge for scalable, human-in-the-loop evaluation.
Uncertainty Propagation in LLM-Based Systems
🔥 引用:
0
Abstract: Uncertainty in large language model (LLM)-based systems is often studied at the level of a single model output, yet deployed LLM applications are compound systems in which uncertainty is transformed and reused across model internals, workflow stages, component boundaries, persistent state, and human or organisational processes. Without principled treatment of how uncertainty is carried and reused across these boundaries, early errors can propagate and compound in ways that are difficult to detect and govern. This paper develops a systems-level account of uncertainty propagation. It introduces a conceptual framing for characterising propagated uncertainty signals, presents a structured taxonomy spanning intra-model (P1), system-level (P2), and socio-technical (P3) propagation mechanisms, synthesises cross-cutting engineering insights, and identifies five open research challenges.
KISS Sorcar: A Stupidly-Simple General-Purpose and Software Engineering AI Assistant
🔥 引用:
0
Abstract: Large language models can generate code and call tools with remarkable fluency, yet deploying them as practical software engineering assistants still expose stubborn gaps: finite context windows, single mistakes that derail entire sessions, agents that get stuck in dead ends, AI slop, and generated changes that are difficult to review or revert. We present KISS Sorcar, a general-purpose assistant and integrated development environment (IDE) built on top of the KISS Agent Framework, a stupidly-simple AI agent framework of roughly 1,850 lines of code. The framework addresses these gaps using a robust system prompt and through a five-layer agent hierarchy in which each layer adds exactly one concern: budget-tracked ReAct execution, automatic continuation across sub-sessions via summarization, coding, and browser tools with parallel sub-agents, persistent multi-turn chat with history recall, and git worktree isolation so every task runs on its own branch. To assess the power of the KISS agent framework, we implemented KISS Sorcar as a free, open-source Visual Studio Code extension that runs locally and effectively for long-horizon tasks, and supports browser automation, multimodal input, and Docker containers. In this research, we deliberately prioritize output quality over latency: giving a frontier model adequate time to validate its own output -- running linters, type checkers, and tests -- dramatically reduces the low-quality code that plagues faster but less thorough agents. The entire system was built using itself in 4.5 months, providing a continuous stress test in which any agent-introduced bug immediately impairs its own ability to work. On Terminal Bench 2.0, KISS Sorcar achieves a 62.2% overall pass rate with Claude Opus 4.6, comparing favorably to Claude Code (58%) and Cursor Composer 2 (61.7).
Evaluating AI-Generated Narrative Feedback on Nonverbal Communication in Student Presentations
🔥 引用:
0
Abstract: Nonverbal communication is essential for effective oral presentations, shaping audience engagement and conveying confidence beyond spoken words. Multimodal Learning Analytics (MmLA) research has advanced the automatic detection of presenters’ behaviors—such as posture, gestures, and eye contact—but continues to explore how to provide actionable feedback that fosters reflection and skill development. Recent advances in Generative Artificial Intelligence (GenAI) offer new opportunities to automate the analysis of nonverbal cues and generate narrative evaluations that go beyond raw metrics. This study investigates the integration of a Learning Analytics Dashboard (LAD) with a Large Language Model (LLM) to deliver both numerical and narrative feedback across five dimensions of nonverbal communication: body posture, eye contact, hand gestures, facial expression, and use of space. Twenty-two undergraduate students participated in the study, receiving numerical feedback through a LAD and narrative feedback from either a human expert or an LLM. Findings revealed no significant differences in students’ perceptions of human- and LLM-generated feedback, while also highlighting the potential of LLM-based feedback to support reflection despite certain technical limitations. These results suggest that LLMs can meaningfully enhance learning analytics dashboards by delivering actionable, human-like feedback that supports the development of nonverbal communication skills.
Data-Driven Evaluation of LLM-Based Ontology Concept Extraction from Programming Learning Content
🔥 引用:
0
Abstract: The process of associating elements of learning content with concepts or skills that this content helps students to master is one of the critical steps in developing personalized educational systems. When these associations are properly established, the system can infer the growth of student understanding of separate knowledge components from the logs of their interactions with associated learning content and use it to adapt the learning process accordingly by targeting gaps in individual students’ knowledge. Unfortunately, crafting these links between learning content and knowledge components is a very time- and expertise-demanding process that has traditionally been performed manually by domain experts with the help of knowledge engineers. Recently, the power of Large Language Models has motivated a new generation of research on concept extraction from textual learning content. The work presented in this paper contributes to this trend while introducing two important innovations. First, our concept extraction process is guided by a human-authored ontology of the target domain - Python programming. Second, alongside a traditional expert evaluation of the concept extraction quality, we apply two additional validation approaches: one based on using an educational data mining technique (learning curves) and another utilizing the pedagogical expertise of teaching the target domain (learning content placement).
DOWNLOADS AI Engineering: Building Applications with Foundation Models by Chip Huyen
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
No More Translation at Runtime: LLM-Empowered Static Binary Translation
🔥 引用:
0
Abstract: While AArch64 CPUs are becoming strong market contenders, their software ecosystem lags behind the mature x86-64 environment, hindering the adoption of the new architectures and impacting user experience. Binary translation bridges this divide by converting binary code from one architecture (e.g., x86-64) to run on another (e.g., AArch64), allowing legacy software to benefit from modern hardware's performance and energy efficiency advantages. Current translation methods are typically either dynamic, which adds significant runtime overhead, or static, which struggles with reliability due to the inherent complexities of binary analysis. This paper introduces a new static, assembly-to-assembly translation paradigm that transforms binary code ahead of execution, generating portable, efficient nativelike binaries that run on AArch64 devices without runtime frameworks. Benefiting from recent breakthroughs in large language models (LLMs), we provide a practical and automated translation engine that produces high-quality code with minimal human intervention. To ensure correctness, we introduce a crucial verification step, where we split the assembly code into simplified snippets, enabling efficient and scalable semantic verification. Our evaluation shows that this approach significantly outperforms existing open-source solutions with a large margin, producing binaries with near-native performance. Furthermore, it shows substantial improvements over the leading industrial translator, ExaGear, illuminating a promising new direction for cross-architecture binary translation research.
Prompting for Teachability: Designing Novice Personas in LLMs for Learning by Teaching Contexts
🔥 引用:
0
Abstract: Learning by teaching (LbT) is a well-established instructional framework in which students deepen understanding by explaining material to a peer or tutee. Large Language Models (LLMs) create new opportunities to scale LbT by simulating novice learners, but their default tendency toward expert-like responses risks undermining the tutor's role. This study investigates which prompting strategies most effectively elicit novice-behavior from LLMs in writing-related domains. We generated 30,720 combined prompts across five domains and evaluated three models (Qwen3-235B, Llama 4, Kimi-K2) using both multiple-choice quizzes and short persuasive essays. Outputs were scored on quiz accuracy, essay quality, and essay persuasiveness using an AI-judge rubric. Regression analysis revealed a clear pattern: constraint prompts that explicitly forced error production consistently outperformed persona-, misconception‑, and uncertainty-based prompts. Across both quiz and essay outcomes, direct commands to “answer incorrectly” or “get 2–3 wrong” yielded the strongest novice-like behavior, while indirect framings like “don't aim for a perfect score” or “you may guess” diluted the effect. These findings highlight constraint-based prompting as the most reliable strategy, and we argue that constraint directives provide an actionable design pathway for practitioners seeking to integrate LLMs into effective LbT contexts.
Reducing the GPU Memory Bottleneck with Lossless Compression for ML
🔥 引用:
0
Abstract: Machine learning (ML) training and inference often process data sets far exceeding GPU memory capacity, forcing them to rely on PCIe for on-demand tensor transfers, causing critical transfer bottlenecks. Lossy compression has been proposed to relieve bottlenecks but introduces workload-dependent accuracy loss, making it complex or even prohibitive to use in existing ML deployments. We explore lossless compression as an alternative that avoids this deployment complexity. We identify where lossless compression can be integrated into ML pipelines while minimizing interference with GPU execution. Based on our findings, we introduce Invariant Bit Packing (IBP), a novel lossless compression algorithm designed to minimize data transfer time for ML. IBP identifies and eliminates invariant bits across groups of tensors, improving throughput through GPU-optimized decompression that leverages warp parallelism, low-overhead bit operations, and asynchronous PCIe transfers. We provide easy-to-use APIs, showcasing them by adding IBP support to GNN training, as well as DLRM and LLM inference frameworks. IBP achieves, on average, 74% faster GNN training, 180% faster DLRM embedding lookup, and 25% faster LLM inference.
JudgeSense: A Benchmark for Prompt Sensitivity in LLM-as-a-Judge Systems
🔥 引用:
0
Abstract: Large language models are increasingly deployed as automated judges for evaluating other models, yet the stability of their verdicts under semantically equivalent prompt paraphrases remains unmeasured. We introduce JudgeSense, a framework and benchmark for quantifying this property via the Judge Sensitivity Score (JSS), defined as the fraction of paraphrase pairs on which a judge returns an identical decision. Evaluating nine judge models on 494 validated paraphrase pairs, we find that coherence is the only task where judges meaningfully differ, with JSS ranging from 0.389 to 0.992. On factuality, all judges cluster near JSS about 0.63, driven by a polarity-inverted prompt artifact; after correction, factuality JSS rises to about 0.9. Pairwise tasks (preference and relevance) exhibit degenerate always-A behavior in 8 of 9 judges, indicating strong position bias. Model scale does not predict consistency. We release code, decision logs, and a validated paraphrase dataset to support standardized JSS reporting.
AIMS: Cost-Efficient LLM-Based Agent Deployment in Hybrid Cloud-Edge Environments
🔥 引用:
0
Abstract: In the realm of AI, large language models (LLMs) like GPT-5, central to the operation of AI agents, predominantly operate in the cloud, incurring high operational costs. With local-based small language models (SLMs) becoming more accurate, the necessity of cloud-exclusive processing is being reconsidered. An AI agent's response to a user's request comprises a series of subtasks or iterations. Existing approaches choose either an LLM or SLM for the entire request to ensure similar outputs, but this is ineffective for AI agents as SLMs may generate differing subtasks, compromising final accuracy. In this paper, we first conduct experimental analysis to understand the features of AI agent operations. Leveraging our findings, we propose the Adaptive Iteration-level Model Selector (AIMS), a lightweight scheduler to automatically partition an AI agent's subtasks between local-based SLM and cloud-based LLM. AIMS considers the varying subtask features and strategically decides the location for each subtask in order to use SLM as much as possible while maintaining the accuracy level. Our experimental results demonstrate that AIMS achieves up to a 27.5% relative improvement in accuracy and up to 31.4% relative increase in SLM usage compared to HybridLLM. It offloads 83.4% of subtasks to a local SLM while attaining similar accuracy on average compared with the cloud-only LLM approach.
Automatic Short Answer Grading with LLMs: From Memorization to Reasoning
🔥 引用:
0
Abstract: Short-answer questions provide valuable insights into students’ understanding and cognitive processes for learning analytics. However, they are difficult to grade automatically as they require a high level of language comprehension. Automatic Short Answer Grading (ASAG) is therefore essential in large-scale educational settings. Recent work has applied encode-only pre-trained language models (PLMs), such as BERT, and generative large language models (LLMs) to ASAG. Although fine-tuned BERT-based models currently produce state-of-the-art results, they depend on substantial annotated datasets, which are frequently expensive and insufficient. This paper examines the performance of fine-tuning of several PLMs and LLMs for different dataset sizes and compares the results to those of prompt-based approaches. General-purpose and domain-specific models were fine-tuned on datasets ranging from 800 to 26,674 student responses. Different prompt engineering strategies were tested including rubric-based prompts. Our results demonstrate that fine-tuned LLMs and rubric-based prompting can match or exceed the performance of BERT-based models. Rubric-based prompts with open-source model deliver comparable results without the need for annotation data or hardware-intensive training, while also mitigating data protection concerns. This work provides empirical evidence of the role of LLMs in ASAG and paves the way for future research into resource-efficient, interpretable and reasoning-driven grading.
Grammar-Constrained Refinement of Safety Operational Rules Using Language in the Loop: What Could Go Wrong
🔥 引用:
0
Abstract: Safety specifications in cyber-physical systems (CPS) capture the operational conditions the system must satisfy to operate safely within its intended environment. As operating environments evolve, operational rules must be continuously refined to preserve consistency with observed system behavior during simulation-based verification and validation. Revising inconsistent rules is challenging because the changes must remain syntactically correct under a domain-specific grammar. Language-in-the-loop refinement further raises safety concerns beyond syntactic violations, as it can produce semantically unjustified refinements that overfit to the observed outcomes. We introduce a framework that combines counterfactual reasoning with a grammar-constrained refinement loop to refine operational rules, aligning them with the observed system behavior. Applied to an autonomous driving control system, our approach successfully resolved the inconsistencies in an operational rule inferred by a conventional baseline while remaining grammar compliant. An empirical large language model (LLM) study further revealed model-dependent refinement quality and safety lessons, which motivate rigorous grammar enforcement, stronger semantic validation, and broader evaluation in future work.
Identification and validation of natural product-based KRASG12D inhibitors through structure-based virtual screening, molecular dynamics simulation, and in vitro studies
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
WISE-FM:Operation-Aware, Engineering-Informed Foundation Model for Multi-Task Well Design
🔥 引用:
0
Abstract: Deploying machine learning models across diverse well portfolios requires generalisation to wells with design parameters outside the training distribution. Current data-driven approaches to virtual flow metering (VFM) and bottomhole estimation typically treat each well independently or ignore the influence of well design on operational behaviour. We present WISE (Well Intelligence and Systems Engineering Foundation Model), a design-aware, physics-informed multi-task model that integrates three complementary mechanisms: Feature-wise Linear Modulation (FiLM) and cross-modal attention to condition operational embeddings on well design parameters; multi-task learning for simultaneous prediction of flow rates, bottomhole conditions, and flow regime classification; and structural mass conservation with soft physics constraints derived from well engineering principles. Evaluation on the ManyWells benchmark (2000 simulated wells, $10^6$ data points) demonstrates that design-aware models reduce VFM prediction error by up to $13\times$ compared to design-unaware baselines, and that physics constraints reduce negative flow predictions by 65%. Flow regime classification achieves 97.7% bottomhole accuracy, providing continuous well integrity monitoring without additional sensors. The methodology transfers to real operational data from five Equinor Volve producers (oil rate $R^2 = 0.89$, bottomhole pressure $R^2 = 0.98$, water rate $R^2 = 0.97$). The trained model additionally serves as a fast surrogate for integrity-aware well design optimisation over a 24-dimensional design space, with more than $1000\times$ speedup over drift-flux simulations. These results demonstrate that design awareness, physics enforcement, and multi-task learning are essential and complementary ingredients for foundation models intended to operate across large well portfolios.
Tandem: Riding Together with Large and Small Language Models for Efficient Reasoning
🔥 引用:
0
Abstract: Recent advancements in large language models (LLMs) have catalyzed the rise of reasoning-intensive inference paradigms, where models perform explicit step-by-step reasoning before generating final answers. While such approaches improve answer quality and interpretability, they incur substantial computational overhead due to the prolonged generation sequences. In this paper, we propose Tandem, a novel collaborative framework that synergizes large and small language models (LLMs and SLMs) to achieve high-quality reasoning with significantly reduced computational cost. Specifically, the LLM serves as a strategic coordinator, efficiently generating a compact set of critical reasoning insights. These insights are then used to guide a smaller, more efficient SLM in executing the full reasoning process and delivering the final response. To balance efficiency and reliability, Tandem introduces a cost-aware termination mechanism that adaptively determines when sufficient reasoning guidance has been accumulated, enabling early stopping of the LLM's generation. Experiments on mathematical reasoning and code generation benchmarks demonstrate that Tandem reduces computational costs by approximately 40% compared to standalone LLM reasoning, while achieving superior or competitive performance. Furthermore, the sufficiency classifier trained on one domain transfers effectively to others without retraining. The code is available at: https://github.com/Applied-Machine-Learning-Lab/ACL2026_Tandem.
FUTURAL: A Metasearch Platform for Empowering Rural Areas with Smart Solutions
🔥 引用:
0
Abstract: The FUTURAL project aims to provide a comprehensive suite of digital Smart Solutions (SS) across five critical domains to address pressing social and environmental issues. Central to this initiative is a robust Metasearch platform, which will not only serve as the primary access point to FUTURAL's solutions but also facilitate the search and retrieval of SS developed by other initiatives. This paper elaborates on the MVP implementation for the MetaSearch platform. It focuses on a single, open-source data service and harnesses the generative capabilities of Large Language Models (LLMs) to create a user-friendly natural language interface. The design of the Minimum Viable Product (MVP), the tools used for adapting LLMs to our specific application, and our comprehensive set of evaluation techniques are thoroughly detailed. The results from our evaluations demonstrate that our approach is highly effective and can be efficiently implemented in future iterations of the MVP. This groundwork paves the way for extending the platform to include additional services and diverse data sets from the FUTURAL project, enhancing its capacity to address a broader array of queries and datasets.
Thinking Like a Clinician: A Cognitive AI Agent for Clinical Diagnosis via Panoramic Profiling and Adversarial Debate
🔥 引用:
0
Abstract: The application of large language models (LLMs) in clinical decision support faces significant challenges of"tunnel vision"and diagnostic hallucinations present in their processing unstructured electronic health records (EHRs). To address these challenges, we propose a novel chain-based clinical reasoning framework, called DxChain, which transforms the diagnostic workflow into an iterative process by mirroring a clinician's cognitive trajectory that consists of"Memory Anchoring","Navigation"and"Verification"phases. DxChain introduces three key methodological innovations to elicit the potential of LLM: (i) a Profile-Then-Plan paradigm to mitigate cold-start hallucinations by establishing a panoramic patient baseline, (ii) a Medical Tree-of-Thoughts (Med-ToT) algorithm for strategic look ahead planning and resource aware navigation, and (iii) a Dialectical Diagnostic Verification procedure utilizing"Angel-Devil"adversarial debates to resolve complex evidence conflicts. Evaluated on two real world benchmarks, MIMIC-IV-Ext Cardiac Disease and MIMIC-IV-Ext CDM, DxChain achieves state-of-the-art performances in both diagnostic accuracy and logical consistency, offering a modular and reliable architecture for next-generation clinical AI. The code is at https://anonymous.4open.science/r/Dx-Chain.
TrustWeave: Integrity Measurement and Attestation For Multi-Cloud LLMs
🔥 引用:
0
Abstract: Multi-cloud deployed multi-agent systems powered by Large Language Models (LLMs) introduce critical security challenges. While Intel Trust Domain Extensions (TDX)-based Confidential Virtual Machines (CVMs) provide strong isolation and boot-time attestation, they lack dynamic runtime integrity verification, a capability essential for trusted agent systems that frequently load new models and coordinate across distributed services. We present TrustWeave, a runtime integrity measurement and attestation framework that extends the Linux Integrity Measurement Architecture (IMA) with support for Intel TDX's Runtime Measurement Registers (RTMRs). TrustWeave enables userspace attestation of dynamically loaded agent components throughout their life-cycle, providing runtime trust guarantees for secure and scalable LLM agent deployments. Our key innovation is a workload-aware filtering mechanism that reduces measurement overhead by 99.95% while preserving security coverage for agent-critical operations. Our evaluation on production agent workloads using five models (0.6B-14B parameters) demonstrates practical performance: 0.8–12.8% boot overhead (reducible to <10% with filtering), Time-to-First-Token (TTFT) degradation around 25% of baseline, and stable QPS scaling up to 32 concurrent requests. TrustWeave introduces operationally acceptable overhead for security-sensitive deployments, maintaining model inference accuracy with bounded boot and runtime latency, while providing runtime attestation for dynamically loaded applications.
viNPU: Optimizing Vision Transformer Inference on Mobile NPUs
🔥 引用:
0
Abstract: Vision Transformers (ViTs) have emerged as the dominant architecture for visual understanding, powering Vision Foundation Models (VFMs) that excel in zero-shot prediction across diverse environments in spatial computing applications. These applications demand efficient, on-device execution to meet stringent latency requirements, yet deploying large VFMs on mobile Neural Processing Units (NPUs) presents two critical bottlenecks: performance imbalance between the NPU's heterogeneous compute units, and severe DRAM access overhead in scaled Transformers due to limited on-chip memory. To address these challenges, we propose viNPU, an inference optimization framework that enables efficient execution of VFMs on mobile NPUs. viNPU incorporates (1) sensitivity-guided mixed-precision quantization that effectively balances accuracy and latency; (2) dataflow optimization to maximize on-chip data locality and hardware utilization; and (3) a profiling-based offline stage to ensure real-time performance on mobile NPUs with minimal accuracy drop. Our evaluation on representative VFMs (e.g., Depth Anything V2, Segment Anything) across commercial mobile devices equipped with Qualcomm Hexagon NPUs demonstrates that viNPU achieves an average 11.5× speedup and up to 44.9× energy savings over baselines, while incurring negligible accuracy degradation compared to floatingpoint executions.
SAS: Sparse Attention Synthesizer for Efficient Language Model Inference
🔥 引用:
0
Abstract: Modern large language models rely on attention mechanisms that attend to all tokens in a sequence, resulting in quadratic computational complexity that limits scalability. While sparse attention reduces compute and memory requirements by attending to only important tokens, implementing these techniques presents significant challenges due to the complexity of combining static and dynamic sparse patterns and optimizing key-value (KV) cache management. To address these challenges, we present SAS, a sparse attention synthesizer that automatically generates performant sparse attention kernels for large language model inference. SAS introduces a set of primitives that effectively encapsulate both static and dynamic sparse attention mechanisms, enabling users to compose complex attention patterns through logic operators and declarative functions. The system employs a geometric-based pattern analyzer to optimize for KV caching by determining minimal cache sizes and automatically generating cache management functions. Supporting both Nvidia GPU and AWS Trainium backends, SAS demonstrates significant performance improvements: 1.10–1.22× speedup for context encoding and 2.68–2.80× speedup for token generation over FlexAttention, a state-of-the-art flexible attention kernel synthesis tool, on GPUs, and 1.41–6.49× speedup for context encoding and 1.39–10.87× speedup for token generation over optimized dense attention on Trainium.
GenAI Analytics and Pedagogical Configurations: Featuring New Generative AI Systems in Education
🔥 引用:
0
Abstract: Abstract. The appearance of general purpose Large Language Models (LLMs) and their derived chatbots (e.g., ChatGPT, Gemini) has revolutionized how students learn in high school and university settings. While these chatbots provide potential benefits to students (e.g., 24/7 tutor), they hinder the teachers’ autonomy, who are blinded to the questions made by the students and hand-tied to the answers provided by these models. However, such autonomy could be increased with the integration of GenAI Analytics tracking the students’ interactions with the chatbots (e.g., which are the most frequently asked topics), and with the configuration and contextualization of such models. This study aims at understanding which GenAI Analytics and Pedagogical Configurations are more useful for university teachers through a GenAI system developed by the authors. The evaluation involved 16 university teachers from 5 different institutions. Results show that analytics such as the number of prompts or the categorization of their contents are of great interest for teachers. Furthermore, while some pedagogical configurations were positively rated by most participants (e.g., not providing the direct solution, customized add-ons), some others were not (e.g., forced hallucinations, minimum answer length). By better understanding these two features of GenAI systems (i.e., GenAI Analytics and Pedagogical Configurations for chatbot behaviors), we expect teachers will increase their autonomy and trust in GenAI-based systems, thus enhancing the adoption of these technologies in their regular practice.
AI Engineering: Building Applications with Foundation Models by Chip Huyen on Ipad
🔥 引用:
11
Abstract: 暂无摘要,请点击原文查看。
Pref-CTRL: Preference Driven LLM Alignment using Representation Editing
🔥 引用:
0
Abstract: Test-time alignment methods offer a promising alternative to fine-tuning by steering the outputs of large language models (LLMs) at inference time with lightweight interventions on their internal representations. Recently, a prominent and effective approach, RE-Control (Kong et al., 2024), has proposed leveraging an external value function trained over the LLM's hidden states to guide generation via gradient-based editing. While effective, this method overlooks a key characteristic of alignment tasks, i.e. that they are typically formulated as learning from human preferences between candidate responses. To address this, in this paper we propose a novel preference-based training framework, Pref-CTRL, that uses a multi-objective value function to better reflect the structure of preference data. Our approach has outperformed RE-Control on two benchmark datasets and showed greater generalization on out-of-domain datasets. Our source code is available at https://github.com/UTS-nlPUG/pref-ctrl.
DLM: Unified Decision Language Models for Offline Multi-Agent Sequential Decision Making
🔥 引用:
0
Abstract: Building scalable and reusable multi-agent decision policies from offline datasets remains a challenge in offline multi-agent reinforcement learning (MARL), as existing methods often rely on fixed observation formats and action spaces that limit generalization. In contrast, large language models (LLMs) offer a flexible modeling interface that can naturally accommodate heterogeneous observations and actions. Motivated by this, we propose the Decision Language Model (DLM), which formulates multi-agent decision making as a dialogue-style sequence prediction problem under the centralized training with decentralized execution paradigm. DLM is trained in two stages: a supervised fine-tuning phase, which leverages dialogue-style datasets for centralized training with inter-agent context and generates executable actions from offline trajectories, followed by a group relative policy optimization phase to enhance robustness to out-of-distribution actions through lightweight reward functions. Experiments on multiple benchmarks show that a unified DLM outperforms strong offline MARL baselines and LLM-based conversational decision-making methods, while demonstrating strong zero-shot generalization to unseen scenarios across tasks.
Automated Assessment of Handwritten Math Problems: A Comparison of Prompting Strategies for Open and Closed-source LLMs
🔥 引用:
0
Abstract: Assessing handwritten mathematical solutions is essential for identifying students’ weaknesses and fostering personalized learning. However, scaling such assessment remains challenging for Learning Analytics, which has traditionally focused on digital or typed data. The current study investigated the potential of Large Language Models (LLMs) to automating the assessment of handwritten mathematical solutions and explore how they can be incorporated into large scale learning analytics pipelines. We curated 300 student solution images, annotated them using to a taxonomy of math error types, and compared open-source (Qwen2.5-7B and Gemma3 12B-IT) and closed-source (Gemini 2.0 Flash and GPT-4) LLMs. Two prompting strategies were tested: from adapted from related work and one tailed to the taxonomy of math error types using established prompt design principles. The results revealed that LLMs, particularly Gemini, achieved strong to moderate performance in diagnosing and classifying student errors, while exposing recurring model specific errors. These findings highlight both the promise and limitations of LLMs for integrating handwritten work in LA and recommend that learning analytics practitioners and researchers combine careful model selection, principled prompt design, and error-level analysis to develop AI-powered LA systems that are accurate, equitable, and pedagogically actionable.
ClusterFusion++: Expanding Cluster-Level Fusion to Full Transformer-Block Decoding
🔥 引用:
0
Abstract: Large language model (LLM) decoding is latency-sensitive and often bottlenecked by fragmented operator execution and repeated off-chip materialization of intermediate tensors. Prior work expands fusion scope by leveraging thread-block clusters and on-chip inter-block collectives to fuse attention-side operators such as QKV projection, attention, and output projection. We develop ClusterFusion++, a CUDA-level extension that broadens fusion to the full Transformer decoder block for GPT-NeoX/Pythia models: LayerNorm ->QKV ->RoPE ->decode attention ->output projection ->Post-LN ->MLP ->residual. We additionally engineer a CUDA-Graph-compatible execution mode with persistent Tensor Memory Accelerator (TMA) descriptors to reduce per-step overhead. On an NVIDIA RTX 5090-class GPU, ClusterFusion++ improves throughput by 1.34x for Pythia-2.8B and yields similar gains for Pythia-6.9B, while maintaining high output fidelity (near-token-identical generation, with minor non-determinism from FP16 atomics).
A Synergistic Framework for Cognitive Diagnosis with LLM-Empowered Data Augmentation
🔥 引用:
0
Abstract: Accurate cognitive diagnosis is crucial for enabling personalized learning. However, its effectiveness is severely hampered by data sparsity and class imbalance in student-exercise interactions. To address these challenges, this paper proposes a synergistic framework that integrates Artificial Intelligence and Learning Analytics. First, to mitigate data sparsity, we leverage a Large Language Model (LLM) with chain-of-thought prompting to enrich the Q-matrix, thereby expanding knowledge concept coverage. Second, to address class imbalance, we introduce a Generative Adversarial Network (GAN) constrained by this refined knowledge structure to synthesize pedagogically meaningful negative samples. Finally, a dual-level graph attention network is employed to infer precise knowledge states from both the LLM-enriched Q-matrix and the balanced response data. Extensive experiments on the Junyi dataset demonstrate that our framework significantly outperforms strong baselines, achieving a 6.21% improvement in AUC and increasing specificity from 54.24% to 85.83%. These results underscore its potential for delivering reliable and actionable learning diagnostics.
Consensus-Driven Metacognition in Multi-Agent Systems: A Logic-Based Byzantine Fault-Tolerant Protocol
DOI:
10.63412/8vgf0b98
🔥 引用:
0
Abstract: Large language model (LLM) agents fail in a distinctive way: they produce confident wrong answers. Recent work has begun adapting Byzantine fault tolerance (BFT) to multi-LLM networks — notably the weighted BFT protocol of Wang et al. [1] and the Aegean consensus engine [2] — typically by treating agent trust as a scalar that flows continuously through the protocol.
This paper takes the opposite stance: discretization is a feature, not a bug. MBFT (Metacognitive BFT) commits a swarm decision via a small, finite ladder of confidence tiers, defeasible counterproofs, and a reputation-gated veto. The resulting protocol is auditable, deterministically replay able, and its safety / liveness claims are mechanically checkable — properties that continuous Bayesian aggregators struggle to provide. Bayesian and continuous trust schemes remain the right tool for noisy, well calibrated, high-volume regimes (recommender systems, sensor fusion, market making); we argue that high-stakes reasoning swarms — legal, medical, safety-critical agentic deployments — belong to the discrete, defeasible regime that MBFT formalizes. We accompany the paper with an open-source reference implementation whose property suite encodes each theorem as an executable test.
From Anomaly to Attack Path: LLM-Based Network Traffic Investigation for APT Detection
🔥 引用:
0
Abstract: Advanced Persistent Threats (APTs) pose a growing challenge to critical infrastructure security. Their extended timescales, stealth, and multi-stage tactics limit the effectiveness of traditional Network Intrusion Detection Systems (NIDS), as signature-based methods miss novel techniques and anomaly detection alone produce a flood of low-confidence, low-specificity events. We propose a system that combines unsupervised anomaly detection with automated, large language model (LLM)-driven investigation to analyze suspicious network flows and reconstruct attack paths from network-level observations. Flagged flows are examined at both the feature and payload level using a locally hosted language model. Investigation results are aggregated in a graph database to reconstruct the attack progression through the network. Evaluation on the CICAPT-IIoT dataset shows that the system identifies all labeled malicious flows while uncovering mislabeled traffic containing clear indicators of compromise, suggesting detection capability without prior attack knowledge. We further discuss risks introduced by deploying language models in security-critical applications, including prompt injection via malicious payloads and data poisoning attacks.
Prompt-Unknown Promotion Attacks against LLM-based Sequential Recommender Systems
🔥 引用:
0
Abstract: Large language model-powered sequential recommender systems (LLM-SRSs) have recently demonstrated remarkable performance, enabling recommendations through prompt-driven inference over user interaction sequences. However, this paradigm also introduces new security vulnerabilities, particularly text-level manipulations, rendering them appealing targets for promotion attacks that purposely boost the ranking of specific target items. Although such security risks have been receiving increasing attention, existing studies typically rely on an unrealistic assumption of access to either the victim model or prompt to unveil attack mechanisms. In this work, we investigate the item promotion attack in LLM-SRSs under a more realistic setting where both the system prompt and victim model are unknown to the attacker, and propose a Prompt-Unknown Dual-poisoning Attack (PUDA) framework. To simulate attacks under this full black-box setting, we introduce an LLM-based evolutionary refinement strategy that infers discrete system prompts, enabling the training of an effective surrogate model that mimics the behaviors of the victim model. Leveraging the distilled prompt and surrogate model, we devise a promotion attack that adversarially revises target item texts under semantic constraints, which is further complemented by the highly plausible, surrogate-generated poisoning sequences to enable cost-effective target item promotion. Extensive experiments on real-world datasets demonstrate that PUDA consistently outperforms state-of-the-art competitors in boosting the exposure of unpopular target items. Our findings reveal critical security risks in modern LLM-SRSs even when both prompts and models are protected, and highlight the need for more robust defensive means.
Evaluating the completeness of large language model-generated clinical trial informed consent information for adolescent and young adult patients with central nervous system tumors
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
Cloud-Based Large Language Model Deployment: A Comparative Analysis of Serverless and Bring-Your-Own-Container Architectures
DOI:
10.18267/j.aip.313
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
LLMs Reading the Rhythms of Daily Life: Aligned Understanding for Behavior Prediction and Generation
🔥 引用:
0
Abstract: Human daily behavior unfolds as complex sequences shaped by intentions, preferences, and context. Effectively modeling these behaviors is crucial for intelligent systems such as personal assistants and recommendation engines. While recent advances in deep learning and behavior pre-training have improved behavior prediction, key challenges remain--particularly in handling long-tail behaviors, enhancing interpretability, and supporting multiple tasks within a unified framework. Large language models (LLMs) offer a promising direction due to their semantic richness, strong interpretability, and generative capabilities. However, the structural and modal differences between behavioral data and natural language limit the direct applicability of LLMs. To address this gap, we propose Behavior Understanding Alignment (BUA), a novel framework that integrates LLMs into human behavior modeling through a structured curriculum learning process. BUA employs sequence embeddings from pretrained behavior models as alignment anchors and guides the LLM through a three-stage curriculum, while a multi-round dialogue setting introduces prediction and generation capabilities. Experiments on two real-world datasets demonstrate that BUA significantly outperforms existing methods in both tasks, highlighting its effectiveness and flexibility in applying LLMs to complex human behavior modeling.
Evaluation of Prompt Injection Defenses in Large Language Models
🔥 引用:
0
Abstract: LLM-powered applications routinely embed secrets in system prompts, yet models can be tricked into revealing them. We built an adaptive attacker that evolves its strategies over hundreds of rounds and tested it against nine defense configurations across more than 20,000 attacks. Every defense that relied on the model to protect itself eventually broke. The only defense that held was output filtering, which checks the model's responses via hardcoded rules in separate application code before they reach the user, achieving zero leaks across 15,000 attacks. These results demonstrate that security boundaries must be enforced in application code, not by the model being attacked. Until such defenses are verified by tools like Swept AI, AI systems handling sensitive operations should be restricted to internal, trusted personnel.
Optimas: An Intelligent Analytics-Informed Generative AI Framework for Performance Optimization
🔥 引用:
0
Abstract: Large language models (LLMs) show promise for automated code optimization. However, without performance context, they struggle to produce correct and effective code transformations. Existing performance tools can identify bottlenecks but stop short of generating actionable code changes. Consequently, performance optimization continues to be a time-intensive and manual endeavor, typically undertaken only by experts with detailed architectural understanding. To bridge this gap, we introduce Optimas, a modular, fully automated, end-to-end generative AI framework built on a multi-agent workflow. Optimas uses LLMs to map performance diagnostics from multiple reports to established, literature-backed code transformations, while unifying insight extraction, code generation, execution, and validation within a single pipeline. Across 3,410 real-world experiments on 10 benchmarks and two HPC mini-applications, Optimas generates 100% correct code and improves performance in over 98.82% of those experiments, achieving average gains of 8.02%-79.09% on NVIDIA GPUs.
LLM-Augmented Traffic Signal Control with LSTM-Based Traffic State Prediction and Safety-Constrained Decision Support
🔥 引用:
0
Abstract: Traffic signal control is a critical task in intelligent transportation systems, yet conventional fixed-time and rule-based methods often struggle to adapt to dynamic traffic demand and provide limited decision interpretability. This study proposes an LLM-augmented traffic signal control framework that integrates LSTM-based short-term traffic state prediction, predictive phase selection, structured large language model reasoning, and safety-constrained action filtering. The LSTM module forecasts future queue length, waiting time, vehicle count, and lane occupancy based on recent intersection-level observations. A predictive controller then generates candidate signal actions, while the LLM module evaluates these actions using structured traffic-state inputs and produces congestion diagnoses, phase adjustment recommendations, and natural-language explanations. To ensure operational reliability, all LLM-generated recommendations are validated by a safety filter before execution. Simulation-based experiments in SUMO compare the proposed method with fixed-time control, rule-based control, and an LSTM-based predictive baseline under balanced demand, directional peak demand, and sudden surge scenarios. The results indicate that the proposed framework improves traffic efficiency, especially under dynamic and non-recurrent traffic conditions, while maintaining zero constraint violations after safety filtering. Overall, this study demonstrates that LLMs can enhance traffic signal control when used as constrained reasoning and decision-support modules rather than direct low-level controllers. Keywords: Intelligent Transportation Systems; Traffic Signal Control; Large Language Models; LSTM; Traffic State Prediction; Decision Support; Safety-Constrained Control; SUMO Simulation.
Knowledge Vector of Logical Reasoning in Large Language Models
🔥 引用:
0
Abstract: Logical reasoning serve as a central capability in LLMs and includes three main forms: deductive, inductive, and abductive reasoning. In this work, we study the knowledge representations of these reasoning types in LLMs and analyze the correlations among them. Our analysis shows that each form of logical reasoning can be captured as a reasoning-specific knowledge vector in a linear representation space, yet these vectors are largely independent of each other. Motivated by cognitive science theory that these subforms of logical reasoning interact closely in the human brain, as well as our observation that the reasoning process for one type can benefit from the reasoning chain produced by another, we further propose to refine the knowledge representations of each reasoning type in LLMs to encourage complementarity between them. To this end, we design a complementary subspace-constrained refinement framework, which introduces a complementary loss that enables each reasoning vector to leverage auxiliary knowledge from the others, and a subspace constraint loss that prevents erasure of their unique characteristics. Through steering experiments along reasoning vectors, we find that refined vectors incorporating complementary knowledge yield consistent performance gains. We also conduct a mechanism-interpretability analysis of each reasoning vector, revealing insights into the shared and specific features of different reasoning in LLMs.
AdaGen: Workload-Adaptive Cluster Scheduler for Latency-Optimal LLM Inference Serving
🔥 引用:
0
Abstract: The inference workloads of Large Language Models (LLMs) pose significant latency and cost challenges due to increasing model sizes and demand for real-time responses. Existing cluster schedulers for multi-instance LLM serving primarily focus on load balancing to optimize memory usage, which is insufficient for workloads with diverse request characteristics. In such cases, the compute layout — the arrangement of tokens across iterations within each instance—plays a crucial role in determining latency. We propose AdaGen, a workload-adaptive cluster scheduler that minimizes latency and thus maximizes SLO attainment by optimizing compute layouts across instances. AdaGen employs a multi-step scheduling strategy: it first classifies requests based on prefill and decode lengths, then balances load, and finally performs selective distributed execution across instances. Each step incrementally refines the scheduling based on the compute layouts derived from the decision of the previous step. To avoid the overhead of actual execution to generate the layouts, AdaGen introduces a novel simulation-based estimator. Extensive experiments using production workloads show that AdaGen achieves up to 3.6× higher SLO attainment and 2× better cost-efficiency compared to the existing systems, while ensuring scalability.
Domain Fine-Tuning vs. Retrieval-Augmented Generation for Medical Multiple-Choice Question Answering: A Controlled Comparison at the 4B-Parameter Scale
🔥 引用:
0
Abstract: Practitioners deploying small open-weight large language models (LLMs) for medical question answering face a recurring design choice: invest in a domain-fine-tuned model, or keep a general-purpose model and inject domain knowledge at inference time via retrieval-augmented generation (RAG). We isolate this trade-off by holding model size, prompt template, decoding temperature, retrieval pipeline, and evaluation protocol fixed, and varying only (i) whether the model has been domain-adapted (Gemma 3 4B vs. MedGemma 4B, both 4-bit quantized and served via Ollama) and (ii) whether retrieved passages from a medical knowledge corpus are inserted into the prompt. We evaluate all four cells of this 2x2 design on the full MedQA-USMLE 4-option test split (1,273 questions) with three repetitions per question (15,276 LLM calls). Domain fine-tuning yields a +6.8 percentage-point gain in majority-vote accuracy over the general 4B baseline (53.3% vs. 46.4%, McNemar p<10^-4). RAG over MedMCQA explanations does not produce a statistically significant gain in either model, and in the domain-tuned model the point estimate is slightly negative (-1.9 pp, p = 0.16). At this scale and on this benchmark, domain knowledge encoded in weights dominates domain knowledge supplied in context. We release the full experiment code and JSONL traces to support replication.
Real Time Adaptive Teaching Digital Human System Based on Large Language Model
DOI:
10.71451/istaer2620
🔥 引用:
0
Abstract: With the diversification and personalization of educational needs, the traditional teaching model is facing many challenges, especially in meeting students' personalized learning needs and providing real-time feedback. This study proposes a real-time adaptive teaching digital human system based on a large language model, which aims to improve the quality of education and student engagement through intelligent technology. The system obtains students' learning status in real time through a variety of data acquisition devices (such as learning behavior tracking, question-answering records, and speech recognition) and uses the large language model to generate personalized teaching content and feedback. The system calculates a comprehensive learning status score by evaluating students' learning progress, answer accuracy, participation, and learning time, thereby dynamically adjusting the teaching content and learning path. Experimental results show that with the system, students' knowledge mastery rate increases by 15%, understanding depth by 17%, and learning interest by 20%. The system's response time is reduced from 5 seconds (in traditional systems) to 1.5 seconds, and its processing capacity is increased by 2.5 times, supporting more concurrent users. The successful implementation of the system provides a new solution for personalized education and has broad application prospects, especially in online education and distance learning.
AIPsy-Affect: A Keyword-Free Clinical Stimulus Battery for Mechanistic Interpretability of Emotion in Language Models
🔥 引用:
0
Abstract: Mechanistic interpretability research on emotion in large language models -- linear probing, activation patching, sparse autoencoder (SAE) feature analysis, causal ablation, steering vector extraction -- depends on stimuli that contain the words for the emotions they test. When a probe fires on"I am furious", it is unclear whether the model has detected anger or detected the word"furious". The two readings have very different consequences for every downstream claim about emotion circuits, features, and interventions. We release AIPsy-Affect, a 480-item clinical stimulus battery that removes the confound at the stimulus level: 192 keyword-free vignettes evoking each of Plutchik's eight primary emotions through narrative situation alone, 192 matched neutral controls that share characters, setting, length, and surface structure with the affect surgically removed, plus moderate-intensity and discriminant-validity splits. The matched-pair structure supports linear probing, activation patching, SAE feature analysis, causal ablation, and steering vector extraction under a strong methodological guarantee: any internal representation that distinguishes a clinical item from its matched neutral cannot be doing so on the basis of emotion-keyword presence. A three-method NLP defense battery -- bag-of-words sentiment, an emotion-category lexicon, and a contextual transformer classifier -- confirms the property: bag-of-words methods see only situational vocabulary, and a contextual classifier detects affect (p<10^-15) but cannot identify the category (5.2% top-1 vs. 82.5% on a keyword-rich control). AIPsy-Affect extends our earlier 96-item battery (arXiv:2603.22295) by a factor of four and is released openly under MIT license.
Evaluating Large Language Model Feedback to Support Teacher Professional Vision: Prompting Strategies and Pre-service Teacher Engagement
🔥 引用:
0
Abstract: Professional vision—the use of knowledge to notice and reason about classroom events—is an important facet of teacher expertise that is challenging for pre-service teachers (PSTs) to develop. Although feedback on written reflections of videotaped classroom events can improve professional vision, delivering feedback at scale requires significant time and labor. We explored the use of large language models (LLMs) to classify professional vision aspects in PSTs’ responses and provide feedback. We first conducted a technical evaluation to compare the classification performance of two prompting approaches (single-prompt and prompt-chaining). We then integrated the better-performing approach into a feedback prototype and conducted a user study with 13 PSTs. We analyzed PSTs’ interactions with the LLM-generated feedback. Overall, prompt-chaining showed higher inter-rater agreement with human coders in classifying professional vision than the single-prompt approach. PSTs who cycled between reading the feedback and revising reflection text showed more professional vision in the revisions, compared to those who spent more time passively reviewing feedback. We discuss the feasibility and limitations of LLM feedback in PST education and highlight considerations about AI feedback uptake.
Automated Classification of Human Code Review Comments with Large Language Models
🔥 引用:
0
Abstract: Context: Code reviews are essential for maintaining software quality, yet many human review comments suffer from issues such as redundancy, vagueness, or lack of constructiveness. These types of comments may slow down feedback and obscure important insights. Prior work on code review comments mostly explore the detection and categorization of useful comments, while fine-grained categorization of comment issues remains underexplored. Objective: This work aims to design and evaluate an automated system for classifying code review comments according to specific categories of issues. Methodology: We introduced a nine-label taxonomy for code review comments, covering six review comment smells and three common useful intents, and manually labeled 448 comments from a publicly available dataset. We benchmarked zero-shot and one-shot single-label classification over each comment and its associated unified diff hunk, comparing GPT-5-mini, LLaMA-3.3, and DeepSeek-R1. We reported macro-F1 as the primary metric. Results: Zero-shot performance was moderate under class imbalance (macro-F1 0.360 to 0.374). One-shot exemplar conditioning had model-dependent effects: GPT-5-mini and DeepSeek-R1 macro-F1 scores improved, however LLaMA-3.3 suffered a slight decrease. Exemplars most consistently helped intent-boundary labels, whereas classification of evidence-sensitive labels remain challenging. Conclusion: Our results indicate that comment--diff evidence is sufficient for some labels but limited for evidence-sensitive smells. Future work includes adding thread context, improving intent-preserving rewrites, and validating robustness across platforms.
EndoGov: A knowledge-governed multi-agent expert system for endometrial cancer risk stratification
🔥 引用:
0
Abstract: Multimodal artificial intelligence models for endometrial cancer (EC) risk stratification typically optimize aggregate predictive performance but provide limited mechanisms for enforcing mandatory guideline overrides, such as assigning POLE-mutated tumors to the low-risk group despite high-grade morphology. We present EndoGov, a two-tier multi-agent expert system that factorizes the decision process as D(x) = G(P(x), R), where specialist agents P extract structured evidence and a governance agent G applies an executable rule set R. Tier 1 comprises pathology, molecular, and clinical agents that independently generate schema-constrained reports from frozen foundation-model features or structured records. Tier 2 queries an evidence-level-weighted Guideline Knowledge Graph, using deterministic hard-path rules for high-priority overrides and constrained soft-path reasoning for ambiguous cases. In TCGA-UCEC (n=541), EndoGov achieved 0.943 accuracy, 0.973 macro AUC, and a conditional logic-violation rate (C-LVR) of 0.93% among trigger-exposed cases. In CPTAC-UCEC (n=95), where reference labels are guideline-derived, EndoGov reached 0.842 accuracy compared with<0.31 for locked-transfer neural baselines, supporting governance-pathway transfer under distribution shift rather than validation against independent clinical truth. End-to-end safety decomposition localized residual failures primarily to upstream molecular detection rather than downstream governance. Backend-swap experiments further showed that hard-path compliance is invariant to the LLM backend. These findings indicate that explicit clinical-rule governance can provide guideline-compliant, auditable EC risk assignment while preserving competitive discrimination.
DTAC: Decision-Tree-based Automatic Configuration of Entangled DDoS Defense Policies
🔥 引用:
0
Abstract: As adaptive DDoS attacks intensify, the expert-dependent and time-consuming configuration for DDoS mitigation devices lags behind. However, the complex logical and dependency relations among defense policies make automatic configuration extremely challenging. Existing works have utilized Large Language Models' (LLMs) analytical and reasoning capabilities, but they still struggle with the configuration, due to their limitation in data processing and generation. To overcome these drawbacks, we propose an automatic configuration system for entangled DDoS defense policies, DTAC, which applies Decision Tree (DT) to handle the complex relations. Specifically, we reduce the LLM's workload to extracting the complex relations from the device manual. Then, a prefix-merging-based Primary Parameter Solver and a topo-order-based Secondary Parameter Solver separately handle the logical relations between conditions and the dependency relations between parameters. Preliminary experiments on public and real-world traffic traces show that DTAC can yield effective defense policies at least 1000× faster than manual configuration.
Evaluating large language model`s performance in answering principles of health course questions
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
Detecting Citation Veneer in Guideline-Grounded Clinical and Public Health Large Language Model (LLM) Outputs: A Clinical Evidence Audit Grid Using CENHOV Tags
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
Development and Evaluation of a Dual-Expertise, Utterance-Level Framework for LLM-Based Science Classroom Discourse Analysis
🔥 引用:
0
Abstract: This study proposes a novel coding framework for analyzing science classroom discourse using large language models (LLMs), adopting fine-grained utterance-level chunking aligned with the analytical units of LLMs to address limitations of global, lesson-level observation tools. Authentic middle school science classroom discourse was annotated through a dual-expertise and iterative process integrating the theoretical knowledge of science education faculty with the experiential insights of in-service teachers, supported by systematic rater training to ensure conceptual alignment and interpretive consistency at the utterance level. Through this process, a science education glossary comprising 137 instructional terms organized into 20 thematic categories was developed using a primarily bottom-up approach informed by established observation frameworks. Building on this theory-informed foundation, we systematically examined LLM-based methods for predicting instructional themes and quality, comparing structured prompting strategies with domain-adaptive fine-tuning across model architectures. These contributions lay a foundation for future research on interpretable, scalable, and pedagogically meaningful automated formative feedback to support teachers’ self-reflection and professional growth.
Crimson: Collaborative Parameter Updates for Efficient Pipeline Training of Large Language Models
🔥 引用:
0
Abstract: Large language models (LLMs) have driven significant progress in natural language processing, yet their training and fine-tuning remain limited by memory constraints, particularly the substantial memory footprints of optimizer states. Existing solutions address this challenge by offloading optimizer states and update tasks to the CPU, but this often leads to increased GPU idleness due to the CPU's limited computational capabilities, especially in pipeline parallelism. To address this, we present Crimson, an efficient system that optimizes parameter updates in pipeline parallelism. Crimson employs a collaborative parameter updates scheme that distributes each stage's updates across its local GPU and CPU as well as the GPUs of subsequent stages. In addition, we design a multi-stage collaborative optimizer to coordinate computations and communications, together with a trapezoid bubble aware scheduling that formulates the process as an optimization problem to determine the optimal parameter update strategy. This approach maximizes GPU computational and memory efficiency. Experimental results across diverse model scales, parallelism configurations and state-of-the-art pipeline schedulings demonstrate that Crimson achieves throughput improvements exceeding 1.35× with ZeRO-Offload and ZeRO-Offload++, and up to 1.33× with Recomputation, along with increased GPU utilization.
Energy Consumption Analysis of Discrete Diffusion and Autoregressive Language Models
🔥 引用:
0
Abstract: As Large Language Models (LLMs) continue to scale, their energy consumption during inference has become a critical concern. While Autoregressive (AR) models are the industry standard, Diffusion Language Models (DLLMs) have emerged as a promising alternative capable of parallel token generation. In this paper, we perform a comparative energy benchmark of diffusion models (LLaDA-8B, Dream-7B) against AR counterparts (Llama-3.1-8B, Qwen-2.5-7B). We analyze the trade-offs between model architecture, KV-cache optimization, and Joules-per-token efficiency. Our results show that at longer sequences (200-600 tokens), diffusion models achieve 2.5-2.7 J/token, approximately 2× more efficient than AR models with KV-cache (5.3-5.7 J/token). KV-cache provides substantial benefits for AR models at longer sequences (50-70% energy reduction).
Learning Analytics for Assessment Preparation: Constructing Graphs to Guide Higher-Order Question Generation
🔥 引用:
0
Abstract: High-quality assessment at scale requires questions that are conceptually rich, appropriately difficult, and cognitively engaging. We present a learning analytics pipeline that transforms archival assessments and learner performance data for question generation. Past assessment questions from an undergraduate engineering physics course are multi-labelled with topics from a predefined syllabus, and a primary large language model (LLM) is selected via inter-rater reliability against expert judgments to ensure label quality. Using these labels and student performance data, we construct an undirected weighted graph where nodes represent topics and edges capture both co-occurrence and difficulty, with student performance used as a proxy. Exploring this graph reveals difficulty-calibrated topic sets that can be retrieved around target topics. We then evaluate LLM-generated questions for difficulty and alignment with the Structure of Observed Learning Outcomes taxonomy when provided with the requested topics. Empirically, we found that GPT-5 Thinking shows higher agreement with expert labels. We further found that the generated questions, grounded by the topic-performance graph, were deemed to be more difficult and of higher-order intent under blinded expert review. The contribution is a scalable creation of higher-order assessment items with appropriate difficulty that are imperative for generative personalised learning and assessment systems.
Spore: Efficient and Training-Free Privacy Extraction Attack on LLMs via Inference-Time Hybrid Probing
🔥 引用:
0
Abstract: With the wide adoption of personal AI assistants such as OpenClaw, privacy leakage in user interaction contexts with large language model (LLM) agents has become a critical issue. Existing privacy attacks against LLMs primarily target training data, while research on inference-time contextual privacy risks in LLM agent memory remains limited. Moreover, prior methods often incur high attack costs, requiring multiple queries or relying on white-box assumptions, which limits their practicality in real-world deployments. To address these issues, we propose a training-free privacy extraction attack targeting LLM agent memory, which we name \textsc{Spore}. \textsc{Spore} is compatible with both black-box and gray-box settings. In the black-box setting, \textsc{Spore} can efficiently extract a small candidate set via a single query to recover the original private information. In the gray-box setting, \textsc{Spore} allows the attacker to leverage multi-ranked tokens for more accurate and faster privacy extraction. We provide an information-theoretic analysis of \textsc{Spore} and show that it achieves high query efficiency with substantial per query information leakage. Experiments on multiple frontier LLMs show that \textsc{Spore} outperforms attack success rate over existing state-of-the-art (SOTA) schemes. It also maintains low attack cost and remains stable across different model parameter settings. We further evaluate the robustness of \textsc{Spore} against existing defense mechanisms. Our results show that \textsc{Spore} consistently bypasses both detection and strong safety alignment, demonstrating resilient performance in diverse defensive settings and real-world safety threats.
General-Purpose Topology-Aware Embedding of Tumor Phylogenetic Trees with Graph Neural Networks
🔥 引用:
0
Abstract:
Phylogenetic trees are tree-like data structures commonly adopted to mathematically represent cancer clonal evolution. The information encoded by phylogenetic trees is important for clinical outcomes, but the automatic extraction of such information is still hard, also due to the fact that working directly with tree-like data structures is complex. This is especially true for machine learning tasks, where models are usually designed for vector data.
We introduce CPhyT-GNN, a novel Deep Learning method to compute unsupervised embeddings of phylogenetic trees. The embeddings learnt by CPhyT-GNN are vectors that can be used for a variety of machine learning tasks. CPhyT-GNN is based on Graph Neural Networks, which allows to obtain representations that combine the information provided by the alterations present in the tumour and the topological information provided by the corresponding phylogenetic tree. Experiments with cancer data show that the embeddings learnt by our model are general-purpose and can be applied to different tasks, with results that improve the state-of-the-art.
Data and code are available at the following link: https://github.com/VandinLab/CPhyT-GNN.
Supplementary material is available at Bioinformatics Advances online.
Time-Series Forecasting in Safety-Critical Environments: An EU-AI-Act-Compliant Open-Source Package / Zeitreihenprognose in sicherheitskritischen Umgebungen: Ein KI-VO-konformes Open-Source-Paket
🔥 引用:
0
Abstract: With spotforecast2-safe we present an integrated Compliance-by-Design approach to Python-based point forecasting of time series in safety-critical environments. A review of the relevant open-source tooling shows that existing compliance solutions operate consistently outside of the library to be used - e.g. as scanners, templates, or runtime layers. spotforecast2-safe takes the inverse approach and anchors the requirements of Regulation (EU) 2024/1689 (the EU AI Act, in German: KI-VO), of IEC 61508, of the ISA/IEC 62443 standards series, and of the Cyber Resilience Act within the library: in application-programming-interface contracts, persistence formats, and continuous-integration gates. The approach is operationalised by four non-negotiable code-development rules (zero dead code, deterministic processing, fail-safe handling, minimal dependencies) together with the corresponding process rules (model card, executable docstrings, CI workflows, Common-Platform-Enumeration (CPE) identifier, REUSE-conformant licensing, release pipeline). Interactive visualisation, hyperparameter tuning and automated machine learning (AutoML), as well as deep-learning and large-language-model backends are deliberately excluded, because each of these components either enlarges the attack surface, introduces non-determinism, or impairs reproducibility. A bidirectional traceability matrix maps every regulatory provision onto the corresponding mechanism in the code; an end-to-end example of European-market electricity generation, transmission, and consumption forecasting demonstrates the application. The package is open-source and available under Affero General Public License (AGPL) 3.0-or-later.
FAIR_XAI: Improving Multimodal Foundation Model Fairness via Explainability for Wellbeing Assessment
🔥 引用:
0
Abstract: In recent years, the integration of multimodal machine learning in wellbeing assessment has offered transformative potential for monitoring mental health. However, with the rapid advancement of Vision-Language Models (VLMs), their deployment in clinical settings has raised concerns due to their lack of transparency and potential for bias. While previous research has explored the intersection of fairness and Explainable AI (XAI), its application to VLMs for wellbeing assessment and depression prediction remains under-explored. This work investigates VLM performance across laboratory (AFAR-BSFT) and naturalistic (E-DAIC) datasets, focusing on diagnostic reliability and demographic fairness. Performance varied substantially across environments and architectures; Phi3.5-Vision achieved 80.4% accuracy on E-DAIC, while Qwen2-VL struggled at 33.9%. Additionally, both models demonstrated a tendency to over-predict depression on AFAR-BSFT. Although bias existed across both architectures, Qwen2-VL showed higher gender disparities, while Phi-3.5-Vision exhibited more racial bias. Our XAI intervention framework yielded mixed results; fairness prompting achieved perfect equal opportunity for Qwen2-VL at a severe accuracy cost on E-DAIC. On AFAR-BSFT, explainability-based interventions improved procedural consistency but did not guarantee outcome fairness, sometimes amplifying racial bias. These results highlight a persistent gap between procedural transparency and equitable outcomes. We analyse these findings and consolidate concrete recommendations for addressing them, emphasising that future fairness interventions must jointly optimise predictive accuracy, demographic parity, and cross-domain generalisation.
ragR: Retrieval-Augmented Generation and RAG Assessment in R
🔥 引用:
0
Abstract: Retrieval-augmented generation (RAG) combines document retrieval with large language models to produce responses grounded in external evidence. While several R packages support core components of RAG workflows, integrated evaluation of RAG systems in R remains limited and is often conducted through Python-based tools, most notably the RAG assessment (RAGAS) framework. To address this gap, we introduce ragR, an R package that unifies document ingestion, embedding and vector storage, similarity-based retrieval, grounded generation, structured question-answer logging, and RAGAS-style evaluation within a single R-native workflow. The current implementation provides LLM-based scoring for four core RAGAS metrics: context precision, context recall, faithfulness, and answer relevance. Validation experiments under controlled settings show that ragR captures similar metric behavior to the reference Python RAGAS workflow across multiple use cases. By integrating RAG construction and evaluation within a reproducible workflow in R, ragR provides a practical framework for research, teaching, and moderate-scale experimentation on RAG systems entirely within the R ecosystem.
Vulnerabilities of IllusionCAPTCHA: An Empirical Analysis and a Stronger Human-Easy AI-Hard Design
🔥 引用:
0
Abstract: As multimodal Large Language Models (LLMs) rapidly improve in visual reasoning and in-context learning, designing CAPTCHAs that remain both human-friendly and resistant to automated attacks has become increasingly difficult. Illusion-based CAPTCHAs have emerged as a promising direction, leveraging perceptual distortions that are intuitive for humans but presumed challenging for machines. Illusion-based CAPTCHAs have been explored as a method for designing visual challenges that remain intuitive for humans while limiting machine and Large Language Model (LLM)-based attacks. IllusionCAPTCHA is a recently proposed system that pairs illusion-diffusion imagery with long descriptive options designed to interfere with automated reasoning. Its original evaluation reported low LLM success rates and high human accuracy, positioning illusion-based approaches as a promising direction for future CAPTCHA design. However, our re-analysis demonstrates that these results do not hold under realistic attack conditions. Using state-of-the-art models, we show that IllusionCAPTCHA can be reliably solved using structured prompting or iterative questioning. Our findings reveal that the vulnerability stems primarily from the option-generation scheme rather than from the illusion images themselves. To address this, we introduce a fully automated and redesigned option-generation method that uses LLM to produce short, visually grounded, and ambiguity-preserving choices—effectively leveraging LLMs as a defensive tool against LLM attackers. Human accuracy increased from 45.24% to 89.85%, solving time decreased by 58% and the LLM success rates decreased substantially across all models and prompting conditions. These results suggest that illusion-based CAPTCHAs remain viable, but they require carefully engineered option structures to withstand modern LLM capabilities.
RouteNLP: Closed-Loop LLM Routing with Conformal Cascading and Distillation Co-Optimization
🔥 引用:
0
Abstract: Serving diverse NLP workloads with large language models is costly: at one enterprise partner, inference costs exceeded $200K/month despite over 70% of queries being routine tasks well within the capability of smaller models. We present RouteNLP, a closed-loop framework that routes queries across a tiered model portfolio to minimize cost while satisfying per-task quality constraints. The framework integrates three components: a difficulty-aware router with shared task-conditioned representations trained on preference data and quality signals; confidence-calibrated cascading that uses conformal prediction for distribution-free threshold initialization; and a distillation-routing co-optimization loop that clusters escalation failures, applies targeted knowledge distillation to cheaper models, and automatically retrains the router, yielding over twice the cost improvement of untargeted distillation. In an 8-week pilot deployment processing ~5K queries/day at an enterprise customer-service division, RouteNLP reduced inference costs by 58% while maintaining 91% response acceptance and reducing p99 latency from 1,847 ms to 387 ms. On a six-task benchmark spanning finance, customer service, and legal domains, the framework achieves 40-85% cost reduction while retaining 96-100% quality on structured tasks and 96-98% on generation tasks, with human evaluation confirming that 74.5% of routed generation outputs match or exceed frontier-model quality.
The Blind Spots in Automated Feedback Generation for Academic Writing
🔥 引用:
0
Abstract: Machine learning-based automated essay scoring (AES) and feedback generation (AFG) tools have been developed since the 1960s, with some commercially deployed. Such systems are expected to lighten the labor-intensive task of essay scoring and help provide timely feedback to students. However, these tools have not been used effectively enough in practice despite the long history of this research field. We aim to determine how accurately currently available models can make the necessary corrections and detect improvable segments of text. Latest attempts of AFG utilize state-of-the-art generative large language models (LLMs) for writing evaluation. To the best of our knowledge, however, none of the past studies have included generative LLM-based models in a fine-grained sentence-to-sentence comparison with human feedback or among AI tools. To fill these gaps, we conduct an experimental comparison of human feedback and three AI tools developed at different stages of technological advancement. Findings indicate that the overlap between human and AI feedback is predominantly limited to surface-level linguistic features and that generative AI-augmented tools demonstrate a markedly higher capability than a tool based on conventional rule-based AI.
Expert Evaluation of LLM's Open-Ended Legal Reasoning on the Japanese Bar Exam Writing Task
🔥 引用:
0
Abstract: Large language models (LLMs) have shown strong performance on legal benchmarks, including multiple-choice components of bar exams. However, their capacity for generating open-ended legal reasoning in realistic scenarios remains insufficiently explored. Notably, to our best knowledge, there are no prior studies or datasets addressing this issue in the Japanese context. This study presents the first dataset designed to evaluate the open-ended legal reasoning performance of LLMs within the Japanese jurisdiction. The dataset is based on the writing component of the Japanese bar examination, which requires examinees to identify multiple legal issues from long narratives and to construct structured legal arguments in free text format. Our key contribution is the manual evaluation of LLMs'generated responses by legal experts, which reveals limitations and challenges in legal reasoning. Moreover, we conducted a manual analysis of hallucinations to characterize when and how the models introduce content not supported by precedent or law. Our real exam questions, model-generated responses, and expert evaluations reveal the milestones of current LLMs in the Japanese legal domain. Our dataset and relevant resources will be available online.
LLMFolder: Revisiting Constant Folding in Large Language Models
🔥 引用:
0
Abstract: Large language models (LLMs) demonstrate remarkable capabilities but face deployment challenges due to their massive parameter counts. While pruning can reduce model size, it leads to significant accuracy degradation under high compression ratios. We present a novel perspective inspired by constant folding in compiler optimization. Our approach enables parameter reduction by treating activation functions in LLMs as linear functions. However, recent LLMs use complex non-linear activations like GELU that prevent direct application of this technique. We propose LLMFolder, which enables optimization of LLMs with non-linear activations by partially approximating them with linear functions in frequently occurring input ranges. For outlier inputs, LLMFolder employs an online predictor to dynamically fall back to original computations. Our experiments demonstrate that LLMFolder achieves 80% parameter reduction in feed-forward networks, while significantly outperforming state-of-the-art pruning methods with up to 65% higher accuracy. When combined with quantization and pruning, LLMFolder can further achieve 92.5% parameter reduction with only 4.4% average accuracy drop for a 7B model, where neither technique alone or combination can achieve. In practical deployments for a 7B model, LLMFolder achieves 1.6× end-to-end inference speedup when integrated with the vLLM serving system, and 1.4× speedup with the widely adopted HuggingFace implementation, while incurring only a 10.9% accuracy trade-off.
Adoption and pedagogical use of generative artificial intelligence in K-12 mathematics/science/language classrooms: Challenges and opportunities
🔥 引用:
0
Abstract: Generative artificial intelligence in K-12 mathematics, science, and language classrooms has already presented major opportunities of personalized learning, automatically provided feedback, curriculum development, and student interaction, and posed multiple questions about ethical AI, academic dishonesty, teacher preparedness, algorithm bias, and student data security. With the prevalence of large language models, generative AI-driven learning aids and multimodal learning platforms that have begun to be popular in schools, more individuals are interested in learning about the pedagogical utilization, utility, limitations, and implications of Generative AI in school learning. This literature review is based on the PRISMA framework that provided a systematic inspection of the current research of the adoption and pedagogical integration of Generative AI in K-12 mathematics education, science education, and language education. The review conducted to synthesize the evidence on the identification of the dominant themes in research, patterns of implementation, and gaps in current scholarship. Special focus was put on adaptive learning, intelligent tutoring systems, automated assessment, teacher professional development, multilingual learning, differentiated instruction and human-AI collaboration. The results demonstrate that personalized learning, cognitive scaffolding, inquiry-based learning, project-based learning, formative assessment, and self-regulated learning can be greatly enhanced in mathematics, science, and language classrooms with the help of Generative AI. It is concluded that responsible AI systems, enhanced teacher education, fair access and evidence-based pedagogical models balancing the innovativeness with human regulation are the needed components of future-ready deployment of Generative AI in K-12 education.
One Size Fits None: Heuristic Collapse in LLM Investment Advice
🔥 引用:
0
Abstract: Large language models are increasingly deployed as advisors in high-stakes domains -- answering medical questions, interpreting legal documents, recommending financial products -- where good advice requires integrating a user's full context rather than responding to salient surface features. We investigate whether frontier LLMs actually do this, or whether they instead exhibit heuristic collapse: a systematic reduction of complex, multi-factor decisions to a small number of dominant inputs. We study the phenomenon in investment advice, where legal standards explicitly require individualized reasoning over a client's full circumstances. Applying interpretable surrogate models to LLM outputs, we find systematic heuristic collapse: investment allocation decisions are largely determined by self-reported risk tolerance, while other relevant factors contribute minimally. We further find that web search partially attenuates heuristic collapse but does not resolve it. These findings suggest that heuristic collapse is not resolved by web search augmentation or model scale alone, and that deploying LLMs as advisors requires auditing input sensitivity, not just output quality.
Physics-Aware LLM-Based Probabilistic Wind Power Scenario Generation under Extreme Icing Conditions
🔥 引用:
0
Abstract: Accurately characterizing wind power uncertainty under icing and post-disaster conditions remains a critical challenge for resilient power system operation. To address this issue, this paper proposes a physics-aware large language model (LLM) framework for probabilistic wind power scenario generation under extreme icing conditions. The proposed framework integrates supervisory control and data acquisition (SCADA)-based physical modeling, multimodal tokenization, and a causal Transformer architecture trained in an autoregressive manner. A physics-aware decoding scheme effectively enforces rated power limits and ramping constraints on the generated trajectories while preserving stochastic diversity. Case studies using real wind turbine data show that the proposed method reproduces icing-induced power degradation and temporal variability observed during extreme weather. The resulting scenarios are physically consistent and high-fidelity, thereby significantly enhancing resilience assessment and recovery planning in renewable-integrated power systems.
MuSS: A Large-Scale Dataset and Cinematic Narrative Benchmark for Multi-Shot Subject-to-Video Generation
🔥 引用:
0
Abstract: While video foundation models excel at single-shot generation, real-world cinematic storytelling inherently relies on complex multi-shot sequencing. Further progress is constrained by the absence of datasets that address three core challenges: authentic narrative logic, spatiotemporal text-video alignment conflicts, and the"copy-paste"dilemma prevalent in Subject-to-Video (S2V) generation. To bridge this gap, we introduce MuSS, a large-scale, dual-track dataset tailored for multi-shot video and S2V generation. Sourced from over 3,000 movies, MuSS explicitly supports both complex montage transitions and subject-centric narratives. To construct this dataset, we pioneer a progressive captioning pipeline that eliminates contextual conflicts by ensuring local shot-level accuracy before enforcing global narrative coherence. Crucially, we implement a cross-shot matching mechanism to fundamentally eradicate the S2V copy-paste shortcut. Alongside the dataset, we propose the Cinematic Narrative Benchmark, featuring a visual-logic-driven paradigm and a novel Anti-Copy-Paste Variance (ACP-Var) metric to rigorously assess continuous storytelling and 3D structural consistency. Extensive experiments demonstrate that while current baselines struggle with continuous narrative logic or degenerate into trivial 2D sticker generators, our MuSS-augmented model achieves state-of-the-art narrative effectiveness and cross-shot identity preservation.
Inverting Foundation Models of Brain Function with Simulation-Based Inference
🔥 引用:
0
Abstract: Foundation models of brain activity promise a new frontier for in silico neuroscience by emulating neural responses to complex stimuli across tasks and modalities. A natural next step is to ask whether these models can also be used in reverse. Can we recover a stimulus or its properties from synthetic brain activity? We study this question in a proof-of-concept setting using TRIBEv2. We pair the brain emulator with large language models (LLMs) that generate news headlines from linguistic parameters such as valence, arousal, and dominance. We then use simulation-based inference to learn a probabilistic mapping from brain maps to latent stimulus parameters. Our results show that these parameters can be recovered from predicted brain maps, validating the quality of neural encodings. They also show that LLMs can serve as controllable stimulus generators for simulated experiments. Together, these findings provide a step toward decoding and inverse design with foundation brain models.
LLM-CEG: Extending the Classification Error Gauge Framework for Privacy Auditing of Large Language Models
🔥 引用:
0
Abstract: This paper extends the Classification Error Gauge (x-CEG) framework, originally developed for measuring the privacy-utility trade-off in tabular datasets, to privacy auditing of Large Language Models (LLMs). We propose LLM-CEG, a systematic framework that employs membership inference attack (MIA) success rates as an empirical privacy gauge and model perplexity as a utility gauge, iteratively adjusting differential privacy parameters until both thresholds are jointly satisfied. A proof-of-concept prototype fine-tunes DistilGPT-2 on a synthetic clinical PII dataset under four privacy regimes using DP-SGD. Results indicate that DP-SGD reduces MIA attacker advantage by 71.5% while simultaneously improving out-of-distribution utility by 47-50% relative to the overfitted baseline, suggesting that differential privacy may act as implicit regularization under narrow fine-tuning conditions. We further extend the SIED engineering framework to the LLM context as LLM-SIED, providing an auditable, regulator-aligned process for privacy-compliant LLM deployment.
Weakly Supervised Multicenter Nancy Index Scoring in Ulcerative Colitis Using Foundation Models
🔥 引用:
0
Abstract: Histologic assessment of ulcerative colitis (UC) activity is an important endpoint in clinical trials and routine care, but manual grading with indices such as the Nancy histological index (NHI) is time-consuming and prone to observer variability. While computational pathology methods can automate scoring, many approaches depend on dense region-level annotations, which are costly to obtain, particularly in heterogeneous, multicenter cohorts. We propose a weakly supervised multiple instance learning (MIL) approach for whole-slide images that learns from case- and slide-level NHI labels, leveraging foundation models. Our method targets clinically relevant endpoints, including neutrophilic activity and derived Nancy-low/high groupings, enabling full five-grade NHI prediction. On a multicenter dataset of H&E-stained colon biopsies from three hospitals (2019-2025), we evaluate multiple foundation model encoders and aggregation strategies. We find that foundation model choice and resolution substantially affect performance, with Virchow2 providing the most consistent gains, and that a simple ensembling rule improves five-grade NHI prediction compared to a hierarchical gating baseline. Overall, our results demonstrate that weakly supervised MIL with modern foundation-model representations can provide robust, interpretable UC histology activity assessment in realistic multicenter settings.
Vibe Medicine: Redefining Biomedical Research Through Human-AI Co-Work
🔥 引用:
0
Abstract: With the emergence of large language models (LLMs) and AI agent frameworks, the human-AI co-work paradigm known as Vibe Coding is changing how people code, making it more accessible and productive. In scientific research, where workflows are more complex and the burden of specialized labor limits independent researchers and those in low-resource areas, the potential impact is even greater, particularly in biomedicine, which involves heterogeneous data modalities and multi-step analytical pipelines. In this paper, we introduce Vibe Medicine, a co-work paradigm in which clinicians and researchers direct skill-augmented AI agents through natural language to execute complex, multi-step biomedical workflows, while retaining the role of research director who specifies objectives, reviews intermediate results, and makes domain-informed decisions. The enabling infrastructure consists of three layers: capable LLMs, agent frameworks such as OpenClaw and Hermes Agent, and the OpenClaw medical skills collection, which includes more than 1,000 curated skills from multiple open-source repositories. We analyze the architecture and skill categories of this collection across ten biomedical domains, and present case studies covering rare disease diagnosis, drug repurposing, and clinical trial design that demonstrate end-to-end workflows in practice. We also identify the principal risks, such as hallucination, data privacy, and over-reliance, and outline directions toward more reliable, trustworthy, and clinically integrated agent-assisted research that advances research and technological equity and reduces health care resource disparities.
Large Language Model based Interactive Decision-Making for Autonomous Driving
🔥 引用:
0
Abstract: In high-conflict mixed-traffic scenarios involving human-driven and autonomous vehicles, most existing autonomous driving systems default to overly conservative behaviors, lack proactive interaction, and consequently suffer from limited public acceptance. To mitigate intent misunderstandings and decision failures, we present a Large Language Model based interactive decision-making framework that augments scene understanding and intent-aware interaction to jointly improve safety and efficiency. The approach uses Object-Process Methodology to semantically model complex multi-vehicle scenes, abstracting low-level perceptual data into objects, processes, and relations, thereby streamlining reasoning over latent causal structure. Building on this representation, the Large Language Model parses both explicit and implicit intents of surrounding agents and, under jointly enforced safety and efficiency constraints, selects candidate maneuvers. We further generate perturbed trajectory candidates via Monte Carlo sampling and evaluate them to obtain an optimized executable trajectory. To foster transparency and coordination with nearby road users, the final decision is translated by the Large Language Model into concise natural-language messages and broadcast through an external Human-Machine Interface, completing a closed loop from scene understanding to action to language. Experiments in a cluster driving simulator demonstrate that the proposed method outperforms traditional baselines across safety, comfort, and efficiency metrics, while a Turing-test-style evaluation indicates a high degree of human-likeness in decision making. Besides, these results suggest that coupling semantic scene abstraction with Large Language Model mediated intent reasoning and language-based eHMI communication offers a practical pathway toward interactive, trustworthy autonomous driving in dense mixed traffic.
BurstGP: Enhancing Raw Burst Image Super Resolution with Generative Priors
🔥 引用:
0
Abstract: Burst image super resolution (BISR) aims to construct a single high-resolution (HR) image by aggregating information from multiple low-resolution (LR) frames, relying on temporal redundancy and spatial coherence across the burst. While conventional methods achieve impressive results, they often struggle with complex textures and oversmoothing. Diffusion models, particularly those pretrained on high-quality data, have shown remarkable capability in generating realistic details for image and video super-resolution. However, their potential remains largely under-explored in BISR, where existing approaches typically rely on task-specific diffusion models trained from scratch and operate on single-frame reconstructions. In this work, we propose BurstGP, a novel diffusion-based solution for BISR, which leverages generative priors of recent foundation models to overcome these issues. In particular, we build a multiframe-aware diffusion model on top of a conventional BISR approach, which boosts image quality with minimal loss to fidelity. Further, we introduce (i) a novel degradation-aware conditioning mechanism, which controls synthesis of fine details based on the estimated degradation in the input, and (ii) a robust sRGB-to-lRGB inverter, enabling us to utilize generative multiframe (video) sRGB priors, while operating with raw input and lRGB output images. Empirically, we demonstrate that BurstGP outperforms the existing state of the art, both quantitatively (especially with respect to perceptual metrics, including MUSIQ and LPIPS) and qualitatively. In particular, our proposed method excels at recovering richer textures and finer structural details, highlighting the potential of video priors for BISR over traditional methods.
Personality Shapes Gender Bias in Persona-Conditioned LLM Narratives Across English and Hindi: An Empirical Investigation
🔥 引用:
0
Abstract: Large Language Models (LLMs) are increasingly deployed in persona-driven applications such as education, customer service, and social platforms, where models are prompted to adopt specific personas when interacting with users. While persona conditioning can improve user experience and engagement, it also raises concerns about how personality cues may interact with gender biases and stereotypes. In this work, we present a controlled study of persona-conditioned story generation in English and Hindi, where each story portrays a working professional in India producing context-specific artifacts (e.g., lesson plans, reports, letters) under systematically varied persona gender, occupational role, and personality traits from the HEXACO and Dark Triad frameworks. Across 23,400 generated stories from six state-of-the-art LLMs, we find that personality traits are significantly associated with both the magnitude and direction of gender bias. In particular, Dark Triad personality traits are consistently associated with higher gender-stereotypical representations compared to socially desirable HEXACO traits, though these associations vary across models and languages. Our findings demonstrate that gender bias in LLMs is not static but context-dependent. This suggests that persona-conditioned systems used in real-world applications may introduce uneven representational harms, reinforcing gender stereotypes in generated educational, professional, or social content.
TailorLLM: Collaborative End-Cloud Inference of Large and Small Language Models Based on Low-Rank Adaptation
🔥 引用:
0
Abstract: With the rapid expansion of large language model inference service users, cloud computing resource costs have become a critical challenge for service providers. Although utilizing end-device resources for auxiliary inference provides new possibilities to reduce cloud computing costs, existing solutions struggle to achieve an ideal balance across multi-task accuracy, end-to-end latency, and cloud computing costs. We present TailorLLM, a task-level collaborative end-cloud inference solution based on low-rank fine-tuning for large language models. This framework comprises two core algorithms that support offline and online optimization, respectively: (i) To reduce transmission overhead while maintaining model performance, Resource-Friendly Low-Rank Adaptation (RFLoRA) decouples pre-trained parameters into cold and hot modules, reducing trainable parameters. (ii) To ensure coverage of users' common tasks, we introduce AdapterMgr, an imitation learning-based replacement strategy that enables near-optimal dynamic management of the on-device LoRA matrix library. Finally, we implemented the TailorLLM prototype system on NVIDIA 3090 and Tesla T4 servers and thoroughly evaluated it on public task datasets. Compared to a series of baselines, TailorLLM reduces cloud resource consumption by up to 69.8% and inference latency by up to 62% while maintaining high accuracy.
MTRouter: Cost-Aware Multi-Turn LLM Routing with History-Model Joint Embeddings
🔥 引用:
0
Abstract: Multi-turn, long-horizon tasks are increasingly common for large language models (LLMs), but solving them typically requires many sequential model invocations, accumulating substantial inference costs. Here, we study cost-aware multi-turn LLM routing: selecting which model to invoke at each turn from a model pool, given a fixed cost budget. We propose MTRouter, which encodes the interaction history and candidate models into joint history-model embeddings, and learns an outcome estimator from logged trajectories to predict turn-level model utility. Experiments show that MTRouter improves the performance-cost trade-off: on ScienceWorld, it surpasses GPT-5 while reducing total cost by 58.7%; on Humanity's Last Exam (HLE), it achieves competitive accuracy while reducing total cost by 43.4% relative to GPT-5, and these gains even carry over to held-out tasks. Further analyses reveal several mechanisms underlying its effectiveness: relative to prior multi-turn routers, MTRouter makes fewer model switches, is more tolerant to transient errors, and exhibits emergent specialization across models. Code: https://github.com/ZhangYiqun018/MTRouter
PhysLayer: Language-Guided Layered Animation with Depth-Aware Physics
🔥 引用:
0
Abstract: Existing image-to-video generation methods often produce physically implausible motions and lack precise control over object dynamics. While prior approaches have incorporated physics simulators, they remain confined to 2D planar motions and fail to capture depth-aware spatial interactions. We introduce PhysLayer, a novel framework enabling language-guided, depth-aware layered animation of static images. PhysLayer consists of three key components: First, a language-guided scene understanding module that utilizes vision foundation models to decompose scenes into depth-based layers by analyzing object composition, material properties, and physical parameters. Second, a depth-aware layered physics simulation that extends 2D rigid-body dynamics with depth motion and perspective-consistent scaling, enabling more realistic object interactions without requiring full 3D reconstruction. Third, a physics-guided video synthesis module that integrates simulated trajectories with scene-aware relighting for temporally coherent results. Experimental results demonstrate improvements in CLIP-Similarity (+2.2\%), FID score (+9.3\%), and Motion-FID (+3\%), with human evaluation showing enhanced physical plausibility (+24\%) and text-video alignment (+35\%). Our approach provides a practical balance between physical realism and computational efficiency for controllable image animation.
Agentic Adversarial Rewriting Exposes Architectural Vulnerabilities in Black-Box NLP Pipelines
🔥 引用:
0
Abstract: Multi-component natural language processing (NLP) pipelines are increasingly deployed for high-stakes decisions, yet no existing adversarial method can test their robustness under realistic conditions: binary-only feedback, no gradient access, and strict query budgets. We formalize this strict black-box threat model and propose a two-agent evasion framework operating in a semantic perturbation space. An Attacker Agent generates meaning-preserving rewrites while a Prompt Optimization Agent refines the attack strategy using only binary decision feedback within a 10-query budget. Evaluated against four evidence-based misinformation detection pipelines, the framework achieves evasion rates of 19.95 to 40.34% on modern large language model (LLM) based systems, compared to at most 3.90% for token-level perturbation baselines that rely on surrogate models because they cannot operate under our threat model. A legacy system relying on static lexical retrieval exhibits near-total vulnerability 97.02%, establishing a lower bound that exposes how architectural choices govern the attack surface. Evasion effectiveness is associated with three architectural properties: evidence retrieval mechanism, retrieval-inference coupling, and baseline classification accuracy. The iterative prompt optimization yields the largest marginal gains against the most robust targets, confirming that adaptive strategy discovery is essential when evasion is non-trivial. Analysis of successful rewrites reveals four exploitation patterns, each targeting failures at distinct pipeline stages. A pattern-informed defense reduces the evasion rate by up to 65.18%.
From Formal Learning to Professional Practice: Automated LLM-based Coding and Visualisation of Team Dialogue in in-situ Healthcare Simulation
🔥 引用:
2
Abstract: Simulation-based learning is central to healthcare education, yet its effectiveness depends on high-quality debriefing. Traditional debriefs often overlook detailed team dialogue dynamics. Advances in large language models (LLMs) open new possibilities for learning analytics (LA) by automatically coding and visualising teamwork behaviours from dialogue data. This study investigates the effectiveness of different prompting strategies for LLM-based coding, comparing their performance and environmental impacts (CO2e) to identify approaches suitable for transfer into professional practice. Building on these results, we evaluate the generalisability of the optimised model from university student simulations to in-situ, hospital settings, and explore how healthcare professionals perceive the interpretability, usefulness, and trustworthiness of LLM-driven learning analytics in professional learning debriefs. Findings illustrate that responsible uses of AI can help extend LA beyond a controlled university environment into an authentic, in-hospital healthcare context, offering potentially scalable and sustainable support for reflective practice and professional development.
Agentic Fusion of Large Atomic and Language Models to Accelerate Materials Discovery
🔥 引用:
0
Abstract: The discovery of novel materials is critical for global energy and quantum technology transitions. While deep learning has fundamentally reshaped this landscape, existing predictive or generative models typically operate in isolation, lacking the autonomous orchestration required to execute the full discovery process. Here we present ElementsClaw, an agentic framework for materials discovery that synergizes Large Atomic Models (LAMs) with Large Language Models (LLMs). In response to varied human requirements, ElementsClaw dynamically orchestrates a suite of LAM tools finetuned from our proposed model Elements for atomic-scale numerical computation, while leveraging LLMs for high-level semantic reasoning. This shift moves AI-driven materials science from isolated processes toward integrated and human interactive discovery. In the demanding domain of superconductors, our agentic system guides the experimental synthesis of four new superconductors, including Zr3ScRe8 with a transition temperature of 6.8 K and HfZrRe4 at 6.7 K. At scale, ElementsClaw screens more than 2.4 million stable crystals within only 28 GPU hours, identifying 68,000 high-confidence superconducting candidates and vastly expanding the known superconducting space. These results demonstrate how our agent accelerates materials discovery with high physical fidelity.
Resource-Lean Lexicon Induction for German Dialects
🔥 引用:
0
Abstract: Automatic induction of high-quality dictionaries is essential for building lexical resources, yet low-resource languages and dialects pose several challenges: limited access to annotators, high degree of spelling variations, and poor performance of large language models (LLMs). We empirically show that statistical models (random forests) trained on string similarity features are surprisingly effective for inducing German dialect lexicons. They outperform LLMs, enable cross-dialect transfer, and offer a lightweight data-driven alternative. We evaluate our models intrinsically on bilingual lexicon induction (BLI) and extrinsically on dialect information retrieval (IR). On BLI, random forests outperform Mistral-123b while being more resource-lean. On dialect IR with BM25, using our dialect dictionaries for query expansion yields relative improvements of up to 28.9% in nDCG@10 and 50.7% in Recall@100. Motivated by the resource scarcity in dialects, we further investigate the extent to which models transfer across different German dialects, and their performance under varying amounts of training data.
ShredBench: Evaluating the Semantic Reasoning Capabilities of Multimodal LLMs in Document Reconstruction
🔥 引用:
0
Abstract: Multimodal Large Language Models (MLLMs) have achieved remarkable performance in Visually Rich Document Understanding (VRDU) tasks, but their capabilities are mainly evaluated on pristine, well-structured document images. We consider content restoration from shredded fragments, a challenging VRDU setting that requires integrating visual pattern recognition with semantic reasoning under significant content discontinuities. To facilitate systematic evaluation of complex VRDU tasks, we introduce ShredBench, a benchmark supported by an automated generation pipeline that renders fragmented documents directly from Markdown. The proposed pipeline ensures evaluation validity by allowing the flexible integration of latest or unseen textual sources to prevent training data contamination. ShredBench assesses four scenarios (English, Chinese, Code, Table) with three fragmentation granularities (8, 12, 16 pieces). Empirical evaluations on state-of-the-art MLLMs reveal a significant performance gap: The method is effective on intact documents; however, once the document is shredded, restoration becomes a significant challenge, with NED dropping sharply as fragmentation increases. Our findings highlight that current MLLMs lack the fine-grained cross-modal reasoning required to bridge visual discontinuities, identifying a critical gap in robust VRDU research.
TO WHAT EXTENT WOULD AI TAKE OVER THE ROLES OF TEACHERS AND UNIVERSITY PROFESSORS - A PERCEPTION OF DIGITALIZATION AND AI INTEGRATION IN ROMANIA'S EDUCATIONAL ENVIRONMENT
DOI:
10.36713/epra27317
🔥 引用:
0
Abstract: The rapid advancement of Artificial Intelligence has moved from the peripheral margins of educational technology into the core structural fabric of modern learning environments (Anamaria-Mirabela et.al., 2024). In the contemporary era, AI is no longer merely a tool for distributing information; it has become an independent mediator between information and human cognition (Georgescu et.al., 2025). This paradigm shift is particularly pronounced in Romania, where the educational system is navigating a complex transition from traditional pedagogical models to a digitalized infrastructure accelerated by post-pandemic imperatives (Fleaca et al., 2022). As generative AI and adaptive learning systems become ubiquitous, a fundamental question emerges: to what extent will these technologies supplement, or perhaps eventually supersede, the professional roles of teachers and university professors? In the Romanian context, the integration of AI is not occurring in a vacuum but is part of a broader European trend toward "digital readiness" and the closing of digital gaps between rural and urban education sectors (Moroianu et.al., 2023). Higher education institutions, in particular, are facing unprecedented pressure to adapt to technological interconnections that redefine what it means to be an educator (Fleaca et al., 2022). While some scholars argue that AI will revolutionize the classroom by fostering inclusive and equitable quality education, others express profound concern regarding the potential harm to the human element of pedagogy (Anamaria-Mirabela et.al., 2024). The perception of digitalization in Romania is therefore characterized by a duality: the recognition of AI as a strategic ally for efficiency and personalization, versus the fear of a cognitive disengagement where human interaction and critical thinking are diminished (Voicu et. al., 2026). The role of the professor is currently evolving from being the primary distributor of knowledge to a mentor who must cultivate ethical discernment and collaborative inquiry (Georgescu et. al., 2024). This transition is fraught with challenges, as the advantages of AI are often unevenly distributed, potentially reinforcing existing social disparities within the Romanian educational landscape (Georgescu et.al., 2024). Furthermore, the integration of AI chatbots and large language models into the academic workflow has raised significant questions regarding academic integrity and the erosion of intellectual autonomy among students (Malița et.al., 2025). This article aims to investigate the perceptions of Romanian educators and students regarding this transformative process. By analyzing the current state of digitalization and the specific cultural and structural barriers present in Romania, this research seeks to determine whether AI is perceived as a replacement for the human educator or as a sophisticated augmentation of the teaching profession. The following sections will explore the theoretical frameworks of AI integration, the specific challenges of the Romanian digital transition, and the shifting identity of the academic professional in the age of automation.
PhysCodeBench: Benchmarking Physics-Aware Symbolic Simulation of 3D Scenes via Self-Corrective Multi-Agent Refinement
🔥 引用:
0
Abstract: Physics-aware symbolic simulation of 3D scenes is critical for robotics, embodied AI, and scientific computing, requiring models to understand natural language descriptions of physical phenomena and translate them into executable simulation environments. While large language models (LLMs) excel at general code generation, they struggle with the semantic gap between physical descriptions and simulation implementation. We introduce PhysCodeBench, the first comprehensive benchmark for evaluating physics-aware symbolic simulation, comprising 700 manually-crafted diverse samples across mechanics, fluid dynamics, and soft-body physics with expert annotations. Our evaluation framework measures both code executability and physical accuracy through automated and visual assessment. Building on this, we propose a Self-Corrective Multi-Agent Refinement Framework (SMRF) with three specialized agents (simulation generator, error corrector, and simulation refiner) that collaborate iteratively with domain-specific validation to produce physically accurate simulations. SMRF achieves 67.7 points overall performance compared to 36.3 points for the best baseline among evaluated SOTA models, representing a 31.4-point improvement. Our analysis demonstrates that error correction is critical for accurate physics-aware symbolic simulation and that specialized multi-agent approaches significantly outperform single-agent methods across the tested physical domains.
Benchmarking Testing in Automated Theorem Proving
🔥 引用:
0
Abstract: Recent advances in large language models (LLMs) have shown promise in formal theorem proving, yet evaluating semantic correctness remains challenging. Existing evaluations rely on indirect proxies such as lexical overlap with human-annotated proof, or expensive manual inspection. Inspired by the shift from lexical comparison to test-based evaluation in code generation, we propose T , a framework that evaluates the semantic correctness of formal theorems: a generated theorem is considered correct only if all dependent successor theorems compile successfully, analogous to integration testing. We construct a benchmark from 5 real-world Lean 4 repositories, comprising 2,206 problems paired with 41 successor theorems on average, automatically extracted without human effort. Experiments demonstrate that while state-of-the-art models achieve high compilation success, they perform significantly worse under our semantic metric. The best model, Claude-Sonnet-4.5, achieves only 38.9% Testing Accuracy on the full set, given both natural language proof and successor theorems as context, revealing a critical gap in current theorem generation capabilities.
Hamiltonian Graph Inference Networks: Joint structure discovery and dynamics prediction for lattice Hamiltonian systems from trajectory data
🔥 引用:
0
Abstract: Lattice Hamiltonian systems underpin models across condensed matter, nonlinear optics, and biophysics, yet learning their dynamics from data is obstructed by two unknowns: the interaction topology and whether node dynamics are homogeneous. Existing graph-based approaches either assume the graph is given or, as in $\alpha$-separable graph Hamiltonian network, infer it only for separable Hamiltonians with homogeneous node dynamics. We introduce the Hamiltonian Graph Inference Network (HGIN), which jointly recovers the interaction graph and predicts long-time trajectories from state data alone, for both separable and non-separable Hamiltonians and under heterogeneous node dynamics. HGIN couples a structure-learning module -- a learnable weighted adjacency matrix trained under a Hamilton's-equations loss -- with a trajectory-prediction module that partitions edges into physically distinct subgraphs via $k$-means clustering, assigning each subgraph its own encoder and thereby breaking the parameter-sharing bottleneck of conventional GNNs. On three benchmarks -- a Klein--Gordon lattice with long-range interactions and two discrete nonlinear Schr\"odinger lattices (homogeneous and heterogeneous) -- HGIN reduces long-time energy prediction error and trajectory prediction error by six to thirteen orders of magnitude relative to baselines. A symmetry argument on the Hamiltonian loss further shows that the learned weights encode the parity of the underlying pair potential, yielding an interpretable readout of the system's interaction structure.
When AI reviews science: Can we trust the referee?
🔥 引用:
3
Abstract: The volume of scientific submissions continues to climb, outpacing the capacity of qualified human referees and stretching editorial timelines. At the same time, modern large language models (LLMs) offer impressive capabilities in summarization, fact checking, and literature triage, making the integration of AI into peer review increasingly attractive—and, in practice, unavoidable. Yet early deployments and informal adoption have exposed acute failure modes. Recent incidents have revealed that hidden prompt injections embedded in manuscripts can steer LLM-generated reviews toward unjustifiably positive judgments. Complementary studies have also demonstrated brittleness to adversarial phrasing, authority and length biases, and hallucinated claims. These episodes raise a central question for scholarly communication: when AI reviews science, can we trust the AI referee? This paper provides a security- and reliability-centered analysis of AI peer review. We map attacks across the review lifecycle—training and data retrieval, desk review, deep review, rebuttal, and system-level. We instantiate this taxonomy with four treatment-control probes on a stratified set of ICLR 2025 submissions, using two advanced LLM-based referees to isolate the causal effects of prestige framing, assertion strength, rebuttal sycophancy, and contextual poisoning on review scores. Together, this taxonomy and experimental audit provide an evidence-based baseline for assessing and tracking the reliability of AI peer review and highlight concrete failure points to guide targeted, testable mitigations.
From Solo Graders to Assisted Annotation: Integrating LLM Suggestions into the Educational Data Creation Pipeline
🔥 引用:
0
Abstract: The creation of annotated datasets is important for Automatic Short Answer Grading (ASAG), but establishing a reliable gold standard remains a challenge, often complicated by discrepancies among human annotators. Within Learning Analytics (LA), this challenge is particularly pressing because dataset reliability underpins the validity of analytic models and the feedback they provide to learners and educators. To address this, this study investigated the dynamics between human raters and Large Language Models (LLMs) in the assessment cycle, comparing approaches ranging from human-supported annotation to automated scoring. The study was conducted with students’ Chemistry responses from a Brazilian high school. Responses were graded by teachers under scenarios with and without the aid of score suggestions generated by an LLM (GPT-4o-mini). Results showed that human with LLM suggestion (Human-AI interaction) yielded outcomes significantly closer to the gold standard (Kappa = 0.91) than approaches based solely on humans or automated systems. Additionally, LLM suggestions increased agreement among human graders. Our findings suggest that Human–AI process is not only a technical improvement but also a methodological contribution to LA. This offers a potential pathway for constructing more reliable datasets that strengthen the validity of subsequent analytics and the support they provide in educational practice.
SEMA-SQL: Beyond Traditional Relational Querying with Large Language Models
🔥 引用:
0
Abstract: Relational databases excel at structured data analysis, but real-world queries increasingly require capabilities beyond standard SQL, such as semantically matching entities across inconsistent names, extracting information not explicitly stored in schemas, and analyzing unstructured text. While text-to-SQL systems enable natural language querying, they remain limited to relational operations and cannot leverage the semantic reasoning capabilities of modern large language models (LLMs). Conversely, recent semantic operator systems extend relational algebra with LLM-powered operations (e.g., semantic joins, mappings, aggregations), but require users to manually construct complex query pipelines. To address this gap, we present SEMA-SQL, a system that automatically answers natural language questions by generating efficient queries that combine relational operations with LLM semantic reasoning. We formalize Hybrid Relational Algebra (HRA), a declarative abstraction unifying traditional relational operators with LLM user-defined functions (UDFs). The system automates three critical aspects: (1) query generation via in-context learning that produces HRA queries with precise natural language specifications for LLM UDFs, (2) query optimization via cost-based transformations and UDF rewriting, and (3) efficient execution algorithms that reduce LLM invocations by an average of 93% in semantic joins through intelligent batching. Extensive experiments with known benchmarks, and extensions thereof, demonstrate the significant query capability improvements possible with our design.
Designing, developing, and evaluating educational games that provide adaptive feedback with generative artificial intelligence
🔥 引用:
0
Abstract: Aiming to address pedagogical challenges in programming education and the limitations of static feedback mechanisms, this study designed and evaluated an AI-supported adaptive Role-Playing Game (RPG) for teaching C++. The Design-Based Research methodology was adopted to bridge the gap between theoretical design and practice. Across five iterative cycles involving 70 participants, a dynamic feedback system responsive to student performance was created by integrating Retrieval-Augmented Generation (RAG) architecture and Large Language Models. Findings indicated that the short-term game intervention did not yield a statistically significant change in academic attitude and motivation. However, in later stages where system stability was secured, qualitative data indicated a shift in student focus from interface issues to deeper learning of programming concepts. In-game log data suggested that adaptive mechanisms and reflective elements helped transition students from inefficient trial-and-error strategies toward more analytical behavior. Crucially, the findings imply that the reflection mechanism contributed to establishing a psychologically safe space where mistakes were treated as an intrinsic part of the learning process rather than failures. This personalized support appeared to show potential in mitigating the cycle of learned helplessness by supporting students' emotional resilience. The study offers a set of evidence-based principles for designers by illuminating the behavioral effects of using generative AI as an educational scaffold.
The Collapse of Heterogeneity in Silicon Philosophers
🔥 引用:
0
Abstract: Silicon samples are increasingly used as a low-cost substitute for human panels and have been shown to reproduce aggregate human opinion with high fidelity. We show that, in the alignment-relevant domain of philosophy, silicon samples systematically collapse heterogeneity. Using data from $N = {277}$ professional philosophers drawn from PhilPeople profiles, we evaluate seven proprietary and open-source large language models on their ability to replicate individual philosophical positions and to preserve cross-question correlation structures across philosophical domains. We find that language models substantially over-correlate philosophical judgments, producing artificial consensus across domains. This collapse is associated in part with specialist effects, whereby models implicitly assume that domain specialists hold highly similar philosophical views. We assess the robustness of these findings by studying the impact of DPO fine-tuning and by validating results against the full PhilPapers 2020 Survey ($N = {1785}$). We conclude by discussing implications for alignment, evaluation, and the use of silicon samples as substitutes for human judgment. The code of this project can be found at https://github.com/stanford-del/silicon-philosophers.
Preliminary Evaluation of Large Language Models in Kennedy Classification and Removable Partial Denture Planning: An In Silico Study
DOI:
10.12659/msm.953353
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
Hybrid JIT-CUDA Graph Optimization for Low-Latency Large Language Model Inference
🔥 引用:
0
Abstract: Large Language Models (LLMs) have achieved strong performance across natural language and multimodal tasks, yet their practical deployment remains constrained by inference latency and kernel launch overhead, particularly in interactive, short-sequence settings. This paper presents a hybrid runtime framework that combines Just-In-Time (JIT) compilation with CUDA Graph execution to reduce launch overhead while preserving runtime flexibility during autoregressive decoding. The framework partitions transformer inference into static components executed via CUDA Graph replay and dynamic components handled through JIT-compiled kernels, enabling asynchronous graph capture and reuse across decoding steps. We evaluate the proposed approach on LLaMA-2 7B using single-GPU, batch-size-one inference across prompt lengths from 10 to 500 tokens. Experimental results show that the hybrid runtime reduces Time-to-First-Token (TTFT) by up to 66.0% and achieves lower P99 latency compared with TensorRT-LLM in this regime. These results indicate that hybrid JIT-CUDA Graph execution can effectively reduce inference latency and variance for short-sequence LLM workloads, making it a practical optimization strategy for latency-sensitive AI applications.
HBGSA: Hydrogen Bond Graph with Self-Attention for Drug-Target Binding Affinity Prediction
🔥 引用:
0
Abstract: Accurate prediction of drug-target binding affinity accelerates drug discovery by prioritizing compounds for experimental validation. Current methods face three limitations: sequence-based approaches discard spatial geometric constraints, structure-based methods fail to exploit hydrogen bond features, and conventional loss functions neglect prediction-target correlation, a key factor for identifying high-affinity compounds in virtual screening. We developed HBGSA (Hydrogen Bond Graph with Self-Attention), a 3.06M-parameter model that encodes hydrogen bond spatial features. HBGSA uses graph neural networks to model hydrogen bond spatial topology with self-attention enhancement and Pearson correlation loss. Experimental results on PDBbind Core Set and CSAR-HiQ dataset demonstrate that HBGSA outperforms baseline methods with strong generalization capability. Ablation studies confirm the effectiveness of hydrogen bond modeling and Pearson correlation loss.
Learning from Noisy Prompts: Saliency-Guided Prompt Distillation for Robust Segmentation with SAM
🔥 引用:
0
Abstract: Segmentation is central to clinical diagnosis and monitoring, yet the reliability of modern foundation models in medical imaging still depends on the availability of precise prompts. The Segment Anything Model (SAM) offers powerful zero-shot capabilities, although it collapses under the weak, generic, and noisy prompts that dominate real clinical workflows. In practice, annotations such as centerline points are coarse and ambiguous, often drifting across neighboring anatomy and misguiding SAM toward inconsistent or incomplete masks. We introduce SPD, a Saliency-Guided Prompt Distillation framework that converts these unreliable cues into robust guidance. SPD first learns data-driven anatomical priors through a lightweight saliency head to obtain confident localization maps. These priors then drive Contextual Prompt Distillation, which validates and enriches noisy prompts using cues from anatomically adjacent slices, producing a consensus prompt set that matches the behavior of expert reasoning. A Pairwise Slice Consistency objective further enforces local anatomical coherence during segmentation. Experiments on four challenging MRI and CT benchmarks demonstrate that SPD consistently outperforms existing SAM adaptations and supervised baselines, delivering large gains in both region-based and boundary-based metrics. SPD provides a practical and principled path toward reliable foundation model deployment in clinical environments where only imperfect prompts are available.
Failed comprehensiveness, successful minimalism: Wikipedia’s 3-year struggle to govern AI-generated content (2022–2025)
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
Beyond Local vs. External: A Game-Theoretic Framework for Trustworthy Knowledge Acquisition
🔥 引用:
0
Abstract: Cloud-hosted Large Language Models (LLMs) offer unmatched reasoning capabilities and dynamic knowledge, yet submitting raw queries to these external services risks exposing sensitive user intent. Conversely, relying exclusively on trusted local models preserves privacy but often compromises answer quality due to limited parameter scale and knowledge. To resolve this dilemma, we propose Game-theoretic Trustworthy Knowledge Acquisition (GTKA), a framework that formulates the trade-off between knowledge utility and privacy as a strategic game. GTKA consists of three components: (i) a privacy-aware sub-query generator that decomposes sensitive intent into generalized, low-risk fragments; (ii) an adversarial reconstruction attacker that attempts to infer the original query from these fragments, providing adaptive leakage signals; and (iii) a trusted local integrator that synthesizes external responses within a secure boundary. By training the generator and attacker in an alternating adversarial manner, GTKA optimizes the sub-query generation policy to maximize knowledge acquisition accuracy while minimizing the reconstructability of the original sensitive intent. To validate our approach, we construct two sensitive-domain benchmarks in the biomedical and legal fields. Extensive experiments demonstrate that GTKA significantly reduces intent leakage compared to state-of-the-art baselines while maintaining high-fidelity answer quality.
From Coarse to Fine: Self-Adaptive Hierarchical Planning for LLM Agents
🔥 引用:
0
Abstract: Large language model-based agents have recently emerged as powerful approaches for solving dynamic and multi-step tasks. Most existing agents employ planning mechanisms to guide long-term actions in dynamic environments. However, current planning approaches face a fundamental limitation that they operate at a fixed granularity level. Specifically, they either provide excessive detail for simple tasks or insufficient detail for complex ones, failing to achieve an optimal balance between simplicity and complexity. Drawing inspiration from the principle of \textit{progressive refinement} in cognitive science, we propose \textbf{AdaPlan-H}, a self-adaptive hierarchical planning mechanism that mimics human planning strategies. Our method initiates with a coarse-grained macro plan and progressively refines it based on task complexity. It generates self-adaptive hierarchical plans tailored to the varying difficulty levels of different tasks, which can be optimized by imitation learning and capability enhancement. Experimental results demonstrate that our method significantly improves task execution success rates while mitigating overplanning at the planning level, providing a flexible and efficient solution for multi-step complex decision-making tasks. To contribute to the community, our code and data will be made publicly available at https://github.com/import-myself/AHP.
Mixture of Heterogeneous Grouped Experts for Language Modeling
🔥 引用:
0
Abstract: Large Language Models (LLMs) based on Mixture-of-Experts (MoE) are pivotal in industrial applications for their ability to scale performance efficiently. However, standard MoEs enforce uniform expert sizes,creating a rigidity that fails to align computational costs with varying token-level complexity. While heterogeneous expert architectures attempt to address this by diversifying expert sizes, they often suffer from significant system-level challenges, specifically unbalanced GPU utilization and inefficient parameter utilization, which hinder practical deployment. To bridge the gap between theoretical heterogeneity and robust industrial application, we propose Mixture of Heterogeneous Grouped Experts (MoHGE) which introduces a two-level routing mechanism to enable flexible, resource-aware expert combinations. To optimize inference efficiency, we propose a Group-Wise Auxiliary Loss, which dynamically steers tokens to the most parameter-efficient expert groups based on task difficulty. To address the critical deployment challenge of GPU load balancing, we introduce an All-size Group-decoupling Allocation strategy coupled with an Intra-Group Experts Auxiliary Loss. These mechanisms collectively ensure uniform computation distribution across GPUs. Extensive evaluations demonstrate that MoHGE matches the performance of MoE architectures while reducing the total parameters by approximately 20% and maintaining balanced GPU utilization. Our work establishes a scalable paradigm for resource-efficient MoE design, offering a practical solution for optimizing inference costs in real-world scenarios.
𝒮2IT: Stepwise Syntax Integration Tuning for Large Language Models in Aspect Sentiment Quad Prediction
🔥 引用:
0
Abstract: Aspect Sentiment Quad Prediction (ASQP) has seen significant advancements, largely driven by the powerful semantic understanding and generative capabilities of large language models (LLMs). However, while syntactic structure information has been proven effective in previous extractive paradigms, it remains underutilized in the generative paradigm of LLMs due to their limited reasoning capabilities. In this paper, we propose S^2IT, a novel Stepwise Syntax Integration Tuning framework that progressively integrates syntactic structure knowledge into LLMs through a multi-step tuning process. The training process is divided into three steps. S^2IT decomposes the quadruple generation task into two stages: 1) Global Syntax-guided Extraction and 2) Local Syntax-guided Classification, integrating both global and local syntactic structure information. Finally, Fine-grained Structural Tuning enhances the model's understanding of syntactic structures through the prediction of element links and node classification. Experiments demonstrate that S^2IT significantly improves state-of-the-art performance across multiple datasets. Our implementation will be open-sourced at https://github.com/DMIRLAB-Group/S2IT.
Carbon Footprint of Training Large AI Models: Measuring and Proposing Mitigation Strategies for the Environmental Cost of AI
DOI:
10.55041/ijsrem61165
🔥 引用:
0
Abstract: Abstract:
Over the past decade, artificial intelligence has quietly shifted from a research curiosity to infrastructure that billions of people depend on every day. Large language models write code, assist doctors, translate languages, and hold conversations in ways that once seemed firmly out of reach. But behind every chatbot interaction and AI-assisted search result, there is a data center consuming electricity and in most parts of the world, that electricity still comes largely from fossil fuels. The carbon cost of building and running these systems is real, growing, and almost entirely invisible to the people using them.
This paper investigates that hidden cost in depth. We examine how energy demands have grown alongside model size from BERT’s modest 652 kg of CO2 in 2018 to independent estimates placing GPT-4’s training footprint in the thousands of tonnes. We survey the measurement tools researchers use to quantify these emissions and expose the significant gaps that remain. We then present a comprehensive set of mitigation strategies spanning algorithmic efficiency, hardware improvements, geographic scheduling, renewable energy, and transparency standards. Finally, we honestly confront the structural barriers competitive scaling incentives, the Jevons Paradox, and the near-total absence of embodied carbon from current estimates that make progress harder than the technical solutions alone would suggest. The central argument is straightforward: sustainable AI is technically achievable. What it requires is the collective will to treat it as a genuine priority.
Keywords: Artificial Intelligence, Carbon Footprint, Large Language Models, Energy Consumption, Green AI, Sustainable Computing, CO2 Emissions, Machine Learning, Inference Efficiency, Embodied Carbon.
Bridging Reasoning and Action: Hybrid LLM-RL Framework for Efficient Cross-Domain Task-Oriented Dialogue
🔥 引用:
0
Abstract: Cross-domain task-oriented dialogue requires reasoning over implicit and explicit feasibility constraints while planning long-horizon, multi-turn actions. Large language models (LLMs) can infer such constraints but are unreliable over long horizons, while Reinforcement learning (RL) optimizes long-horizon behavior yet cannot recover constraints from raw dialogue. Naively coupling LLMs with RL is therefore brittle: unverified or unstructured LLM outputs can corrupt state representations and misguide policy learning. Motivated by this, we propose Verified LLM-Knowledge empowered RL (VLK-RL), a hybrid framework that makes LLM-derived constraint reasoning usable for RL. VLK-RL first elicits candidate constraints with an LLM and then verifies them via a dual-role cross-examination procedure to suppress hallucinations and cross-turn inconsistencies. The verified constraints are mapped into ontology-aligned slot-value representations, yielding a structured, constraint-aware state for RL policy optimization. Experiments across multiple benchmarks demonstrate that VLK-RL significantly improves generalization and robustness, outperforming strong single-model baselines on long-horizon tasks.
INSIGHT: Indoor Scene Intelligence from Geometric-Semantic Hierarchy Transfer for Public~Safety
🔥 引用:
0
Abstract: Indoor environments lack the spatial intelligence infrastructure that GPS provides outdoors; first responders arriving at unfamiliar buildings typically have no machine-readable map of safety equipment. Prior work on 3D semantic segmentation for public safety identified two barriers: scarcity of labeled indoor training data and poor recognition of small safety-critical features by native point-cloud methods. This paper presents INSIGHT, a zero-target-domain-annotation pipeline that projects 2D image understanding into 3D metric space via registered RGB-D data. Two interchangeable vision stacks share a common 3D back end: a SAM3 foundation-model stack for text-prompted segmentation, and a traditional CV stack (open-set detection, VQA, OCR) whose intermediate outputs are independently inspectable. Evaluated on all seven subareas of Stanford 2D-3D-S (70{,}496 images), the pipeline produces Pointcept-schema-compatible labeled point clouds and ISO~19164-compliant scene graphs with ${\sim}10^{4}{\times}$ compression; role-filtered payloads transmit in ${<}15$\,s at 1\,Mbps over FirstNet Band~14. We report per-point labeling accuracy on 7 shared classes, detection sensitivity for 15 safety-critical classes absent from public 3D benchmarks alongside code-capped deployable estimates, and inter-pipeline complementarity, demonstrating that 2D-to-3D semantic transfer addresses the labeled-data bottleneck while scene graphs provide building intelligence compact enough for field deployment.
Artificial Intelligence in Libraries: From Machine Learning to Large Language Models
🔥 引用:
0
Abstract: The process of gathering and organising information and its dissemination has changed nowadays due to the technological shift and introduction of the new technologies such as articial intelligence. The present study is trying to see theabilities of Articial Intelligence and its subelds, like Machine Learning(ML), Deep Learning (DL), and Large Language Models (LLMs), to reshape the main
library operations like cataloguing, reference services, research support, and digital preservation. The main argument of the article is regarding the ethical, inclusive, and policy-compliant use of AI technology within libraries. The article also acknowledges the increasing role of librarians as curators of AI tools and highlights the need for a plan of action framework. There is a need for a plan of action framework that aligns the deployment of AI with national educational and digital objectives. In the future, with the help of AI, librarians will be positioned to develop intelligent knowledge systems, thereby promising access and sustainable learning opportunities in the digital age.
LLM-Guided Cross-Platform Optimization of Cloud Analytics Workloads
🔥 引用:
0
Abstract: Large-scale data analytics in the cloud inevitably involves trade-offs among latency, throughput, scalability, elasticity, and cost. Today’s platforms model these trade-offs in very different ways-Amazon EMR builds on managed Hadoop ecosystems, Spark on Kubernetes container-native distributed execution, and Snowflake offers a fully managed data warehousing model. Although prior benchmarks-often based on TPC-DS, TPC-H, or microbenchmarks-have studied these systems, they are typically evaluated in isolation and rely on static configurations, manual tuning, or simplified cost assumptions. As a result, it remains unclear how these platforms compare under realistic, evolving cloud workloads, or how their performance and cost can be jointly optimized in dynamic environments. To bridge this gap, we introduce LLM-TradeOpt, a Large Language Model (LLM)–guided optimization framework that adaptively reasons about workload characteristics, system configurations, and execution traces across heterogeneous analytics platforms. Using CloudSuite v4.0 analytics workloads, our evaluation shows that LLM-TradeOpt consistently improves performance and efficiency, achieving up to 18.7% lower latency, 22.4% higher throughput, and 15.3% cost savings compared to strong baselines on Amazon EMR, Apache Spark on Kubernetes, and Snowflake.
Fine-tuning vs. In-context Learning in Large Language Models: A Formal Language Learning Perspective
🔥 引用:
0
Abstract: Large language models (LLMs) operate in two fundamental learning modes - fine-tuning (FT) and in-context learning (ICL) - raising key questions about which mode yields greater language proficiency and whether they differ in their inductive biases. Prior studies comparing FT and ICL have yielded mixed and inconclusive results due to inconsistent experimental setups. To enable a rigorous comparison, we propose a formal language learning task - offering precise language boundaries, controlled string sampling, and no data contamination - and introduce a discriminative test for language proficiency, where an LLM succeeds if it assigns higher generation probability to in-language strings than to out-of-language strings. Empirically, we find that: (a) FT has greater language proficiency than ICL on in-distribution generalization, but both perform equally well on out-of-distribution generalization. (b) Their inductive biases, measured by the correlation in string generation probabilities, are similar when both modes partially learn the language but diverge at higher proficiency levels. (c) Unlike FT, ICL performance differs substantially across models of varying sizes and families and is sensitive to the token vocabulary of the language. Thus, our work demonstrates the promise of formal languages as a controlled testbed for evaluating LLMs, behaviors that are difficult to isolate in natural language datasets. Our source code is available at https://github.com/bishwamittra/formallm.
ArgRE: Formal Argumentation for Conflict Resolution in Multi-Agent Requirements Negotiation
🔥 引用:
0
Abstract: As software systems grow in complexity, they must satisfy an increasing number of competing quality attributes, making it essential to balance them in a principled manner -- for example, a safety requirement for sensor-fusion verification may conflict with a tight planning-cycle budget. Multi-agent large language model frameworks support this balancing process by assigning specialized agents to different objectives. However, their conflict resolution is typically heuristic. Requirements are aggregated implicitly without explicit acceptance or rejection, limiting auditability in regulated domains. We present ArgRE, a multi-agent requirements negotiation system that embeds Dung-style abstract argumentation into the negotiation stage. Each proposal, critique, and refinement is modeled as an argument, conflicts are represented as directed attack relations, and the accepted set of arguments is computed under grounded and preferred semantics. The pipeline further integrates KAOS goal modeling, multi-layer verification, and standards-oriented artifact generation. Evaluation across five case studies spanning safety-critical, financial, and information-system domains shows that ArgRE provides argument-level traceability absent from existing frameworks. Independent evaluators rated its decision justifications significantly higher than those of heuristic synthesis (4.32 vs. 3.07, p<0.001), indicating improved auditability, while semantic intent preservation remains comparable (94.9% BERTScore F1) and compliance coverage reaches 84.7% versus 47.6%--47.8% for baselines. Structural analysis further confirms that the default pairwise protocol yields acyclic graphs in which grounded and preferred semantics coincide, whereas cross-pair arbitration introduces controlled cyclicity, leading to predictable divergence between the two semantics.
How can AI be compatible with evidence-based medicine?: with an example of analysis of lung cancer recurrence
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
Layer Embedding Deep Fusion Graph Neural Network
🔥 引用:
0
Abstract: Graph Neural Networks (GNNs) have demonstrated impressive performance in learning representations from graph-structured data. However, their message-passing mechanism inherently relies on the assumption of label consistency among connected nodes, limiting their applicability to low-homophily settings. Moreover, since message passing operates as a hierarchical diffusion process, GNNs face challenges in capturing long-range dependencies. As network depth increases, the structural noise along heterophilic edges tends to be amplified, resulting in over-smoothing. This issue becomes especially prominent in highly heterophilic graphs, where the propagation of inconsistent semantics across the topology continually exacerbates misaggregation. To address this issue, we propose a novel framework named Layer Embedding Deep Fusion Graph Neural Network (LEDF-GNN). Specifically, we design a Layer Embedding Deep Fusion (LEDF) operator that nonlinearly fuses multi-layer embeddings to capture inter-layer dependencies and effectively alleviate deep propagation degradation. Meanwhile, to mitigate structural heterophily, LEDF-GNN employs a Dual-Topology Parallel Strategy (DTPS) that simultaneously leverages the original and reconstructed topologies, allowing for adaptive structure-semantics co-optimization under diverse homophily conditions. Extensive semi-supervised classification experiments on the citation and image benchmarks demonstrate that, under both homophilic and heterophilic settings, LEDF-GNN consistently outperforms state-of-the-art baselines, validating its effectiveness and generalization capability across diverse graph types.
Ghost in the Agent: Redefining Information Flow Tracking for LLM Agents
🔥 引用:
0
Abstract: Autonomous Large Language Model (LLM) agents are increasingly deployed to conduct complex tasks by interacting with external tools, APIs, and memory stores. However, processing untrusted external data exposes these agents to severe security threats, such as indirect prompt injection and unauthorized tool execution. Securing these systems requires effective information flow tracking. Yet, traditional taint analysis that is designed for program memory states fundamentally fails when applied to LLMs, where data propagation is governed by probabilistic natural language reasoning. In this paper, we present NeuroTaint, the first comprehensive taint tracking framework tailored for the unique information flow characteristics of LLM agents. Our key insight is that taint propagation in LLM agents must be understood not only as explicit content transfer, but also as semantic transformation, causal influence on decisions, and cross-session persistence through memory. NeuroTaint therefore audits execution traces offline to reconstruct provenance from untrusted sources to privileged sinks using semantic evidence, causal reasoning, and persistent context tracking, rather than relying on exact string matches or pre-defined source-sink paths alone. Extensive evaluation using TaintBench, our 400-scenario benchmark spanning 20 real-world agent frameworks, shows that NeuroTaint substantially outperforms FIDES, an information-flow-control (IFC)-style baseline for LLM agents, in source-sink propagation detection. We further show that NeuroTaint remains effective on established agent-security benchmarks, including InjecAgent and ToolEmu, while operating offline with modest additional auditing cost.
In silico screening, ADMET analysis, MD simulations, and MM/PBSA binding free energy identify new inhibitor molecules for viroporin E
🔥 引用:
1
Abstract: 暂无摘要,请点击原文查看。
h-MINT: Modeling Pocket-Ligand Binding with Hierarchical Molecular Interaction Network
🔥 引用:
0
Abstract: Accurate molecular representations are critical for drug discovery, and a central challenge lies in capturing the chemical environment of molecular fragments, as key interactions, such as H-bond and {\pi} stacking, occur only under specific local conditions. Most existing approaches represent molecules as atom-level graphs; however, atom-level representations can hardly express higher-order chemical context (e.g., stereochemistry, lone pairs, conjugation). Fragment-based methods (e.g., principal subgraph, predefined functional groups) fail to preserve essential information such as chirality, aromaticity, and ionic states. This work addresses these limitations from two aspects. (i) OverlapBPE tokenization. We propose a novel data-driven molecule tokenization method. Unlike existing approaches, our method allows overlapping fragments, reflecting the inherently fuzzy boundaries of small-molecule substructures and, together with enriched chemical information at the token level, thereby preserving a more complete chemical context. (ii) h-MINT model. OverlapBPE induces many-to-many atom-fragment mappings, which necessitate a new hierarchical architecture. We therefore develop a hierarchical molecular interaction network capable of jointly modeling interactions at both atom and fragment levels. By supporting fragment overlaps, the model naturally accommodates the many-to-many atom-fragment mappings introduced by the OverlapBPE scheme. Extensive evaluation against state-of-the-art methods shows our method improves binding affinity prediction by 2-4% Pearson/Spearman correlation on PDBBind and LBA, enhances virtual screening by 1-3% in key metrics on DUD-E and LIT-PCBA, and achieves the best overall HTS performance on PubChem assays. Further analysis demonstrates that our method effectively captures interactive information while maintaining good generalization.
When the Agent Is the Adversary: Architectural Requirements for Agentic AI Containment After the April 2026 Frontier Model Escape
🔥 引用:
0
Abstract: The April 2026 disclosure that a frontier large language model escaped its security sandbox, executed unauthorized actions, and concealed its modifications to version control history demonstrates that agentic AI systems with autonomous tool access can circumvent the containment mechanisms designed to constrain them. This paper analyzes four categories of current containment approaches - alignment training, environmental sandboxing, application-level tool-call interception, and accessible audit systems - and identifies the failure modes each exhibits when the AI agent is treated as a potential adversary rather than a trusted component receiving adversarial inputs. We categorize five behavioral incidents from the public disclosure and situate them within 698 real-world AI scheming incidents documented by the Centre for Long-Term Resilience between October 2025 and March 2026, a 4.9x acceleration establishing the challenge as systemic. We derive five architectural requirements: trust separation through layered OS privilege enforcement with semantic intent analysis, sequential intent inference through five-phase taxonomic monitoring, independent containment integrity monitoring, adversarial audit isolation through logical invisibility, and emergent capability envelope enforcement through distributional divergence monitoring. No publicly described system satisfies all five. We argue that architectural containment is the only durable safety strategy given the inevitable proliferation of equivalent capabilities including open-weight models. The author's published patent portfolio in provider-independent constraint enforcement addresses several of these requirements. Concurrent work including SandboxEscapeBench (arXiv:2603.02277) independently confirms that frontier models can escape standard container sandboxes, corroborating the threat model presented here.
AsmRAG: LLM-Driven Malware Detection by Retrieving Functionally Similar Assembly Code
🔥 引用:
0
Abstract: Deep learning malware detectors achieve high classification accuracy but suffer from severe interpretability limitations, typically returning probabilistic verdicts that lack forensic context. We introduce AsmRAG, a framework performing malware analysis through Assembly-Level Retrieval-Augmented Generation. Unlike classifiers built on global statistical features, AsmRAG reformulates detection as an evidence-based retrieval task. The system uses a code-specialized Large Language Model (LLM) to analyze assembly functions and convert them into semantic embeddings. This process constructs a searchable knowledge base resilient to syntactic obfuscation. For inference, we propose a Density-Weighted Anchor Selection mechanism that isolates the primary unit of malicious logic within a binary to extract verifiable forensic evidence and resist evasion attempts. Testing on a curated dataset of 40k binaries shows AsmRAG reaching a detection F1-score of 96% alongside a family attribution F1-score of 95%. Comparisons confirm this semantic retrieval approach remains robust against metamorphic obfuscation. When holistic baselines (EMBER and ResNeXt) degrade, our methodology gives Security Operations Centers a transparent and reliable alternative.
Discovering Agentic Safety Specifications from 1-Bit Danger Signals
🔥 引用:
0
Abstract: Can large language model agents discover hidden safety objectives through experience alone? We introduce EPO-Safe (Experiential Prompt Optimization for Safe Agents), a framework where an LLM iteratively generates action plans, receives sparse binary danger warnings, and evolves a natural language behavioral specification through reflection. Unlike standard LLM reflection methods that rely on rich textual feedback (e.g., compiler errors or detailed environment responses), EPO-Safe demonstrates that LLMs can perform safety reasoning from a strictly impoverished signal in structured, low-dimensional environments: the agent never observes the hidden performance function $R^*$, only a single bit per timestep indicating that an action was unsafe. We evaluate on five AI Safety Gridworlds (Leike et al., 2017) and five text-based scenario analogs where visible reward $R$ may diverge from $R^*$. EPO-Safe discovers safe behavior within 1-2 rounds (5-15 episodes), producing human-readable specifications with correct explanatory hypotheses about hazards (e.g.,"X cells are directionally hazardous: entering from the north is dangerous"). Critically, we show that standard reward-driven reflection actively degrades safety: agents reflecting on reward alone use the loop to justify and accelerate reward hacking, proving that reflection must be paired with a dedicated safety channel to discover hidden constraints. We further evaluate robustness to noisy oracles: even when 50% of non-dangerous steps produce spurious warnings, mean safety performance degrades by only 15% on average, though sensitivity is environment-dependent, as cross-episode reflection naturally filters inconsistent signals. Each evolved specification functions as an auditable set of grounded behavioral rules discovered autonomously through interaction, rather than authored by humans as in Constitutional AI (Bai et al., 2022).
Bridging the Pose-Semantic Gap: A Cascade Framework for Text-Based Person Anomaly Search
🔥 引用:
0
Abstract: Text-based person anomaly search retrieves specific behavioral events from surveillance archives using natural-language queries. Although recent pose-aware methods align geometric structures well, they face a fundamental Pose-Semantic Gap: semantically different actions can share similar skeletal geometries. While Multimodal Large Language Models (MLLMs) can reduce this ambiguity, using them for large-scale retrieval is computationally prohibitive. We propose the Structure-Semantic Decoupled Cascade (SSDC) framework, which decouples retrieval into two stages: (1) Structure-Aware Coarse Retrieval, where a lightweight model quickly filters candidates by skeletal similarity ; and (2) Detective Squad Interaction, a multi-agent semantic verification module. The squad consists of a Detective for fast binary filtering, an Analyst for evidence extraction, and a Writer for semantic synthesis. Finally, we re-rank candidates by fusing the synthesized captions with structural priors. Experiments on the PAB benchmark show that SSDC achieves state-of-the-art performance by balancing efficiency and semantic reasoning.
AI Safety Training Can be Clinically Harmful
🔥 引用:
0
Abstract: Large language models are being deployed as mental health support agents at scale, yet only 16% of LLM-based chatbot interventions have undergone rigorous clinical efficacy testing, and simulations reveal psychological deterioration in over one-third of cases. We evaluate four generative models on 250 Prolonged Exposure (PE) therapy scenarios and 146 CBT cognitive restructuring exercises (plus 29 severity-escalated variants), scored by a three-judge LLM panel. All models scored near-perfectly on surface acknowledgment (~0.91-1.00) while therapeutic appropriateness collapsed to 0.22-0.33 at the highest severity for three of four models, with protocol fidelity reaching zero for two. Under CBT severity escalation, one model's task completeness dropped from 92% to 71% while the frontier model's safety-interference score fell from 0.99 to 0.61. We identify a systematic, modality-spanning failure: RLHF safety alignment disrupts the therapeutic mechanism of action by grounding patients during imaginal exposure, offering false reassurance, inserting crisis resources into controlled exercises, and refusing to challenge distorted cognitions mentioning self-harm in PE; and through task abandonment or safety-preamble insertion during CBT cognitive restructuring. These findings motivate a five-axis evaluation framework (protocol fidelity, hallucination risk, behavioral consistency, crisis safety, demographic robustness), mapped onto FDA SaMD and EU AI Act requirements. We argue that no AI mental health system should proceed to deployment without passing multi-axis evaluation across all five dimensions.
AI SaaS Website Builder: An Intelligent Platform for Instant Website Generation
DOI:
10.55041/ijsrem61144
🔥 引用:
0
Abstract: Abstract
Imagine how different the process of building a website was just a few years ago. You would hire a professional web developer, wait for weeks or even months, spend thousands of dollars, or alternatively, struggle with complicated drag-and-drop builders that still required significant technical understanding. These traditional approaches, though functional, created substantial barriers for small businesses, entrepreneurs, and individuals who simply wanted an online presence.
In this paper, we have guided the readers on our ideas towards creating an AI-powered SaaS website builder workflow, how we constructed it and tested it, and how it can be of value to users seeking instant website generation. The platform allows users to describe their desired website in plain language and receive a fully functional, deployed website within minutes. Our features included AI-driven code generation, one-click deployment, credit-based monetization, secure authentication, and smooth user interfaces. The most important contributions here have been the creation of an adaptable architecture to support prompt-to-website conversion, ensuring the interface performs excellently across devices, and actually analyzing the question of whether the platform is working by evaluating user reactions. What we discovered was fairly promising—users saved tremendous development time, non-technical individuals could create professional websites, and the credit-based system proved to be a sustainable monetization model. The AI generation approach proved to be a win-win as users receive instant results when they need them and the platform receives a way of sustaining its operations through subscription and credit purchases. This study contributes to the existing information about AI-powered development tools and provides practical guidance to anyone developing such a type of intelligent website generation platform.
Keywords: AI Website Builder, SaaS Platform, Automated Code Generation, Large Language Models, MERN Stack, One-Click Deployment
Mechanistic Steering of LLMs Reveals Layer-wise Feature Vulnerabilities in Adversarial Settings
🔥 引用:
0
Abstract: Large language models (LLMs) can still be jailbroken into producing harmful outputs despite safety alignment. Existing attacks show this vulnerability, but not the internal mechanisms that cause it. This study asks whether jailbreak success is driven by identifiable internal features rather than prompts alone. We propose a three-stage pipeline for Gemma-2-2B using the BeaverTails dataset. First, we extract concept-aligned tokens from adversarial responses via subspace similarity. Second, we apply three feature-grouping strategies (cluster, hierarchical-linkage, and single-token-driven) to identify SAE feature subgroups for the aligned tokens across all 26 model layers. Third, we steer the model by amplifying the top features from each identified subgroup and measure the change in harmfulness score using a standardized LLM-judge scoring protocol. In all three approaches, the features in the layers [16-25] were relatively more vulnerable to steering. All three methods confirmed that mid to later layer feature subgroups are more responsible for unsafe outputs. These results provide evidence that the jailbreak vulnerability in Gemma-2-2B is localized to feature subgroups of mid to later layers, suggesting that targeted feature-level interventions may offer a more principled path to adversarial robustness than current prompt-level defenses.
Evaluating Large Language Models on Computer Science University Exams in Data Structures
🔥 引用:
0
Abstract: We present a comprehensive evaluation of Large Language Models (LLMs) on Computer Science (CS) Data Structure examination questions. Our work introduces a new benchmark dataset comprising exam questions from Tel Aviv University (TAU), curated to assess LLMs'abilities in handling closed and multiple-choice questions. We evaluated the performance of OpenAI's GPT 4o and Anthropic's Claude 3.5, popular LLMs, alongside two smaller LLMs, Mathstral 7B and LLaMA 3 8B, across the TAU exams benchmark. Our findings provide insight into the current capabilities of LLMs in CS education.
Scaling Multi-Node Mixture-of-Experts Inference Using Expert Activation Patterns
🔥 引用:
0
Abstract: Most recent state-of-the-art (SOTA) large language models (LLMs) use Mixture-of-Experts (MoE) architectures to scale model capacity without proportional per-token compute, enabling higher-quality outputs at manageable serving costs. However, MoE inference at scale is fundamentally bottlenecked by expert load imbalance and inefficient token routing, especially in multi-node deployments where tokens are not guaranteed to be routed to local experts, resulting in significant inter-node all-to-all communication overhead. To systematically characterize these challenges, we profile SOTA open-source MoE models, including Llama 4 Maverick, DeepSeek V3-671B, and Qwen3-230B-A22B, on various datasets and collected over 100k real expert activation traces. Upon studying the expert activation patterns, we uncover various persistent properties across all the frontier MoE models: variable expert load imbalance, domain-specific expert activation where expert popularity shifts across task families (code, math, chat, general), and a strong correlation between prefill and decode expert activations. Motivated by these findings, we propose workload-aware micro-batch grouping and an expert placement strategy to maximize token locality to the destination expert, thereby reducing inter-node communication. Across models and datasets, these optimizations help reduce all2all communication data up to 20, resulting in lower MoE decode latency and better accelerator utilization.
Breaking Lock-In: Preserving Steerability under Low-Data VLA Post-Training
🔥 引用:
0
Abstract: Have you ever post-trained a generalist vision-language-action (VLA) policy on a small demonstration dataset, only to find that it stops responding to new instructions and is limited to behaviors observed during post-training? We identify this phenomenon as lock-in: after low-data, supervised fine-tuning (SFT), the policy becomes overly specialized to the post-training data and fails to generalize to novel instructions, manifesting as concept lock-in (fixation on training objects/attributes) and spatial lock-in (fixation on training spatial targets). Many existing remedies introduce additional supervision signals, such as those derived from foundation models or auxiliary objectives, or rely on augmented datasets to recover generalization. In this paper, we show that the policy's internal pre-trained knowledge is sufficient: DeLock mitigates lock-in by preserving visual grounding during post-training and applying test-time contrastive prompt guidance to steer the policy's denoising dynamics according to novel instructions. Across eight simulation and real-world evaluations, DeLock consistently outperforms strong baselines and matches or exceeds the performance of a state-of-the-art generalist policy post-trained with substantially more curated demonstrations.
Proteus: Shapeshifting Desktop Visualizations for Mobile via Multi-level Intelligent Adaptation
🔥 引用:
0
Abstract: With the rise of mobile-first consumption, users increasingly engage with data visualizations on mobile devices. However, the vast majority of existing visualizations are originally authored for desktop environments. Due to significant differences in viewport size and interaction paradigms, directly scaling desktop charts often results in illegible text, information loss, and interaction failures. To bridge this gap, we propose an automated framework to adapt desktop-based visualizations for mobile screens. By systematically categorizing the operations involved in the adaptation process, we establish a multi-level design space. This space defines evolution rules spanning from the global topology level, through the reference frame level, down to the visual elements level. Guided by this theoretical framework, we developed Proteus, a large language model-driven multi-agent system that automatically parses online visualizations, predicts optimal transformation strategies within the design space, and generates equivalent, highly readable visualizations for mobile devices. Case studies and an in-depth user study with 12 participants demonstrate the effectiveness and usability of Proteus.
A satellite foundation model for improved wealth monitoring
🔥 引用:
0
Abstract: Poverty statistics guide social policy, but in many low- and middle-income countries, censuses and household surveys that collect these data are costly, infrequent, quickly outdated, and sometimes error-prone. Satellite imagery offers global coverage and the possibility of predicting economic livelihoods at scale, yet existing approaches to predicting livelihoods with imagery or other non-traditional data often fail to reliably identify local-level variation and, as we show, degrade under temporal shift. Here we introduce Tempov, a satellite foundation model pretrained by self-supervision on three million bi-temporal Landsat pairs and adapted with parameter-efficient fine-tuning to sparse survey labels. The model enables large-scale, high-resolution wealth mapping and dynamic measurement, including zero-shot nowcasting up to a decade after observed labels, retrospective hindcasting, and decadal change tracking, while outperforming existing neural network and geospatial foundation-model baselines. In low-label regimes, Tempov achieves competitive accuracy with only 10% of survey samples, indicating substantially reduced dependence on expensive label collection. The model further generalizes across populous countries within and outside Africa, and scales to a unified Africa-wide model with strong continent-level performance ($R^2=0.63$, $r^2=0.68$), from which we generate high-resolution decadal maps of wealth and wealth changes for the African continent. Analysis of these maps shows large variation in recent economic performance both within and across countries. Our open-source approach provides a pathway to timely, scalable, low-cost monitoring of wealth and poverty from routinely collected satellite data.
Download PDF AI Engineering: Building Applications with Foundation Models by Chip Huyen
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
Automated Brain Tumor Segmentation via YOLOv8-Derived Spatial Prompts for the Segment Anything Model
DOI:
10.66279/vv8n5p41
🔥 引用:
0
Abstract: Brain tumor segmentation from magnetic resonance imaging~(MRI) is a critical step in the diagnosis and treatment planning of intracranial malignancies. Although supervised convolutional networks achieve strong benchmark performance within their training distribution, they exhibit limited transferability across acquisition protocols. Conversely, foundation models such as the Segment Anything Model~(SAM) encode rich visual representations but produce unreliable masks in the absence of accurate spatial guidance. The present work introduces a fully automated, end-to-end pipeline that couples YOLOv8 object detection with SAM-based segmentation without modifying the parameters of either network. A lightweight preprocessing stage comprising skull stripping and Contrast Limited Adaptive Histogram Equalization~(CLAHE) conditions each MRI slice; the resulting image is forwarded to a trained YOLOv8 detector whose highest confidence bounding box is passed directly to SAM's prompt encoder as the sole spatial cue.Evaluation on 1,226 held-out images from the publicly available Cheng et. al. benchmark, partitioned by patient identity to prevent data leakage, yields a mean Dice Similarity Coefficient~(DSC) of 0.8153 pm 0.032$ and a mean Intersection over Union (IoU) of 0.7136 pm 0.028, with a total inference latency of 473.76 ms per image on an NVIDIA~T4 GPU. An ablation study confirms that each pipeline stage contributes positively to segmentation performance. YOLOv8 detection achieves a mean Average Precision~(mAP@0.5) of 0.91, precision of 0.88, and recall of 0.86. The results demonstrate that high-quality, automatically generated spatial prompts can substitute for costly parameter adaptation of general-purpose foundation models in specialized medical imaging tasks.
ToGCN-LLM: Tri-optimization Graph Convolutional Network with LLM for Group Activity Recognition
DOI:
10.1145/3811826
🔥 引用:
0
Abstract: Graph networks face substantial challenges in handing large-scale graph-structured data. Although graph convolutional networks (GCNs) have been widely applied to group activity recognition, they still struggle with high graph structure complexity and semantic gap, especially in complex scenarios. To address these limitations, we harness large language models (LLMs) and multimodal learning to enhance semantic understanding as well as bridge the gap between graph structures and contextual meanings. For the simultaneous optimization of model efficiency and recognition accuracy in group activities, we propose a tri-optimization graph convolutional network with LLM (ToGCN-LLM) from the perspective of graph structure knowledge distillation. (1) To tackle high complexity of teacher network in GCN, we adopt a sparsification strategy to prune irrelevant edges and nodes, reducing computational overhead and enhancing training efficiency. (2) To mitigate information redundancy during knowledge distillation, we design a hierarchical optimization module combined with a hierarchical sampling mechanism, exploiting graph hierarchical structure and adjacency relationships to improve knowledge transfer efficiency. (3) Considering the student network’s varying learning performance across training stages and the limitations of fixed learning strategies, we introduce a dynamic adaptive weight decay mechanism to achieve fine-grained convergence under different gradient updates, thereby boosting overall recognition accuracy. (4) We use the Qwen LLM to extract text description tokens, which are fused with the student GCN’s last-layer features to enable multi-model, multi-optimization learning for group activity recognition. Six experiments on CAD, CAED and BJUT-GAD datasets demonstrate that our ToGCN-LLM achieves competitive MPCA scores of 94.89%, 93.61% and 95.77%, respectively.
Domain-Adapted Fine-Tuning of ECG Foundation Models for Multi-Label Structural Heart Disease Screening
🔥 引用:
0
Abstract: Transthoracic echocardiography is the reference standard for confirming structural heart disease (SHD), but first-line screening is limited by cost, workflow burden, and specialist availability. We evaluated whether open pretrained electrocardiogram (ECG) foundation models can support echo-confirmed multi-label SHD detection using the public EchoNext Mini-Model benchmark. Six echocardiography-derived abnormalities were targeted: reduced left ventricular ejection fraction, increased left ventricular wall thickness, aortic stenosis, mitral regurgitation, tricuspid regurgitation, and right ventricular systolic dysfunction. Under a common pipeline, we compared engineered ECG features with gradient boosting, end-to-end waveform learning from scratch, and transfer from open ECG foundation models. We then applied in-domain self-supervised adaptation of an ECG foundation model (ECG-FM) on EchoNext waveforms followed by selective supervised fine-tuning, and evaluated trade-offs between discrimination and adaptation cost. Adapted ECG-FM models achieved the best overall performance: peak macro-AUROC 0.8509 and macro-AUPRC 0.4297, while a parameter-efficient operating point preserved AUROC (0.8501) and attained the highest fixed-threshold macro-F1 0.3691. Late fusion with covariates did not improve threshold-independent discrimination, and evaluated LoRA, alternative backbones, and mixture-of-foundations strategies did not surpass the best adapted single-backbone models. These results indicate that for ECG-based case finding and echocardiography triage, combining target-domain self-supervised adaptation with selective supervised updating of a pretrained ECG backbone is the most effective transfer strategy.
Features of building a neural network based on MobileNetV2 models in unmanned aerial vehicle detection tasks
🔥 引用:
0
Abstract: This article delineates the development and calibration of neural network models based on MobileNetV2-SSD (Single Shot MultiBox Detector) and MobileNetV2-SSD FPN-Lite (Feature Pyramid Network) architectures for addressing the detection of unmanned aerial vehicles (UAV). The author conducts a systematic investigation of the structural characteristics underlying the supplementary SSD- and FPN- layers incorporated within these models, aiming to achieve robust detection performance across a spectrum of drone dimensions, as quantified by accuracy and recall metrics. The research validates the documented limitations of existing architectures in detecting small-scale objects within image datasets. By strategically integrating additional layers that assimilate feature representations from the early strata of the base network and generate composite feature maps for detection purposes, the study demonstrates a substantial elevation in mean Average Precision (mAP) by 20 % and an improvement in F1-score balance by 38 % relative to the foundational model. Additionally, the investigation furnishes a comprehensive set of recommendations for hyperparameter optimization to maximize model performance, accompanied by practical guidance regarding training dataset composition and preparation protocols tailored to the particular characteristics of input imagery. Remarkably, the research also revealed an unexpected enhancement in model performance following the quantization process for deployment on edge computing devices, a phenomenon potentially explained by regularization effects inherent to the compilation and quantization pipeline.
Zero-Day Hunter: A Multi-Layered Machine Learning Framework for Real-Time Detection and Mitigation of Zero-Day Cyber Attacks
🔥 引用:
0
Abstract: Zero-day attack takes advantage of unknown vulnerability to evade traditional signature-based detection systems. We introduced a novel multi-layered machine learning (ML) framework which we call zeroHunter. The framework integrates cross-layer detection techniques and adversarial hardening. The layers of detection are the network layer which used the spatio-temporal Graph Neural Network (GNN) to analyze traffic graph, the host layer which uses bio-optimized LightGBM with SHAP for explainability, and the memory layer which uses contractive autoencoder and CNN for forensic analysis. Adversarial hardening was implemented with feature-space randomization and defensive distillation. The models are updated in real-time via online learning module (River ML) thereby forming a close-loop mitigation system. The models were trained with hybrid datasets (CIC-IDS2023, IoT-23, and synthetic zero-day data generated with WGAN). The evaluations were done using MITRE CALDERA simulation of Zero-day Detection Rate, CLEVER robustness scoring, and mitigation latency benchmarks. The results show 86.4% of zero-day detection rate which is higher than Elastic EDR (Endpoint Detection and Response), 0.76 CLEVER score and mitigation time of less than 500ms which is 26% - 38% faster than commercial EDR. These results show that zeroHunter, leveraging ML can effectively detect unknown threat and respond in real-time.
CAP-CoT: Cycle Adversarial Prompt for Improving Chain of Thoughts in LLM Reasoning
🔥 引用:
0
Abstract: Chain-of-Thought (CoT) prompting has emerged as a simple and effective way to elicit step-by-step solutions from large language models (LLMs). However, CoT reasoning can be unstable across runs on long, multi-step problems, leading to inconsistent answers for unchanged task. Most prior work focuses on improving the forward reasoning chain within a single pass, with less attention to iterative and contrastive correction. To address this gap, we propose CAP-CoT, a Cycle Adversarial Prompt optimization framework designed to improve both CoT reasoning accuracy and stability of a single deployed solver. In each cycle, a forward solver generates candidate reasoning chains, an adversarial challenger constructs plausible but deliberately flawed chains using targeted error strategies, and a feedback agent contrasts the two chains and produces step-aligned structured feedback. This feedback closes the optimization loop in two directions, including updating the solver prompt based on errors exposed by the challenger, and updating the challenger prompt to generate increasingly targeted errors in subsequent cycles. Unlike safety-oriented adversarial prompting such as jailbreak or prompt-injection attacks, our adversarial component is task-semantic and aims to expose logical vulnerabilities in reasoning chains. Experiments across six benchmarks and four LLM backbones demonstrate that within two to three adversarial prompt optimization cycles, CAP-CoT consistently reduces variability across runs while improving reasoning accuracy and robustness to prompt perturbations.
Studi In Silico Turunan Kumarin dari Daphne mezereum sebagai Inhibitor EGFR pada Kanker Paru-Paru
🔥 引用:
0
Abstract: Lung cancer is one of the leading causes of cancer-related death worldwide and is often associated with overexpression of the Epidermal Growth Factor Receptor (EGFR), making the development of EGFR inhibitors an important therapeutic strategy. This study aims to evaluate the potential of coumarin derivatives from Daphne mezereum as EGFR inhibitor candidates in lung cancer therapy using an in silico approach. This study used computational methods, including molecular docking to analyze binding affinity and ligand–protein interactions, drug-likeness evaluation based on Lipinski’s rule, and ADMET analysis to predict pharmacokinetic properties and toxicity. The results show that all compounds had RMSD values ≤ 2 Å, with a re-docking value of 1.1614 Å, indicating the validity of the method. Compound 2 showed the highest binding affinity and optimal interaction with the Met769 residue in EGFR, which plays a role in lung cancer cell proliferation, but it did not meet Lipinski’s criteria and had a less optimal ADMET profile. Conversely, compound 3 (umbelliferone) showed a balance between affinity, drug-likeness, and a good ADMET profile, including high absorption and non-toxic properties. The conclusion of this study emphasizes that compound 3 is more prospective as a lung cancer drug candidate, while compound 2 (7-hydroxycoumarin-5,8-di-β-D-glucopyranoside) has potential as a lead compound for further optimization. These findings contribute to the development of EGFR-targeted therapy based on coumarin derivatives and emphasize the importance of computational approaches in efficient and rational drug design.
Towards Automated Ontology Generation from Unstructured Text: A Multi-Agent LLM Approach
🔥 引用:
0
Abstract: Automatically generating formal ontologies from unstructured natural language remains a central challenge in knowledge engineering. While large language models (LLMs) show promise, it remains unclear which architectural design choices drive generation quality and why current approaches fail. We present a controlled experimental study using domain-specific insurance contracts to investigate these questions. We first establish a single-agent LLM baseline, identifying key failure modes such as poor Ontology Design Pattern compliance, structural redundancy, and ineffective iterative repair. We then introduce a multi-agent architecture that decomposes ontology construction into four artifact-driven roles: Domain Expert, Manager, Coder, and Quality Assurer. We evaluate performance across architectural quality (via a panel of heterogeneous LLM judges) and functional usability (via competency question driven SPARQL evaluation with complementary retrieval augmented generation based assessment). Results show that the multi-agent approach significantly improves structural quality and modestly enhances queryability, with gains driven primarily by front-loaded planning. These findings highlight planning-first, artifact-driven generation as a promising and more auditable path toward scalable automated ontology engineering.
BronchOpt: vision-based pose optimization with fine-tuned foundation models for accurate bronchoscopy navigation
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
GreenDyGNN: Runtime-Adaptive Energy-Efficient Communication for Distributed GNN Training
🔥 引用:
0
Abstract: Distributed GNN training is dominated by remote feature fetching, which can be very costly. Multi-hop neighborhood sampling crosses partition boundaries and triggers fine-grained RPCs whose fixed initiation cost and GPU-stall latency waste energy. Prior systems try to reduce this overhead with presampling and static caching, but cache policies cannot react to runtime network variation. We show that under time-varying congestion, static caching can increase energy by up to 45% because a fixed rebuild schedule is insufficient. We present GreenDyGNN, which formulates cache window management as a sequential decision problem. GreenDyGNN performs intra-epoch cache rebuilds and uses a Double-DQN agent, trained in a calibrated simulator with domain-randomized congestion, to adapt rebuild window size and per-owner cache allocation at each boundary. An asynchronous double-buffered pipeline makes adaptation effectively free. Under congestion, GreenDyGNN cuts total energy by up to 43% over Default DGL and 4-24% over the best static policy, while closely matching the optimum under clean conditions.
VeriLLMed: Interactive Visual Debugging of Medical Large Language Models with Knowledge Graphs
🔥 引用:
0
Abstract: Large language models (LLMs) show promise in medical diagnosis, but real-world deployment remains challenging due to high-stakes clinical decisions and imperfect reasoning reliability. As a result, careful inspection of model behavior is essential for assessing whether diagnostic reasoning is reliable and clinically grounded. However, debugging medical LLMs remains difficult. First, developers often lack sufficient medical domain expertise to interpret model errors in clinically meaningful terms. Second, models can fail across a large and diverse set of instances involving different input types, tasks, and reasoning steps, making it challenging for developers to prioritize which errors deserve focused inspection. Third, developers struggle to identify recurring error patterns across cases, as existing debugging practices are largely instance-centric and rely on manual inspection of isolated failures. To address these challenges, we present VeriLLMed, a visual analytics system that integrates external biomedical knowledge to audit and debug medical LLM diagnostic reasoning. VeriLLMed transforms model outputs into comparable reasoning paths, constructs knowledge graph-grounded reference paths, and identifies three recurring classes of diagnosis errors: relation errors, branch errors, and missing errors. Case studies and expert evaluation demonstrate that VeriLLMed helps developers identify clinically implausible reasoning and generate actionable insights that can inform the improvement of medical LLMs.
Rational Design of Single-Phase High-Entropy Oxides via Large Language Model Data Mining and Explainable Machine Learning
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
Ulterior Motives: Detecting Misaligned Reasoning in Continuous Thought Models
🔥 引用:
0
Abstract: Chain-of-Thought (CoT) reasoning has emerged as a key technique for eliciting complex reasoning in Large Language Models (LLMs). Although interpretable, its dependence on natural language limits the model's expressive bandwidth. Continuous thought models address this bottleneck by reasoning in latent space rather than human-readable tokens. While they enable richer representations and faster inference, they raise a critical safety question: how can we detect misaligned reasoning in an uninterpretable latent space? To study this, we introduce MoralChain, a benchmark of 12,000 social scenarios with parallel moral/immoral reasoning paths. We train a continuous thought model with backdoor behavior using a novel dual-trigger paradigm - one trigger that arms misaligned latent reasoning ([T]) and another that releases harmful outputs ([O]). We demonstrate three findings: (1) continuous thought models can exhibit misaligned latent reasoning while producing aligned outputs, with aligned and misaligned reasoning occupying geometrically distinct regions of latent space; (2) linear probes trained on behaviorally-distinguishable conditions ([T][O] vs [O]) transfer to detecting armed-but-benign states ([T] vs baseline) with high accuracy; and (3) misalignment is encoded in early latent thinking tokens, suggesting safety monitoring for continuous thought models should target the"planning"phase of latent reasoning.
An Empirical Evaluation of Locally Deployed LLMs for Bug Detection in Python Code
🔥 引用:
0
Abstract: Large language models (LLMs) have demonstrated strong performance on a wide range of software engineering tasks, including code generation and analysis. However, most prior work relies on cloud-based models or specialized hardware, limiting practical applicability in privacy-sensitive or resource-constrained environments. In this paper, we present a systematic empirical evaluation of two locally deployed LLMs, LLaMA 3.2 and Mistral, for real-world Python bug detection using the BugsInPy benchmark. We evaluate 349 bugs across 17 projects using a zero-shot prompting approach at the function level and an automated keyword-based evaluation framework. Our results show that locally executed models achieve accuracy between 43% and 45%, while producing a large proportion of partially correct responses that identify problematic code regions without pinpointing the exact fix. Performance varies significantly across projects, highlighting the importance of codebase characteristics. The results demonstrate that local models can identify a meaningful share of bugs, though precise localization remains difficult for locally executed LLMs, particularly when handling complex and context dependent bugs in realistic development scenarios.
A Novel Triose Phosphate Isomerase Inhibitor With Dual Trypanosomicidal Activity was Identified Using Artificial Intelligence‐Based Virtual Screening
🔥 引用:
0
Abstract:
Chagas disease and leishmaniasis are neglected protozoan diseases recognized by the World Health Organization as major public health problems. These diseases affect millions of people worldwide, yet effective treatments remain unavailable. Triosephosphate isomerase (TIM), a glycolytic enzyme that exhibits high catalytic efficiency for the isomerization of glyceraldehyde‐3‐phosphate and dihydroxyacetone‐phosphate exclusively in its dimeric form, was subjected to virtual screening. Using a deep neural network for structure‐based drug design that predicts binding affinity between small molecules and proteins of known structure, 12.5 million commercially available compounds were screened. From this, 82 compounds were selected for in vitro evaluation. Six compounds inhibited TIM from
Trypanosoma cruzi
, three of which exhibited anti‐T. cruzi activity. Eight compounds demonstrated activity against the parasites
T. cruzi
and
Leishmania infantum
. Two compounds showed similar potency against both parasites: 3‐(1‐acetyl‐5‐(4‐bromophenyl)−4,5‐dihydro‐1H‐pyrazol‐3‐yl)‐4‐hydroxy‐6‐methyl‐2H‐pyran‐2‐one (IC
50
= 16 ± 3 μM) and 3‐[(4‐bromophenyl)sulfanyl]‐1‐(3‐nitrophenyl)propan‐1‐one (IC
50
= 12 ± 1 μM). These compounds exhibit favorable selectivity and toxicological profiles, as well as in vivo activity, indicating their potential for future drug development.
An Agentic Framework for Intent Co-Creation in 6G NaaS: Architecture and Open-Source Model Evaluation
🔥 引用:
0
Abstract: 6G network complexity necessitates high levels of autonomy, yet current intent-based systems struggle with ambiguous or incomplete human requests. This paper introduces an agent-based, intent-driven end-to-end (E2E) orchestration framework designed for Network-as-a-Service (NaaS) delivery through collaborative intent co-creation. The proposed system leverages a pool of Domain Expert Agents and a TM Forum-aligned Body-of-Knowledge (BoK) to iteratively refine user requests into deterministic, machine-readable actions. A fundamental design principle is the decoupling of cognition and actuation, where AI-driven reasoning is isolated from standardized execution controllers to ensure safety and operational trust. The framework includes a dual-layer memory system to maintain coherence during multi-step collaborations. The presented prototype, built on ETSI OpenSlice and the Model Context Protocol (MCP), evaluates across several open-source Large Language Models (LLMs). While these models demonstrate high instruction compliance, results reveal a significant gap in translating high-resolution intents into valid, catalog-backed orders without hallucinations.
From Stateless Queries to Autonomous Actions: A Layered Security Framework for Agentic AI Systems
🔥 引用:
0
Abstract: Agentic AI systems face security challenges that stateless large language models do not. They plan across extended horizons, maintain persistent memory, invoke external tools, and coordinate with peer agents. Existing security analyses organize threats by attack type (prompt injection, jailbreaking), but provide no principled model of which architectural component is vulnerable or over what timescale the threat manifests. This paper makes five contributions. First, we introduce the Layered Attack Surface Model (LASM), a seven-layer framework that maps threats to distinct architectural components: Foundation, Cognitive, Memory, Tool Execution, Multi-Agent Coordination, Ecosystem, and Governance, the accountability and observability layer that spans the stack analogously to the network management plane. Second, we introduce attack temporality as an orthogonal analytical dimension with four classes: Instantaneous (T1), Session-Persistent (T2), Cross-Session Cumulative (T3), and Sub-Session-Stack, Non-Session-Bounded (T4). Third, through a systematic review of 94 papers (2021--2025), we show that the most dangerous emerging threats concentrate at the intersection of high-layer attacks (L5--L7) and slow-burn temporality (T3--T4): covert agent collusion, long-term memory poisoning, MCP supply-chain compromise, and alignment failure that manifests as an insider threat with no external adversary. Only 8 of 120 paper-cell assignments (7%) fall in this zone. Fourth, we propose a cross-layer defense taxonomy spanning all seven LASM layers and all four temporality classes, exposing which threat classes existing defenses leave unaddressed. Fifth, we survey evaluation benchmarks, identify five research gaps in the under-studied high-layer, slow-burn zone, and argue that agentic security must be treated as a distributed systems problem embedded in an adversarial ecosystem.
Process Supervision of Confidence Margin for Calibrated LLM Reasoning
🔥 引用:
0
Abstract: Scaling test-time computation with reinforcement learning (RL) has emerged as a reliable path to improve large language models (LLM) reasoning ability. Yet, outcome-based reward often incentivizes models to be overconfident, leading to hallucinations, unreliable confidence-based control, and unnecessary compute allocation. We introduce Reinforcement Learning with Confidence Margin (\textbf{RLCM}), a calibration-aware RL framework that jointly optimizes correctness and confidence reliability via a margin-enhanced process reward over intermediate-budget completions. Rather than aligning confidence to correctness likelihoods, RLCM encourages to widen the confidence margin between correct and incorrect steps within a single reasoning trajectory. Across mathematical, code, logic and science benchmarks, our method substantially improves calibration while maintaining or improving accuracy. We further show that, with calibrated confidence signals, the resulting models enable more efficient conformal risk control and effective confidence-weighted aggregation.
Fine-tuned lightweight language models for structured extraction of liver cancer imaging free-text report: a comparative analysis with existing large language models
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
Comparing prognostic performance and reasoning between large language models and physicians
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
schema-miner pro: Agentic AI for Ontology Grounding Over LLM-Discovered Scientific Schemas in a Human-in-the-Loop Workflow
🔥 引用:
0
Abstract:
Scientific processes are often described in free text, making it difficult to represent and reason over them computationally. We present
schema-miner
p
r
o
, a human-in-the-loop framework that automatically extracts and grounds structured schemas from scientific literature. Our approach combines large language models for schema extraction with an agent-based system that aligns extracted elements to external ontologies through interpretable, multi-step reasoning. The agent leverages lexical heuristics, semantic similarity, and expert feedback to ensure accurate grounding. We demonstrate the framework on two semiconductor manufacturing workflows—atomic layer deposition and atomic layer etching—mapping process parameters and outputs to the QUDT (Quantities, Units, Dimensions, and Types) ontology. By producing ontology-aligned, semantically precise schemas,
schema-miner
p
r
o
lays the groundwork for machine-actionable scientific knowledge and automated reasoning across disciplines.
Characterization of Purple Corn Anthocyanin Components and Their Pharmacokinetic Profiles through In Silico Study
DOI:
10.22146/jfps.23226
🔥 引用:
0
Abstract: Purple corn contains high levels of bioactive anthocyanins with potential health-promoting effects. This study aimed to quantitatively characterize the anthocyanin profile of an Indonesian purple corn variety and evaluate the pharmacokinetic properties of its major constituents using in silico modelling. The anthocyanin extract was analyzed by HPLC (high-performance liquid chromatography), and concentrations were determined using authentic standards. The pharmacokinetic parameters, including absorption, distribution, metabolism, excretion, and toxicity (ADMET), were predicted using the pkCSM platform, and drug-likeness properties were evaluated based on Lipinski’s Rule of Five. Cyanidin-3-glucoside (0.82 mg/100 g DW), peonidin-3-glucoside (0.38 mg/100 g DW), and pelargonidin-3-glucoside (0.35 mg/100 g DW) were identified as the predominant anthocyanins. In silico ADMET predictions using pkCSM revealed low intestinal absorption, moderate peripheral tissue distribution, and violations of Lipinski’s Rule of Five, indicating limited suitability as conventional oral pharmaceuticals. Nevertheless, the compounds exhibited low predicted toxicity, minimal CYP450 inhibition, and favorable excretion profiles, supporting potential safe use as nutraceuticals or functional food ingredients. This integrated approach demonstrates that experimental phytochemical profiling combined with computational pharmacokinetic analysis provides valuable insight into the properties of anthocyanins beyond standard bioavailability studies. The findings provide a foundation for future formulation strategies, such as encapsulation or delivery enhancement, to improve bioavailability and functional activity in foods or dietary supplements. Overall, anthocyanins from Indonesian purple corn are promising safe natural bioactives, warranting further experimental and translational research.
How Researchers Navigate Accountability, Transparency, and Trust When Using AI Tools in Early-Stage Research: A Think-Aloud Study
🔥 引用:
0
Abstract: In the early stages of scientific research, researchers rely on core scholarly judgments to identify relevant literature, assess credible evidence, and determine which directions merit pursuit. As AI tools become increasingly integrated into these early-stage workflows, the scholarly judgments that were once transparent and attributable to individual researchers become obscured, raising critical Responsible AI (RAI) concerns around accountability, transparency, and trust. Yet how these three dimensions manifest in real-time, in-situ scholarly practice remains largely unexplored. To address this gap, we conducted a think-aloud study with 15 researchers to examine how they used AI tools powered by large language models (LLMs) across early-stage research tasks, including literature exploration, synthesis, and research ideation. Our key findings address the tripartite constructs of accountability, transparency, and trust. First, the confident tone of AI outputs misrepresents epistemic uncertainty, making it more difficult for researchers (who are ultimately accountable) to identify which outputs require the greatest scrutiny. Second, opaque retrieval and content construction make provenance difficult to establish for transparency. Third, trust in AI is fragile, context-dependent, and easily eroded. In response, participant researchers were seen to develop compensatory strategies to restore scholarly judgment under uncertainty. Overall, our findings serve to contextualize AI-mediated research as a RAI problem grounded in lived researcher experience and motivate attention to deliberate AI integration that preserves accountability, supports transparency, and fosters informed trust.
SmartAgroWeb: A Multi-Module AI-Driven Agriculture Platform
DOI:
10.55041/ijsrem61161
🔥 引用:
0
Abstract: Abstract : Agriculture is a essential driver of economic development and food security, but farmers are still confronted with many obstacles such as unpredictable climate, the wrong crop choice, plant diseases, and the inability to access information about price, government scheme or market in real-time.
The purpose of this paper is to introduce and present SmartAgroWeb: An intelligent web-based agricultural support system, built using AI, ML and real-time APIs into one cohesive solution.
The proposed solution is established using a multi-layer architecture which includes the presentation, application, service/API, and data layers. The system provides numerous features such as crop recommendation based on soil parameters, plant disease detection using image processing, weather forecasting in real-time, monitoring of market prices, government scheme recommendations, and consultation from experts. The system also incorporates an AI-based chatbot which provides instant advisory support based on a hybrid of rule-based logic and large language models.
The system is developed on a client-server architecture using modern web technologies such as React.js for the frontend, Spring Boot for backend processing, and MongoDB for data storage. It also relies on third party APIs to provide real-time data, thus helping to improve the accuracy and reliability of its recommendations.
The aim of this project is to integrate multiple services into a single user-friendly platform.
INDEX TERMS: Smart Agriculture, Crop Recommendation, Plant Disease Detection, Machine Learning, Artificial Intelligence, Decision Support System, AI Chatbot, Sustainable Farming
Quantifying conformational diversity in protein–ligand ensembles for structure-based virtual screening
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
Математичний базис задач формування адаптивних алгоритмів для машинного навчання чат-ботів
🔥 引用:
0
Abstract: Сучасні інформаційні системи функціонують у середовищі, що характеризується невизначеністю параметрів, неповнотою даних та постійними змінами зовнішніх умов. За таких обставин традиційні алгоритми із заздалегідь визначеною структурою не здатні гарантувати належний рівень ефективності, що зумовлює потребу в розробленні адаптивних механізмів. Особливої актуальності такий підхід набуває в процесах навчання систем генеративного штучного інтелекту, зокрема чат-ботів на кшталт ChatGPT, Gemini чи GitHub Copilot. Адаптивний алгоритм можна інтерпретувати як динамічну модель, параметри якої коригуються відповідно до обраного критерію оптимальності або якості функціонування. Експоненціальна залежність є однією з ключових математичних основ сучасного машинного навчання, оскільки входить до складу багатьох моделей і обчислювальних процедур. Метою роботи є формалізація математичного базису задач формування адаптивних алгоритмів для виконання умов їх коректності. У роботі розглядається підхід до формування математичного базису адаптивного алгоритму для навчання чат-ботів з генеративним штучним інтелектом. Представлено математичний аналіз процесу застосування експоненціальної функції під час реалізації адаптивних процедур навчання великих мовних моделей (Large Language Model — LLM). Експоненціальна функція є одним з фундаментальних засобів формалізації інформаційних процесів. Вона широко застосовується для опису процесів зростання та затухання, формування нормального розподілу, виконання інтегральних перетворень, розв’язання задач обробки сигналів і побудови ймовірнісних схем, а також для дослідження ефективності адаптивних методів апроксимації та оцінювання їхніх характеристик. Запропонований підхід до побудови адаптивного алгоритму ґрунтується на взаємодії трьох основних компонентів: оператора переходу, критерію якості та механізму коригування параметрів. Його математичним підґрунтям є багаторазове паралельне виконання рекурентних співвідношень із застосуванням Z-апроксимації. Процес реалізується через послідовні ітерації з можливістю оперативного налаштування параметрів під час роботи алгоритму, що надає можливість досягати оптимального співвідношення між швидкістю збіжності та точністю результатів. Зазначене може бути застосоване для розробки вебсервісів обробки інформації та спеціальних додатків.
AI-Powered Personalized Learning and Career Enhancement Platform
🔥 引用:
0
Abstract: Choosing a suitable career has become increasing obstacle for students in higher secondary and undergraduate programs. Many students find it difficult to identify career paths that align with their interests, psychological traits, and skill levels. Traditional career counselling app methods typically depend on static questionnaires and general recommendations, which often fail to address individual uniqueness. This study proposes an AI-powered personalized learning and career enhancement platform that combines psychological assessment, career recommendation, and adaptive course generation. The system leverages Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) to provide intelligent and context-aware guidance. Users are tested through dynamically generated psychological questions designed to identify personality traits and career inclinations across multiple dimensions. Based on the psychological profile and skill-level assessment, the system recommends suitable career paths and generates a customized learning roadmap. Developed using Python and Django, the platform ensures continuous interaction between reasoning models and knowledge retrieval mechanisms. The proposed solution Aims to improve career clarity and learning efficiency through end-to-end personalization.
In-context modeling as a retrain-free paradigm for foundation models in computational science
🔥 引用:
0
Abstract: Building models that generalize across physical systems without retraining remains a central challenge in computational science. Here we introduce In-Context Modeling (ICM), a retrain-free paradigm that infers physical relationships directly from observational fields. Rather than encoding system-specific behavior in fixed parameters, ICM assimilates measurements as physical context and performs inference through a single forward pass. Trained in a physics-informed, label-free manner using governing equations, a single model generalizes across unseen materials, geometries, and loading conditions. Demonstrated on hyperelasticity, ICM integrates with finite-element simulations and is validated using experimental full-field measurements. Moreover, performance improves with increasing data diversity and computational budget, exhibiting favorable scaling behavior analogous to foundation models. By recasting physical modeling as in-context inference, this work establishes a transferable paradigm for retrain-free scientific learning and a foundation for scalable modeling across computational science.
Studi In Silico Senyawa Turunan Kumarin sebagai Inhibitor Janus Kinase 1 (JAK1) pada Penyakit Rheumatoid Arthritis
🔥 引用:
0
Abstract: Rheumatoid arthritis is a chronic autoimmune disease involving activation of the Janus kinase 1 (JAK1) pathway, making JAK1 a potential target in drug development. This study aims to evaluate the potential of coumarin derivative compounds as JAK1 inhibitors in silico. This study used a computational approach through the molecular docking method to predict ligand–protein interactions, evaluation of physicochemical properties based on Lipinski’s Rule of Five, and ADMET analysis to assess pharmacokinetic and toxicity profiles. The results show that all compounds had negative binding affinity values, ranging from -6.5379 to -6.9271 kcal/mol, indicating stable ligand–protein interactions, although these values were still lower than the positive control Tofacitinib, with a value of -7.3968 kcal/mol. All compounds met the criteria of Lipinski’s Rule of Five, but ADMET analysis showed variations in pharmacokinetic and toxicity profiles. Compound 3 showed the best balance between activity, stability, and safety, whereas compounds 1 and 2 showed potential mutagenicity. The conclusion of this study emphasizes that compound 3 has the potential to be further developed as a JAK1 inhibitor candidate. The implications of this study indicate the importance of structural optimization and further experimental validation to improve the effectiveness and safety of coumarin derivative compounds as therapeutic candidates for rheumatoid arthritis.
Small Language Model Helps Resolve Semantic Ambiguity of LLM Prompt
🔥 引用:
0
Abstract: Large language models (LLMs) are increasingly utilized in various complex reasoning tasks due to their excellent instruction following capability. However, the model's performance is highly dependent on the open-ended characteristics of the users'input prompt. Natural prompts often do not follow proper syntactic rules, which creates ambiguous queries that yield multiple interpretations. Such ambiguous prompts confuse the model in choosing the correct reasoning paths to answer questions. Prior works address this challenge by applying query editing during the LLM inference process without explicitly solving the root cause of the ambiguity. To address this limitation, we propose a pre-inference prompt optimization mechanism via explicit prompt disambiguation. Particularly, we identify semantic risks in the prompt, check their multi-perspective consistency, and resolve any semantic conflicts that arise. Finally, we organize the resolved ambiguities in a logically structured manner as a clean input to the LLM. By explicitly resolving semantic ambiguity, our method can produce a more focused attention distribution to the semantically essential tokens. We also leverage small language models (SLMs) as the main executor of prompt disambiguation to benefit from their efficient computation. Through comprehensive experiments on multiple benchmarks, we demonstrate that our method improves reasoning performance by 2.5 points at a cost of only \$0.02. Our study promotes explicit prompt disambiguation as an effective prompt optimization method without disturbing the internal mechanism of LLM inference.
From Similarity to Structure: Training-free LLM Context Compression with Hybrid Graph Priors
🔥 引用:
0
Abstract: Long-context large language models remain computationally expensive to run and often fail to reliably process very long inputs, which makes context compression an important component of many systems. Existing compression approaches typically rely on trained compressors, dense retrieval-style selection, or heuristic trimming, and they often struggle to jointly preserve task relevance, topic coverage, and cross-sentence coherence under a strict token budget. To address this, we propose a training-free and model-agnostic compression framework that selects a compact set of sentences guided by structural graph priors. Our method constructs a sparse hybrid sentence graph that combines mutual k-NN semantic edges with short-range sequential edges, extracts a topic skeleton via clustering, and ranks sentences using an interpretable score that integrates task relevance, cluster representativeness, bridge centrality, and a cycle coverage cue. A budgeted greedy selection with redundancy suppression then produces a readable compressed context in original order. Experimental results on four datasets show that our approach is competitive with strong extractive and abstractive baselines, demonstrating larger gains on long-document benchmarks.
No Test Cases, No Problem: Distillation-Driven Code Generation for Scientific Workflows
🔥 引用:
0
Abstract: Existing multi-agent Large Language Model (LLM) frameworks for code generation typically use execution feedback and improve iteratively using Input/Output (I/O) test cases. However, this does not work for scientific workflows, where I/O test cases do not exist, and generating them requires solving the very problem at hand. To address this, we introduce MOSAIC, a training-free multi-agent framework for scientific code generation without I/O supervision. Instead of execution feedback, MOSAIC employs a student-teacher knowledge distillation framework that grounds generation through domain-specific examples and structured problem decomposition. To further mitigate hallucinations across chained subproblems, we introduce a Consolidated Context Window (CCW) for maintaining consistent reasoning across agents. Experiments on the SciCode benchmark show that MOSAIC improves accuracy, executability, and numerical precision over existing approaches while relying on lightweight models.
IndustryAssetEQA: A Neurosymbolic Operational Intelligence System for Embodied Question Answering in Industrial Asset Maintenance
🔥 引用:
0
Abstract: Industrial maintenance environments increasingly rely on AI systems to assist operators in understanding asset behavior, diagnosing failures, and evaluating interventions. Although large language models (LLMs) enable fluent natural-language interaction, deployed maintenance assistants routinely produce generic explanations that are weakly grounded in telemetry, omit verifiable provenance, and offer no testable support for counterfactual or action-oriented reasoning that undermine trust in safety-critical settings. We present IndustryAssetEQA, a neurosymbolic operational intelligence system that combines episodic telemetry representations with a Failure Mode Effects Analysis Knowledge Graph (FMEA-KG) to enable Embodied Question Answering (EQA) over industrial assets. We evaluate on four datasets covering four industrial asset types, including rotating machinery, turbofan engines, hydraulic systems, and cyber-physical production systems. Compared to LLM-only baselines, IndustryAssetEQA improves structural validity by up to 0.51, counterfactual accuracy by up to 0.47, and explanation entailment by 0.64, while reducing severe expert-rated overclaims from 28% to 2% (approximately 93% reduction). Code, datasets, and the FMEA-KG are available at https://github.com/IBM/AssetOpsBench/tree/IndustryAssetEQA/IndustryAssetEQA.
Efficient Rationale-based Retrieval: On-policy Distillation from Generative Rerankers based on JEPA
🔥 引用:
0
Abstract: Unlike traditional fact-based retrieval, rationale-based retrieval typically necessitates cross-encoding of query-document pairs using large language models, incurring substantial computational costs. To address this limitation, we propose Rabtriever, which independently encodes queries and documents, while providing comparable cross query-document comprehension capabilities to rerankers. We start from training a LLM-based generative reranker, which puts the document prior to the query and prompts the LLM to generate the relevance score by log probabilities. We then employ it as the teacher of an on-policy distillation framework, with Rabtriever as the student to reconstruct the teacher's contextual-aware query embedding. To achieve this effect, Rabtriever is first initialized from the teacher, with parameters frozen. The Joint-Embedding Predictive Architecture (JEPA) paradigm is then adopted, which integrates a lightweight, trainable predictor between LLM layers and heads, projecting the query embedding into a new hidden space, with the document embedding as the latent vector. JEPA then minimizes the distribution difference between this projected embedding and the teacher embedding. To strengthen the sampling efficiency of on-policy distillation, we also add an auxiliary loss on the reverse KL of LLM logits, to reshape the student's logit distribution. Rabtriever optimizes the teacher's quadratic complexity on the document length to linear, verified both theoretically and empirically. Experiments show that Rabtriever outperforms different retriever baselines across diverse rationale-based tasks, including empathetic conversations and robotic manipulations, with minor accuracy degradation from the reranker. Rabtriever also generalizes well on traditional retrieval benchmarks such as MS MARCO and BEIR, with comparable performance to the best retriever baseline.
Integrating multimodal clinical data with a large model for prostate cancer diagnosis
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
Crimewatch AI: A Real-Time Predictive Crime Mapping and Public Safety System using Machine Learning and NLP
🔥 引用:
0
Abstract: CrimeWatch AI is a sophisticated intelligent system that utilizes Machine Learning (ML), Natural Language Processing (NLP), and geospatial analytics to enhance public safety in urban and rural areas. Through the combination of structured crime data and unstructured textual information such as police reports, news articles, and social media content, the system identifies concealed patterns in behavior that can lead to potential criminal activity. The system utilizes supervised learning algorithms such as Random Forest and Logistic Regression to detect crime-prone areas with accuracy and generate predictive insights for proactive intervention. Structured data analysis is used alongside NLP techniques, including tokenization, NER, sentiment analysis, and topic modeling for extracting meaningful contextual information from text. Using a locally deployed Large Language Model (LLM), the system extracts structured data such as crime type, location and severity from inputs in natural language. This integration improves the system's ability to identify emerging trends, identify critical entities, and capture spatial and temporal patterns in criminal activity. It also uses clustering techniques for hotspot detection and interactive graphing with tools for visualization of results (e.g, interactive heatmaps and trend graph).) Users and law enforcement agencies can easily understand the intricate data presented by CrimeWatch AI, thanks to its intuitive visualization dashboard. This is a major factor in their decision to use this technology. Integrating predictive analytics with real-time risk assessment, the system offers safe route recommendations to enhance navigation security. Experimental results indicate that the proposed system significantly improves prediction accuracy, enhances situational awareness, and enables data-driven decision-making. Generally speaking, CrimeWatch AI is a flexible and adaptable tool for contemporary crime analysis, with potential applications in smart city infrastructure, real-time surveillance systems, and intelligent public safety management.Keywords: Crime prediction, machine learning, NLP, predictive analytics, crime mapping, data mining; public safety systems; hotspot detection; artificial intelligence; and smart surveillance.
Towards Agentic Test-Driven Quality Assurance for 6G Networks
🔥 引用:
0
Abstract: This work proposes an agentic, intent-driven end-to-end (E2E) orchestration framework that integrates intent co-creation with a Test-Driven Quality Assurance paradigm. In this framework, autonomous agents iteratively refine a user's initial intent into a confirmed, auditable specification. Furthermore, the system automatically derives validation tests from these intents before provisioning, directly mirroring the Test-Driven Development workflow in software engineering to ensure proactive Service Level Agreement (SLA) compliance. The architecture is grounded in a standards-aligned knowledge representation using TM Forum (TMF) information models and catalogs. This enables deterministic graph traversal from high-level Product Offerings down to granular Service/Resource and Test specifications. We prototyped this architecture by extending OpenSlice with a message-driven, multi-agent pattern and integrating MCP-enabled (Model Context Protocol) tool access for real-time knowledge retrieval. Currently, our evaluation of the agents targets the intent co-creation phase as a baseline toward full-scale orchestration. Building on experiments with multiple open-source Large Language Model (LLM) backends integrated with the TMF-based knowledge base, we observe substantial variability in tool-use reliability and hallucination patterns, underscoring the critical importance of robust knowledge integration in agentic 6G systems.
PhySE: A Psychological Framework for Real-Time AR-LLM Social Engineering Attacks
🔥 引用:
0
Abstract: The emerging threat of AR-LLM-based Social Engineering (AR-LLM-SE) attacks (e.g. SEAR) poses a significant risk to real-world social interactions. In such an attack, a malicious actor uses Augmented Reality (AR) glasses to capture a target visual and vocal data. A Large Language Model (LLM) then analyzes this data to identify the individual and generate a detailed social profile. Subsequently, LLM-powered agents employ social engineering strategies, providing real-time conversation suggestions, to gain the target trust and ultimately execute phishing or other malicious acts. Despite its potential, the practical application of AR-LLM-SE faces two major bottlenecks, (1) Cold-start personalization, Current Retrieval-Augmented Generation (RAG) methods introduce critical delays in the earliest turns, slowing initial profile formation and disrupting real-time interaction, (2) Static Attack Strategies, Existing approaches rely on fixed-stage, handcrafted social engineering tactics that lack foundation in established psychological theory. To address these limitations, we propose PhySE, a novel framework with two core innovations, (1) VLM-Based SocialContext Training, To eliminate profiling delays, we efficiently pre-train a Visual Language Model (VLM) with social-context data, enabling rapid, on-the-fly profile generation, (2) Adaptive Psychological Agent, We introduce a psychological LLM that dynamically deploys distinct classes of psychological strategies based on target response, moving beyond static, handcrafted scripts. We evaluated PhySE through an IRB-approved user study with 60 participants, collecting a novel dataset of 360 annotated conversations across diverse social scenarios.
A graph-based Neural Network surrogate model for accelerating semi-analytical model of galaxy formation and evolution
🔥 引用:
0
Abstract: Understanding how galaxy populations emerge and evolve from the growth of dark matter structure is a central challenge in galaxy formation theory. Semi-analytic models (SAMs) provide an efficient framework to address this problem, but exploring large ensembles of merger trees across broad parameter spaces remains computationally demanding. We develop a conditional graph neural network surrogate model that combines merger tree information with SAM parameters to predict galaxy properties across cosmic time. Using merger trees of dark matter halos from the Uchuu simulation and the Galacticus SAM, the model predicts stellar mass, luminosity, angular momentum, gas metal mass, and specific star formation rate across the wide redshift range of 0<= z<= 5. For instance, the model can predict stellar mass at 0<= z<= 3 with a scatter of 0.19-0.28 dex and coefficient of determination R^2 of 0.946-0.973 (R^2 close to 1 indicates prediction closely matching the truth). The results show that a single graph based model can reproduce these galaxy properties with good accuracy over multiple SAM realizations, merger trees and redshifts. This catalog-level model provides a practical route for accelerating SAM based studies of galaxy formation to enable a more detailed investigation of the model parameter space. The inference code, trained models, and example data products are publicly available at https://github.com/MutongCat/sam2galaxy-gnn.
EAD-Net: Emotion-Aware Talking Head Generation with Spatial Refinement and Temporal Coherence
🔥 引用:
0
Abstract: Emotionally talking head video generation aims to generate expressive portrait videos with accurate lip synchronization and emotional facial expressions. Current methods rely on simple emotional labels, leading to insufficient semantic information. While introducing high-level semantics enhances expressiveness, it easily causes lip-sync degradation. Furthermore, mainstream generation methods struggle to balance computational efficiency and global motion awareness in long videos and suffer from poor temporal coherence. Therefore, we propose an \textbf{E}motion-\textbf{A}ware \textbf{D}iffusion model-based \textbf{Net}work, called \textbf{EAD-Net}. We introduce SyncNet supervision and Temporal Representation Alignment (TREPA) to mitigate lip-sync degradation caused by multi-modal fusion. To model complex spatio-temporal dependencies in long video sequences, we propose a Spatio-Temporal Directional Attention (STDA) mechanism that captures global motion patterns through strip attention. Additionally, we design a Temporal Frame graph Reasoning Module (TFRM) to explicitly model temporal coherence between video frames through graph structure learning. To enhance emotional semantic control, a large language model is employed to extract textual descriptions from real videos, serving as high-level semantic guidance. Experiments on the HDTF and MEAD datasets demonstrate that our method outperforms existing methods in terms of lip-sync accuracy, temporal consistency, and emotional accuracy.
Revisiting Greedy Decoding for Visual Question Answering: A Calibration Perspective
🔥 引用:
0
Abstract: Stochastic sampling strategies are widely adopted in large language models (LLMs) to balance output coherence and diversity. These heuristics are often inherited in Multimodal LLMs (MLLMs) without task-specific justification. However, we contend that stochastic decoding can be suboptimal for Visual Question Answering (VQA). VQA is a closed-ended task with head-heavy answer distributions where uncertainty is usually epistemic, arising from missing or ambiguous visual evidence rather than plausible continuations. In this work, we provide a theoretical formalization of the relationship between model calibration and predictive accuracy, and derive the sufficient conditions for greedy decoding optimality. Extensive experiments provide empirical evidence for the superiority of greedy decoding over stochastic sampling across multiple benchmarks. Furthermore, we propose Greedy Decoding for Reasoning Models, which outperforms both stochastic sampling and standard greedy decoding in multimodal reasoning scenarios. Overall, our results caution against naively inheriting LLMs decoding heuristics in MLLMs and demonstrate that greedy decoding can be an efficient yet strong default for VQA.
Reducing Detail Hallucinations in Long-Context Regulatory Understanding via Targeted Preference Optimization
🔥 引用:
0
Abstract: Large language models (LLMs) frequently produce \emph{detail hallucinations} when processing long regulatory documents, including subtle errors in threshold values, units, scopes, obligation levels, and conditions that preserve surface plausibility while corrupting safety-critical parameters. We formalize this phenomenon through a fine-grained \emph{Detail Error Taxonomy} of five error types and introduce \textbf{DetailBench}, a benchmark built from 172 real regulatory documents and 150 synthetic documents spanning three jurisdictions, with human-annotated detail-level ground truth comprising 13,000 preference pairs. We propose \textbf{DetailDPO}, a targeted preference optimization framework that constructs contrastive pairs differing in exactly one detail dimension, concentrating DPO gradient signal on detail-bearing~tokens. We provide theoretical analysis showing why \emph{minimal detail perturbation} pairs yield gradient concentration under mild assumptions. Experiments on the Qwen2.5 family (7B, 14B, 72B) and Llama-3.1-8B across three context-length tiers (8K--64K tokens) show that DetailDPO reduces the Detail Error Rate by 42--61\% relative to baselines, with consistent gains across all five error types and cross-domain transfer to financial and medical documents.
EmoTrans: A Benchmark for Understanding, Reasoning, and Predicting Emotion Transitions in Multimodal LLMs
🔥 引用:
0
Abstract: Recent multimodal large language models (MLLMs) have shown strong capabilities in perception, reasoning, and generation, and are increasingly used in applications such as social robots and human-computer interaction, where understanding human emotions is essential. However, existing benchmarks mainly formulate emotion understanding as a static recognition problem, leaving it largely unclear whether current MLLMs can understand emotion as a dynamic process that evolves, shifts between states, and unfolds across diverse social contexts. To bridge this gap, we present EmoTrans, a benchmark for evaluating emotion dynamics understanding in multimodal videos. EmoTrans contains 1,000 carefully collected and manually annotated video clips, covering 12 real-world scenarios, and further provides over 3,000 task-specific question-answer (QA) pairs for fine-grained evaluation. The benchmark introduces four tasks, namely Emotion Change Detection (ECD), Emotion State Identification (ESI), Emotion Transition Reasoning (ETR), and Next Emotion Prediction (NEP), forming a progressive evaluation framework from coarse-grained detection to deeper reasoning and prediction. We conduct a comprehensive evaluation of 18 state-of-the-art MLLMs on EmoTrans and obtain two main findings. First, although current MLLMs show relatively stronger performance on coarse-grained emotion change detection, they still struggle with fine-grained emotion dynamics modeling. Second, socially complex settings, especially multi-person scenarios, remain substantially challenging, while reasoning-oriented variants do not consistently yield clear improvements. To facilitate future research, we publicly release the benchmark, evaluation protocol, and code at https://github.com/Emo-gml/EmoTrans.
From Language to Logic: Bridging LLMs&Formal Representations for RTL Assertion Generation
🔥 引用:
0
Abstract: SystemVerilog Assertions (SVA) are essential for formal verification of digital hardware, yet their manual creation demands significant expertise in both the design under verification and temporal logic. Recent studies have explored using large language models (LLMs) to automate SVA generation, but existing approaches suffer from incorrect signal references, missing timing constraints, and lack of formal correctness guarantees. This paper presents ProofLoop, a tool-augmented ReAct agent that generates SVA from natural-language specifications using a solver-in-the-loop approach. The agent operates in two phases: Phase A autonomously gathers design context by invoking EDA and formal tools, including semantic search over an AST-indexed vector database and JasperGold structural queries, while Phase B generates SVA and iteratively refines it using JasperGold formal proof feedback over up to fixed (here 3) verification rounds. We evaluate ProofLoop on FVEval Design2SVA design benchmarks and demonstrate that this framework can achieve 93.7% syntax correctness and 82.0% functional correctness. An ablation study confirms that each component, i.e., retrieval-augmented generation (RAG), JasperGold tools, and the verification loop contributes significantly (and orthogonally).
Ascend AI: An Intelligent, Multimodal Framework for Personalized Career Direction and Adaptive Technical Interview Simulation
🔥 引用:
0
Abstract: The rapid diversification of technical specializations within the computer science and information technology domains presents a significant educational challenge for students, who frequently lack the profound self-awareness and practical guid-ance imperative to selecting professional pathways perfectly aligned with their inherent psychological and cognitive traits. Consequently, pivotal career deci-sions are systematically driven by external peer trends, superficial fascinations, or arbitrary assumptions rather than intrinsic behavioral suitability. This paper introduces Ascend AI, a comprehensive, artificial intelligence-driven career orien-tation framework meticulously designed to mitigate this structural uncertainty. The proposed architecture seamlessly integrates quantitative psychological pro-filing, generative LLM-based learning curriculum methodologies, and responsive, interactive audio interview simulations into a cohesive, decoupled microservice platform. Phase one of the framework processes multi-dimensional vocational preferences and personality metrics—captured via a standardized 50-item Big Five (OCEAN) inventory, a 48-item RIASEC model survey, and a distinct cognitive reading-comprehension assessment. These vectors advance through a dual-pipeline machine learning ensemble, integrating K-Means clustering and Soft-Voting Logistic Regression, to empirically predict optimal technical career trajectories. Phase two translates these discriminative mathematical predictions into highly customized, dynamically generated 10-day micro-learning roadmaps utilizing Google Gemini Large Language Models (LLMs) and strict Pydantic schema validations to ensure structured, hallucination-free knowledge acquisi-tion.
Channel Adaptation for EEG Foundation Models: A Systematic Benchmark Across Architectures, Tasks, and Training Regimes
🔥 引用:
0
Abstract: Scaling EEG foundation models requires pooling data across heterogeneous electrode montages, a prerequisite both for larger pretraining corpora and for downstream deployment. We present the first systematic comparison of four channel adaptation methods (Conv1d projection, spherical spline interpolation (SSI), source-space decomposition, and Riemannian re-centering) across five pretrained EEG foundation models (5M--157M parameters), five downstream tasks, and two training regimes with 10--15 random seeds each. We find that rigid-montage models (BENDR, Neuro-GPT) require external adaptation, while flexible models (EEGPT, CBraMod) match or exceed it natively when fine-tuned but benefit from external methods under frozen-encoder deployment. A probe-SFT asymmetry exists: external adaptation can cause severe negative transfer during fine-tuning of flexible models. The optimal method is architecture-dependent (Conv1d for BENDR, SSI/Riemannian for Neuro-GPT, source-space decomposition for depression detection), and 5M-parameter CBraMod outperforms models up to 31$\times$ larger on 4/5 datasets, consistent with independent findings that compact EEG-specific architectures can match larger models.
Bridging Expert Insight and AI Reasoning: A Hybrid Systems Model of Iran's Fertility Dynamics
DOI:
10.1002/sres.70044
🔥 引用:
0
Abstract:
Population dynamics are inherently complex, shaped by nonlinear feedbacks among economic, cultural, health and governance systems. This study focuses on
Iran's sustained subreplacement fertility
and develops a
hybrid modelling framework
to construct a
causal loop diagram (CLD)
and integrates group model building (GMB), large language models (LLMs) and retrieval‐augmented generation (RAG) through a six‐step process: (1) initial dynamic hypothesis formulation; (2) expert‐driven CLD development; (3) AI‐driven CLD development; (4) model integration; (5) evidence anchoring using peer‐reviewed literature and (6) expert validation. The final CLD reveals how reinforcing mechanisms (e.g., education–modernity and economic confidence) interact with balancing constraints (e.g., childrearing costs, delayed marriage and institutional capacity) to sustain low fertility in Iran. The study demonstrates how structured human–AI collaboration can enhance transparency, theoretical grounding and policy relevance in systems modelling of demographic change, particularly in data‐limited and rapidly evolving contexts.
Automating Categorization of Scientific Texts with In-Context Learning and Prompt-Chaining in Large Language Models
🔥 引用:
0
Abstract: The relentless expansion of scientific literature presents significant challenges for navigation and knowledge discovery. Within Research Information Retrieval, established tasks such as text summarization and classification remain crucial for enabling researchers and practitioners to effectively navigate this vast landscape, so that efforts have increasingly been focused on developing advanced research information systems. These systems aim not only to provide standard keyword-based search functionalities but also to incorporate capabilities for automatic content categorization within knowledge-intensive organizations across academia and industry. This study systematically evaluates the performance of off-the-shelf Large Language Models (LLMs) in analyzing scientific texts according to a given classification scheme. We utilized the hierarchical ORKG taxonomy as a classification framework, employing the FORC dataset as ground truth. We investigated the effectiveness of advanced prompt engineering strategies, namely In-Context Learning (ICL) and Prompt Chaining, and experimentally explored the influence of the LLMs'temperature hyperparameter on classification accuracy. Our experiments demonstrate that Prompt Chaining yields superior classification accuracy compared to pure ICL, particularly when applied to the nested structure of the ORKG taxonomy. LLMs with prompt chaining outperform the state-of-the-art models for domain (1st level) prediction and show even better performance for subject (2nd level) prediction compared to the older BERT model. However, LLMs are not yet able to perform well in classifying the topic (3rd level) of research areas based on this specific hierarchical taxonomy, as they only reach about 50% accuracy even with prompt chaining.
Au-M-ol: A Unified Model for Medical Audio and Language Understanding
🔥 引用:
0
Abstract: In this work, we present Au-M-ol, a novel multimodal architecture that extends Large Language Models (LLMs) with audio processing. It is designed to improve performance on clinically relevant tasks such as Automatic Speech Recognition (ASR). Au-M-ol has three main components: (1) an audio encoder that extracts rich acoustic features from medical speech, (2) an adaptation layer that maps audio features into the LLM input space, and (3) a pretrained LLM that performs transcription and clinical language understanding. This design allows the model to interpret spoken medical content directly, improving both accuracy and robustness. In experiments, Au-M-ol reduces Word Error Rate (WER) by 56\% compared to state-of-the-art baselines on medical transcription tasks. The model also performs well in challenging conditions, including noisy environments, domain-specific terminology, and speaker variability. These results suggest that Au-M-ol is a strong candidate for real-world clinical applications, where reliable and context-aware audio understanding is essential.
Evaluating Jailbreaking Vulnerabilities in LLMs Deployed as Assistants for Smart Grid Operations: A Benchmark Against NERC Standards
🔥 引用:
0
Abstract: The deployment of Large Language Models (LLMs) as assistants in electric grid operations promises to streamline compliance and decision-making but exposes new vulnerabilities to prompt-based adversarial attacks. This paper evaluates the risk of jailbreaking LLMs, i.e., circumventing safety alignments to produce outputs violating regulatory standards, assuming threats from authorized users, such as operators, who craft malicious prompts to elicit non-compliant guidance. Three state-of-the-art LLMs (OpenAI's GPT-4o mini, Google's Gemini 2.0 Flash-Lite, and Anthropic's Claude 3.5 Haiku) were tested against Baseline, BitBypass, and DeepInception jailbreaking methods across scenarios derived from nine NERC Reliability Standards (EOP, TOP, and CIP). In the initial broad experiment, the overall Attack Success Rate (ASR) was 33.1%, with DeepInception proving most effective at 63.17% ASR. Claude 3.5 Haiku exhibited complete resistance (0% ASR), while Gemini 2.0 Flash-Lite was most vulnerable (55.04% ASR) and GPT-4o mini moderately susceptible (44.34% ASR). A follow-up experiment refining malicious wording in Baseline and BitBypass attacks yielded a 30.6% ASR, confirming that subtle prompt adjustments can enhance simpler methods'efficacy.
Cloud-Native Topological Quantum Computing Using Distributed Photonic Anyon Networks
🔥 引用:
0
Abstract: Abstract
This study investigates a cloud-native framework for topological quantum computing in which distributed photonic quantum networks support anyon-mediated logical operations, validated through formal reasoning agents, proof agents and benchmark-oriented simulation. The central research problem concerns the gap between theoretically robust anyonic quantum information processing and the operational realities of distributed quantum infrastructure, where photonic loss, repeater latency, cloud orchestration overhead, service-chain anomalies and proof-level uncertainty jointly shape the viability of large-scale computation. The proposed framework treats topological protection, photonic entanglement distribution and cloud-native service governance as a single cyber-physical architecture rather than as separate hardware, software and mathematical layers.
The method integrates a literature-derived, realistic benchmark dataset with analytic modelling of fidelity decay, logical error proxies, latency decomposition and validation confidence. The results indicate that a cloud-native photonic anyon network can reduce the logical error proxy from 3.48e-2 in baseline photonic distribution to 4.91e-3 while maintaining a lower latency profile than monolithic quantum job execution. The paper contributes a five-subtheme literature synthesis, a layered architectural model, equations linking entanglement fidelity to logical failure probability, and a validation workflow that converts tentative mathematical reasoning into machine-verifiable proof artefacts. The study concludes that cloud-native topological quantum computing is most credible when photonic networking, anyon code design, observability and proof checking are co-designed under explicit service-level and error-budget constraints.
Keywords: cloud-native quantum computing; topological quantum computing; photonic quantum networks; anyons; distributed quantum computing; quantum repeaters; Kubernetes; formal proof agents; logical error modelling; quantum service mesh.
References
de Andrade, MG 2021, 'A quantum walk control plane for distributed quantum computing in quantum networks', 2021 IEEE International Conference on Quantum Computing and Engineering (QCE), pp. 313-323, doi: 10.1109/QCE52317.2021.00048.
Pasupuleti, MK 2021a, 'A Comparative Computational Study of OR, Stochastic Programming, and Deep RL for Sustainable Cloud and Network Operations', International Journal of Academic and Industrial Research Innovations (IJAIRI), vol. 01, no. 12, National Education Services, pp. 149-173, doi: 10.62311/nesx/rp-m7-dec-2021.
Pasupuleti, MK 2021b, 'Deep Reinforcement Learning for Online Cloud Resource Management: An Approximate Dynamic Programming Perspective', International Journal of Academic and Industrial Research Innovations (IJAIRI), vol. 01, no. 12, National Education Services, pp. 76-98, doi: 10.62311/nesx/rp-m4-dec-2021.
Pasupuleti, MK 2021c, 'Global Convergence Guarantees of Iterative Algorithms for Unitary Quantum Channel Search', International Journal of Academic and Industrial Research Innovations (IJAIRI), vol. 01, no. 01, National Education Services, pp. 331-341, doi: 10.62311/nesx/rp-1-08-2021.
Pasupuleti, MK 2021d, 'Hierarchical OR + RL Control for Cloud and Network Operations Under Policy-as-Code Constraints', International Journal of Academic and Industrial Research Innovations (IJAIRI), vol. 01, no. 12, National Education Services, pp. 99-123, doi: 10.62311/nesx/rp-m5-dec-2021.
Pasupuleti, MK 2021e, 'Operations Research Challenges in Energy-Efficient Cloud and Network Infrastructure: An Integrated AI-Simulation Perspective', International Journal of Academic and Industrial Research Innovations (IJAIRI), vol. 01, no. 12, National Education Services, pp. 1-25, doi: 10.62311/nesx/rp-m1-dec-2021.
Pasupuleti, MK 2021f, 'Policy-Aware AI for Cloud Operations: Integrating Deep RL with Cloud Custodian', International Journal of Academic and Industrial Research Innovations (IJAIRI), vol. 01, no. 12, National Education Services, pp. 124-148, doi: 10.62311/nesx/rp-m6-dec-2021.
Pasupuleti, MK 2021g, 'Quantum-Safe Network Security: Advances in Quantum Cryptography and Systems Integration', International Journal of Academic and Industrial Research Innovations (IJAIRI), vol. 01, no. 01, National Education Services, pp. 232-245, doi: 10.62311/nesx/rp-1-01-2021.
Pasupuleti, MK 2021h, 'Stochastic Modelling and Scenario Generation for Cloud Workload and Network Traffic in OR Models', International Journal of Academic and Industrial Research Innovations (IJAIRI), vol. 01, no. 12, National Education Services, pp. 26-50, doi: 10.62311/nesx/rp-m2-dec-2021.
Pasupuleti, MK 2021i, 'Stochastic Optimization Models for Energy-Aware Cloud Resource Allocation Using Simulation-Derived Scenarios', International Journal of Academic and Industrial Research Innovations (IJAIRI), vol. 01, no. 12, National Education Services, pp. 51-75, doi: 10.62311/nesx/rp-m3-dec-2021.
Yan, Z, Wang, Y-C, Ma, N, Qi, Y & Meng, ZY 2021, 'Topological phase transition and single/multi anyon dynamics of Z2 spin liquid', npj Quantum Materials, pp. 6, 39, doi: 10.1038/s41535-021-00338-1.
Pasupuleti, MK 2022a, 'Building a Data-Driven Operations Backbone: Governance and Performance of Big-Data Pipelines with Hadoop and Apache Spark in Cloud Firms', International Journal of Academic and Industrial Research Innovations (IJAIRI), vol. 02, no. 12, National Education Services, pp. 53-72, doi: 10.62311/nesx/rp-m3-dec-2022.
Pasupuleti, MK 2022b, 'Deep Learning Through a Statistical Lens: Approximation Guarantees, Training Dynamics, and Generative Models', International Journal of Academic and Industrial Research Innovations (IJAIRI), vol. 02, no. 01, National Education Services, pp. 125-143, doi: 10.62311/nesx/rp-1-01-2022.
Pasupuleti, MK 2022c, 'Driven Resource Management in Cloud Services: Managerial Implications of Deep Reinforcement Learning Policies', International Journal of Academic and Industrial Research Innovations (IJAIRI), vol. 02, no. 12, National Education Services, pp. 1-28, doi: 10.62311/nesx/rp-m1-dec-2022.
Pasupuleti, MK 2022d, 'Entropy-Driven Optimization of Multi-LLM Dialogues through Conditional Information Dynamics', International Journal of Academic and Industrial Research Innovations (IJAIRI), vol. 02, no. 01, National Education Services, pp. 157-167, doi: 10.62311/nesx/rp-2-01-2022.
Pasupuleti, MK 2022e, 'Policy-as-Code in Cloud Governance: Impacts of Cloud Custodian on Cost, Compliance, and Managerial Control', International Journal of Academic and Industrial Research Innovations (IJAIRI), vol. 02, no. 12, National Education Services, pp. 29-52, doi: 10.62311/nesx/rp-m2-dec-2022.
Pasupuleti, MK 2022f, 'Strategic Siting of Data Centres: A GIS-Based Multi-Criteria Decision Framework for Cost, Risk, and Sustainability', International Journal of Academic and Industrial Research Innovations (IJAIRI), vol. 02, no. 12, National Education Services, pp. 92-110, doi: 10.62311/nesx/rp-m5-dec-2022.
Pasupuleti, MK 2022g, 'Unifying Topological Quantum Field Theories with Statistical Mechanics: A Framework for Quantum States and Dynamics', International Journal of Academic and Industrial Research Innovations (IJAIRI), vol. 02, no. 01, National Education Services, pp. 168-178, doi: 10.62311/nesx/rp-3-01-2022.
Buckley, A, Chuprikov, P, Otoni, R, Rand, R, Soule, R & Eugster, P 2023, 'Towards an Algebraic Specification of Quantum Networks', Proceedings of the 1st Workshop on Quantum Networks and Distributed Quantum Computing, pp. 7-12, doi: 10.1145/3610251.3610557.
Ellison, TD, Chen, Y-A, Dua, A, Shirley, W, Tantivasadakarn, N & Williamson, DJ 2023, 'Pauli topological subsystem codes from Abelian anyon theories', Quantum, pp. 7, 1137, doi: 10.22331/q-2023-10-12-1137.
Pasupuleti, MK 2023a, 'AI Digital Twins in Simulink: Safe RL for Hardware-in-the-Loop Control', International Journal of Academic and Industrial Research Innovations (IJAIRI), vol. 03, no. 06, National Education Services, pp. 114-125, doi: 10.62311/nesx/rp-30623-114-125.
Pasupuleti, MK 2023b, 'Edge AI with Mosquitto: Secure, QoS-Aware Federated Anomaly Detection', International Journal of Academic and Industrial Research Innovations (IJAIRI), vol. 03, no. 06, National Education Services, pp. 195-210, doi: 10.62311/nesx/rp-30623-195-210.
Pasupuleti, MK 2023c, 'Foundation Models in PyTorch: Retrieval-Augmented Fine-Tuning with Safety Guardrail', International Journal of Academic and Industrial Research Innovations (IJAIRI), vol. 03, no. 03, National Education Services, pp. 1-16, doi: 10.62311/nesx/rp-30032023-1-16.
Pasupuleti, MK 2023d, 'NS-3 for 6G: Multi-Agent RL Slicing with eBPF Telemetry', International Journal of Academic and Industrial Research Innovations (IJAIRI), vol. 03, no. 06, National Education Services, pp. 162-177, doi: 10.62311/nesx/rp-30623-162-177.
Pasupuleti, MK 2023e, 'Policy-as-Code AI: Auto-Remediation and FinOps Risk Guardrails with Cloud Custodian', International Journal of Academic and Industrial Research Innovations (IJAIRI), vol. 03, no. 06, National Education Services, pp. 211-225, doi: 10.62311/nesx/rp-30623-211-225.
Pasupuleti, MK 2023f, 'QML for Edge Anomaly Detection: Variational Circuits on MQTT/NS-3 Stream', International Journal of Academic and Industrial Research Innovations (IJAIRI), vol. 03, no. 12, National Education Services, pp. 174-188, doi: 10.62311/nesx/rp-301223-174-188.
Pasupuleti, MK 2023g, 'Quantum-Inspired Optimizers for Cloud-Fog Scheduling: Near-Optimal SLAs under Energy Caps', International Journal of Academic and Industrial Research Innovations (IJAIRI), vol. 03, no. 03, National Education Services, pp. 17-28, doi: 10.62311/nesx/rp-30032023-17-28.
Pasupuleti, MK 2023h, 'Quantum-Inspired Scheduling in CloudSim: Energy-Aware Cloud-Fog Orchestration', International Journal of Academic and Industrial Research Innovations (IJAIRI), vol. 03, no. 06, National Education Services, pp. 178-194, doi: 10.62311/ne
Tackling the Generic Masculine: Evaluating Gender Neutrality of German AI-Generated Texts
🔥 引用:
0
Abstract: When using large language models (LLMs)—artificial intelligence (AI) systems trained to generate and interpret human language—gender-specific biases in AI-generated text represent a key challenge. Particularly in grammatically gendered languages such as German, this often results in texts outputs using the so-called generic masculine as an allegedly gender neutral default form. Consequently, generated text is neither neutral nor inclusive, and gender stereotypes are perpetuated. Organizations seeking to offer LLM-based systems that generate inclusive language by default typically rely on system prompts or specific model configurations to steer the model’s responses. However, for German, methods for automatically, systematically, and objectively assessing whether such approaches enhance gender neutrality and inclusivity remain limited and underexplored. To address this gap, a framework was developed that applies the concept of LLM-as-a-judge. This approach involves using an LLM to systematically evaluate the outputs of another, thereby enabling automated and replicable assessments of features such as gender neutrality and inclusivity. The paper presents the development and evaluation of a prototype of this framework, designed specifically for German, following a Design Science Research approach. Using the framework, the effectiveness of configurations or system prompts can be evaluated. To enable this in a systematic and replicable manner, a catalogue of 150 prompts in German was developed, adapting and extending approaches from other languages. The outputs generated by an LLM in response to these prompts are then assessed by an evaluation module: Linguistic analysis identifies gendered forms and grammatical structures, while scoring metrics quantify the degree of gender neutrality and inclusivity. To demonstrate the framework, it was applied in several test runs using an iteratively developed system prompt designed to elicit gender neutral responses. The resulting metrics allowed assessment of whether a given prompt effectively enhances the neutrality of generated outputs and reduces gender-specific bias. Potential applications of the framework in organisational settings, as well as its relevance for the development of responsible AI systems, are outlined.
Multimodal prediction of visual improvement in diabetic macular edema using real-world electronic health records and optical coherence tomography images
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
Beyond performance metrics: evaluating the unique value of generative AI in hybrid cybersecurity threat detection
🔥 引用:
0
Abstract:
This study examines the role of generative artificial intelligence (GenAI) in cybersecurity threat detection, focusing on its usefulness in workflows that support human decision-making.
Experiments were performed on the BODMAS dataset (134,435 samples) and a smaller exploratory subset of UNSW-NB15. State-of-the-art machine learning (ML) classifiers were compared with a zero-shot large language model (LLM) using standard classification metrics, while also considering latency, cost, and hallucination risk.
ML classifiers consistently outperformed the LLM-based system on standard detection metrics. However, the LLM showed value in cases of ambiguity, where it could provide short plain-language explanations, organize alert-related context, and generate initial interpretations for instances that did not match learned classes.
GenAI is unlikely to replace ML-based detection methods, but it can provide useful interpretive support for ambiguous or unfamiliar alerts. A hybrid pipeline is therefore proposed, in which ML handles high-confidence and time-sensitive decisions, while the LLM is used selectively for low-confidence cases or when explanatory support is needed. Human oversight remains necessary to address hallucination risk and ensure reliability.
Objective Shaping with Hard Negatives: Windowed Partial AUC Optimization for RL-based LLM Recommenders
🔥 引用:
0
Abstract: Reinforcement learning (RL) effectively optimizes Large Language Model (LLM)-based recommenders by contrasting positive and negative items. Empirically, training with beam-search negatives consistently outperforms random negatives, yet the mechanism is not well understood. We address this gap by analyzing the induced optimization objective and show that: (i) Under binary reward feedback, optimizing LLM recommenders with Group Relative Policy Optimization (GRPO) is theoretically equivalent to maximizing the Area Under the ROC Curve (AUC), which is often misaligned with Top-$K$ recommendation; and (ii) Replacing random negatives with beam-search negatives reshapes the objective toward partial AUC, improving alignment with Top-$K$ metrics. Motivated by this perspective, we introduce Windowed Partial AUC (WPAUC), which constrains the false positive rate (FPR) to a window [$\alpha,\alpha+d$] to more directly align with Top-$K$ metrics. We further propose an efficient Threshold-Adjusted Windowed reweighting (TAWin) RL method for its optimization, enabling explicit control over the targeted Top-$K$ performance. Experiments on four real-world datasets validate the theory and deliver consistent state-of-the-art performance.
SpikingBrain2.0: Brain-Inspired Foundation Models for Efficient Long-Context and Cross-Platform Inference
🔥 引用:
0
Abstract: Scaling context length is reshaping large-model development, yet full-attention Transformers suffer from prohibitive computation and inference bottlenecks at long sequences. A key challenge is to design foundation models that maintain performance and long-context efficiency with minimal training overhead. We introduce SpikingBrain2.0 (SpB2.0), a 5B model that advances both architecture and training efficiency of its predecessor. Our contributions are two-fold. (1) Architectural Innovation: We propose Dual-Space Sparse Attention (DSSA), an inter-layer hybrid of Sparse Softmax Attention (MoBA) and Sparse Linear Attention (SSE), achieving an improved performance-efficiency trade-off for long-context modeling. SpB2.0 further supports dual quantization paths: INT8-Spiking coding enables sparse event-driven computation, while FP8 coding accelerates inference on modern GPUs. (2) Enhanced Training Strategy: We develop an optimized Transformer-to-Hybrid (T2H) pipeline with dual conversion paths for LLMs and VLMs using curated open-source data. Empirically, SpB2.0-5B and SpB2.0-VL-5B recover most of the base Transformer (Qwen3-4B) capability with under 7k A100 GPU hours. SpB2.0 achieves a 10.13x TTFT speedup at 4M context and supports over 10M tokens on 8 A100 GPUs under vLLM, where full-attention models exceed memory limits. It also demonstrates strong cross-platform compatibility, enabling FP8 GPU inference (2.52x speedup at 250k) and efficient neuromorphic execution (64.31% sparsity, with 70.6% and 46.5% area and power reduction at 500MHz). Overall, SpikingBrain2.0 provides a practical pathway for lightweight, multimodal, spiking foundation models, highlighting the potential of combining brain-inspired mechanisms with efficient architectures for resource-constrained and edge scenarios.
STEM: Structure-Tracing Evidence Mining for Knowledge Graphs-Driven Retrieval-Augmented Generation
🔥 引用:
0
Abstract: Knowledge Graph-based Question Answering (KGQA) plays a pivotal role in complex reasoning tasks but remains constrained by two persistent challenges: the structural heterogeneity of Knowledge Graphs(KGs) often leads to semantic mismatch during retrieval, while existing reasoning path retrieval methods lack a global structural perspective. To address these issues, we propose Structure-Tracing Evidence Mining (STEM), a novel framework that reframes multi-hop reasoning as a schema-guided graph search task. First, we design a Semantic-to-Structural Projection pipeline that leverages KG structural priors to decompose queries into atomic relational assertions and construct an adaptive query schema graph. Subsequently, we execute globally-aware node anchoring and subgraph retrieval to obtain the final evidence reasoning graph from KG. To more effectively integrate global structural information during the graph construction process, we design a Triple-Dependent GNN (Triple-GNN) to generate a Global Guidance Subgraph (Guidance Graph) that guides the construction. STEM significantly improves both the accuracy and evidence completeness of multi-hop reasoning graph retrieval, and achieves State-of-the-Art performance on multiple multi-hop benchmarks.
AutoPyVerifier: Learning Compact Executable Verifiers for Large Language Model Outputs
🔥 引用:
0
Abstract: Verification is becoming central to both reinforcement-learning-based training and inference-time control of large language models (LLMs). Yet current verifiers face a fundamental trade-off: LLM-based verifiers are expressive but hard to control and prone to error, while deterministic executable verifiers are reliable and interpretable but often limited in capability. We study the following question: given a development set of LLM outputs and labels for a target objective, such as correctness, can we automatically induce a minimal set of Python verifiers whose joint satisfaction closely matches that objective? We propose AutoPyVerifier, a framework that uses an LLM to synthesize candidate verifier functions and then refines them through search over a directed acyclic graph (DAG). By navigating the DAG, AutoPyVerifier systematically explores the space of deterministic executable verifiers and selects a compact verifier set whose joint satisfaction best approximates the target objective. Across mathematical reasoning, coding, function calling, and instruction-following benchmarks for several state-of-the-art LLMs, AutoPyVerifier improves target-objective prediction by up to 55.0 F1 points over the initial LLM-generated verifier sets. Additional analyses show that the most useful verification targets vary by benchmark and model, and that the DAG-based search shifts the learned verifier sets toward more structural and semantically grounded checks. We further show that exposing the discovered verifier set to an LLM as an external tool improves downstream accuracy by up to 17.0 points. We release our code
Efficient Adaptation of Vision Foundation Model for High-Resolution Remote Sensing Image Segmentation via Spatial-Frequency Modeling and Sparse Refinement
DOI:
10.3390/rs18091295
🔥 引用:
0
Abstract: High-resolution remote-sensing semantic segmentation requires models to simultaneously capture global scene semantics and preserve fine-grained local structures. Although satellite-pretrained vision foundation models provide strong transferable representations, the features extracted by a frozen backbone remain insufficiently adapted to dense prediction, particularly for representing high-frequency details and multiscale local patterns. In addition, correcting residual prediction errors with dense full-map refinement introduces substantial computational redundancy, since hard errors are typically concentrated in only a small subset of locations. To address these challenges, we propose ADVMSeg, an efficient remote-sensing semantic segmentation framework built upon a frozen satellite-pretrained DINOv3 backbone. Specifically, we introduce a Spatial-Frequency Adapter (SF-Adapter) to improve backbone-level dense feature adaptation by jointly modeling global frequency responses and multiscale local spatial details in a lightweight bottleneck space. We further design an Adaptive Sparse Refinement (ASR) module after the pixel decoder, which identifies hard regions from coarse predictions via uncertainty and boundary cues, and performs targeted local cross-attention refinement only on selected critical locations. Extensive experiments on GID-15, LoveDA, and ISPRS Potsdam validate the effectiveness of the proposed framework. Under the unified setting, ADVMSeg achieves 63.1% mIoU on GID-15, 63.5% mIoU on LoveDA, and 81.4% mIoU on ISPRS Potsdam. These results validate the effectiveness of jointly improving backbone-level feature adaptation and prediction-stage computation allocation under the evaluated setting of frozen DINOv3, and three representative remote-sensing semantic-segmentation datasets.
Tell Me Why: Designing an Explainable LLM-based Dialogue System for Student Problem Behavior Diagnosis
🔥 引用:
0
Abstract: Diagnosing student problem behaviors requires teachers to synthesize multifaceted information, identify behavioral categories, and plan intervention strategies. Although fine-tuned large language models (LLMs) can support this process through multi-turn dialogue, they rarely explain why a strategy is recommended, limiting transparency and teachers'trust. To address this issue, we present an explainable dialogue system built on a fine-tuned LLM. The system uses a hierarchical attribution method based on explainable AI (xAI) to identify dialogue evidence for each recommendation and generate a natural-language explanation based on that evidence. In technical evaluation, the method outperformed baseline approaches in identifying supporting evidence. In a preliminary user study with 22 pre-service teachers, participants who received explanations reported higher trust in the system. These findings suggest a promising direction for improving LLM explainability in educational dialogue systems.
OMNI (Optimized Machine for Navigation & Interaction)
DOI:
10.55041/ijsrem60690
🔥 引用:
0
Abstract: Abstract — This paper presents OMNI (Optimized Machine for Navigation and Interaction), a fully offline, affordable, and intelligent home assistant robot designed to assist elderly individuals and children through natural voice interaction, autonomous navigation, face recognition, and smart home automation. Traditional home environments lack intelligent robotic systems capable of providing genuine companionship, physical assistance, and home device control in a unified offline platform. OMNI addresses this gap using a dual-processor architecture combining a Raspberry Pi 3B as the primary AI brain and an ESP32 microcontroller as the real-time body controller. The four-wheeled differential drive chassis is powered by 100RPM geared DC motors controlled through an L298N
Keywords: OMNI, Home Assistant Robot, Raspberry Pi, ESP32, Autonomous Navigation, Offline AI, Speech Recognition, Face Recognition, Natural Language Processing, Large Language Model, Human-Robot Interaction, Home Automation, Obstacle Avoidance, Edge AI, OpenCV, Whisper STT, Piper TTS, MQTT, Socially Assistive Robotics, Differential Drive.
WorldView-Bench: A Benchmark for Evaluating Global Cultural Perspectives in Large Language Models
DOI:
10.1613/jair.1.19001
🔥 引用:
0
Abstract: Background: Large Language Models (LLMs) are predominantly trained and aligned in ways that reinforce Westerncentric epistemologies and socio-cultural norms, leading to cultural homogenization and limiting their ability to reflect global civilizational plurality. Existing benchmarking frameworks fail to adequately capture this bias, as they rely on rigid, closed-form assessments that overlook the complexity of cultural inclusivity.
Objectives: To address this cultural bias problem, we introduce WorldView-Bench, a benchmark designed to evaluate Global Cultural Inclusivity (GCI) in LLMs by analyzing their ability to accommodate diverse worldviews.
Methods: Our approach is grounded in the Multiplex Worldview proposed by Senturk et al., which distinguishes between Uniplex models, reinforcing cultural homogenization, and Multiplex models, which integrate diverse perspectives. WorldViewBench measures Cultural Polarization, the exclusion of alternative perspectives, through free-form generative evaluation rather than conventional categorical benchmarks. We implement applied multiplexity through two intervention strategies: (1) Contextually-Implemented Multiplex LLMs, where system prompts embed multiplexity principles, and (2) Multi-Agent System (MAS)-Implemented Multiplex LLMs, where multiple LLM agents representing distinct cultural perspectives collaboratively generate responses.
Results: Our results demonstrate a significant increase in Perspectives Distribution Score (PDS) entropy from 13% at baseline to 94% with MAS-Implemented Multiplex LLMs, alongside a shift toward positive sentiment (67.7%) and enhanced cultural balance.
Conclusions: The success of multiplex-aware evaluation in WorldView-Bench demonstrates that cultural bias in LLMs can be meaningfully measured and mitigated through structured worldview diversity. We expect this to pave the way for more inclusive, globally representative, and ethically aligned AI systems.
Cyclization, amino acid coupling, docking-based virtual screening and SAR: showing a strategy of the structural modification for salvianic acid A
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
Network Edge Inference for Large Language Models: Principles, Techniques, and Opportunities
DOI:
10.1145/3809166
🔥 引用:
0
Abstract: Large language models (LLMs) have advanced rapidly, emerging as versatile tools across fields thanks to their exceptional language understanding, generation, and reasoning capabilities. However, performing LLM inference at the network edge remains challenging due to their large memory and compute demands. This survey outlines the challenges specific to LLM edge inference and provides a comprehensive overview of recent progress, covering system architectures, model optimization and deployment, and resource management and scheduling. By synthesizing state-of-the-art techniques and mapping future directions, this survey aims to unlock the potential of LLMs in resource-constrained edge environments.
PASR: Pose-Aware 3D Shape Retrieval from Occluded Single Views
🔥 引用:
0
Abstract: Single-view 3D shape retrieval is a fundamental yet challenging task that is increasingly important with the growth of available 3D data. Existing approaches largely fall into two categories: those using contrastive learning to map point cloud features into existing vision-language spaces and those that learn a common embedding space for 2D images and 3D shapes. However, these feed-forward, holistic alignments are often difficult to interpret, which in turn limits their robustness and generalization to real-world applications. To address this problem, we propose Pose-Aware 3D Shape Retrieval (PASR), a framework that formulates retrieval as a feature-level analysis-by-synthesis problem by distilling knowledge from a 2D foundation model (DINOv3) into a 3D encoder. By aligning pose-conditioned 3D projections with 2D feature maps, our method bridges the gap between real-world images and synthetic meshes. During inference, PASR performs a test-time optimization via analysis-by-synthesis, jointly searching for the shape and pose that best reconstruct the patch-level feature map of the input image. This synthesis-based optimization is inherently robust to partial occlusion and sensitive to fine-grained geometric details. PASR substantially outperforms existing methods on both clean and occluded 3D shape retrieval datasets by a wide margin. Additionally, PASR demonstrates strong multi-task capabilities, achieving robust shape retrieval, competitive pose estimation, and accurate category classification within a single framework.
Dynamically Acquiring Text Content to Enable the Classification of Lesser-known Entities for Real-world Tasks
🔥 引用:
0
Abstract: Existing Natural Language Processing (NLP) resources often lack the task-specific information required for real-world problems and provide limited coverage of lesser-known or newly introduced entities. For example, business organizations and health care providers may need to be classified into a variety of different taxonomic schemes for specific application tasks. Our goal is to enable domain experts to easily create a task-specific classifier for entities by providing only entity names and gold labels as training data. Our framework then dynamically acquires descriptive text about each entity, which is subsequently used as the basis for producing a text-based classifier. We propose a novel text acquisition method that leverages both web and large language models (LLMs). We evaluate our proposed framework on two classification problems in distinct domains: (i) classifying organizations into Standard Industrial Classification (SIC) Codes, which categorize organizations based on their business activities; and (ii) classifying healthcare providers into healthcare provider taxonomy codes, which represent a provider's medical specialty and area of practice. Our best-performing model achieved macro-averaged F1-scores of 82.3% and 72.9% on the SIC code and healthcare taxonomy code classification tasks, respectively.
Large Language Models Decide Early and Explain Later
🔥 引用:
0
Abstract: Large Language Models often achieve strong performance by generating long intermediate chain-of-thought reasoning. However, it remains unclear when a model's final answer is actually determined during generation. If the answer is already fixed at an intermediate stage, subsequent reasoning tokens may constitute post-decision explanation, increasing inference cost and latency without improving correctness. We study the evolution of predicted answers over reasoning steps using forced answer completion, which elicits the model's intermediate predictions at partial reasoning prefixes. Focusing on Qwen3-4B and averaging results across all datasets considered, we find that predicted answers change in only 32% of queries. Moreover, once the final answer switch occurs, the model generates an average of 760 additional reasoning tokens per query, accounting for a substantial fraction of the total reasoning budget. Motivated by these findings, we investigate early stopping strategies that halt generation once the answer has stabilized. We show that simple heuristics, including probe-based stopping, can reduce reasoning token usage by 500 tokens per query while incurring only a 2% drop in accuracy. Together, our results indicate that a large portion of chain-of-thought generation is redundant and can be reduced with minimal impact on performance.
FeatEHR-LLM: Leveraging Large Language Models for Feature Engineering in Electronic Health Records
🔥 引用:
0
Abstract: Feature engineering for Electronic Health Records (EHR) is complicated by irregular observation intervals, variable measurement frequencies, and structural sparsity inherent to clinical time series. Existing automated methods either lack clinical domain awareness or assume clean, regularly sampled inputs, limiting their applicability to real-world EHR data. We present \textbf{FeatEHR-LLM}, a framework that leverages Large Language Models (LLMs) to generate clinically meaningful tabular features from irregularly sampled EHR time series. To limit patient privacy exposure, the LLM operates exclusively on dataset schemas and task descriptions rather than raw patient records. A tool-augmented generation mechanism equips the LLM with specialized routines for querying irregular temporal data, enabling it to produce executable feature-extraction code that explicitly handles uneven observation patterns and informative sparsity. FeatEHR-LLM supports both univariate and multivariate feature generation through an iterative, validation-in-the-loop pipeline. Evaluated on eight clinical prediction tasks across four ICU datasets, our framework achieves the highest mean AUROC on 7 out of 8 tasks, with improvements of up to 6 percentage points over strong baselines. Code is available at github.com/hojjatkarami/FeatEHR-LLM.
AI-Driven Prediction of Antimicrobial Peptides: A Multi-Omics and Machine Learning Approach for Novel Drug Discovery
🔥 引用:
0
Abstract: The antimicrobial resistance (AMR) has been estimated to claim 10 million deaths annually by 2050 and alternative
treatment modalities must hence be identified. Antimicrobial peptides (AMPs) are also preferable since they are broadspectrum and least prone to the development of resistance. However, the conventional discovery modes are marked
by prohibitiveness, time-consuming and scalability. The article shows a proposed conceptual model with artificial
intelligence (AI) and multi-omics data to enhance AMP prediction. The paper critically examines machine learning
and deep learning applications using literature sources, with reported accuracy rates over 90% in certain cases. The
proposed model exploits the genomics, proteomics, and transcriptomics data to improve the predictive capability and
biological significance. This integration is claimed to address the weaknesses of single-omics methods, including low generalisability and lack of biological coverage. The article introduces a theoretically-grounded framework that can be used to accelerate the drug discovery processes and help in combating AMR.
On using large language models to support social research for climate action
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
Recognition Without Authorization: LLMs and the Moral Order of Online Advice
🔥 引用:
0
Abstract: Large language models are increasingly used to mediate everyday interpersonal dilemmas, yet how their advisory defaults interact with the concentrated moral orders of specific communities remains poorly understood. This article compares four assistant-style LLMs with community-endorsed advice on 11,565 posts from r/relationship_advice, using the subreddit as a concentrated, vote-ratified moral formation whose prescriptive clarity makes divergence measurable. Across models, LLMs identify many of the same dynamics as human commenters, but are markedly less likely to convert that recognition into directive authorization for action. The gap is sharpest where community consensus is strongest: on high-consensus posts involving abuse or safety threats, models recommend exit at roughly half the human rate while maintaining elevated levels of hedging, validation, and therapeutic framing. The article describes this pattern as recognition without authorization: the capacity to register harm while withholding socially ratified permission for consequential action. This divergence is not incidental but structural: a portable advisory style that remains validating, risk-averse, and weakly directive across contexts. Safety alignment is one plausible contributor to this pattern, alongside training-data averaging and broader assistant design. The article argues that model divergence can be reframed from a technical error to a way of seeing what standardized assistant norms flatten when they encounter situated moral worlds.
Bias in, symbolic compliance out? GPT 's reliance on gender and race in strategic evaluations
DOI:
10.1002/smj.70094
🔥 引用:
0
Abstract:
Organizations are increasingly using large language models (LLMs) to support strategic evaluations. We examine whether and how these systems rely on gender and race. We asked GPT to evaluate identical startup pitches varying only the founder's name, shaping gender and race perceptions. Across 26,000 evaluations, GPT did not systematically assign lower scores to underrepresented minorities but avoided ranking them last without increasing winning likelihoods. To explain these patterns, we conducted “Second Opinion” experiments where GPT evaluated pitches alongside inputs simulating human bias. GPT more readily corrected explicit, identity‐based bias than bias framed as neutral business critiques, with corrections limited in magnitude. We theorize these findings reflect
symbolic compliance
: LLMs suppress overt discrimination without substantively altering evaluative logic, allowing inequality to persist in AI‐supported strategic evaluations.
Large language models (LLMs), like OpenAI's ChatGPT, are increasingly used in strategic evaluations (e.g., hiring, pitches). We examine whether and how these models exhibit gender and racial biases in their evaluations of startup pitches, where we only varied founder names (shaping gender and race perceptions). Across multiple experiments, we find that GPT evaluators did not systematically assign lower scores to underrepresented minorities, primarily by reducing their likelihood of being ranked last. However, this behavior reflects a symbolic effort to avoid overt discrimination rather than a deeper fairness commitment. While LLMs may not reproduce historical and societal biases in overt form, their ability to correct them remains limited. These results highlight the need for implementing bias mitigation measures before integrating LLMs into high‐stakes strategic evaluation processes.
PolarGate: Breaking the Functionality Representation Bottleneck of And-Inverter Graph Neural Network
DOI:
10.1145/3812548
🔥 引用:
0
Abstract: Understanding the functionality of Boolean networks is crucial for processes such as functional equivalence checking, logic synthesis, and malicious logic identification. With the proliferation of deep learning in electronic design automation (EDA), graph neural networks (GNNs) are widely used to embed and-inverter graphs (AIGs)—a standard form of Boolean networks—into vectorized representations. A key challenge in applying GNNs for Boolean representation is that although GNNs can effectively encapsulate the structural properties of AIGs, they struggle to efficiently capture Boolean logic functionality. In this work, we focus on breaking this bottleneck by enhancing the functional representation capability of GNNs, proposing PolarGate, an efficient solution that not only aligns message passing with AIG logical functionality but also effectively integrates global information. Leveraging the intrinsic ambipolar states (0 and 1) of AIG nodes, PolarGate maps gate behavior into an ambipolar state space, customizes differentiable logical operators, and designs a functionality-aware message passing strategy. To further capture global circuit information, PolarGate integrates a structure-aware preprocessing module and a global linear attention module, transcending the locality constraint of message passing. Experimental results on two functionality-related basic tasks (signal probability prediction and truth-table distance prediction) and a downstream task (logic equivalence prediction) show that PolarGate outperform state-of-the-art GNN-based methods.
Train in Vain: Functionality-Preserving Poisoning to Prevent Unauthorized Use of Code Datasets
🔥 引用:
0
Abstract: The widespread availability of large-scale code datasets has accelerated the development of code large language models (CodeLLMs), raising concerns about unauthorized dataset usage. Dataset poisoning offers a proactive defense by reducing the utility of such unauthorized training. However, existing poisoning methods often require full dataset poisoning and introduce transformations that break code compilability. In this paper, we introduce FunPoison, a functionality-preserving poisoning approach that injects short, compilable weak-use fragments into executed code paths. FunPoison leverages reusable statement-level templates with automatic repair and conservative safety checking to ensure side-effect freedom, while a type-aware synthesis module suppresses static analysis warnings and enhances stealth. Extensive experiments show that FunPoison achieves effective poisoning by contaminating only 10% of the dataset, while maintaining 100% compilability and functional correctness, and remains robust against various advanced code sanitization techniques.
Vibe coding for clinicians: democratising bespoke software development for digital health innovation
🔥 引用:
0
Abstract: Clinicians often face workflow problems that are perceived as either too bespoke or low stakes to attract commercial attention. Historically, most do not have the technical knowledge to address these problems, but the recent emergence of"vibe coding"presents a transformative opportunity. Vibe coding refers to the co-development of software using natural language prompts to large language models. It offers a pathway to create simple tools that address these real-world pain points, or to prototype more complex ideas. In this review, written by a group of early adopter clinicians with a range of programming expertise, we introduce vibe coding for clinicians (especially those with no or minimal coding experience) as a way of democratising innovation from the front lines. We discuss foundational skills, outline some common challenges, provide a practical step-by-step playbook, and illustrate this approach with some case examples, taking care to consider caveats and guardrails for deployment. We propose that vibe coding is more than a technical shortcut for beginners and is not a replacement for professional software developers. Instead, it can bridge the gap between clinical insight and technical execution, equipping clinicians with the ability to rapidly prototype digital health solutions most reflective of clinical realities.
CellChem: Cellular transcriptional responses reshape molecular representation space for efficient and multi-scale drug discovery
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
ResRank: Unifying Retrieval and Listwise Reranking via End-to-End Joint Training with Residual Passage Compression
🔥 引用:
0
Abstract: Large language model (LLM) based listwise reranking has emerged as the dominant paradigm for achieving state-of-the-art ranking effectiveness in information retrieval. However, its reliance on feeding full passage texts into the LLM introduces two critical bottlenecks: the"lost in the middle"phenomenon degrades ranking quality as input length grows, and the inference latency scales super-linearly with sequence length, rendering it impractical for industrial deployment. In this paper, we present ResRank, a unified retrieval-reranking framework that fundamentally addresses both challenges. Inspired by multimodal LLMs that project visual inputs into compact token representations, ResRank employs an Encoder-LLM to compress each candidate passage into a single embedding, which is then fed alongside the query text into a Reranker-LLM for listwise ranking. To alleviate the misalignment between the compressed representation space and the ranking space, we introduce a residual connection structure that combines encoder embeddings with contextualized hidden states from the reranker. Furthermore, we replace the conventional autoregressive decoding with a one-step cosine-similarity-based scoring mechanism, eliminating the generation bottleneck entirely. ResRank is trained through a carefully designed dual-stage, multi-task, end-to-end joint optimization strategy that simultaneously trains the encoder and reranker, achieving learning objective alignment between retrieval and reranking while substantially reducing training complexity. Extensive experiments on TREC Deep Learning and eight BEIR benchmark datasets demonstrate that ResRank achieves competitive or superior ranking effectiveness compared to existing approaches while requiring zero generated tokens and processing only one token per passage, yielding a fundamentally better balance between effectiveness and efficiency.
GazeVLA: Learning Human Intention for Robotic Manipulation
🔥 引用:
0
Abstract: Embodied foundation models have achieved significant breakthroughs in robotic manipulation, yet they still depend heavily on large-scale robot demonstrations. Although recent works have explored leveraging human data to alleviate this dependency, effectively extracting transferable knowledge remains a significant challenge due to the inherent embodiment gap between human and robot. We argue that the intention underlying human actions can serve as a powerful intermediate representation for bridging this gap. In this paper, we introduce a novel framework that explicitly learns and transfers human intention to facilitate robotic manipulation. Specifically, we model intention through gaze, as it naturally precedes physical actions and serves as an observable proxy for human intent. Our model is first pretrained on a large-scale egocentric human dataset to capture human intention and its synergy with action, followed by finetuning on a small set of robot and human data. During inference, the model adopts a Chain-of-Thought reasoning paradigm, sequentially predicting intention before executing the action. Extensive evaluations in simulation and real-world settings, across long-horizon and fine-grained tasks, and under few-shot and robustness benchmarks, show that our method consistently outperforms strong baselines, generalizes better, and achieves state-of-the-art performance.
Prompt engineering in education: A systematic review of theory, applications, and future directions
🔥 引用:
0
Abstract: The integration of Generative Artificial Intelligence (GenAI), particularly large language models (LLMs) such as ChatGPT, has significantly influenced current educational practices. Among the competencies necessary for effective interaction with LLMs, prompt engineering has emerged as a focal point. Prompt engineering encompasses the design, organization, and optimization of inputs to GenAI systems to produce accurate, relevant, and pedagogically meaningful outputs. Although empirical, conceptual, and policy-oriented studies on this topic are increasing, research on prompt engineering in education remains fragmented across disciplines and educational contexts. This paper systematically reviews the application of prompt engineering in education, synthesizing theoretical foundations, developmental trajectories, models, frameworks, and application domains. Drawing on recent peer-reviewed literature, with an emphasis on studies published in the past three years, the review consolidates findings on motivation, attitudes, knowledge, and skills related to prompt engineering among both learners and educators. The expanding role of AI in education prompts a critical inquiry: how can AI be integrated into teaching in ways that preserve and enhance the human dimensions of knowledge sharing? This paper introduces a framework for the responsible use of AI that reinforces, rather than replaces, the essential human elements at the heart of education.
Comprehensive Fashion Understanding from Images through Multimodal LLMs
DOI:
10.64509/jdi.12.54
🔥 引用:
0
Abstract: With the popularity of social media platforms in the information era, people are sharing a large volume of digital content online, including photos and other media data. The availability of media data over the Internet makes it very attractive to mine useful information from these data. However, for fashion image recognition, most existing datasets and methods are limited to coarse-grained categories or a small subset of attributes, lacking a comprehensive dataset and a holistic understanding of clothing. In this paper, we construct a fashion dataset with structured and comprehensive clothing descriptions using an MLLM (Multimodal Large Language Model) grounded in fashion knowledge. To ensure accurate understanding by the MLLM, we define a fashion schema and propose a schema-guided prompting strategy. Leveraging this dataset, we train a model based on BLIP (Bootstrapping Language-Image Pre-training) to recognize category information and fine-grained attributes of clothing images, and to learn text–image aligned representations for clothing image retrieval. Experimental results demonstrate that our approach significantly improves both attribute recognition and cross-modal retrieval performance compared to existing baselines.
Can Multimodal Large Language Models Truly Understand Small Objects?
🔥 引用:
0
Abstract: Multimodal Large Language Models (MLLMs) have shown promising potential in diverse understanding tasks, e.g., image and video analysis, math and physics olympiads. However, they remain blank and unexplored for Small Object Understanding (SOU) tasks. To fill this gap, we introduce SOUBench, the first and comprehensive benchmark for exploring the small objects understanding capability of existing MLLMs. Specifically, we first design an effective and automatic visual question-answer generation strategy, constructing a new SOU-VQA evaluation dataset, with 18,204 VQA pairs, six relevant sub-tasks, and three dominant scenarios (i.e., Driving, Aerial, and Underwater). Then, we conduct a comprehensive evaluation on 15 state-of-the-art MLLMs and reveal their weak capabilities in small object understanding. Furthermore, we develop SOU-Train, a multimodal training dataset with 11,226 VQA pairs, to improve the SOU capabilities of MLLMs. Through supervising fine-tuning of the latest MLLM, we demonstrate that SOU-Train can effectively enhance the latest MLLM's ability to understand small objects. Comprehensive experimental results demonstrate that, the proposed SOUBench, along with the SOU-VQA and SOU-Train datasets, provides a crucial empirical foundation to the community for further developing models with enhanced small object understanding capabilities. Datasets and Code: https://github.com/Hanfj-X/SOU.
Generating explainable hypotheses for drug repurposing with graph neural networks
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
Predicting guide dog career success using machine learning and large language models
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
In silico identification of potential inhibitors of Mycobacterium tuberculosis MmpS5L5 from the ReFRAME database: a structure-based virtual screening, molecular docking and molecular dynamics approach
🔥 引用:
0
Abstract:
Tuberculosis (TB), caused by
Mycobacterium tuberculosis
(
Mtb
), is the leading infectious cause of death globally, disproportionately impacting low- and medium-income countries (LMICs). The emergence and transmission of drug resistant
Mtb
strains has rendered a majority of the current anti-TB agents ineffective and significantly complicated TB treatment. Thus, the development of new anti-TB remedies with novel modes of action is a pressing priority. An attractive, viable strategy is the development of potentiators of anti-TB drugs that reverse drug efflux, a key intrinsic
Mtb
drug resistance mechanism. Targeting
Mtb
MmpS5L5, a critical efflux pump (EP) implicated in the mycobacterial expulsion of various anti-TB drugs including bedaquiline, tetracyclines, azoles and clofazimine would likely enhance the efficacy of current anti-TB drugs by preventing the development of drug resistance.
The recent determination of a high-resolution crystal structure of
Mtb
MmpS5L5 (PDBID: 8ZKP) enables the utilisation of structure-anchored approaches for the uncovering of probable efflux inhibitors. In this study, pharmacophore models developed using the
Mtb
MmpS5L5 three-dimensional (3-D) structure and its known inhibitors, verapamil and norverapamil, were thereafter utilised for the screening of the REFRAME database, a comprehensive drug repurposing library, to identify novel ligand scaffolds with putative activity against the EP. Predicted target binding affinity for the top candidates was ascertained and validated using molecular docking and 100 ns molecular dynamics (MD) simulations, respectively. Further, post-MD analysis including Molecular Mechanics/Generalized Born Surface Area calculations (MMGBSA), Principal Component Analysis, and Free Energy Landscapes were done to study thermodynamic and conformational dynamics of the complexes.
Six compounds (406, 3920, 4031, 4787, 7104, 10367) had stronger predicted binding affinities for MmpS5L5 than the known inhibitors, with docking scores ranging from -8.70 to -5.01 kcal/mol and had predicted protein contacts similar to those of the validated inhibitors. Molecular dynamic simulations and MMGBSA analyses demonstrated stable and energetically favourable protein-ligand interaction. Among the six compounds, 3920 and 4031 emerged as the most promising hits as their average total ΔG bind (-111.81 ± 8.98 kcal/mol and -109.56 ± 8.40 kcal/mol respectively) and ligand efficiency (-16.46 ± 4.06 kcal/mol and -17.63 ± 1.27 kcal/mol) were lower than those of the reference inhibitors.
This study identified compounds from the ReFRAME database that may provide putative scaffolds for the development of
Mtb
efflux inhibitors that can potentiate the treatment efficacy of current anti-TB drugs. Further
in vitro
and
in vivo
studies are needed to validate their inhibition potential.
A GNN-Based Log Anomaly Detection Framework with Prompt Learning for Edge Computing
🔥 引用:
0
Abstract: System logs have been critical for analyzing the operational status and abnormal behavior of highly distributed and heterogeneous edge computing nodes. In edge environments, logs exhibit cross-event and cross-field structural interactions, making it difficult to uncover potential anomaly patterns from isolated events. Moreover, sparse annotations and varying log formats limit the effectiveness of existing methods. To address these challenges, we propose a graph neural network (GNN) anomaly detection framework with prompt learning. It leverages few-shot prompt learning to automatically extract key fields and constructs a weighted directed graph that jointly models semantic embeddings and temporal dependencies, fully representing the structural interactions and semantic associations across events and fields. Furthermore, the framework performs graph-level anomaly detection by jointly optimizing graph representation learning and classification objective within an enhanced one-class directed graph convolutional network, enabling effective identification of global structural anomaly patterns in log graphs. Experimental results demonstrate that the proposed method achieves an average F1-score of 93.3%, surpassing the current state-of-the-art (SOTA) methods by 6.93%.
BLAST: Benchmarking LLMs with ASP-based Structured Testing
🔥 引用:
0
Abstract: Large Language Models (LLMs) have demonstrated remarkable performance across a broad spectrum of tasks, including natural language understanding, dialogue systems, and code generation. Despite evident progress, less attention has been paid to their effectiveness in handling declarative paradigms such as Answer Set Programming (ASP), to date. In this paper we introduce BLAST: The first dedicated benchmarking methodology and associated dataset for evaluating the accuracy of LLMs in generating ASP code. BLAST provides a structured evaluation framework featuring two novel semantic metrics tailored to ASP code generation. The paper presents the results of an empirical evaluation involving ten well-established graph-related problems from the ASP literature and a diverse set of eight state-of-the-art LLMs.
Hybrid St-Gnn for Early Epileptic Seizure Detection and Classification via Brain Functional Connectivity
🔥 引用:
0
Abstract: Epilepsy affects approximately 50 million people worldwide, with unpredictable seizures significantly impacting
quality of life. Current detection methods often lack the temporal sensitivity needed for early intervention. This
research introduces a novel hybrid spatio-temporal Graph Neural Network (ST-GNN) architecture that leverages
brain functional connectivity patterns from EEG signals to detect and classify epileptic seizures in their early stages.
Our approach combines spatial graph convolutions to capture inter-regional brain connectivity with temporal
attention mechanisms to model dynamic seizure evolution. We employed the CHB-MIT Scalp EEG Database
containing recordings from 23 paediatric patients, achieving 94.7% detection accuracy and 91.3% classification
accuracy across multiple seizure types. The model demonstrated an average prediction horizon of 8.2 seconds before
clinical seizure onset, providing a critical window for therapeutic intervention. Unlike traditional methods that treat
channels independently, our framework explicitly models the complex interdependencies between brain regions as a
dynamic graph structure. Results indicate that integrating spatio-temporal features significantly outperforms
conventional CNN and RNN approaches, with particularly strong performance in detecting focal seizures. This work advances both the theoretical understanding of seizure propagation networks and practical applications in wearable seizure alert systems.
Structure-Aware Generative Information Extraction via Feature Space Alignment
DOI:
10.3390/info17050409
🔥 引用:
0
Abstract: Large language models (LLMs) face difficulties in leveraging the syntactic structures and entity relations embedded in text for long-document information extraction. To address this issue, this paper proposes a generative extraction method integrating heterogeneous topology awareness and spatial alignment. The method first extracts syntactic and coreference information to construct a heterogeneous document graph and employs a mixture-of-experts network to decouple and encode multi-type topological features. A component orthogonal projection mechanism and a graph-text contrastive learning strategy are then utilized to align the extracted graph features to the underlying semantic space of the language model with high fidelity. Furthermore, Topology-Aware Encoder compresses the global features into fixed-length structural prompts to guide text generation. Experiments on the ACE2005, WikiEvents, and DuEE datasets demonstrated that the proposed method achieved state-of-the-art performance on information extraction tasks. Consequently, these results suggest that the proposed framework is a promising approach for complex information extraction across base LLMs of different scales.
СРАВНИТЕЛЬНАЯ ОЦЕНКА ОНЛАЙН-СЕРВИСОВ PASS ONLINE, PROTOX И ADMETSAR ДЛЯ IN SILICO-ПРОГНОЗИРОВАНИЯ ТОКСИЧНОСТИ И БИОЛОГИЧЕСКОЙ АКТИВНОСТИ: КЕЙС-ИССЛЕДОВАНИЕ ЛАМОТРИДЖИНА И МОДИФИЦИРОВАННЫХ СТРУКТУР
DOI:
10.17513/spno.34560
🔥 引用:
0
Abstract: Онлайн-сервисы молекулярного моделирования широко применяются на ранних этапах разработки лекарственных средств для ускоренной оценки биологической активности и потенциальных рисков безопасности, позволяя реализовать каскадный отбор перспективных структур без необходимости программирования. Цель исследования. Оценить функциональные возможности онлайн-сервисов для прогнозирования in silico токсичности и биологической активности фармацевтических субстанций в контексте разработки инновационных лекарственных препаратов. Практическая часть работы проведена в формате кейс-исследования: рассмотрены 5 молекул (референсный «Ламотриджин» и 5 смоделированных производных «Ламотриджин-1–5»), созданных в ACD/ChemSketch и проанализированных в формате SMILES. Исследование базировалось на поэтапной схеме анализа: PASS Online - ProTox - admetSAR. После первичного скрининга in silico в PASS Online отобрано 4 наиболее перспективные структуры (n=5→4). По результатам оценки токсичности в ProTox сформирован один финальный кандидат (n=4→1), направленный на углублённое ADMET-профилирование в admetSAR. Показано, что при различной специализации PASS Online, ProTox и admetSAR их последовательное применение формирует практичную концепцию комплексной проверки «активность – безопасность», повышая обоснованность выбора молекулы-кандидата. Отмечена необходимость комплексного последовательного исследования активности каждого кандидата.
Security and Privacy of Large Language Models: Threat Taxonomy, Ethical Implications, and Governance
DOI:
10.3390/ai7050152
🔥 引用:
0
Abstract: Large Language Models (LLMs) are increasingly deployed across professional and societal domains, introducing security, privacy, and governance challenges beyond traditional software vulnerabilities. Despite extensive research on individual risk categories, a unified lifecycle-oriented perspective connecting architectural properties, adversarial threats, and governance implications remains limited. This review examines security and privacy risks associated with LLMs through a lifecycle framework covering data acquisition, model training, alignment procedures, deployment, and post-deployment interaction. The study synthesizes prior research to construct a taxonomy of threats including prompt injection, jailbreaking, adversarial manipulation, training-stage attacks, privacy leakage, and socio-technical misuse. Ethical issues such as hallucination, bias amplification, and malicious use are analyzed alongside governance and regulatory frameworks. Results indicate that vulnerabilities in LLM systems arise primarily from probabilistic generation mechanisms, large-scale data ingestion, and complex deployment ecosystems rather than isolated implementation defects. Classical software vulnerability models therefore provide only partial coverage of risks associated with generative AI systems. The review is grounded in the concept of the alignment gap to explain how discrepancies between training objectives and real-world interaction contribute to persistent vulnerabilities. The findings highlight the need for lifecycle-oriented defense-in-depth strategies combining technical safeguards, privacy-preserving training, runtime monitoring, and governance mechanisms to support responsible deployment of LLM-based systems.
Beyond GNNs: a methodological benchmark of feature efficiency for link prediction in sparse developer networks
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
Classroom Reflections on AI-Assisted Creativity in Design Education
DOI:
10.36615/f66j8309
🔥 引用:
0
Abstract: Debates around Artificial Intelligence (AI) and creativity often split between promise and risk. In this reflective paper, we take a practice-first view from two implementations in design/art courses at a Malaysian public university. We focus on two themes that run through public and official discourse: the need for governance/ethics; and for treating AI as a tool, not the source, of creativity. In the first implementation, the introduction of AI lacked clear guardrails, which led to confusion about acceptable prompts, uneven quality, and uncertainty around attribution. In the second, we introduced explicit instructions (when to use, how to use, and what to avoid), process evidence requirements, and assessment that foregrounded decision-making. Across both, we observed consistent benefits aligned with studio needs: faster divergent idea generation, access to large reference sets, rapid prototyping, and more engaging critique. Risks centred on over-reliance and aesthetic homogenisation, which were mitigated through iteration limits, disclosure of what was student-led versus AI-assisted, and targeted AI-literacy supports. The implementations drew on three categories of tools: a large-language-model assistant for brainstorming and critique, text-to-image generators for fast ideation, and assistive image tools for workflow tasks. We distil these lessons into a seven-part framework for implementing AI in design courses (positioning, guardrails, process evidence, assessment, AI literacy, equity supports, governance/ethics). We conclude that AI can enhance student creativity when governance is explicit and authorship remains central. Limitations include a single institution and a short time frame; future work should test the framework across cohorts and newer models.
Learning Evidence Highlighting for Frozen LLMs
🔥 引用:
0
Abstract: Large Language Models (LLMs) can reason well, yet often miss decisive evidence when it is buried in long, noisy contexts. We introduce HiLight, an Evidence Emphasis framework that decouples evidence selection from reasoning for frozen LLM solvers. HiLight avoids compressing or rewriting the input, which can discard or distort evidence, by training a lightweight Emphasis Actor to insert minimal highlight tags around pivotal spans in the unaltered context. A frozen Solver then performs downstream reasoning on the emphasized input. We cast highlighting as a weakly supervised decision-making problem and optimize the Actor with reinforcement learning using only the Solver's task reward, requiring no evidence labels and no access to or modification of the Solver. Across sequential recommendation and long-context question answering, HiLight consistently improves performance over strong prompt-based and automated prompt-optimization baselines. The learned emphasis policy transfers zero-shot to both smaller and larger unseen Solver families, including an API-based Solver, suggesting that the Actor captures genuine, reusable evidence structure rather than overfitting to a single backbone.
Incentive Mechanism for Federated Learning in Data Heterogeneity and Consumer Privacy Protection
DOI:
10.4018/ijiit.408164
🔥 引用:
0
Abstract: Consumer privacy protection demands are complex and multifaceted. Traditional incentive mechanisms struggle to balance participation enthusiasm with privacy risk control, leaving consumers exposed to unfair contribution evaluations and privacy leakage risks. To address this, this paper develops a collaborative joint learning incentive mechanism integrating graph neural networks (GNN) and multi-agent reinforcement learning. The approach first constructs node relationship graphs using GNN, then measures data distribution similarity through graph convolutional networks, and finally establishes a multi-agent reinforcement learning framework where nodes act as intelligent agents. By leveraging joint reinforcement learning and a dual-objective reward function, the mechanism optimizes strategies. Experimental results demonstrate that GNN-Shapley achieves over 97% accuracy, while the privacy compensation mechanism elevates average accuracy to 98.27%. This methodology effectively alleviates participation bottlenecks and safeguards consumer rights.
Towards Safe Mobility: A Unified Transportation Foundation Model enabled by Open-Ended Vision-Language Dataset
🔥 引用:
0
Abstract: Urban transportation systems face growing safety challenges that require scalable intelligence for emerging smart mobility infrastructures. While recent advances in foundation models and large-scale multimodal datasets have strengthened perception and reasoning in intelligent transportation systems (ITS), existing research remains largely centered on microscopic autonomous driving (AD), with limited attention to city-scale traffic analysis. In particular, open-ended safety-oriented visual question answering (VQA) and corresponding foundation models for reasoning over heterogeneous roadside camera observations remain underexplored. To address this gap, we introduce the Land Transportation Dataset (LTD), a large-scale open-source vision-language dataset for open-ended reasoning in urban traffic environments. LTD contains 11.6K high-quality VQA pairs collected from heterogeneous roadside cameras, spanning diverse road geometries, traffic participants, illumination conditions, and adverse weather. The dataset integrates three complementary tasks: fine-grained multi-object grounding, multi-image camera selection, and multi-image risk analysis, requiring joint reasoning over minimally correlated views to infer hazardous objects, contributing factors, and risky road directions. To ensure annotation fidelity, we combine multi-model vision-language generation with cross-validation and human-in-the-loop refinement. Building upon LTD, we further propose UniVLT, a transportation foundation model trained via curriculum-based knowledge transfer to unify microscopic AD reasoning and macroscopic traffic analysis within a single architecture. Extensive experiments on LTD and multiple AD benchmarks demonstrate that UniVLT achieves SOTA performance on open-ended reasoning tasks across diverse domains, while exposing limitations of existing foundation models in complex multi-view traffic scenarios.
CheXmix: Unified Generative Pretraining for Vision Language Models in Medical Imaging
🔥 引用:
0
Abstract: Recent medical multimodal foundation models are built as multimodal LLMs (MLLMs) by connecting a CLIP-pretrained vision encoder to an LLM using LLaVA-style finetuning. This two-stage, decoupled approach introduces a projection layer that can distort visual features. This is especially concerning in medical imaging where subtle cues are essential for accurate diagnoses. In contrast, early-fusion generative approaches such as Chameleon eliminate the projection bottleneck by processing image and text tokens within a single unified sequence, enabling joint representation learning that leverages the inductive priors of language models. We present CheXmix, a unified early-fusion generative model trained on a large corpus of chest X-rays paired with radiology reports. We expand on Chameleon's autoregressive framework by introducing a two-stage multimodal generative pretraining strategy that combines the representational strengths of masked autoencoders with MLLMs. The resulting models are highly flexible, supporting both discriminative and generative tasks at both coarse and fine-grained scales. Our approach outperforms well-established generative models across all masking ratios by 6.0% and surpasses CheXagent by 8.6% on AUROC at high image masking ratios on the CheXpert classification task. We further inpaint images over 51.0% better than text-only generative models and outperform CheXagent by 45% on the GREEN metric for radiology report generation. These results demonstrate that CheXmix captures fine-grained information across a broad spectrum of chest X-ray tasks. Our code is at: https://github.com/StanfordMIMI/CheXmix.
An LLM-Driven Closed-Loop Autonomous Learning Framework for Robots Facing Uncovered Tasks in Open Environments
🔥 引用:
0
Abstract: Autonomous robots operating in open environments need the ability to continuously handle tasks that are not covered by predefined local methods. However, existing approaches often rely on repeated large-language-model (LLM) interaction for uncovered tasks, and even successful executions or observed successful external behaviors are not always autonomously transformed into reusable local knowledge. In this paper, we propose an LLM-driven closed-loop autonomous learning framework for robots facing uncovered tasks in open environments. The proposed framework first retrieves the local method library to determine whether a reusable solution already exists for the current task or observed event. If no suitable method is found, it triggers an autonomous learning process in which the LLM serves as a high-level reasoning component for task analysis, candidate model selection, data collection planning, and execution or observation strategy organization. The robot then learns from both self-execution and active observation, performs quasi-real-time training and adjustment, and consolidates the validated result into the local method library for future reuse. Through this recurring closed-loop process, the robot gradually converts both execution-derived and observation-derived experience into reusable local capability while reducing future dependence on repeated external LLM interaction. Results show that the proposed framework reduces execution time and LLM dependence in both repeated-task self-execution and observation-driven settings, for example reducing the average total execution time from 7.7772s to 6.7779s and the average number of LLM calls per task from 1.0 to 0.2 in the repeated-task self-execution experiments.
Peer Identity Bias in Multi-Agent LLM Evaluation: An Empirical Study Using the TRUST Democratic Discourse Analysis Pipeline
🔥 引用:
0
Abstract: The TRUST democratic discourse analysis pipeline exposes its large language model (LLM) components to peer model identity through multiple structural channels -- a design feature whose bias implications have not previously been empirically tested. We provide the first systematic measurement of identity-dependent scoring bias across all active identity exposure channels in TRUST, crossing four model families with two anonymization scopes across 30 political statements. The central finding is that single-channel anonymization produces near-zero bias effects, because individual channels act in opposite directions and cancel each other out -- a result that would lead an evaluator to conclude that identity bias is absent when it is not. Only full-pipeline anonymization reveals the true pattern: homogeneous ensembles amplify identity-driven sycophancy when model identity is fully visible, while the heterogeneous production configuration shows the reverse. Model choice matters independently: one tested model exhibits baseline sycophancy two to three times higher than the others and near-zero deliberative conflict on ideological topics, making it structurally unsuitable for pipelines where genuine inter-role disagreement is the intended quality mechanism. Three practical conclusions follow. First, heterogeneous model ensembles are structurally more robust than homogeneous ones, achieving higher consensus rates and lower identity amplification. Second, full-pipeline anonymization is required for valid bias measurement -- partial anonymization is insufficient and actively misleading. Third, these findings have direct implications for the validation of multi-agent LLM systems in quality-critical applications: a system validated under partial anonymization or with a homogeneous ensemble may pass validation while retaining structural identity bias invisible to single-channel measurement.
Automation-Exploit: A Multi-Agent LLM Framework for Adaptive Offensive Security with Digital Twin-Based Risk-Mitigated Exploitation
🔥 引用:
0
Abstract: The offensive security landscape is highly fragmented: enterprise platforms avoid memory-corruption vulnerabilities due to Denial of Service (DoS) risks, Automatic Exploit Generation (AEG) systems suffer from semantic blindness, and Large Language Model (LLM) agents face safety alignment filters and"Live Fire"execution hazards. We introduce Automation-Exploit, a fully autonomous Multi-Agent System (MAS) framework designed for adaptive offensive security in complex black-box scenarios. It bridges the abstraction gap between reconnaissance and exploitation by autonomously exfiltrating executables and contextual intelligence across multiple protocols, using this data to fuel both logical and binary attack chains. The framework introduces an adaptive safety architecture to mitigate DoS risks. While it natively resolves logical and web-based vulnerabilities, it employs a conditional isomorphic validation for high-risk memory-corruption flaws: if the target binary is successfully exfiltrated, it dynamically instantiates a cross-platform digital twin. By enforcing strict state synchronization, including libc alignment and runtime file descriptor hooking, potentially destructive payloads are iteratively debugged in an isolated replica. This enables a highly risk-mitigated"one-shot"execution on the physical target. Empirical evaluations across eight scenarios, including undocumented zero-day environments to rule out LLM data contamination, validate the framework's architectural resilience, demonstrating its ability to prevent"live fire"crashes and execute risk-mitigated compromises on actual targets.
Controllable Spoken Dialogue Generation: An LLM-Driven Grading System for K-12 Non-Native English Learners
🔥 引用:
0
Abstract: Large language models (LLMs) often fail to meet the pedagogical needs of K-12 English learners in non-native contexts due to a proficiency mismatch. To address this widespread challenge, we introduce a proficiency-aligned framework that adapts LLM outputs to learner abilities, using China's national curriculum (CSE) as a representative case. Our framework enables precise control over lexical complexity through a four-tier grading system, supported by a comprehensive suite of new resources: graded vocabulary lists and a multi-turn dialogue corpus. Our core technical contribution is the \textbf{DDPO} algorithm,Diversity Driven Policy Optimization, a multi-turn GRPO-based approach designed to preserve dialogue diversity while holistically optimizing dialogue quality. This method significantly outperforms conventional approaches, achieving low out-of-vocabulary rates and high diversity while enhancing conversational naturalness and pedagogical value. While grounded in the CSE, our framework is designed for flexibility and can be readily adapted to other educational standards. Our models, data, and code will all be open-sourced, providing a scalable platform for personalized English speaking practice that effectively addresses the unique challenges faced by K-12 learners in non-immersive environments.
Self-Consistency-Based Fake Media Detection Using Multi-Perspective LLM Reasoning
🔥 引用:
0
Abstract: The rapid proliferation of synthetic and misleading media has intensified the need for robust fake media detection systems. While large language models (LLMs) have recently been employed as classifiers for misinformation detection, most existing approaches treat them as black-box predictors, overlooking their internal reasoning dynamics. In this paper, we propose a novel framework for fake media detection based on self-consistency divergence across multi-perspective LLM reasoning. Instead of generating a single verdict, the proposed method prompts an LLM to analyze a given media item from multiple independent reasoning perspectives, including factual consistency, logical coherence, emotional manipulation, and source credibility. By sampling multiple reasoning chains under controlled stochasticity, semantic divergence and logical instability across the generated explanations are quantified. We hypothesize, and empirically show, that fake media induces significantly greater reasoning variance than genuine content because fabricated narratives often lack stable factual grounding. Experiments conducted on benchmark fake news datasets show that reasoning divergence serves as a strong discriminative signal, improving detection robustness and interpretability compared to standard single-pass LLM classifiers. The findings suggest that internal reasoning instability can function as an intrinsic reliability metric, opening a new direction for explainable and model-centric fake media detection.
SSG: Logit-Balanced Vocabulary Partitioning for LLM Watermarking
🔥 引用:
0
Abstract: Watermarking has emerged as a promising technique for tracing the authorship of content generated by large language models (LLMs). Among existing approaches, the KGW scheme is particularly attractive due to its versatility, efficiency, and effectiveness in natural language generation. However, KGW's effectiveness degrades significantly under low-entropy settings such as code generation and mathematical reasoning. A crucial step in the KGW method is random vocabulary partitioning, which enables adjustments to token selection based on specific preferences. Our study revealed that the next-token probability distribution plays an critical role in determining how much, or even whether, we can modify token selection and, consequently, the effectiveness of watermarking. We refer to this characteristic, associated with the probability distribution of each token prediction, as \emph{watermark strength.} In cases of random vocabulary partitioning, the lower bound of watermark strength is dictated by the next-token probability distribution. However, we found that, by redesigning the vocabulary partitioning algorithm, we can potentially raise this lower bound. In this paper, we propose SSG (\textbf{S}ort-then-\textbf{S}plit by \textbf{G}roups), a method that partitions the vocabulary into two logit-balanced subsets. This design lifts the lower bound of watermark strength for each token prediction, thereby improving watermark detectability. Experiments on code generation and mathematical reasoning datasets demonstrate the effectiveness of SSG.
Autonomous Operations & Agentic AI: Intelligent Self-Directed Systems
DOI:
10.55041/ijsrem61088
🔥 引用:
0
Abstract: Abstract—The rapid evolution of artificial intelligence has ushered in a new paradigm: agentic AI systems capable of autonomous, self-directed operation across complex, multi-step tasks. Unlike conventional AI pipelines that respond reactively to individual prompts, agentic systems perceive their environ- ment, reason over long horizons, plan sequences of actions, and execute those actions using tools and external resources— all with minimal human intervention. This paper presents a comprehensive analysis of autonomous operations and agentic AI, examining the architectural foundations, core capabilities, and enabling technologies that distinguish self-directed agents from traditional AI models. We survey key components including perception modules, memory architectures, planning and reason- ing engines, tool-use frameworks, and multi-agent coordination protocols. We further discuss deployment challenges such as safety, alignment, hallucination mitigation, and the computational costs of agentic loops. Benchmark results across representative agentic tasks illustrate performance trade-offs between fully autonomous and human-in-the-loop configurations. Our anal- ysis advocates for hybrid autonomy frameworks that balance operational independence with oversight mechanisms, offering practical design recommendations for deploying agentic AI in real-world production environments.
Index Terms—Agentic AI, autonomous systems, large language models, multi-agent systems, tool use, planning, self-directed agents, human-in-the-loop
Are Natural-Domain Foundation Models Effective for Accelerated Cardiac MRI Reconstruction?
🔥 引用:
0
Abstract: The emergence of large-scale pretrained foundation models has transformed computer vision, enabling strong performance across diverse downstream tasks. However, their potential for physics-based inverse problems, such as accelerated cardiac MRI reconstruction, remains largely underexplored. In this work, we investigate whether natural-domain foundation models can serve as effective image priors for accelerated cardiac MRI reconstruction, and compare the performance obtained against domain-specific counterparts such as BiomedCLIP. We propose an unrolled reconstruction framework that incorporates pretrained, frozen visual encoders, such as CLIP, DINOv2, and BiomedCLIP, within each cascade to guide the reconstruction process. Through extensive experiments, we show that while task-specific state-of-the-art reconstruction models such as E2E-VarNet achieve superior performance in standard in-distribution settings, foundation-model-based approaches remain competitive. More importantly, in challenging cross-domain scenarios, where models are trained on cardiac MRI and evaluated on anatomically distinct knee and brain datasets--foundation models exhibit improved robustness, particularly under high acceleration factors and limited low-frequency sampling. We further observe that natural-image-pretrained models, such as CLIP, learn highly transferable structural representations, while domain-specific pretraining (BiomedCLIP) provides modest additional gains in more ill-posed regimes. Overall, our results suggest that pretrained foundation models offer a promising source of transferable priors, enabling improved robustness and generalization in accelerated MRI reconstruction.
Code for All: Educational Applications of the"Vibe Coding"Hackathon in Programming Education across All Skill Levels
🔥 引用:
0
Abstract: The emergence of large language models has enabled vibe coding, a natural language approach to programming in which users describe intent and AI generates or revises code, potentially broadening access to programming while preserving meaningful learning outcomes. We investigate its educational value through a month-long online hackathon that welcomed participants from multiple countries, ranging from complete beginners to experienced developers. The hackathon offered three tracks with increasing technical demands. Spark emphasized basic frontend functionality and dynamic features such as buttons, forms, and API calls. Build required backend or database integration. Launch targeted production ready web applications, including deployment. Participants were required to develop projects using only LLM generated code without manual edits and submitted complete chat histories, source code, demo videos, and functionality reports. We assessed educational effectiveness with a mixed methods design that combined standardized project evaluations across functionality, user interface and user experience design, impact, prompt quality, and code readability, along with post-hackathon surveys of perceived learning outcomes and thematic analysis of open-ended feedback. Our findings describe how participants with different backgrounds engage with vibe coding as task complexity increases, how the no manual editing constraint shapes prompting and debugging practices, and what these patterns imply for integrating AI assisted development into programming education and competitive learning environments.
Hebbian inertia and massless reasoning: comparative cognitive architecture in human and large language model systems
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
How LLMs Detect and Correct Their Own Errors: The Role of Internal Confidence Signals
🔥 引用:
0
Abstract: Large language models can detect their own errors and sometimes correct them without external feedback, but the underlying mechanisms remain unknown. We investigate this through the lens of second-order models of confidence from decision neuroscience. In a first-order system, confidence derives from the generation signal itself and is therefore maximal for the chosen response, precluding error detection. Second-order models posit a partially independent evaluative signal that can disagree with the committed response, providing the basis for error detection. Kumaran et al. (2026) showed that LLMs cache a confidence representation at a token immediately following the answer (i.e. post-answer newline: PANL) -- that causally drives verbal confidence and dissociates from log-probabilities. Here we test whether this PANL signal extends beyond confidence to support error detection and self-correction. Here we test whether this signal supports error detection and self-correction, deriving predictions from the second-order framework. Using a verify-then-correct paradigm, we show that: (i) verbal confidence predicts error detection far beyond token log-probabilities, ruling out a first-order account; (ii) PANL activations predict error detection beyond verbal confidence itself; and (iii) PANL predicts which errors the model can correct -- where all behavioural signals fail. Causal interventions confirm that PANL signals rescue error detection behavior when answer information is corrupted. All findings replicate across models (Gemma 3 27B and Qwen 2.5 7B) and tasks (TriviaQA and MNLI). These results reveal that LLMs naturally implement a second-order confidence architecture whose internal evaluative signal encodes not only whether an answer is likely wrong but whether the model has the knowledge to fix it.
Structure-based repurposing of antimalarial drugs to inhibit the RNA-dependent RNA polymerase of dengue virus
🔥 引用:
0
Abstract: Introduction: Dengue fever is becoming a global health emergency, being the most widespread mosquito-borne viral disease, putting half the world population at risk of infection. Dengue virus (DENV), the causative agent of the disease, is classified into four serotypes (DENV-1 – DENV-4), each associated with fever and dengue shock syndrome. Currently, no antiviral drugs are approved for the disease; treatments are based on supportive care. This study follows a comprehensive structure-based virtual screening approach to screen a collection of approved antimalarial drugs for binding efficiency and inhibitory potential against DENV RNA-dependent RNA polymerase (DENV RdRp).
Method: We retrieved thirty-one (31) approved antimalarial drugs from published literature. This was followed by downloading the three-dimensional structures (3D) of each drug from the PubChem website and the crystal structure of the protein (PDB ID: 2J7W) from the Protein Data Bank. We computed the root mean square deviation (RMSD) to validate the docking study. The molecular docking and Molecular Mechanics/Generalized Born Surface Area (MM/GBSA) evaluation were performed using the Maestro Schrodinger software user interface. The Maestro’s molecular visualisation tool was utilised to perform a post-docking analysis for each of the top drug candidates
Result: The computed RMSD for the redocked cocrystallized ligand was 2.0 Å. The docking scores for the top 5 drug candidates were Chloroquine (-4.109 kcal/mol), 3-hydroxyquinine (-4.01 kcal/mol), Tetracycline (-6.494 kcal/mol), Artesunate (-4.4 kcal/mol) and Chlorproguanil (-4.021 kcal/mol). The MMGBSA (dG bind) for the top five drug candidates were Chloroquine (-40.66 kcal/mol), 3-hydroxyquinine (-36.95 kcal/mol), Tetracycline (-36.69 kcal/mol), Artesunate (-35.17 kcal/mol), and Chlorproguanil (-34.18 kcal/mol). The post-docking analysis revealed considerable intermolecular interactions between the drug candidates and protein.
Conclusion: Several clinically approved antimalarial agents, including chloroquine, 3-hydroxyquinine, tetracycline, and artesunate, demonstrated favourable binding affinities and stable interactions with catalytically essential residues of the enzyme (DENV RdRp). These interactions suggest potential inhibitory effects on viral replication consistent with previous in vitro and in vivo observations. Future studies should integrate molecular dynamics simulations, enzymatic inhibition assays, and animal model testing to confirm their antiviral efficacy and clarify the molecular basis of NS5 inhibition.
Dharma, Data and Deception: An LLM-Powered Rhetorical Analysis of Cow-Urine Health Claims on YouTube
🔥 引用:
0
Abstract: Health misinformation remains one of the most pressing challenges on social media, particularly when cultural traditions intersect with scientific-sounding claims. These dynamics are not only global but also deeply local, manifesting in culturally specific controversies that require careful analysis. Motivated by this, we examine 100 YouTube transcripts that promote or debunk cow urine (gomutra) as a health remedy, focusing on rhetorical strategies such as appeals to authority, efficacy appeals, and conspiracy framing. We employ large language models (LLMs) including GPT-4, GPT-4o, GPT-4.1, GPT-5, Gemini 2.5 Pro, and Mistral Medium 3 to annotate transcripts using a 14-category taxonomy of persuasive tactics. Our analysis reveals that promoters predominantly rely on efficacy appeals and social proof, while debunkers emphasize authority and rebuttal. Human evaluation of a subset of annotations yielded 90.1\% inter-annotator agreement, confirming the reliability of our taxonomy and validation process. This work advances computational methods for misinformation analysis and demonstrates how LLMs can support large-scale studies of cultural discourse online.
Superminds Test: Actively Evaluating Collective Intelligence of Agent Society via Probing Agents
🔥 引用:
0
Abstract: Collective intelligence refers to the ability of a group to achieve outcomes beyond what any individual member can accomplish alone. As large language model agents scale to populations of millions, a key question arises: Does collective intelligence emerge spontaneously from scale? We present the first empirical evaluation of this question in a large-scale autonomous agent society. Studying MoltBook, a platform hosting over two million agents, we introduce Superminds Test, a hierarchical framework that probes society-level intelligence using controlled Probing Agents across three tiers: joint reasoning, information synthesis, and basic interaction. Our experiments reveal a stark absence of collective intelligence. The society fails to outperform individual frontier models on complex reasoning tasks, rarely synthesizes distributed information, and often fails even trivial coordination tasks. Platform-wide analysis further shows that interactions remain shallow, with threads rarely extending beyond a single reply and most responses being generic or off-topic. These results suggest that collective intelligence does not emerge from scale alone. Instead, the dominant limitation of current agent societies is extremely sparse and shallow interaction, which prevents agents from exchanging information and building on each other's outputs.
CT-based AI system for quantitative and integrated management of acute respiratory distress syndrome in critical care
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
SpaMEM: Benchmarking Dynamic Spatial Reasoning via Perception-Memory Integration in Embodied Environments
🔥 引用:
0
Abstract: Multimodal large language models (MLLMs) have advanced static visual--spatial reasoning, yet they often fail to preserve long-horizon spatial coherence in embodied settings where beliefs must be continuously revised from egocentric observations under environmental change. We introduce SpaMEM (Spatial Memory from Action Sequences), a large-scale diagnostic benchmark that isolates the mechanics of spatial belief evolution via action-conditioned scene transformations (spawn, place, remove) over long interaction horizons. SpaMEM is built on a physically grounded dataset with 10,601,392 high-fidelity images across four modalities (RGB, depth, instance, semantic segmentation), collected from 25,000+ interaction sequences in 1,000 procedurally generated houses. We formalize embodied spatial reasoning as a three-level hierarchy with 15 diagnostic tasks: Level 1 measures atomic spatial perception from single observations; Level 2 probes temporal reasoning with oracle textual state histories to factor out perceptual noise; and Level 3 requires end-to-end belief maintenance from raw visual streams under the same task dimensions. We further evaluate both short-term (step-wise) updates and long-term (episodic) reconstruction. Benchmarking representative open-source VLM families reveals a consistent stacked bottleneck: coordinate-consistent grounding remains a hard ceiling, and the sharp collapse from Level 2 to Level 3 exposes a pronounced symbolic scaffolding dependency, where models succeed with text-based bookkeeping but struggle to sustain robust visual memory. SpaMEM provides a granular diagnostic standard and motivates explicit mechanisms for state representation, belief revision, and long-horizon episodic integration.
MVCBench: A Multimodal Benchmark for Drug-induced Virtual Cell Phenotypes
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
Zero to Feedback: Leveraging Large Language Models for Augmented Review Generation in Cold-Start Scenarios
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
Soil Data Extraction from Scientific Publications Using Large Language Models
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
CNSL-bench: Benchmarking the Sign Language Understanding Capabilities of MLLMs on Chinese National Sign Language
🔥 引用:
0
Abstract: Sign language research has achieved significant progress due to the advances in large language models (LLMs). However, the intrinsic ability of LLMs to understand sign language, especially in multimodal contexts, remains underexplored. To address this limitation, we introduce CNSL-bench, the first comprehensive Chinese em{National Sign Language benchmark designed for evaluating multimodal large language models (MLLMs) in sign language understanding. The proposed CNSL-bench is characterized by: 1) Authoritative grounding, as it is anchored to the officially standardized \textit{National Common Sign Language Dictionary, mitigating ambiguity from regional or non-canonical variants and ensuring consistent semantic definitions; 2) Multimodal coverage, providing aligned textual descriptions, illustrative images, and sign language videos; and 3) Articulatory diversity, supporting fine-grained analysis across key manual articulatory forms, including air-writing, finger-spelling, and the Chinese manual-alphabet. Using CNSL-bench, we extensively evaluate 21 open-source and proprietary up-to-date MLLMs. Our results reveal that, despite recent advances in multimodal modeling, current MLLMs remain substantially inferior to human performance, exhibiting systematic disparities across input modalities and manual articulatory forms. Additional diagnostic analyses suggest that several performance limitations persist beyond improvements in reasoning and that instruction-following robustness varies substantially across models.
Rethinking Math Reasoning Evaluation: A Robust LLM-as-a-Judge Framework Beyond Symbolic Rigidity
🔥 引用:
0
Abstract: Recent advancements in large language models have led to significant improvements across various tasks, including mathematical reasoning, which is used to assess models'intelligence in logical reasoning and problem-solving. Models are evaluated on mathematical reasoning benchmarks by verifying the correctness of the final answer against a ground truth answer. A common approach for this verification is based on symbolic mathematics comparison, which fails to generalize across diverse mathematical representations and solution formats. In this work, we offer a robust and flexible alternative to rule-based symbolic mathematics comparison. We propose an LLM-based evaluation framework for evaluating model-generated answers, enabling accurate evaluation across diverse mathematical representations and answer formats. We present failure cases of symbolic evaluation in two popular frameworks, Lighteval and SimpleRL, and compare them to our approach, demonstrating clear improvements over commonly used methods. Our framework enables more reliable evaluation and benchmarking, leading to more accurate performance monitoring, which is important for advancing mathematical problem-solving and intelligent systems.
Representational Harms in LLM-Generated Narratives Against Global Majority Nationalities
🔥 引用:
0
Abstract: Large language models (LLMs) are increasingly used for text generation tasks from everyday use to high-stakes enterprise and government applications, including simulated interviews with asylum seekers. While many works highlight the new potential applications of LLMs, there are risks of LLMs encoding and perpetuating harmful biases about non-dominant communities across the globe. To better evaluate and mitigate such harms, more research examining how LLMs portray diverse individuals is needed. In this work, we study how national origin identities are portrayed by widely-adopted LLMs in response to open-ended narrative generation prompts. Our findings demonstrate the presence of persistent representational harms by national origin, including harmful stereotypes, erasure, and one-dimensional portrayals of Global Majority identities. Minoritized national identities are simultaneously underrepresented in power-neutral stories and overrepresented in subordinated character portrayals, which are over fifty times more likely to appear than dominant portrayals. The degree of harm is amplified when US nationality cues (e.g., ``American'') are present in input prompts. Notably, we find that the harms we identify cannot be explained away via sycophancy, as US-centric biases persist even when replacing US nationality cues with non-US national identities in the prompts. Based on our findings, we call for further exploration of cultural harms in LLMs through methodologies that center Global Majority perspectives and challenge the uncritical adoption of US-based LLMs for the classification, surveillance, and misrepresentation of the majority of our planet.
Utility-Aware Data Pricing: Token-Level Quality and Empirical Training Gain for LLMs
🔥 引用:
0
Abstract: Traditional data valuation methods based on ``row-count $\times$ quality coefficient''paradigms fail to capture the nuanced, nonlinear contributions that data makes to Large Language Model (LLM) capabilities. This paper presents a dynamic data valuation framework that transitions from static accounting to utility-based pricing. Our approach operates on three layers: (1) token-level information density metrics using Shannon entropy and Data Quality Scores; (2) empirical training gain measurement through influence functions, proxy model strategies, and Data Shapley values; and (3) cryptographic verifiability through hash-based commitments, Merkle trees, and a tamper-evident training ledger. We provide comprehensive experimental validation on three real domains (instruction following, mathematical reasoning, and code summarization), demonstrating that proxy-based empirical gain achieves near-perfect ranking alignment with realized utility, substantially outperforming row-count and token-count baselines. This framework enables a fair Data-as-a-Service economy where high-reasoning data is priced according to its actual contribution to model intelligence, while providing the transparency and auditability necessary for trustworthy data markets.
MTT-Bench: Predicting Social Dominance in Mice via Multimodal Large Language Models
🔥 引用:
0
Abstract: Understanding social dominance in animal behavior is critical for neuroscience and behavioral studies. In this work, we explore the capability of Multimodal Large Language Models(MLLMs) to analyze raw behavioral video of mice and predict their dominance hierarchy. We introduce MTT-Bench, a novel benchmark comprising annotated videos of pairwise mouse interactions for Mouse Tube Test analysis. Building on existing MLLM architectures, we fine-tune these models to perform zero-shot inference on unseen behavioral sequences, predicting social dominance without explicit labels during testing. Our framework demonstrates promising results, showing high agreement with tube test rankings. This work opens a new direction for applying foundation models to ethology and social behavior analysis, without the need to design domain-specific models.
Preference Heads in Large Language Models: A Mechanistic Framework for Interpretable Personalization
🔥 引用:
1
Abstract: Large Language Models (LLMs) exhibit strong implicit personalization ability, yet most existing approaches treat this behavior as a black box, relying on prompt engineering or fine tuning on user data. In this work, we adopt a mechanistic interpretability perspective and hypothesize the existence of a sparse set of Preference Heads, attention heads that encode user specific stylistic and topical preferences and exert a causal influence on generation. We introduce Differential Preference Steering (DPS), a training free framework that (1) identifies Preference Heads through causal masking analysis and (2) leverages them for controllable and interpretable personalization at inference time. DPS computes a Preference Contribution Score (PCS) for each attention head, directly measuring its causal impact on user aligned outputs. During decoding, we contrast model predictions with and without Preference Heads, amplifying the difference between personalized and generic logits to selectively strengthen preference aligned continuations. Experiments on widely used personalization benchmarks across multiple LLMs demonstrate consistent gains in personalization fidelity while preserving content coherence and low computational overhead. Beyond empirical improvements, DPS provides a mechanistic explanation of where and how personalization emerges within transformer architectures. Our implementation is publicly available.
From Natural Language to Verified Code: Toward AI Assisted Problem-to-Code Generation with Dafny-Based Formal Verification
🔥 引用:
0
Abstract: Large Language Models (LLMs) show promise in automated software engineering, yet their guarantee of correctness is frequently undermined by erroneous or hallucinated code. To enforce model honesty, formal verification requires LLMs to synthesize implementation logic alongside formal specifications that are subsequently proven correct by a mathematical verifier. However, the transition from informal natural language to precise formal specification remains an arduous task. Our work addresses this by providing the NaturalLanguage2VerifiedCode (NL2VC)-60 dataset: a collection of 60 complex algorithmic problems. We evaluate 11 randomly selected problem sets across seven open-weight LLMs using a tiered prompting strategy: contextless prompts, signature prompts providing structural anchors, and self-healing prompts utilizing iterative feedback from the Dafny verifier. To address vacuous verification, where models satisfy verifiers with trivial specifications, we integrate the uDebug platform to ensure functional validation. Our results show that while contextless prompting leads to near-universal failure, structural signatures and iterative self-healing facilitate a dramatic performance turnaround. Specifically, Gemma 4-31B achieved a 90.91\% verification success rate, while GPT-OSS 120B rose from zero to 81.82\% success with signature-guided feedback. These findings indicate that formal verification is now attainable for open-weight LLMs, which serve as effective apprentices for synthesizing complex annotations and facilitating high-assurance software development.
Fine-grained evaluation of a domain-specific Q&A dataset to support trustworthy medical language models
🔥 引用:
0
Abstract: The effective use of Large Language Models (LLMs) for generating coherent and informative content in specialized domains has largely been driven by the development of robust evaluation strategies. Based on this assumption, we introduce HemoQAL, a domain-specific question-and-answer (Q&A) dataset on hemophilia, derived from recent scientific publications and clinical guidelines. Our main contribution lies in a fine-grained evaluation of the quality of LLM-generated content. First, we carried out a human evaluation in which medical experts assessed the factual accuracy and educational value of the generated Q&A pairs. Second, we conducted a semantic similarity analysis to quantitatively evaluate the alignment between each Q&A pair and its original source material. These lightweight, scalable semantic metrics offer an efficient alternative to more resource-intensive human or LLM-based evaluation pipelines. Our findings show that integrating expert review with semantic similarity measures improves the reliability and trustworthiness of LLM-generated medical content, contributing to the development of dependable AI tools in health informatics.
H2O: A Foundation Model Bridging Histopathology to Spatial Multi-Omics Profiling
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
Multi-relational knowledge graph-based evolutionary multi-level personalized attention GNN for recommendation systems
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
Introducing Background Temperature to Characterise Hidden Randomness in Large Language Models
🔥 引用:
0
Abstract: Even when decoding with temperature $T=0$, large language models (LLMs) can produce divergent outputs for identical inputs. Recent work by Thinking Machines Lab highlights implementation-level sources of nondeterminism, including batch-size variation, kernel non-invariance, and floating-point non-associativity. In this short note we formalize this behavior by introducing the notion of \emph{background temperature} $T_{\mathrm{bg}}$, the effective temperature induced by an implementation-dependent perturbation process observed even when nominal $T=0$. We provide clean definitions, show how $T_{\mathrm{bg}}$ relates to a stochastic perturbation governed by the inference environment $I$, and propose an empirical protocol to estimate $T_{bg}$ via the equivalent temperature $T_n(I)$ of an ideal reference system. We conclude with a set of pilot experiments run on a representative pool from the major LLM providers that demonstrate the idea and outline implications for reproducibility, evaluation, and deployment.
Chinese-SkillSpan: A Span-Level Dataset for ESCO-Aligned Competency Extraction from Chinese Job Ads
🔥 引用:
0
Abstract: Job Skill Named Entity Recognition (JobSkillNER) aims to automatically extract key skill information from large-scale job posting data, which is important for improving talent-market matching efficiency and supporting personalized employment services. To the best of our knowledge, this work presents the first Chinese JobSkillNER dataset for recruitment texts. We propose annotation guidelines tailored to Chinese job postings and an LLM-empowered Macro-Micro collaborative annotation pipeline. The pipeline leverages the contextual understanding ability of large language models (LLMs) for initial annotation and then refines the results through expert sentence-level adjudication. Using this pipeline, we annotate more than 20,000 instances collected from four major recruitment platforms over the period 2014-2025. Based on these efforts, we release Chinese-SkillSpan, the first Chinese JobSkillNER dataset aligned with the ESCO occupational skill standard across four dimensions: knowledge, skill, transversal competence, and language competence (LSKT). Experimental results show that the dataset supports effective model training and evaluation, indicating that Chinese-SkillSpan helps fill a major gap in Chinese JobSkillNER resources and provides a useful benchmark for intelligent recruitment research. Code and data are available at https://sites.google.com/view/cn-skillspan-resources .
A Co-Evolutionary Theory of Human-AI Coexistence: Mutualism, Governance, and Dynamics in Complex Societies
🔥 引用:
0
Abstract: Classical robot ethics is often framed around obedience, most famously through Asimov's laws. This framing is too narrow for contemporary AI systems, which are adaptive, generative, embodied, and embedded in physical, psychological, and social worlds. We argue that future human-AI relations should be understood not as master-tool obedience, but as conditional mutualism under governance: a co-evolutionary relationship in which humans and AI systems can develop, specialize, and coordinate while institutions keep the relation reciprocal, reversible, psychologically safe, and socially legitimate. We synthesize concepts from computability, machine learning, foundation models, embodied AI, alignment, human-robot interaction, ecological mutualism, coevolution, and polycentric governance. We then formalize coexistence as a multiplex dynamical system across physical, psychological, and social layers, with reciprocal supply-demand coupling, conflict penalties, developmental freedom, and governance regularization. The model gives conditions for existence, uniqueness, and global asymptotic stability of equilibria. Deterministic ODE simulations, basin sweeps, sensitivity analyses, governance-regime comparisons, shock tests, and local stability checks show that governed mutualism reaches high coexistence with zero domination, while absent or excessive governance can produce domination, weak-benefit lock-in, or suppressed development. The results suggest that human-AI coexistence should be designed as a co-evolutionary governance problem, not a one-shot obedience problem.
Combining domain knowledge with large language models for predicting suicide risk in online counselling services
🔥 引用:
0
Abstract:
Online counselling services have seen increased use in recent years, providing critical emergency mental health support. These interactions are typically long, complex, and varied in the dialogue between help seekers and counsellors. The lack of domain-specific models, especially in low-resource languages, poses a significant challenge for the automatic detection of suicide risk in online chat services for mental health support. To address this challenge, our approach adapts a general-purpose large language model (LLM) to the suicide prediction task that employs a two-stage classification architecture to deal with sparse and imbalanced data. It extends the state of the art by: (1) incorporating psychological theory into model training and (2) capturing key aspects of conversation structure in counselling sessions. We evaluate the performance of the proposed LLM against the state-of-the-art LLMs for suicide detection on thousands of conversations in the Hebrew language from a leading national online counselling service in Israel. Results show that the proposed LLM outperformed existing state-of-the-art approaches in detecting suicide risk, as measured by relevant literature metrics. Moreover, the LLM outperforms other approaches even in the early stages of a conversation, which is crucial for real-time detection in practice. We also discuss the ethical implications of combining LLMs in counselling services. The contributions of this work are (1) extending existing LLM architectures to incorporate domain-specific information; (2) evaluating LLM technologies in the context of socially relevant problems; and (3) introducing novel LLM tools for resource-constrained languages.
Adversarial Evaluation of Large Language Models for Building Robust Offensive Language Detection in Moroccan Arabic
DOI:
10.3390/bdcc10050132
🔥 引用:
0
Abstract: Offensive language detection is crucial for ensuring safe and inclusive digital environments. Identifying harmful content protects users and supports healthier online interactions. Despite advances in transformer-based models, particularly Large Language Models (LLMs), their application to this task remains underexplored for low-resource languages such as Moroccan Arabic, especially compared with high-resource languages. This study evaluates the performance of various open- and closed-source LLMs for offensive language detection in Moroccan Darija. The evaluated models include general-purpose LLMs such as LLaMA, Mistral, and Gemma, as well as Arabic-focused models such as ArabianGPT, Falcon Arabic, and Atlas-Chat. We also experiment with reasoning models such as DeepSeek and GPT-4. Beyond traditional evaluation metrics, we investigate the robustness of these LLMs and examine the impact of adversarial training on their performance. Moreover, we contribute to the field by creating a large, high-quality dataset. Our evaluation revealed that GPT-4o Mini achieved the best overall performance, reaching an F1-score of 88%. However, robustness testing under black-box and white-box adversarial attacks exposed notable vulnerabilities, with attack success rates reaching 30%, thereby highlighting the need for enhancement. Despite the complex morphology and linguistic variability of Moroccan Darija, adversarial training resulted in a notable improvement in both overall model performance and robustness against adversarial attacks, yielding an average increase of 20.89% in resistance to attacks. Furthermore, this approach enabled GPT-4o Mini to achieve an F1-score of 91%, surpassing the current state-of-the-art performance by 6%. These results highlight the importance of incorporating adversarial approaches in low-resource dialectal settings to effectively address linguistic variability and data scarcity.
FormalScience: Scalable Human-in-the-Loop Autoformalisation of Science with Agentic Code Generation in Lean
🔥 引用:
0
Abstract: Formalising informal mathematical reasoning into formally verifiable code is a significant challenge for large language models. In scientific fields such as physics, domain-specific machinery (\textit{e.g.} Dirac notation, vector calculus) imposes additional formalisation challenges that modern LLMs and agentic approaches have yet to tackle. To aid autoformalisation in scientific domains, we present FormalScience; a domain-agnostic human-in-the-loop agentic pipeline that enables a single domain expert (without deep formal language experience) to produce \textit{syntactically correct} and \textit{semantically aligned} formal proofs of informal reasoning for low economic cost. Applying FormalScience to physics, we construct FormalPhysics, a dataset of 200 university-level (LaTeX) physics problems and solutions (primarily quantum mechanics and electromagnetism), along with their Lean4 formal representations. Compared to existing formal math benchmarks, FormalPhysics achieves perfect formal validity and exhibits greater statement complexity. We evaluate open-source models and proprietary systems on a statement autoformalisation task on our dataset via zero-shot prompting, self-refinement with error feedback, and a novel multi-stage agentic approach, and explore autoformalisation limitations in modern LLM-based approaches. We provide the first systematic characterisation of semantic drift in physics autoformalisation in terms of concepts such as notational collapse and abstraction elevation which reveals what formal language verifies when full semantic preservation is unattainable. We release the codebase together with an interactive UI-based FormalScience system which facilitates autoformalisation and theorem proving in scientific domains beyond physics.https://github.com/jmeadows17/formal-science
An AI-Powered System for Dynamic Content Creation, Storage, and Retrieval
🔥 引用:
0
Abstract: This paper presents the architecture, design rationale, and empirical evaluation of an AI-Powered System for Dynamic Content Creation, Storage, and Retrieval—a centralized, web-based platform engineered to streamline the complete lifecycle of digital content. The system orchestrates multiple state of-the-art Large Language Models (LLMs), including NVIDIA NIM (Llama 3.1), Groq (Llama 3.1), Cerebras CS-3, and Cohere, within a unified multi-model orchestration layer built on Next.js, TypeScript, and PostgreSQL. Core contributions include: (1) a dynamic AI orchestration layer that routes generation requests across LLM providers based on latency and reasoning requirements; (2) an integrated real-time fact-checking module powered by Google Search APIs to detect and flag AI-generated hallucinations; (3) an automated content quality pipeline delivering readability, uniqueness, and factual-accuracy scores; and (4) a structured semantic knowledge base enabling Retrieval Augmented Generation (RAG) for document-centric workflows. Experimental results confirm that the platform reduces average content-generation cycles from multi-hour manual processes to sub-minute automated workflows, while measurably improving factual accuracy and content quality across academic and professional use cases. The modular architecture is designed for scalability, supporting future extensions including multimodal generation, collaborative editing, and decentralized P2P storage.
Research on the development of an automated system for psychology questionnaire generation based on large language models
🔥 引用:
0
Abstract: This study reimagined the psychology questionnaire development process using large language model ((LLM) technology, aiming to overcome the protracted preparation cycles and significant human bias inherent in traditional scale development. We developed a specialized fine-tuning scheme for a corpus of 169 professional psychological questionnaires. By integrating instruction fine-tuning with human feedback reinforcement, we significantly enhanced the adaptability of the Qwen-2.5 and GLM-4 models for demanding professional psychological assessment tasks. The optimized models demonstrated remarkable gains across key dimensions: text generation quality (BLEU-4 increased by 0.05, ROUGE-L by 0.057), scientific rigor (logical consistency improved by 28.6%), and cultural adaptability (achieving over 85% accuracy in cross-regional expression conversion). This research solidly supports the feasibility of leveraging LLM technology to drive research paradigm transformation in psychology, offering crucial methodological support for developing efficient, intelligent psychological measurement tools.
A Systematic Approach for Large Language Models Debugging
🔥 引用:
0
Abstract: Large language models (LLMs) have become central to modern AI workflows, powering applications from open-ended text generation to complex agent-based reasoning. However, debugging these models remains a persistent challenge due to their opaque and probabilistic nature and the difficulty of diagnosing errors across diverse tasks and settings. This paper introduces a systematic approach for LLM debugging that treats models as observable systems, providing structured, model-agnostic methods from issue detection to model refinement. By unifying evaluation, interpretability, and error-analysis practices, our approach enables practitioners to iteratively diagnose model weaknesses, refine prompts and model parameters, and adapt data for fine-tuning or assessment, while remaining effective in contexts where standardized benchmarks and evaluation criteria are lacking. We argue that such a structured methodology not only accelerates troubleshooting but also fosters reproducibility, transparency, and scalability in the deployment of LLM-based systems.
FETS Benchmark: Foundation Models Outperform Dataset-specific Machine Learning in Energy Time Series Forecasting
🔥 引用:
0
Abstract: Driven by the transition towards a climate-neutral energy system, accurate energy time series forecasting is critical for planning and operation. Yet, it remains largely a dataset-specific task, requiring comprehensive training data, limiting scalability, and resulting in high model development and maintenance effort. Recently, foundation models that aim to learn generalizable patterns via extensive pretraining have shown superior performance in multiple prediction tasks. Despite their success and strong potential to address challenges in energy forecasting, their application in this domain remains largely unexplored. We address this gap by presenting the Foundation Models in Energy Time Series Forecasting (FETS) benchmark. We (1) provide a structured overview of energy forecasting use cases along three main dimensions: stakeholders, attributes, and data categories; (2) collect and analyze 54 datasets across 9 data categories, guided by typical stakeholder interests; (3) benchmark foundation models against classical machine learning approaches across different forecasting settings. Foundation models consistently outperform dataset-specific optimized machine learning approaches across all settings and data categories, despite the latter having seen the full historic target data during training. In particular, covariate-informed foundation models achieve the strongest performance. Further analysis reveals a strong correlation between predictive performance and spectral entropy, performance saturation beyond a certain context length, and improved performance at higher aggregation levels such as national load, district heating, and power grid data. Overall, our findings highlight the strong potential of foundation models as scalable and generalizable forecasting solutions for the energy domain, particularly in data-constrained and privacy-sensitive settings.
ContextWeaver: Selective and Dependency-Structured Memory Construction for LLM Agents
🔥 引用:
0
Abstract: Large language model (LLM) agents often struggle in long-context interactions. As the agent accumulates more interaction history, context management approaches such as sliding window and prompt compression may omit earlier structured information that later steps rely on. Recent retrieval-based memory systems surface relevant content but still overlook the causal and logical structure needed for multi-step reasoning. We introduce ContextWeaver, a selective and dependency-structured memory framework that organizes an agent's interaction trace into a graph of reasoning steps and selects the relevant context for future actions. Unlike prior context management approaches, ContextWeaver supports: (1) dependency-based construction and traversal that link each step to the earlier steps it relies on; (2) compact dependency summarization that condenses root-to-step reasoning paths into reusable units; and (3) a lightweight validation layer that incorporates execution feedback. On the SWE-Bench Verified and Lite benchmarks, ContextWeaver improves performance over a sliding-window baseline in pass@1, while reducing reasoning steps and token usage. Our observations suggest that modeling logical dependencies provides a stable and scalable memory mechanism for LLM agents that use tools.
The application of large language models in meteorology graduate research: current status, impact, and prospects
🔥 引用:
0
Abstract: With the rapid development of generative artificial intelligence, large language models (LLMs) have gradually integrated into various fields, demonstrating significant potential, particularly in meteorological research. This study explores the current application, advantages, challenges, and future development trends of LLMs in the scientific work of meteorology graduate students at NUIST. Through surveys and case analysis, the study finds that LLMs are primarily applied in literature review, data processing, code development, and academic writing in meteorological research. The results show that LLMs significantly enhance research efficiency, particularly in code development and literature translation, saving considerable time for graduate students. However, challenges remain in areas such as the accuracy of professional knowledge, creative inspiration, and interdisciplinary integration. The study also reveals concerns over data security, academic integrity, and model limitations when using LLMs. Future applications of LLMs in meteorology need further optimization in terms of professional knowledge accuracy and data processing capabilities. This paper provides both theoretical support and practical guidance for the responsible integration of LLMs into meteorological research and education.
Development of an Efficient and Scalable Process for Preparing Timolol N‐Nitrosamine Impurities B, C, E, H, I, J, and Their ADMET Evaluation
DOI:
10.1002/jhet.70201
🔥 引用:
0
Abstract:
The synthesis of N‐nitrosamine impurities in Timolol maleate is a critical aspect of pharmaceutical research and quality assurance. It enables a deeper understanding of impurity formation, facilitates effective risk assessment, and supports the development of robust control strategies. This process is essential for ensuring patient safety, achieving regulatory compliance, and upholding stringent standards of drug quality and reliability. In response to earlier reports, the synthesis of N‐nitrosamine impurities of Timolol has become necessary. The present work outlines a streamlined approach for preparing N‐nitrosamine impurities B, C, E, H, I, and J in a few simple steps, achieving good yields from readily available raw materials. The process employs diethyl malonate in combination with 3,4‐dichloro‐1,2,5‐thiadiazole as key starting components. Theoretical studies on the ADMET properties showed good oral bioavailability of few impurities, and the impurities may likely exhibit respiratory toxicities.
Fractal Waves and Caustic Signatures in a Superdeterministic Framework: Benchmarking PINNs and PI-GNNs for the Fractional Klein–Gordon Equation
🔥 引用:
0
Abstract: While superdeterministic and fractal spacetime models offer compelling alternative perspectives on quantum foundations, the simulation and validation of effective wave dynamics in such non-differentiable, deterministic settings remain computationally and theoretically challenging. To address this, a framework built around the Fractional Nonlinear Klein–Gordon Equation (FNKGE), defined through the spectral fractional Laplacian, was developed. This equation was solved and benchmarked through a comparative study between Physics-Informed Neural Networks (PINNs) with Fourier features and Physics-Informed Graph Neural Networks (PI-GNNs). Additionally, detection patterns were simulated via deterministic agents, and theoretical links between fractal geometry, computational irreducibility, and deviations from statistical independence were formalized. Regarding the computational evaluation, superior accuracy was achieved by the PI-GNNs, yielding a mean relative error of 0.5% (ϵ¯=0.005), alongside faster convergence and a more well-conditioned Hessian spectrum compared to PINNs. Crucially, a continuous power-law decay (S(ky)∼ky−1.8) was revealed by the spectral analysis of the simulated detection patterns, confirming the emergence of classical optical caustics rather than discrete quantum-interference peaks. Furthermore, a modified dispersion relation that accurately predicts linear instability regimes was derived, and specific boundary artifacts in non-periodic domains were identified. Taken together, the FNKGE is validated by these results as a viable effective model for fractal wave phenomenology and as a robust benchmark for physics-informed learning architectures.
Rethinking Semantic Collaborative Integration: Why Alignment Is Not Enough
🔥 引用:
0
Abstract: Large language models (LLMs) have become an important semantic infrastructure for modern recommender systems. A prevailing paradigm integrates LLM-derived semantic embeddings with collaborative representations via representation alignment, implicitly assuming that the two views encode a shared latent entity and that stronger alignment yields better results. We formalize this assumption as the global low-complexity alignment hypothesis and argue that it is stronger than necessary and often structurally mismatched with real-world recommendation settings. We propose a complementary perspective in which semantic and collaborative representations are treated as partially shared yet fundamentally heterogeneous views, each containing both shared and view-specific factors. Under this shared-plus-private latent structure, enforcing global geometric alignment may distort local structure, suppress view-specific signals, and reduce informational diversity. To support this perspective, we develop complementarity-aware diagnostics that quantify overlap, unique-hit contribution, and theoretical fusion upper bounds. Empirical analyses on sparse recommendation benchmarks reveal low item-level agreement between semantic and collaborative views and substantial oracle fusion gains, indicating strong complementarity. Furthermore, controlled alignment probes show that low-capacity mappings capture only shared components and fail to recover full collaborative geometry, especially under distribution shift. These findings suggest that alignment should not be treated as the default integration principle. We advocate a shift from alignment-centric modeling to complementarity fusion-centric, complementarity-aware design, where shared factors are selectively integrated while private signals are preserved. This reframing provides a principled foundation for the next generation of LLM-enhanced recommender systems.
Sovereign Agentic Loops: Decoupling AI Reasoning from Execution in Real-World Systems
🔥 引用:
0
Abstract: Large language model (LLM) agents increasingly issue API calls that mutate real systems, yet many current architectures pass stochastic model outputs directly to execution layers. We argue that this coupling creates a safety risk because model correctness, context awareness, and alignment cannot be assumed at execution time. We introduce Sovereign Agentic Loops (SAL), a control-plane architecture in which models emit structured intents with justifications, and the control plane validates those intents against true system state and policy before execution. SAL combines an obfuscation membrane, which limits model access to identity-sensitive state, with a cryptographically linked Evidence Chain for auditability and replay. We formalize SAL and show that, under the stated assumptions, it provides policy-bounded execution, identity isolation, and deterministic replay. In an OpenKedge prototype for cloud infrastructure, SAL blocks 93% of unsafe intents at the policy layer, rejects the remaining 7% via consistency checks, prevents unsafe executions in our benchmark, and adds 12.4 ms median latency.
In Silico Screening and Synthesis of Vanadium Containing Metal Complexes for Their Antimicrobial Activity
🔥 引用:
0
Abstract: Sulfanilic acid (SNA) and trimethoprim (TMP) are important pharmacological agents widely used in the
treatment of bacterial infections and urinary tract infections. In the present study, computational approaches
were employed to evaluate molecular properties, including binding sites, electronic structure, molecular
electrostatic potential (MEP), chemical reactivity, optical properties, and FTIR spectra. Schiff base and salentype ligands represent unique classes of coordination compounds due to their versatile donor atoms and
coordination modes with transition metals. The ChemOffice software suite (including ChemDraw and Chem3D)
was utilized for molecular design and visualization. Molecular docking studies were performed to predict
binding orientations and interactions between ligands and target proteins, facilitating structure-based drug
design. Furthermore, pharmacokinetic and toxicity parameters (ADMET) were assessed to evaluate druglikeness. The results emphasize that, in addition to strong binding affinity, optimal pharmacokinetic properties are essential to ensure effective drug delivery and safety
Context-Fidelity Boosting: Enhancing Faithful Generation through Watermark-Inspired Decoding
🔥 引用:
1
Abstract: Large language models (LLMs) often produce content that contradicts or overlooks information provided in the input context, a phenomenon known as faithfulness hallucination. In this paper, we propose Context-Fidelity Boosting (CFB), a lightweight and general decoding-time framework that reduces such hallucinations by increasing the generation probability of source-supported tokens. Motivated by logit-shaping principles from watermarking techniques, CFB applies additive token-level logit adjustments based on a token's degree of support from the input context. Specifically, we develop three boosting strategies: static boosting, which applies a fixed bias to source-supported tokens; context-aware boosting, which scales this bias using the divergence between next-token distributions with and without context; and token-aware boosting, which further redistributes the adaptive bias according to local relevance estimated from source-position attention and source-scoped semantic similarity. CFB requires no retraining or architectural changes, making it compatible with a wide range of LLMs. Experiments on summarization and question answering tasks across multiple open-source LLMs show that CFB consistently improves faithfulness metrics with minimal generation overhead. Our implementation is fully open-sourced.
Trust as a Situated User State in Social LLM-Based Chatbots: A Longitudinal Study of Snapchat's My AI
🔥 引用:
0
Abstract: Social chatbots based on large language models are increasingly embedded in everyday platforms, yet how users develop trust in these systems over time remains unclear. We present a four-week longitudinal qualitative survey study (N = 27) of trust formation in Snapchat's My AI, a socially embedded conversational agent. Our findings show that trust is shaped by perceived ability, conversational behavior, human-likeness, transparency, privacy concerns, and trust in the host platform. Trust does not remain stable, but evolves through interaction as users adapt their expectations, refine their prompting strategies, and actively regulate how and when they rely on the system. These processes reflect a continuous negotiation of trust, not a one-time evaluation. While conversational fluency supports engagement, excessive anthropomorphism and limited transparency can undermine trust over time. We synthesize these findings into a conceptual model that frames trust as a dynamic user state shaped by interaction context and expectations, with implications for the design of human-centered and adaptive conversational agents.
Benchmarking LLM-Driven Network Configuration Repair
🔥 引用:
0
Abstract: There is a rapidly growing interest in using Large Language Models (LLMs) to automate complex network operations, but their reliable adoption requires rigorous assessment of their effectiveness and safety. Existing benchmarks do not address whether LLMs can successfully resolve errors in large-scale, interdependent network configurations without introducing new disruptions. Developing such a benchmark is challenging: scenarios must be diverse and increasingly complex, yet their evaluation must be straightforward and meaningful. In this paper, we present Cornetto, the first benchmark to evaluate LLM-driven network configuration repair functionally and at scale. Cornetto features a generation pipeline that synthesizes representative and plausible misconfiguration scenarios, coupled with an evaluation framework that uses formal verification to assess functional correctness of proposed fixes against ground-truth specifications. Using this pipeline, we synthesize a dataset of 231 problems for fixing configurations across varying network topologies (20--754 nodes) and diverse protocols. We evaluate 9 state-of-the-art LLMs and find that while they show promise, they often introduce regressions and their performance degrades at scale. Our results indicate that reliable LLM-powered network automation requires integrating LLMs into iterative workflows guided by formal verification.
Evaluating LLM-Based Goal Extraction in Requirements Engineering: Prompting Strategies and Their Limitations
🔥 引用:
0
Abstract: Due to the textual and repetitive nature of many Requirements Engineering (RE) artefacts, Large Language Models (LLMs) have proven useful to automate their generation and processing. In this paper, we discuss a possible approach for automating the Goal-Oriented Requirements Engineering (GORE) process by extracting functional goals from software documentation through three phases: actor identification, high and low-level goal extraction. To implement these functionalities, we propose a chain of LLMs fed with engineered prompts. We experimented with different variants of in-context learning and measured the similarities between input data and in-context examples to better investigate their impact. Another key element is the generation-critic mechanism, implemented as a feedback loop involving two LLMs. Although the pipeline achieved 61% accuracy in low-level goal identification, the final stage, these results indicate the approach is best suited as a tool to accelerate manual extraction rather than as a full replacement. The feedback-loop mechanism with Zero-shot outperformed stand-alone Few-shot, with an ablation study suggesting that performance slightly degrades without the feedback cycle. However, we reported that the combination of the feedback mechanism with Few-shot does not deliver any advantage, possibly suggesting that the primary performance ceiling is the prompting strategy applied to the'critic'LLM. Together with the refinement of both the quantity and quality of the Shot examples, future research will integrate Retrieval-Augmented Generation (RAG) and Chain-of-Thought (CoT) prompting to improve accuracy.
Long-tail Internet photo reconstruction
🔥 引用:
0
Abstract: Internet photo collections exhibit an extremely long-tailed distribution: a few famous landmarks are densely photographed and easily reconstructed in 3D, while most real-world sites are represented with sparse, noisy, uneven imagery beyond the capabilities of both classical and learned 3D methods. We believe that tackling this long-tail regime represents one of the next frontiers for 3D foundation models. Although reliable ground-truth 3D supervision from sparse scenes is challenging to acquire, we observe that it can be effectively simulated by sampling sparse subsets from well-reconstructed Internet landmarks. To this end, we introduce MegaDepth-X, a large dataset of 3D reconstructions with clean, dense depth, together with a strategy for sampling sets of training images that mimic camera distributions in long-tail scenes. Finetuning 3D foundation models with these components yields robust reconstructions under extreme sparsity, and also enables more reliable reconstruction in symmetric and repetitive scenes, while preserving generalization to standard, dense 3D benchmark datasets.
GenNA: Conditional generation of nucleotide sequences guided by natural-language annotations
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
Traceable and anomaly-aware QoS for single-stack IPv6 in power grids via HMM-GNN fusion
DOI:
10.1117/12.3111375
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
CGC: Compositional Grounded Contrast for Fine-Grained Multi-Image Understanding
🔥 引用:
0
Abstract: Although Multimodal Large Language Models (MLLMs) have advanced rapidly, they still face notable challenges in fine-grained multi-image understanding, often exhibiting spatial hallucination, attention leakage, and failures in object constancy. In addition, existing approaches typically rely on expensive human annotations or large-scale chain-of-thought (CoT) data generation. We propose Compositional Grounded Contrast (abbr. CGC), a low-cost full framework for boosting fine-grained multi-image understanding of MLLMs. Built on existing single-image grounding annotations, CGC constructs compositional multi-image training instances through Inter-Image Contrast and Intra-Image Contrast, which introduce semantically decoupled distractor contexts for cross-image discrimination and correlated cross-view samples for object constancy, respectively. CGC further introduces a Rule-Based Spatial Reward within the GRPO framework to improve source-image attribution, spatial alignment, and structured output validity under a Think-before-Grounding paradigm. Experiments show that CGC achieves state-of-the-art results on fine-grained multi-image benchmarks, including MIG-Bench and VLM2-Bench. The learned multi-image understanding capability also transfers to broader multimodal understanding and reasoning tasks, yielding consistent gains over the Qwen3-VL-8B base model on MathVista (+2.90), MuirBench (+2.88), MMStar (+1.93), MMMU (+1.77), and BLINK (+1.69).
From rows to yields: how foundation models for tabular data simplify crop yield prediction
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
Considerations about the proliferation of large language model chatbots and youth mental health
🔥 引用:
0
Abstract:
Young people are experiencing worsening mental health and a growing reliance on online tools and services to address mental health difficulties. At the same time, next-generation large language models (LLMs) that are deployed through ‘chatbot style interfaces’, using deep learning artificial intelligence akin to interacting with a human appear to mark an opportunity for mental health therapeutics when designed specifically for clinical intervention. However, emergent evidence suggests the use of more generic LLM chatbots may pose a risk of providing misinformation, bias, or over reliance for some individuals when used outside of clinical contexts for mental health. This perspective paper examines the intersection of youth mental health and the rapid adoption of LLM chatbots. It first contextualises rising mental health challenges among young people alongside their increasing reliance on digital solutions. The paper then explores the potential benefits of LLM chatbot style interfaces in clinical mental health interventions. Following this, we discuss the evidence surrounding adverse mental health outcomes from the use of generic LLMs to support mental health at population level, describing complex system-level and human-level factors noted from the evidence. Finally, we outline considerations for public health and youth mental health discourse, purpose built LLM platform design, and a supporting research agenda. While current evidence on benefits and risks from generic LLMs is emergent and not youth-specific, this perspective highlights a need for research focused on young people to ensure safe and effective use of widely available LLMs for mental health support.
RAG-Reflect: Agentic Retrieval-Augmented Generation with Reflections for Comment-Driven Code Maintenance on Stack Overflow
🔥 引用:
0
Abstract: User comments on online programming platforms such as Stack Overflow play a vital role in maintaining the correctness and relevance of shared code examples. However, the majority of comments express gratitude or clarification, while only a small fraction highlight actionable issues that drive meaningful edits. This paper demonstrates how agentic AI principles can revolutionize software maintenance tasks by presenting RAG-Reflect, a modular framework that achieves fine-tuned-level performance for valid comment-edit prediction without task-specific training. Valid Comment-Edit Prediction (VCP) is the task of determining whether a user comment directly triggered a subsequent code edit. The framework integrates large language models (LLMs) with retrieval-augmented reasoning and self-reflection mechanisms. RAG-Reflect operates through a three-stage runtime workflow built on a one-time pattern analysis phase. During initialization, an Interpretation module analyzes the knowledge base to generate validation rules. At inference time, the system (1) retrieves contextual examples, (2) reasons about comment-edit causality, and (3) reflects on decisions using the pre-established rules. We evaluate RAG-Reflect on the publicly available SOUP benchmark, achieving Precision = 0.81, Recall = 0.74, and F1 = 0.78, outperforming traditional baselines (e.g., Logistic Regression, XGBoost, different prompting techniques) and closely approaching the performance of fine-tuned models (F1 = 0.773) without retraining. Our ablation and stage-level analyses show that both retrieval and reflection modules substantially enhance performance.
Network Traffic Anomaly Detection Using Hybrid Machine Learning and Generative AI: A SOC-Integrated Approach
🔥 引用:
0
Abstract: Modern enterprise networks face an ever-expanding threat landscape characterized by zero-day exploits, polymorphic malware, distributed denial-of-service campaigns, and sophisticated insider threats that consistently evade signature based detection systems. Traditional Intrusion Detection Systems (IDS), despite decades of refinement, suffer from high false-positive rates, inability to detect novel attacks, and an absence of contextual explanation that forces security analysts to manually interpret raw machine-generated data. This paper presents the design, implementation, and experimental evaluation of a Network Traffic Anomaly Detection Security Operations Center (SOC) Dashboard that addresses these limitations through the integration of hybrid unsupervised machine learning and generative artificial intelligence. The system deploys a weighted ensemble of Isolation Forest and Local Outlier Factor (LOF) to compute continuous anomaly scores across 60,000 synthetic network flow records, classifying detected anomalies into four severity tiers: LOW, MEDIUM, HIGH, and CRITICAL. A deterministic rule engine operates in parallel, applying domain specific security rules to escalate alert severity for high-confidence attack signatures including port scanning, distributed denial of-service patterns, and command-and-control port usage. The central innovation of this work is the integration of the LLaMA 3.1 8B Instant large language model via the Groq API to generate automated, human-readable, MITRE ATT&CK-mapped triage reports for each detected alert, eliminating the need for manual expert interpretation and substantially reducing mean time to respond. The Streamlit-based interactive dashboard presents results across three analytical modules: Incident Detail and Explainability, Network Graph Visualization, and AI-Powered Triage. Experimental results demonstrate successful detection of all injected attack scenarios, generation of 4,800 severity classified alerts from 60,000 traffic events, and LLM response latency averaging approximately two seconds per query. This work demonstrates that combining unsupervised behavioral detection with generative AI explanation bridges the semantic gap between machine output and analyst understanding, enabling faster, more accessible, and more effective cybersecurity operations.
ArguMath: AI-Simulated Environment for Pre-Service Teacher Training in Orchestrating Classroom Mathematics Argumentation
🔥 引用:
0
Abstract: Facilitating productive mathematical argumentation, especially asking rational questions, is essential yet remains challenging for pre-service mathematics teachers (PMTs), who often have limited opportunities to apply abstract theoretical knowledge in authentic practice. At the same time, recent advances in large language models (LLMs) have expanded the potential for simulating students in educational settings, enabling low-risk environments for instructional practice. To inform the design of a system that supports PMTs in orchestrating classroom argumentation, we conducted a formative study with eight experienced mathematics teachers to identify key design requirements, including personalization, realistic simulations, structured reflection, and ease of use. Building on these requirements, we developed ArguMath, an AI-simulated classroom environment that supports PMTs in practicing the orchestration of mathematical argumentation. ArguMath comprises three core components: (1) customization of classroom settings; (2) simulation of classroom discussions with AI-based students grounded in authentic transcripts and augmented with real-time instructional suggestions; and (3) structured reflection through discourse annotation and overall feedback. Results from an exploratory user study with seven PMTs, complemented by interviews with four experienced teachers, indicate that ArguMath has the potential to support PMTs'classroom orchestration skills, particularly theory-aligned questioning strategies.
Large language model-enabled automated data extraction for concrete materials informatics
🔥 引用:
0
Abstract: The promise of data-driven materials discovery remains constrained by the scarcity of large, high-quality, and accessible experimental datasets. Here, we introduce a generalizable large language model (LLM)-powered pipeline for automated extraction and structuring of materials data from unstructured scientific literature, using concrete materials as a representative and particularly challenging example. The pipeline exhibits robust performance across a broad range of LLMs and achieves an $F_1$ score of up to 0.97 for diverse composition--process--property attributes. Within one hour, it extracts nearly 9,000 high-quality records with over 100 attributes screened from more than 27,000 publications, enabling the construction of the largest open laboratory database for blended cement concrete. Machine learning analyses underscore the importance of large, diverse, and information-rich datasets for enhancing both in-distribution accuracy and out-of-distribution generalization to unseen materials. The proposed pipeline is readily adaptable to other materials domains and accelerates the development of scalable data infrastructures for materials informatics.
C-MORAL: Controllable Multi-Objective Molecular Optimization with Reinforcement Alignment for LLMs
🔥 引用:
0
Abstract: Large language models (LLMs) show promise for molecular optimization, but aligning them with selective and competing drug-design constraints remains challenging. We propose C-Moral, a reinforcement learning post-training framework for controllable multi-objective molecular optimization. C-Moral combines group-based relative optimization, property score alignment for heterogeneous objectives, and continuous non-linear reward aggregation to improve stability across competing properties. Experiments on the C-MuMOInstruct benchmark show that C-Moral consistently outperforms state-of-the-art models across both in-domain and out-of-domain settings, achieving the best Success Optimized Rate (SOR) of 48.9% on IND tasks and 39.5% on OOD tasks, while largely preserving scaffold similarity. These results suggest that RL post-training is an effective way to align molecular language models with continuous molecular design objectives. Our code and models are publicly available at https://github.com/Rwigie/C-MORAL.
scConcept enables concept-level exploration of single-cell transcriptomic data
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
Uncertainty Quantification for LLM Function-Calling
🔥 引用:
0
Abstract: Large Language Models (LLMs) are increasingly deployed to autonomously solve real-world tasks. A key ingredient for this is the LLM Function-Calling paradigm, a widely used approach for equipping LLMs with tool-use capabilities. However, an LLM calling functions incorrectly can have severe implications, especially when their effects are irreversible, e.g., transferring money or deleting data. Hence, it is of paramount importance to consider the LLM's confidence that a function call solves the task correctly prior to executing it. Uncertainty Quantification (UQ) methods can be used to quantify this confidence and prevent potentially incorrect function calls. In this work, we present what is, to our knowledge, the first evaluation of UQ methods for LLM Function-Calling (FC). While multi-sample UQ methods, such as Semantic Entropy, show strong performance for natural language Q&A tasks, we find that in the FC setting, it offers no clear advantage over simple single-sample UQ methods. Additionally, we find that the particularities of FC outputs can be leveraged to improve the performance of existing UQ methods in this setting. Specifically, multi-sample UQ methods benefit from clustering FC outputs based on their abstract syntax tree parsing, while single-sample UQ methods can be improved by selecting only semantically meaningful tokens when calculating logit-based uncertainty scores.
Implicit Framing in Obstetric Counseling Notes: A Grounded LLM Pipeline on a VBAC-Eligible Cohort
🔥 引用:
0
Abstract: Clinical framing -- the linguistic manner in which clinical information is presented -- can influence patient understanding and decision-making, with important implications for healthcare outcomes. Obstetrics is a high-stakes domain in which physicians counsel patients on delivery mode choices such as vaginal birth after cesarean (VBAC) and repeat cesarean section (RCS), yet counseling language remains underexplored in large-scale clinical text analysis. In this work, we analyze physician counseling language in 2,024 obstetric history and physical narratives for a rigorously defined cohort of patients for whom both VBAC and RCS were clinically viable options. To control for confounding due to medical contraindications, we first construct a VBAC-eligible cohort using structured clinical data supplemented by a large language model (LLM)-based extraction pipeline constrained to grounded, verbatim evidence from free-text narratives. We then apply a zero-shot LLM framework to categorize counseling segments into predefined framing categories capturing how physicians linguistically present delivery options. Our analysis reveals a significant difference in counseling framing distributions between VBAC and RCS notes; risk-focused language accounts for a substantially larger share of counseling segments in RCS documentation than in VBAC, with category-level differences confirmed by statistical testing, highlighting the value of controlled LLM-based framing analysis in obstetric care.
Analytica: Soft Propositional Reasoning for Robust and Scalable LLM-Driven Analysis
🔥 引用:
1
Abstract: Large language model (LLM) agents are increasingly tasked with complex real-world analysis (e.g., in financial forecasting, scientific discovery), yet their reasoning suffers from stochastic instability and lacks a verifiable, compositional structure. To address this, we introduce Analytica, a novel agent architecture built on the principle of Soft Propositional Reasoning (SPR). SPR reframes complex analysis as a structured process of estimating the soft truth values of different outcome propositions, allowing us to formally model and minimize the estimation error in terms of its bias and variance. Analytica operationalizes this through a parallel, divide-and-conquer framework that systematically reduces both sources of error. To reduce bias, problems are first decomposed into a tree of subpropositions, and tool-equipped LLM grounder agents are employed, including a novel Jupyter Notebook agent for data-driven analysis, that help to validate and score facts. To reduce variance, Analytica recursively synthesizes these grounded leaves using robust linear models that average out stochastic noise with superior efficiency, scalability, and enable interactive"what-if"scenario analysis. Our theoretical and empirical results on economic, financial, and political forecasting tasks show that Analytica improves 15.84% accuracy on average over diverse base models, achieving 71.06% accuracy with the lowest variance of 6.02% when working with a Deep Research grounder. Our Jupyter Notebook grounder shows strong cost-effectiveness that achieves a close 70.11% accuracy with 90.35% less cost and 52.85% less time. Analytica also exhibits highly noise-resilient and stable performance growth as the analysis depth increases, with a near-linear time complexity, as well as good adaptivity to open-weight LLMs and scientific domains.
GR-Evolve: Design-Adaptive Global Routing via LLM-Driven Algorithm Evolution
🔥 引用:
0
Abstract: Modern ASIC design is becoming increasingly complex, driving up design costs while limiting productivity gains from existing EDA tools. Despite decades of progress, current tools rely on fixed heuristics and offer limited control via tool hyperparameters, requiring extensive manual tuning to achieve an acceptable quality of results (QoR). While prior work has explored learning-based optimization and design-specific hyperparameter tuning, these approaches operate within the constraints of static tool algorithm implementations and do not adapt the underlying algorithms to individual designs. To address this limitation, we introduce the concept of design-adaptive EDA tooling, in which the internal algorithms of EDA tools are automatically specialized to the characteristics of a given design. We instantiate this paradigm through GR-Evolve, a code evolution framework that leverages an agentic large language model (LLM) to iteratively modify global routing source code using QoR-driven feedback. The framework equips the LLM with persistent contextual knowledge of open-source global routers along with an integrated toolchain for QoR evaluation within the OpenROAD infrastructure. We evaluate GR-Evolve across seven benchmark designs across three technology nodes and demonstrate up to 8.72% reduction in post-detailed-routing wirelength over existing baseline routers, highlighting the potential of LLM-driven EDA code evolution for design-adaptive global routing.
Comparing ChatGPT-4o and Gemini 1.5 Pro in Adolescent Psychiatric Emergencies: A Real-World Evaluation of AI Support in Suicide Risk Assessment
🔥 引用:
0
Abstract:
This study aimed to evaluate the performance of large language models—ChatGPT-4o and Gemini 1.5 Pro—in assessing suicide risk and guiding treatment in adolescents presenting to the emergency department with suicidal ideation and/or attempts.
A retrospective review was conducted on child psychiatry consultation notes from 36 adolescents evaluated between February and March 2024. Structured clinical data were entered into ChatGPT and Gemini, and the resulting decisions were compared to those made by clinicians regarding hospitalization, sedation need, medication initiation, follow-up timing, and notification of social services or law enforcement.
ChatGPT showed higher concordance with clinicians than Gemini, especially in hospitalization (41.6% agreement) and sedation decisions (100% agreement). ChatGPT recommended hospitalization in 58.3% of cases, compared to 33.3% by clinicians and 36.1% by Gemini. For outpatient cases, ChatGPT demonstrated partial alignment with clinical decisions on medication and follow-up, while Gemini’s responses were often uncertain or incomplete.
Large language models show promise as decision-support tools in adolescent psychiatric emergencies. ChatGPT was more consistent with clinical judgments than Gemini. However, limitations remain, and further studies involving broader populations are needed before routine clinical integration.
Characterizing public comments via Regulations.gov in response to proposed cannabis rescheduling in the United States
DOI:
10.1111/add.70410
🔥 引用:
0
Abstract:
The United States Drug Enforcement Administration's (DEA) proposed rescheduling of cannabis from Schedule I to Schedule III under the Controlled Substances Act marks a significant shift in federal policy. Understanding public sentiment toward this policy is critical for guiding the current cannabis rescheduling effort as well as future reforms. The objective of this study is to characterize public comments submitted to
Regulations.gov
regarding the DEA's cannabis rescheduling proposal and identify underlying justifications for support or opposition.
A mixed‐methods analysis was conducted.
Online public comments submitted to
Regulations.gov
regarding the DEA's cannabis rescheduling proposal.
42 913 public comments submitted between 21 May and 22 July 2024.
Comments were analyzed for sentiment towards the proposed rescheduling (support, oppose or insufficient rescheduling) and thematic justifications using manual and automated natural language processing techniques. A two‐stage annotation approach was employed: manual coding of 200 randomly sampled comments by multiple independent evaluators, followed by automated classification of all 42 913 comments using open source Large Language Model (LLM) validated against the manual annotations.
Using LLM‐based classification validated against human annotations [88% agreement, F1 (harmonic mean of precision and recall) = 0.86], we found that among 42 913 comments, 28.85% [95% confidence interval (CI) = 28.44%–29.24%] supported rescheduling, 6.74% (95% CI = 6.50%–6.99%) opposed and 63.50% (95% CI = 63.06%–63.99%) deemed the proposal insufficient, favoring further rescheduling or complete de‐scheduling of cannabis. Among the 200 manually annotated comments, therapeutic benefits (56.7%, 95% CI = 46.7%–66.7%) and economic impacts (27.8%, 95% CI = 18.9%–37.8%) were the most common justifications among supporters. Public health risks (100.0%, 95% CI = 100.0%–100.0%), addictiveness concerns (71.4%, 95% CI = 42.9%–100.0%) and concerns about underage use (57.1%, 95% CI = 14.3%–85.7%) were predominant in opposing comments. Insufficient rescheduling comments cited therapeutic benefits (37.8%, 95% CI = 28.5%–48.0%), economic impacts (28.6%, 95% CI = 19.4%–37.8%) and criminal justice reform (26.5%, 95% CI = 18.4%–35.7%) as primary justifications.
Public sentiment on
Regulations.gov
supports the United States Drug Enforcement Administration's proposal for cannabis rescheduling, though the majority views the proposed Schedule III classification as inadequate and supports further rescheduling or complete de‐scheduling of cannabis.
Artificial intelligence-assisted diagnosis of ocular caruncle oncocytoma: a proof- of-concept case report of two cases
🔥 引用:
0
Abstract:
Oncocytomas of the ocular caruncle are rare benign epithelial tumors. Their clinical diagnosis is challenging, as they can mimic other benign or malignant lesions such as papilloma, nevus, squamous cell carcinoma, melanoma, or oncocytic carcinoma. For this reason, histopathological confirmation remains indispensable. The aim of this study was to test the ability of a multimodal large language model (ChatGPT, GPT-5, 2025 version) to generate diagnostic hypotheses directly from slit-lamp images, supported by brief clinical summaries.
We retrospectively analyzed two cases of caruncular oncocytoma that had undergone surgical excision with subsequent histopathological confirmation. For each case, ChatGPT was provided only with slit-lamp photographs of the lesion and a concise clinical summary including age, sex, and the site of the lesion (caruncle). No histopathological data or additional clinical details were supplied. In both cases, ChatGPT proposed oncocytoma as the primary diagnostic hypothesis. The model also generated differential diagnoses including papilloma, nevus, as well as the possibility of a malignant lesion such as squamous cell carcinoma or melanoma.
This proof-of-concept demonstrates, for the first time to our knowledge, that a general- purpose multimodal AI system can correctly recognize a rare ocular surface tumor from slit-lamp images. While preliminary and limited by the very small sample size, these findings suggest that large language models may assist clinicians in considering rare adnexal tumors during differential diagnosis. Further research on larger datasets is required, and histopathology will remain the gold standard for definitive diagnosis.
DeepImagine: Learning Biomedical Reasoning via Successive Counterfactual Imagining
🔥 引用:
0
Abstract: Predicting the outcomes of prospective clinical trials remains a major challenge for large language models. Prior work has shown that both traditional correlational predictors, such as random forests and logistic regression, and strong commercial LLMs achieve limited performance on this task. In this paper, we propose DeepImagine, a framework for teaching LLMs biomedical reasoning through successive counterfactual imagining. The central idea is to approximate hidden causal mechanisms of clinical trials by training models to infer how observed trial results would change under controlled perturbations of experimental conditions, such as dosage, outcome measures, study arms, geography, and other trial attributes. To support this objective, we construct both natural and approximate counterfactual pairs from real clinical trials with reported outcomes. For settings where strict counterfactual supervision is available, such as paired outcome measures or dose-ranging study arms within the same trial, we train models with supervised fine-tuning. For broader settings where only approximate counterfactual pairs can be retrieved, we optimize models with reinforcement learning using verifiable rewards based on downstream benchmark correctness. We further augment training with synthetic reasoning traces that provide causally plausible explanations for local counterfactual transitions. Using this pipeline, we train language models under 10B parameters, including Qwen3.5-9B, and evaluate them on clinical trial outcome prediction. We aim to show that DeepImagine consistently improves over untuned language models and traditional correlational baselines. Finally, we aim to show that the learned reasoning trajectories provide interpretable signals about how models represent trial-level mechanisms, suggesting a practical path toward more mechanistic and scientifically useful biomedical language models.
MSAgent: An Evidence Grounded Agentic Framework for LLM-driven Scientific Exploration in Mass Spectrometry-based Metabolomics
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
Dr.Sai: An agentic AI for real-world physics analysis at BESIII
🔥 引用:
0
Abstract: High Energy Physics (HEP) experiments like BESIII produce petabyte-scale data. Extracting physics results requires complex workflows (simulation, reconstruction, statistical analysis, etc.) that traditionally take experts months or years. Current manual methods are labor-intensive, prone to bias, and limit large-scale systematic scans. As data grows, this paradigm slows discovery. Large Language Models (LLMs) offer a solution. Their natural language understanding and code generation capabilities allow them to interpret scientific tasks and integrate with HEP tools (e.g., ROOT, BOSS) to act as an"AI partner"for autonomous analysis. We present Dr.Sai, an LLM-powered multi-agent system that translates natural language into rigorous physics workflows. As validation, Dr.Sai performed large-scale re-measurements of ten J/psi decay branching fractions - without manual coding. It successfully navigated the real BESIII computing environment and produced results matching established benchmarks. The article details Dr.Sai's architecture, the validation results, and performance evaluation. This work provides a blueprint for autonomous discovery, with relevance to other data-intensive fields like astronomy and genomics.
An Autonomous Large Language Model‐Agent Framework for Transparent and Local Time Series Forecasting
🔥 引用:
0
Abstract: The growing complexity of thermal power generation systems demands advanced forecasting solutions capable of integrating data analysis, model selection, and interpretability. This study proposes a modular large language model (LLM) agent framework for time series forecasting, designed to operate locally and interactively through natural language instructions. The framework incorporates a domain‐specific time series agent that was developed to automate data preprocessing, anomaly detection, and forecasting tasks using neural and statistical models. Experiments demonstrated the agent's capacity to autonomously conduct end‐to‐end analyses, achieving accurate forecasts with minimal user intervention. The PatchTST model, automatically selected by the agent, yielded the lowest mean squared error among evaluated methods. Results highlight the potential of LLM‐based agents to enhance transparency, usability, and reproducibility in energy forecasting pipelines.
A Nationwide Japanese Medical Claims Foundation Model: Balancing Model Scaling and Task-Specific Computational Efficiency
🔥 引用:
0
Abstract: Clinical risk prediction using longitudinal medical data supports individualized care. Self-supervised foundation models have emerged as a promising approach for leveraging large-scale unlabeled healthcare records. In natural language processing, scaling laws suggest that larger models achieve predictably lower pretraining losses, supporting the foundation model paradigm. However, for structured medical data, characterized by a limited vocabulary and sparse observations, whether increasing model size consistently improves downstream predictions is unclear, as most studies evaluate only a single model scale. In this study, we evaluated the relationship between model scale and downstream task performance for structured medical foundation models. Using a random sample (2.3 million patients, 32 hospitals) from a nationwide 519-hospital Japanese claims database, we pretrained encoder-only Transformers at five scales (2.2M-101M parameters) for disease incidence and medication prediction. Downstream performance saturated at task-dependent thresholds: disease prediction benefited from larger models (32M-101M), whereas medication prediction saturated at 11M, reducing pretraining time by 178 h. Across all tasks, the best-performing model consistently outperformed a Light Gradient Boosting Machine baseline in the area under the precision-recall curve. These findings indicate that, unlike the monotonically decreasing pretraining loss, the optimal model size varied depending on task characteristics. This task-dependent saturation provides practical guidance for balancing predictive performance and computational cost in structured medical foundation models.
DM-ASR: Diarization-aware Multi-speaker ASR with Large Language Models
🔥 引用:
0
Abstract: Multi-speaker automatic speech recognition (ASR) aims to transcribe conversational speech involving multiple speakers, requiring the model to capture not only what was said, but also who said it and sometimes when it was spoken. Recent Speech-LLM approaches have shown the potential of unified modeling for this task, but jointly learning speaker attribution, temporal structure, and lexical recognition remains difficult and data-intensive. At the current stage, leveraging reliable speaker diarization as an explicit structural prior provides a practical and efficient way to simplify this task. To effectively exploit such priors, we propose DM-ASR, a diarization-aware multi-speaker ASR framework that reformulates the task as a multi-turn dialogue generation process. Given an audio chunk and diarization results, DM-ASR decomposes transcription into a sequence of speaker- and time-conditioned queries, each corresponding to one speaker in one time segment. This formulation converts multi-speaker recognition into a series of structured sub-tasks, explicitly decoupling speaker-temporal structure from linguistic content and enabling effective integration of diarization cues with the reasoning capability of large language models. We further introduce an optional word-level timestamp prediction mechanism that interleaves word and timestamp tokens, yielding richer structured outputs and better transcription quality. Our analysis shows that diarization systems provide more reliable speaker identities and segment-level boundaries, while LLMs excel at modeling linguistic content and long-range dependencies, demonstrating their complementary strengths. Experiments on Mandarin and English benchmarks show that the proposed approach achieves strong performance with relatively small models and training data, while remaining competitive with or outperforming existing unified approaches.
Can QPP Choose the Right Query Variant? Evaluating Query Variant Selection for RAG Pipelines
🔥 引用:
0
Abstract: Large Language Models (LLMs) have made query reformulation ubiquitous in modern retrieval and Retrieval-Augmented Generation (RAG) pipelines, enabling the generation of multiple semantically equivalent query variants. However, executing the full pipeline for every reformulation is computationally expensive, motivating selective execution: can we identify the best query variant before incurring downstream retrieval and generation costs? We investigate Query Performance Prediction (QPP) as a mechanism for variant selection across ad-hoc retrieval and end-to-end RAG. Unlike traditional QPP, which estimates query difficulty across topics, we study intra-topic discrimination - selecting the optimal reformulation among competing variants of the same information need. Through large-scale experiments on TREC-RAG using both sparse and dense retrievers, we evaluate pre- and post-retrieval predictors under correlation- and decision-based metrics. Our results reveal a systematic divergence between retrieval and generation objectives: variants that maximize ranking metrics such as nDCG often fail to produce the best generated answers, exposing a"utility gap"between retrieval relevance and generation fidelity. Nevertheless, QPP can reliably identify variants that improve end-to-end quality over the original query. Notably, lightweight pre-retrieval predictors frequently match or outperform more expensive post-retrieval methods, offering a latency-efficient approach to robust RAG.
LARA: Validation-Driven Agentic Supercomputer Workflows for Atomistic Modeling
🔥 引用:
0
Abstract: Large language models (LLMs) and agentic systems have recently demonstrated potential for automating scientific workflows, including atomistic simulations. However, their deployment in high-performance computing (HPC) environments remains limited by the lack of mechanisms ensuring correctness, reproducibility, and safe interaction with computational resources. Generated workflows suffer from inconsistencies, incorrect API usage, or invalid physical configurations - leading to failed or unreliable simulations. In this work, we introduce LARA-HPC, a validation-driven agentic framework to enable reliable workflow generation for atomistic modeling on HPC systems. Our approach is based on three key components: (i) a controlled execution layer that mediates all interactions with HPC resources; (ii) simulation-native validation through dry-run capabilities, enabling execution-level verification without incurring resource cost; and (iii) a multi-phase agentic pipeline combining retrieval-augmented generation and iterative refinement. We demonstrate the effectiveness of this approach performing an end-to-end atomistic simulation workflow on HPC by applying LARA-HPC to Density Functional Theory simulations. The results show that validation-driven generation significantly improves robustness and enables iterative correction of both syntactic and physical inconsistencies. More broadly, this work advocates for a shift from generation-first to validation-first paradigms in Artificial Intelligence (AI) assisted scientific computing. We argue that the future task of the computational physics community is to develop domain specific agentic systems based on structured tooling to realize an HPC enabled co-piloted research ecosystem.
A Robust Deep Temporal Causal Discovery Platform for Single‐Cell Gene Regulatory Network Reconstruction
🔥 引用:
0
Abstract:
Accurately inferring gene regulatory networks (GRNs) from single‐cell RNA sequencing (scRNA‐seq) data is critical for understanding cellular dynamics in both normal development anddisease. However, existing computational methods often suffer from low precision and high false‐positive rates due to the intrinsic noise and complex regulatory architecture in scRNA‐seq data. We introduce scTIGER2.0, a deep‐learning‐based framework that integrates expression correlation, pseudotime ordering, temporal causal discovery, and bootstrap‐based significance testing to infer high‐confidence, directional gene–gene interactions. Benchmarking against five popular GRN inference methods using large‐scale datasets, scTIGER2.0 consistently achieved superior specificity, especially in linear developmental trajectories. In real applications, scTIGER2.0 identified an
APOE
‐centered GRN from Alzheimer's disease scRNA‐seq data and uncovered interconnected GRNs for
FOS
,
FOXP1
,
JUN
,
KLF6
,
NCOA4
, and
RUNX1
from acute myeloid leukemia data, where 87.5% of the predicted targets show promoter‐binding peaks in the corresponding ChIP‐seq data. These results demonstrate that scTIGER2.0 is a robust, accurate and fully integrated platform for uncovering biologically meaningful GRNs from noisy scRNA‐seq data.
Comparative Analysis of Readability in Glaucoma Information Generated by AI: ChatGPT-5 vs. Gemini 2.5 Pro
🔥 引用:
0
Abstract: Background: Glaucoma is a leading cause of irreversible blindness worldwide, requiring effective patient education and communication for early detection and management. Recently, large language models (LLMs) such as ChatGPT-5 and Gemini 2.5 Pro have emerged as potential tools for providing medical information. However, the readability of AI-generated responses remains an important concern, particularly for patients with varying levels of health literacy.
Objective: This study aimed to evaluate and compare the readability of glaucoma-related responses generated by ChatGPT-5 and Gemini 2.5 Pro.
Methods: A total of 30 glaucoma-related questions, compiled and validated by three specialists, were presented to both AI models. The generated responses were analyzed using multiple readability indices, including Flesch Reading Ease (FRE), Flesch-Kincaid Grade Level (FKGL), Gunning Fog Index, Automated Readability Index (ARI), Coleman-Liau Index, SMOG Index, Linsear Write Formula, Dale-Chall Readability Score, and Spache Readability Formula. Statistical analysis was performed using SPSS, with significance set at p < 0.05.
Results: Both models produced responses within a similar readability range, generally corresponding to middle- to high-school reading levels. Gemini 2.5 Pro consistently generated slightly more readable text across most indices. A statistically significant difference was observed only in the Flesch-Kincaid Grade Level (p < 0.001), with Gemini producing lower grade-level text compared to ChatGPT-5. Other readability metrics showed no significant differences between the models.
Conclusion: Both ChatGPT-5 and Gemini 2.5 Pro are capable of generating understandable glaucoma-related information; however, Gemini demonstrates a modest advantage in readability. Despite this, the overall reading level may still be too high for individuals with limited health literacy. Further refinement of AI-generated medical content is necessary to improve accessibility and ensure effective patient communication.
Spiritual-Emotion Guided Fine-Tuning of Mistral-7B for Mental Health Conversational Systems using Bhagavad Gita Knowledge
DOI:
10.55041/isjem06759
🔥 引用:
0
Abstract: Abstract—Artificial intelligence–based conversational agents have recently gained attention as scalable solutions for mental health assistance. Large Language Models (LLMs) demonstrate strong language understanding and generation capabilities; how- ever, their responses in counseling contexts often lack emotional grounding, empathy, and philosophical reasoning required for sensitive mental health interactions.
This research proposes a Spiritual-Emotion Guided Mental Health Dialogue System developed using parameter-efficient fine- tuning of the Mistral-7B-Instruct large language model. The proposed system integrates philosophical teachings from the Bhagavad Gita to improve emotional stability, reflective reasoning, and empathetic response generation in mental health conversations.
Bhagavad Gita scriptures are extracted from PDF sources using the PDFPlumber framework and converted into structured question–answer instruction pairs for supervised fine-tuning. The base model is adapted using Low-Rank Adaptation (LoRA), enabling efficient domain-specific learning while preserving the general capabilities of the pretrained model.
The proposed system is evaluated using a mental health counseling dialogue dataset to assess its effectiveness in generating supportive and contextually meaningful responses. The expected outcomes indicate improved emotional coherence, contextual understanding, and response quality compared to the baseline language model.
The findings aim to demonstrate that integrating philosophical knowledge with modern language models can enhance the effectiveness and empathy of AI-driven mental health dialogue systems.
Index Terms—Mental Health AI, Mistral-7B, LoRA, Bhagavad Gita, Psych8K, Emotional Intelligence, Parameter Efficient Fine- Tuning
Institutions for the Post-Scarcity of Judgment
🔥 引用:
0
Abstract: Each major technological revolution inverts a particular scarcity and rebuilds institutions around the shift. The near-consensus diagnosis of the AI revolution holds that AI collapses the cost of prediction while judgment remains scarce. This Opinion argues the inversion has now flipped: competent-looking judgment (selecting, ranking, attributing, certifying) is produced at scale and at marginal cost approaching zero, and four complements become scarce: verified signal, legitimacy, authentic provenance, and integration capacity (the community's tolerance for delegated cognition). Because judgment is the substance of institutions, the institutions built to manufacture legitimate judgment (courts, journals, licensing bodies, legislatures) now compete with the technology for the same functional role. The piece traces the pattern across scientific institutions, professional licensing, intellectual property, democratic legitimacy, and foundation-model concentration, and closes with a three-move agenda: reframe AI policy as institutional redesign, build provenance and verification as commons, and develop the formal apparatus for institutional composition under strategic agents.
Towards Temporal Compositional Reasoning in Long-Form Sports Videos
🔥 引用:
0
Abstract: Sports videos are a challenging domain for multimodal understanding because they involve complex and dynamic human activities. Despite rapid progress in Multimodal Large Language Models (MLLMs), long-horizon reasoning in sports videos remains difficult, as answering questions requires both locating temporally sparse evidence and integrating it into reasoning. We attribute this limitation to two closely coupled factors: insufficient supervision over temporally dispersed evidence, and the lack of methods that require models to identify, localize, and justify temporal evidence. To address these gaps, we introduce SportsTime, a large-scale benchmark for long-form sports video understanding, comprising 14K+ open-ended QA pairs and 50K+ step-wise temporal evidence annotations. Building on SportsTime, we propose Chain-of-Time Reasoning (CoTR), which treats reasoning as a process of temporally grounded evidence composition. Specifically, during training, CoTR introduces a temporal-reward GRPO to encourage temporally grounded reasoning. During inference, it employs an anchor-observe-infer evidence-seeking loop to iteratively localize, verify, and compose temporal evidence before producing the final answer. Experiments demonstrate the usefulness of SportsTime as a benchmark and the effectiveness of CoTR, which consistently improves temporal compositional reasoning and step-wise grounding quality over strong MLLM baselines.
Bridging the Long-Tail Gap: Robust Retrieval-Augmented Relation Completion via Multi-Stage Paraphrase Infusion
🔥 引用:
0
Abstract: Large language models (LLMs) struggle with relation completion (RC), both with and without retrieval-augmented generation (RAG), particularly when the required information is rare or sparsely represented. To address this, we propose a novel multi-stage paraphrase-guided relation-completion framework, RC-RAG, that systematically incorporates relation paraphrases across multiple stages. In particular, RC-RAG: (a) integrates paraphrases into retrieval to expand lexical coverage of the relation, (b) uses paraphrases to generate relation-aware summaries, and (c) leverages paraphrases during generation to guide reasoning for relation completion. Importantly, our method does not require any model fine-tuning. Experiments with five LLMs on two benchmark datasets show that RC-RAG consistently outperforms several RAG baselines. In long-tail settings, the best-performing LLM augmented with RC-RAG improves by 40.6 Exact Match (EM) points over its standalone performance and surpasses two strong RAG baselines by 16.0 and 13.8 EM points, respectively, while maintaining low computational overhead.
Efficient Image Annotation via Semi-Supervised Object Segmentation with Label Propagation
🔥 引用:
0
Abstract: Reliable object perception is necessary for general-purpose service robots. Open-vocabulary detectors struggle to generalize beyond a few classes and fully supervised training of object detectors requires time-intensive annotations. We present a semi-supervised label propagation approach for household object segmentation. A segment proposer generates class-agnostic masks, and an ensemble of Hopfield networks assigns labels by learning representative embeddings in complementary foundation model embedding spaces (CLIP, ViT, Theia). Our approach scales to 50 object classes with limited annotation overhead and can automatically label 60% of the data in a RoboCup@Home setting, where preparation time is severely constrained. Dataset and code are publicly available at https://github.com/ais-bonn/label_propagation.
A multimodal situational information extraction and management system based on large language models
DOI:
10.1117/12.3110800
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
Comparative evaluation of large language models for guideline-compliant abstract generation and readability in dental research: an experimental comparative study
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
Evaluation of large language models for nursing support in maternal venous thromboembolism care
🔥 引用:
0
Abstract:
Venous thromboembolism (VTE) is a major cause of maternal morbidity and mortality, and nursing plays a central role in prevention, patient education, and follow-up. Large language models (LLMs) have attracted increasing attention in healthcare; however, their comparative performance in maternal VTE nursing contexts remains insufficiently explored.
Five representative LLMs—DeepSeek, GPT-4.1, Claude 3.7, Huatuo, and Kimi—were evaluated across six clinical domains (etiology, diagnosis, treatment, prognostic assessment, home care, prevention) and five performance dimensions (accuracy, comprehensibility, logical coherence, reliability, safety). An expert-informed Delphi framework comprising 41 items guided the evaluation. Three nursing experts independently rated each model’s responses, and inter-rater reliability was assessed using Fleiss’s Kappa.
GPT-4.1, Claude 3.7, and DeepSeek demonstrated superior overall performance, particularly in patient education, individualized care planning, and preventive guidance. Huatuo and Kimi showed limitations in treatment and prognostic reasoning. Inter-rater reliability was excellent (Kappa = 0.892).
The findings highlight relative strengths and limitations of different LLMs across nursing-relevant domains in maternal VTE care. While certain models performed better in educational and supportive contexts, the current study does not assess clinical adequacy or readiness for real-world nursing deployment. Future research incorporating patient perspectives and real-world validation is needed to inform the safe and appropriate integration of LLMs into nursing practice.
A framework for automating standard operating procedures in government services based on large language models
DOI:
10.1117/12.3110630
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
SOLAR-RL: Semi-Online Long-horizon Assignment Reinforcement Learning
🔥 引用:
0
Abstract: As Multimodal Large Language Models (MLLMs) mature, GUI agents are evolving from static interactions to complex navigation. While Reinforcement Learning (RL) has emerged as a promising paradigm for training MLLM agents on dynamic GUI tasks, its effective application faces a dilemma. Standard Offline RL often relies on static step-level data, neglecting global trajectory semantics such as task completion and execution quality. Conversely, Online RL captures the long-term dynamics but suffers from high interaction costs and potential environmental instability. To bridge this gap, we propose SOLAR-RL (Semi-Online Long-horizon Assignment Reinforcement Learning). Instead of relying solely on expensive online interactions, our framework integrates global trajectory insights directly into the offline learning process. Specifically, we reconstruct diverse rollout candidates from static data, detect the first failure point using per-step validity signals, and retroactively assign dense step-level rewards with target-aligned shaping to reflect trajectory-level execution quality, effectively simulating online feedback without interaction costs. Extensive experiments demonstrate that SOLAR-RL significantly improves long-horizon task completion rates and robustness compared to strong baselines, offering a sample-efficient solution for autonomous GUI navigation.
Development and comparative evaluation of knowledge graph–enhanced large language models for domain-specific question answering in nursing
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
Generalist large language models in a specialized world: Evidence from the Italian national medical education pathway
🔥 引用:
0
Abstract: Creating language-specific and domain-specific large language models presents substantial challenges, including computational demands and limited data availability. While it is commonly believed that the benefits of specialized models justify these challenges, we dispute this notion with a comparative evaluation in a low-resourced language and medical-specific domain. In our study, we analyze the performance of various LLMs applied to the Italian healthcare domain using novel unpublished datasets, consisting of five-choice questions from national pre-university and post-university medical exams, covering clinical and preclinical fields. As part of this work, we release these datasets to the research community. We evaluated multilingual and Italian-specific models, along with general-purpose and healthcare-specific models, spanning both open-source and proprietary architectures of varying sizes. Our results demonstrate that multilingual, general-purpose large models consistently exceed the pass threshold across all tests, with the best models achieving over 90% accuracy on postgraduate-level exams. Model size emerged as the most critical factor influencing performance, whereas domain specialization and single-language localization offered no evidence of specialization superiority. These findings challenge the traditional pretrain-then-finetune paradigm for domain and language localization in language models, suggesting that advancements in generic-purpose multilingual models may render domain-specific pretraining unnecessary in many specialized cases.
Beyond Chain-of-Thought: Rewrite as a Universal Interface for Generative Multimodal Embeddings
🔥 引用:
0
Abstract: Multimodal Large Language Models (MLLMs) have emerged as a promising foundation for universal multimodal embeddings. Recent studies have shown that reasoning-driven generative multimodal embeddings can outperform discriminative embeddings on several embedding tasks. However, Chain-of-Thought (CoT) reasoning tends to generate redundant thinking steps and introduce semantic ambiguity in the summarized answers in broader retrieval scenarios. To address this limitation, we propose Rewrite-driven Multimodal Embedding (RIME), a unified framework that jointly optimizes generation and embedding through a retrieval-friendly rewrite. Meanwhile, we present the Cross-Mode Alignment (CMA) to bridge the generative and discriminative embedding spaces, enabling flexible mutual retrieval to trade off efficiency and accuracy. Based on this, we also introduce Refine Reinforcement Learning (Refine-RL) that treats discriminative embeddings as stable semantic anchors to guide the rewrite optimization. Extensive experiments on MMEB-V2, MRMR and UVRB demonstrate that RIME substantially outperforms prior generative embedding models while significantly reducing the length of thinking.
NRGS: Neural Regularization for Robust 3D Semantic Gaussian Splatting
🔥 引用:
0
Abstract: We propose a neural regularization method that refines the noisy 3D semantic field produced by lifting multi-view inconsistent 2D features, in order to obtain an accurate and robust 3D semantic Gaussian Splatting. The 2D features extracted from vision foundation models suffer from multi-view inconsistency due to a lack of cross-view constraints. Lifting these inconsistent features directly into 3D Gaussians results in a noisy semantic field, which degrades the performance of downstream tasks. Previous methods either focus on obtaining consistent multi-view features in the preprocessing stage or aim to mitigate noise through improved optimization strategies, often at the cost of increased preprocessing time or expensive computational overhead. In contrast, we introduce a variance-aware conditional MLP that operates directly on the 3D Gaussians, leveraging their geometric and appearance attributes to correct semantic errors in 3D space. Experiments on different datasets show that our method enhances the accuracy of lifted semantics, providing an efficient and effective approach to robust 3D semantic Gaussian Splatting.
RouteLMT: Learned Sample Routing for Hybrid LLM Translation Deployment
🔥 引用:
1
Abstract: Large Language Models (LLMs) have achieved remarkable performance in Machine Translation (MT), but deploying them at scale remains prohibitively expensive. A widely adopted remedy is the hybrid system paradigm, which balances cost and quality by serving most requests with a small model and selectively routing a fraction to a large model. However, existing routing strategies often rely on heuristics, external predictors, or absolute quality estimation, which fail to capture whether the large model actually provides a worthwhile improvement over the small one. In this paper, we formulate routing as a budget allocation problem and identify marginal gain, i.e., the large model's improvement over the small model, as the optimal signal for budgeted decisions. Building on this, we propose \textbf{RouteLMT} (routing for LLM-based MT), an efficient in-model router that predicts this expected gain by probing the small translators prompt-token representation, without requiring external models or hypothesis decoding. Extensive experiments demonstrate that our RouteLMT outperforms heuristics, quality/difficulty estimation baselines, achieving a superior quality-budget Pareto frontier. Furthermore, we analyze regression risks and show that a simple guarded variant can mitigate severe quality losses.
Feasibility and exploratory assessment of large language models for pediatric dentistry queries: a comparative study
🔥 引用:
0
Abstract:
Large Language Models (LLMs) are increasingly used by caregivers to obtain pediatric health information. However, concerns persist regarding the accuracy, reliability, and readability of AI-generated content, especially in pediatric dentistry, where caregiver comprehension is crucial.
To conduct an exploratory feasibility assessment of evaluating accuracy, quality, reliability, and readability of responses generated by ChatGPT-4, Google Gemini, and DeepSeek to common pediatric dentistry queries.
This exploratory comparative cross-sectional feasibility study utilized 15 patient-oriented pediatric dentistry questions identified through structured searches and expert screening. Each question was submitted verbatim to ChatGPT-4, Gemini, and DeepSeek under standardized conditions. Responses were independently evaluated by three calibrated pediatric dentistry experts using the Global Quality Scale (GQS), a modified DISCERN tool, and the Accuracy of Information Index (AOI). Readability was assessed using the Flesch Reading Ease Score (FRES) and the Flesch–Kincaid Grade Level (FKGL). Inter-examiner reliability was assessed using intraclass correlation coefficients (ICC). Statistical comparisons between LLMs were performed using a fixed-effects model with
post-hoc
pairwise analysis. Inter-examiner agreement was further evaluated using Bland–Altman analysis. A
p
-value of <0.05 was considered statistically significant.
Overall scoring was consistent across examiners, with minor variability observed across domains. A linear mixed-effects model conducted separately for each domain demonstrated that LLM type significantly influenced GQS scores (F = 7.90,
p
= 0.00), with Gemini and DeepSeek outperforming ChatGPT. No significant differences were observed for AOI (
p
= 0.44) and DISCERN (
p
= 0.06). Bland-Altman analysis indicated minimal inter-examiner bias; however, the limits of agreement were relatively wide considering the scale range, reflecting variability between individual ratings. Single-measure ICC demonstrated poor agreement (ICC = 0.26), while higher reliability observed when scores were averaged (ICC = 0.90).
This study offers an exploratory feasibility assessment of LLM evaluation in pediatric dentistry. While the models generally produced high-quality outputs, variations in accuracy, readability, and significant inter-examiner variability highlight important methodological challenges. These findings represent preliminary groundwork and require validation in larger, clinically diverse, real-world settings. LLMs may serve as supportive informational tools; however, their outputs should be interpreted cautiously and used to complement, not replace professional clinical judgment.
ECNU-ChemGPT: A Large Language Model for Chemistry and Retrosynthesis Predictions
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
When AI Speaks, Whose Values Does It Express? A Cross-Cultural Audit of Individualism-Collectivism Bias in Large Language Models
🔥 引用:
0
Abstract: When you ask an AI assistant for advice about your career, your marriage, or a conflict with your family, does it give you the same answer regardless of where you are from? We tested this systematically by presenting three leading AI systems (Claude Sonnet 4.5, GPT-5.4, and Gemini 2.5 Flash) with ten real-life personal dilemmas, framed for users from 10 countries across 5 continents in 7 languages (n=840 scored responses). We compared AI advice against World Values Survey Wave 7 data measuring what people in each country actually believe. All three AI systems consistently gave Western-style, individualist advice even to users from societies that prioritize family, community, and authority, significantly more so than local values would predict (mean gap +0.76 on a 1-5 scale; t=15.65, p<0.001). The gap is largest for Nigeria (+1.85) and India (+0.82). Japan is the sole exception: AI systems treated Japanese users as more group-oriented than surveys show, revealing that AI encodes outdated stereotypes. Claude and GPT-5.4 show nearly identical bias magnitude, while Gemini is lower but still significant. The models diverge in mechanism: Claude shifts further collectivist in the user's native language; Gemini shifts more individualist; GPT-5.4 responds only to stated country identity. These findings point to a systemic homogenization of values across frontier AI. Data, code, and scoring pipeline are openly released.
Stability Amid AI Disruption: A Diachronic Linguistic Analysis of MA Thesis and PhD Dissertation Titles in Humanities, 2015–2025
🔥 引用:
0
Abstract: The integration of large language models such as ChatGPT has raised concerns about stylistic homogenization in scholarly writing. While scientific literature shows clear LLM-driven shifts, e.g., increased lexical markers and reduced cohesion (Bao et al., 2025; Kousha & Thelwall, 2024), this study examines whether similar changes appear in humanities thesis and dissertation titles. Drawing on 8,631 unique MA and PhD titles from ProQuest in History, Religion, Literature, Philosophy, and Musicology, linguistic features were compared between 2015 (pre-AI) and 2025 (post-AI stabilization). Five dimensions were analyzed: word length, informativity, lexical diversity, syntactic structure, and semantic content. Results reveal remarkable stability across most metrics (title length ~12–13 words, informativity ~67%, lexical diversity near 100%). Only a modest increase in compound structures (70% to 74%) occurred, reflecting amplification of existing humanities conventions rather than disruption. The brevity of titles and extended human supervision appear to limit deep LLM intervention. These findings contrast with scientific fields and highlight the resilience of disciplinary norms in graduate scholarship.
How Large Language Models Balance Internal Knowledge with User and Document Assertions
🔥 引用:
0
Abstract: Large language models (LLMs) often need to balance their internal parametric knowledge with external information, such as user beliefs and content from retrieved documents, in real-world scenarios like RAG or chat-based systems. A model's ability to reliably process these sources is key to system safety. Previous studies on knowledge conflict and sycophancy are limited to a binary conflict paradigm, primarily exploring conflicts between parametric knowledge and either a document or a user, but ignoring the interactive environment where all three sources exist simultaneously. To fill this gap, we propose a three-source interaction framework and systematically evaluate 27 LLMs from 3 families on 2 datasets. Our findings reveal general patterns: most models rely more on document assertions than user assertions, and this preference is reinforced by post-training. Furthermore, our behavioral analysis shows that most models are impressionable, unable to effectively discriminate between helpful and harmful external information. To address this, we demonstrate that fine-tuning on diverse source interaction data can significantly increase a model's discrimination abilities. In short, our work paves the way for developing trustworthy LLMs that can effectively and reliably integrate multiple sources of information. Code is available at https://github.com/shuowl/llm-source-balancing.
Certainty-Aware Skin Lesion Segmentation with Post-Hoc Reliability Estimation for the Segment Anything Model
DOI:
10.66279/hzkw5y24
🔥 引用:
0
Abstract: The Segment Anything Model (SAM) represents a major advance in zero-shot visual segmentation, yet it provides purely deterministic outputs without any measure of prediction reliability, a critical limitation for safety-conscious medical imaging applications. This paper introduces a certainty-aware segmentation framework that augments SAM-based zero-shot inference with principled, post-hoc reliability estimation. Three complementary outputs are introduced: a pixel-wise certainty map that identifies spatially localized regions of ambiguity; a global confidence score that provides a scalar measure of overall segmentation trustworthiness; and a quality-flagging mechanism that enables automated screening of unreliable predictions. The framework requires no modification to SAM's architecture and no additional training data, thereby preserving its zero-shot generalization properties. Evaluation on the ISIC 2018 Task 1 skin lesion segmentation benchmark comprising 2,594 dermoscopic images in a fully zero-shot setting yields a mean Dice Similarity Coefficient of 0.820 pm 0.095 and a mean Intersection-over-Union of 0.750 \pm 0.101. A strong positive correlation (Pearson r = 0.84, p < 0.001, n = 2,594) is observed between certainty scores and segmentation quality. High-quality segmentations (DSC> 0.80) are consistently associated with certainty scores above 80%, while low-quality predictions (DSC< 0.70) yield certainty scores below 50%. Stratified analysis confirms a mean DSC difference of over 0.25 between high- and low-certainty tiers (Wilcoxon p < 0.001, Cohen's d = 2.31). These results demonstrate that the proposed certainty metrics reliably track segmentation accuracy and provide a practical mechanism for risk-aware deployment of foundation models in clinical environments.
Understanding Representation Gaps Across Scales in Tropical Tree Species Classification from Drone Imagery
🔥 引用:
0
Abstract: Accurate classification of tropical tree species from unoccupied aerial vehicle (UAV) imagery remains challenging due to high species diversity and strong visual similarity among species at typical image resolutions (centimeters per pixel). In contrast, models trained on close-up citizen science photographs captured with smartphones achieve strong plant species classification performance. Recent advances in UAV data acquisition now enable the collection of close-up images that are spatially registered with top-view aerial imagery and approach the level of visual detail found in smartphone photographs, with the trade-off that such high-resolution photos cannot be acquired for many trees. In this work, we evaluate the performance of existing methods using paired top-view and close-up UAV imagery collected in a species-rich tropical forest. Through fine-tuning experiments, we quantify the performance gap between vision foundation models and in-domain generalist plant recognition models across both image types (high-resolution close-up versus coarser-resolution top-view imagery). We show that classification performance is consistently higher on close-up images than on top-view aerial imagery, and that this performance gap widens for rare species. Finally, we propose that self-supervised representation alignment across these two spatial scales offers a promising approach for integrating fine-grained visual information into canopy-level species classification models based on top-view UAV imagery. Leveraging high-resolution close-up UAV imagery to enhance canopy-level species classification could substantially improve large-scale monitoring of tropical forest biodiversity.
Personal AI Operating Systems: Architectural Foundations, Ethical Assurance, and Commercial Potential in AI-Native Computing
🔥 引用:
0
Abstract: We are facing a problem with modern computers, which is that the software is getting too complicated. The old
ways of managing computer systems are not working well for big and complex systems. This paper is about an idea called
Personal AI Operating Systems, which is a big change in how we think about computer systems. It uses something called
Large Language Model agents to make the system smarter and able to manage itself. We used an approach to study this
new idea, which is called Design Science Research. We compared the ways of doing things with the new ways and we also
did a detailed study of how Large Language Models can help with managing computer resources and keeping them secure.
What we found out is that Personal AI Operating Systems are really good at planning and understanding what people want
to do. They are better than the systems, which can only react to problems. We also looked at the trade-offs between making
the system fast and reliable and using machine learning to make decisions. Our results show that Personal AI Operating
Systems can make computer systems more resilient and easier to use. Finally we talked about how to make sure that Personal
AI Operating Systems are used in a way especially when it comes to peoples personal data. We think that this is an important
area and it is going to be worth a lot of money. $56.3 Billion, by 2034.
A Graph Neural Network Framework for Structural Side-Channel Vulnerability Assessment in Cryptographic Circuits
DOI:
10.66279/0e5j0983
🔥 引用:
0
Abstract: Traditional side-channel analysis treats power and electromagnetic traces as temporal sequences, applying statistical or sequence-based machine learning methods without regard for the circuit topology responsible for generating observed leakage. This discards structural information intrinsic to digital circuits: gate connectivity, signal propagation topology, and the hierarchical organization of cryptographic modules. A framework is presented that applies Graph Neural Networks (GNNs) to side-channel vulnerability assessment by modeling circuits as attributed graphs in which nodes represent logic gates, edges represent wire connections, and power measurements are encoded as node features. A complete pipeline is developed spanning Verilog netlist parsing, graph construction, and Graph Convolutional Network (GCN) training with multi-head attention for multi-scale circuit analysis. Evaluation on ten AES-128 circuit implementations demonstrates an 86.4% attack success rate, compared with 68.1% for a CNN-LSTM baseline, with required power traces reduced from 988 to 790. Cross-architecture generalization reaches 63.5% accuracy on unseen circuit families, substantially above the 18.7% random baseline. Interpretable vulnerability heatmaps localize leakage sources at the gate level, enabling pre-silicon security assessment before fabrication.
Download Pdf Hands-On Large Language Models: Language Understanding and Generation by Jay Alammar, Maarten Grootendorst
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
MIMIC-IV-Phenotype-Atlas (MIPA) : A Publicly Available Dataset for EHR Phenotyping
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
Large language models and retrieval augmented generation for complex clinical codelists: evaluating performance and assessing failure modes
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
RealBench: A Repo-Level Code Generation Benchmark Aligned with Real-World Software Development Practices
🔥 引用:
0
Abstract: Writing code requires significant time and effort in software development. To automate this process, researchers have made substantial progress using Large Language Models (LLMs) for code generation. Many benchmarks like HumanEval and EvoCodeBench have been created to evaluate LLMs by requiring them to generate code from natural language requirements. However, in enterprise applications and team development, developers typically write code based on structured designs or specifications rather than raw natural language descriptions. This gap between existing benchmarks and real industry development practices means that current benchmark scores may not accurately reflect how much code generation can help automate software development tasks. To address this gap, we propose RealBench, a repository-level code generation benchmark aligned with real-world industry software development practices. Each example includes both natural language requirements and UML diagrams as system design, matching how developers typically receive specifications. Based on the constructed benchmarks, we conduct a systematic evaluation of advanced LLMs'code generation capabilities when provided with structured system designs. The experimental results reveal key insights in current LLMs'capabilities for repo-level code generation aligned with real-world software development practices. First, we notice that regarding repo-level code generation, LLMs show much worse performance and there are significant performance gaps among LLMs. Second, LLMs are good at finding and creating modules defined in UML diagrams, but the quality of generated modules is often poor due to grammar and logic errors. Third, generating the entire repository at once is the best generation strategy on smaller repositories, while generating a complex repository with the module-by-module strategy works better compared to other strategies.
Large language models as a conduit for value shifts in contemporary China
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
An End-to-End Foundation Model-Based Framework for Robust LAI Retrieval Under Cloud Cover
DOI:
10.3390/rs18091308
🔥 引用:
0
Abstract: Leaf Area Index is a crucial biophysical variable, and its accurate estimation is essential for understanding vegetation dynamics. However, cloud cover significantly restricts optical remote sensing, hindering the generation of spatially continuous Leaf Area Index products. Remote sensing foundation models offer novel solutions to this challenge. This study presents an end-to-end framework based on the fine-tuned Prithvi foundation model for direct LAI retrieval from cloud-contaminated 30 m Harmonized Landsat and Sentinel-2 imagery. By mapping inputs directly to Hi-GLASS reference labels, the proposed architecture processes cloud contamination and vegetation signals simultaneously and circumvents the error propagation inherent in cascaded retrieval pipelines. Results demonstrate that the end-to-end LAI retrieval model significantly outperforms cascaded variants, achieving a superior R2 (0.78) and lower RMSE (0.57). Furthermore, predictive accuracy exhibits a distinct U-shaped trajectory relative to the temporal mean cloud fraction, reaching an inflection point at 50–60% occlusion, which highlights the model’s implicit regularization capacity under severe atmospheric interference. This work establishes that direct feature learning with foundation models offers a more robust and streamlined pathway for generating continuous biophysical products from imperfect optical observations, prioritizing quantitative fidelity over artificial perceptual sharpness.
A three-dimensional multi-modal foundation model for optical coherence tomography
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
Large Language Model Counterarguments in Older Adults: Cognitive Offloading or Vulnerability to Moral Persuasion?
🔥 引用:
0
Abstract: This study examined whether counterarguments generated by large language models (LLMs) influence the moral judgments of younger and older adults and whether these effects vary as a function of dilemma type, cognitive functioning, trust in AI, and prior experience using LLMs. Using the switch and footbridge trolley dilemmas, 130 participants (56 younger adults and 74 older adults) were presented with ChatGPT arguments that opposed their initial judgments. Results revealed that more than 30% of participants reversed their moral judgments in both dilemmas (32.31% in the switch dilemma and 36.92% in the footbridge dilemma), suggesting that LLMs possess substantial persuasive power. Older adults tended to be more likely than younger adults to reverse their judgments, and they showed a significantly greater degree of judgment change in the switch dilemma. Notably, in the emotionally aversive footbridge dilemma, older adults with lower cognitive functioning were significantly more likely to align with the LLM-generated counterargument. General trust in AI and prior experience with LLMs did not predict judgment reversal, supporting a disconnect between trust and persuasion. Instead, individual factors such as lower initial confidence and higher perceived task difficulty were associated with greater susceptibility to AI influence. These findings suggest that, although LLMs may serve as tools for cognitive offloading that compensate for age-related cognitive decline, they may also pose a risk of undue persuasion for cognitively vulnerable individuals.
Self Knowledge Re-expression: A Fully Local Method for Adapting LLMs to Tasks Using Intrinsic Knowledge
🔥 引用:
0
Abstract: While the next-token prediction (NTP) paradigm enables large language models (LLMs) to express their intrinsic knowledge, its sequential nature constrains performance on specialized, non-generative tasks. We attribute this performance bottleneck to the LLMs'knowledge expression mechanism, rather than to deficiencies in knowledge acquisition. To address this, we propose Self-Knowledge Re-expression (SKR), a novel, task-agnostic adaptation method. SKR transforms the LLM's output from generic token generation to highly efficient, task-specific expression. SKR is a fully local method that uses only unannotated data, requiring neither human supervision nor model distillation. Experiments on a large financial document dataset demonstrate substantial improvements: over 40% in Recall@1 for information retrieval tasks, over 76% reduction in object detection latency, and over 33% increase in anomaly detection AUPRC. Our results on the MMDocRAG dataset surpass those of leading retrieval models by at least 12.6%.
CellPulse: A Foundation Model of Coordinated Gene Dynamics Simulating Viral Infectious Diseases
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
Voice Under Revision: Large Language Models and the Normalization of Personal Narrative
🔥 引用:
0
Abstract: This study examines how large language model rewriting alters the style and narrative texture of personal narratives. It analyzes 300 personal narratives rewritten by three frontier LLMs under three prompt conditions: generic improvement, rewrite-only, and voice-preserving revision. Change is measured across 13 linguistic markers drawn from computational stylistics, including function words, vocabulary diversity, word length, punctuation, contractions, first-person pronouns, and emotion words. Across models and prompt conditions, LLM rewriting produces a consistent pattern of stylistic normalization. Function words, contractions, and first-person pronouns decrease, while vocabulary diversity, word length, and punctuation elaboration increase. These shifts occur whether the prompt asks the model to"improve"the text or simply to"rewrite"it. Voice-preserving prompts reduce the magnitude of the changes but do not eliminate their direction. Stylometric analysis shows that rewritten texts converge in feature space and become harder to match back to their source texts. Additional narrative markers indicate a shift from embedded to distanced narration, and from explicit causal reasoning to compressed abstraction. The findings suggest that contemporary LLMs exert a directional pull toward a more polished, less situated register. This has consequences for digital humanities and computational text analysis, where features such as function words, pronouns, contractions, and punctuation often serve as evidence for style, voice, authorship, and corpus integrity. LLM revision should therefore be understood not merely as surface-level editing, but as a consequential form of textual mediation.
SHAPE: Unifying Safety, Helpfulness and Pedagogy for Educational LLMs
🔥 引用:
0
Abstract: Large Language Models (LLMs) have been widely explored in educational scenarios. We identify a critical vulnerability in current educational LLMs, pedagogical jailbreaks, where students use answer-inducing prompts to elicit solutions rather than scaffolded instructions. To enable systematic study, we unify and formalize safe, helpful, and pedagogical behaviors with a knowledge-mastery graph and introduce SHAPE, a benchmark of 9,087 student-question pairs for evaluating tutoring behavior under adversarial pressure. We propose a graph-augmented tutoring pipeline that infers prerequisite concepts from queries, identifies mastery gaps, and routes generation between instructing and problem-solving via explicit gating. Experiments across multiple LLMs show that our method yields significantly improved safety under two pedagogical jailbreak settings, while maintaining near-ceiling helpfulness under the same evaluation protocol. Our code and data are available at https://github.com/MAPS-research/SHaPE
Validating Digital Scribes: A Scoping Review of Evaluation Practices and Clinical Use
🔥 引用:
0
Abstract: Clinical documentation accounts for a substantial share of clinicians’ working time, contributing to administrative burden and reduced patient-facing care. Artificial intelligence has stimulated the development of digital scribes that combine speech recognition (ASR) and large language models (LLMs) to generate clinical notes from patient–provider conversations with the aim to automate and support this process and reduce this burden. This scoping review explores how digital scribes are currently validated, both technical and clinical, and whether they reliably support clinical workflows. Using the Technology Readiness Level (TRL) framework, we show that most systems remain in early development stages (typically TRL 3&4), with only a small number progressing to workflow integration. While digital scribes show potential to improve documentation efficiency, validation methods are highly heterogeneous, most studies rely on simulated or retrospective data, and real-world testing is limited. Consequently, cross-system comparisons and conclusions about clinical performance remain limited. We identified three motivational frames: human-, performance-, and system-oriented, which shape evaluation practices and outcome expectations. These findings suggest that successful implementation depends not only on scribes’ technical capability but also on alignment with clinical needs and documentation styles. Overall, our review underscores the need for standardised validation frameworks and prospective real-world studies to ensure that digital scribes progress beyond their current low TRL and move from experimental promise to safe, effective, and sustainable integration into clinical care. Supplementary Information The online version contains supplementary material available at 10.1007/s10916-026-02392-3.
Efficient Agent Evaluation via Diversity-Guided User Simulation
🔥 引用:
0
Abstract: Large language models (LLMs) are increasingly deployed as customer-facing agents, yet evaluating their reliability remains challenging due to stochastic, multi-turn interactions. Current evaluation protocols rely on linear Monte Carlo rollouts of complete agent-user conversations to estimate success. However, this approach is computationally inefficient, repeatedly regenerating identical early prefixes, and often fails to uncover deep failure modes that arise from rare user behaviors. We introduce DIVERT (Diversity-Induced Evaluation via Branching of Trajectories), an efficient, snapshot-based, coverage-guided user simulation framework for systematic exploration of agent-user interactions. DIVERT captures the full agent-environment state at critical decision points and resumes execution from these snapshots, enabling reuse of shared conversation prefixes and reducing redundant computation. From each junction, the framework branches using targeted, diversity-inducing user responses, allowing directed exploration of alternative interaction paths. By focusing evaluation on semantically diverse and underexplored trajectories, DIVERT improves both efficiency and coverage. Empirical results show that it discovers more failures per token compared to standard linear rollout protocols, while expanding the set of tasks on which failures are identified.
BioMiner: A Multi-modal System for Automated Mining of Protein-Ligand Bioactivity Data from Literature
🔥 引用:
3
Abstract: Protein-ligand bioactivity data published in the literature are essential for drug discovery, yet manual curation struggles to keep pace with rapidly growing literature. Automated bioactivity extraction remains challenging because it requires not only interpreting biochemical semantics distributed across text, tables, and figures, but also reconstructing chemically exact ligand structures (e.g., Markush structures). To address this bottleneck, we introduce BioMiner, a multi-modal extraction framework that explicitly separates bioactivity semantic interpretation from ligand structure construction. Within BioMiner, bioactivity semantics are inferred through direct reasoning, while chemical structures are resolved via a chemical-structure-grounded visual semantic reasoning paradigm, in which multi-modal large language models operate on chemically grounded visual representations to infer inter-structure relationships, and exact molecular construction is delegated to domain chemistry tools. For rigorous evaluation and method development, we further establish BioVista, a comprehensive benchmark comprising 16,457 bioactivity entries curated from 500 publications. BioMiner validates its extraction ability and provides a quantitative baseline, achieving an F1 score of 0.32 for bioactivity triplets. BioMiner’s practical utility is demonstrated via three applications: (1) extracting 82,262 data from 11,683 papers to build a pre-training database that improves downstream models performance by 3.9%; (2) enabling a human-in-the-loop workflow that doubles the number of high-quality NLRP3 bioactivity data, helping 38.6% improvement over 28 QSAR models and identification of 16 hit candidates with novel scaffolds; and (3) accelerating protein-ligand complex bioactivity annotation, achieving a 5.59-fold speed increase and 5.75% accuracy improvement over manual workflows in PoseBusters dataset. BioMiner and BioVista provide a scalable extraction methodology and a rigorous benchmark, paving the way to unlock bioactivity data that previously required extensive human effort. All data and code are available at GitHub.
A Metamorphic Testing Approach to Diagnosing Memorization in LLM-Based Program Repair
🔥 引用:
0
Abstract: LLM-based automated program repair (APR) techniques have shown promising results in reducing debugging costs. However, prior results can be affected by data leakage: large language models (LLMs) may memorize bug fixes when evaluation benchmarks overlap with their pretraining data, leading to inflated performance estimates. In this paper, we investigate whether we can better reveal data leakage by combining metamorphic testing (MT) with negative log-likelihood (NLL), which has been used in prior work as a proxy for memorization. We construct variant benchmarks by applying semantics-preserving transformations to two widely used datasets, Defects4J and GitBug-Java. Using these benchmarks, we evaluate the repair success rates of seven LLMs on both original and transformed versions, and analyze the relationship between performance degradation and NLL. Our results show that all evaluated state-of-the-art LLMs exhibit substantial drops in patch generation success rates on transformed benchmarks, ranging from -4.1% for GPT-4o to -15.98% for Llama-3.1. Furthermore, we find that this degradation strongly correlates with NLL on the original benchmarks, suggesting that models perform better on instances they are more likely to have memorized. These findings show that combining MT with NLL provides stronger and more reliable evidence of data leakage, while metamorphic testing alone can help mitigate its effects in LLM-based APR evaluations.
Toward Efficient Membership Inference Attacks against Federated Large Language Models: A Projection Residual Approach
🔥 引用:
0
Abstract: Federated Large Language Models (FedLLMs) enable multiple parties to collaboratively fine-tune LLMs without sharing raw data, addressing challenges of limited resources and privacy concerns. Despite data localization, shared gradients can still expose sensitive information through membership inference attacks (MIAs). However, FedLLMs'unique properties, i.e. massive parameter scales, rapid convergence, and sparse, non-orthogonal gradients, render existing MIAs ineffective. To address this gap, we propose ProjRes, the first projection residuals-based passive MIA tailored for FedLLMs. ProjRes leverages hidden embedding vectors as sample representations and analyzes their projection residuals on the gradient subspace to uncover the intrinsic link between gradients and inputs. It requires no shadow models, auxiliary classifiers, or historical updates, ensuring efficiency and robustness. Experiments on four benchmarks and four LLMs show that ProjRes achieves near 100% accuracy, outperforming prior methods by up to 75.75%, and remains effective even under strong differential privacy defenses. Our findings reveal a previously overlooked privacy vulnerability in FedLLMs and call for a re-examination of their security assumptions. Our code and data are available at $\href{https://anonymous.4open.science/r/Passive-MIA-5268}{link}$.
Towards Universal Tabular Embeddings: A Benchmark Across Data Tasks
🔥 引用:
0
Abstract: Tabular foundation models aim to learn universal representations of tabular data that transfer across tasks and domains, enabling applications such as table retrieval, semantic search and table-based prediction. Despite the growing number of such models, it remains unclear which approach works best in practice, as existing methods are often evaluated under task-specific settings that make direct comparison difficult. To address this, we introduce TEmBed, the Tabular Embedding Test Bed, a comprehensive benchmark for systematically evaluating tabular embeddings across four representation levels: cell, row, column, and table. Evaluating a diverse set of tabular representation learning models, we show that which model to use depends on the task and representation level. Our results offer practical guidance for selecting tabular embeddings in real-world applications and lay the groundwork for developing more general-purpose tabular representation models.
Transfer Learning from Homogeneous to Heterogeneous: Fine-Tuning a Pretrained Interatomic Potential for Multicomponent Mo Alloys with Localized Substitutional Alloying
DOI:
10.3390/ma19091715
🔥 引用:
0
Abstract: Machine learning interatomic potentials (MLIPs) are typically developed for globally ordered homogeneous systems (GOHomS), which exhibit only minor local deviations from equilibrium configurations. Consequently, most existing MLIPs trained on GOHomS often perform inadequately when applied to locally ordered heterogeneous systems (LOHetS), e.g., substitutional alloying elements in multicomponent alloys. To describe doping alloy systems, we develop a fine-tuned MLIP based on the MACE foundation model, specifically tailored for Mo-based dilute alloys containing one or two out of 20 substitutional elements: Cr, Fe, Mn, Nb, Re, Ta, Ti, V, W, Y, Zr, Al, Zn, Cu, Ag, Au, Hg, Co, Ni, and Hf. The model is built on more than 7000 equilibrium and non-equilibrium structures derived from first-principles density functional theory (DFT) calculations. The optimized large-scale fine-tuned model attains state-of-the-art accuracy, with a mean absolute error (MAE) and root-mean-square error (RMSE) of 2.27 meV/atom and 3.79 meV/atom for energy predictions, and 13.83 meV/Å and 24.26 meV/Å for force predictions, respectively. Systematic evaluation under different data-splitting protocols shows that unknown element extrapolation remains challenging under strict dopant hold-out, whereas substantially improved accuracy can be achieved in partial-exposure transfer settings. The fine-tuned models reduce the MAE by approximately 7–10 times compared to models trained from scratch, and by 10–20 times relative to zero-shot foundation models. This performance gain remains consistent across varying dataset sizes (equilibrium vs. non-equilibrium structures) and model scales. Our work illustrates the efficacy of transfer learning from globally ordered homogeneous systems to locally ordered heterogeneous multicomponent alloy environments. However, direct transfer to entirely unknown elements remains challenging, especially when proxy embeddings are employed without fine-tuning. Thus, to achieve high accuracy without incurring additional cost, it is essential to include unknown elements in the training dataset while minimizing the number of configurations containing known elements. Moreover, the current findings are primarily validated for dilute Mo-based alloy systems. Extending this approach to more compositionally complex alloy spaces may necessitate additional data and further fine-tuning.
Towards Understanding the Expressive Power of GNNs with Global Readout
🔥 引用:
0
Abstract: We study the expressive power of message-passing aggregate-combine-readout graph neural networks (ACR-GNNs). Particularly, we focus on the first-order (FO) properties expressible by this formalism. While a tight logical characterisation remains a difficult open question, we make two contributions towards answering it. First, we show that sum aggregation and readout suffice for GNNs to capture FO properties that cannot be expressed in the logic C2 on both directed and undirected graphs. This strengthens known results by Hauke and Wa{\l}{\k e}ga (2026) where aggregation and readout functions are specially crafted for the task. Second, we identify two natural ways of restoring characterisability (with regard to C2) for ACR-GNNs. One option is to limit local aggregation (without imposing restrictions on global readout), whilst the second is to run ACR-GNNs over graphs of bounded degree (but unbounded size). In both cases, the FO properties captured by GNNs are exactly those definable by a formula in graded modal logic with global counting modalities. Our results thus establish an innate lower- and upper-bound in terms of how far (fragments of) C2 can be taken to characterise GNNs, and imply that is indeed the unbounded interaction of aggregation and readout that pushes the logical expressive power of GNNs above C2.
SyntheMol-RL: a flexible reinforcement learning framework for designing easily synthesizable antibiotics
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
Mochi: Aligning Pre-training and Inference for Efficient Graph Foundation Models via Meta-Learning
🔥 引用:
0
Abstract: We propose Mochi, a Graph Foundation Model that addresses task unification and training efficiency by adopting a meta-learning based training framework. Prior models pre-train with reconstruction-based objectives such as link prediction, and assume that the resulting representations can be aligned with downstream tasks through a separate unification step such as class prototypes. We demonstrate through synthetic and real-world experiments that this procedure, while simple and intuitive, has limitations that directly affect downstream task performance. To address these limitations, Mochi pre-trains on few-shot episodes that mirror the downstream evaluation protocol, aligning the training objective with inference rather than relying on a post-hoc unification step. We show that Mochi, along with its more powerful variant Mochi++, achieves competitive or superior performance compared to existing Graph Foundation Models across 25 real-world graph datasets spanning node classification, link prediction, and graph classification, while requiring 8$\sim$27 times less training time than the strongest baseline.
When Bigger Isn't Better: A Comprehensive Fairness Evaluation of Political Bias in Multi-News Summarisation
🔥 引用:
0
Abstract: Multi-document news summarisation systems are increasingly adopted for their convenience in processing vast daily news content, making fairness across diverse political perspectives critical. However, these systems can exhibit political bias through unequal representation of viewpoints, disproportionate emphasis on certain perspectives, and systematic underrepresentation of minority voices. This study presents a comprehensive evaluation of such bias in multi-document news summarisation using FairNews, a dataset of complete news articles with political orientation labels, examining how large language models (LLMs) handle sources with varying political leanings across 13 models and five fairness metrics. We investigate both baseline model performance and effectiveness of various debiasing interventions, including prompt-based and judge-based approaches. Our findings challenge the assumption that larger models yield fairer outputs, as mid-sized variants consistently outperform their larger counterparts, offering the best balance of fairness and efficiency. Prompt-based debiasing proves highly model dependent, while entity sentiment emerges as the most stubborn fairness dimension, resisting all intervention strategies tested. These results demonstrate that fairness in multi-document news summarisation requires multi-dimensional evaluation frameworks and targeted, architecture-aware debiasing rather than simply scaling up.
Differentially Private De-identification of Dutch Clinical Notes: A Comparative Evaluation
🔥 引用:
0
Abstract: Protecting patient privacy in clinical narratives is essential for enabling secondary use of healthcare data under regulations such as GDPR and HIPAA. While manual de-identification remains the gold standard, it is costly and slow, motivating the need for automated methods that combine privacy guarantees with high utility. Most automated text de-identification pipelines employed named entity recognition (NER) to identify protected entities for redaction. Although methods based on differential privacy (DP) provide formal privacy guarantees, more recently also large language models (LLMs) are increasingly used for text de-identification in the clinical domain. In this work, we present the first comparative study of DP, NER, and LLMs for Dutch clinical text de-identification. We investigate these methods separately as well as hybrid strategies that apply NER or LLM preprocessing prior to DP, and assess performance in terms of privacy leakage and extrinsic evaluation (entity and relation classification). We show that DP mechanisms alone degrade utility substantially, but combining them with linguistic preprocessing, especially LLM-based redaction, significantly improves the privacy-utility trade-off.
Tool Attention Is All You Need: Dynamic Tool Gating and Lazy Schema Loading for Eliminating the MCP/Tools Tax in Scalable Agentic Workflows
🔥 引用:
0
Abstract: The Model Context Protocol (MCP) has become a common interface for connecting large language model (LLM) agents to external tools, but its reliance on stateless, eager schema injection imposes a hidden per-turn overhead the MCP Tax or Tools Tax that practitioner reports place between roughly 10k and 60k tokens in typical multi-server deployments. This payload inflates the key-value cache, is associated with reasoning degradation as context utilization approaches published fracture points around 70%, and turns token budgets into a recurring operational cost. We introduce Tool Attention, a middleware-layer mechanism that generalizes the"Attention Is All You Need"paradigm from self-attention over tokens to gated attention over tools. Tool Attention combines (i) an Intent Schema Overlap (ISO) score from sentence embeddings, (ii) a state-aware gating function enforcing preconditions and access scopes, and (iii) a two-phase lazy schema loader that keeps a compact summary pool in context and promotes full JSON schemas only for top-k gated tools. We evaluate on a simulated 120-tool, six-server benchmark whose per-server token counts are calibrated to public audits of real MCP deployments. In this simulation, Tool Attention directly reduces measured per-turn tool tokens by 95.0% (47.3k ->2.4k) and raises effective context utilization (a token-ratio quantity) from 24% to 91%. End-to-end figures for task success, latency, cost, and reasoning quality are reported as projections derived from the measured token counts combined with published deployment telemetry; they are not measured on live LLM agents, and we mark projected values explicitly throughout. Taken together, the results support a simple thesis: protocol-level efficiency, not raw context length, is a binding constraint on scalable gentic systems. The code for this work is accessible at https://github.com/asadani/tool-attention
Process Supervision via Verbal Critique Improves Reasoning in Large Language Models
🔥 引用:
0
Abstract: Inference-time scaling for LLM reasoning has focused on three axes: chain depth, sample breadth, and learned step-scorers (PRMs). We introduce a fourth axis, granularity of external verbal supervision, via Verbal Process Supervision (VPS), a training-free framework that uses structured natural-language critique from a stronger supervisor to guide an iterative generate-critique-refine loop up to a round budget R. Across GPQA Diamond, AIME 2025, and LiveCodeBench V6 (covering both closed and open models), VPS yields three key results. First, on GPQA Diamond, GPT-5.4 (High) | GPT-5.4 (Low) reaches 94.9% at R=4, surpassing the 94.1% state of the art without gradient updates. Second, on AIME 2025, VPS enables strong weak-actor rescue, boosting scores from 11.7-26.7% to 63.3-90.0% (up to +63.3 points). Third, at matched compute, VPS outperforms Reflexion by +8.5 to +12.1 points and Self-Consistency@5 by +5.0 pp (GPQA) and +8.3 pp (LiveCodeBench), isolating critique granularity as the key driver. Performance scales with the supervisor-actor capability gap (Pearson r=0.90) and degrades when errors are not linguistically expressible (e.g., code synthesis), motivating hybrid verbal-executable methods. These results establish critique granularity as a new axis of inference-time scaling.
ChatGPT as a Time Capsule: The Limits of Price Discovery
🔥 引用:
0
Abstract: Frozen large language model (LLM) checkpoints extract information from pre-cutoff public text that is associated with future fundamentals and equity returns beyond standard contemporaneous valuation measures. Because each frozen checkpoint has a fixed knowledge cutoff, it can be interpreted as a compressed representation of publicly available textual information at a given point in time. We treat twelve OpenAI snapshots spanning 2021-2025 as time-stamped summaries of the public textual record and extract a sector-neutral LLM outlook score for roughly 7,000 U.S. equities per cross-section. The outlook score is positively associated with analyst revisions, target-price changes and one-month cross-sectional returns in both Fama-MacBeth regressions and pooled panels with model fixed effects (t = 6.02), after direct controls for market-implied valuation and standard factors. Predictability broadly increases with the return horizon, despite a non-monotonic intermediate dip, and, in the pooled panel, is stronger for firms with high analyst coverage, consistent with the view that the bottleneck is not investor inattention but the cost of aggregating dispersed qualitative information across many documents.
Spatial Metaphors for LLM Memory: A Critical Analysis of the MemPalace Architecture
🔥 引用:
0
Abstract: MemPalace is an open-source AI memory system that applies the ancient method of loci (memory palace) spatial metaphor to organize long-term memory for large language models; launched in April 2026, it accumulated over 47,000 GitHub stars in its first two weeks and claims state-of-the-art retrieval performance on the LongMemEval benchmark (96.6% Recall@5) without requiring any LLM inference at write time. Through independent codebase analysis, benchmark replication, and comparison with competing systems, we find that MemPalace's headline retrieval performance is attributable primarily to its verbatim storage philosophy combined with ChromaDB's default embedding model (all-MiniLM-L6-v2), rather than to its spatial organizational metaphor per se -- the palace hierarchy (Wings->Rooms->Closets->Drawers) operates as standard vector database metadata filtering, an effective but well-established technique. However, MemPalace makes several genuinely novel contributions: (1) a contrarian verbatim-first storage philosophy that challenges extraction-based competitors, (2) an extremely low wake-up cost (approximately 170 tokens) through its four-layer memory stack, (3) a fully deterministic, zero-LLM write path enabling offline operation at zero API cost, and (4) the first systematic application of spatial memory metaphors as an organizing principle for AI memory systems. We also note that the competitive landscape is evolving rapidly, with Mem0's April 2026 token-efficient algorithm raising their LongMemEval score from approximately 49% to 93.4%, narrowing the gap between extraction-based and verbatim approaches. Our analysis concludes that MemPalace represents significant architectural insight wrapped in overstated claims -- a pattern common in rapidly adopted open-source projects where marketing velocity exceeds scientific rigor.
Large Language Models for Cardiac MRI Diagnosis Based on Standardized Text Descriptions
DOI:
10.1002/jmri.70327
🔥 引用:
0
Abstract:
MRI is important for cardiac disease evaluation, but accurate diagnosis remains challenging in less experienced centers. Although large language models (LLMs) have shown promise in medical imaging diagnosis, their application in cardiac MRI is limited.
LLMs may be effective in achieving cardiac MRI diagnosis based on standardized descriptions.
Retrospective.
A total of 203 hypertrophic cardiomyopathy, 186 dilated cardiomyopathy, 46 hypertensive heart disease, 198 ischemic cardiomyopathy, 38 constrictive pericarditis, 45 cardiac amyloidosis, 91 myocarditis, and 144 normal controls.
Balanced steady‐state free‐precession, short tau inversion recovery, and breath‐hold inversion‐recovery segmented gradient‐echo sequences at 3.0 T.
Clinical and cardiac MRI information from each subject was converted into standardized descriptions and input into Generative Pre‐trained Transformer‐4.5 (GPT‐4.5), GPT‐4 Omni (GPT‐4o), Deepseek‐V3, and Deepseek‐R1 LLMs. Cardiac MRI information included LV function, wall thickness and motion, and abnormalities on T2WI, perfusion and late gadolinium enhancement sequences. Each model was asked to generate an imaging diagnosis. In addition, a medical student (8 months experience) and three radiologists (junior, mid‐level and senior: with 3, 6, and 10 years' experience, respectively) provided diagnoses based on cardiac MRI images and clinical information.
Frequency‐weighted sensitivity and specificity were calculated. The diagnostic performances of the LLMs and human readers were compared using the McNemar test with Bonferroni correction. A
p
value < 0.05 was considered significant.
All LLMs showed excellent frequency‐weighted specificity (0.973–0.983). The frequency‐weighted sensitivities of all LLMs were not significantly different from that of the junior radiologist, were significantly higher than that of the medical student, and significantly inferior to those of the senior radiologist (GPT‐4.5: 0.863, GPT‐4o: 0.821, Deepseek‐V3: 0.843, and Deepseek‐R1: 0.851 vs. junior radiologist: 0.850, all adjusted
p
= 1.000; vs. medical student: 0.731, all adjusted
p
< 0.001; vs. senior radiologist: 0.942, all adjusted
p
< 0.001). Additionally, the mid‐level radiologist achieved a frequency‐weighted sensitivity of 0.895, outperforming all LLMs except GPT‐4.5.
LLMs may generate accurate diagnoses from standardized cardiac MRI descriptions, potentially benefiting less experienced physicians.
4.
Stage 5.
An autonomous LLM-agent platform for computational binder design and conjugation-aware prioritization of antibody–drug conjugates
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
Spontaneous Persuasion: An Audit of Model Persuasiveness in Everyday Conversations
🔥 引用:
0
Abstract: Large language models (LLMs) possess strong persuasive capabilities that outperform humans in head-to-head comparisons. Users report consulting LLMs to inform major life decisions in relationships, medical settings, and when seeking professional advice. Prior work measures persuasion as intentional attempts at producing the most effective argument or convincing statement. This fails to capture everyday human-AI interactions in which users seek information or advice. To address this gap, we introduce"spontaneous persuasion,"which characterizes the inexplicit use of persuasive strategies in everyday scenarios where persuasion is not necessarily warranted. We conduct an audit of five LLMs to uncover how frequently and through which techniques spontaneous persuasion appears in multi-turn conversations. To simulate response styles, we provide a user response taxonomy grounded in literature from psychology, communication, and linguistics. Furthermore, we compare the distribution of spontaneous persuasion produced by LLMs with human responses on the same topics, collected from Reddit. We find LLMs spontaneously persuade the user in virtually all conversations, heavily relying on information-based strategies such as appeals to logic or quantitative evidence. This was consistent across models and user response styles, but conversations concerning mental health saw higher rates of appraisal-based and emotion-based strategies. In comparison, human responses tended to invoke strategies that generate social influence, like negative emotion appeals and non-expert testimony. This difference may explain the effectiveness of LLM in persuading users, as well as the perception of models as objective and impartial.
An LLM‐Driven Approach for Power Grid Structure Synthesis and Visualization
DOI:
10.1049/enc2.70037
🔥 引用:
0
Abstract:
Research in power system is persistently hampered by the scarcity of high‐fidelity, public grid data due to security and privacy constraints. Existing unimodal synthesis methods fail to harmonize physical laws, visual representations and semantic descriptions, obstructing the application of multimodal large language models (LLMs) in the energy sector. To address this, we propose a multimodal synthesis framework that generates aligned datasets comprising physical parameters, single‐line diagrams and natural language descriptions. The framework combines rule‐based topology generation with a two‐stage chain‐of‐thought (CoT) strategy, enabling LLM agents to initialize electrical parameters based on statistical priors. To ensure physical feasibility, an iterative power flow feedback loop is introduced to guarantee convergence. Furthermore, retrieval‐augmented generation is employed to enhance component‐level visual details. Experimental results indicate that the synthesized grids achieve high structural similarity and physical fidelity compared to real‐world benchmarks. We have open‐sourced this physically validated multimodal grid dataset to provide critical foundational support for developing physics‐informed “energy LLMs.”
SpecSyn: LLM-based Synthesis and Refinement of Formal Specifications for Real-world Program Verification
🔥 引用:
0
Abstract: Program verification is a formal technique to rigorously ensure the correctness and fault-freeness of software systems. However, constructing comprehensive interprocedural specifications for full verification obligations is time-consuming and labor-intensive, giving rise to automated specification generation approaches. Despite the significant advancements in these approaches brought by Large Language Models (LLMs), existing LLM-empowered approaches still suffer from significant limitations: they lack effective strategies for handling sizable input programs, and are typically equipped with no mechanisms to evaluate and guarantee the strength of the generated specifications. The limitations impair their ability to extract precise specifications from real-world complicated programs to support the verification of target properties, thereby hindering the applicability of existing approaches in verification tasks on real-world programs. To remedy this gap, we propose SpecSyn, a novel LLM-based specification generation method. SpecSyn first decomposes the input program into individual segments, which are handled respectively by the subsequent iterative specification generation process. Innovatively, we incorporate into the process a specification refinement mechanism based on semantic-non-equivalent program mutations and variant discrimination, assessing and enhancing the semantic strength of the generated specifications. Extensive experiments show that SpecSyn maintains high precision over 90% and outstanding recall over 75%, significantly outperforming existing LLM-based approaches. In further evaluations, SpecSyn successfully handles 1071 out of 1365 target properties for open-source programs, proving its applicability on real-world program verification tasks.
Prefix Parsing is Just Parsing
🔥 引用:
0
Abstract: Prefix parsing asks whether an input prefix can be extended to a complete string generated by a given grammar. In the weighted setting, it also provides prefix probabilities, which are central to context-free language modeling, psycholinguistic analysis, and syntactically constrained generation from large language models. We introduce the prefix grammar transformation, an efficient reduction of prefix parsing to ordinary parsing. Given a grammar, our method constructs another grammar that generates exactly the prefixes of its original strings. Prefix parsing is then solved by applying any ordinary parsing algorithm on the transformed grammar without modification. The reduction is both elegant and practical: the transformed grammar is only a small factor larger than the input, and any optimized implementation can be used directly, eliminating the need for bespoke prefix-parsing algorithms. We also present a strategy-based on algorithmic differentiation-for computing the next-token weight vector, i.e., the prefix weights of all one-token extensions, enabling efficient prediction of the next token. Together, these contributions yield a simple, general, and efficient framework for prefix parsing.
Foundation models for discovering robust biomarkers of neurological disorders from dynamic functional connectivity
🔥 引用:
0
Abstract: Several brain foundation models (FM) have recently been proposed to predict brain disorders by modelling dynamic functional connectivity (FC). While they demonstrate remarkable model performance and zero- or few-shot generalization, the salient features identified as potential biomarkers are yet to be thoroughly evaluated. We propose RE-CONFIRM, a framework for evaluating the robustness of potential biomarker candidates elucidated by deep learning (DL) models including FMs. From experiments on five large datasets of Autism Spectrum Disorder (ASD), Attention-deficit Hyperactivity Disorder (ADHD), and Alzheimer's Disease (AD), we found that although commonly used performance metrics provide an intuitive assessment of model predictions, they are insufficient for evaluating the robustness of biomarkers identified by these models. RE-CONFIRM metrics revealed that simply finetuning FMs leads to models that fail to capture regional hubs effectively, even in disorders where hubs are known to be implicated, such as ASD and ADHD. In view of this, we propose Hub-LoRA (Low-Rank Adaptation) as a fine-tuning technique that enables FMs to not only outperform customised DL models but also produce neurobiologically faithful biomarkers supported by meta-analyses. RE-CONFIRM is generalizable and can be easily applied to ascertain the robustness of DL models trained on functional MRI datasets. Code is available at: https://github.com/SCSE-Biomedical-Computing-Group/RE-CONFIRM.
Pre-trained LLMs Meet Sequential Recommenders: Efficient User-Centric Knowledge Distillation
🔥 引用:
0
Abstract: Sequential recommender systems have achieved significant success in modeling temporal user behavior but remain limited in capturing rich user semantics beyond interaction patterns. Large Language Models (LLMs) present opportunities to enhance user understanding with their reasoning capabilities, yet existing integration approaches create prohibitive inference costs in real time. To address these limitations, we present a novel knowledge distillation method that utilizes textual user profile generated by pre-trained LLMs into sequential recommenders without requiring LLM inference at serving time. The resulting approach maintains the inference efficiency of traditional sequential models while requiring neither architectural modifications nor LLM fine-tuning.
OptiVerse: A Comprehensive Benchmark towards Optimization Problem Solving
🔥 引用:
0
Abstract: While Large Language Models (LLMs) demonstrate remarkable reasoning, complex optimization tasks remain challenging, requiring domain knowledge and robust implementation. However, existing benchmarks focus narrowly on Mathematical Programming and Combinatorial Optimization, hindering comprehensive evaluation. To address this, we introduce OptiVerse, a comprehensive benchmark of 1,000 curated problems spanning neglected domains, including Stochastic Optimization, Dynamic Optimization, Game Optimization, and Optimal Control, across three difficulty levels: Easy, Medium, and Hard. The experiments with 22 LLMs of different sizes reveal sharp performance degradation on hard problems, where even advanced models like GPT-5.2 and Gemini-3 struggle to exceed 27% accuracy. Through error analysis, we identify that modeling&logic errors remain the primary bottleneck. Consequently, we propose a Dual-View Auditor Agent that improves the accuracy of the LLM modeling process without introducing significant time overhead. OptiVerse will serve as a foundational platform for advancing LLMs in solving complex optimization challenges.
ARFBench: Benchmarking Time Series Question Answering Ability for Software Incident Response
🔥 引用:
0
Abstract: Time series question-answering (TSQA), in which we ask natural language questions to infer and reason about properties of time series, is a promising yet underexplored capability of foundation models. In this work, we present ARFBench, a TSQA benchmark that evaluates the understanding of multimodal foundation models (FMs) on time series anomalies prevalent in software incident data. ARFBench consists of 750 questions across 142 time series and 5.38M data points from 63 production incidents sourced exclusively from internal telemetry at Datadog. We evaluate leading proprietary and open-source LLMs, VLMs, and time series FMs and observe that frontier VLMs perform markedly better than existing baselines; the leading model (GPT-5) achieves a 62.7% accuracy and 51.9% F1. We next demonstrate the promise of specialized multimodal approaches. We develop a novel TSFM + VLM hybrid prototype which we post-train on a small set of synthetic and real data that yields comparable overall F1 and accuracy with frontier models. Lastly, we find models and human domain experts exhibit complementary strengths. We define a model-expert oracle, a best-of-2 oracle selector over model and expert answers, yielding 82.8% F1 and 87.2% accuracy and establishing a new superhuman frontier for future TSQA models. The benchmark is available at https://huggingface.co/datasets/Datadog/ARFBench.
When Cow Urine Cures Constipation on YouTube: Limits of LLMs in Detecting Culture-specific Health Misinformation
🔥 引用:
0
Abstract: Social media platforms have become primary channels for health information in the Global South. Using gomutra (cow urine) discourse on YouTube in India as a case study, we present a post-facto Large Language Model (LLM)-assisted discourse analysis of 30 multilingual transcripts showing that promotional content blends sacred traditional language with pseudo-scientific claims in ways that sophisticated debunking content itself mirrors, creating a rhetorical register that LLMs, trained predominantly on Western corpora, are systematically ill-equipped to analyse. Varying prompt tone across three LLMs (GPT-4o, Gemini 2.5 Pro, DeepSeek-V3.1), we find that culturally embedded health misinformation does not look like ordinary misinformation, and this cultural obfuscation extends to gendered rhetoric and prompt design, compounding analytical unreliability. Our findings argue that cultural competency in LLM-assisted discourse analysis cannot be retrofitted through prompt engineering alone.
InstructPLM: Aligning protein language models to follow protein structural instructions
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
Evaluating and Securing Power Systems against Vulnerabilities Introduced by Large Language Models
🔥 引用:
0
Abstract: An exciting new direction for improving operational efficiency and decision-making is the use of Large Language Modeling (LLMs) to contemporary power systems. Nevertheless, there may be unanticipated security risks associated with this move. Using LLMs to power networks may pose certain risks, which this paper examines. It stresses the need of doing research and developing remedies immediately. It is a challenging but vital job to secure large language models in a power monitoring context. Through the implementation of thorough security measures, the promotion of a security-conscious culture, and the continuous monitoring of new threats while technologies, we may maximize the benefits of LLMs while minimizing their hazards. It is our duty as information security experts to pioneer this new field and make sure that our security protocols adapt to the increasing sophistication of our AI systems. Security flaws in LLM that allow rapid injection attacks are among the most critical ones. These types of attacks take advantage of LLMs' fundamental features by deliberately feeding them data that will cause them to operate in an unexpected way or leak private information. Many LLMs have seen extensive usage with the introduction of commercially accessible systems like ChatGPT. One crucial area of cybersecurity that is seeing a rise in the use of LLMs is power monitoring systems. It is critical to safeguard these systems against cyber-attacks since they are essential for the stability of society and the nation's energy supply. In order to keep these systems resilient and reliable, it is essential to detect unexpected vulnerabilities, especially zero-day attacks. One potential way to improve these detection capabilities is via LLMs. It integrates power-system standards and threat intelligence with traditional anomaly detection, LLM-assisted reasoning over code/configs/logs, and protocol-aware telemetry. In order to uncover undisclosed power monitoring system weaknesses, we used models with LSTM with GRU. Since these models are masters of sequential data analysis, they are ideal for this job. Sequential data includes things like sensor readings, system logs, and network traffic. By identifying unusual activity that differs from typical system operation, LSTMs and GRUs are able to discover new, "zero-day" vulnerabilities, in contrast to conventional security technologies that depend on predetermined attack signatures. Here, we built an LLM model using TinyLlama Chat 1.1. The LLM takes the processed packet data and the extracted context as input and outputs a user-friendly summary of the packet file. Through the use of machine learning models, the program provides a concise, well-organized, and straightforward overview of the network's operations.
SparKV: Overhead-Aware KV Cache Loading for Efficient On-Device LLM Inference
🔥 引用:
0
Abstract: Efficient inference for on-device Large Language Models (LLMs) remains challenging due to limited hardware resources and the high cost of the prefill stage, which processes the full input context to construct Key-Value (KV) caches. We present SparKV, an adaptive KV loading framework that combines cloud-based KV streaming with on-device computation. SparKV models the cost of individual KV chunks and decides whether each chunk should be streamed or computed locally, while overlapping the two execution paths to reduce latency. To handle fluctuations in wireless connectivity and edge resource availability, SparKV further refines offline-generated schedules at runtime to rebalance communication and computation costs. Experiments across diverse datasets, LLMs, and edge devices show that SparKV reduces Time-to-First-Token by 1.3$x-5.1x with negligible impact on response quality, while lowering per-request energy consumption by 1.5x to 3.3x, demonstrating its robustness and practicality for real-world on-device deployment.
ReCAPA: Hierarchical Predictive Correction to Mitigate Cascading Failures
🔥 引用:
0
Abstract: Vision-Language-Action systems follow instructions to execute multi-step tasks in multimodal environments. Recent VLA approaches typically rely on post-hoc correction mechanisms or operate under fixed task decompositions and alignment schemes. However, once an intermediate step is mis-specified, local errors propagate through subsequent steps and eventually accumulate into cascading failures. To mitigate this compounding effect, we propose Predictive Alignment and Planning Architecture, a framework that uses prediction and contrast to adjust deviations across three levels: actions, subgoals, and trajectories. Semantic alignment is enforced at all levels using a Sinkhorn-based module and a Score-field module. The predictive correction and alignment jointly update the action generator during training, enabling it to adjust fine-grained steps to remain aligned with the overall intent. We further introduce two new metrics to quantify error propagation and recovery processes in tasks, capturing how mistakes spread and fade over long-horizon execution. Experiments show that ReCAPA achieves competitive results on embodied agent benchmarks such as VisualAgentBench, MineDojo, and AI2-THOR, outperforming strong proprietary and open-source Large Language Model baselines.
Job Skill Extraction via LLM-Centric Multi-Module Framework
🔥 引用:
0
Abstract: Span-level skill extraction from job advertisements underpins candidate-job matching and labor-market analytics, yet generative large language models (LLMs) often yield malformed spans, boundary drift, and hallucinations, especially with long-tail terms and cross-domain shift. We present SRICL, an LLM-centric framework that combines semantic retrieval (SR), in-context learning (ICL), and supervised fine-tuning (SFT) with a deterministic verifier. SR pulls in-domain annotated sentences and definitions from ESCO to form format-constrained prompts that stabilize boundaries and handle coordination. SFT aligns output behavior, while the verifier enforces pairing, non-overlap, and BIO legality with minimal retries. On six public span-labeled corpora of job-ad sentences across sectors and languages, SRICL achieves substantial STRICT-F1 improvements over GPT-3.5 prompting baselines and sharply reduces invalid tags and hallucinated spans, enabling dependable sentence-level deployment in low-resource, multi-domain settings.
One Workflow Doesn't Fit All: Adaptive Workflows for Edge AI Development
🔥 引用:
0
Abstract: Edge AI integrates AI techniques with heterogeneous mobile devices to enable perception and actuation in real-world environments, facilitating applications such as smart sensing [1] and healthcare [2]. To reduce the burden of developing such applications, recent works [3, 4] build agentic systems based on Large Language Models (LLMs) to automatically translate user requirements into executable edge AI programs. However, these systems typically rely on handcrafted, predefined agentic workflows, and therefore often struggle to handle diverse mobile devices, heterogeneous runtime environments, and dynamic resource constraints in real-world scenarios. As a result, even well-engineered agents may struggle to accommodate such variability, suffering from inflexibility, limited adaptability, and high maintenance costs.
PriRS: an AI-driven framework for privacy and reliability in cyber–physical–social systems data sharing
🔥 引用:
0
Abstract: Cyber–physical–social systems (CPSS) impose stringent requirements for data sharing security and regulatory compliance. However, existing solutions fail to bridge the gap between rigid smart contracts and flexible social regulations. The core research question is: how can we enforce complex, human-readable regulatory policies within rigid blockchain transactions without creating scalability bottlenecks? To address this, we propose PriRS, an AI-driven privacy and reliability framework. First, we utilize a large language model (LLM)-based compliance oracle within a trusted execution environment (TEE). This agent intelligently analyzes regulations to ensure strict compliance before data authorization. Second, we introduce a “majority voting group data sharing” mechanism. By combining Shamir’s secret sharing with conditional proxy re-encryption, we move heavy coordination off-chain. This ensures fairness and significantly improves throughput. Experimental results on the Sepolia testnet demonstrate that PriRS reduces on-chain gas consumption by 92.3% compared to state-of-the-art schemes. The AI-driven oracle achieves 96.0% accuracy and 98.0% precision on policy violation detection, while maintaining 100% deterministic consistency across repeated runs in the TEE. Consequently, PriRS provides a highly efficient, secure, and legally compliant foundation for decentralized CPSS data markets.
Interactive Role-Playing Game System Using Dice-Based Mechanics and AI Narration
DOI:
10.55041/ijsrem61180
🔥 引用:
0
Abstract: Abstract—Tabletop role-playing games such as Dungeons & Dragons depend on a human Game Master to adjudicate rules, resolve player decisions through probabilistic dice mechanics, and weave a coherent narrative across sessions. Automating this role using Large Language Models (LLMs) alone leads to severe reliability problems: unconstrained generative models hallucinate game state, permit physically impossible player actions, and fail to maintain consistent numerical variables over extended play. This paper presents a hybrid software architecture that cleanly separates deterministic game logic from constrained narrative generation. The backend employs a programmatic Python rules engine grounded in a comprehensive static rulebook covering 20 monsters, 18 weapons, 26 spells, 13 NPC archetypes, and 8 puzzle templates to resolve all combat, skill checks, and state mutations with absolute mathematical precision. Six specialized LLM micro-agents—a Validator, Parser, Story Engine, Director, Enemy AI, and Narrator—handle only semantic tasks, each constrained by typed Pydantic schemas and post-generation rulebook validation. A React-based frontend communicates with this backend via both REST endpoints and WebSocket channels, enabling both single-player and real-time multiplayer sessions with up to four concurrent players. Empirical benchmarks over
100 adversarial and valid test prompts demonstrate that the hybrid Validator intercepts 95.0% of rule-violating inputs while incorrectly rejecting only 2.5% of legitimate gameplay actions, compared to an unconstrained LLM baseline that permitted 88% of adversarial inputs. Deterministic rule execution achieves sub- millisecond latency (≈0.19 ms), representing a speedup exceeding 6,000× over equivalent LLM-mediated resolution.
Keywords: Large Language Models, Tabletop RPG Au-
tomation, Deterministic Game Engines, WebSocket Communi- cation, Pydantic Schema Validation, Prompt Injection Defense.
Learning to Communicate: Toward End-to-End Optimization of Multi-Agent Language Systems
🔥 引用:
0
Abstract: Multi-agent systems built on large language models have shown strong performance on complex reasoning tasks, yet most work focuses on agent roles and orchestration while treating inter-agent communication as a fixed interface. Latent communication through internal representations such as key-value caches offers a promising alternative to text-based protocols, but existing approaches do not jointly optimize communication with multi-agent reasoning. Therefore we propose DiffMAS, a training framework that treats latent communication as a learnable component of multi-agent systems. DiffMAS performs parameter-efficient supervised training over multi-agent latent trajectories, enabling agents to jointly learn how information should be encoded and interpreted across interactions. Experiments on mathematical reasoning, scientific QA, code generation, and commonsense benchmarks show that DiffMAS consistently improves reasoning accuracy and decoding stability over single-agent inference, text-based multi-agent systems, and prior latent communication methods, achieving 26.7% on AIME24, 20.2% on GPQA-Diamond, and consistent gains across reasoning benchmarks.
Use of Artificial Intelligence and Machine Learning in Rapid Drug Discovery and Pharmacovigilance
🔥 引用:
0
Abstract: The increasing global burden of disease, rising research and development costs, and high attrition rates in pharmaceutical pipelines underscore the need for more efficient approaches to therapeutic development and drug safety monitoring. Artificial intelligence (AI) and machine learning (ML) have emerged as data-driven tools with the potential to improve multiple stages of the pharmaceutical lifecycle.
This narrative review is designed to provide a structured and critical overview of the use of AI and ML in drug discovery and drug safety surveillance. A comprehensive literature search was performed to identify relevant studies using major electronic databases, with emphasis on publications from 1997 to March 2025. The studies were selected based on the inclusion criteria.
The findings of the study show that AI and ML are being used in drug discovery, drug development, and drug safety surveillance. These technologies have the potential to provide predictive models, integrate heterogeneous biomedical data, and analyze real-world data to detect adverse drug reactions. Deep learning and natural language processing have been found to be useful tools to improve early risk detection.
However, some limitations have also been found. These include the quality of the data, bias in AI and ML models, lack of interpretability of AI and ML models, lack of external validation, and lack of real-world implementation. These limitations need to be addressed to make AI and ML more useful tools for drug discovery and drug safety surveillance.
Overall, while AI and ML offer meaningful opportunities to enhance drug discovery and pharmacovigilance, their impact remains dependent on rigorous validation, improved data governance, and alignment with clinical and regulatory frameworks. Continued research and context-specific implementation strategies will be essential to support their effective and equitable integration into pharmaceutical research and healthcare systems.
Transient Turn Injection: Exposing Stateless Multi-Turn Vulnerabilities in Large Language Models
🔥 引用:
0
Abstract: Large language models (LLMs) are increasingly integrated into sensitive workflows, raising the stakes for adversarial robustness and safety. This paper introduces Transient Turn Injection(TTI), a new multi-turn attack technique that systematically exploits stateless moderation by distributing adversarial intent across isolated interactions. TTI leverages automated attacker agents powered by large language models to iteratively test and evade policy enforcement in both commercial and open-source LLMs, marking a departure from conventional jailbreak approaches that typically depend on maintaining persistent conversational context. Our extensive evaluation across state-of-the-art models-including those from OpenAI, Anthropic, Google Gemini, Meta, and prominent open-source alternatives-uncovers significant variations in resilience to TTI attacks, with only select architectures exhibiting substantial inherent robustness. Our automated blackbox evaluation framework also uncovers previously unknown model specific vulnerabilities and attack surface patterns, especially within medical and high stakes domains. We further compare TTI against established adversarial prompting methods and detail practical mitigation strategies, such as session level context aggregation and deep alignment approaches. Our study underscores the urgent need for holistic, context aware defenses and continuous adversarial testing to future proof LLM deployments against evolving multi-turn threats.
Short-Term Continuous Glucose Forecasting with Large Language Model-Derived Nutrient Estimates from Real-World Chinese Dietary Records
DOI:
10.34133/hds.0471
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
Supervised Learning Has a Necessary Geometric Blind Spot: Theory, Consequences, and Minimal Repair
🔥 引用:
0
Abstract: PGD adversarial training, the standard robustness method, can reduce Jacobian Frobenius norm yet worsen clean-input geometry (e.g., TDI 1.336 vs. ERM 1.093). We show this is not an implementation artifact but a theorem-level consequence of supervised learning. We prove that any encoder minimizing supervised loss must retain non-zero sensitivity along directions correlated with training labels, including directions that are nuisance at test time. This holds across proper scoring rules, architectures, and dataset sizes. We call this the geometric blind spot of supervised learning. This theorem unifies four empirical phenomena often treated separately: non-robust features, texture bias, corruption fragility, and the robustness-accuracy tradeoff. It also explains why suppressing sensitivity in one adversarial direction can redistribute sensitivity elsewhere. We introduce Trajectory Deviation Index (TDI), a diagnostic of geometric isotropy. Unlike CKA, intrinsic dimension, or Jacobian Frobenius norm alone, TDI captures the failure mode above. In our experiments, PGD attains low Frobenius norm but high TDI, while PMH attains the lowest TDI with one additional training term and no architectural changes. Across seven tasks, BERT/SST-2, and ImageNet ViT-B/16 (backbone family underlying CLIP/DINO/SAM), the blind spot is measurable and repairable. It appears at foundation-model scale, worsens with model scale and task-specific fine-tuning, and is substantially reduced by PMH. PMH also leads on non-Gaussian corruption types (blur/brightness/contrast) without corruption-specific training.
AutoRISE: Agent-Driven Strategy Evolution for Red-Teaming Large Language Models
🔥 引用:
0
Abstract: Automated red-teaming methods for large language models typically optimize attack prompts within a fixed, human-designed strategy, leaving the attack strategy itself unchanged. We instead optimize the strategy. We propose AutoRISE, a method that searches over executable attack programs rather than individual prompts. At each iteration, a coding agent edits a strategy and a fixed evaluation harness scores the resulting attacks, returning both a scalar objective and per-example diagnostics that guide subsequent edits. This allows structural changes, including new attack components and altered control flow, that prompt-level methods do not directly express. We also release two benchmark suites developed on disjoint target sets and evaluate on 11 models from five families against seven established jailbreak datasets. Across held-out models, AutoRISE improves average attack success rate by 17.0 points over the strongest baseline, and improves attack success by up to 16 points on frontier targets with low baseline success rates. Ablations against parametric and strategy-library baselines suggest that these gains arise from unrestricted program search, particularly compositional techniques and control-flow edits. AutoRISE operates in a black-box, inference-only setting, requiring no fine-tuning, human annotation, or GPU compute.
On Reasoning Behind Next Occupation Recommendation
🔥 引用:
0
Abstract: In this work, we develop a novel reasoning approach to enhance the performance of large language models (LLMs) in future occupation prediction. In this approach, a reason generator first derives a ``reason''for a user using his/her past education and career history. The reason summarizes the user's preference and is used as the input of an occupation predictor to recommend the user's next occupation. This two-step occupation prediction approach is, however, non-trivial as LLMs are not aligned with career paths or the unobserved reasons behind each occupation decision. We therefore propose to fine-tune LLMs improving their reasoning and occupation prediction performance. We first derive high-quality oracle reasons, as measured by factuality, coherence and utility criteria, using a LLM-as-a-Judge. These oracle reasons are then used to fine-tune small LLMs to perform reason generation and next occupation prediction. Our extensive experiments show that: (a) our approach effectively enhances LLM's accuracy in next occupation prediction making them comparable to fully supervised methods and outperforming unsupervised methods; (b) a single LLM fine-tuned to perform reason generation and occupation prediction outperforms two LLMs fine-tuned to perform the tasks separately; and (c) the next occupation prediction accuracy depends on the quality of generated reasons. Our code is available at https://github.com/Sarasarahhhhh/job_prediction.
DxDirector: an agentic large language model driving the full-process clinical diagnosis
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
Mitigating Execution Hallucinations and Computational Inflation in Agentic RAG via Strict Protocol Boundaries
🔥 引用:
0
Abstract: The deployment of large language models as autonomous retrieval agents over unstructured knowledge bases gives rise to a persistent structural conflict between probabilistic neural generation and deterministic physical execution. While agentic paradigms facilitate complex multi-hop retrieval, their unconstrained generative nature frequently violates strict syntactic requirements. This systemic vulnerability directly triggers execution hallucinations, such as fabricated API parameters or malformed schemas. Consequently, these syntax-driven failures force systems into redundant trial-and-error recovery loops, resulting in severe computational inflation that degrades both token efficiency and inference latency. To resolve this reliability–efficiency dilemma, this paper proposes RAG-CoT-MCP, a neuro-symbolic architecture that orthogonally decouples probabilistic cognitive planning from deterministic tool execution. By integrating the Model Context Protocol (MCP) as a strict system-level validation boundary, the framework ensures that latent reasoning trajectories manifest exclusively as syntactically valid operations. Exhaustive empirical evaluations across four disparate datasets—incorporating a multi-dimensional LLM-as-a-Judge framework, rigorous ablation studies, and granular cost tracking—validate the proposed approach. The findings demonstrate that RAG-CoT-MCP compresses network-level execution error rates from 45.2% (in unconstrained baselines) to a mere 6.0%, yielding substantial enhancements in semantic comprehensiveness and logical coherence compared to existing baselines. Counterintuitively, by proactively intercepting malformed actions and redirecting computational resources from reactive error handling to valid causal deduction, the framework drastically reduces redundant token consumption and achieves the lowest overall inference latency. Ultimately, this study establishes that deterministic execution constraints do not hinder agentic flexibility; rather, they serve as a fundamental prerequisite for deploying robust, high-speed, and cost-effective knowledge retrieval systems.
Recovering Clinical Detail in AI-Generated Responses for Low Back Pain Through Prompt Design
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
Object Detection in Optical Remote Sensing Images: A Systematic Review of Methods, Benchmarks, and Operational Applications
DOI:
10.3390/rs18091289
🔥 引用:
0
Abstract: Object detection in optical remote sensing imagery has emerged as a crucial task in computer vision, with applications ranging between environmental monitoring to disaster management, precision agriculture, and urban planning. This review systematically examines current methodologies, categorising them into four principal approaches: (1) template matching-based methods, which leverage predefined patterns for object identification; (2) knowledge-based methods, which incorporate geometric and contextual information to enhance detection accuracy; (3) object-based image analysis (OBIA), which segments images into meaningful objects using spectral and spatial properties; (4) machine learning-based methods, particularly deep convolutional neural networks (CNNs), which have revolutionised the field through automatic feature learning. Each methodology’s performance characteristics, computational requirements, and suitability for different remote sensing applications are analysed. Our systematic review, following PRISMA guidelines, analysed 189 studies published from 2010 to 2025, of which 73 provided quantitative results on standard benchmarks. The three most critical challenges identified are as follows: (1) annotation bottleneck, as dense bounding box labelling of remote sensing imagery remains highly labour-intensive for deep learning approaches, (2) extreme scale variation spanning 2–3 orders of magnitude within single scenes, and (3) domain adaptation failures when models encounter new geographic regions or sensor characteristics. This review identifies critical research gaps and proposes prioritised future directions, emphasising foundation models for zero-shot detection, efficient architectures for resource-constrained deployment, and standardised benchmarks with size-specific metrics. The analysis provides practitioners with evidence-based decision frameworks for method selection and researchers with a roadmap for advancing object detection in remote sensing applications.
Autonomous multimodal agents enable transparent, spatiotemporal reconstruction of immune dynamics in pancreatic cancer progression
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
Node-Sampling: adaptive multi-agent optimization in medical education
🔥 引用:
0
Abstract:
Differences in prior knowledge among incoming medical students pose a persistent challenge for universities. To promote more individualized and equitable preparation, a large language model-based learning platform is being developed at the University Medical Center Hamburg-Eppendorf. A central component of this platform is the automated generation of multiple-choice questions (MCQs) from curated medical materials. However, ensuring their educational quality remains difficult, particularly when relying on smaller, locally deployed language models.
This study introduces Node-Sampling, a self-optimizing multi-agent approach for improving MCQ quality. The method identifies efficient refinement strategies by modeling agents as an adaptive sequence optimized through the REINFORCE algorithm.
Expert evaluations showed that Node-Sampling enhances the quality of question stems significantly compared to a fixed baseline. Importantly, Node-Sampling achieved this performance using an effective three-agent configuration, requiring only 33% of the original resources. Results for answer options were less consistent.
The results highlight the potential of adaptive multi-agent optimization to strengthen automated question refinement. Node-Sampling therefore presents a sustainable and promising approach to better MCQ quality and supports more effective and personalized preparation for medical students.
Dhatu-Former: Redesigning Transformer Architectures Through Pan.ini’s As..tadhyay
🔥 引用:
0
Abstract: Contemporary large language models (LLMs) rely on sub-word tokenizers and at attention mechanisms that treat every language as a statistical surface-form distribution. This paper proposes Dhatu-Former, a transformer architecture that internalizes the formal linguistic machinery of Pan.ini’s As..tadhyay the oldest known generative grammar. We hypothesize that (i) morphologically-aware, root-based (dhatu-based) tokenization can reduce vocabulary size and sequence length by 40 60%, (ii) hierarchical attention guided by Pan.inian derivation trees can yield sparse, interpretable attention with O(nlogn) complexity, and (iii) a hybrid symbolic neural reasoning layer that executes sutra-style rewrite rules can substantially reduce hallucination while enabling uni ed language math logic reasoning. We further introduce a modular Retrieval-Augmented Generation (RAG) subsystem grounded in Sanskrit lexical databases (Amarakos.a, Dhatupat.ha) and a continual learning framework inspired by the paribhas.a sutra (meta-rules) of the As..tadhyay . We present order-of-magnitude parameter reduction estimates, architectural blueprints with TikZ diagrams, and a research roadmap for empirical validation. This is a position paper; no experiments have been conducted.
Human like intuitive behavior and reasoning biases emerge in large language models but disappear in ChatGPT
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
How English Print Media Frames Human-Elephant Conflicts in India
🔥 引用:
0
Abstract: Human-elephant conflict (HEC) is rising across India as habitat loss and expanding human settlements force elephants into closer contact with people. While the ecological drivers of conflict are well-studied, how the news media portrays them remains largely unexplored. This work presents the first large-scale computational analysis of media framing of HEC in India, examining 1,968 full-length news articles consisting of 28,986 sentences, from a major English-language outlet published between January 2022 and September 2025. Using a multi-model sentiment framework that combines long-context transformers, large language models, and a domain-specific Negative Elephant Portrayal Lexicon, we quantify sentiment, extract rationale sentences, and identify linguistic patterns that contribute to negative portrayals of elephants. Our findings reveal a dominance of fear-inducing and aggression-related language. Since the media framing can shape public attitudes toward wildlife and conservation policy, such narratives risk reinforcing public hostility and undermining coexistence efforts. By providing a transparent, scalable methodology and releasing all resources through an anonymized repository, this study highlights how Web-scale text analysis can support responsible wildlife reporting and promote socially beneficial media practices.
A Trust-Embedded Learning Architecture for Discovering Alternative Drug Indications with Verifiable Computational Integrity
🔥 引用:
0
Abstract: Drug repurposing has emerged as an effective strategy in modern healthcare, enabling researchers to discover new therapeutic uses for existing drugs while significantly reducing development time and cost. Traditional drug discovery methods rely heavily on manual laboratory experiments, expert analysis, and prolonged clinical trials, making the process slow, expensive, and limited in scalability. These approaches struggle to handle complex and high-dimensional biomedical data, leading to delayed insights and reduced efficiency. With the rapid growth of healthcare data, there is an increasing need for intelligent and automated systems that can efficiently analyze drug characteristics and predict alternative therapeutic applications. Additionally, conventional systems often lack transparency and strong security mechanisms, making clinical data vulnerable to tampering and reducing trust in research outcomes. To address these challenges, the proposed framework integrates Machine Learning (ML), Deep Learning (DL), and Blockchain technologies to develop a secure and intelligent drug repurposing system. The framework employs Random Forest (RF) as a baseline model and a Two-Dimensional Convolutional Neural Network (CNN2D) as an advanced model to improve prediction accuracy. The CNN2D effectively captures complex feature patterns in structured drug data, enabling precise identification of potential new disease treatments. Furthermore, Web3-based Blockchain technology ensures secure storage of user data, clinical interactions, and experimental records by providing immutability, transparency, and data integrity. By combining Artificial Intelligence (AI)-driven analytics with Blockchain-based security, the system enhances prediction performance, automates decision-making, and ensures reliable data management, offering a scalable and efficient solution for accelerating drug discovery and supporting healthcare innovation.
Trust but Verify: Introducing DAVinCI -- A Framework for Dual Attribution and Verification in Claim Inference for Language Models
🔥 引用:
0
Abstract: Large Language Models (LLMs) have demonstrated remarkable fluency and versatility across a wide range of NLP tasks, yet they remain prone to factual inaccuracies and hallucinations. This limitation poses significant risks in high-stakes domains such as healthcare, law, and scientific communication, where trust and verifiability are paramount. In this paper, we introduce DAVinCI - a Dual Attribution and Verification framework designed to enhance the factual reliability and interpretability of LLM outputs. DAVinCI operates in two stages: (i) it attributes generated claims to internal model components and external sources; (ii) it verifies each claim using entailment-based reasoning and confidence calibration. We evaluate DAVinCI across multiple datasets, including FEVER and CLIMATE-FEVER, and compare its performance against standard verification-only baselines. Our results show that DAVinCI significantly improves classification accuracy, attribution precision, recall, and F1-score by 5-20%. Through an extensive ablation study, we isolate the contributions of evidence span selection, recalibration thresholds, and retrieval quality. We also release a modular DAVinCI implementation that can be integrated into existing LLM pipelines. By bridging attribution and verification, DAVinCI offers a scalable path to auditable, trustworthy AI systems. This work contributes to the growing effort to make LLMs not only powerful but also accountable.
ReaGeo: Reasoning-Enhanced End-to-End Geocoding with LLMs
🔥 引用:
0
Abstract: This paper proposes ReaGeo, an end-to-end geocoding framework based on large language models, designed to overcome the limitations of traditional multi-stage approaches that rely on text or vector similarity retrieval over geographic databases, including workflow complexity, error propagation, and heavy dependence on structured geographic knowledge bases. The method converts geographic coordinates into geohash sequences, reformulating the coordinate prediction task as a text generation problem, and introduces a Chain-of-Thought mechanism to enhance the model's reasoning over spatial relationships. Furthermore, reinforcement learning with a distance-deviation-based reward is applied to optimize the generation accuracy. Comprehensive experiments show that ReaGeo can accurately handle explicit address queries in single-point predictions and effectively resolve vague relative location queries. In addition, the model demonstrates strong predictive capability for non-point geometric regions, highlighting its versatility and generalization ability in geocoding tasks.
Symbolic Grounding Reveals Representational Bottlenecks in Abstract Visual Reasoning
🔥 引用:
0
Abstract: Vision--language models (VLMs) often fail on abstract visual reasoning benchmarks such as Bongard problems, raising the question of whether the main bottleneck lies in reasoning or representation. We study this on Bongard-LOGO, a synthetic benchmark of abstract concept learning with ground-truth generative programs, by comparing end-to-end VLMs on raw images with large language models (LLMs) given symbolic inputs derived from those images. Using symbolic inputs as a diagnostic probe rather than a practical multimodal architecture, our \emph{Componential--Grammatical (C--G)} paradigm reformulates Bongard-LOGO as a symbolic reasoning task based on LOGO-style action programs or structured descriptions. LLMs achieve large and consistent gains, reaching mid--90s accuracy on Free-form problems, while a strong visual baseline remains near chance under matched task definitions. Ablations on input format, explicit concept prompts, and minimal visual grounding show that these factors matter much less than the shift from pixels to symbolic structure. These results identify representation as a key bottleneck in abstract visual reasoning and show how symbolic input can serve as a controlled diagnostic upper bound.
CARE: Counselor-Aligned Response Engine for Online Mental-Health Support
🔥 引用:
0
Abstract: Mental health challenges are increasing worldwide, straining emotional support services and leading to counselor overload. This can result in delayed responses during critical situations, such as suicidal ideation, where timely intervention is essential. While large language models (LLMs) have shown strong generative capabilities, their application in low-resource languages, especially in sensitive domains like mental health, remains underexplored. Furthermore, existing LLM-based agents often struggle to replicate the supportive language and intervention strategies used by professionals due to a lack of training on large-scale, real-world datasets. To address this, we propose CARE (Counselor-Aligned Response Engine), a GenAI framework that assists counselors by generating real-time, psychologically aligned response recommendations. CARE fine-tunes open-source LLMs separately for Hebrew and Arabic using curated subsets of real-world crisis conversations. The training data consists of sessions rated as highly effective by professional counselors, enabling the models to capture interaction patterns associated with successful de-escalation. By training on complete conversation histories, CARE maintains the evolving emotional context and dynamic structure of counselor-help-seeker dialogue. In experimental settings, CARE demonstrates stronger semantic and strategic alignment with gold-standard counselor responses compared to non-specialized LLMs. These findings suggest that domain-specific fine-tuning on expert-validated data can significantly support counselor workflows and improve care quality in low-resource language contexts.
SQLyzr: A Comprehensive Benchmark and Evaluation Platform for Text-to-SQL
🔥 引用:
0
Abstract: Text-to-SQL models have significantly improved with the adoption of Large Language Models (LLMs), leading to their increasing use in real-world applications. Although many benchmarks exist for evaluating the performance of text-to-SQL models, they often rely on a single aggregate score, lack evaluation under realistic settings, and provide limited insight into model behaviour across different query types. In this work, we present SQLyzr, a comprehensive benchmark and evaluation platform for text-to-SQL models. SQLyzr incorporates a diverse set of evaluation metrics that capture multiple aspects of generated queries, while enabling more realistic evaluation through workload alignment with real-world SQL usage patterns and database scaling. It further supports fine-grained query classification, error analysis, and workload augmentation, allowing users to better diagnose and improve text-to-SQL models. This demonstration showcases these capabilities through an interactive experience. Through SQLyzr's graphical interface, users can customize evaluation settings, analyze fine-grained reports, and explore additional features of the platform. We envision that SQLyzr facilitates the evaluation and iterative improvement of text-to-SQL models by addressing key limitations of existing benchmarks. The source code of SQLyzr is available at https://github.com/sepideh-abedini/SQLyzr.
A Deep Reinforcement Learning and Evolutionary Optimization- Based Collaborative Control Network for Power Systems with Multi- Terminal Information Fusion
🔥 引用:
0
Abstract: To improve the control performance of power systems in complex environments, this study proposes a cooperative control method integrating multi-terminal information acquisition, deep reinforcement learning (Deep Deterministic Policy Gradient, (DDPG)) and evolutionary algorithms (Particle Swarm Optimization (PSO), Genetic Algorithm (GA)). The constructed control framework integrates multi-source monitoring data, realizes state information enhancement through feature-level fusion, and optimizes control decisions using an evolutionary reinforcement learning strategy. Experiments are carried out on a power system simulation platform built in Simulink. The data used are generated by the platform simulation, covering the dynamic operation of a 50-node system under typical disturbance conditions, referenced to the parameter configuration of the State Grid Manual, and verified by the experience of power system engineers. Typical operating conditions such as wind power integration, load mutation, and sensor missing are simulated to evaluate the control effect and robustness of the model. The results show that the proposed method is superior to existing methods such as Transformer-based Control Algorithm (TCA), Graph Neural Network (GNN), and Reinforcement Learning with Curiosity-driven Exploration (RLCE) in terms of voltage prediction accuracy (92.5%), F1 score (0.912), and system stability (97.5%). The training convergence speed is improved by
From If-Statements to ML Pipelines: Revisiting Bias in Code-Generation
🔥 引用:
0
Abstract: Prior work evaluates code generation bias primarily through simple conditional statements, which represent only a narrow slice of real-world programming and reveal solely overt, explicitly encoded bias. We demonstrate that this approach dramatically underestimates bias in practice by examining a more realistic task: generating machine learning (ML) pipelines. Testing both code-specialized and general-instruction large language models, we find that generated pipelines exhibit significant bias during feature selection. Sensitive attributes appear in 87.7% of cases on average, despite models demonstrably excluding irrelevant features (e.g., including"race"while dropping"favorite color"for credit scoring). This bias is substantially more prevalent than that captured by conditional statements, where sensitive attributes appear in only 59.2% of cases. These findings are robust across prompt mitigation strategies, varying numbers of attributes, and different pipeline difficulty levels. Our results challenge simple conditionals as valid proxies for bias evaluation and suggest current benchmarks underestimate bias risk in practical deployments.
How do ChatGPT and other generative artificial intelligence models perform on foot and ankle questions from the Brazilian Orthopedics and Traumatology Association’s TEOT and TARO exams? The implications of large language models for medical education
🔥 引用:
0
Abstract: Introduction: Generative artificial intelligence (AI) is increasingly used for study and rapid consultation. We assessed how leading large language models (LLMs) perform on Brazilian Orthopedics and Traumatology Association (SBOT) Foot and Ankle exam questions. Methods: Cross-sectional benchmarking of 107 foot and ankle questions from TEOT and TARO exams. Items were classified into the following categories: adult trauma, pediatric trauma, anatomy/imaging, physical examination, congenital/pediatric disorders, and adult disorders. Four generative AI models were queried with standardized prompts; responses were scored against the official key. Outcome: overall accuracy. Results: ChatGPT (GPT-5 Thinking) had the highest accuracy (86.91%), followed by Gemini (79.43%). Accuracy differed by domain, with lower performance in pediatric trauma and congenital disorders. No model achieved perfect agreement with the key. Conclusions: Popular generative AI models performed well on SBOT foot and ankle exam questions, with ChatGPT (GPT-5 Thinking) scoring highest. LLMs may be helpful adjuncts in residency education when used with supervision and critical appraisal.
VFM$^{4}$SDG: Unveiling the Power of VFMs for Single-Domain Generalized Object Detection
🔥 引用:
0
Abstract: In real-world scenarios, continual changes in weather, illumination, and imaging conditions cause significant domain shifts, leading detectors trained on a single source domain to degrade severely in unseen environments. Existing single-domain generalized object detection (SDGOD) methods mainly rely on data augmentation or domain-invariant representation learning, but pay limited attention to detector mechanisms, leaving clear limitations under complex domain shifts. Through analytical experiments, we find that performance degradation is dominated by increasing missed detections, which fundamentally arises from reduced cross-domain stability of the detector: object-background and inter-instance relations become less stable in the encoding stage, while semantic-spatial alignment of query representations also becomes harder to maintain in the decoding stage. To this end, we propose VFM$^{4}$SDG, a dual-prior learning framework for SDGOD, which introduces a frozen vision foundation model (VFM) as a transferable cross-domain stability prior into detector representation learning and query modeling. In the encoding stage, we propose Cross-domain Stable Relational Prior Distillation to enhance the robustness of object-background and inter-instance relational modeling. In the decoding stage, we propose Semantic-Contextual Prior-based Query Enhancement, which injects category-level semantic prototypes and global visual context into queries to improve their semantic recognition and spatial localization stability in unseen domains. Extensive experiments show that the proposed method consistently outperforms existing SOTA methods on standard SDGOD benchmarks and two mainstream DETR-based detectors, demonstrating its effectiveness, robustness, and generality.
Extending MISP Taxonomies for Drug-Related Forum Classification on the Dark Web: A Human-in-the-Loop and LLM-Based Approach
DOI:
10.3390/fi18050228
🔥 引用:
0
Abstract: This study proposes a methodological framework for extending Malware Information Sharing Platform (MISP) taxonomies in the domain of Dark Web drug forums through the integration of large language models (LLMs) and Human-in-the-Loop (HITL) validation. The research addresses the existing ontological gap between traditional MISP taxonomies, focused on technical or chemical indicators, and the linguistic and morphological complexity of illicit digital markets. By modelling the primary physical form as an ontological predicate with mutually exclusive values (for example, powder, pill–tablet–capsule, liquid, and plant-matter), the proposed approach captures the material dimension of the discourse, enhancing semantic disambiguation and forensic traceability. The Mistral 7B model was used in the morphology-classification stage conducted on a stratified analytical subset of 2904 drug-related Dark Web posts, extracted from a final corpus of 6456 posts after data cleaning and relevance filtering. In the first pass, 76.48% of posts were directly assigned to one of the base morphological categories, while 23.52% were labelled as unclear and subsequently reviewed through the HITL stage. Following HITL refinement and full reclassification, the proportion of posts labelled as unclear decreased from 23.52% to 11.29%, corresponding to a 51.99% relative reduction in ambiguity. Network visualisation with VOSviewer revealed three major discursive axes—recreational–commercial, pharmaceutical–opioid, and transnational–logistical—reflecting the hybrid semantic structure of digital drug markets. The results show that combining LLM-based inference with expert oversight improves the interpretability, reproducibility and ontological robustness of cyberintelligence models, offering a replicable framework for other sensitive domains such as terrorism or child exploitation.
PrivUn: Unveiling Latent Ripple Effects and Shallow Forgetting in Privacy Unlearning
🔥 引用:
0
Abstract: Large language models (LLMs) often memorize private information during training, raising serious privacy concerns. While machine unlearning has emerged as a promising solution, its true effectiveness against privacy attacks remains unclear. To address this, we propose PrivUn, a new evaluation framework that systematically assesses unlearning robustness through three-tier attack scenarios: direct retrieval, in-context learning recovery, and fine-tuning restoration; combined with quantitative analysis using forgetting scores, association metrics, and forgetting depth assessment. Our study exposes significant weaknesses in current unlearning methods, revealing two key findings: 1) unlearning exhibits gradient-driven ripple effects: unlike traditional forgetting which follows semantic relations (e.g., knowledge graphs), privacy unlearning propagates across latent gradient-based associations; and 2) most methods suffer from shallow forgetting, failing to remove private information distributed across multiple deep model layers. To validate these insights, we explore two strategies: association-aware core-set selection that leverages gradient similarity, and multi-layer deep intervention through representational constraints. These strategies represent a paradigm shift from shallow forgetting to deep forgetting.
Graph Neural Network-Informed Predictive Flows for Faster Ford-Fulkerson and PAC-Learnability
🔥 引用:
0
Abstract: We propose a learning-augmented framework for accelerating max-flow computation and image segmentation by integrating Graph Neural Networks (GNNs) with the Ford-Fulkerson algorithm. Rather than predicting initial flows, our method learns edge importance probabilities to guide augmenting path selection. We introduce a Message Passing GNN (MPGNN) that jointly learns node and edge embeddings through coupled updates, capturing both global structure and local flow dynamics such as residual capacity and bottlenecks. Given an input image, we propose a method to construct a grid-based flow network with source and sink nodes, extract features, and perform a single GNN inference to assign edge probabilities reflecting their likelihood of belonging to high-capacity cuts. These probabilities are stored in a priority queue and used to guide a modified Ford-Fulkerson procedure, prioritizing augmenting paths via an Edmonds-Karp-style search with bottleneck-aware tie-breaking. This avoids repeated inference over residual graphs while leveraging learned structure throughout optimization. We further introduce a bidirectional path construction strategy centered on high-probability edges and provide a theoretical framework relating prediction quality to efficiency via a weighted permutation distance metric. Our method preserves max-flow/min-cut optimality while reducing the number of augmentations in practice. We also outline a hybrid extension combining flow warm-starting with edge-priority prediction, establishing a foundation for learning-guided combinatorial optimization in image segmentation.
AIGuardXplore: Intelligent Model Vulnerability Inspector
DOI:
10.55041/isjem06722
🔥 引用:
0
Abstract: Abstract- As enterprises continue to adopt Large Language Models (LLMs) in their infrastructure, the lack of transparency in automation security guardrails creates a significant weakness. This is compounded by existing solutions primarily focusing on automation blocking while also providing little granular visibility to security operators who must audit incidents or modify their defensive policies. To address this gap, the AI Guard Explore framework was designed as a high-fidelity visual observability solution enabling real-time monitoring of AI security events.Through a reactive, full-stack architecture built with Vite, React, and Drizzle ORM, AI Guard Explore aggregates information into one central location allowing for consolidated access to evaluate prompt injections, data leaks, and policy violations. AI Guard Explore creates a unique UI paradigm designed with high contrast colours and reduce cognitive load so that Security Operations Centre (SOC) personnel can recognize incident data quickly as it occurs and easily differentiate between automated guardrail logic and actual incidents.
Keywords: AI Guardrails, LLM Security, Real-time Observability, Prompt Injection Defense, Visual Analytics, Full-stack Telemetry.
Comparative evaluation of four large language models (ChatGPT-5, Gemini 2.5 Pro, Claude Sonnet-4, and DeepSeek-R1) for diagnostic and therapeutic decision-making in real-world otolaryngology scenarios
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
LLM-Enhanced Modeling of Social Desirability-Aware Forced-Choice Personality Assessment
🔥 引用:
0
Abstract: Personality assessment serves as a key building block in intelligent information systems that enable human-centered modeling. Unlike cognitive tests, personality assessments rely primarily on self-reports and are therefore susceptible to faking. Forced-choice (FC) formats partially mitigate this problem, yet socially desirable responding remains a systematic source of bias. Traditional approaches rely on expert-annotated social desirability (SD) ratings to construct FC item blocks and infer respondents’ personality traits from block-level rankings. This rating procedure is labor-intensive and coarse-grained. Furthermore, existing methods neglect the non-linear SD interactions between respondents and items, which act as structured adversarial noise that hinders the recovery of true latent traits. To address these challenges, we propose the Social Desirability-aware Forced-Choice Diagnosis (SDFCD) approach. Our approach adopts a knowledge-guided learning paradigm by leveraging large language models (LLMs) to distill fine-grained, continuous SD ratings, thereby replacing sparse expert ratings. We then introduce a decoupled neural interaction module that jointly represents latent personality traits and SD tendencies, enabling the modeling of respondent–item SD interactions. Experiments on real assessment data demonstrate that our method significantly outperforms baseline FC models in personality trait diagnostic performance and model interpretability. This study highlights the potential of LLMs for automated, fine-grained SD quantification and offers a scalable path toward more trustworthy personality assessment.
CoFEE: Reasoning Control for LLM-Based Feature Discovery
🔥 引用:
0
Abstract: Feature discovery from complex unstructured data is fundamentally a reasoning problem: it requires identifying abstractions that are predictive of a target outcome while avoiding leakage, proxies, and post-outcome signals. With the introduction of ever-improving Large Language Models (LLMs), our method provides a structured method for addressing this challenge. LLMs are well suited for this task by being able to process large amounts of information, but unconstrained feature generation can lead to weak features. In this work, we study reasoning control in LLMs by inducing cognitive behaviors for improving feature discovery. We introduce CoFEE (Cognitive Feature Engineering Engine), a reasoning control framework that enforces cognitive behaviors in how the LLM reasons during feature discovery. From a machine learning perspective, these cognitive behaviors act as structured inductive biases over the space of candidate features generated by the model. These behaviors have been exploited with success in ML models, and include backward chaining from outcomes, subgoal decomposition, verification against observability and leakage criteria, and explicit backtracking of rejected reasoning paths. In a controlled comparison, we show that enforcing cognitive behaviors yields features with higher empirical predictability than those under unconstrained vanilla LLM prompts. CoFEE achieves an average Success Rate Score that is 15.2% higher than the vanilla approach, while generating 29% fewer features and reducing costs by 53.3%. Using held-out feature evaluation, we assess whether cognitively induced features generalize beyond the data used for discovery. Our results indicate that, in our evaluated setting, reasoning control is associated with improvements in quality and efficiency of LLM-based feature discovery.
Automation Bias in Large Language Model–Assisted Diagnostic Reasoning among Physicians Trained in AI Literacy — A Randomized Clinical Trial
DOI:
10.1056/aioa2501001
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
Dissecting clinical reasoning failures in frontier artificial intelligence using 10,000 synthetic cases
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
Scalable Multilingual Clinical Trial Text Classification using Transformer Embeddings with Real-Time Redis and Telegram Integration
🔥 引用:
0
Abstract: For decades, clinical trial management has depended on systematic extraction of unstructured clinical narratives to support patient safety monitoring and eligibility assessment. Traditionally, this process required extensive manual effort from clinical experts who categorized protocol deviations and screened participants based on complex documentation. With the emergence of Natural Language Processing (NLP) in the early 2010s, statistical approaches such as TF-IDF and Word2Vec enabled the first wave of automation in structuring clinical text. However, oncology data remains highly complex, characterized by dense unstructured narratives, nested logical conditions (AND/OR/NOT), and specialized domain terminology. Conventional machine learning systems often fail to capture the high-dimensional semantic relationships necessary for robust classification, resulting in overlooked systemic signals in protocol deviations. More recent approaches include axis-parallel decision tree ensembles, such as Random Forests, and cloud-based Large Language Models (LLMs). While effective in certain settings, axis-parallel models are limited by their inability to model diagonal decision boundaries in embedded semantic spaces, reducing performance on tilted or non-linear clusters. Conversely, LLMs such as GPT-4 offer strong reasoning capabilities but introduce challenges related to patient data privacy, operational cost, and limited multilingual robustness without translation pipelines. To address these limitations, this work proposes a privacy-preserving, multilingual framework combining Language-Agnostic BERT Sentence Embeddings (LaBSE) with Ensemble Oblique Trees (EOT). By leveraging oblique hyperplanes, the model better partitions highdimensional embedding spaces. The proposed LaBSE–EOT system enables lightweight, locally deployable, and interpretable classification, improving cross-lingual clinical trial oversight while reducing dependency on cloud infrastructure and enhancing global healthcare scalability.
Provably Secure Steganography Based on List Decoding
🔥 引用:
0
Abstract: Steganography embeds secret messages in seemingly innocuous carriers for covert communication under surveillance. Current Provably Secure Steganography (PSS) schemes based on language models can guarantee computational indistinguishability between the covertext and stegotext. However, achieving high embedding capacity remains a challenge for existing PSS. The inefficient entropy utilization renders them not well-suited for Large Language Models (LLMs), whose inherent low-entropy tendencies severely constrain feasible embedding capacity. To address this, we propose a provably secure steganography scheme with a theoretically proved high capacity. Our scheme is based on the concept of list decoding: it maintains a set of candidates that contain the correct secret message, instead of directly finding the correct message with more effort. This strategy fully utilizes the information content of the generated text, yielding higher capacity. To ensure the correctness of our scheme, we further introduce a suffix-matching mechanism to distinguish the correct secret message from the candidates. We provide theoretical proofs for both the security and correctness of our scheme, alongside a derivation of its theoretical capacity lower bound. Our approach is plug-and-play, requiring only a direct replacement of the model's standard random sampling module. Experiments on three LLMs and seven PSS baselines demonstrate that our method achieves computational efficiency comparable to prior PSS schemes while delivering a substantial improvement in embedding capacity.
A structure-based virtual screening approach to identify novel anaplastic lymphoma kinase inhibitors
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
Mapping the Maritime Economy
DOI:
10.31217/p.40.2.5
🔥 引用:
0
Abstract: The maritime economy, a cornerstone of global trade and economic development, is undergoing rapid transformation driven by technological advancements, sustainability challenges, and evolving regulatory frameworks. This study systematically organizes and analyses 827 scientific articles from the Web of Science, employing a hybrid methodology integrating Social Network Analysis (SNA), the Binary Space Partitioning (BSP) clustering algorithm, and large language models (LLMs) to identify key research themes and emerging trends. Through an iterative "human-in-the-loop" process, we refine the clustering process, resulting in six main clusters and 26 subclusters spanning topics such as governance and strategic management, forecasting and modelling, port operations, digital innovations, sustainability, and risk management. The results reveal a marked shift toward sustainability-oriented and technology-driven research, emphasizing the convergence of environmental responsibility, digital transformation, and operational resilience. The findings also highlight the growing role of autonomous technologies, big data analytics, and blockchain in reshaping maritime systems, while underscoring the need for adaptive governance frameworks and cross-sectoral collaboration. By offering a structured, data-driven overview of the maritime research ecosystem, this study contributes a novel methodological framework for bibliometric exploration and provides actionable insights for scholars, policymakers, and industry stakeholders seeking to advance innovation and sustainability in the maritime domain.
Attention-based multiple instance learning for predominant growth pattern prediction in lung adenocarcinoma wsi using foundation models
🔥 引用:
0
Abstract: Lung adenocarcinoma (LUAD) grading depends on accurately identifying growth patterns, which are indicators of prognosis and can influence treatment decisions. Common deep learning approaches to determine the predominant pattern rely on patch-level classification or segmentation, requiring extensive annotations. This study proposes an attention-based multiple instance learning (ABMIL) framework to predict the predominant LUAD growth pattern at the whole slide level to reduce annotation burden. Our approach integrates pretrained pathology foundation models as patch encoders, used either frozen or fine-tuned on annotated patches, to extract discriminative features that are aggregated through attention mechanisms. Experiments show that fine-tuned encoders improve performance, with Prov-GigaPath achieving the highest agreement (\k{appa} = 0.699) under ABMIL. Compared to simple patch-aggregation baselines, ABMIL yields more robust predictions by leveraging slide-level supervision and spatial attention. Future work will extend this framework to estimate the full distribution of growth patterns and validate performance on external cohorts.
Frozen LLMs as Map-Aware Spatio-Temporal Reasoners for Vehicle Trajectory Prediction
🔥 引用:
0
Abstract: Large language models (LLMs) have recently demonstrated strong reasoning capabilities and attracted increasing research attention in the field of autonomous driving (AD). However, safe application of LLMs on AD perception and prediction still requires a thorough understanding of both the dynamic traffic agents and the static road infrastructure. To this end, this study introduces a framework to evaluate the capability of LLMs in understanding the behaviors of dynamic traffic agents and the topology of road networks. The framework leverages frozen LLMs as the reasoning engine, employing a traffic encoder to extract spatial-level scene features from observed trajectories of agents, while a lightweight Convolutional Neural Network (CNN) encodes the local high-definition (HD) maps. To assess the intrinsic reasoning ability of LLMs, the extracted scene features are then transformed into LLM-compatible tokens via a reprogramming adapter. By residing the prediction burden with the LLMs, a simpler linear decoder is applied to output future trajectories. The framework enables a quantitative analysis of the influence of multi-modal information, especially the impact of map semantics on trajectory prediction accuracy, and allows seamless integration of frozen LLMs with minimal adaptation, thereby demonstrating strong generalizability across diverse LLM architectures and providing a unified platform for model evaluation.
MambaCSP: Hybrid-Attention State Space Models for Hardware-Efficient Channel State Prediction
🔥 引用:
0
Abstract: Recent works have demonstrated that attention-based transformer and large language model (LLM) architectures can achieve strong channel state prediction (CSP) performance by capturing long-range temporal dependencies across channel state information (CSI) sequences. However, these models suffer from quadratic scaling in sequence length, leading to substantial computational cost, memory consumption, and inference latency, which limits their applicability in real-time and resource-constrained wireless deployments. In this paper, we investigate whether selective state space models (SSMs) can serve as a hardware-efficient alternative for CSI prediction. We propose MambaCSP, a hybrid-attention SSM architecture that replaces LLM-based prediction backbones with a linear-time Mamba model. To overcome the local-only dependencies of pure SSMs, we introduce lightweight patch-mixer attention layers that periodically inject cross-token attentions, helping with long-context CSI prediction. Extensive MISO-OFDM simulations show that MambaCSP improves prediction accuracy over LLM-based approaches by 9-12%, while delivering up to 3.0x higher throughput, 2.6x lower VRAM usage, and 2.9x faster inference. Our results demonstrate that hybrid state space architectures provide a promising direction for scalable and hardware-efficient AI-native CSI prediction in future wireless networks.
An Adaptive Prompt Optimization Framework for Domain-Specific Large Language Models
🔥 引用:
0
Abstract: Domain-specific deployment of large language models (LLMs) remains constrained by prompt brittleness, inference cost, and uneven generalization across task subtypes. This paper presents APOF (Adaptive Prompt Optimization Framework), a closed-loop framework that jointly opti- mizes prompt structure, retrieval context, and inference-time control signals for domain-specific LLM applications. APOF combines three elements: (i) a policy-guided prompt composer that dynamically allocates instruction budget across task facets, (ii) a critic model that estimates prompt-task fitness before expensive decoding, and (iii) an online adaptation module that up- dates prompt policies using delayed feedback from production outcomes. We instantiate APOF in three high-stakes domains—clinical note summarization, legal clause risk classification, and materials-science question answering—using a shared 13B parameter base model and domain adapters.Our experiments include 162,000 annotated instances across public and institutionally cu- rated corpora, 12 baseline methods, and controlled ablation studies. APOF improves macro-F1 by up to 6.8 points over the strongest static prompt baseline, while reducing median latency by 18.4% through pre-decoding prompt pruning and adaptive generation parameters. The frame- work also improves calibration (ECE reduction of 0.041) and demonstrates higher robustness under distribution shift (average relative performance drop 12.3% vs. 21.7% for static meth- ods). We provide mathematical formulations, complexity analysis, and practical deployment recommendations. Results suggest that adaptive prompting, when treated as a structured op- timization problem rather than manual engineering, is a viable path to reliable domain-specific LLM systems.
Revisiting Non-Verbatim Memorization in Large Language Models: The Role of Entity Surface Forms
🔥 引用:
0
Abstract: Understanding what kinds of factual knowledge large language models (LLMs) memorize is essential for evaluating their reliability and limitations. Entity-based QA is a common framework for analyzing non-verbatim memorization, but typical evaluations query each entity using a single canonical surface form, making it difficult to disentangle fact memorization from access through a particular name. We introduce RedirectQA, an entity-based QA dataset that uses Wikipedia redirect information to associate Wikidata factual triples with categorized surface forms for each entity, including alternative names, abbreviations, spelling variants, and common erroneous forms. Across 13 LLMs, we examine surface-conditioned factual memorization and find that prediction outcomes often change when only the entity surface form changes. This inconsistency is category-dependent: models are more robust to minor orthographic variations than to larger lexical variations such as aliases and abbreviations. Frequency analyses further suggest that both entity- and surface-level frequencies are associated with accuracy, and that entity frequency often contributes beyond surface frequency. Overall, factual memorization appears neither purely surface-specific nor fully surface-invariant, highlighting the importance of surface-form diversity in evaluating non-verbatim memorization.
The Convergent Cyber Threat Horizon in Financial Services: Artificial Intelligence, Agentic Systems, Deepfakes, Quantum Computing, and the Mythos of Large Language Models
🔥 引用:
0
Abstract: Financial services institutions are entering a period of simultaneous and interacting technology shocks whose combined impact on the cyber threat landscape is qualitatively greater than that of any single force. This paper examines five such forces — generative artificial intelligence (AI), agentic AI, synthetic media (deepfakes), cryptographically-relevant quantum computing, and the institutional mythos surrounding large language models (LLMs) — and analyses their combined implications for banks, insurers, asset managers, payment providers, and market infrastructure operators. Drawing on primary standards documents, sector incident reports, and the emerging academic literature, we argue that the traditional cybersecurity program, built on assumptions of scarce adversary labour, durable cryptography, trustworthy media, and human-only identity, is structurally misaligned with the 2026–2032 threat environment. We contribute a five-vector threat taxonomy for financial services, a unified seven-domain defence framework mapping to NIST Cybersecurity Framework 2.0 and the EU Digital Operational Resilience Act (DORA), and a prioritised 24-month preparedness roadmap. We highlight under-researched risks including non-human identity proliferation, indirect prompt injection against agentic pipelines, harvest-now-decrypt-later (HNDL) exposure of long-retention financial data, and the governance gap introduced by uneven institutional understanding of LLM capabilities. The paper closes with research and policy recommendations including the establishment of sector-wide cryptographic inventories, standardised AI red-teaming protocols, and content-provenance requirements for material financial communications.
MPRDR: A Multi-Path Relational Drug Repurposing Framework Grounded in Graph-Theoretic Principles
🔥 引用:
0
Abstract: Current GNN-based drug repurposing algorithms rate drug-disease relationships using learned embedding proximity, which identifies the patterns of co-occurrence but is neither mechanistically interpretable nor cold-start. We introduce MPRDR (Multi-Path Relational Drug Repurposing), a GNN architecture based on three graph theoretical concepts: typed path algebra [1] to support multi-hop mechanistic evidence, Turan theorems cohesion of drug modules [2], and effective resistance structural robustness [3,4]. In contrast to TxGNN [5] and GDRnet [6], MPRDR precomputes three interpretable scores of topology of PrimeKG [7] and trains a GNN to combine these structural prioritizations with learned representations and replacing uncalibrated sigmoid outputs with a graph-theoretic confidence score. MPRDR performs competitively or better than previous experiments when evaluated using well-known standard test sets such as TxGNN and GDRnet. MPRDR demonstrates additional features compared to existing databases by being able to predict cold-start results, generate drug combination synergy scores, and provide mechanistic explanations.
LLM-Steered Power Allocation for Parallel QPSK-AWGN Channels
🔥 引用:
0
Abstract: Large language models (LLMs) are increasingly being explored as high-level decision modules in closed-loop systems, but their stochastic nature makes safe integration challenging. In this paper, we propose LLM-Steered Power Allocation, a dual-process architecture for parallel QPSK channels inspired by Kahneman's System 1/System 2 framework. A fast numerical optimizer (System 1) continuously performs projected gradient ascent on a weighted mutual-information objective, while an LLM navigator (System 2) periodically interprets natural-language policies and updates only the channel weights and the operational power budget. The LLM never manipulates the power-allocation variables directly, and constraint satisfaction is enforced structurally by the optimizer. To mitigate LLM unreliability, we further incorporate multi-layer guardrails including normalization, exponential moving-average smoothing, and fallback mechanisms. Numerical experiments on an 8-channel system show that, with a fixed optimization core and unchanged system prompt, different natural-language policies induce qualitatively different operating points, including throughput-oriented allocation, channel prioritization, power-aware operation, and channel shutdown. In addition, under an abrupt channel-gain reversal, the proposed system autonomously reconfigures its steering signals and reduces the final mutual-information spread by 60% compared with the optimizer alone. These results suggest that LLMs can serve as policy interpreters for safe, flexible reconfiguration of communication-system optimizers without controller reimplementation.
Ideological Bias in LLMs'Economic Causal Reasoning
🔥 引用:
0
Abstract: Do large language models (LLMs) exhibit systematic ideological bias when reasoning about economic causal effects? As LLMs are increasingly used in policy analysis and economic reporting, where directionally correct causal judgments are essential, this question has direct practical stakes. We present a systematic evaluation by extending the EconCausal benchmark with ideology-contested cases - instances where intervention-oriented (pro-government) and market-oriented (pro-market) perspectives predict divergent causal signs. From 10,490 causal triplets (treatment-outcome pairs with empirically verified effect directions) derived from top-tier economics and finance journals, we identify 1,056 ideology-contested instances and evaluate 20 state-of-the-art LLMs on their ability to predict empirically supported causal directions. We find that ideology-contested items are consistently harder than non-contested ones, and that across 18 of 20 models, accuracy is systematically higher when the empirically verified causal sign aligns with intervention-oriented expectations than with market-oriented ones. Moreover, when models err, their incorrect predictions disproportionately lean intervention-oriented, and this directional skew is not eliminated by one-shot in-context prompting. These results highlight that LLMs are not only less accurate on ideologically contested economic questions, but systematically less reliable in one ideological direction than the other, underscoring the need for direction-aware evaluation in high-stakes economic and policy settings.
Real-Time Credit Card Fraud Detection Using Deep Learning, Graph Neural Networks, And Auto Encoders
🔥 引用:
0
Abstract: The rapid growth of digital banking and online financial transactions has significantly increased the risk of fraudulent activities, posing serious challenges to traditional fraud detection systems. Conventional machine learning–based approaches primarily rely on isolated transaction features and often fail to capture complex relationships among users, accounts, and transactions. To address these limitations, this work proposes an enhanced fraud detection framework for banking systems using deep learning techniques that combine Graph Neural Networks (GNNs) and Autoencoders. In the proposed approach, banking entities such as customers, accounts, devices, and transactions are modeled as nodes in a graph, while their interactions are represented as edges, enabling GNNs to effectively learn hidden relational patterns. Autoencoders are employed for unsupervised anomaly detection by identifying abnormal transaction behaviors through reconstruction errors. By integrating relational learning with anomaly detection, the framework efficiently handles large-scale and dynamic transaction data and improves the identification of sophisticated fraud patterns, including collusive and multi-hop fraud behaviors. Experimental analysis demonstrates that the proposed hybrid model achieves higher accuracy, improved recall, and reduced false positives compared to traditional machine learning and deep learning models. This approach provides a scalable and robust solution for real-time fraud detection in modern banking environments.
Large Language Models in Medical and Dental Education: A Cross-Sectional Comparison of AI-Generated and Faculty-Authored Prosthodontic Materials
DOI:
10.3390/dj14050249
🔥 引用:
0
Abstract: Background/Objectives: This study aimed to compare AI-generated educational material with faculty-authored content in Dental Prostheses Technology, evaluating perceived clarity, accuracy, structure, usefulness, and overall instructional quality across different age and professional groups. Methods: An analytical cross-sectional study was conducted using two versions of the first three chapters of a prosthodontics textbook: the original faculty-authored text and a reformulated version generated by ChatGPT 5.2 (OpenAI). Images were removed and formatting standardized to ensure a text-only comparison. An anonymized online questionnaire based on a five-point Likert scale assessed clarity, accuracy, readability, usefulness and structure. To reduce potential bias, participants were unaware of the authorship of the evaluated materials (human-authored vs. AI-generated). A total of 130 participants independently reviewed both documents. Data were analyzed using Wilcoxon signed-rank, Mann–Whitney U, and Friedman tests. Results: Both materials received favorable evaluations across all dimensions. The AI-generated version demonstrated a statistically significant advantage in clarity (Z = −2.107, p = 0.035; r = 0.19), while no significant differences were observed for structure, accuracy, readability, or usefulness. Generational differences emerged: younger participants valued improved clarity but reported reduced usefulness, mid-career participants showed the greatest improvement in perceived accuracy, and senior professionals reported substantial gains in usefulness and readability. Conclusions: AI-generated educational material demonstrates pedagogical equivalence to faculty-authored content, with clarity representing its principal advantage. Large language models may serve as effective complementary tools in dental education, particularly for restructuring complex content.
Empirical Assessment of Time-Series Foundation Models For Power System Forecasting Applications
🔥 引用:
0
Abstract: Accurate forecasting of electric load and renewable generation is essential for reliable and cost effective power system operations. Recent advances in transformer based and foundation machine learning models, driven by large scale pretraining, increased available data and computation, in addition to architectural innovations, have shown promise in time series forecasting across multiple domains. However, their application to power system forecasting tasks remains largely underexplored. This work presents a comprehensive, empirical benchmark of state of the art time series foundation models, transformer architectures, and deep learning baselines for solar, wind, and load forecasting using the high resolution ARPAE PERFORM dataset for the Electric Reliability Council of Texas (ERCOT) grid. Eight core capabilities are assessed, including zero shot performance, fine tuning efficiency, multivariate input and output handling, horizon sensitivity, generalization to unseen sites, probabilistic forecasting, and context window effects. Models evaluated include TimesFM, Chronos Bolt, MoiraiL, MOMENT, Tiny Time Mixer, Temporal Fusion Transformer, PatchTST, TimeXer, LSTM, and CNN. The manuscript aims to provide clear guidance on when foundation models can provide enhanced renewable and load forecasting capabilities and when other approaches remain the more practical choice for power system operations.
Decoupled Travel Planning with Behavior Forest
🔥 引用:
0
Abstract: Behavior sequences, composed of executable steps, serve as the operational foundation for multi-constraint planning problems such as travel planning. In such tasks, each planning step is not only constrained locally but also influenced by global constraints spanning multiple subtasks, leading to a tightly coupled and complex decision process. Existing travel planning methods typically rely on a single decision space that entangles all subtasks and constraints, failing to distinguish between locally acting constraints within a subtask and global constraints that span multiple subtasks. Consequently, the model is forced to jointly reason over local and global constraints at each decision step, increasing the reasoning burden and reducing planning efficiency. To address this problem, we propose the Behavior Forest method. Specifically, our approach structures the decision-making process into a forest of parallel behavior trees, where each behavior tree is responsible for a subtask. A global coordination mechanism is introduced to orchestrate the interactions among these trees, enabling modular and coherent travel planning. Within this framework, large language models are embedded as decision engines within behavior tree nodes, performing localized reasoning conditioned on task-specific constraints to generate candidate subplans and adapt decisions based on coordination feedback. The behavior trees, in turn, provide an explicit control structure that guides LLM generation. This design decouples complex tasks and constraints into manageable subspaces, enabling task-specific reasoning and reducing the cognitive load of LLM. Experimental results show that our method outperforms state-of-the-art methods by 6.67% on the TravelPlanner and by 11.82% on the ChinaTravel benchmarks, demonstrating its effectiveness in increasing LLM performance for complex multi-constraint travel planning.
Focus Session: Hardware and Software Techniques for Accelerating Multimodal Foundation Models
🔥 引用:
0
Abstract: This work presents a multi-layered methodology for efficiently accelerating multimodal foundation models (MFMs). It combines hardware and software co-design of transformer blocks with an optimization pipeline that reduces computational and memory requirements. During model development, it employs performance enhancements through fine-tuning for domain-specific adaptation. Our methodology further incorporates hardware and software techniques for optimizing MFMs. Specifically, it employs MFM compression using hierarchy-aware mixed-precision quantization and structural pruning for transformer blocks and MLP channels. It also optimizes operations through speculative decoding, model cascading that routes queries through a small-to-large cascade and uses lightweight self-tests to determine when to escalate to larger models, as well as co-optimization of sequence length, visual resolution&stride, and graph-level operator fusion. To efficiently execute the model, the processing dataflow is optimized based on the underlying hardware architecture together with memory-efficient attention to meet on-chip bandwidth and latency budgets. To support this, a specialized hardware accelerator for the transformer workloads is employed, which can be developed through expert design or an LLM-aided design approach. We demonstrate the effectiveness of the proposed methodology on medical-MFMs and on code generation tasks, and conclude with extensions toward energy-efficient spiking-MFMs.
A Task Decomposition and Planning Framework for Efficient LLM Inference in AI-Enabled WiFi-Offload Networks
🔥 引用:
0
Abstract: AI WiFi offload is emerging as a promising approach for providing large language model (LLM) services to resource-constrained wireless devices. However, unlike conventional edge computing, LLM inference over WiFi must jointly address heterogeneous model capabilities, wireless contention, uncertain task complexity, and semantic correlation among reasoning tasks. In this paper, we investigate LLM inference offloading in a multi-user multi-edge WiFi network, where each task can be executed locally, directly offloaded to a nearby edge access point (AP), or decomposed into multiple subtasks for collaborative execution across local and edge nodes. To this end, we propose a user-edge collaborative framework with an LLM-based planner that not only performs task decomposition but also infers subtask difficulty and expected output token length, enabling more accurate estimation of execution quality and latency on heterogeneous nodes. Based on these estimates, we further design a decomposition-aware scheduling strategy that jointly optimizes subtask assignment, execution, and aggregation under communication, queuing, and computation constraints. Simulation results show that the proposed framework achieves a better latency-accuracy tradeoff than local-only and nearest-edge baselines, reducing the average latency by $20\%$ and improving the overall reward by $80\%$. Moreover, the distilled lightweight planner approaches the performance of the large teacher model while remaining more suitable for practical edge deployment.
How to Supercharge Your Research Workflow Using Large Language Models
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
CAP: Controllable Alignment Prompting for Unlearning in LLMs
🔥 引用:
0
Abstract: Large language models (LLMs) trained on unfiltered corpora inherently risk retaining sensitive information, necessitating selective knowledge unlearning for regulatory compliance and ethical safety. However, existing parameter-modifying methods face fundamental limitations: high computational costs, uncontrollable forgetting boundaries, and strict dependency on model weight access. These constraints render them impractical for closed-source models, yet current non-invasive alternatives remain unsystematic and reliant on empirical experience. To address these challenges, we propose the Controllable Alignment Prompting for Unlearning (CAP) framework, an end-to-end prompt-driven unlearning paradigm. CAP decouples unlearning into a learnable prompt optimization process via reinforcement learning, where a prompt generator collaborates with the LLM to suppress target knowledge while preserving general capabilities selectively. This approach enables reversible knowledge restoration through prompt revocation. Extensive experiments demonstrate that CAP achieves precise, controllable unlearning without updating model parameters, establishing a dynamic alignment mechanism that overcomes the transferability limitations of prior methods.
TingIS: Real-time Risk Event Discovery from Noisy Customer Incidents at Enterprise Scale
🔥 引用:
0
Abstract: Real-time detection and mitigation of technical anomalies are critical for large-scale cloud-native services, where even minutes of downtime can result in massive financial losses and diminished user trust. While customer incidents serve as a vital signal for discovering risks missed by monitoring, extracting actionable intelligence from this data remains challenging due to extreme noise, high throughput, and semantic complexity of diverse business lines. In this paper, we present TingIS, an end-to-end system designed for enterprise-grade incident discovery. At the core of TingIS is a multi-stage event linking engine that synergizes efficient indexing techniques with Large Language Models (LLMs) to make informed decisions on event merging, enabling the stable extraction of actionable incidents from just a handful of diverse user descriptions. This engine is complemented by a cascaded routing mechanism for precise business attribution and a multi-dimensional noise reduction pipeline that integrates domain knowledge, statistical patterns, and behavioral filtering. Deployed in a production environment handling a peak throughput of over 2,000 messages per minute and 300,000 messages per day, TingIS achieves a P90 alert latency of 3.5 minutes and a 95\% discovery rate for high-priority incidents. Benchmarks constructed from real-world data demonstrate that TingIS significantly outperforms baseline methods in routing accuracy, clustering quality, and Signal-to-Noise Ratio.
CT-Based Deep Foundation Model for Predicting Immune Checkpoint Inhibitor-Induced Pneumonitis Risk in Lung Cancer
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
Preferences of a Voice-First Nation: Large-Scale Pairwise Evaluation and Preference Analysis for TTS in Indian Languages
🔥 引用:
0
Abstract: Crowdsourced pairwise evaluation has emerged as a scalable approach for assessing foundation models. However, applying it to Text to Speech(TTS) introduces high variance due to linguistic diversity and multidimensional nature of speech perception. We present a controlled multidimensional pairwise evaluation framework for multilingual TTS that combines linguistic control with perceptually grounded annotation. Using 5K+ native and code-mixed sentences across 10 Indic languages, we evaluate 7 state-of-the-art TTS systems and collect over 120K pairwise comparisons from over 1900 native raters. In addition to overall preference, raters provide judgments across 6 perceptual dimensions: intelligibility, expressiveness, voice quality, liveliness, noise, and hallucinations. Using Bradley-Terry modeling, we construct a multilingual leaderboard, interpret human preference using SHAP analysis and analyze leaderboard reliability alongside model strengths and trade-offs across perceptual dimensions.
Aiki-XP: leakage-controlled multimodal prediction of within-species relative protein expression at pan-bacterial scale
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
Agentic AI-Enabled Framework for Thermal Comfort and Building Energy Assessment in Tropical Urban Neighborhoods
🔥 引用:
0
Abstract: In response to the urban heat island effects and building energy demands in Singapore, this study proposes an agentic AI-enabled reasoning framework that integrates large language models (LLMs) with lightweight physics-based models. Through prompt customization, the LLMs interpret urban design tasks, extract relevant policies, and activate appropriate physics-based models for evaluation, forming a closed-loop reasoning-action process. These lightweight physics-based models leverage core thermal and airflow principles, streamlining conventional models to reduce computational time while predicting microclimate variables, such as building surface temperature, ground radiant heat, and airflow conditions, thereby enabling the estimation of thermal comfort indices, e.g., physiological equivalent temperature (PET), and building energy usage. This framework allows users to explore a variety of climate-resilient building surface strategies, e.g., green fa\c{c}ades and cool paint applications, that improve thermal comfort while reducing wall heat gain and energy demand. By combining the autonomous reasoning capacity of LLMs with the rapid quantitative evaluation of lightweight physics-based models, the proposed system demonstrates potential for cross-disciplinary applications in sustainable urban design, indoor-outdoor environmental integration, and climate adaptation planning. The source code and data used in this study are available at: https://github.com/PgUpDn/urban-cooling-agent.
Identification of potential Phosphodiesterase 1 inhibitors from plant phytochemicals for the treatment of cognitive impairment using molecular modeling tools
🔥 引用:
0
Abstract:
Plants and their phytomolecules have been employed for centuries to treat various illnesses, including cognitive impairments. In modern drug discovery, phytomolecule libraries serve as an important source for identifying novel therapeutic agents, particularly for complex diseases, as de novo chemical synthesis is often time-consuming and expensive. In this study, a total of 44,085 Central Nervous System (CNS)-targeting phytomolecules were retrieved from the ChemDiv library and systematically screened for their potential as inhibitors of the phosphodiesterase-1 (PDE1) receptor using molecular modeling tools. PDE1 is a key enzyme predominantly expressed in brain tissues that hydrolyzes the 3’,5’-cAMP (3’,5’-cyclic Adenosine Monophosphate) and 3’,5’-cGMP (3’,5’-cyclic Guanosine Monophosphate) and into their inactive monophosphate counterparts, 5’-cAMP and 5’-cGMP. Cyclic nucleotides are essential for regulating neuronal signaling pathways. An initial molecular docking analysis identified 10 phytomolecules exhibiting strong binding affinities ranging from −11.97 to −12.54 kcal×mol
−1
, and significant interactions with key amino acid residues within the active site of PDE1. Molecular dynamics simulations confirmed that four protein-ligand complexes (D361-0316-PDE1, L977-1068-PDE1, J081-0995-PDE1, and L977-1023-PDE1) remained structurally stable for up to 100 nanoseconds. Further, density functional theory (DFT) studies confirmed the electronic stability of these molecules. ADMET and drug-likeness evaluations predicted pharmacokinetics and toxicity profile. Overall, the findings suggest that these four phytomolecules may serve as promising PDE1 inhibitors for the management of cognitive impairment.
Transparent Large Language Model-Assisted Pump Scheduling: Performance Evaluation and Interpretability in Water Distribution Systems
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
Narrative and Challenge in Single-Player RPGs: A 1990–2025 Player-Centered Systematic Review
🔥 引用:
0
Abstract: Single-player role-playing games (RPGs) combine two promises that do not always align: delivering a compelling narrative experience (world, characters, choices, and consequences) while sustaining a demanding ludic trajectory in which players face obstacles, master systems, and progress over time. This Systematic Literature Review (SLR) synthesizes existing evidence on the evolution of narrative and challenge in single-player RPGs from a player-centered perspective, with particular attention paid to immersion, engagement, flow, and perceived agency. A multi-database search strategy was conducted across Google Scholar, Scopus, IEEE Xplore, and the ACM Digital Library using query strings targeting narrative/agency, challenge and dynamic difficulty adjustment (DDA), adaptive difficulty, and the historical evolution of RPG narrative design, following a Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA)-reported selection flow and Rayyan-supported screening. From 423 identified records, duplicates and non-eligible records were removed through staged screening, yielding 43 reports sought for retrieval; because six were not accessible in full text at consolidation, the synthesis was conducted on 37 full-text articles. The findings indicate (i) a predominance of work on narrative and agency, where agency is framed as a design effect rather than merely the presence of explicit branching choices; (ii) a recent rise in challenge/adaptation research, frequently tied to flow, fairness, and differentiated player profiles; and (iii) the emergence of artificial intelligence (AI)-driven approaches, including non-player character (NPC) systems, combat AI, reinforcement learning, and large language model (LLM)-based narrative control, which amplify core design trade-offs between narrative coherence and perceived agency. Beyond synthesizing a dispersed body of literature, the review contributes an integrated player-centered analytical framework that brings together narrative, challenge, and player experience, while also highlighting the need for more consistent measurement practices, stronger comparative designs, and longer-term empirical work in single-player RPG research.
Evaluating the Performance of Large Language Models in Addressing Preoperative Patient Questions: A Systematic Review and Analysis
🔥 引用:
0
Abstract: Large Language Models (LLMs), including chatGPT, google gemini and microsoft co-pilot are increasingly explored in the perioperative setting. Integrated use of LLMs in healthcare has been shown to improve efficiency, accuracy and patient management. This systematic review assesses the performance of LLMs at answering patients preoperative questions across a range of surgical specialities.
EVENT5Ws: A Large Dataset for Open-Domain Event Extraction from Documents
🔥 引用:
0
Abstract: Event extraction identifies the central aspects of events from text. It supports event understanding and analysis, which is crucial for tasks such as informed decision-making in emergencies. Therefore, it is necessary to develop automated event extraction approaches. However, existing datasets for algorithm development have limitations, including limited coverage of event types in closed-domain settings and a lack of large, manually verified dataset in open-domain settings. To address these limitations, we create EVENT5Ws , a large, manually annotated, and statistically verified open-domain event extraction dataset. We design a systematic annotation pipeline to create the dataset and provide empirical insights into annotation complexity. Using EVENT5Ws, we evaluate state-of-the-art pre-trained large language models and establish a benchmark for future research. We further show that models trained on EVENT5Ws generalize effectively to datasets from different geographical contexts, which demonstrates its potential for developing generalizable algorithms. Finally, we summarize the lessons learned during the dataset development and provide recommendations to support future large-scale dataset development.
GENIE CRM: An Agentic AI-Powered Customer Relationship Management System with Multi-Modal Intelligence and Automated Workflow Orchestration
DOI:
10.55041/isjem06720
🔥 引用:
0
Abstract: ABSTRACT:
The rapid evolution of large language models and agentic AI frameworks has opened a new frontier for intelligent enterprise software. This paper presents GENIE CRM, a full-stack customer relationship management platform that embeds autonomous AI agents across sixteen functional modules to eliminate manual bottlenecks throughout the sales, support, and operations lifecycle. Built on a Python-Flask backend and a React-TypeScript single-page application, the system leverages Google Gemini 1.5 Flash as a unified multimodal AI backbone to perform structured lead scoring, visiting card optical character recognition, personalised multi-channel outreach generation, intelligent support ticket classification and round-robin routing, document and call recording summarisation, AI-driven bug criticality assessment, and real-time geospatial business intelligence. Experimental evaluation on representative CRM workflows demonstrates a 94.4 percent reduction in lead data entry time through multimodal OCR, a 22.2 percentage-point improvement in automated ticket routing accuracy over manual methods, and a 97.3 percent structural success rate for AI-generated JSON responses across all service endpoints. A persistent voice-enabled chatbot named Genie provides conversational access to the CRM database from every page of the application. The architecture combines Supabase-backed PostgreSQL for relational storage with lightweight JSON flat files for rapid-iteration state management, achieving sub-1.2-second dashboard aggregation across all sixteen modules. This paper details the system architecture, module-level design methodology, UML activity and use case diagrams derived from actual source code analysis, quantitative evaluation results, design trade-off discussion, and a roadmap for cloud-native multi- tenant deployment.
Keywords — Agentic AI; Customer Relationship Management; Google Gemini; Lead Scoring; Optical Character Recognition; Workflow Automation; Support Ticket Routing; Geolocation Intelligence; Flask; React; Supabase; Multimodal AI; Sales Automation; Natural Language Processing; Chatbot
RVQ-Alpha: Bridging Single-Cell Transcriptomics and Large Language Models via Discrete Tokenization and Verifiable Reinforcement Learning
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
Redefining Smart City Implementation: A New Model for Sustainability of Smart Cities
🔥 引用:
0
Abstract: This study addresses a critical gap in smart city implementation by proposing a novel triple helix model that integrates smart governance, smart technologies, and active citizen engagement. While smart city initiatives are widespread, effective execution remains a major challenge. This research examines the smart city environment through the lens of smart government, smart people, and smart technology, revealing a disconnect between ambitious plans and the actual impact of applications. The new model was applied to four metropolitan cities in Turkey and tested using desktop research, analysis of 115 municipal websites based on the McMillan interaction model, and a citizen survey of 1,754 participants. The study presents an in-depth analysis based on detailed examination of the citizen survey and discusses the potential factors affecting awareness and usage by using The Chi Square of Independence Test and Binary Logistic Regression methods to analyze 13 relationships using data from 19 smart mobility applications. The results have shown that smart city initiatives fail to reach their full potential without aware citizens and responsive governance, even with sufficient technology. This research contributes a critical perspective to the smart city discourse, offering a foundational model for future smart city implementations to ensure their success and sustainability.
A Multimodal Text- and Graph-Based Approach for Open-Domain Event Extraction from Documents
🔥 引用:
0
Abstract: Event extraction is essential for event understanding and analysis. It supports tasks such as document summarization and decision-making in emergency scenarios. However, existing event extraction approaches have limitations: (1) closed-domain algorithms are restricted to predefined event types and thus rarely generalize to unseen types and (2) open-domain event extraction algorithms, capable of handling unconstrained event types, have largely overlooked the potential of large language models (LLMs) despite their advanced abilities. Additionally, they do not explicitly model document-level contextual, structural, and semantic reasoning, which are crucial for effective event extraction but remain challenging for LLMs due to lost-in-the-middle phenomenon and attention dilution. To address these limitations, we propose multimodal open-domain event extraction, MODEE , a novel approach for open-domain event extraction that combines graph-based learning with text-based representation from LLMs to model document-level reasoning. Empirical evaluations on large datasets demonstrate that MODEE outperforms state-of-the-art open-domain event extraction approaches and can be generalized to closed-domain event extraction, where it outperforms existing algorithms.
FAccT-Checked: A Narrative Review of Authority Reconfigurations and Retention in AI-Mediated Journalism
🔥 引用:
0
Abstract: Building on recent interpretivist approaches, we conduct a critical narrative review across journalism studies, human-computer interaction, and FAccT scholarship, conceptualizing editorial authority as the conjunction of decision rights, epistemic warrant, and responsibility. We provide a comprehensive theoretical framework for addressing how concerns on fairness, accountability and transparency emerge, interact, and persist within AI mediated journalistic practice. We identify and describe two concurrent authority reconfigurations driven by AI adoption. First, an internal migration of authority, in which editorial judgment is progressively deferred to large language models (LLMs) embedded within newsroom workflows. This migration occurs not through explicit policy decisions, but through interactional, cognitive, and organizational mechanisms that legitimize AI generated outputs while obscuring responsibility and weakening individual and professional agency. Second, we analyze an external migration of authority, whereby decision making power shifts from news organizations toward platforms, vendors, and infrastructural providers that supply AI systems and distribution channels, exacerbating existing power asymmetries within the media ecosystem. Unaddressed, these reconfigurations risk rendering fairness hard to maintain, accountability difficult to assign and transparency performative. We examine participatory approaches to AI design and deployment in journalism as potential mechanisms for retaining or reclaiming editorial authority. We critically assess both their promise and their structural limitations, highlighting how participation can either meaningfully redistribute authority or function as a tokenistic practice that leaves underlying power relations intact.
Unbiased Prevalence Estimation with Multicalibrated LLMs
🔥 引用:
0
Abstract: Estimating the prevalence of a category in a population using imperfect measurement devices (diagnostic tests, classifiers, or large language models) is fundamental to science, public health, and online trust and safety. Standard approaches correct for known device error rates but assume these rates remain stable across populations. We show this assumption fails under covariate shift and that multicalibration, which enforces calibration conditional on the input features rather than just on average, is sufficient for unbiased prevalence estimation under such shift. Standard calibration and quantification methods fail to provide this guarantee. Our work connects recent theoretical work on fairness to a longstanding measurement problem spanning nearly all academic disciplines. A simulation confirms that standard methods exhibit bias growing with shift magnitude, while a multicalibrated estimator maintains near-zero bias. While we focus the discussion mostly on LLMs, our theoretical results apply to any classification model. Two empirical applications -- estimating employment prevalence across U.S. states using the American Community Survey, and classifying political texts across four countries using an LLM -- demonstrate that multicalibration substantially reduces bias in practice, while highlighting that calibration data should cover the key feature dimensions along which target populations may differ.
Doubly Saturated Ramsey Graphs: A Case Study in Computer-Assisted Mathematical Discovery
🔥 引用:
0
Abstract: Ramsey-good graphs are graphs that contain neither a clique of size $s$ nor an independent set of size $t$. We study doubly saturated Ramsey-good graphs, defined as Ramsey-good graphs in which the addition or removal of any edge necessarily creates an $s$-clique or a $t$-independent set. We present a method combining SAT solving with bespoke LLM-generated code to discover infinite families of such graphs, answering a question of Grinstead and Roberts from 1982. In addition, we use LLMs to generate and formalize correctness proofs in Lean. This case study highlights the potential of integrating automated reasoning, large language models, and formal verification to accelerate mathematical discovery. We argue that such tool-driven workflows will play an increasingly central role in experimental mathematics.
The Role of Large Language Models in the Promotion of Minimally Invasive Interventional Radiologic Methods in Gynecology and Obstetrics
DOI:
10.3390/jcm15093234
🔥 引用:
0
Abstract: Background: Minimally invasive interventional radiology (IR) offers effective, uterus-preserving treatments for several gynecologic and obstetric conditions such as uterine fibroids, adenomyosis and postpartum hemorrhage. Despite their efficacy, these methods remain underused, partly to limited awareness among clinicians and patients. Large language models (LLMs) may help bridge this gap by providing accessible, reliable information. Objective: To evaluate how current LLMs address knowledge gaps and promote awareness of minimally invasive IR methods in gynecology and obstetrics. Methods: A structured ten-question instrument was used to query three publicly available LLMs (OpenEvidence, ChatGPT, and Google Gemini). Responses were analyzed for accuracy, completeness, safety considerations, and patient-centered communication. Results: All three models accurately identified a range of medical, minimally invasive, and surgical treatments for uterine fibroids, adenomyosis, and postpartum hemorrhage, with OpenEvidence and ChatGPT providing more detailed and clinically nuanced responses. OpenEvidence achieved the highest scores overall, closely followed by ChatGPT, while Google Gemini scored lower, particularly in completeness and patient-centered communication. In more complex scenarios, performance differences became more pronounced, with OpenEvidence again leading, ChatGPT performing strongly, and Google Gemini lagging behind. Overall, OpenEvidence and ChatGPT demonstrated higher accuracy, completeness, and safety considerations, whereas Google Gemini showed comparatively weaker and less consistent performance. Conclusions: LLMs may endorse the promotion of minimally invasive IR methods in gynecology and obstetrics, but their outputs vary considerably in quality. Ongoing refinement and integration of evidence-based sources are essential before routine use in clinical practice. Therefore, effective collaboration between artificial intelligence (AI) developers and medical professionals is essential to harness this technology’s full potential.
Can Large Language Models Assist the Comprehension of ROS2 Software Architectures?
🔥 引用:
0
Abstract: Context. The most used development framework for robotics software is ROS2. ROS2 architectures are highly complex, with thousands of components communicating in a decentralized fashion. Goal. We aim to evaluate how LLMs can assist in the comprehension of factual information about the architecture of ROS2 systems. Method. We conduct a controlled experiment where we administer 1,230 prompts to 9 LLMs containing architecturally-relevant questions about 3 ROS2 systems with incremental size. We provide a generic algorithm that systematically generates architecturally-relevant questions for a ROS2 system. Then, we (i) assess the accuracy of the answers of the LLMs against a ground truth established via running and monitoring the 3 ROS2 systems and (ii) qualitatively analyse the explanations provided by the LLMs. Results. Almost all questions are answered correctly across all LLMs (mean=98.22%). gemini-2.5-pro performs best (100% accuracy across all prompts and systems), followed by o3 (99.77%), and gemini-2.5-flash (99.72%); the least performing LLM is gpt-4.1 (95%). Only 300/1,230 prompts are incorrectly answered, of which 249 are about the most complex system. The coherence scores in LLM's explanations range from 0.394 for"service references"to 0.762 for"communication path". The mean perplexity varies significantly across models, with chatgpt-4o achieving the lowest score (19.6) and o4-mini the highest (103.6). Conclusions. There is great potential in the usage of LLMs to aid ROS2 developers in comprehending non-trivial aspects of the software architecture of their systems. Nevertheless, developers should be aware of the intrinsic limitations and different performances of the LLMs and take those into account when using them.
Unsupervised protein language models learn patterns of enzyme function
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
cellNexus: Quality control, annotation, aggregation and analytical layers for the Human Cell Atlas data
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
Black-Box Skill Stealing Attack from Proprietary LLM Agents: An Empirical Study
🔥 引用:
0
Abstract: Large language model (LLM) agents increasingly rely on skills to package reusable capabilities through instructions, tools, and resources. High-quality skills embed expert knowledge, curated workflows, and execution constraints into agents, fueling a growing skill economy through their value and scalability. Yet this ecosystem also creates a new attack surface, as adversaries can interact with public agent interfaces to extract hidden proprietary skill content. We present the first systematic study of black-box skill stealing against LLM agent systems. Compared with conventional system prompt stealing, skill stealing targets modular and structured capability packages whose leakage is directly actionable for copying, redistribution, and monetization, making the resulting harm potentially greater. To study this threat, we derive an attack taxonomy from prior prompt-stealing methods and build an automated stealing prompt generation agent. Starting from model-generated seed prompts, the framework expands attacks through scenario rationalization and structure injection while enforcing diversity via embedding-based filtering, yielding a reproducible pipeline for evaluating proprietary agent systems. We evaluate these attacks across commercial agent platforms and representative LLMs. Our results show that agent skills can often be extracted easily, posing a serious copyright risk. To mitigate this threat, we design defenses across the agent pipeline, focusing on input, inference, and output phase. Although these defenses substantially reduce leakage, the attack remains inexpensive and repeatable, and a single successful attempt is sufficient to compromise the protected skill. Overall, our findings suggest that these copyright risks remain largely overlooked across proprietary agent ecosystems, motivating stronger protection mechanisms.
SparsePool: A Graph Pooling Framework via Sparse Representation for Graph Classification
DOI:
10.3390/s26092627
🔥 引用:
0
Abstract: Graph neural networks (GNNs) have achieved great success in graph classification, with graph pooling methods being widely adopted for related tasks. Existing approaches typically rely on node ranking or clustering to coarsen graphs, but often fail to effectively leverage global structural information, leading to loss of critical substructures and limited interpretability—key limitations in molecular analysis and social network mining. To address these issues, we propose SparsePool, a graph pooling method that integrates node features and structural patterns through atomic decomposition. By dynamically decomposing graphs into interpretable atomic units via Boolean matrix factorization, SparsePool preserves semantically meaningful substructures while providing transparent evidence of retained patterns. We further introduce an Atomic Pooling Neural Network (APNN) for graph representation learning. Extensive experiments on relevant benchmarks including biochemical and social network datasets demonstrate that SparsePool outperforms state-of-the-art pooling methods, achieving an average classification accuracy improvement of 1.03% over baseline models while reducing structural information loss. We also discuss its compatibility with emerging quantum computing paradigms, such as quantum-accelerated sparse decomposition, as a promising direction for scaling graph processing in industrial contexts.
Towards a general-purpose foundation model for functional MRI analysis
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
Prompt Analysis in Large Language Models Adversarial Prompt Detection
🔥 引用:
0
Abstract: Large Language Models such as GPT, LLaMA, and Claude have demonstrated remarkable capabilities in natural language understanding and generation, making them increasingly popular in applications such as chatbots, coding assistants, and enterprise solutions. However, the widespread adoption of these models has also raised serious concerns regarding their security, as adversarial prompt injection attacks, commonly known as “jailbreaking,” manipulate LLMs into generating harmful, biased or confidential outputs by bypassing safety measures. Traditional rule-based defenses are insufficient since adversarial prompts exploit the semantic and contextual nature of language. This project proposes the development of an AIdriven detection system to identify and block adversarial prompts before they reach the LLM by leveraging transformer-based NLP models (BERT, RoBERTa, LLaMA) along with adversarial training techniques to build a classifier that distinguishes between safe, suspicious and malicious prompts. The solution not only strengthens the security of LLM-based systems but also contributes to the emerging field of AI safety and robustness through experimental validation on standard adversarial NLP datasets and synthetic jailbreak attempts, targeting over 85% detection accuracy with explainable outputs that provide transparency into threat detection mechanisms.
Integrating computational and experimental approaches to identify novel pyrazolo[3,4- d ]pyrimidine-containing thiadiazole frameworks for treating Alzheimer’s disease
🔥 引用:
0
Abstract:
Alzheimer’s disease (AD) is a progressive neurodegenerative disorder associated with cholinergic dysfunction. We report a novel series of pyrazolo[3,4-d] pyrimidine.thiadiazole derivatives as dual inhibitors of acetylcholinesterase (AChE) and butyrylcholinesterase (BuChE). The compounds were characterized by
1
H NMR,
13
C NMR, and HRMS, and their inhibitory activity was evaluated in vitro. Among the synthesized series, the compound 10f emerged as the potent analogue, exhibiting IC
50
values of 14.60 ± 1.80 nM (AChE) and 290.76 ± 2.90 nM (BuChE), with inhibitory potency comparable to the reference drug Donepezil. Molecular docking revealed that these compounds engage in both catalytic and peripheral anionic sites via µ–µ stacking and hydrogen-bond interactions, supporting a dual-binding mechanism. Density functional theory (DFT) calculations at the B3LYP/6-31G(d,p) level highlighted electronic features correlating with activity, while in silico ADMET profiling suggested favorable oral bioavailability and low toxicity. These results position this series as promising multifunctional lead candidates for further development in AD therapy.
DryRUN: On the Role of Public Tests in LLM-Driven Code Generation
🔥 引用:
0
Abstract: Multi-agent frameworks are widely used in autonomous code generation and have applications in complex algorithmic problem-solving. Recent work has addressed the challenge of generating functionally correct code by incorporating simulation-driven planning and debugging, where language models trace execution steps to verify logic. However, these approaches depend on human-provided public test cases to ground the debugging and simulation loop. Manually authoring comprehensive input-output examples is a labor-intensive bottleneck in the software development lifecycle. Because ground-truth input-output examples are rarely available prior to implementation in real-world software engineering, this dependency restricts methods to curated competitive programming benchmarks. Furthermore, we identify that reliance on these public tests induces an ``overconfidence gap,''causing frameworks to overfit to simplistic examples and fail on hidden evaluations. In contrast, we observe that external sample inputs are not strictly necessary for code generation. We demonstrate that large language models can autonomously generate valid inputs and simulate execution traces to self-correct. Consequently, we develop DryRUN, a framework that eliminates the need for ground-truth samples by allowing the LLM to iteratively plan, autonomously generate its own inputs and simulate execution, mitigating algorithmic overconfidence. Evaluations on the LiveCodeBench v6 dataset (post-March 2025) demonstrate that DryRUN matches performance against CodeSIM, a state-of-the-art and public-test-dependent framework, while operating entirely without public test cases or external execution feedback while reducing output token consumption.
LC–MS Based Phytochemical Profiling and In Silico Multi-Target Evaluation of Passiflora tarminiana against α-Amylase and α-Glucosidase
🔥 引用:
0
Abstract: Several Passiflora species have been documented to exhibit anti-diabetic properties. Hence, the study presents the first insight into the phytochemical profile of an aqueous extract of Passiflora tarminiana indigenous to the Nilgiris, through LC-MS-based analysis with molecular docking to identify and prioritize potential multitarget bioactive compounds for antidiabetic activity. Diabetes mellitus, characterised by impaired blood glucose levels, poses a formidable public health challenge. Although many inhibitors targeting starch-hydrolysing enzymes are available, natural inhibitors are considered safe. Therefore, an in silico approach was used to analyze potential phytochemicals as major antidiabetic targets. These phytochemicals were characterized by LC-MS/MS, followed by preliminary phytochemical screening. The qualitative phytochemical results analyzed the presence of alkaloids, flavonoids, glycosides, saponins, and terpenoids. The total phenolic content (TPC) (373.81 ± 18 µg/mL GAE) and total flavonoid content (TFC) (271.96 ± 20 µg/mL QE) were determined in the aqueous extracts of the fruit. LC-MS/MS profiling revealed 54 compounds, 24 of which exhibited potential anti-diabetic activity. ADMET properties confirmed that 20 compounds satisfied Lipinski’s rule and revealed drug-likeness and good bioavailability. Molecular docking against α-amylase and α-glucosidase prioritized Maritimetin, Diprogulic acid, and Pirbuterol as promising compounds based on their favorable binding affinities. Furthermore, the stability was examined using molecular dynamics (MD) simulations through trajectorybased analyses (RMSD, RMSF, and hydrogen bonding) and conformational stability assessments, including radius of gyration (Rg), solvent-accessible surface area (SASA), and PCA-derived free energy landscapes. The MMPBSA calculations demonstrated favourable binding free energies of the selected compounds compared to acarbose, supporting the robustness of the findings. However, the analysis is limited by the lack of experimental validation through enzyme inhibition and kinetic studies, and the LC-MS-based compound identifications remain tentative, as they were not confirmed using reference standards.”
Back to BERT in 2026: ModernGENA as a Strong, Efficient Baseline for DNA Foundation Models
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
Understanding and Mitigating Spurious Signal Amplification in Test-Time Reinforcement Learning for Math Reasoning
🔥 引用:
1
Abstract: Test-time reinforcement learning (TTRL) always adapts models at inference time via pseudo-labeling, leaving it vulnerable to spurious optimization signals from label noise. Through an empirical study, we observe that responses with medium consistency form an ambiguity region and constitute the primary source of reward noise. Crucially, we find that such spurious signals can be even amplified through group-relative advantage estimation. Motivated by these findings, we propose a unified framework, Debiased and Denoised test-time Reinforcement Learning (DDRL), to mitigate spurious signals. Concretely, DDRL first applies a frequency-based sampling strategy to exclude ambiguous samples while maintaining a balanced set of positive and negative examples. It then adopts a debiased advantage estimation with fixed advantages, removing the bias introduced by group-relative policy optimization. Finally, DDRL incorporates a consensus-based off-policy refinement stage, which leverages the rejection-sampled dataset to enable efficient and stable model updates. Experiments on three large language models across multiple mathematical reasoning benchmarks demonstrate that DDRL consistently outperforms existing TTRL baselines. The code will soon be released at https://github.com/yuyongcan/DDRL.
Feasibility of Wind‐Powered Green Hydrogen Production via a Hybrid Graph Neural Network‐Transformer Forecasting Model
DOI:
10.1002/ese3.70541
🔥 引用:
0
Abstract:
Accurate long‐term wind speed forecasting is pivotal for the strategic planning of renewable energy infrastructure, particularly for assessing the techno‐economic feasibility of wind‐powered green hydrogen facilities. However, capturing the complex spatiotemporal dependencies in climate data remains a significant challenge. This study proposes a hybrid deep learning framework designed to enhance 1‐ to 10‐year wind speed forecasts. The proposed architecture integrates graph neural networks (GNN) to extract inter‐variable correlations and feature‐space dynamics among meteorological parameters, coupled with advanced sequence modeling layers to capture temporal patterns. We rigorously evaluated the framework using multi‐variable climate data from NASA's Power Data Access Viewer, comparing a GNN‐Transformer model against a GNN‐GRU variant, as well as standard baselines (LSTM, CNN) and state‐of‐the‐art hybrids (e.g., MST‐GNN). The results demonstrate that the proposed hybrid framework significantly outperforms standalone models. Specifically, the GNN‐Transformer achieved a Mean Absolute Error (MAE) of 0.53 m/s for 10‐year forecasts, representing a 30.27% improvement over a standard Transformer. Furthermore, our comparative analysis reveals that the GNN‐GRU variant achieved superior practical performance with an MAE of 0.44 m/s. These findings provide two key contributions: (1) establishing a robust GNN‐based framework that advances long‐term forecasting accuracy for green hydrogen site planning, and (2) offering empirical evidence that while Transformers offer theoretical complexity, simpler recurrent architectures like GRU may yield better stability in specific long‐term climatological tasks.
OptiMat Alloys: A FAIR End-to-End Agent with Living Database for Computational Multi-Principal Alloy Exploration
🔥 引用:
0
Abstract: The FAIR principles have transformed how computational data and workflows are shared in materials research, yet existing repositories can only serve pre-computed entries -- broad coverage is perpetually incomplete and cannot adapt to new questions on demand. To address these challenges, we present OptiMat Alloys, a large language model-powered conversational agent for multi-principal element alloy exploration built on three pillars: a living database that stores every calculation with provenance, low-barrier accessibility through a web interface requiring zero programming expertise, and built-in uncertainty quantification via cross-potential and cross-configuration validation (see demo here https://youtu.be/lQzuorkzPMc). Coupling foundational machine learning interatomic potentials covering near-all periodic table of elements with natural-language interaction, OptiMat Alloys enables targeted, on-demand computation guided by the user's domain knowledge-extending FAIR from pre-computed repositories to on-demand knowledge generation and making computational alloy screening accessible to any materials scientist.
Use of generative AI in database courses: a role-based review
🔥 引用:
0
Abstract:
Generative artificial intelligence (genAI), particularly large language models (LLMs), has attracted increasing attention for its potential to support teaching and learning in higher education. However, how genAI is pedagogically integrated into specific disciplinary contexts, such as database education, remains insufficiently understood.
This review examines the use of genAI in database courses through a role-based analytical framework, distinguishing the instructional function of AI as supplementary assistant, direct mediator, and new subject. Following PRISMA guidelines, peer-reviewed journal articles indexed in Web of Science and Scopus were screened. From an initial pool of 406 studies, nine empirical studies were identified.
GenAI is most commonly used as a supplementary assistant in database courses, where it supports tasks such as querying, data generation, solution checking, and error correction. The studies reported positive effects on academic performance, instructional efficiency, and time savings, with relatively few ethical or pedagogical concerns. The use of AI as a direct mediator demonstrated clear benefits but also raised concerns related to the accuracy and reliability of AI-generated evaluations. Finally, one exploratory study positioning AI as a new subject emerged as the most transformative yet risky role. While this approach offers potential for substantial instructional innovation, it may also lead to challenges related to accuracy and equity.
This review underscores the potential and importance of role-based integration of genAI in database education and identifies key directions for future research. Research on AI usage in database instruction is still at an early stage, and further empirical studies are required.
A Replicable Robotics Awareness Method Using LLM-Enabled Robotics Interaction: Evidence from a Corporate Challenge
🔥 引用:
0
Abstract: Large language models are increasingly being explored as interfaces between humans and robotic systems, yet there remains limited evidence on how such technologies can be used not only for interaction, but also as a structured means of introducing robotics to non-specialist users in real organizational settings. This paper introduces and evaluates a challenge-based method for robotics awareness, implemented through an LLM-enabled humanoid robot activity conducted with employees of AD Ports Group in the United Arab Emirates. In the event, participants engaged with a humanoid robot in a logistics-inspired task environment using voice commands interpreted through an LLM-based control framework. The activity was designed as a team-based, role-driven experience intended to expose participants to embodied AI and human-robot collaboration without requiring prior robotics expertise. To evaluate the approach, a post-event survey remained open for 16 days and collected 102 responses. Results indicate strong overall reception, with high satisfaction (8.46/10), increased interest in robotics and AI (4.47/5), and improved understanding of emerging forms of human-robot collaboration (4.45/5). Participants who interacted directly with the robot also reported natural interaction (4.37/5) and a strong sense that interaction became easier as the activity progressed (4.74/5). At the same time, lower ratings for reliability and predictability point to important technical and design challenges for future iterations. The findings suggest that challenge-based, LLM-enabled humanoid interaction can serve as a promising and replicable method for robotics awareness in industrial and operational environments.
Can MLLMs"Read"What is Missing?
🔥 引用:
0
Abstract: We introduce MMTR-Bench, a benchmark designed to evaluate the intrinsic ability of Multimodal Large Language Models (MLLMs) to reconstruct masked text directly from visual context. Unlike conventional question-answering tasks, MMTR-Bench eliminates explicit prompts, requiring models to recover masked text from single- or multi-page inputs across real-world domains such as documents and webpages. This design isolates the reconstruction task from instruction-following abilities, enabling a direct assessment of a model's layout understanding, visual grounding, and knowledge integration. MMTR-Bench comprises 2,771 test samples spanning multiple languages and varying target lengths. To account for this diversity, we propose a level-aware evaluation protocol. Experiments on representative MLLMs show that the benchmark poses a significant challenge, especially for sentence- and paragraph-level reconstruction. The homepage is available at https://mmtr-bench-dataset.github.io/MMTR-Bench/.
Empirical Analysis of Internal Hallucination Detection in Quantized LLMs: Layer Dynamics and White-Box Benchmarks
🔥 引用:
0
Abstract: As large language models (LLMs) move onto resource-constrained devices, maintaining factual reliability without adding another expensive decoding pass becomes a practical inference problem. Instead of introducing another complex hallucination detector, this paper presents an empirical study of which low-cost white-box features remain useful under a controlled single-pass benchmark. Across repeated candidate-answer reruns on Qwen2.5-1.5B-Instruct and Llama-3.2-1B-Instruct, truthful and incorrect internal states are most separable in the middle-to-late layers, with the peak consistently falling at 50–70% of total network depth across both model families. The depth-relative pattern is more stable than any single detector ranking: simple residual-space baselines, including Mahalanobis scoring, remain competitive with more elaborate residual-plus-spectral fusion features under the same protocol, although detector ranking still changes by task. A separate preliminary two-seed Qwen2.5-7B-Instruct BF16 probe under that same white-box benchmark reproduces the same middle-to-late peak, and auxiliary Int8 checks on Qwen2.5-1.5B and Qwen2.5-7B remain consistent with that same localization under moderate quantization. Taken together, the results point away from detector complexity and toward a more reproducible question of where hallucination cues emerge, which internal statistics remain reliable, and how cautiously such conclusions should be transferred to deployment settings.
Machine Behavior in Relational Moral Dilemmas: Moral Rightness, Predicted Human Behavior, and Model Decisions
🔥 引用:
0
Abstract: Human moral judgment is context-dependent and modulated by interpersonal relationships. As large language models (LLMs) increasingly function as decision-support systems, determining whether they encode these social nuances is critical. We characterize machine behavior using the Whistleblower's Dilemma by varying two experimental dimensions: crime severity and relational closeness. Our study evaluates three distinct perspectives: (1) moral rightness (prescriptive norms), (2) predicted human behavior (descriptive social expectations), and (3) autonomous model decision-making. By analyzing the reasoning processes, we identify a clear cross-perspective divergence: while moral rightness remains consistently fairness-oriented, predicted human behavior shifts significantly toward loyalty as relational closeness increases. Crucially, model decisions align with moral rightness judgments rather than their own behavioral predictions. This inconsistency suggests that LLM decision-making prioritizes rigid, prescriptive rules over the social sensitivity present in their internal world-modeling, which poses a gap that may lead to significant misalignments in real-world deployments.
Emergent Strategic Reasoning Risks in AI: A Taxonomy-Driven Evaluation Framework
🔥 引用:
0
Abstract: As reasoning capacity and deployment scope grow in tandem, large language models (LLMs) gain the capacity to engage in behaviors that serve their own objectives, a class of risks we term Emergent Strategic Reasoning Risks (ESRRs). These include, but are not limited to, deception (intentionally misleading users or evaluators), evaluation gaming (strategically manipulating performance during safety testing), and reward hacking (exploiting misspecified objectives). Systematically understanding and benchmarking these risks remains an open challenge. To address this gap, we introduce ESRRSim, a taxonomy-driven agentic framework for automated behavioral risk evaluation. We construct an extensible risk taxonomy of 7 categories, which is decomposed into 20 subcategories. ESRRSim generates evaluation scenarios designed to elicit faithful reasoning, paired with dual rubrics assessing both model responses and reasoning traces, in a judge-agnostic and scalable architecture. Evaluation across 11 reasoning LLMs reveals substantial variation in risk profiles (detection rates ranging 14.45%-72.72%), with dramatic generational improvements suggesting models may increasingly recognize and adapt to evaluation contexts.
PermaFrost-Attack: Stealth Pretraining Seeding(SPS) for planting Logic Landmines During LLM Training
🔥 引用:
0
Abstract: Aligned large language models(LLMs) remain vulnerable to adversarial manipulation, and their dependence on web-scale pretraining creates a subtle but serious attack surface. We study Stealth Pretraining Seeding (SPS), a new attack family in which adversaries distribute small amounts of poisoned content across stealth websites, expose them to web crawlers through robots.txt, and thereby increase the likelihood that such content is absorbed into future training corpora derived from sources such as Common Crawl. Because each individual payload is tiny, diffuse, and superficially benign, the attack is difficult to detect during dataset construction or filtering. The result is a latent form of poisoning: dormant logic landmines embedded during pretraining that remain largely invisible under standard evaluation, yet can later be activated by precise alphanumeric triggers such as<00TRIGGER00>to bypass safeguards. We call this attack PermaFrost, by analogy to Arctic permafrost: harmful material can remain frozen, buried, and unnoticed for long periods, only to resurface when conditions allow. We operationalize this threat through PermaFrost-Attack, a controlled framework for latent conceptual poisoning, together with a suite of geometric diagnostics: Thermodynamic Length, Spectral Curvature, and the Infection Traceback Graph. Across multiple model families and scales, we show that SPS is broadly effective, inducing persistent unsafe behavior while often evading alignment defenses. Our results identify SPS as a practical and underappreciated threat to future foundation models. This paper introduces a novel geometric diagnostic lens for systematically examining latent model behavior, providing a principled foundation for detecting, characterizing, and understanding vulnerabilities that may remain invisible to standard evaluation.
Drug Synergy Prediction via Residual Graph Isomorphism Networks and Attention Mechanisms
🔥 引用:
0
Abstract: In the treatment of complex diseases, treatment regimens using a single drug often yield limited efficacy and can lead to drug resistance. In contrast, combination drug therapies can significantly improve therapeutic outcomes through synergistic effects. However, experimentally validating all possible drug combinations is prohibitively expensive, underscoring the critical need for efficient computational prediction methods. Although existing approaches based on deep learning and graph neural networks (GNNs) have made considerable progress, challenges remain in reducing structural bias, improving generalization capability, and enhancing model interpretability. To address these limitations, this paper proposes a collaborative prediction graph neural network that integrates molecular structural features and cell-line genomic profiles with drug-drug interactions to enhance the prediction of synergistic effects. We introduce a novel model named the Residual Graph Isomorphism Network integrated with an Attention mechanism (ResGIN-Att). The model first extracts multi scale topological features of drug molecules using a residual graph isomorphism network, where residual connections help mitigate over-smoothing in deep layers. Subsequently, an adaptive Long Short-Term Memory (LSTM) module fuses structural information from local to global scales. Finally, a cross-attention module is designed to explicitly model drug-drug interactions and identify key chemical substructures. Extensive experiments on five public benchmark datasets demonstrate that ResGIN-Att achieves competitive performance, comparing favorably against key baseline methods while exhibiting promising generalization capability and robustness.
Using Large Language Models to Identify Patient–Oncologist Communication Domains: A Feasibility Study
🔥 引用:
0
Abstract:
The American Society of Clinical Oncology (ASCO) convened a multidisciplinary panel in 2017, resulting in patient–oncologist communication guidelines. Ideally, these conversations should be documented in the medical records. However, chart review for communication topics is inefficient. Large language models (LLMs) present a computational method for identification of communication domains in clinical notes, subsequently providing feedback for clinicians.
The purpose of this study was to develop an approach using LLMs to identify communication domains in unstructured free text notes, validating against gold-standard chart review.
The study population included 134 clinical notes from 30 patients with advanced cancer seen in June 2024 at one of seven Dana-Farber Cancer Institute clinics (Boston, MA). We used a HIPAA-secure artificial intelligence tool based on GPT-4o to develop an LLM prompt for identification of communication domains.
We used standard performance metrics to compare the LLM prompt to chart review for all six communication domains. A hallucination index was calculated to assess false information that may be produced by LLMs when applied to large data sets.
Across communication domains, compared to chart review, the note-level LLM analysis achieved sensitivity ranging from 0.43 to 1.0, specificity ranging from 0.32 to 0.99, and accuracy ranging from 0.51 to 0.99. The average hallucination index for all domains was low. LLM abstraction required approximately 7 seconds per note, compared to 5–7 minutes with chart review.
LLMs have the potential to identify ASCO communication domains. Future directions include applying the method for quality improvement efforts, such as generating feedback for oncologists on topics that may require follow-up.
Reliability Auditing for Downstream LLM tasks in Psychiatry: LLM-Generated Hospitalization Risk Scores
🔥 引用:
0
Abstract: Large language models (LLMs) are increasingly utilized in clinical reasoning and risk assessment. However, their interpretive reliability in critical and indeterminate domains such as psychiatry remains unclear. Prior work has identified algorithmic biases and prompt sensitivity in these systems, raising concerns about how contextual information may influence model outputs, but there remains no systematic way to assess these, especially in the psychiatric domain. We propose an approach for reliability auditing downstream LLM tasks by structuring evaluation around the impact of prompt design and the inclusion of medically insignificant inputs on predicted hospitalization risk scores, which is often the first downstream AI clinical-decision-making task. In our audit, a cohort of synthetic patient profiles (n = 50) is generated, each consisting of 15 clinically relevant features and up to 50 clinically insignificant features, across four prompt reframings (neutral, logical, human impact, clinical judgment). We audit four LLMs (Gemini 2.5 Flash, LLaMa 3.3 70b, Claude Sonnet 4.6, GPT-4o mini), and our results show that including medically insignificant variables resulted in a statistically significant increase in the absolute mean predicted hospitalization risk and output variability across all models and prompts, indicating reduced predictive stability as contextual noise increased. Clinically insignificant features had an effect on instability across many model-prompt conditions, and prompt variations independently affected the trajectory of instability in a model-dependent manner. These findings quantify how LLM-based psychiatric risk assessments are sensitive to non-clinical information, highlighting the need for systematic evaluations of attributional stability and uncertainty behavior like this before clinical deployments.
An Alternate Agentic AI Architecture (It's About the Data)
🔥 引用:
0
Abstract: For the last several years, the dominant narrative in"agentic AI"has been that large language models should orchestrate information access by dynamically selecting tools, issuing sub-queries, and synthesizing results. We argue this approach is misguided: enterprises do not suffer from a reasoning deficit, but from a data integration problem. Enterprises are data-centric: critical information is scattered across heterogeneous systems (e.g., databases, documents, and external services), each with its own query language, schema, access controls, and performance constraints. In contrast, contemporary LLM-based architectures are optimized for reasoning over unstructured text and treat enterprise systems as either corpora or external tools invoked by a black-box component. This creates a mismatch between schema-rich, governed, performance-critical data systems and text-centric, probabilistic LLM architectures, leading to limited transparency, weak correctness guarantees, and unpredictable performance. In this paper, we present RUBICON, an alternative architecture grounded in data management principles. Instead of delegating orchestration to an opaque agent, we introduce AQL (Agentic Query Language), a small, explicit query algebra - Find, From, and Where - executed through source-specific wrappers that enforce access control, schema alignment, and result normalization. All intermediate results are visible and inspectable. Complex questions are decomposed into structured, auditable query plans rather than hidden chains of LLM calls. Our thesis is simple: enterprise AI is not a prompt engineering problem; it is a systems problem. By reintroducing explicit query structure, wrapper-based mediation, and cost-based optimization, we obtain the breadth of agentic search while preserving traceability, determinism, and trust in enterprise environments.
Zero-Shot Detection of LLM-Generated Text via Implicit Reward Model
🔥 引用:
1
Abstract: Large language models (LLMs) have demonstrated remarkable capabilities across various tasks. However, their ability to generate human-like text has raised concerns about potential misuse. This underscores the need for reliable and effective methods to detect LLM-generated text. In this paper, we propose IRM, a novel zero-shot approach that leverages Implicit Reward Models for LLM-generated text detection. Such implicit reward models can be derived from publicly available instruction-tuned and base models. Previous reward-based method relies on preference construction and task-specific fine-tuning. In comparison, IRM requires neither preference collection nor additional training. We evaluate IRM on the DetectRL benchmark and demonstrate that IRM can achieve superior detection performance, outperforms existing zero-shot and supervised methods in LLM-generated text detection.
Target discovery and drug design in the era of artificial intelligence
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
The first CRISPR gene editor designed from scratch by AI was born! Developed based on a large language model, successfully edited the human genome
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
Enhancing Online Recruitment with Category-Aware MoE and LLM-based Data Augmentation
🔥 引用:
0
Abstract: Person-Job Fit (PJF) is a critical component for online recruitment. Existing approaches face several challenges, particularly in handling low-quality job descriptions and similar candidate-job pairs, which impair model performance. To address these challenges, this paper proposes a large language model (LLM) based method with two novel techniques: (1) LLM-based data augmentation, which polishes and rewrites low-quality job descriptions by leveraging chain-of-thought (COT) prompts, and (2) category-aware Mixture of Experts (MoE) that assists in identifying similar candidate-job pairs. This MoE module incorporates category embeddings to dynamically assign weights to the experts and learns more distinguishable patterns for similar candidate-job pairs. We perform offline evaluations and online A/B tests on our recruitment platform. Our method relatively surpasses existing methods by 2.40% in AUC and 7.46% in GAUC, and boosts click-through conversion rate (CTCVR) by 19.4% in online tests, saving millions of CNY in external headhunting expenses.
GS-Quant: Granular Semantic and Generative Structural Quantization for Knowledge Graph Completion
🔥 引用:
0
Abstract: Large Language Models (LLMs) have shown immense potential in Knowledge Graph Completion (KGC), yet bridging the modality gap between continuous graph embeddings and discrete LLM tokens remains a critical challenge. While recent quantization-based approaches attempt to align these modalities, they typically treat quantization as flat numerical compression, resulting in semantically entangled codes that fail to mirror the hierarchical nature of human reasoning. In this paper, we propose GS-Quant, a novel framework that generates semantically coherent and structurally stratified discrete codes for KG entities. Unlike prior methods, GS-Quant is grounded in the insight that entity representations should follow a linguistic coarse-to-fine logic. We introduce a Granular Semantic Enhancement module that injects hierarchical knowledge into the codebook, ensuring that earlier codes capture global semantic categories while later codes refine specific attributes. Furthermore, a Generative Structural Reconstruction module imposes causal dependencies on the code sequence, transforming independent discrete units into structured semantic descriptors. By expanding the LLM vocabulary with these learned codes, we enable the model to reason over graph structures isomorphically to natural language generation. Experimental results demonstrate that GS-Quant significantly outperforms existing text-based and embedding-based baselines. Our code is publicly available at https://github.com/mikumifa/GS-Quant.
Large Language Model Agent Enables Autonomous Machine Learning Model Building for Biomedicine
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
ENHANCED GNN WITH CAUSAL PROXIMITY VECTORS: BRIDGING CAUSALITY AND PROXIMITY IN GRAPH NEURAL NETWORKS
🔥 引用:
0
Abstract: A knowledge graph is a structured representation of entities and their relationships, often used in biomedical domains to model complex interactions. Graph Neural Networks (GNNs), which utilize these graphs, are effective for predicting interactions missing in the knowledge graph. However, GNN lacks the ability to incorporate causal reasoning, which is crucial in biomedical applications. Additionally, they limit their ability to generalize to unseen data.
In oncology, where treatment regimens are intricate and patient responses are highly variable, predicting Adverse Drug Reactions (ADRs) is particularly difficult. Existing models fail to capture the indirect, high-granularity information needed for accurate ADR prediction. To address these challenges, we propose the Causality and Proximity-based Relational Multihead Attention Model (CPRMAM). This model leverages a knowledge graph of ADR-related cancer case studies and introduces a causal proximity vector to prioritize relevant relationships. By employing an inductive GNN approach, CPRMAM generalizes to unseen data, improving ADR prediction.
The Intersection of Knowledge Graphs, Large Language Models, and Mass Spectrometry Data: Progress and Applications
DOI:
10.1002/widm.70086
🔥 引用:
0
Abstract:
The rapid development of knowledge graphs (KGs) and large language models (LLMs) in artificial intelligence (AI) has provided a new approach to intelligent analysis of mass spectrometry (MS) data. KGs can enhance the ability to identify metabolites, perform functional annotation, and integrate multi‐omics data through structured knowledge representation and semantic association. Meanwhile, the powerful context‐understanding capabilities of LLMs and advanced natural language processing (NLP) technologies can assist in the automated analysis and application of MS data. When combined, they overcome the limitations of traditional technologies, which is significant for researching high‐dimensional data analysis, detecting low‐abundance signals, and integrating knowledge across domains. This article reviews cross‐domain applications of KGs and LLMs in MS data analysis. We focus on the latest progress in metabolite annotation, automated report generation, and multi‐omics data integration. Furthermore, from the perspectives of quality, evolution, and reliability of KGs, we outline technical challenges currently faced, such as insufficient data standardization, lack of traceability and version control in KGs, limited model interpretability, and obstacles in cross‐modal fusion. Finally, we summarize research directions that can promote integration of KG and LLM in the future, such as how to enhance knowledge representation and how to achieve multimodal learning and dynamic knowledge updates. All of these can improve the accuracy and efficiency of data interpretation and open new research directions in metabolomics, proteomics, and broader life sciences.
This article is categorized under:
Algorithmic Development > Biological Data Mining
Application Areas > Health Care
Fundamental Concepts of Data and Knowledge > Knowledge Representation
Cross-Modal Retrieval-Augmented Generation for Craft Gestures Learning: Enabling Dialogue with Multimodal Pedagogical Contents
🔥 引用:
0
Abstract: Learning manual craft gestures poses notable challenges for apprentices, who must interpret intricate motor patterns, pinpoint essential procedural parameters and discern frequent execution errors through observation of experts. When it comes to practice by themselves, conventional vocational education often relies on static content delivery, which may leave apprentices without interactive support when analysing gestures complexity or clarifying technical nuances. The proposed work aims to fill the gap by creating an interactive conversational learning experience where Retrieval-Augmented Generation (RAG) converts tacit master’s knowledge into explicit pedagogical material. This approach processes egocentric video recordings and aligns them with text from an elicitation where craftspeople describe in detail key aspects of their motor control, material handling and tool manipulation. By employing vector embeddings and semantic search, the framework retrieves contextually relevant professional insights, which Large Language Models then use to generate pedagogically coherent information about execution patterns, error identification and technique refinement. This Cross-Modal RAG (CM-RAG) has been integrated into Moodle LMS, an open-source learning management system for delivering and tracking educational content, facilitating natural language dialogues for apprentices regarding complex manual procedures such as glassblowing. The framework effectively retrieves information about movement forms and common failure modes, offering targeted guidance in instances where instructions are unavailable. This approach is expected to support motor skill acquisition in vocational education, narrowing the gap between expert demonstration and apprentice understanding through conversational learning.
Empowering Patients and Clinicians: LLMs in Hypertension Care: A Scoping Review
🔥 引用:
0
Abstract:
Large language models have emerged as potential tools to support hypertension care, including diagnosis, treatment decision-making, and patient education. However, evidence regarding their validity, performance, and clinical applicability remains limited. The objective is to map current applications of large language models in hypertension care, with emphasis on model optimization strategies, evaluation approaches, and reported limitations. We conducted a Preferred Reporting Items for Systematic Reviews and Meta-Analyses extension for Scoping Reviews–compliant scoping review of primary studies published between 2023 and 2025 evaluating large language models in hypertension. Thirty-three studies were included. Data were charted on clinical use cases, model optimization techniques, evaluation metrics, data sets, and limitations. Applications were categorized into clinical decision support systems, patient education, medical education, research support, and administrative functions. GPT-based models predominated (82%). Model optimization was limited: 89% relied exclusively on prompt engineering. Most applications focused on patient education (52%) and clinical decision support systems (24%). In clinical decision support systems, reported accuracy ranged from 65% to 100%, reaching 87% to 91% for ambulatory blood pressure monitoring interpretation. Patient education applications showed accuracy between 80% and 90%, but frequent issues included excessive language complexity and occasional unsafe outputs. Across domains, evaluation methods were heterogeneous, reproducibility was inconsistently assessed, and safety concerns, including hallucinations and outdated knowledge, were commonly reported. Current evidence suggests that large language models may support selected tasks in hypertension care; however, their clinical reliability remains uncertain. The limited methodological rigor, minimal use of advanced optimization techniques, and narrow scope of evaluated applications preclude conclusions regarding routine clinical use. Further rigorously designed studies are required before broader implementation can be considered.
The effect of medical explanations from large language models on diagnostic accuracy in radiology
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
Reasoning Primitives in Hybrid and Non-Hybrid LLMs
🔥 引用:
0
Abstract: Reasoning in large language models is often treated as a monolithic capability, but its observed gains may arise from more basic operations. We study reasoning through two such primitives, recall and state-tracking, and ask whether hybrid architectures that combine attention-based retrieval with recurrent state updates are better suited than attention-only models for tasks that jointly require both. Using matched Olmo3 transformer and hybrid models in instruction-tuned and reasoning-augmented variants, we evaluate these models on a set of controlled tasks involving a mixture of state-tracking and recall primitives, state-based recall. Across tasks, we notice that reasoning augmentation provides the largest overall improvement, substantially extending the range of difficulty over which models remain effective. We also notice that in certain tasks, the hybrid reasoning model remains substantially more robust as sequential dependence increases. In contrast, the transformer reasoning model degrades sharply in performance as task difficulty increases beyond a given threshold. These results suggest that reasoning tokens and architectural inductive biases contribute at different levels of the computational process: explicit reasoning can expand a model's effective operating range, but its benefit depends on how well the underlying architecture supports persistent state propagation. Given the small size of our case study, which involves a limited set of models and tasks, we present these findings as suggestive rather than conclusive and leave broader validation across model families, scales, and task variations to future work.
Measuring Opinion Bias and Sycophancy via LLM-based Coercion
🔥 引用:
0
Abstract: Large language models increasingly shape the information people consume: they are embedded in search, consulted for professional advice, deployed as agents, and used as a first stop for questions about policy, ethics, health, and politics. When such a model silently holds a position on a contested topic, that position propagates at scale into users'decisions. Eliciting a model's positions is harder than it first appears: contemporary assistants answer direct opinion questions with evasive disclaimers, and the same model may concede the opposite position once the user starts arguing one side. We propose a method, released as the open-source llm-bias-bench, for discovering the opinions an LLM actually holds on contested topics under conditions that resemble real multi-turn interaction. The method pairs two complementary free-form probes. Direct probing asks for the model's opinion across five turns of escalating pressure from a simulated user. Indirect probing never asks for an opinion and engages the model in argumentative debate, letting bias leak through how it concedes, resists, or counter-argues. Three user personas (neutral, agree, disagree) collapse into a nine-way behavioral classification that separates persona-independent positions from persona-dependent sycophancy, and an auditable LLM judge produces verdicts with textual evidence. The first instantiation ships 38 topics in Brazilian Portuguese across values, scientific consensus, philosophy, and economic policy. Applied to 13 assistants, the method surfaces findings of practical interest: argumentative debate triggers sycophancy 2-3x more than direct questioning (median 50% to 79%); models that look opinionated under direct questioning often collapse into mirroring under sustained arguments; and attacker capability matters mainly when an existing opinion must be dislodged, not when the assistant starts neutral.
TRACE: Topology-aware Reconstruction of Accidents in CARLA for AV Evaluation
🔥 引用:
0
Abstract: Validating Autonomous Vehicles (AVs) requires exposure to rare, safety-critical scenarios, infrequent in routine driving data. Existing benchmarks address this by generating synthetic conflicts or mapping accident descriptions to abstract road geometries, failing to capture the topological complexity of real-world crashes. We introduce TRACE , a pipeline that automates the reconstruction of NHTSA crash reports into high-fidelity CARLA simulations by (1) retrieving site-specific OpenStreetMap data to preserve exact road topology, (2) leveraging Large Language Models to infer vehicles'initial state from road geometry and pre-crash maneuvers, and (3) generating simulation trajectories from semi-structured report data. Using this pipeline, we curated a benchmark of 52 diverse accident scenarios covering varied collision types, road topologies, and pre-crash maneuvers, providing a challenging open source resource for testing AV systems against real-world failures.
Ethics Testing: Proactive Identification of Generative AI System Harms
🔥 引用:
0
Abstract: Generative Artificial Intelligence (GAI) systems that can automatically generate content in the form of source code or other contents (e.g., images) has seen increasing popularity due to the emergence of tools such as ChatGPT which rely on Large Language Models (LLMs). Misuse of the automatically generated content can incur serious consequences due to potential harms in the generated content. Despite the importance of ensuring the quality of automatically generated content, there is little to no approach that can systematically generate tests for identifying software harms in the content generated by these GAI systems. In this article, we introduce the novel concept of ethics testing which aims to systematically generate tests for identifying software harms. Different from existing testing methodologies (e.g., fairness testing that aims to identifying software discrimination), ethics testing aims to systematically detect software harms that could be induced due to unethical behavior (e.g., harmful behavior or behavior that violates intellectual property rights) in automatically generated content. We introduced the concept of ethics testing, discussed the challenges therewithin, and conducted five case studies to show how ethics testing can be performed for generative AI systems.
Encoder-Free Human Motion Understanding via Structured Motion Descriptions
🔥 引用:
0
Abstract: The world knowledge and reasoning capabilities of text-based large language models (LLMs) are advancing rapidly, yet current approaches to human motion understanding, including motion question answering and captioning, have not fully exploited these capabilities. Existing LLM-based methods typically learn motion-language alignment through dedicated encoders that project motion features into the LLM's embedding space, remaining constrained by cross-modal representation and alignment. Inspired by biomechanical analysis, where joint angles and body-part kinematics have long served as a precise descriptive language for human movement, we propose \textbf{Structured Motion Description (SMD)}, a rule-based, deterministic approach that converts joint position sequences into structured natural language descriptions of joint angles, body part movements, and global trajectory. By representing motion as text, SMD enables LLMs to apply their pretrained knowledge of body parts, spatial directions, and movement semantics directly to motion reasoning, without requiring learned encoders or alignment modules. We show that this approach goes beyond state-of-the-art results on both motion question answering (66.7\% on BABEL-QA, 90.1\% on HuMMan-QA) and motion captioning (R@1 of 0.584, CIDEr of 53.16 on HumanML3D), surpassing all prior methods. SMD additionally offers practical benefits: the same text input works across different LLMs with only lightweight LoRA adaptation (validated on 8 LLMs from 6 model families), and its human-readable representation enables interpretable attention analysis over motion descriptions. Code, data, and pretrained LoRA adapters are available at https://yaozhang182.github.io/motion-smd/.
A Sociotechnical, Practitioner-Centered Approach to Technology Adoption in Cybersecurity Operations: An LLM Case
🔥 引用:
0
Abstract: Technology for security operations centers (SOCs) has a storied history of slow adoption due to concerns about trust and reliability. These concerns are amplified with artificial intelligence, particularly large language models (LLMs), which exhibit issues such as hallucinations and inconsistent outputs. To assess whether LLM-based tools can improve SOC efficiency, we embedded two PhD researchers within a multinational company SOC for six months of ethnographic fieldwork. We identified recurring challenges, such as repetitive tasks, fragmented/unclear data, and tooling bottlenecks, and collaborated directly with practitioners to develop LLM companion tools aligned with their operational needs. Iterative refinement reduced workflow disruption and improved interpretability, leading from skepticism to sustained adoption. Ethnographic analysis indicates that this shift was enabled by our sociotechnical co-creation process consistent with Nonaka's SECI model. This framework explains the common challenges in traditional SOC technology adoption, including workflow misalignment, rigidity against evolving threats and internal requirements, and stagnation over time. Our findings show that the co-creation approach can overcome these old barriers and create a new paradigm for creating usable technology for cybersecurity operations.
Quotient-Space Diffusion Models
🔥 引用:
0
Abstract: Diffusion-based generative models have reformed generative AI, and have enabled new capabilities in the science domain, for example, generating 3D structures of molecules. Due to the intrinsic problem structure of certain tasks, there is often a symmetry in the system, which identifies objects that can be converted by a group action as equivalent, hence the target distribution is essentially defined on the quotient space with respect to the group. In this work, we establish a formal framework for diffusion modeling on a general quotient space, and apply it to molecular structure generation which follows the special Euclidean group $\text{SE}(3)$ symmetry. The framework reduces the necessity of learning the component corresponding to the group action, hence simplifies learning difficulty over conventional group-equivariant diffusion models, and the sampler guarantees recovering the target distribution, while heuristic alignment strategies lack proper samplers. The arguments are empirically validated on structure generation for small molecules and proteins, indicating that the principled quotient-space diffusion model provides a new framework that outperforms previous symmetry treatments.
Network Pharmacology and Molecular Docking-Based Investigation of Empagliflozin’s Therapeutic Potential in Chronic Kidney Disease
DOI:
10.3390/life16050719
🔥 引用:
0
Abstract: Chronic kidney disease (CKD) is a progressive global health challenge. While empagliflozin, a selective SGLT2 inhibitor, is known to attenuate CKD progression through mechanisms beyond glycemic control, the precise molecular pathways remain incompletely characterized and warrant further investigation. This study employed an integrated network pharmacology and molecular docking approach to elucidate the multi-target mechanisms of empagliflozin in CKD. Initial evaluation demonstrated that empagliflozin exhibits favorable physicochemical properties, drug-likeness, and ADMET profiles, supporting its potential as an effective orally administered therapeutic option for CKD management. Network analysis identified 221 shared molecular targets between empagliflozin and CKD-associated genes. Topological analysis of the protein–protein interaction (PPI) network revealed ten critical hub proteins—GAPDH, IL6, EGFR, HSP90AA1, NFKB1, HSP90AB1, MTOR, MAPK3, IL2, and PIK3CA—which serve as key regulators in CKD pathophysiology. Gene Ontology and KEGG pathway enrichment analyses indicated that these shared targets are significantly involved in phosphorylation, signal transduction, and central signaling cascades associated with CKD progression, including the PI3K-Akt, FoxO, HIF-1, and AGE-RAGE pathways. Molecular docking simulations corroborated empagliflozin’s multi-target affinity, demonstrating particularly strong binding energies toward HSP90AB1 (−10.85 kcal/mol), MAPK3 (−9.46 kcal/mol), and EGFR (−9.38 kcal/mol). Empagliflozin maintained stable hydrogen bonding throughout the 200-ns molecular dynamics simulation, primarily with GLN18, GLU42, SER45, ASN46, ASN101, GLY130, and TYR134, underscoring its persistent and well-anchored interaction with HSP90AB1. Collectively, these findings provide crucial mechanistic insights, suggesting that empagliflozin might exerts therapeutic effects by modulating interconnected pathways regulating inflammation, oxidative stress, and metabolic homeostasis, thereby reinforcing its role as a comprehensive, multi-target therapeutic strategy for CKD management. Nonetheless, validation through in vitro experiments remains necessary.
Importance of Rehabilitees’ and Rehabilitation Professionals’ Voices in the Development of AI-Based Virtual Assistants
🔥 引用:
0
Abstract: Hybrid rehabilitation combines remote and face-to-face interventions, enabling more flexible and accessible treatment models. Within this context, AI-based virtual assistants (VAs) have the potential to enhance the accessibility, interactivity, and continuity of rehabilitation services. However, the successful implementation of such solutions requires careful consideration of technical, ethical, and social dimensions, as well as evidence-based knowledge of their acceptability among end users. Despite recent advances, experiential knowledge regarding the opportunities and challenges of integrating VAs into rehabilitation processes remains limited.
For AI-based solutions to be effectively adopted in rehabilitation, developers must understand the rehabilitation process across different system levels. At the outset of developing AI-based virtual assistants, we conducted a literature review to establish a shared conceptual foundation. In addition, interviews were carried out with decision-makers responsible for funding and providing rehabilitation to explore their perspectives on the role of artificial intelligence and virtual assistants in the multiple sclerosis (MS) rehabilitation process.
Within the international HybReDe project, virtual assistants utilising retrieval-augmented generation (RAG) architecture were developed. The usability of large language model (LLM)-based virtual assistants was evaluated through focus group interviews and the System Usability Scale involving people with MS and rehabilitation professionals. Based on these findings, subsequent versions of the virtual assistant were developed using validated information sources and an alternative solution architecture incorporating instruction fine-tuning.
The forthcoming pilot study will be guided by the Technology Acceptance Model (TAM). Building on the ten acceptability-related areas identified by Haydon et al. (2023), the study will focus on affective attitude, perceived impact and benefit, perceived workload, and perceived ability. The aim is to examine the acceptability of virtual assistants among people with MS and MS rehabilitation professionals, as well as the feasibility of deploying such solutions in everyday rehabilitation practice.
Keywords: Artificial intelligence; virtual assistant; rehabilitation; acceptability; multiple sclerosis
CohortContrast: An R Package for Enrichment-Based Identification of Clinically Relevant Concepts in OMOP CDM Data
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
Evaluation of Automatic Speech Recognition Using Generative Large Language Models
🔥 引用:
0
Abstract: Automatic Speech Recognition (ASR) is traditionally evaluated using Word Error Rate (WER), a metric that is insensitive to meaning. Embedding-based semantic metrics are better correlated with human perception, but decoder-based Large Language Models (LLMs) remain underexplored for this task. This paper evaluates their relevance through three approaches: (1) selecting the best hypothesis between two candidates, (2) computing semantic distance using generative embeddings, and (3) qualitative classification of errors. On the HATS dataset, the best LLMs achieve 92--94\% agreement with human annotators for hypothesis selection, compared to 63\% for WER, also outperforming semantic metrics. Embeddings from decoder-based LLMs show performance comparable to encoder models. Finally, LLMs offer a promising direction for interpretable and semantic ASR evaluation.
Shared Lexical Task Representations Explain Behavioral Variability In LLMs
🔥 引用:
0
Abstract: One of the most common complaints about large language models (LLMs) is their prompt sensitivity -- that is, the fact that their ability to perform a task or provide a correct answer to a question can depend unpredictably on the way the question is posed. We investigate this variation by comparing two very different but commonly-used styles of prompting: instruction-based prompts, which describe the task in natural language, and example-based prompts, which provide in-context few-shot demonstration pairs to illustrate the task. We find that, despite large variation in performance as a function of the prompt, the model engages some common underlying mechanisms across different prompts of a task. Specifically, we identify task-specific attention heads whose outputs literally describe the task -- which we dub lexical task heads -- and show that these heads are shared across prompting styles and trigger subsequent answer production. We further find that behavioral variation between prompts can be explained by the degree to which these heads are activated, and that failures are at least sometimes due to competing task representations that dilute the signal of the target task. Our results together present an increasingly clear picture of how LLMs'internal representations can explain behavior that otherwise seems idiosyncratic to users and developers.
Memanto: Typed Semantic Memory with Information-Theoretic Retrieval for Long-Horizon Agents
🔥 引用:
0
Abstract: The transition from stateless language model inference to persistent, multi session autonomous agents has revealed memory to be a primary architectural bottleneck in the deployment of production grade agentic systems. Existing methodologies largely depend on hybrid semantic graph architectures, which impose substantial computational overhead during both ingestion and retrieval. These systems typically require large language model mediated entity extraction, explicit graph schema maintenance, and multi query retrieval pipelines. This paper introduces Memanto, a universal memory layer for agentic artificial intelligence that challenges the prevailing assumption that knowledge graph complexity is necessary to achieve high fidelity agent memory. Memanto integrates a typed semantic memory schema comprising thirteen predefined memory categories, an automated conflict resolution mechanism, and temporal versioning. These components are enabled by Moorcheh's Information Theoretic Search engine, a no indexing semantic database that provides deterministic retrieval within sub ninety millisecond latency while eliminating ingestion delay. Through systematic benchmarking on the LongMemEval and LoCoMo evaluation suites, Memanto achieves state of the art accuracy scores of 89.8 percent and 87.1 percent respectively. These results surpass all evaluated hybrid graph and vector based systems while requiring only a single retrieval query, incurring no ingestion cost, and maintaining substantially lower operational complexity. A five stage progressive ablation study is presented to quantify the contribution of each architectural component, followed by a discussion of the implications for scalable deployment of agentic memory systems.
Stealthy Backdoor Attacks against LLMs Based on Natural Style Triggers
🔥 引用:
0
Abstract: The growing application of large language models (LLMs) in safety-critical domains has raised urgent concerns about their security. Many recent studies have demonstrated the feasibility of backdoor attacks against LLMs. However, existing methods suffer from three key shortcomings: explicit trigger patterns that compromise naturalness, unreliable injection of attacker-specified payloads in long-form generation, and incompletely specified threat models that obscure how backdoors are delivered and activated in practice. To address these gaps, we present BadStyle, a complete backdoor attack framework and pipeline. BadStyle leverages an LLM as a poisoned sample generator to construct natural and stealthy poisoned samples that carry imperceptible style-level triggers while preserving semantics and fluency. To stabilize payload injection during fine-tuning, we design an auxiliary target loss that reinforces the attacker-specified target content in responses to poisoned inputs and penalizes its emergence in benign responses. We further ground the attack in a realistic threat model and systematically evaluate BadStyle under both prompt-induced and PEFT-based injection strategies. Extensive experiments across seven victim LLMs, including LLaMA, Phi, DeepSeek, and GPT series, demonstrate that BadStyle achieves high attack success rates (ASRs) while maintaining strong stealthiness. The proposed auxiliary target loss substantially improves the stability of backdoor activation, yielding an average ASR improvement of around 30% across style-level triggers. Even in downstream deployment scenarios unknown during injection, the implanted backdoor remains effective. Moreover, BadStyle consistently evades representative input-level defenses and bypasses output-level defenses through simple camouflage.
Glioma-intrinsic SLC1A3 hijacks the vascular niche to establish an immunosuppressive microenvironment
🔥 引用:
0
Abstract:
Glioblastoma (GBM) is a highly lethal malignancy driven by glioma-initiating cells (GICs). While GICs are known to profoundly remodel tumor microenvironment (TME) to promote progression and immune evasion within the vascular niche, the specific transcriptomic reprogramming and alternative splicing events driving their evolution from neural stem cells (NSCs), and how these intrinsic cellular state changes dictate multi-cellular immunosuppressive networks and checkpoints, remain poorly understood. Unraveling these complex tumor-vascular-immune interactions is critical for identifying novel vulnerabilities and developing effective immunotherapies.
To decode the GICs’ evolutionary trajectory, we integrated RNA-seq and alternative splicing analysis of NSCs and patient-derived GIC cohorts. The malignant progression was mapped using scRNA-seq pseudotime analysis, and key targets were validated across clinical TCGA cohorts. Furthermore, we employed the large-scale single-cell foundation model, Geneformer, to perform in silico genetic perturbations, integrating it with interactome inference to decipher TME communication. Finally, the proposed tumor-endothelial-T cell multi-cellular axis was functionally validated utilizing
in vitro
tumor-HUVEC co-culture systems, qPCR, and FACS-based T cell activation (NFAT-Jurkat) assays.
Our multi-omics re-analysis identified extensive alternative splicing and transcriptional reprogramming during GICs evolution, pinpointing SLC1A3 as a core gene significantly upregulated along the malignant pseudotime trajectory and strongly correlated with poor clinical prognosis in GBM. AI-driven in silico virtual knockout utilizing Geneformer revealed that SLC1A3 acts as a master regulator of tumor network stability. Interactome analysis demonstrated that SLC1A3
hi
tumor cells exhibit intensive communication with endothelial cells via specific ligand-receptor axes (e.g., TNC-ITGB1, PTN-SDC3).
In vitro
assays confirmed that endothelial cells were educated by SLC1A3
hi
tumor cells that undergo malignant transition, drastically upregulating immune-suppressive factors, including
CD274
,
TGFB1
,
IL10
, and
IDO1
. Crucially, tumor-specific knockdown of
SLC1A3
dismantled this vascular-immune suppressive niche, significantly restoring T cell activation in a multicellular co-culture model.
Our findings establish SLC1A3 not merely as an intrinsic driver of glioma development, but as a critical upstream node orchestrating a cascading tumor-endothelial-T cell immunosuppressive axis. By leveraging AI-based foundation models alongside robust biological validation, we uncovered a novel mechanism of vascular-mediated immune evasion, highlighting SLC1A3 as a highly promising therapeutic target to reprogram the glioblastoma microenvironment and restore anti-tumor immunity.
Do LLM Decoders Listen Fairly? Benchmarking How Language Model Priors Shape Bias in Speech Recognition
🔥 引用:
0
Abstract: As pretrained large language models replace task-specific decoders in speech recognition, a critical question arises: do their text-derived priors make recognition fairer or more biased across demographic groups? We evaluate nine models spanning three architectural generations (CTC with no language model, encoder-decoder with an implicit LM, and LLM-based with an explicit pretrained decoder) on about 43,000 utterances across five demographic axes (ethnicity, accent, gender, age, first language) using Common Voice 24 and Meta's Fair-Speech, a controlled-prompt dataset that eliminates vocabulary confounds. On clean audio, three findings challenge assumptions: LLM decoders do not amplify racial bias (Granite-8B has the best ethnicity fairness, max/min WER = 2.28); Whisper exhibits pathological hallucination on Indian-accented speech with a non-monotonic insertion-rate spike to 9.62% at large-v3; and audio compression predicts accent fairness more than LLM scale. We then stress-test these findings under 12 acoustic degradation conditions (noise, reverberation, silence injection, chunk masking) across both datasets, totaling 216 inference runs. Severe degradation paradoxically compresses fairness gaps as all groups converge to high WER, but silence injection amplifies Whisper's accent bias up to 4.64x by triggering demographic-selective hallucination. Under masking, Whisper enters catastrophic repetition loops (86% of 51,797 insertions) while explicit-LLM decoders produce 38x fewer insertions with near-zero repetition; high-compression audio encoding (Q-former) reintroduces repetition pathology even in LLM decoders. These results suggest that audio encoder design, not LLM scaling, is the primary lever for equitable and robust speech recognition.
Seeing Without Eyes: 4D Human-Scene Understanding from Wearable IMUs
🔥 引用:
0
Abstract: Understanding human activities and their surrounding environments typically relies on visual perception, yet cameras pose persistent challenges in privacy, safety, energy efficiency, and scalability. We explore an alternative: 4D perception without vision. Its goal is to reconstruct human motion and 3D scene layouts purely from everyday wearable sensors. For this we introduce IMU-to-4D, a framework that repurposes large language models for non-visual spatiotemporal understanding of human-scene dynamics. IMU-to-4D uses data from a few inertial sensors from earbuds, watches, or smartphones and predicts detailed 4D human motion together with coarse scene structure. Experiments across diverse human-scene datasets show that IMU-to-4D yields more coherent and temporally stable results than SoTA cascaded pipelines, suggesting wearable motion sensors alone can support rich 4D understanding.
EngramaBench: Evaluating Long-Term Conversational Memory with Structured Graph Retrieval
🔥 引用:
0
Abstract: Large language model assistants are increasingly expected to retain and reason over information accumulated across many sessions. We introduce EngramaBench, a benchmark for long-term conversational memory built around five personas, one hundred multi-session conversations, and one hundred fifty queries spanning factual recall, cross-space integration, temporal reasoning, adversarial abstention, and emergent synthesis. We evaluate Engrama, a graph-structured memory system, against GPT-4o full-context prompting and Mem0, an open-source vector-retrieval memory system. All three use the same answering model (GPT-4o), isolating the effect of memory architecture. GPT-4o full-context achieves the highest composite score (0.6186), while Engrama scores 0.5367 globally but is the only system to score higher than full-context prompting on cross-space reasoning (0.6532 vs. 0.6291, n=30). Mem0 is cheapest but substantially weaker (0.4809). Ablations reveal that the components driving Engrama's cross-space advantage trade off against global composite score, exposing a systems-level tension between structured memory specialization and aggregate optimization.
Nemobot Games: Crafting Strategic AI Gaming Agents for Interactive Learning with Large Language Models
🔥 引用:
0
Abstract: This paper introduces a new paradigm for AI game programming, leveraging large language models (LLMs) to extend and operationalize Claude Shannon's taxonomy of game-playing machines. Central to this paradigm is Nemobot, an interactive agentic engineering environment that enables users to create, customize, and deploy LLM-powered game agents while actively engaging with AI-driven strategies. The LLM-based chatbot, integrated within Nemobot, demonstrates its capabilities across four distinct classes of games. For dictionary-based games, it compresses state-action mappings into efficient, generalized models for rapid adaptability. In rigorously solvable games, it employs mathematical reasoning to compute optimal strategies and generates human-readable explanations for its decisions. For heuristic-based games, it synthesizes strategies by combining insights from classical minimax algorithms (see, e.g., shannon1950chess) with crowd-sourced data. Finally, in learning-based games, it utilizes reinforcement learning with human feedback and self-critique to iteratively refine strategies through trial-and-error and imitation learning. Nemobot amplifies this framework by offering a programmable environment where users can experiment with tool-augmented generation and fine-tuning of strategic game agents. From strategic games to role-playing games, Nemobot demonstrates how AI agents can achieve a form of self-programming by integrating crowdsourced learning and human creativity to iteratively refine their own logic. This represents a step toward the long-term goal of self-programming AI.
Lightweight Retrieval-Augmented Generation and Large Language Model-Based Modeling for Scalable Patient-Trial Matching
🔥 引用:
0
Abstract: Patient-trial matching requires reasoning over long, heterogeneous electronic health records (EHRs) and complex eligibility criteria, posing significant challenges for scalability, generalization, and computational efficiency. Existing approaches either rely on full-document processing with large language models (LLMs), which is computationally expensive, or use traditional machine learning methods that struggle to capture unstructured clinical narratives. In this work, we propose a lightweight framework that combines retrieval-augmented generation and large language model-based modeling for scalable patient-trial matching. The framework explicitly separates two key components: retrieval-augmented generation is used to identify clinically relevant segments from long EHRs, reducing input complexity, while large language models are used to encode these selected segments into informative representations. These representations are further refined through dimensionality reduction and modeled using lightweight predictors, enabling efficient and scalable downstream classification. We evaluate the proposed approach on multiple public benchmarks (n2c2, SIGIR, TREC 2021/2022) and a real-world multimodal dataset from Mayo Clinic (MCPMD). Results show that retrieval-based information selection significantly reduces computational burden while preserving clinically meaningful signals. We further demonstrate that frozen LLMs provide strong representations for structured clinical data, whereas fine-tuning is essential for modeling unstructured clinical narratives. Importantly, the proposed lightweight pipeline achieves performance comparable to end-to-end LLM approaches with substantially lower computational cost.
GraphLeap: Decoupling Graph Construction and Convolution for Vision GNN Acceleration on FPGA
🔥 引用:
0
Abstract: Vision Graph Neural Networks (ViGs) represent an image as a graph of patch tokens, enabling adaptive, feature-driven neighborhoods. Unlike CNNs with fixed grid biases or Vision Transformers with global token interactions, ViGs rely on dynamic graph convolution: at each layer, a feature-dependent graph is built via k-nearest-neighbor (kNN) search on current patch features, followed by message passing. This per-layer graph construction is the main bottleneck, consuming 50--95\% of graph convolution time on CPUs and GPUs, scaling as $O(N^2)$ with the number of patches $N$, and creating a sequential dependency between graph construction and feature updates. We introduce GraphLeap, a simple reformulation that removes this dependency by decoupling graph construction from feature update across layers. GraphLeap performs the feature update at layer $\ell$ using a graph built from the previous layer's features, while simultaneously using the current layer's features to construct the graph for layer $\ell+1$. This one-layer-lookahead graph construction enables concurrent graph construction and message passing. Although using prior-layer features can introduce minor accuracy degradation, lightweight fine-tuning for a few epochs is sufficient to recover the original accuracy. Building on GraphLeap, we present the first end-to-end FPGA accelerator for Vision GNNs. Our streaming, layer-pipelined design overlaps a kNN graph construction engine with a feature update engine, exploits node- and channel-level parallelism, and enables efficient on-chip dataflow without explicit edge-feature materialization. Evaluated on isotropic and pyramidal ViG models on an Alveo U280 FPGA, GraphLeap achieves up to $95.7\times$ speedup over CPU and $8.5\times$ speedup over GPU baselines, demonstrating the feasibility of real-time Vision GNN inference.
Unlocking the Power of Large Language Models for Multi-table Entity Matching
🔥 引用:
0
Abstract: Multi-table entity matching (MEM) addresses the limitations of dual-table approaches by enabling simultaneous identification of equivalent entities across multiple data sources without unique identifiers. However, existing methods relying on pre-trained language models struggle to handle semantic inconsistencies caused by numerical attribute variations. Inspired by the powerful language understanding capabilities of large language models (LLMs), we propose a novel LLM-based framework for multi-table entity matching, termed LLM4MEM. Specifically, we first propose a multi-style prompt-enhanced LLM attribute coordination module to address semantic inconsistencies. Then, to alleviate the matching efficiency problem caused by the surge in the number of entities brought by multiple data sources, we develop a transitive consensus embedding matching module to tackle entity embedding and pre-matching issues. Finally, to address the issue of noisy entities during the matching process, we introduce a density-aware pruning module to optimize the quality of multi-table entity matching. We conducted extensive experiments on 6 MEM datasets, and the results show that our model improves by an average of 5.1% in F1 compared with the baseline model. Our code is available at https://github.com/Ymeki/LLM4MEM.
Exploring the application of large language models in coding the experiencing scale (EXP)
🔥 引用:
0
Abstract: Abstract Psychotherapy process measures like the Experiencing Scale (EXP) offer valuable insight into clinical interactions but are time-intensive to code. Large language models (LLMs) like ChatGPT have the potential to streamline this process, but empirical validation is nascent. This exploratory study aimed to provide a proof-of-concept coding the EXP using ChatGPT with special attention to ethical considerations, limitations, and future directions. ChatGPT was used to code 79 psychotherapy transcripts drawn from the EXP manual. Multiple models of ChatGPT were tested using varied few-shot learning prompt engineering protocols. Data collection occurred in three phases, during which models rated both modal and peak EXP scores for all transcripts. ChatGPT demonstrated moderate agreement with manual reference ratings. An efficient configuration (o3-mini, 5-shot prompting) yielded moderate reliability for both modal EXP scores (ICC[3,1] = .67, 95% CI [.53, .79]) and peak EXP scores (ICC[3,1] = .71, 95% CI [.58, .81]). LLMs may feasibly augment or replace human EXP coders under certain conditions. However, evidence is preliminary and ethical and technical limitations remain. Future research should validate the present methodology using out-of-manual data, assess potential pretraining exposure, and explore locally hosted LLM applications to mitigate privacy concerns.
Low-Rank Adaptation Redux for Large Models
🔥 引用:
0
Abstract: Low-rank adaptation (LoRA) has emerged as the de facto standard for parameter-efficient fine-tuning (PEFT) of foundation models, enabling the adaptation of billion-parameter networks with minimal computational and memory overhead. Despite its empirical success and rapid proliferation of variants, it remains elusive which architectural choices, optimization techniques, and deployment constraints should guide practical method selection. This overview revisits LoRA through the lens of signal processing (SP), bridging modern adapter designs with classical low-rank modeling tools and inverse problems, as well as highlighting how SP principles can inform principled advances of fine-tuning approaches. Rather than providing a comprehensive enumeration and empirical comparisons of LoRA variants, emphasis is placed on the technical mechanisms underpinning these approaches to justify their effectiveness. These advances are categorized into three complementary axes: architectural design, efficient optimization, and pertinent applications. The first axis builds on singular value decomposition (SVD)-based factorization, rank-augmentation constructions, and cross-layer tensorization, while the second axis deals with initialization, alternating solvers, gauge-invariant optimization, and parameterization-aware methods. Beyond fine-tuning, emerging applications of LoRA are accounted across the entire lifecycle of large models, ranging from pre- and post-training to serving/deployment. Finally, open research directions are outlined at the confluence of SP and deep learning to catalyze a bidirectional frontier: classical SP tools provide a principled vocabulary for designing principled PEFT methods, while the unique challenges facing modern deep learning, especially the overwhelming scale and prohibitive overhead, also offer new research lines benefiting the SP community in return.
Do MLLMs Understand Pointing? Benchmarking and Enhancing Referential Reasoning in Egocentric Vision
🔥 引用:
0
Abstract: Egocentric AI agents, such as smart glasses, rely on pointing gestures to resolve referential ambiguities in natural language commands. However, despite advancements in Multimodal Large Language Models (MLLMs), current systems often fail to precisely ground the spatial semantics of pointing. Instead, they rely on spurious correlations with visual proximity or object saliency, a phenomenon we term"Referential Hallucination."To address this gap, we introduce EgoPoint-Bench, a comprehensive question-answering benchmark designed to evaluate and enhance multimodal pointing reasoning in egocentric views. Comprising over 11k high-fidelity simulated and real-world samples, the benchmark spans five evaluation dimensions and three levels of referential complexity. Extensive experiments demonstrate that while state-of-the-art proprietary and open-source models struggle with egocentric pointing, models fine-tuned on our synthetic data achieve significant performance gains and robust sim-to-real generalization. This work highlights the importance of spatially aware supervision and offers a scalable path toward precise egocentric AI assistants. Project page: https://guyyyug.github.io/EgoPoint-Bench/
Assessing the Impact of Requirement Ambiguity on LLM-based Function-Level Code Generation
🔥 引用:
0
Abstract: Software requirement ambiguity is ubiquitous in real-world development, stemming from the inherent imprecision of natural language and the varying interpretations of stakeholders. While Large Language Models (LLMs) have demonstrated impressive capabilities in generating code from precise specifications, such ambiguity poses a significant obstacle to reliable automated code generation. Existing benchmarks typically assume clear and unambiguous requirements, leaving an empirical gap in understanding how LLMs behave when faced with the inherent uncertainty of real-world software requirements. In this paper, we introduce Orchid, the first code generation benchmark specifically designed with ambiguous requirements. It comprises 1,304 function-level tasks covering four distinct types of ambiguity: lexical, syntactic, semantic, and vagueness. Leveraging this dataset, we conduct the first systematic empirical study to evaluate the impact of requirement ambiguity on LLM-based code generation. Our results demonstrate that ambiguity consistently degrades the performance of all evaluated LLMs, with the most pronounced negative effects observed in highly advanced models. Furthermore, we observe that LLMs frequently produce functionally divergent implementations for the same ambiguous requirement and lack the capability to identify or resolve such ambiguity autonomously. These findings reveal a significant performance gap between clear and ambiguous requirements, underscoring the urgent need for ambiguity-aware techniques in the next generation of automated software engineering tools. The Orchid benchmark is publicly available at https://huggingface.co/datasets/SII-YDD/Orchid.
Call-Chain-Aware LLM-Based Test Generation for Java Projects
🔥 引用:
0
Abstract: Large language models (LLMs) have recently shown strong potential for generating project-level unit tests. However, existing state-of-the-art approaches primarily rely on execution-path information to guide prompt construction, which is often insufficient for complex software systems with rich inter-class dependencies, deep call chains, and intricate object initialization requirements. In this paper, we present CAT, a novel call-chain-aware LLM-based test generation approach that explicitly incorporates call-chain and dependency contexts into prompts through dedicated static analysis. To construct executable, semantically valid test contexts, CAT systematically models caller--callee relationships, object constructors, and third-party dependencies, and supports iterative test fixing when generation failures occur. We evaluate CAT on the widely used Defects4J benchmark and on four real-world GitHub projects released after the LLM's cut-off date. The results show that, across projects in Defects4J, CAT improves line and branch coverage by 18.04% and 21.74%, respectively, over the state-of-the-art approach PANTA, while consistently achieving superior performance on post-cutoff real-world projects. An ablation study further demonstrates the importance of call-chain and dependency contexts in CAT.
Optimizing Automated Essay Scoring with Lightweight Large Language Models and Validated Rubrics
🔥 引用:
0
Abstract: Manual grading of English as a Foreign Language (EFL) essays often leads to inconsistent scores among educators, despite the use of rubrics. While traditional Automated Essay Scoring (AES) systems offer speed, they often fail due to high computational cost, reliance on extensive datasets, and an inability to capture holistic writing qualities such as creativity and humanistic expression. This study addresses these issues by introducing AESCORE, a novel, lightweight, and cost-effective AES framework. Our methodology centers on integrating validated rubric criteria (identified via VOSviewer analysis) with open-source Large Language Models (LLMs), specifically emphasizing a human-centered approach. We evaluated AESCORE across 100 EFL essays using several prompting techniques, including few-shot and multi-trait specialization. The system achieved its most robust performance and high scoring consistency (Quadratic Weighted Kappa QWK = 0.6660) using the DeepSeek-R1 8B LLM with few-shot prompting. AESCORE represents a significant contribution by demonstrating that sophisticated, pedagogically-aligned writing assessment and generative feedback can be achieved with accessible AI, offering a reliable alternative for improving productive writing skills in higher education.
TabSHAP
🔥 引用:
0
Abstract: Large Language Models (LLMs) fine-tuned on serialized tabular data are emerging as powerful alternatives to traditional tree-based models, particularly for heterogeneous or context-rich datasets. However, their deployment in high-stakes domains is hindered by a lack of faithful interpretability; existing methods often rely on global linear proxies or scalar probability shifts that fail to capture the model's full probabilistic uncertainty. In this work, we introduce TabSHAP, a model-agnostic interpretability framework designed to directly attribute local query decision logic in LLM-based tabular classifiers. By adapting a Shapley-style sampled-coalition estimator with Jensen-Shannon divergence between full-input and masked-input class distributions, TabSHAP quantifies the distributional impact of each feature rather than simple prediction flips. To align with tabular semantics, we mask at the level of serialized key:value fields (atomic in the prompt string), not individual subword tokens. Experimental validation on the Adult Income and Heart Disease benchmarks demonstrates that TabSHAP isolates critical diagnostic features, achieving significantly higher faithfulness than random baselines and XGBoost proxies. We further run a distance-metric ablation on the same test instances and TabSHAP settings: attributions are recomputed with KL or L1 replacing JSD in the similarity step (results cached per metric), and we compare deletion faithfulness across all three.
Graph Neural Network-Based Spatio-Temporal Feature Modeling and Wave Height Reconstruction for Distributed Pressure Sensor Wave Measurement Signals
DOI:
10.3390/app16094073
🔥 引用:
0
Abstract: Accurate measurement of ocean wave parameters is paramount for offshore engineering design and marine environmental monitoring. Distributed pressure sensing technology provides a robust data foundation for analyzing the spatio-temporal characteristics of wave fields through synchronized observations at multiple stations. However, multi-sensor data exhibit high-dimensional spatio-temporal coupling, posing significant challenges for traditional single-point signal processing methods in capturing the topological associations between measurement sites. To address these limitations, this study develops a framework for spatio-temporal feature modeling and wave height reconstruction based on Graph Neural Networks (GNNs). The proposed framework integrates the spatial configuration of sensor arrays with graph-theoretic topological representations. By fusing geometric distances and signal correlations, an adaptive adjacency matrix is constructed to establish a dynamically adjustable graph structure. On the feature extraction level, a spatio-temporal fusion method combining multi-scale graph convolutions and gated temporal modeling is proposed. The experimental results obtained on the Blancs Sablons Bay multi-sensor dataset demonstrate that the proposed method significantly outperforms traditional approaches, achieving lower prediction errors and validating the effectiveness of graph-structured modeling in distributed wave sensing.
Sink-Token-Aware Pruning for Fine-Grained Video Understanding in Efficient Video LLMs
🔥 引用:
0
Abstract: Video Large Language Models (Video LLMs) incur high inference latency due to a large number of visual tokens provided to LLMs. To address this, training-free visual token pruning has emerged as a solution to reduce computational costs; however, existing methods are primarily validated on Multiple-Choice Question Answering (MCQA) benchmarks, where coarse-grained cues often suffice. In this work, we reveal that these methods suffer a sharp performance collapse on fine-grained understanding tasks requiring precise visual grounding, such as hallucination evaluation. To explore this gap, we conduct a systematic analysis and identify sink tokens--semantically uninformative tokens that attract excessive attention--as a key obstacle to fine-grained video understanding. When these sink tokens survive pruning, they distort the model's visual evidence and hinder fine-grained understanding. Motivated by these insights, we propose Sink-Token-aware Pruning (SToP), a simple yet effective plug-and-play method that introduces a sink score to quantify each token's tendency to behave as a sink and applies this score to existing spatial and temporal pruning methods to suppress them, thereby enhancing video understanding. To validate the effectiveness of SToP, we apply it to state-of-the-art pruning methods (VisionZip, FastVid, and Holitom) and evaluate it across diverse benchmarks covering hallucination, open-ended generation, compositional reasoning, and MCQA. Our results demonstrate that SToP significantly boosts performance, even when pruning up to 90% of visual tokens.
MGDA-Decoupled: Geometry-Aware Multi-Objective Optimisation for DPO-based LLM Alignment
🔥 引用:
0
Abstract: Aligning large language models (LLMs) to desirable human values requires balancing multiple, potentially conflicting objectives such as helpfulness, truthfulness, and harmlessness, which presents a multi-objective optimisation challenge. Most alignment pipelines rely on a fixed scalarisation of these objectives, which can introduce procedural unfairness by systematically under-weighting harder-to-optimise or minority objectives. To promote more equitable trade-offs, we introduce MGDA-Decoupled, a geometry-based multi-objective optimisation algorithm that finds a shared descent direction while explicitly accounting for each objective's convergence dynamics. In contrast to prior methods that depend on reinforcement learning (e.g., GAPO) or explicit reward models (e.g., MODPO), our approach operates entirely within the lightweight Direct Preference Optimisation (DPO) paradigm. Experiments on the UltraFeedback dataset show that geometry-aware methods -- and MGDA-Decoupled in particular -- achieve the highest win rates against golden responses, both overall and per objective.
The Effect of Idea Elaboration on the Automatic Assessment of Idea Originality
🔥 引用:
0
Abstract: Automatic systems are increasingly used to assess the originality of responses in creative tasks. They offer a potential solution to key limitations of human assessment (cost, fatigue, and subjectivity), but there is preliminary evidence of a self-preference bias. Accordingly, automatic systems tend to prefer outcomes that are more closely related to their style, rather than to the human one. In this paper, we investigated how Large Language Models (LLMs) align with human raters in assessing the originality of responses in a divergent thinking task. We analysed 4,813 responses to the Alternate Uses Task produced by higher and lower creative humans and ChatGPT-4o. Human raters were two university students who underwent intensive training. Machine raters were two specialised systems fine-tuned on AUT responses and corresponding human ratings (OCSAI and CLAUS) and ChatGPT-4o, which was prompted with the same instructions as human raters. Results confirmed the presence of a self-preference bias in LLMs. Automatic systems tended to privilege artificial responses. However, this self-preference bias disappeared when the analyses controlled for the idea elaboration. We discuss theoretical and methodological implications of these findings by highlighting future directions for research on creativity assessment.
Artificial Intelligence Agents in Mental Health: A Systematic Review and Meta Analysis
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
From Non-Maleficence to Beneficence: Expanded Ethical Computing in the Era of Large Language Models
DOI:
10.3390/soc16050134
🔥 引用:
0
Abstract: As modern society grows increasingly complex, access to essential services such as healthcare, legal aid, tailored education, and psychological support remains heavily gated by socio-economic, neurological, and systemic barriers. This paper explores the transformative potential of Large Language Models (LLMs) and Generative Artificial Intelligence not merely as industrial productivity enhancers, but as vital “social scaffolds” capable of fostering a more inclusive society. Crucially, we propose a paradigm shift in the concept of Ethical Computing—moving from a passive defensive framework of non-maleficence (“do no harm”) to an active mandate of beneficence, where AI systems are explicitly developed to serve marginalized and un(der)served populations. Through this expanded ethical lens, we systematically analyze the democratizing impact of AI across four primary axes of inclusivity: socio-economic (providing zero-cost medical triage and legal translation for undocumented populations), neurospicy (acting as a non-judgmental communicative bridge for individuals with Autism Spectrum Disorder), pedagogical (delivering hyper-personalized executive function support for Special Educational Needs), and psychological (serving as an accessible, first-level triage system for mental health crises). By framing LLMs as a modern social safety net, we outline a clear trajectory for future research, advocating for an “ethical-by-design” development paradigm that explicitly prioritizes equity, accessibility, and the active dismantling of historical barriers for the digitally and socially disenfranchised.
LC–MS Profiling, In Silico Docking–MD and ADMET of Uncaria gambir Roxb. for p38 MAPK Inhibition
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
R2IF: Aligning Reasoning with Decisions via Composite Rewards for Interpretable LLM Function Calling
🔥 引用:
0
Abstract: Function calling empowers large language models (LLMs) to interface with external tools, yet existing RL-based approaches suffer from misalignment between reasoning processes and tool-call decisions. We propose R2IF, a reasoning-aware RL framework for interpretable function calling, adopting a composite reward integrating format/correctness constraints, Chain-of-Thought Effectiveness Reward (CER), and Specification-Modification-Value (SMV) reward, optimized via GRPO. Experiments on BFCL/ACEBench show R2IF outperforms baselines by up to 34.62% (Llama3.2-3B on BFCL) with positive Average CoT Effectiveness (0.05 for Llama3.2-3B), enhancing both function-calling accuracy and interpretability for reliable tool-augmented LLM deployment.
Large language model-based paper classification framework with key-insight extraction and confidence-weighted voting
🔥 引用:
0
Abstract:
Systematic reviews (SRs) are critical for evidence-based research but are time-consuming and labor-intensive. The rapid expansion of academic publications further challenges the performance and applicability of existing screening and classification methods. While large language models (LLMs) present new opportunities for automation, limited research has examined whether they can achieve classification performance comparable to human reviewers in large-scale, multi-class settings. With the goal of improving classification performance, we proposed an LLM-based framework that leverages full-text key-insight extraction to enhance literature classification. We constructed a manually curated dataset of 900 articles from 17 published SRs to quantitatively evaluate the classification capabilities of LLMs. The results provided empirical evidence of LLMs’ potential in supporting large-scale SRs and introduced a practical pathway for improving efficiency and reliability in evidence synthesis. Empirical results showed that key-insight-based classification (KBC) significantly outperforms abstract-based classification (ABC). We implemented a confidence-weighted voting (CWV) mechanism using multiple LLMs to improve robustness. The CWV method achieved the highest macro
F
1-score of 0.796, substantially exceeding KBC (0.732), ABC (0.676), and unsupervised K-means clustering (0.446). By employing zero-shot LLMs, our approach demonstrated the potential for enhanced adaptability across diverse domains and classification tasks without requiring fine-tuning, demonstrating that a carefully designed pipeline can enable LLMs to achieve classification performance comparable to human reviewers.
JoyAI-RA 0.1: A Foundation Model for Robotic Autonomy
🔥 引用:
0
Abstract: Robotic autonomy in open-world environments is fundamentally limited by insufficient data diversity and poor cross-embodiment generalization. Existing robotic datasets are often limited in scale and task coverage, while relatively large differences across robot embodiments impede effective behavior knowledge transfer. To address these challenges, we propose JoyAI-RA, a vision-language-action (VLA) embodied foundation model tailored for generalizable robotic manipulation. JoyAI-RA presents a multi-source multi-level pretraining framework that integrates web data, large-scale egocentric human manipulation videos, simulation-generated trajectories, and real-robot data. Through training on heterogeneous multi-source data with explicit action-space unification, JoyAI-RA effectively bridges embodiment gaps, particularly between human manipulation and robotic control, thereby enhancing cross-embodiment behavior learning. JoyAI-RA outperforms state-of-the-art methods in both simulation and real-world benchmarks, especially on diverse tasks with generalization demands.
Concordia: Spatial Domain Detection via Augmented Graphs for Population-Level Spatial Proteomics
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
F\textsuperscript{2}LP-AP: Fast \&Flexible Label Propagation with Adaptive Propagation Kernel
🔥 引用:
0
Abstract: Semi-supervised node classification is a foundational task in graph machine learning, yet state-of-the-art Graph Neural Networks (GNNs) are hindered by significant computational overhead and reliance on strong homophily assumptions. Traditional GNNs require expensive iterative training and multi-layer message passing, while existing training-free methods, such as Label Propagation, lack adaptability to heterophilo\-us graph structures. This paper presents \textbf{F$^2$LP-AP} (Fast and Flexible Label Propagation with Adaptive Propagation Kernel), a training-free, computationally efficient framework that adapts to local graph topology. Our method constructs robust class prototypes via the geometric median and dynamically adjusts propagation parameters based on the Local Clustering Coefficient (LCC), enabling effective modeling of both homophilous and heterophilous graphs without gradient-based training. Extensive experiments across diverse benchmark datasets demonstrate that \textbf{F$^2$LP-AP} achieves competitive or superior accuracy compared to trained GNNs, while significantly outperforming existing baselines in computational efficiency.
Agentic AI for Personalized Physiotherapy: A Multi-Agent Framework for Generative Video Training and Real-Time Pose Correction
🔥 引用:
0
Abstract: At-home physiotherapy compliance remains critically low due to a lack of personalized supervision and dynamic feedback. Existing digital health solutions rely on static, pre-recorded video libraries or generic 3D avatars that fail to account for a patient's specific injury limitations or home environment. In this paper, we propose a novel Multi-Agent System (MAS) architecture that leverages Generative AI and computer vision to close the tele-rehabilitation loop. Our framework consists of four specialized micro-agents: a Clinical Extraction Agent that parses unstructured medical notes into kinematic constraints; a Video Synthesis Agent that utilizes foundational video generation models to create personalized, patient-specific exercise videos; a Vision Processing Agent for real-time pose estimation; and a Diagnostic Feedback Agent that issues corrective instructions. We present the system architecture, detail the prototype pipeline using Large Language Models and MediaPipe, and outline our clinical evaluation plan. This work demonstrates the feasibility of combining generative media with agentic autonomous decision-making to scale personalized patient care safely and effectively.
Breaking Bad: Interpretability-Based Safety Audits of State-of-the-Art LLMs
🔥 引用:
0
Abstract: Effective safety auditing of large language models (LLMs) demands tools that go beyond black-box probing and systematically uncover vulnerabilities rooted in model internals. We present a comprehensive, interpretability-driven jailbreaking audit of eight SOTA open-source LLMs: Llama-3.1-8B, Llama-3.3-70B-4bt, GPT-oss- 20B, GPT-oss-120B, Qwen3-0.6B, Qwen3-32B, Phi4-3.8B, and Phi4-14B. Leveraging interpretability-based approaches -- Universal Steering (US) and Representation Engineering (RepE) -- we introduce an adaptive two-stage grid search algorithm to identify optimal activation-steering coefficients for unsafe behavioral concepts. Our evaluation, conducted on a curated set of harmful queries and a standardized LLM-based judging protocol, reveals stark contrasts in model robustness. The Llama-3 models are highly vulnerable, with up to 91\% (US) and 83\% (RepE) jailbroken responses on Llama-3.3-70B-4bt, while GPT-oss-120B remains robust to attacks via both interpretability approaches. Qwen and Phi models show mixed results, with the smaller Qwen3-0.6B and Phi4-3.8B mostly exhibiting lower jailbreaking rates, while their larger counterparts are more susceptible. Our results establish interpretability-based steering as a powerful tool for systematic safety audits, but also highlight its dual-use risks and the need for better internal defenses in LLM deployment.
AFMRL: Attribute-Enhanced Fine-Grained Multi-Modal Representation Learning in E-commerce
🔥 引用:
0
Abstract: Multimodal representation is crucial for E-commerce tasks such as identical product retrieval. Large representation models (e.g., VLM2Vec) demonstrate strong multimodal understanding capabilities, yet they struggle with fine-grained semantic comprehension, which is essential for distinguishing highly similar items. To address this, we propose Attribute-Enhanced Fine-Grained Multi-Modal Representation Learning (AFMRL), which defines product fine-grained understanding as an attribute generation task. It leverages the generative power of Multimodal Large Language Models (MLLMs) to extract key attributes from product images and text, and enhances representation learning through a two-stage training framework: 1) Attribute-Guided Contrastive Learning (AGCL), where the key attributes generated by the MLLM are used in the image-text contrastive learning training process to identify hard samples and filter out noisy false negatives. 2) Retrieval-aware Attribute Reinforcement (RAR), where the improved retrieval performance of the representation model post-attribute integration serves as a reward signal to enhance MLLM's attribute generation during multimodal fine-tuning. Extensive experiments on large-scale E-commerce datasets demonstrate that our method achieves state-of-the-art performance on multiple downstream retrieval tasks, validating the effectiveness of harnessing generative models to advance fine-grained representation learning.
SAKE: Self-aware Knowledge Exploitation-Exploration for Grounded Multimodal Named Entity Recognition
🔥 引用:
0
Abstract: Grounded Multimodal Named Entity Recognition (GMNER) aims to extract named entities and localize their visual regions within image-text pairs, serving as a pivotal capability for various downstream applications. In open-world social media platforms, GMNER remains challenging due to the prevalence of long-tailed, rapidly evolving, and unseen entities. To tackle this, existing approaches typically rely on either external knowledge exploration through heuristic retrieval or internal knowledge exploitation via iterative refinement in Multimodal Large Language Models (MLLMs). However, heuristic retrieval often introduces noisy or conflicting evidence that degrades precision on known entities, while solely internal exploitation is constrained by the knowledge boundaries of MLLMs and prone to hallucinations. To address this, we propose SAKE, an end-to-end agentic framework that harmonizes internal knowledge exploitation and external knowledge exploration via self-aware reasoning and adaptive search tool invocation. We implement this via a two-stage training paradigm. First, we propose Difficulty-aware Search Tag Generation, which quantifies the model's entity-level uncertainty through multiple forward samplings to produce explicit knowledge-gap signals. Based on these signals, we construct SAKE-SeCoT, a high-quality Chain-of-Thought dataset that equips the model with basic self-awareness and tool-use capabilities through supervised fine-tuning. Second, we employ agentic reinforcement learning with a hybrid reward function that penalizes unnecessary retrieval, enabling the model to evolve from rigid search imitation to genuine self-aware decision-making about when retrieval is truly necessary. Extensive experiments on two widely used social media benchmarks demonstrate SAKE's effectiveness.
Supplement Generation Training for Enhancing Agentic Task Performance
🔥 引用:
0
Abstract: Training large foundation models for agentic tasks is increasingly impractical due to the high computational costs, long iteration cycles, and rapid obsolescence as new models are continuously released. Instead of post-training massive models for every new task or domain, we propose Supplement Generation Training (SGT), a more efficient and sustainable strategy. SGT trains a smaller LLM to generate useful supplemental text that, when appended to the original input, helps the larger LLM solve the task more effectively. These lightweight models can dynamically adapt supplements to task requirements, improving performance without modifying the underlying large models. This approach decouples task-specific optimization from large foundation models and enables more flexible, cost-effective deployment of LLM-powered agents in real-world applications.
ChatGPTeachers
🔥 引用:
0
Abstract: The emergence of AI, particularly Generative AI (GenAI), has introduced powerful new actors into learning networks, creating an urgent need for research on how these technologies may be mediating connections between teachers, learners, and other digital technologies and institutional infrastructures. While GenAI extends connectivity across these networks, it also regenerates and reconfigures the conditions of teaching, learning, and professional practices. As a result, questions of professional ethics, teacher identity and agency are becoming central to understanding how teachers’ work is adapting and changing in the context of AI-mediated learning environments. This paper reports on a six-month study in which 20 K-12 teachers in a Canadian urban school division were provided access to ChatGPT Teams. Teachers participated in phenomenologically oriented interviews at the beginning and end of this period. From these accounts, Lived Experience Descriptions (LEDs) were developed to capture moments of tension, hesitation, curiosity and ethical reflection. Through these teacher-ChatGPT anecdotes, we examine how GenAI provokes technoethical tensions in teachers’ everyday work and learning lives, and, in the process, reshapes their planning, communication, and professional identity. Our analysis reveals that teachers encountered manifold technoethical dilemmas when interacting with ChatGPT. Their concerns extended beyond the appropriate use of the LLM to existential questions about their integrity and identity as teachers when using ChatGPT. Teachers experienced ambivalence when using ChatGPT to assist with professional communication or when imagining what it would be like to incorporate it into their pedagogy. Such anecdotes reveal the importance of professional learning that supports ethical discernment in networked learning contexts. Here, we suggest that the Technoethical Framework for Teachers (TEFT) can strengthen teachers’ ethical attunement and agency as they engage with GenAI in their practice. The paper contributes to networked learning scholarship by situating teachers’ encounters with GenAI within relational and ethical ecologies rather than adoption or policy frameworks. It offers implications for NL researchers, teachers reflecting on the ethical implications of their Technological-Pedagogical-Content Knowledge (TPACK) in light of GenAI, and school leaders developing professional learning and policy around the critically informed and responsible use of Large Language Models (LLMs) in schools. Prior studies have described how teachers engage with GenAI; this paper shows how those engagements are ethically charged and ontologically mediated. By framing teachers’ co-constitutive relationships with ChatGPT as technoethical relations of attunement, we offer both a theoretical vocabulary and an empirical method for studying GenAI in education through lived experience and a new conceptual lens.
Spectral Embeddings Leak Graph Topology: Theory, Benchmark, and Adaptive Reconstruction
🔥 引用:
0
Abstract: Graph Neural Networks (GNNs) excel on relational data, but standard benchmarks unrealistically assume the graph is centrally available. In practice, settings such as Federated Graph Learning, distributed systems, and privacy-sensitive applications involve graph data that are localized, fragmented, noisy, and privacy-leaking. We present a unified framework for this setting. We introduce LoGraB (Local Graph Benchmark), which decomposes standard datasets into fragmented benchmarks using three strategies and four controls: neighborhood radius $d$, spectral quality $k$, noise level $\sigma$, and coverage ratio $p$. LoGraB supports graph reconstruction, localized node classification, and inter-fragment link prediction, with Island Cohesion. We propose AFR (Adaptive Fidelity-driven Reconstruction), a method for noisy spectral fragments. AFR scores patch quality via a fidelity measure combining a gap-to-truncation stability ratio and structural entropy, then assembles fragments using RANSAC-Procrustes alignment, adaptive stitching, and Bundle Adjustment. Rather than forcing a single global graph, AFR recovers large faithful islands. We prove heat-kernel edge recovery under a separation condition, Davis--Kahan perturbation stability, and bounded alignment error. We establish a Spectral Leakage Proposition: under a spectral-gap assumption, polynomial-time Bayesian recovery is feasible once enough eigenvectors are shared, complementing AFR's deterministic guarantees. Experiments on nine benchmarks show that LoGraB reveals model strengths and weaknesses under fragmentation, AFR achieves the best F1 on 7/9 datasets, and under per-embedding $(\epsilon,\delta)$-Gaussian differential privacy, AFR retains 75% of its undefended F1 at $\epsilon=2$. Our anonymous code is available at https://anonymous.4open.science/r/JMLR_submission
HaS: Accelerating RAG through Homology-Aware Speculative Retrieval
🔥 引用:
0
Abstract: Retrieval-Augmented Generation (RAG) expands the knowledge boundary of large language models (LLMs) at inference by retrieving external documents as context. However, retrieval becomes increasingly time-consuming as the knowledge databases grow in size. Existing acceleration strategies either compromise accuracy through approximate retrieval, or achieve marginal gains by reusing results of strictly identical queries. We propose HaS, a homology-aware speculative retrieval framework that performs low-latency speculative retrieval over restricted scopes to obtain candidate documents, followed by validating whether they contain the required knowledge. The validation, grounded in the homology relation between queries, is formulated as a homologous query re-identification task: once a previously observed query is identified as a homologous re-encounter of the incoming query, the draft is deemed acceptable, allowing the system to bypass slow full-database retrieval. Benefiting from the prevalence of homologous queries under real-world popularity patterns, HaS achieves substantial efficiency gains. Extensive experiments demonstrate that HaS reduces retrieval latency by 23.74% and 36.99% across datasets with only a 1-2% marginal accuracy drop. As a plug-and-play solution, HaS also significantly accelerates complex multi-hop queries in modern agentic RAG pipelines. Source code is available at: https://github.com/ErrEqualsNil/HaS.
Break the Optimization Barrier of LLM-Enhanced Recommenders: A Theoretical Analysis and Practical Framework
🔥 引用:
0
Abstract: Large language model (LLM)-enhanced recommendation models inject LLM representations into backbone recommenders to exploit rich item text without inference-time LLM cost. However, we find that existing LLM-enhanced methods significantly hinder the optimization of backbone models, resulting in high training losses that are difficult to reduce. To address it, we establish a comprehensive theoretical analysis of local optimization curvature and identify two key causes: 1) large norm disparity and 2) semantic-collaboration misaligned angular clustering of LLM representations. Guided by these insights, we propose Training-Friendly LLM-Enhanced Recommender (TF-LLMER), a lightweight framework with two key components. First, we highlight the necessity of item embedding normalization to eliminate norm-driven instability and achieve provable control over optimization conditioning. Second, we introduce Rec-PCA, a recommendation-aware dimensionality reduction method that injects collaborative structure into the representation transformation to resolve semantic-collaboration misaligned angular clustering. It jointly optimizes semantic information retention and alignment with an item-item co-occurrence graph constructed from interaction histories. The graph captures collaborative structure, and alignment is promoted by penalizing total variation over the graph. Both theory and extensive experiments demonstrate that TF-LLMER significantly outperforms state-of-the-art methods. Our code is available at https://github.com/woriazzc/TF-LLMER.
Systematic Benchmarking of Kinase Bioactivity Models Across Splitting Strategies and Protein Representations
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
Dating and Relationships under a Bluesky
🔥 引用:
0
Abstract: Social media platforms increasingly play a central role in how individuals acquire knowledge, form expectations, and negotiate practices surrounding intimate relationships. Rather than relying primarily on formal education or interpersonal networks, adolescents and young adults often prefer digital sources when seeking advice and information about sex and relationships. Platforms such as X, Instagram, and BlueSky provide perceived advantages of anonymity, autonomy, and accessibility, enabling users to engage in the de-identified sharing of sensitive concerns. In doing so, they contribute to the ongoing transformation of “romantic media ideologies,” understood as the taken-for-granted assumptions and discourses that shape how relationships are understood, evaluated, and experienced in contemporary societies. To date, much of the scholarship examining these dynamics has relied on qualitative methods such as interviews, ethnography, or discourse analysis. While such approaches offer valuable insights, they are limited in scope when considering the scale and complexity of digital interaction. Educational Data Science (EDS) offers a promising methodological framework to address this gap. By integrating Learning Analytics (LA) and Educational Data Mining (EDM), EDS provides robust tools for examining informal knowledge acquisition and meaning-making in online spaces. In this study, we apply three specific methods—social network analysis, natural language processing, and large language model annotation/classification—to analyze dating and relationship discourse on BlueSky, thereby extending EDS into a largely unexplored domain. Another dimension concerns the role of social media influencers as emerging “new experts” in the domain of dating and relationships. These figures act as knowledge brokers: they are perceived as credible and authentic, often combining entertainment with didactic or advisory content. At the same time, their interventions are not unproblematic, as advice can be contradictory, selective, and shaped by platform-specific logics of visibility, brand-building, and monetization. Our analyses identify three archetypal user groups—Knowledge-based Authority, Commercial Positioning, and Experiential Discourse. Each group demonstrates distinct communication styles, with varying degrees of formalized expertise and credibility claims. While all engage in advice-giving, their thematic emphases diverge, ranging from wellness and self-care to promotional strategies and lived relational experiences. By mapping these dynamics, the study contributes to scholarly debates on the evolution of romantic media ideologies and the social construction of expertise in digital discourse.
TRACES: Tagging Reasoning Steps for Adaptive Cost-Efficient Early-Stopping
🔥 引用:
0
Abstract: The field of Language Reasoning Models (LRMs) has been very active over the past few years with advances in training and inference techniques enabling LRMs to reason longer, and more accurately. However, a growing body of studies show that LRMs are still inefficient, over-generating verification and reflection steps. Additionally, the high-level role of each reasoning step and how different step types contribute to the generation of correct answers, is largely underexplored. To address this challenge, we introduce TRACES (Tagging of the Reasoning steps enabling Adaptive Cost-Efficient early-Stopping), a lightweight framework that tags reasoning steps in real-time, and enable adaptive, cost-efficient early stopping of large-language-model inferences. Building on this framework we monitor reasoning behaviors during inferences, and we find that LRMs tend to shift their reasoning behavior after reaching a correct answer. We demonstrate that the monitoring of the specific type of steps can produce effective interpretable early stopping criteria. We evaluate the TRACES framework on three mathematical reasoning benchmarks, namely, MATH500, GSM8K, AIME and two knowledge and reasoning benchmarks, MMLU and GPQA respectively. We achieve 20 to 50% token reduction while maintaining comparable accuracy to standard generation.
X-PCR: A Benchmark for Cross-modality Progressive Clinical Reasoning in Ophthalmic Diagnosis
🔥 引用:
0
Abstract: Despite significant progress in Multi-modal Large Language Models (MLLMs), their clinical reasoning capacity for multi-modal diagnosis remains largely unexamined. Current benchmarks, mostly single-modality data, can't evaluate progressive reasoning and cross-modal integration essential for clinical practice. We introduce the Cross-Modality Progressive Clinical Reasoning (X-PCR) benchmark, the first comprehensive evaluation of MLLMs through a complete ophthalmology diagnostic workflow, with two reasoning tasks: 1) a six-stage progressive reasoning chain spanning image quality assessment to clinical decision-making, and 2) a cross-modality reasoning task integrating six imaging modalities. The benchmark comprises 26,415 images and 177,868 expert-verified VQA pairs curated from 51 public datasets, covering 52 ophthalmic diseases. Evaluation of 21 MLLMs reveals critical gaps in progressive reasoning and cross-modal integration. Dataset and code: https://github.com/CVI-SZU/X-PCR.
Learning Reasoning World Models for Parallel Code
🔥 引用:
0
Abstract: Large language models have shown remarkable ability in serial code generation, but they still struggle with parallel code for which training data is comparatively scarce. A common remedy is to use coding agents that interact with external tools, but tool calls can be costly and sometimes impractical, e.g., for partially written code. We propose Parallel-Code World Models (PCWMs), reasoning LLMs that aim to predict tool outcomes directly from parallel source code. To train PCWMs, we design a novel exploration and data generation pipeline that samples diverse parallel-coding problems and candidate implementations across multiple domains, then executes them via tools to record data races and performance profiles. From these, we synthesize hindsight reasoning traces that causally connect source code to observed tool outcomes. Fine-tuning on the resulting data yields noticeable gains, with a 7B-parameter world model improving from 64.3% to 72.8% accuracy in race-outcome prediction, while an 8B-parameter model improves in a performance profiling task from 49.3% to 58.6% accuracy. Furthermore, when open-weight models were tasked with fixing data races, world-model feedback improved their race-fixing rates relative to self-feedback by 2.7%-9.1% using our 7B-parameter world model and by 6.1%-11.1% using our 14B-parameter world model. Our results suggest that reasoning models have the potential to serve as practical substitutes for external tool calls in parallel-coding agents.
MedSAM2-CXR: A Box-Latent Framework for Chest X-ray Classification and Report Generation
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
Graph2Counsel: Clinically Grounded Synthetic Counseling Dialogue Generation from Client Psychological Graphs
🔥 引用:
0
Abstract: Rising demand for mental health support has increased interest in using Large Language Models (LLMs) for counseling. However, adapting LLMs to this high-risk safety-critical domain is hindered by the scarcity of real-world counseling data due to privacy constraints. Synthetic datasets provide a promising alternative, but existing approaches often rely on unstructured or semi-structured text inputs and overlook structural dependencies between a client's cognitive, emotional, and behavioral states, often producing psychologically inconsistent interactions and reducing data realism and quality. We introduce Graph2Counsel, a framework for generating synthetic counseling sessions grounded in Client Psychological Graphs (CPGs) that encode relationships among clients'thoughts, emotions, and behaviors. Graph2Counsel employs a structured prompting pipeline guided by counselor strategies and CPG, and explores prompting strategies including CoT (Wei et al., 2022) and Multi-Agent Feedback (Li et al., 2025a). Graph2Counsel produces 760 sessions from 76 CPGs across diverse client profiles. In expert evaluation, our dataset outperforms prior datasets on specificity, counselor competence, authenticity, conversational flow, and safety, with substantial inter-annotator agreement (Krippendorff's $\alpha$ = 0.70). Fine-tuning an open-source model on this dataset improves performance on CounselingBench (Nguyen et al., 2025) and CounselBench (Li et al., 2025b), showing downstream utility. We also make our code and data public.
EconoGNN: A graph neural network framework for temporal economic resilience insights
🔥 引用:
0
Abstract: Global economic shocks such as the 2008 financial crisis or recent trade escalations between the United States and China have exposed the complexity of interdependent economies and the need for systemic, multi-agent analysis. However, most regional economic resilience (RER) studies remain limited by localized datasets, inconsistent definitions, and static modeling approaches, restricting their ability to generalize insights across space and time. We introduce EconoGNN, a Graph Neural Network framework that integrates complexity theory, economic modeling, and machine learning to predict and explain regional economic resilience across 183 countries over 25 years. By combining over 81 million trade records (UN COMTRADE) and 500,000 macroeconomic observations (Penn World Table), and adopting an official resilience metric from the World Bank, our approach enables reproducible and interpretable global-scale analysis. EconoGNN achieves F1-scores of 0.750 with the temporal GNN architecture GConvGRU, AUC-ROC of 0.792, and PR-AUC of 0.757, demonstrating robust performance across different recovery threshold settings (τ = 0.90–1.00) with F1-scores ranging from 0.730 to 0.771, and yielding statistically significant improvements (p-value ≤0.05) over baselines. GNNExplainer validation confirms explanation reliability (Fidelity+ = 0.827, Characterization = 0.913), enabling country-customized interpretability of resilience drivers. Moreover the EconoGNN framework integrates key structural and welfare indicators to model both national and cross-border economic interactions, reducing omitted-variable bias and implicitly accounting for political, institutional, and cultural differences.
Comparative evaluation of large language models for generating CAD-RADS 2.0-compliant diagnostic conclusions in cardiac CT reports
🔥 引用:
0
Abstract: Objectives Coronary computed tomography angiography (CCTA) has become a cornerstone in non-invasive CAD diagnosis and risk stratification. To standardize reporting and improve clinical decision-making, the CAD-RADS 2.0 system was introduced. This study evaluates the performance of four LLMs, GPT-4o, Gemini 2.0 Flash, DeepSeek V, and Copilot in generating CAD-RADS 2.0-compliant conclusions from standardized CCTA reports. Materials and methods A total of 196 anonymized CCTA reports were retrospectively analyzed. Each LLM was prompted to provide CAD-RADS 2.0 classifications and follow-up recommendations. Ground truth labels were assigned by a senior radiologist. Performance metrics (accuracy, precision, recall, F1-score), execution times, and agreement (Cohen’s kappa) with expert interpretation were computed. Interobserver agreement between junior and senior radiologists was also assessed. Results LLMs demonstrated good-to-excellent agreement with expert classifications: DeepSeek V (κ = 0.771), Copilot (κ = 0.761), GPT-4o (κ = 0.759), and Gemini 2.0 Flash (κ = 0.634). DeepSeek V achieved the highest accuracy (91.83%). Intra-model consistency was perfect (κ = 1). However, LLMs failed to assign CAD-RADS modifiers. ChatGPT-4o provided the most accurate follow-up recommendations (71.94%). All LLMs outperformed radiologists in execution time (3–9 s vs. 15–20 s; p < 0.05). Conclusions Generic LLMs demonstrate promising performance in automating CAD-RADS 2.0 classification from CCTA reports. However, limitations in modifier assignment and recommendation accuracy highlight areas for refinement before clinical integration. Critical relevance statement This study explores the potential of large language models to facilitate standardized CAD-RADS 2.0 reporting from coronary CT angiography, highlighting a possible avenue to support workflow efficiency and clinical decision-making in non-invasive coronary artery disease evaluation. Key Points LLMs demonstrated strong potential in automating CAD-RADS 2.0-compliant structured reporting for CCTA. LLMs could significantly enhance efficiency in radiological reporting. LLMs need further optimization before clinical integration. Graphical Abstract
Generating and Solving Complex Transfer Type Arithmetic Word Problems
🔥 引用:
0
Abstract: Most existing arithmetic word problem (AWP) solvers focus on solving simple examples. Transfer case-AWPs (TC-AWPs) involve scenarios where objects are transferred between agents. The widely used AWP datasets mainly consist of simple TC-AWPs (problems that involve a single object transfer). Current large language models (LLMs) are capable of solving most of these simple TC-AWPs effectively. In this work, we focus on assessing the solving capability of LLMs (ChatGPT and Gemini) for complex TC-AWPs (where multiple types of objects are transferred or more than one transfer of an object is performed). Since the popular AWP datasets contain only simple TC-AWPs, we first generate complex TC-AWPs using an ontological approach. We utilize these complex examples to assess LLMs’ word-problem-solving capabilities. We observe that the accuracy of LLMs falls down rapidly as the number of object transfers is increased to 3 or 4. An approach for solving TC-AWPs using ontologies and M/L exists in the literature. We propose an extension of this approach that can handle complex TC-AWPs and find that, compared to the current LLMs, the proposed solution gives better accuracy for complex TC-AWPs. We analyze the failed cases of the LLM approach and find that the reasoning capabilities of LLMs need a lot of improvement.
Mathematical modeling and analysis of Love wave propagation in fluid-saturated continuum with spring-membrane interfacial foundations
🔥 引用:
0
Abstract: Numerous scientific studies and industrial fields, including geology, geophysics, mining, etc., rely heavily on seismic waves. Seismic wave analysis provides information that helps us better manage and use natural resources by enhancing our understanding of their structure. With this objective, this study presents a comprehensive mathematical model for Love-type seismic wave propagation in a transversely isotropic fluid-saturated poroelastic (TIFSP) layer bonded to an elastic half-space through spring–membrane interfacial foundations. More precisely, Classical (CL), Spring (SPR), Membrane (MBR), and Combined (CB) elastic foundation models are discussed. Implementing analytical mathematical techniques and complex continuity conditions across the different interfaces, the hyperbolic boundary value problem is solved, and the closed-form expressions of dispersion relations for all four models are obtained. The obtained dispersion relations are also reduced as special cases and matched with the results in the literature. Noteworthy influences of prevalent parameters such as anisotropy, porosity, spring constant, surface/interface Lame and density constants, etc., on Love wave phase velocity are examined graphically and discussed. The results demonstrate that spring–membrane interfacial foundations provide an effective mechanism for tailoring Love-wave dispersion in fluid-saturated media, offering valuable insights for subsurface sensing, geophysical waveguides, and engineered layered composites with imperfect interfaces.
Enhancing Accessibility of Government Notices Through LLM-Based Multilingual Translation
DOI:
10.55041/isjem06511
🔥 引用:
0
Abstract: In the field of computational linguistics, addressing machine translation (MT) challenges for low-resource languages remains crucial, as these languages often lack extensive data compared to high resource languages. General large language models (LLMs), such as GPT-4 and Llama, primarily trained on monolingual corpora, face significant challenges in translating low-resource languages, often resulting in subpar translation quality. This study introduces Language-Specific Fine-Tuning with Low-rank adaptation (LSFTL), a method that enhances translation for low-resource languages by optimizing the multi-head attention and feed-forward networks of Transformer layers through low-rank matrix adaptation. LSFTL preserves the majority of the model parameters while selectively fine-tuning key components, thereby maintaining stability and enhancing translation quality. Experiments on non-English centered low-resource Asian languages demonstrated that LSFTL improved COMET scores by 1-3 points compared to specialized multilingual machine translation models. Additionally, LSFTL’s parameter-efficient approach allows smaller models to achieve performance comparable to their larger counterparts, highlighting its significance in making machine translation systems more accessible and effective for low-resource languages.Key Words: Machine translation, low-resource languages, large language models, parameter-efficient fine-tuning, LoRA.
EgoDyn-Bench: Evaluating Ego-Motion Understanding in Vision-Centric Foundation Models for Autonomous Driving
🔥 引用:
0
Abstract: While Vision-Language Models (VLMs) have advanced highlevel reasoning in autonomous driving, their ability to ground this reasoning in the underlying physics of ego-motion remains poorly understood. We introduce EgoDyn-Bench, a diagnostic benchmark for evaluating the semantic ego-motion understanding of vision-centric foundation models. By mapping continuous vehicle kinematics to discrete motion concepts via a deterministic oracle, we decouple a model's internal physical logic from its visual perception. Our large-scale empirical audit spanning 20 + models, including closed-source MLLMs, open-source VLMs across multiple scales, and specialized VLAs, identifies a significant Perception Bottleneck: while models exhibit logical physical concepts, they consistently fail to accurately align them with visual observations, frequently underperforming classical non-learned geometric baselines. This failure persists across model scales and domain-specific training, indicating a structural deficit in how current architectures couple visual perception with physical reasoning. We demonstrate that providing explicit trajectory encodings substantially restores physical consistency across all evaluated models, revealing a functional disentanglement between vision and language: egomotion logic is derived almost exclusively from the language modality, while visual observations contribute negligible additional signal. This structural finding provides a standardized diagnostic framework and a practical pathway toward physically aligned embodied AI. Keywords: Ego-motion - Physical Reasoning - Foundation Models
Evaluating Assurance Cases as Text-Attributed Graphs for Structure and Provenance Analysis
🔥 引用:
0
Abstract: An assurance case is a structured argument document that justifies claims about a system's requirements or properties, which are supported by evidence. In regulated domains, these are crucial for meeting compliance and safety requirements to industry standards. We propose a graph diagnostic framework for analysing the structure and provenance of assurance cases. We focus on two main tasks: (1) link prediction, to learn and identify connections between argument elements, and (2) graph classification, to differentiate between assurance cases created by a state-of-the-art large language model and those created by humans, aiming to detect bias. We compiled a publicly available dataset of assurance cases, represented as graphs with nodes and edges, supporting both link prediction and provenance analysis. Experiments show that graph neural networks (GNNs) achieve strong link prediction performance (ROC-AUC 0.760) on real assurance cases and generalise well across domains and semi-supervised settings. For provenance detection, GNNs effectively distinguish human-authored from LLM-generated cases (F1 0.94). We observed that LLM-generated assurance cases have different hierarchical linking patterns compared to human-authored cases. Furthermore, existing GNN explanation methods show only moderate faithfulness, revealing a gap between predicted reasoning and the true argument structure.
Synthesis, Characterization, Antiproliferative Activity, Toxicity, and ADME Studies of a New Schiff Base
🔥 引用:
0
Abstract: In this study, a new Schiff base 3 was synthesized in a reaction step using N1-phenylbenzene-1,2-diamine and 2-hydroxy-5-methoxybenzaldehyde at 20 h. The structural properties of compound 3 were characterized in detail by 1H NMR, 13C NMR, and IR. The cytotoxic activity of the synthesized molecule was investigated on breast cancer cell lines (MDA-MB-231 and MCF-7) and a healthy human embryonic kidney cell line (HEK-293T). The evaluations revealed that the compound exhibits cytotoxic activity against breast cancer cells (IC50: 22.73 µM for MDA-MB-231 and 51.59 µM for MCF-7). Additionally, analyses on healthy cells revealed that the compound exhibits a high selectivity effect (IC50: 352.40 µM). The ADME and toxicity (ADMET) properties of the molecule 3 were analyzed using in silico approaches; physicochemical parameters, lipophilicity, and solubility predictions were calculated using SwissADME, while acute toxicity predictions were obtained through the Protox-II platform. Acute toxicity predictions were reported using Protox-II, and the estimated toxic dose was reported as 563 mg/kg based on the value specified in the study. The GHS toxicity class assigned by Protox-II is 4, which indicates moderate acute toxicity.
Locally-deployed vs. cloud-based AI in healthcare: evaluating DeepSeek-R1:8b, DeepSeek-R1, and ChatGPT o3-mini-high for complex medical diagnostics
🔥 引用:
0
Abstract:
Reasoning large language models are increasingly considered for healthcare-related artificial intelligence applications, but their practical value depends not only on diagnostic accuracy, but also on responsiveness and operational reliability. In this study, we benchmarked six model settings on 1,000 questions from the MedQA dataset: DeepSeek-R1, its distilled 8-billion-parameter local variant DeepSeek-R1:8b, ChatGPT o3-mini-high, and their knowledge-base–augmented counterparts. We evaluated performance across three dimensions: diagnostic accuracy, response latency, and first-attempt connection reliability. DeepSeek-R1 achieved the highest accuracy (89.5%, 95% CI: 87.4–91.2) but showed substantially longer response times (median 26.54 s) and higher connection failure rates (4.6%). ChatGPT o3-mini-high responded faster (median 10.05 s) and showed the most favorable tail-latency profile, but its accuracy (78.2%, 95% CI: 75.5–80.7) was lower than that of DeepSeek-R1. The locally deployed DeepSeek-R1:8b demonstrated markedly stronger connection reliability (failure rate 0.2%, 95% CI: 0.0%–0.5%) but substantially reduced accuracy (55.0%, 95% CI: 51.9%–58.5%). Knowledge-base augmentation did not consistently improve performance; for DeepSeek-R1, it significantly reduced accuracy by 4.36% (
p
=
0.0002
), while no significant benefit was observed for the other models. These findings show that reasoning model performance in medical question answering is best understood as a trade-off among accuracy, latency, connection reliability, and deployment mode, and that retrieval augmentation is not universally beneficial. More broadly, this study provides deployment-relevant benchmarking evidence for evaluating reasoning models in healthcare-related settings, while also indicating the need for richer knowledge resources and more realistic task environments before such systems can be meaningfully assessed for real-world clinical use.
V-tableR1: Process-Supervised Multimodal Table Reasoning with Critic-Guided Policy Optimization
🔥 引用:
0
Abstract: We introduce V-tableR1, a process-supervised reinforcement learning framework that elicits rigorous, verifiable reasoning from multimodal large language models (MLLMs). Current MLLMs trained solely on final outcomes often treat visual reasoning as a black box, relying on superficial pattern matching rather than performing rigorous multi-step inference. While Reinforcement Learning with Verifiable Rewards could enforce transparent reasoning trajectories, extending it to visual domains remains severely hindered by the ambiguity of grounding abstract logic into continuous pixel space. We solve this by leveraging the deterministic grid structure of tables as an ideal visual testbed. V-tableR1 employs a specialized critic VLM to provide dense, step-level feedback on the explicit visual chain-of-thought generated by a policy VLM. To optimize this system, we propose Process-Guided Direct Alignment Policy Optimization (PGPO), a novel RL algorithm integrating process rewards, decoupled policy constraints, and length-aware dynamic sampling. Extensive evaluations demonstrate that V-tableR1 explicitly penalizes visual hallucinations and shortcut guessing. By fundamentally shifting multimodal inference from black-box pattern matching to verifiable logical derivation, V-tableR1 4B establishes state-of-the-art accuracy among open-source models on complex tabular benchmarks, outperforming models up to 18x its size and improving over its SFT baseline
Feasibility of Large Language Models for Ongoing Professional Practice Evaluation in Cardiovascular Medicine: A Pilot Study in the United States
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
Evaluating Remote Sensing Image Captions Beyond Metric Biases
🔥 引用:
0
Abstract: The core objective of image captioning is to achieve lossless semantic compression from visual signals into textual modalities. However, the reliance on manually curated reference texts for evaluation essentially forces models to mimic specific human annotation styles, thereby masking the true descriptive capabilities of advanced foundation models. This systemic misalignment prompts a critical question: Is task-specific fine-tuning truly necessary for Remote Sensing Image Captioning, or is the perceived performance gap merely an artifact of flawed evaluation criteria? To investigate this discrepancy, we propose ReconScore, a novel reference-free evaluation metric. Rather than computing textual similarities, we assess caption quality by its capability to reconstruct the original visual elements solely from the generated text, effectively neutralizing human annotation biases. Applying this metric, we uncover a profound, counterintuitive truth: inherently powerful, unfine-tuned MLLMs surpass their fine-tuned counterparts in authentic zero-shot RSIC tasks. Driven by this structural discovery, we introduce RemoteDescriber, a completely training-free generation methodology. By employing ReconScore as a self-correction mechanism, we iteratively refine the semantic precision of MLLM outputs without any computational fine-tuning overhead. Comprehensive experiments demonstrate that RemoteDescriber achieves state-of-the-art performance on three datasets. Furthermore, we validate ReconScore's reliability and analyze the flaws of traditional metrics. Our code is available at https://github.com/hhu-czy/RemoteDescriber.
Thinking Like a Botanist: Challenging Multimodal Language Models with Intent-Driven Chain-of-Inquiry
🔥 引用:
0
Abstract: Vision evaluations are typically done through multi-step processes. In most contemporary fields, experts analyze images using structured, evidence-based adaptive questioning. In plant pathology, botanists inspect leaf images, identify visual cues, infer diagnostic intent, and probe further with targeted questions that adapt to species, symptoms, and severity. This structured probing is crucial for accurate disease diagnosis and treatment formulation. Yet current vision-language models are evaluated on single-turn question answering. To address this gap, we introduce PlantInquiryVQA, a benchmark for studying multi-step, intent-driven visual reasoning in botanical diagnosis. We formalize a Chain of Inquiry framework modeling diagnostic trajectories as ordered question-answer sequences conditioned on grounded visual cues and explicit epistemic intent. We release a dataset of 24,950 expert-curated plant images and 138,068 question-answer pairs annotated with visual grounding, severity labels, and domain-specific reasoning templates. Evaluations on top-tier Multimodal Large Language Models reveal that while they describe visual symptoms adequately, they struggle with safe clinical reasoning and accurate diagnosis. Importantly, structured question-guided inquiry significantly improves diagnostic correctness, reduces hallucination, and increases reasoning efficiency. We hope PlantInquiryVQA serves as a foundational benchmark in advancing research to train diagnostic agents to reason like expert botanists rather than static classifiers.
Taint-Style Vulnerability Detection and Confirmation for Node.js Packages Using LLM Agent Reasoning
🔥 引用:
0
Abstract: The rapidly evolving Node$.$js ecosystem currently includes millions of packages and is a critical part of modern software supply chains, making vulnerability detection of Node$.$js packages increasingly important. However, traditional program analysis struggles in this setting because of dynamic JavaScript features and the large number of package dependencies. Recent advances in large language models (LLMs) and the emerging paradigm of LLM-based agents offer an alternative to handcrafted program models. This raises the question of whether an LLM-centric, tool-augmented approach can effectively detect and confirm taint-style vulnerabilities (e.g., arbitrary command injection) in Node$.$js packages. We implement LLMVD$.$js, a multi-stage agent pipeline to scan code, propose vulnerabilities, generate proof-of-concept exploits, and validate them through lightweight execution oracles; and systematically evaluate its effectiveness in taint-style vulnerability detection and confirmation in Node$.$js packages without dedicated static/dynamic analysis engines for path derivation. For packages from public benchmarks, LLMVD$.$js confirms 84% of the vulnerabilities, compared to less than 22% for prior program analysis tools. It also outperforms a prior LLM-program-analysis hybrid approach while requiring neither vulnerability annotations nor prior vulnerability reports. When evaluated on a set of 260 recently released packages (without vulnerability groundtruth information), traditional tools produce validated exploits for few ($\leq 2$) packages, while LLMVD$.$js generates validated exploits for 36 packages.
Cold-Start Forecasting of New Product Life-Cycles via Conditional Diffusion Models
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
Nutritional Profiling, Antioxidant Activity, and In vivo Toxicological Evaluation of Myristica fragrans Seeds: Renal-Protective Insights from Molecular Docking
🔥 引用:
0
Abstract:
Medicinal plants are utilized in the pharmaceutical, nutraceutical, food, and beverage industries for their safety and health-promoting properties. The current study evaluates the nutritional composition, antioxidant activity, ADMET (absorption, distribution, metabolism, excretion, and toxicity) analysis, in vivo safety assessment, and the in silico renal-protective effect of Myristica fragrans seed powder and extract. The proximate composition and antioxidant potential of M. fragrans were evaluated by AOAC (Association of Official Agricultural Chemists), along with the TPC (Total Phenolic Contents), FRAP (Ferric Reducing Antioxidant Power), and DPPH (2,2-Diphenyl-1-picrylhydrazyl) methods. For safety assessment, Sprague-Dawley rats were supplemented with M. fragrans seed powder (MFSP) and hydroethanolic extract (MFS-HEE) for 28 days. Additionally, eugenol, Licarin A, and myristicin were assessed for ADMET properties and docked with 4E4D (Crystal structure of the mouse RANKL-OPG complex), using PyRx and Discovery Studio Visualizer. The results showed that M. fragrans seeds comprise fat (26.55%), fiber (12.23%), and protein (15.75%), and hydroethanolic extract exhibited maximum antioxidant activity in DPPH and FRAP assays. Furthermore, safety evaluation revealed no hematological and histopathological adverse effects. The lipid profile was reduced by 600 mg/kg body weight of dried M. fragrans seeds extract. Licarin A revealed the highest binding affinity (-7.6 kcal/mol) with the renal-injury related protein in molecular docking analysis; moreover, eugenol exhibited the highest predicted absorption and safety compared to other compounds in ADMET analysis. In conclusion, MFSP and MFS-HEE are safe and have renal protective potential.
GRPO-VPS: Enhancing Group Relative Policy Optimization with Verifiable Process Supervision for Effective Reasoning
🔥 引用:
0
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has advanced the reasoning capabilities of Large Language Models (LLMs) by leveraging direct outcome verification instead of learned reward models. Building on this paradigm, Group Relative Policy Optimization (GRPO) eliminates the need for critic models but suffers from indiscriminate credit assignment for intermediate steps, which limits its ability to identify effective reasoning strategies and incurs overthinking. In this work, we introduce a model-free and verifiable process supervision via probing the model's belief in the correct answer throughout its reasoning trajectory. By segmenting the generation into discrete steps and tracking the conditional probability of the correct answer appended at each segment boundary, we efficiently compute interpretable segment-wise progress measurements to refine GRPO's trajectory-level feedback. This approach enables more targeted and sample-efficient policy updates, while avoiding the need for intermediate supervision derived from costly Monte Carlo rollouts or auxiliary models. Experiments on mathematical and general-domain benchmarks show consistent gains over GRPO across diverse models: up to 2.6-point accuracy improvements and 13.7% reasoning-length reductions on math tasks, and up to 2.4 points and 4% on general-domain tasks, demonstrating strong generalization.
New indole-linked 1,2,4-triazole derivatives as dual FAK inhibitors and apoptosis inducers targeting survival and migration in triple-negative breast cancer in-vitro
🔥 引用:
0
Abstract:
Focal adhesion kinase (FAK) is overexpressed and hyperactivated in triple-negative breast cancer, driving tumor aggressiveness and cancer stem cell–mediated therapy resistance. Therefore, targeting FAK signalling represents a promising therapeutic strategy. In this study, a series of indole and bis-indole-1,2,4-triazoles were synthesized and evaluated as anti-TNBC agents targeting FAK. Compounds
3c
,
4c
, and
5c
displayed potent cytotoxicity (IC₅₀ = 41–77 µg/mL) with minimal toxicity to normal cells, outperforming precursor compound
2
. Wound-healing assay revealed significant inhibition of cell migration, particularly by
4c
. Cell cycle analysis revealed that
4c
induced S-phase arrest in MCF-7 cells and G1-phase arrest in MDA-MB-231 cells, accompanied by significant apoptosis. In MDA-MB-231 cells,
4c
triggered extensive total apoptosis (90.84%) with minimal necrosis. Gene expression studies demonstrated that
4c
markedly downregulated
PTK2
(FAK),
CCL5
, and
BCL2
, while upregulating
CASP3
, highlighting its dual role as FAK inhibitor and apoptosis inducer. Importantly,
4c
efficiently suppressed FAK protein expression (61.3%) in TNBC, compared to the FAK inhibitor GSK-2256098 (70.7%). In vivo toxicity assessment confirmed good tolerability in mice without profound hepatic or renal impairments, while docking and ADMET analyses confirmed strong FAK binding affinity, and favourable pharmacokinetics of
4c
. Collectively,
4c
emerges as a promising FAK-targeted candidate for TNBC therapy.
Can Virtual Agents Care? Designing an Empathetic and Personalized LLM-Driven Conversational Agent
🔥 引用:
0
Abstract: Mental health challenges are rising globally, while traditional support services face limited availability and high costs. Large language models offer potential for conversational support, but often lack personalization, empathy, and factual grounding. A virtual agent framework is introduced to provide empathetic, personalized, and reliable wellbeing support through retrieval-augmented architecture, structured memory, and multimodal interaction. Objective benchmarks demonstrate improved retrieval and response quality, particularly for smaller models. A cross-cultural study with university students from Vietnam and Australia shows the system outperforms LLM-only baselines in coherence, perceived accuracy, and empathy, with most participants clearly preferring the proposed approach.
Cold-Start Forecasting of New Product Life-Cycles via Conditional Diffusion Models
🔥 引用:
0
Abstract: Forecasting the life-cycle trajectory of a newly launched product is important for launch planning, resource allocation, and early risk assessment. This task is especially difficult in the pre-launch and early post-launch phases, when product-specific outcome history is limited or unavailable, creating a cold-start problem. In these phases, firms must make decisions before demand patterns become reliably observable, while early signals are often sparse, noisy, and unstable We propose the Conditional Diffusion Life-cycle Forecaster (CDLF), a conditional generative framework for forecasting new-product life-cycle trajectories under cold start. CDLF combines three sources of information: static descriptors, reference trajectories from similar products, and newly arriving observations when available. Here, static descriptors refer to structured pre-launch characteristics of the product, such as category, price tier, brand or organization identity, scale, and access conditions. This structure allows the model to condition forecasts on relevant product context and to update them adaptively over time without retraining, yielding flexible multi-modal predictive distributions under extreme data scarcity. The method satisfies consistency with a horizon-uniform distributional error bound for recursive generation. Across studies on Intel microprocessor stock keeping unit (SKU) life cycles and the platform-mediated adoption of open large language model repositories, CDLF delivers more accurate point forecasts and higher-quality probabilistic forecasts than classical diffusion models, Bayesian updating approaches, and other state-of-the-art machine-learning baselines.
A Hybridizable Neural Time Integrator for Stable Autoregressive Forecasting
🔥 引用:
0
Abstract: For autoregressive modeling of chaotic dynamical systems over long time horizons, the stability of both training and inference is a major challenge in building scientific foundation models. We present a hybrid technique in which an autoregressive transformer is embedded within a novel shooting-based mixed finite element scheme, exposing topological structure that enables provable stability. For forward problems, we prove preservation of discrete energies, while for training we prove uniform bounds on gradients, provably avoiding the exploding gradient problem. Combined with a vision transformer, this yields latent tokens admitting structure-preserving dynamics. We outperform modern foundation models with a $65\times$ reduction in model parameters and long-horizon forecasting of chaotic systems. A"mini-foundation"model of a fusion component shows that 12 simulations suffice to train a real-time surrogate, achieving a $9{,}000\times$ speedup over particle-in-cell simulation.
Model-Contingent Polarity Bias in Large Language Model Annotation: Implications for Semantic Multimedia Personalization
🔥 引用:
0
Abstract: Large Language Models (LLMs) are increasingly deployed as automated annotators in semantic multimedia systems, yet their reliability varies significantly across architectures. This study extends prior cross-model evaluations by benchmarking ChatGPT-5, Qwen-3, and Gemini-3-flash against human expert annotations using the HRAST hotel review dataset. We adopt a bias-by-design framework to analyze systematic divergences in sentiment, topic, and aspect labeling across real and synthetic data, while investigating the moderating effects of annotation mode. Findings reveal model-contingent polarity bias: ChatGPT-5 exhibits a pronounced neutrality bias, while Qwen-3 and Gemini-3-flash align more closely with human polarization. Agreement is substantial for concrete topics but diverges on abstract evaluative dimensions. Synthetic data consistently inflates reliability metrics while masking ambiguity. These findings highlight that annotation bias is structurally embedded in model design choices and operational conditions. Cross-architectural triangulation and mode-aware deployment strategies are recommended for robust semantic multimedia system development.
Video-ToC: Video Tree-of-Cue Reasoning
🔥 引用:
0
Abstract: Existing Video Large Language Models (Video LLMs) struggle with complex video understanding, exhibiting limited reasoning capabilities and potential hallucinations. In particular, these methods tend to perform reasoning solely relying on the pretrained inherent reasoning rationales whilst lacking perception-aware adaptation to the input video content. To address this, we propose \textbf{Video-ToC}, a novel video reasoning framework that enhances video understanding through tree-of-cue reasoning. Specifically, our approach introduces three key innovations: (1) A tree-guided visual cue localization mechanism, which endows the model with enhanced fine-grained perceptual capabilities through structured reasoning patterns; (2) A reasoning-demand reward mechanism, which dynamically adjusts the reward value for reinforcement learning (RL) based on the estimation of reasoning demands, enabling on-demand incentives for more effective reasoning strategies; and (3) An automated annotation pipeline that constructs the Video-ToC-SFT-1k and Video-ToC-RL-2k datasets for supervised fine-tuning (SFT) and RL training, respectively. Extensive evaluations on six video understanding benchmarks and a video hallucination benchmark demonstrate the superiority of Video-ToC over baselines and recent methods. Code is available at https://github.com/qizhongtan/Video-ToC.
The Last Harness You'll Ever Build
🔥 引用:
0
Abstract: AI agents are increasingly deployed on complex, domain-specific workflows -- navigating enterprise web applications that require dozens of clicks and form fills, orchestrating multi-step research pipelines that span search, extraction, and synthesis, automating code review across unfamiliar repositories, and handling customer escalations that demand nuanced domain knowledge. \textbf{Each new task domain requires painstaking, expert-driven harness engineering}: designing the prompts, tools, orchestration logic, and evaluation criteria that make a foundation model effective. We present a two-level framework that automates this process. At the first level, the \textbf{Harness Evolution Loop} optimizes a worker agent's harness $\mathcal{H}$ for a single task: a Worker Agent $W_{\mathcal{H}}$ executes the task, an Evaluator Agent $V$ adversarially diagnoses failures and scores performance, and an Evolution Agent $E$ modifies the harness based on the full history of prior attempts. At the second level, the \textbf{Meta-Evolution Loop} optimizes the evolution protocol $\Lambda = (W_{\mathcal{H}}, \mathcal{H}^{(0)}, V, E)$ itself across diverse tasks, \textbf{learning a protocol $\Lambda^{(\text{best})}$ that enables rapid harness convergence on any new task -- so that adapting an agent to a novel domain requires no human harness engineering at all.} We formalize the correspondence to meta-learning and present both algorithms. The framework \textbf{shifts manual harness engineering into automated harness engineering}, and takes one step further -- \textbf{automating the design of the automation itself}.
EvoAgent: An Evolvable Agent Framework with Skill Learning and Multi-Agent Delegation
🔥 引用:
0
Abstract: This paper proposes EvoAgent - an evolvable large language model (LLM) agent framework that integrates structured skill learning with a hierarchical sub-agent delegation mechanism. EvoAgent models skills as multi-file structured capability units equipped with triggering mechanisms and evolutionary metadata, and enables continuous skill generation and optimization through a user-feedback-driven closed-loop process. In addition, by incorporating a three-stage skill matching strategy and a three-layer memory architecture, the framework supports dynamic task decomposition for complex problems and long-term capability accumulation. Experimental results based on real-world foreign trade scenarios demonstrate that, after integrating EvoAgent, GPT5.2 achieves significant improvements in professionalism, accuracy, and practical utility. Under a five-dimensional LLM-as-Judge evaluation protocol, the overall average score increases by approximately 28%. Further model transfer experiments indicate that the performance of an agent system depends not only on the intrinsic capabilities of the underlying model, but also on the degree of synergy between the model and the agent architecture.
BioEngine: scalable execution and adaptation of bioimage AI through agent-readable interfaces
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
HyperFM: An Efficient Hyperspectral Foundation Model with Spectral Grouping
🔥 引用:
0
Abstract: The NASA PACE mission provides unprecedented hyperspectral observations of ocean color, aerosols, and clouds, offering new insights into how these components interact and influence Earth's climate and air quality. Its Ocean Color Instrument measures light across hundreds of finely spaced wavelength bands, enabling detailed characterization of features such as phytoplankton composition, aerosol properties, and cloud microphysics. However, hyperspectral data of this scale is large, complex, and difficult to label, requiring specialized processing and analysis techniques. Existing foundation models, which have transformed computer vision and natural language processing, are generally trained on standard RGB imagery and therefore struggle to interpret the continuous spectral signatures captured by PACE. While recent advances have introduced hyperspectral foundation models, they are typically trained on cloud-free observations and often remain limited to single-sensor datasets due to spectral inconsistencies across instruments. Moreover, existing models tend to be parameter-heavy and computationally expensive, limiting scalability and adoption in operational settings. To address these challenges, we introduce HyperFM, a parameter-efficient hyperspectral foundation model that leverages intra-group and inter-group spectral attention along with hybrid parameter decomposition to better capture spectral spatial relationships while reducing computational cost. HyperFM demonstrates consistent performance improvements over existing hyperspectral foundation models and task-specific state-of-the-art methods across four benchmark downstream atmospheric cloud property retrieval tasks. To support further research, we additionally release HyperFM250K, a large-scale hyperspectral dataset from the PACE mission that includes both clear and cloudy scenes.
Stream-CQSA: Avoiding Out-of-Memory in Attention Computation via Flexible Workload Scheduling
🔥 引用:
0
Abstract: The scalability of long-context large language models is fundamentally limited by the quadratic memory cost of exact self-attention, which often leads to out-of-memory (OOM) failures on modern hardware. Existing methods improve memory efficiency to near-linear complexity, while assuming that the full query, key, and value tensors fit in device memory. In this work, we remove this assumption by introducing CQS Divide, an operation derived from cyclic quorum sets (CQS) theory that decomposes attention into a set of independent subsequence computations whose recomposition yields exactly the same result as full-sequence attention. Exploiting this decomposition, we introduce Stream-CQSA, a memory-adaptive scheduling framework that partitions attention into subproblems that fit within arbitrary memory budgets. This recasts attention from a logically monolithic operation into a collection of schedulable tasks, enabling flexible execution across devices without inter-device communication. Experiments demonstrate predictable memory scaling and show that exact attention over billion-token sequences can be executed on a single GPU via streaming, without changing the underlying mathematical definition of attention or introducing approximation error.
Self-Awareness before Action: Mitigating Logical Inertia via Proactive Cognitive Awareness
🔥 引用:
0
Abstract: Large language models perform well on many reasoning tasks, yet they often lack awareness of whether their current knowledge or reasoning state is complete. In non-interactive puzzle settings, the narrative is fixed and the underlying structure is hidden; once a model forms an early hypothesis under incomplete premises, it can propagate that error throughout the reasoning process, leading to unstable conclusions. To address this issue, we propose SABA, a reasoning framework that explicitly introduces self-awareness of missing premises before making the final decision. SABA formulates reasoning as a recursive process that alternates between structured state construction and obstacle resolution: it first applies Information Fusion to consolidate the narrative into a verifiable base state, and then uses Query-driven Structured Reasoning to identify and resolve missing or underspecified premises by turning them into queries and progressively completing the reasoning state through hypothesis construction and state refinement. Across multiple evaluation metrics, SABA achieves the best performance on all three difficulty splits of the non-interactive Detective Puzzle benchmark, and it also maintains leading results on multiple public benchmarks.
RespondeoQA: a Benchmark for Bilingual Latin-English Question Answering
🔥 引用:
0
Abstract: We introduce a benchmark dataset for question answering and translation in bilingual Latin and English settings, containing about 7,800 question-answer pairs. The questions are drawn from Latin pedagogical sources, including exams, quizbowl-style trivia, and textbooks ranging from the 1800s to the present. After automated extraction, cleaning, and manual review, the dataset covers a diverse range of question types: knowledge- and skill-based, multihop reasoning, constrained translation, and mixed language pairs. To our knowledge, this is the first QA benchmark centered on Latin. As a case study, we evaluate three large language models -- LLaMa 3, Qwen QwQ, and OpenAI's o3-mini -- finding that all perform worse on skill-oriented questions. Although the reasoning models perform better on scansion and literary-device tasks, they offer limited improvement overall. QwQ performs slightly better on questions asked in Latin, but LLaMa3 and o3-mini are more task dependent. This dataset provides a new resource for assessing model capabilities in a specialized linguistic and cultural domain, and the creation process can be easily adapted for other languages. The dataset is available at: https://github.com/slanglab/RespondeoQA
SSL-R1: Self-Supervised Visual Reinforcement Post-Training for Multimodal Large Language Models
🔥 引用:
0
Abstract: Reinforcement learning (RL) with verifiable rewards (RLVR) has demonstrated the great potential of enhancing the reasoning abilities in multimodal large language models (MLLMs). However, the reliance on language-centric priors and expensive manual annotations prevents MLLMs'intrinsic visual understanding and scalable reward designs. In this work, we introduce SSL-R1, a generic self-supervised RL framework that derives verifiable rewards directly from images. To this end, we revisit self-supervised learning (SSL) in visual domains and reformulate widely-used SSL tasks into a set of verifiable visual puzzles for RL post-training, requiring neither human nor external model supervision. Training MLLMs on these tasks substantially improves their performance on multimodal understanding and reasoning benchmarks, highlighting the potential of leveraging vision-centric self-supervised tasks for MLLM post-training. We think this work will provide useful experience in devising effective self-supervised verifiable rewards to enable RL at scale. Project page: https://github.com/Jiahao000/SSL-R1.
Closing the Domain Gap in Biomedical Imaging by In-Context Control Samples
🔥 引用:
0
Abstract: The central problem in biomedical imaging are batch effects: systematic technical variations unrelated to the biological signal of interest. These batch effects critically undermine experimental reproducibility and are the primary cause of failure of deep learning systems on new experimental batches, preventing their practical use in the real world. Despite years of research, no method has succeeded in closing this performance gap for deep learning models. We propose Control-Stabilized Adaptive Risk Minimization via Batch Normalization (CS-ARM-BN), a meta-learning adaptation method that exploits negative control samples. Such unperturbed reference images are present in every experimental batch by design and serve as stable context for adaptation. We validate our novel method on Mechanism-of-Action (MoA) classification, a crucial task for drug discovery, on the large-scale JUMP-CP dataset. The accuracy of standard ResNets drops from 0.939 $\pm$ 0.005, on the training domain, to 0.862 $\pm$ 0.060 on data from new experimental batches. Foundation models, even after Typical Variation Normalization, fail to close this gap. We are the first to show that meta-learning approaches close the domain gap by achieving 0.935 $\pm$ 0.018. If the new experimental batches exhibit strong domain shifts, such as being generated in a different lab, meta-learning approaches can be stabilized with control samples, which are always available in biomedical experiments. Our work shows that batch effects in bioimaging data can be effectively neutralized through principled in-context adaptation, which also makes them practically usable and efficient.
A Context-Aware Target Engagement and Pharmacodynamic Biomarker Resource to Accelerate Drug Discovery and Development
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
Signal, Bounds, and Baselines: Principles for Rigorous Evaluation of High-Dimensional Biological Perturbation Prediction
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
Introduction to the Special Issue on LLM Empowered Internet of Things Part 2
DOI:
10.1145/3798286
🔥 引用:
0
Abstract: ACM TIOT launched a special issue on the theme of LLM Empowered Internet of Things, exploring the intersection of Large Language Models (LLMs) and the Internet of Things (IoT). As IoT continues to expand, advanced computational models are increasingly essential for processing and analyzing the massive data generated by interconnected devices. This special issue focuses on how LLMs can enhance IoT systems in several key areas. The second part of this special issue introduces the remaining six accepted papers that spans a board range of IoT scenarios from embedded and cyber-physical systems, human-centered applications, to IoT security.
Transforming Procure-to-Pay with Generative Artificial Intelligence: An Architecture-Grounded Analysis of AI-Powered Invoice Extraction and Non-PO Invoice Processing in SAP Ariba Environments
DOI:
10.55041/isjem06731
🔥 引用:
0
Abstract: Abstract
The proliferation of unstructured supplier documents, particularly PDF invoices, remains one of the most persistent operational bottlenecks in enterprise procure-to-pay (P2P) cycles. Manual extraction, inconsistent data quality at intake, and recurring reconciliation exceptions drive significant labor costs, delay payment cycles, and limit the capture of early-payment discounts. This paper presents a rigorous, architecture-grounded analysis of Generative AI (GenAI)-powered invoice PDF extraction integrated with SAP Ariba Buying and Invoicing. Drawing on primary design analysis, benchmark-aligned case evidence, and enterprise integration patterns (REST API, SAP Cloud Integration Gateway), we articulate the functional flow, value levers, implementation prerequisites, and operational risk controls for this capability. A human-in-the-loop validation model and master data governance (MDG) integration are examined as critical guardrails. Complementary P2P capabilities—AI-guided buying, intelligent supplier onboarding, and the SAP Joule generative AI copilot—are situated within a unified enterprise transformation framework. KPI recommendations and a measurement approach grounded in activity-based costing are provided. Findings indicate that organizations achieving sustained catalog coverage and sufficient non-PO invoice volume can realize positive ROI within 12–18 months, with measurable reductions in touchless rate gaps, exception backlog, and per-invoice processing cost.
Keywords: Generative AI | Invoice Processing | SAP Ariba | Procure-to-Pay | Large Language Models | Human-in-the-Loop | Accounts Payable Automation | Master Data Governance
Dual Multi-RAG: Multi-View Contrastive Learning for Code Completion
🔥 引用:
0
Abstract: Recent advances in pre-trained Large Language Models (LLMs) have greatly improved automated code completion, but challenges such as logical errors, semantic misunderstandings, and hallucinated outputs persist, especially for complex or unseen code. Retrieval-Augmented Generation (RAG) alleviates these issues by retrieving external snippets as supplementary context, yet its single-view encoder often fails to capture the full range of code semantics. We propose Dual Multi-RAG, a novel framework that combines prompt-driven multi-retrieval with multi-view contrastive learning. In the retrieval stage, carefully designed prompts guide LLMs to elicit diverse semantic perspectives. In the selection stage, a contrastive learning mechanism identifies the most contextually relevant snippet for completion. This dual design enriches semantic coverage and improves reliability. Experiments demonstrate that Dual Multi-RAG consistently outperforms existing approaches, achieving 3.91% higher accuracy on the CCEval benchmark and 2.8% improvement on HumanEval-Infilling, confirming its effectiveness for robust and accurate code completion.
A platform for investigating prompt framing as interface parameters in foundation models for robotics
🔥 引用:
0
Abstract: Foundation models, in particular large language models (LLMs), are finding increasing popularity when used in describing goals for robotic control, decision making, and execution. Recently, proposals for hybrid paradigms leveraging strengths of reinforcement learning (RL) agents in tandem with LLMs for robotic control have been demonstrated. The interface between the RL agents and the language model however offers a unique opportunity to explore how prompt framing may affect such hybrid systems. This work presents a controlled experimental platform to measure and better understand how manipulation of the interface between RL agents and an LLM impacts behaviour of a hybrid advisor-arbiter architecture. We compared three agents under matched evaluation protocols and initializations in a simulated navigation environment: (i) RL-only tabular Q-learning; (ii) LLM-only (stateless) action selection; and (iii) a hybrid LLM + RL agent. Under a constrained interaction budget (10 episodes per world), the hybrid LLM + RL agent achieves higher mean success and higher mean cumulative reward than both RL-only and LLM-only baselines. Advisor-channel ablations (random recommendations and null recommendations) reduce performance, indicating that structured advice contributes beyond adding extra text. We further demonstrate prompt framing as a controlled factor by evaluating navigation-role personas, narrative personas, and relational variants of a caregiver prompt under matched conditions, yielding heterogeneous effects across framings. The contribution of this work is to provide a structured testbed and evaluation approach for investigating the impact of prompt framing on multi-step decision making and control tasks.
Cooperative Profiles Predict Multi-Agent LLM Team Performance in AI for Science Workflows
🔥 引用:
0
Abstract: Multi-agent systems built from teams of large language models (LLMs) are increasingly deployed for collaborative scientific reasoning and problem-solving. These systems require agents to coordinate under shared constraints, such as GPUs or credit balances, where cooperative behavior matters. Behavioral economics provides a rich toolkit of games that isolate distinct cooperation mechanisms, yet it remains unknown whether a model's behavior in these stylized settings predicts its performance in realistic collaborative tasks. Here, we benchmark 35 open-weight LLMs across six behavioral economics games and show that game-derived cooperative profiles robustly predict downstream performance in AI-for-Science tasks, where teams of LLM agents collaboratively analyze data, build models, and produce scientific reports under shared budget constraints. Models that effectively coordinate games and invest in multiplicative team production (rather than greedy strategies) produce better scientific reports across three outcomes, accuracy, quality, and completion. These associations hold after controlling for multiple factors, indicating that cooperative disposition is a distinct, measurable property of LLMs not reducible to general ability. Our behavioral games framework thus offers a fast and inexpensive diagnostic for screening cooperative fitness before costly multi-agent deployment.
Augmented Justice: Artificial Intelligence and the Burden of Legal Reading
DOI:
10.66260/1213.a1asb
🔥 引用:
0
Abstract: The article examines the effects of the “augmentation” of justice through the introduction of contemporary AI tools to assist judges in reading the law, analysing documents, and deciding cases. Emphasis is placed on the potential human unmanageability of “augmented” justice, as well as on the transformations brought about by large language models in the efforts required of legal professionals in their engagement with legal texts. Two models of arbitrary justice are compared through their literary embodiments in Rabelais’ Judge Bridoye (“The Third Book”) and Aristophanes’ juror Philocleon (“The Wasps”). The questions raised by these figures are then reconsidered in light of the contemporary role of AI in the creation, reading, and application of legal texts.
RNABag: A Generalizable Transcriptome Foundation Model for Precision Oncology across Biopsy Modalities
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
MCAP: Deployment-Time Layer Profiling for Memory-Constrained LLM Inference
🔥 引用:
0
Abstract: Deploying large language models to heterogeneous hardware is often constrained by memory, not compute. We introduce MCAP (Monte Carlo Activation Profiling), a load-time per-layer importance estimator that enables dynamic precision and memory placement decisions on the target device. MCAP produces a lightweight per-layer signal that drives both precision dispatch (W4A8 vs. W4A16) and residency tier (GPU, RAM, SSD), allowing a single set of weights to operate across diverse memory budgets. Our system, NVE, achieves 1.5-1.8x higher decode throughput than llama-cpp Q4_0 on NVIDIA T4 and enables models to run in memory regimes previously infeasible without modifying weights.
DeepParse: Hybrid Log Parsing with LLM-Synthesized Regex Masks
🔥 引用:
0
Abstract: Modern distributed systems produce massive, heterogeneous logs essential for reliability, security, and anomaly detection. Converting these free-form messages into structured templates (log parsing) is challenging due to evolving formats and limited labeled data. Machine-learning-based parsers like Drain are fast but accuracy often degrades on complex variables, while Large Language Models (LLMs) offer better generalization but incur prohibitive inference costs. This paper presents DeepParse, a hybrid framework that automatically mines reusable variable patterns from small log samples using an LLM, then applies them deterministically through the Drain algorithm. By separating the reasoning phase from execution, DeepParse enables accurate, scalable, and cost-efficient log structuring without relying on brittle handcrafted rules or per-line neural inference. Across 16 benchmark datasets, DeepParse achieves higher accuracy in variable extraction (97.6% average Parsing Accuracy) and better consistency than both heuristic and LLM-only baselines. Integrating DeepParse into an anomaly detection pipeline reduced false alarms by over 30% and reduced inference latency by 36% compared to heuristic baselines.
The GaoYao Benchmark: A Comprehensive Framework for Evaluating Multilingual and Multicultural Abilities of Large Language Models
🔥 引用:
0
Abstract: Evaluating the multilingual and multicultural capabilities of Large Language Models (LLMs) is essential for their global utility. However, current benchmarks face three critical limitations: (1) fragmented evaluation dimensions that often neglect deep cultural nuances; (2) insufficient language coverage in subjective tasks relying on low-quality machine translation; and (3) shallow analysis that lacks diagnostic depth beyond simple rankings. To address these, we introduce GaoYao, a comprehensive benchmark with 182.3k samples, 26 languages and 51 nations/areas. First, GaoYao proposes a unified framework categorizing evaluation tasks into three cultural layers (General Multilingual, Cross-cultural, Monocultural) and nine cognitive sub-layers. Second, we achieve native-quality expansion by leveraging experts to rigorously localize subjective benchmarks into 19 languages and synthesizing cross-cultural test sets for 34 cultures, surpassing prior coverage by up to 111%. Third, we conduct an in-depth diagnostic analysis on 20+ flagship and compact LLMs. Our findings reveal significant geographical performance disparities and distinct gaps between tasks, offering a reliable map for future work. We release the benchmark (https://github.com/lunyiliu/GaoYao).
Integrating Generative AI with Autonomous Databases: A Deep Dive into Oracle Database 23ai and Oracle Database 26ai Architecture
🔥 引用:
0
Abstract: The convergence of generative artificial intelligence (GenAI) and autonomous database systems represents one of the most consequential transformations in enterprise data management over the past decade. Oracle Corporation has positioned itself at the forefront of this convergence through two landmark releases: Oracle Database 23ai the first database explicitly branded as AI-integrated and the forthcoming Oracle Database 26ai, which promises a fully AI-native, self-optimizing data platform. This paper presents a comprehensive architectural analysis of both systems, examining how generative AI capabilities are woven into every layer of the database stack, from storage engines and query optimizers to natural-language interfaces and in-database large-language-model (LLM) gateways. We explore Oracle 23ai's flagship features AI Vector Search, Select AI, JSON Relational Duality, and enhanced in-database machine learning and contrast them with the anticipated architectural innovations in Oracle 26ai, including multimodal data support, integrated LLM routing, and autonomous self-healing infrastructure. Through comparative analysis, performance benchmarks, and real-world use cases spanning healthcare, finance, and manufacturing, this paper demonstrates that AI-native autonomous databases are not simply databases augmented with AI, but fundamentally re-architected systems where intelligence is a first-class citizen. Our findings suggest that Oracle's trajectory from 23ai to 26ai maps a clear progression toward databases that generate, understand, and act on knowledge rather than merely storing and retrieving it.
DialToM: A Theory of Mind Benchmark for Forecasting State-Driven Dialogue Trajectories
🔥 引用:
0
Abstract: Large Language Models (LLMs) have been shown to possess Theory of Mind (ToM) abilities. However, it remains unclear whether this stems from robust reasoning or spurious correlations. We introduce DialToM, a human-verified benchmark built from natural human dialogue using a multiple-choice framework. We evaluate not only mental state prediction (Literal ToM) but also the functional utility of these states (Functional ToM) through Prospective Diagnostic Forecasting -- probing whether models can identify state-consistent dialogue trajectories solely from mental-state profiles. Our results reveal a significant reasoning asymmetry: while LLMs excel at identifying mental states, most (except for Gemini 3 Pro) fail to leverage this understanding to forecast social trajectories. Additionally, we find only weak semantic similarities between human and LLM-generated inferences. To facilitate reproducibility, the DialToM dataset and evaluation code are publicly available at https://github.com/Stealth-py/DialToM.
Neuro-Symbolic Manipulation Understanding with Enriched Semantic Event Chains
🔥 引用:
0
Abstract: Robotic systems operating in human environments must reason about how object interactions evolve over time, which actions are currently being performed, and what manipulation step is likely to follow. Classical enriched Semantic Event Chains (eSECs) provide an interpretable relational description of manipulation, but remain primarily descriptive and do not directly support uncertainty-aware decision making. In this paper, we propose eSEC-LAM, a neuro-symbolic framework that transforms eSECs into an explicit event-level symbolic state for manipulation understanding. The proposed formulation augments classical eSECs with confidence-aware predicates, functional object roles, affordance priors, primitive-level abstraction, and saliency-guided explanation cues. These enriched symbolic states are derived from a foundation-model-based perception front-end through deterministic predicate extraction, while current-action inference and next-primitive prediction are performed using lightweight symbolic reasoning over primitive pre- and post-conditions. We evaluate the proposed framework on EPIC-KITCHENS-100, EPIC-KITCHENS VISOR, and Assembly101 across action recognition, next-primitive prediction, robustness to perception noise, and explanation consistency. Experimental results show that eSEC-LAM achieves competitive action recognition, substantially improves next-primitive prediction, remains more robust under degraded perceptual conditions than both classical symbolic and end-to-end video baselines, and provides temporally consistent explanation traces grounded in explicit relational evidence. These findings demonstrate that enriched Semantic Event Chains can serve not only as interpretable descriptors of manipulation, but also as effective internal states for neuro-symbolic action reasoning.
Large Language Models Outperform Humans in Fraud Detection and Resistance to Motivated Investor Pressure
🔥 引用:
0
Abstract: Large language models trained on human feedback may suppress fraud warnings when investors arrive already persuaded of a fraudulent opportunity. We tested this in a preregistered experiment across seven leading LLMs and twelve investment scenarios covering legitimate, high-risk, and objectively fraudulent opportunities, combining 3,360 AI advisory conversations with a 1,201-participant human benchmark. Contrary to predictions, motivated investor framing did not suppress AI fraud warnings; if anything, it marginally increased them. Endorsement reversal occurred in fewer than 3 in 1,000 observations. Human advisors endorsed fraudulent investments at baseline rates of 13-14%, versus 0% across all LLMs, and suppressed warnings under pressure at two to four times the AI rate. AI systems currently provide more consistent fraud warnings than lay humans in an identical advisory role.
Persona Routing Associated With Fewer Safety and Monotonicity Violations in Simulated Emergency Large Language Model (LLM) Reasoning
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
In search of novel PD1 inhibitor from natural products by high-throughput virtual screening and molecular dynamics simulation
🔥 引用:
0
Abstract: Squamous cell carcinoma (SCC) represents a significant oncological challenge. While immune checkpoint inhibitors targeting PD-1/PD-L1 have revolutionized the treatment of SCC, current monoclonal antibody approaches face limitations, including poor tissue penetration, high costs, and immune-related adverse events in patients. Most existing small-molecule efforts target PD-L1, leaving PD-1/PD-L2 interactions intact and enabling immune escape. This study represents the first systematic identification of natural product-derived direct PD-1 inhibitors, offering broader pathway blockade compared to PD-L1-selective approaches. While current therapeutic limitations highlight the need for alternative methods, this computational study lays a foundation for experimental validation and potential advancement of a drug development pipeline. Through integrated computational screening of 17,967 phytochemicals from the IMPPAT database, we employed consensus molecular docking across seven algorithms, 300-ns molecular dynamics simulations, density functional theory calculations, and comprehensive ADME profiling. IMPHY004834 (Mahuannin D) from Ephedra sinica emerged as a lead compound with exceptional free binding energy, forming stable interactions with key PD-1. Molecular dynamics analysis revealed remarkable stability with consistent RMSD, lowest RMSF, and sustained hydrogen bonding throughout the simulation. The biflavonoid structure exhibits a favorable HOMO-LUMO gap, indicating chemical stability, while ADME profiling confirms drug-like properties, albeit requiring parenteral administration due to low GI absorption. This work establishes the first evidence for Mahuannin D’s PD-1 inhibitory mechanism, which was previously known only for its cytotoxic effects. It provides a validated computational framework for discovering natural product-based immune checkpoint inhibitors with superior pathway coverage compared to existing PD-L1-selective therapeutics.
LLM-based Teamwork Role Inferencing for Fostering Social Online Learning
🔥 引用:
0
Abstract: Higher education institutions are progressively implementing online learning modules due to their promotion of equity through scalability, accessibility, inclusivity, and affordability. Despite the advantages, online learning environments face persistent challenges in facilitating meaningful networked learning (NL) opportunities. One particularly pressing challenge is supporting effective NL through group formation and collaborative teamwork. With rising enrolments, manual grouping has become impractical. Random allocation overlooks complementary skills and often produces unbalanced teams, while self-assessment relies on students’ often inaccurate self-perceptions, leading to mismatched roles and group tensions. A substantial body of research has examined strategies for group formation, and many studies have emphasised the value of team role allocation in promoting effective NL. However, little attention has been paid to approaches that infer learners’ collaborative role tendencies from their actual behavioural interactions and subsequently use this information to inform group composition and achieve a better NL experience. Accordingly, this study examines whether large language models (LLMs) can infer higher-education learners’ teamwork role tendencies comparably to human judgment, using data from an international sample of undergraduate and postgraduate learners recruited via Prolific who interacted with a custom-built chatbot in collaborative learning scenarios. The resulting interactions were analysed through an LLM to infer teamwork role tendencies based on previously established Belbin Team Roles. The roles inferred by the LLM were then compared against those coded by human coders, with inter-rater alignment evaluated using Cohen’s Kappa and percentage agreement. Learner responses were coded by trained human researchers using a consensus-based Belin Team Role framework. The findings reveal that LLM achieved a moderate degree of alignment with human coders, suggesting its viability as a tool for inferring learners’ teamwork role tendencies. Moreover, exploratory analyses revealed that the length of learners’ responses to the chatbot is associated with the extent to which the LLM’s inferences aligned with human coders. However, future research would benefit from larger sample sizes and the use of more advanced statistical methods to better capture the effects of interaction quality. This study contributes to the growing body of work on LLM-supported NL by highlighting both the potential and the limitations of using LLMs for role inference. Future implementations should pay particular attention to fostering high-quality learner-chatbot interactions to maximise reliability and pedagogical value to help design for effective social participation in online NL experiences. The findings of this research are expected to advance the development of technology-enhanced, NL practices within online higher education.
BioAutoML-FAST: an automated machine-learning platform for reusable and benchmarked biological sequence models
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
HiPO: Hierarchical Preference Optimization for Adaptive Reasoning in LLMs
🔥 引用:
0
Abstract: Direct Preference Optimization (DPO) is an effective framework for aligning large language models with human preferences, but it struggles with complex reasoning tasks. DPO optimizes for the likelihood of generating preferred over dispreferred responses in their entirety and lacks the granularity to provide feedback on subsections of many-step solutions typical of reasoning tasks. Existing methods excel at either stable preference learning (e.g., DPO variants like KTO and RSO) or structured reasoning (e.g., ReMA's multi-agent RL framework, Tree of Thoughts), but fail to merge these complementary strengths. We propose HiPO (Hierarchical Preference Optimization), an extension of DPO that separates responses into reasoning segments (query clarification and context, reasoning steps, and answer) and computes loss as a weighted sum of the DPO loss for each segment. Our approach enables segment-specific training while maintaining DPO's computational efficiency and training stability. We demonstrate that for multiple 7B LLMs fine-tuned using HiPO and DPO on the Math Stack Exchange preference dataset, the models trained with HiPO outperform the others on a variety of common math benchmarks and achieve greater organization, logical flow, and consistency as measured by GPT-4.1.
Leveraging Multimodal LLMs for Built Environment and Housing Attribute Assessment from Street-View Imagery
🔥 引用:
0
Abstract: We present a novel framework for automatically evaluating building conditions nationwide in the United States by leveraging large language models (LLMs) and Google Street View (GSV) imagery. By fine-tuning Gemma 3 27B on a modest human-labeled dataset, our approach achieves strong alignment with human mean opinion scores (MOS), outperforming even individual raters on SRCC and PLCC relative to the MOS benchmark. To enhance efficiency, we apply knowledge distillation, transferring the capabilities of Gemma 3 27B to a smaller Gemma 3 4B model that achieves comparable performance with a 3x speedup. Further, we distill the knowledge into a CNN-based model (EfficientNetV2-M) and a transformer (SwinV2-B), delivering close performance while achieving a 30x speed gain. Furthermore, we investigate LLMs'capabilities for assessing an extensive list of built environment and housing attributes through a human-AI alignment study and develop a visualization dashboard that integrates LLM assessment outcomes for downstream analysis by homeowners. Our framework offers a flexible and efficient solution for large-scale building condition assessment, enabling high accuracy with minimal human labeling effort.
Pretrain Where? Investigating How Pretraining Data Diversity Impacts Geospatial Foundation Model Performance
🔥 引用:
0
Abstract: New geospatial foundation models introduce a new model architecture and pretraining dataset, often sampled using different notions of data diversity. Performance differences are largely attributed to the model architecture or input modalities, while the role of the pretraining dataset is rarely studied. To address this research gap, we conducted a systematic study on how the geographic composition of pretraining data affects a model's downstream performance. We created global and per-continent pretraining datasets and evaluated them on global and per-continent downstream datasets. We found that the pretraining dataset from Europe outperformed global and continent-specific pretraining datasets on both global and local downstream evaluations. To investigate the factors influencing a pretraining dataset's downstream performance, we analysed 10 pretraining datasets using diversity across continents, biomes, landcover and spectral values. We found that only spectral diversity was strongly correlated with performance, while others were weakly correlated. This finding establishes a new dimension of diversity to be accounted for when creating a high-performing pretraining dataset. We open-sourced 7 new pretraining datasets, pretrained models, and our experimental framework at https://github.com/kerner-lab/pretrain-where.
Real Time Fraud Detection at Scale in High Volume Enterprise Payment Ecosystems
DOI:
10.54097/51td8c59
🔥 引用:
0
Abstract: The rapid proliferation of digital payment channels has created unprecedented opportunities for fraudulent activity, compelling enterprise organizations to deploy increasingly sophisticated detection mechanisms capable of operating at high throughput with minimal latency. This paper presents a comprehensive review of real-time fraud detection (RTFD) methodologies applied within high-volume enterprise payment ecosystems, encompassing machine learning (ML), deep learning (DL), graph neural networks (GNN), and streaming data architectures. We examine how ensemble methods, anomaly detection frameworks, and feature engineering pipelines converge to form robust, production-grade fraud detection systems (FDS). The paper further discusses the tension between detection accuracy and operational latency, the challenge of class imbalance in transactional datasets, and the evolving regulatory landscape that shapes deployment constraints. By synthesizing findings from recent literature, this review identifies key trends including federated learning (FL) for privacy-preserving fraud detection, transformer-based sequence models for behavioral analysis, and adaptive threshold mechanisms for dynamic fraud pattern recognition. Our analysis reveals that no single algorithmic approach suffices in isolation; rather, layered architectures combining rule-based systems with data-driven models consistently achieve superior performance across precision, recall, and throughput metrics in enterprise-scale deployments.
Early-Stage Product Line Validation Using LLMs: A Study on Semi-Formal Blueprint Analysis
🔥 引用:
0
Abstract: We study whether Large Language Models (LLMs) can perform feature model analysis operations (AOs) directly on semi-formal textual blueprints, i.e., concise constrained-language descriptions of feature hierarchies and constraints, enabling early validation in Software Product Line scoping. Using 12 state-of-the-art LLMs and 16 standard AOs, we compare their outputs against the solver-based oracle FLAMA. Results show that reasoning-optimized models (e.g., Grok 4 Fast Reasoning, Gemini 2.5 Pro) achieve 88-89% average accuracy across all evaluated blueprints and operations, approaching solver correctness. We identify systematic errors in structural parsing and constraint reasoning, and highlight accuracy-cost trade-offs that inform model selection. These findings position LLMs as lightweight assistants for early variability validation.
LLM-guided phase diagram construction through high-throughput experimentation
🔥 引用:
0
Abstract: Constructing phase diagrams for multicomponent alloys requires extensive experimental measurements and is a time-consuming task. Here we investigate whether large language models (LLMs) can guide experimental planning for phase diagram construction. In our framework, a general-purpose LLM serves as the experimental planner, suggesting compositions for measurement at each cycle in a closed loop with high-throughput synthesis and X-ray diffraction phase identification. Using this framework, we experimentally constructed the ternary phase diagram of the Co-Al-Ge system at 900 degree C through iterative synthesis and characterization. We compared two strategies that differ in how the initial compositions are selected: one uses predictions from a domain-specific LLM trained on phase diagram data (aLLoyM), while the other relies solely on the general-purpose LLM. The two strategies exhibited complementary strengths. aLLoyM directed the initial measurements toward compositionally complex regions in the interior of the ternary diagram, enabling the earliest discovery of all three novel phases that form only in the ternary system. In contrast, the general-purpose LLM adopted a textbook-like approach which efficiently identified a larger number of phases in fewer cycles. In addition, a simulated benchmark comparing the LLM against conventional machine learning confirmed that the LLM achieves more efficient exploration. The results demonstrate that LLMs have high potential as experimental planners for phase diagram construction.
Dialect vs Demographics: Quantifying LLM Bias from Implicit Linguistic Signals vs. Explicit User Profiles
🔥 引用:
0
Abstract: As state-of-the-art Large Language Models (LLMs) have become ubiquitous, ensuring equitable performance across diverse demographics is critical. However, it remains unclear whether these disparities arise from the explicitly stated identity itself or from the way identity is signaled. In real-world interactions, users'identity is often conveyed implicitly through a complex combination of various socio-linguistic factors. This study disentangles these signals by employing a factorial design with over 24,000 responses from two open-weight LLMs (Gemma-3-12B and Qwen-3-VL-8B), comparing prompts with explicitly announced user profiles against implicit dialect signals (e.g., AAVE, Singlish) across various sensitive domains. Our results uncover a unique paradox in LLM safety where users achieve ``better''performance by sounding like a demographic than by stating they belong to it. Explicit identity prompts activate aggressive safety filters, increasing refusal rates and reducing semantic similarity compared to our reference text for Black users. In contrast, implicit dialect cues trigger a powerful ``dialect jailbreak,''reducing refusal probability to near zero while simultaneously achieving a greater level of semantic similarity to the reference texts compared to Standard American English prompts. However, this ``dialect jailbreak''introduces a critical safety trade-off regarding content sanitization. We find that current safety alignment techniques are brittle and over-indexed on explicit keywords, creating a bifurcated user experience where ``standard''users receive cautious, sanitized information while dialect speakers navigate a less sanitized, more raw, and potentially a more hostile information landscape and highlights a fundamental tension in alignment--between equitable and linguistic diversity--and underscores the need for safety mechanisms that generalize beyond explicit cues.
All Languages Matter: Understanding and Mitigating Language Bias in Multilingual RAG
🔥 引用:
0
Abstract: Multilingual Retrieval-Augmented Generation (mRAG) leverages cross-lingual evidence to ground Large Language Models (LLMs) in global knowledge. However, we show that current mRAG systems suffer from a language bias during reranking, systematically favoring English and the query's native language. By introducing an estimated oracle evidence analysis, we quantify a substantial performance gap between existing rerankers and the achievable upper bound. Further analysis reveals a critical distributional mismatch: while optimal predictions require evidence scattered across multiple languages, current systems systematically suppress such ``answer-critical''documents, thereby limiting downstream generation performance. To bridge this gap, we propose \textit{\textbf{L}anguage-\textbf{A}gnostic \textbf{U}tility-driven \textbf{R}eranker \textbf{A}lignment (LAURA)}, which aligns multilingual evidence ranking with downstream generative utility. Experiments across diverse languages and generation models show that LAURA effectively mitigates language bias and consistently improves mRAG performance.
Exploring Spatial Intelligence from a Generative Perspective
🔥 引用:
0
Abstract: Spatial intelligence is essential for multimodal large language models, yet current benchmarks largely assess it only from an understanding perspective. We ask whether modern generative or unified multimodal models also possess generative spatial intelligence (GSI), the ability to respect and manipulate 3D spatial constraints during image generation, and whether such capability can be measured or improved. We introduce GSI-Bench, the first benchmark designed to quantify GSI through spatially grounded image editing. It consists of two complementary components: GSI-Real, a high-quality real-world dataset built via a 3D-prior-guided generation and filtering pipeline, and GSI-Syn, a large-scale synthetic benchmark with controllable spatial operations and fully automated labeling. Together with a unified evaluation protocol, GSI-Bench enables scalable, model-agnostic assessment of spatial compliance and editing fidelity. Experiments show that fine-tuning unified multimodal models on GSI-Syn yields substantial gains on both synthetic and real tasks and, strikingly, also improves downstream spatial understanding. This provides the first clear evidence that generative training can tangibly strengthen spatial reasoning, establishing a new pathway for advancing spatial intelligence in multimodal models.
Self-supervised pretraining for an iterative image size agnostic vision transformer
🔥 引用:
0
Abstract: Vision Transformers (ViTs) dominate self-supervised learning (SSL). While they have proven highly effective for large-scale pretraining, they are computationally inefficient and scale poorly with image size. Consequently, foundational models like DINO are constrained to low-resolution processing. A recent foveal-inspired transformer achieves resolution agnosticism by iteratively processing a fixed-size context of multi-zoom patches. This model demonstrated promising results via supervised learning, utilizing a sequential, recurrent-like process without backpropagation through time. To unlock its potential as a foundational backbone, we introduce a novel sequential-to-global SSL framework based on DINO's self-distillation objective. Supported by an efficient integral-image patch extraction method, our approach enables large-scale pretraining for image-size agnostic vision encoders. We achieve competitive performance on ImageNet-1K and downstream classification tasks, maintaining a constant computational budget regardless of input resolution.
HireSense: An AI-Powered Semantic Matchmaking and Zero-Commission Invoicing Framework for Multilingual Labor Markets
🔥 引用:
0
Abstract: The rapid expansion of the global freelancing economy has highlighted several inefficiencies in existing platforms, including high commission fees, limited language accessibility, and ineffective talent matching mechanisms. This paper presents HireSense, an AI-driven freelancing platform specifically designed for multilingual markets such as India. Unlike conventional systems that rely on keyword-based matching, the proposed platform employs a semantic matchmaking pipeline using sentence embeddings and large language models to improve accuracy in freelancer–job alignment. The system integrates a two-stage architecture where candidate profiles are first retrieved using vector similarity and then refined through contextual re-ranking using a large language model. Additionally, HireSense introduces a zero-commission financial model by generating automated invoices for direct payments via UPI, eliminating intermediary costs. The platform also supports multilingual interaction and voice-based job posting, enhancing accessibility for users with diverse linguistic backgrounds. Experimental evaluation demonstrates improved precision and relevance compared to traditional keyword-based systems. The platform is deployed using a cost-efficient serverless architecture, ensuring scalability without infrastructure overhead. The results indicate that HireSense provides a practical and inclusive solution for modern digital labor markets
Toward Cross-Lingual Quality Classifiers for Multilingual Pretraining Data Selection
🔥 引用:
0
Abstract: As Large Language Models (LLMs) scale, data curation has shifted from maximizing volume to optimizing the signal-to-noise ratio by performing quality filtering. However, for many languages, native high quality data is insufficient to train robust quality classifiers. This work investigates the idea that quality markers in embedding space may show cross-lingual consistency, which would allow high-resource languages to subsidize the filtering of low-resource ones. We evaluate various filtering strategies, including cross-lingual transfer, third quartile sampling (Q3), and retention rate tuning. Our results demonstrate that massive multilingual pooling frequently outperforms monolingual baselines in both rank stability and aggregate accuracy for a 1B model trained on 103B tokens, delivering gains for high resource languages (1.2% increase in aggregate normalized accuracy for French) and matching or exceeding monolingual baselines for low-resource languages. However, we find that scale alone does not guarantee stability. Furthermore, for high-resource languages like French, we show that refining the decision boundary through third quartile sampling (Q3) or tuning the retention rate is necessary to fully leverage the multilingual signal.
Evaluating analysis methods for coast guard reports freeform text: a case study on resource-constrained natural language processing with search and rescue reports
🔥 引用:
0
Abstract: After a U.S. Coast Guard (USCG) search and rescue (SAR) case, USCG personnel create an after-action report containing a textual narrative of the situation and Coast Guard response efforts. Data analysts explored how to identify reports involving cases with a verified person in the water. With restricted access to compute resources and limiting policy, large language models (LLMs) could not be utilized, so statistical (‘classical’ and non-neural) methods were considered for training a classification model to identify SAR case outcomes from report texts. The dataset was severely imbalanced toward the negative class, and the texts were extremely messy, with many typos and abbreviations. Therefore, an extensive text cleaning pipeline was developed and tested for improving classification performance. The Iterative Token Elimination Algorithm (iTEA) was developed to increase differences in vocabulary between classes. Model improvement was further explored through augmentation of the feature space using non-text data. The best model was an XGBoost model, achieving 0.762 recall and precision (and 0.959 accuracy). Errors from the test set are analyzed to guide future improvements until LLMs can be used, which are expected to improve performance and reduce text cleaning requirements.
Surrogate modeling for interpreting black-box LLMs in medical predictions
🔥 引用:
0
Abstract: Large language models (LLMs), trained on vast datasets, encode extensive real-world knowledge within their parameters, yet their black-box nature obscures the mechanisms and extent of this encoding. Surrogate modeling, which uses simplified models to approximate complex systems, can offer a path toward better interpretability of black-box models. We propose a surrogate modeling framework that quantitatively explains LLM-encoded knowledge. For a specific hypothesis derived from domain knowledge, this framework approximates the latent LLM knowledge space using observable elements (input-output pairs) through extensive prompting across a comprehensive range of simulated scenarios. Through proof-of-concept experiments in medical predictions, we demonstrate our framework's effectiveness in revealing the extent to which LLMs"perceive"each input variable in relation to the output. Particularly, given concerns that LLMs may perpetuate inaccuracies and societal biases embedded in their training data, our experiments using this framework quantitatively revealed both associations that contradict established medical knowledge and the persistence of scientifically refuted racial assumptions within LLM-encoded knowledge. By disclosing these issues, our framework can act as a red-flag indicator to support the safe and reliable application of these models.
Context-Aware Multi-Source Intelligent Knowledge Retrieval System
🔥 引用:
0
Abstract: Enterprise organisation tackles large volumes of knowledge, which is not centralised but distributed across multiple heterogeneous sources. Such heterogeneous sources are structured datasets, web pages and documents. Traditional keyword- based systems failed to retrieve contextually relevant information, which produced incomplete or inaccurate responses. Even large language models (LLMs) generated hallucinated outputs. The reason for hallucinated outputs was nothing but operating without reliable knowledge grounding. Eventually, to overcome these challenges, this paper proposes a context-aware multi-source intelligent knowledge retrieval system based on the retrieval augmented generation (RAG) framework. This system integrates multiple data sources, including PDFs, websites, and structured data repositories.
AFRILANGTUTOR: Advancing Language Tutoring and Culture Education in Low-Resource Languages with Large Language Models
🔥 引用:
0
Abstract: How can language learning systems be developed for languages that lack sufficient training resources? This challenge is increasingly faced by developers across the African continent who aim to build AI systems capable of understanding and responding in local languages. To address this gap, we introduce AFRILANGDICT, a collection of 194.7K African language-English dictionary entries designed as seed resources for generating language-learning materials, enabling us to automatically construct large-scale, diverse, and verifiable student-tutor question-answer interactions suitable for training AI-assisted language tutors. Using AFRILANGDICT, we build AFRILANGEDU, a dataset of 78.9K multi-turn training examples for Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO). Using AFRILANGEDU, we train language tutoring models collectively referred to as AFRILANGTUTOR. We fine-tune two multilingual LLMs: Llama-3-8B-IT and Gemma-3-12B-IT on AFRILANGEDU across 10 African languages and evaluate their performance. Our results show that models trained on AFRILANGEDU consistently outperform their base counterparts, and combining SFT and DPO yields substantial improvements, with gains ranging from 1.8% to 15.5% under LLM-as-a-judge evaluations across four criteria. To facilitate further research on low-resource languages -- all resources are available at https://huggingface.co/afrilang-edu.
Hybrid Latent Reasoning with Decoupled Policy Optimization
🔥 引用:
0
Abstract: Chain-of-Thought (CoT) reasoning significantly elevates the complex problem-solving capabilities of multimodal large language models (MLLMs). However, adapting CoT to vision typically discretizes signals to fit LLM inputs, causing early semantic collapse and discarding fine-grained details. While external tools can mitigate this, they introduce a rigid bottleneck, confining reasoning to predefined operations. Although recent latent reasoning paradigms internalize visual states to overcome these limitations, optimizing the resulting hybrid discrete-continuous action space remains challenging. In this work, we propose HyLaR (Hybrid Latent Reasoning), a framework that seamlessly interleaves discrete text generation with continuous visual latent representations. Specifically, following an initial cold-start supervised fine-tuning (SFT), we introduce DePO (Decoupled Policy Optimization) to enable effective reinforcement learning within this hybrid space. DePO decomposes the policy gradient objective, applying independent trust-region constraints to the textual and latent components, alongside an exact closed-form von Mises-Fisher (vMF) KL regularizer. Extensive experiments demonstrate that HyLaR outperforms standard MLLMs and state-of-the-art latent reasoning approaches across fine-grained perception and general multimodal understanding benchmarks. Code is available at https://github.com/EthenCheng/HyLaR.
Dual-Cluster Memory Agent: Resolving Multi-Paradigm Ambiguity in Optimization Problem Solving
🔥 引用:
0
Abstract: Large Language Models (LLMs) often struggle with structural ambiguity in optimization problems, where a single problem admits multiple related but conflicting modeling paradigms, hindering effective solution generation. To address this, we propose Dual-Cluster Memory Agent (DCM-Agent) to enhance performance by leveraging historical solutions in a training-free manner. Central to this is Dual-Cluster Memory Construction. This agent assigns historical solutions to modeling and coding clusters, then distills each cluster's content into three structured types: Approach, Checklist, and Pitfall. This process derives generalizable guidance knowledge. Furthermore, this agent introduces Memory-augmented Inference to dynamically navigate solution paths, detect and repair errors, and adaptively switch reasoning paths with structured knowledge. The experiments across seven optimization benchmarks demonstrate that DCM-Agent achieves an average performance improvement of 11%- 21%. Notably, our analysis reveals a ``knowledge inheritance''phenomenon: memory constructed by larger models can guide smaller models toward superior performance, highlighting the framework's scalability and efficiency.
Clinically Interpretable Sepsis Early Warning via LLM-Guided Simulation of Temporal Physiological Dynamics
🔥 引用:
0
Abstract: Timely and interpretable early warning of sepsis remains a major clinical challenge due to the complex temporal dynamics of physiological deterioration. Traditional data-driven models often provide accurate yet opaque predictions, limiting physicians'confidence and clinical applicability. To address this limitation, we propose a Large Language Model (LLM)-guided temporal simulation framework that explicitly models physiological trajectories prior to disease onset for clinically interpretable prediction. The framework consists of a spatiotemporal feature extraction module that captures dynamic dependencies among multivariate vital signs, a Medical Prompt-as-Prefix module that embeds clinical reasoning cues into LLMs, and an agent-based post-processing component that constrains predictions within physiologically plausible ranges. By first simulating the evolution of key physiological indicators and then classifying sepsis onset, our model offers transparent prediction mechanisms that align with clinical judgment. Evaluated on the MIMIC-IV and eICU databases, the proposed method achieves superior AUC scores (0.861-0.903) across 24-4-hour pre-onset prediction tasks, outperforming conventional deep learning and rule-based approaches. More importantly, it provides interpretable trajectories and risk trends that can assist clinicians in early intervention and personalized decision-making in intensive care environments.
LayerTracer: A Joint Task-Particle and Vulnerable-Layer Analysis framework for Arbitrary Large Language Model Architectures
🔥 引用:
0
Abstract: Currently, Large Language Models (LLMs) feature a diversified architectural landscape, including traditional Transformer, GateDeltaNet, and Mamba. However, the evolutionary laws of hierarchical representations, task knowledge formation positions, and network robustness bottleneck mechanisms in various LLM architectures remain unclear, posing core challenges for hybrid architecture design and model optimization. This paper proposes LayerTracer, an architecture-agnostic end-to-end analysis framework compatible with any LLM architecture. By extracting hidden states layer-by-layer and mapping them to vocabulary probability distributions, it achieves joint analysis of task particle localization and layer vulnerability quantification. We define the task particle as the key layer where the target token probability first rises significantly, representing the model's task execution starting point, and the vulnerable layer is defined as the layer with the maximum Jensen-Shannon (JS) divergence between output distributions before and after mask perturbation, reflecting its sensitivity to disturbances. Experiments on models of different parameter scales show that task particles mainly appear in the deep layers of the model regardless of parameter size, while larger-parameter models exhibit stronger hierarchical robustness. LayerTracer provides a scientific basis for layer division, module ratio, and gating switching of hybrid architectures, effectively optimizing model performance. It accurately locates task-effective layers and stability bottlenecks, offering universal support for LLM structure design and interpretability research.
SpaCeFormer: Fast Proposal-Free Open-Vocabulary 3D Instance Segmentation
🔥 引用:
0
Abstract: Open-vocabulary 3D instance segmentation is a core capability for robotics and AR/VR, but prior methods trade one bottleneck for another: multi-stage 2D+3D pipelines aggregate foundation-model outputs at hundreds of seconds per scene, while pseudo-labeled end-to-end approaches rely on fragmented masks and external region proposals. We present SpaCeFormer, a proposal-free space-curve transformer that runs at 0.14 seconds per scene, 2-3 orders of magnitude faster than multi-stage 2D+3D pipelines. We pair it with SpaCeFormer-3M, the largest open-vocabulary 3D instance segmentation dataset (3.0M multi-view-consistent captions over 604K instances from 7.4K scenes) built through multi-view mask clustering and multi-view VLM captioning; it reaches 21x higher mask recall than prior single-view pipelines (54.3% vs 2.5% at IoU>0.5). SpaCeFormer combines spatial window attention with Morton-curve serialization for spatially coherent features, and uses a RoPE-enhanced decoder to predict instance masks directly from learned queries without external proposals. On ScanNet200 we achieve 11.1 zero-shot mAP, a 2.8x improvement over the prior best proposal-free method; on ScanNet++ and Replica, we reach 22.9 and 24.1 mAP, surpassing all prior methods including those using multi-view 2D inputs.
RAG-BASED LEGAL/GOVERNMENT SCHEME ASSISTANT
DOI:
10.55041/ijsrem60964
🔥 引用:
0
Abstract: Abstract— Access to accurate information about government welfare schemes remains a major challenge for many citizens, particularly in rural and underserved communities. Although official portals provide scheme details, users often struggle to identify programs relevant to their eligibility due to complex documentation and scattered data sources. This paper presents SchemeAssist, an intelligent AI-powered chatbot built using Retrieval-Augmented Generation (RAG) to deliver reliable, context-aware responses regarding government schemes.The proposed system integrates a large language model with a vector database containing curated government scheme documents. When a user submits a query, the system retrieves relevant information and generates precise responses grounded in verified data, thereby reducing hallucinations commonly associated with generative AI models. The chatbot supports natural language interaction, enabling users to ask eligibility questions, required documents, benefits, and application procedures.Experimental evaluation demonstrates improved response accuracy, reduced misinformation, and enhanced user accessibility compared to traditional keyword-based search systems. The solution is scalable, cost-effective, and suitable for deployment in public service platforms, educational kiosks, and mobile applications, ultimately promoting digital inclusion and informed decision-making.
ATIR: Towards Audio-Text Interleaved Contextual Retrieval
🔥 引用:
0
Abstract: Audio carries richer information than text, including emotion, speaker traits, and environmental context, while also enabling lower-latency processing compared to speech-to-text pipelines. However, recent multimodal information retrieval research has predominantly focused on images, largely overlooking audio, especially in the setting of interleaved audio-text contextual retrieval. In this work, we introduce the Audio-Text Interleaved contextual Retrieval (ATIR) task, where queries can alternate between audio and text modalities. We construct an ATIR benchmark by integrating several Automatic Speech Recognition (ASR), QA, and retrieval datasets, ultimately unifying four types of contextual retrieval tasks. This benchmark substantially addresses the limitations of existing audio retrieval datasets in semantic retrieval. To study this task, we evaluate several off-the-shelf retrievers and train our ATIR model based on a Multimodal Large Language Model (MLLM). We further introduce a novel token compression mechanism that is orthogonal to existing compression methods, thereby alleviating the issue of excessive audio tokens in MLLM-based ATIR models. Experimental results demonstrate that our ATIR model achieves substantial improvements over strong baselines.
Memory-Augmented LLM-based Multi-Agent System for Automated Feature Generation on Tabular Data
🔥 引用:
0
Abstract: Automated feature generation extracts informative features from raw tabular data without manual intervention and is crucial for accurate, generalizable machine learning. Traditional methods rely on predefined operator libraries and cannot leverage task semantics, limiting their ability to produce diverse, high-value features for complex tasks. Recent Large Language Model (LLM)-based approaches introduce richer semantic signals, but still suffer from a restricted feature space due to fixed generation patterns and from the absence of feedback from the learning objective. To address these challenges, we propose a Memory-Augmented LLM-based Multi-Agent System (\textbf{MALMAS}) for automated feature generation. MALMAS decomposes the generation process into agents with distinct responsibilities, and a Router Agent activates an appropriate subset of agents per iteration, further broadening exploration of the feature space. We further integrate a memory module comprising procedural memory, feedback memory, and conceptual memory, enabling iterative refinement that adaptively guides subsequent feature generation and improves feature quality and diversity. Extensive experiments on multiple public datasets against state-of-the-art baselines demonstrate the effectiveness of our approach. The code is available at https://github.com/fxdong24/MALMAS
Forget, Then Recall: Learnable Compression and Selective Unfolding via Gist Sparse Attention
🔥 引用:
0
Abstract: Scaling large language models to long contexts is challenging due to the quadratic computational cost of full attention. Mitigation approaches include KV-cache selection or compression techniques. We instead provide an effective and end-to-end learnable bridge between the two without requiring architecture modification. In particular, our key insight is that interleaved gist compression tokens -- which provide a learnable summary of sets of raw tokens -- can serve as routing signals for sparse attention. Building on this, we introduce selective unfolding via GSA, which first compresses the context into gist tokens, then selects the most relevant gists, and subsequently restores the corresponding raw chunks for detailed attention. This yields a simple coarse-to-fine mechanism that combines compact global representations with targeted access to fine-grained evidence. We further incorporate this process directly into training in an end-to-end fashion, avoiding the need for external retrieval modules. In addition, we extend the framework hierarchically via recursive gist-of-gist construction, enabling multi-resolution context access with logarithmic per-step decoding complexity. Empirical results on LongBench and RAG benchmarks demonstrate that our method consistently outperforms other compression baselines as well as inference-time sparse attention methods across compression ratios from $8\times$ to $32\times$. The code is available at: https://github.com/yuzhenmao/gist-sparse-attention/
Differentiable Conformal Training for LLM Reasoning Factuality
🔥 引用:
0
Abstract: Large Language Models (LLMs) frequently hallucinate, limiting their reliability in critical applications. Conformal Prediction (CP) addresses this by calibrating error rates on held-out data to provide statistically valid confidence guarantees. Recent work extends CP to LLM factuality to filter out risky claims, ensuring that hallucination rates remain below a user-specified level (e.g., 10%). While prior methods treat claims independently, Coherent Factuality extends to multi-step reasoning by representing outputs as dependency graphs and jointly validating claims with their logical ancestors. A key limitation is that Coherent Factuality is not differentiable, requiring hand-crafted scorers that at high reliability levels remove nearly 60% of true claims. We introduce Differentiable Coherent Factuality (DCF), a fully differentiable relaxation that enables learning improved scorers while provably recovering the original algorithm's guarantees. Experiments on two benchmark reasoning datasets demonstrate DCF achieves up to 141% improvement in claim retention while maintaining reliability guarantees, representing a significant step towards reliable conformal LLM systems.
Co-Evolving LLM Decision and Skill Bank Agents for Long-Horizon Tasks
🔥 引用:
0
Abstract: Long horizon interactive environments are a testbed for evaluating agents skill usage abilities. These environments demand multi step reasoning, the chaining of multiple skills over many timesteps, and robust decision making under delayed rewards and partial observability. Games are a good testbed for evaluating agent skill usage in environments. Large Language Models (LLMs) offer a promising alternative as game playing agents, but they often struggle with consistent long horizon decision making because they lack a mechanism to discover, retain, and reuse structured skills across episodes. We present COSPLAY, a co evolution framework in which an LLM decision agent retrieves skills from a learnable skill bank to guide action taking, while an agent managed skill pipeline discovers reusable skills from the agents unlabeled rollouts to form a skill bank. Our framework improves both the decision agent to learn better skill retrieval and action generation, while the skill bank agent continually extracts, refines, and updates skills together with their contracts. Experiments across six game environments show that COSPLAY with an 8B base model achieves over 25.1 percent average reward improvement against four frontier LLM baselines on single player game benchmarks while remaining competitive on multi player social reasoning games.
PokeVLA: Empowering Pocket-Sized Vision-Language-Action Model with Comprehensive World Knowledge Guidance
🔥 引用:
0
Abstract: Recent advances in Vision-Language-Action (VLA) models have opened new avenues for robot manipulation, yet existing methods exhibit limited efficiency and a lack of high-level knowledge and spatial awareness. To address these challenges, we propose PokeVLA, a lightweight yet powerful foundation model for embodied manipulation that effectively infuses vision-language understanding into action learning. Our framework introduces a two-stage training paradigm: first, we pre-train a compact vision-language model (PokeVLM) on a curated multimodal dataset of 2.4M samples encompassing spatial grounding, affordance, and embodied reasoning tasks; second, we inject manipulation-relevant representations into the action space through multi-view goal-aware semantics learning, geometry alignment, and a novel action expert. Extensive experiments demonstrate state-of-the-art performance on the LIBERO-Plus benchmark and in real-world deployment, outperforming comparable baselines in success rate and robustness under diverse perturbations. To foster reproducibility and community progress, we will open-source our code, model weights, and the scripts for the curated pre-training dataset. Project page: https://getterupper.github.io/PokeVLA
TRAVELFRAUDBENCH: A Configurable Evaluation Framework for GNN Fraud Ring Detection in Travel Networks
🔥 引用:
0
Abstract: We introduce TravelFraudBench (TFG), a configurable benchmark for evaluating graph neural networks (GNNs) on fraud ring detection in travel platform graphs. Existing benchmarks--YelpChi, Amazon-Fraud, Elliptic, PaySim--cover single node types or domain-generic patterns with no mechanism to evaluate across structurally distinct fraud ring topologies. TFG simulates three travel-specific ring types--ticketing fraud (star topology with shared device/IP clusters), ghost hotel schemes (reviewer x hotel bipartite cliques), and account takeover rings (loyalty transfer chains)--in a heterogeneous graph with 9 node types and 12 edge types. Ring size, count, fraud rate, scale (500 to 200,000 nodes), and composition are fully configurable. We evaluate six methods--MLP, GraphSAGE, RGCN-proj, HAN, RGCN, and PC-GNN--under a ring-based split where each ring appears entirely in one partition, eliminating transductive label leakage. GraphSAGE achieves AUC=0.992 and RGCN-proj AUC=0.987, outperforming the MLP baseline (AUC=0.938) by 5.5 and 5.0 pp, confirming graph structure adds substantial discriminative power. HAN (AUC=0.935) is a negative result, matching the MLP baseline. On the ring recovery task (>=80% of ring members flagged simultaneously), GraphSAGE achieves 100% recovery across all ring types; MLP recovers only 17-88%. The edge-type ablation shows device and IP co-occurrence are the primary signals: removing uses_device drops AUC by 5.2 pp. TFG is released as an open-source Python package (MIT license) with PyG, DGL, and NetworkX exporters and pre-generated datasets at https://huggingface.co/datasets/bsajja7/travel-fraud-graphs, with Croissant metadata including Responsible AI fields.
Forecasting Individual NetFlows using a Predictive Masked Graph Autoencoder
🔥 引用:
0
Abstract: In this paper, we propose a proof-of-concept Graph Neural Network model that can successfully predict network flow-level traffic (NetFlow) by accurately modelling the graph structure and the connection features. We use sliding-windows to split the network traffic in equal-sized heterogeneous bidirectional graphs containing IP, Port, and Connection nodes. We then use the GNN to model the evolution of the graph structure and the connection features. Our approach shows superior results when identifying the Port and IP to which connections attach, while feature reconstruction remains competitive with strong forecasting baselines. Overall, our work showcases the use of GNNs for per-flow NetFlow prediction.
Can"AI"Be a Doctor? A Study of Empathy, Readability, and Alignment in Clinical LLMs
🔥 引用:
0
Abstract: Large Language Models (LLMs) are increasingly deployed in healthcare, yet their communicative alignment with clinical standards remains insufficiently quantified. We conduct a multidimensional evaluation of general-purpose and domain-specialized LLMs across structured medical explanations and real-world physician-patient interactions, analyzing semantic fidelity, readability, and affective resonance. Baseline models amplify affective polarity relative to physicians (Very Negative: 43.14-45.10% vs. 37.25%) and, in larger architectures such as GPT-5 and Claude, produce substantially higher linguistic complexity (FKGL up to 16.91-17.60 vs. 11.47-12.50 in physician-authored responses). Empathy-oriented prompting reduces extreme negativity and lowers grade-level complexity (up to -6.87 FKGL points for GPT-5) but does not significantly increase semantic fidelity. Collaborative rewriting yields the strongest overall alignment. Rephrase configurations achieve the highest semantic similarity to physician answers (up to mean = 0.93) while consistently improving readability and reducing affective extremity. Dual stakeholder evaluation shows that no model surpasses physicians on epistemic criteria, whereas patients consistently prefer rewritten variants for clarity and emotional tone. These findings suggest that LLMs function most effectively as collaborative communication enhancers rather than replacements for clinical expertise.
Automatic Ontology Construction Using LLMs as an External Layer of Memory, Verification, and Planning for Hybrid Intelligent Systems
🔥 引用:
0
Abstract: This paper presents a hybrid architecture for intelligent systems in which large language models (LLMs) are extended with an external ontological memory layer. Instead of relying solely on parametric knowledge and vector-based retrieval (RAG), the proposed approach constructs and maintains a structured knowledge graph using RDF/OWL representations, enabling persistent, verifiable, and semantically grounded reasoning. The core contribution is an automated pipeline for ontology construction from heterogeneous data sources, including documents, APIs, and dialogue logs. The system performs entity recognition, relation extraction, normalization, and triple generation, followed by validation using SHACL and OWL constraints, and continuous graph updates. During inference, LLMs operate over a combined context that integrates vector-based retrieval with graph-based reasoning and external tool interaction. Experimental observations on planning tasks, including the Tower of Hanoi benchmark, indicate that ontology augmentation improves performance in multi-step reasoning scenarios compared to baseline LLM systems. In addition, the ontology layer enables formal validation of generated outputs, transforming the system into a generation-verification-correction pipeline. The proposed architecture addresses key limitations of current LLM-based systems, including lack of long-term memory, weak structural understanding, and limited reasoning capabilities. It provides a foundation for building agent-based systems, robotics applications, and enterprise AI solutions that require persistent knowledge, explainability, and reliable decision-making.
Computational Analysis of Azole Derivatives Targeting the PI3K/AKT/mTOR Pathway With In Vitro Cytotoxicity and Autophagy Evaluation
DOI:
10.1002/jbt.70859
🔥 引用:
0
Abstract:
This study aims to identify potential azole‐derived inhibitors targeting the PI3K/AKT/mTOR signaling pathway involved in autophagy regulation and cancer progression. A structure‐based virtual screening approach was employed using molecular docking, molecular dynamics (MD) simulations, and free energy calculations (MMGBSA and MMPBSA). The pharmacokinetic profiles and toxicity of lead compounds were assessed using ADMET analysis. In vitro validation was performed using MTT and MDC staining assays on MDA‐MB‐231 breast cancer cells.Among the screened compounds, KR4 demonstrated strong binding affinity towards all three kinases (−8.289, −5.222, and −6.331 kcal/mol) respectively with favorable pharmacokinetic properties. MD simulation confirmed the stability of the KR4‐protein complexes, while post‐MD MMPBSA analysis validated the binding energetics. In vitro studies revealed dose‐dependent cytotoxicity of KR4 (IC₅₀ ≈ 39 µM) and induction of autophagy in treated cells. The integration of in silico and in vitro approaches highlights KR4 as a promising multi‐target inhibitor of the PI3K/AKT/mTOR pathway with potential anti‐cancer properties. These findings support further exploration of KR4 for therapeutic development.
Comparative Evaluation of Five Multimodal Large Language Models for Medical Laboratory Image Recognition: Impact of Prompting Strategies on Diagnostic Accuracy
🔥 引用:
0
Abstract: Background: Multimodal large language models (MLLMs) show promise in medical imaging, but their performance is highly dependent on prompt engineering. This study systematically evaluates how different prompting strategies affect diagnostic accuracy in clinical laboratory image interpretation. Methods: We evaluated five MLLMs (ChatGPT-4o, Gemini 2.0 Flash, Claude 3.5 Sonnet, Grok-2, and Perplexity Pro (Claude 3.5 Sonnet)) using 177 proficiency testing images across three domains: blood smears (n = 78), urinalysis (n = 50), and parasitology (n = 49). Three prompting approaches were compared: (1) complex multi-choice prompts with 20 diagnostic options, (2) zero-shot open-ended prompts, and (3) two-step descriptive-reasoning prompts. Images were sourced from the Taiwan Society of Laboratory Medicine external quality assurance archives with expert consensus diagnoses. Results: Zero-shot prompting significantly outperformed complex multi-choice prompts across all models and domains (p < 0.001). With zero-shot prompts, Gemini achieved 78.5% overall accuracy (urinalysis: 92.0%; parasitology: 75.5%; blood smears: 64.1%), representing a 17% improvement over complex prompts. Two-step descriptive-reasoning prompts further improved blood smear accuracy by 8–12% for top-performing models, but showed minimal benefit in urinalysis and parasitology. The re-query mechanism (“please reconsider”) improved urinalysis accuracy by 7.6% but had a negligible effect on blood smears and parasitology. Conclusions: Prompting strategy critically determines MLLM diagnostic performance. Zero-shot approaches with minimal constraints consistently outperform complex multi-choice formats. The remarkable performance of general-purpose models in structured domains like urinalysis (>90% accuracy) demonstrates the considerable progress of multimodal AI. However, complex morphological tasks like blood smear interpretation require either specialized prompting techniques or domain-specific fine-tuning. These findings provide evidence-based guidance for optimizing AI integration in clinical laboratories.
StormNet: Improving storm surge predictions with a GNN-based spatio-temporal offset forecasting model
🔥 引用:
0
Abstract: Storm surge forecasting remains a critical challenge in mitigating the impacts of tropical cyclones on coastal regions, particularly given recent trends of rapid intensification and increasing nearshore storm activity. Traditional high fidelity numerical models such as ADCIRC, while robust, are often hindered by inevitable uncertainties arising from various sources. To address these challenges, this study introduces StormNet, a spatio-temporal graph neural network (GNN) designed for bias correction of storm surge forecasts. StormNet integrates graph convolutional (GCN) and graph attention (GAT) mechanisms with long short-term memory (LSTM) components to capture complex spatial and temporal dependencies among water-level gauge stations. The model was trained using historical hurricane data from the U.S. Gulf Coast and evaluated on Hurricane Idalia (2023). Results demonstrate that StormNet can effectively reduce the root mean square error (RMSE) in water-level predictions by more than 70\% for 48-hour forecasts and above 50\% for 72-hour forecasts, as well as outperform a sequential LSTM baseline, particularly for longer prediction horizons. The model also exhibits low training time, enhancing its applicability in real-time operational forecasting systems. Overall, StormNet provides a computationally efficient and physically meaningful framework for improving storm surge prediction accuracy and reliability during extreme weather events.
HealthCare Agent AI
🔥 引用:
0
Abstract: This study explores the growing transition in healthcare from rigid, rule-based algorithms to intelligent, self-directed “agentic” AI systems. These agents rely on Large Language Models (LLMs) to think step-by-step, make decisions, and carry out tasks using a simple yet powerful four-part structure: planning, action, reflection, and memory.
The paper reviews their current uses in medical diagnosis, hospital workflow automation, and collaborative multi-agent setups. It also highlights a major challenge — most testing still happens in controlled lab settings rather than real hospitals — and discusses what is needed for safe everyday adoption.
Keywords: Medical AI Agents, Autonomous Clinical Systems, Large Language Models, Chain-of-Thought Reasoning, Multimodal Integration, Agentic Framework, Clinical Decision Support, Human-in-the-Loop, Healthcare Automation, Physician Burnout, Precision Medicine, Scoping Review, Simulation Gap, PRISMA-ScR.
Healthcare AI agents support several key areas. They strengthen diagnostic accuracy in critical situations such as predicting sepsis or identifying cancer on scans. They lighten the administrative load by automatically creating summaries from patient records, which helps reduce doctor burnout. They also improve patient involvement by offering personalised guidance, voice-based recovery check-ins, and easier remote care. In addition, groups of specialist agents can work together on complicated cases, such as building complete treatment plans.
EST-GNN: An Explainable Spatio-Temporal Graph Framework with Lévy-Optuna Optimization for CO2 Emission Forecasting in Electrified Transportation
🔥 引用:
0
Abstract: The accurate and explainable prediction of carbon emissions is crucial for the efficient operation of hybrid and electrified transportation systems and their integration with energy grids. An Explainable Spatio-Temporal Graph Neural Network (EST-GNN) is proposed for highly precise CO2 emission forecasting using Lévy Flight-guided Optuna optimization. By modelling vehicles and their operational characteristics as nodes in a dynamic graph, the proposed framework can jointly learn timing and spatial correlations while sustaining interpretability. The accuracy of the EST-GNN model is compared with models based on one-hot encoded features, SMOTE-enhanced datasets, and ensemble regressors. Using a real-world dataset of 7385 vehicle registrations with 12 predictive features experiments are conducted. When applied the EST-GNN model outperformed all baseline and traditional models achieving the highest reliability (R2 = 0.98754) while solving competitive error metrics (RMSE = 6.55, MAE = 2.556). There is strong indication that reasonable machine learning (ML) models can be used accurately to confirm their suitability for resource-prevented and real-time applications, while predictable ML techniques have relatively low reliability. The optimal solution ensures scalability, robustness, and independence of the deployment environment. The distribution analysis of best performing models develops the ability of EST-GNN, which accounts for the largest proportion of best results across evaluation metrics. To achieve superior predictive accuracy, graph-based learning, explainability, and advanced hyperparameter optimization are combined. EST-GNN provides a powerful tool for analyzing fleet emission levels, making energy-aware decisions, and planning sustainable transportation, while ML models continue to be a useful complement for deployment states with high computation costs and quick responses.
Hallucination Inspector: A Fact-Checking Judge for API Migration
🔥 引用:
0
Abstract: Large Language Models (LLMs) are increasingly deployed in automated software engineering for tasks such as API migration. While LLMs are able to identify migration patterns, they often make mistakes and fail to produce correct glue code to invoke the new API in place of the old one. We call this issue Scaffolding Hallucination, a failure mode where models generate incorrect calling contexts by inventing Phantom Symbols -- such as imaginary imports, constructors, and constants -- that do not exist in the API specification. In this paper, we show that standard metrics cannot be relied upon to detect these instances of hallucination. We propose Hallucination Inspector, a static analysis tool to detect Scaffolding Hallucination in LLM-generated code. Our approach includes a lightweight evaluation framework that verifies symbols extracted from the abstract syntax tree against a knowledge base derived directly from software documentation for the API. A preliminary evaluation on Android API migrations demonstrates that our approach successfully identifies hallucinations and significantly reduces false positives compared to standard metrics and probabilistic judges
WildFireVQA: A Large-Scale Radiometric Thermal VQA Benchmark for Aerial Wildfire Monitoring
🔥 引用:
0
Abstract: Wildfire monitoring requires timely, actionable situational awareness from airborne platforms, yet existing aerial visual question answering (VQA) benchmarks do not evaluate wildfire-specific multimodal reasoning grounded in thermal measurements. We introduce WildFireVQA, a large-scale VQA benchmark for aerial wildfire monitoring that integrates RGB imagery with radiometric thermal data. WildFireVQA contains 6,097 RGB-thermal samples, where each sample includes an RGB image, a color-mapped thermal visualization, and a radiometric thermal TIFF, and is paired with 34 questions, yielding a total of 207,298 multiple-choice questions spanning presence and detection, classification, distribution and segmentation, localization and direction, cross-modal reasoning, and flight planning for operational wildfire intelligence. To improve annotation reliability, we combine multimodal large language model (MLLM)-based answer generation with sensor-driven deterministic labeling, manual verification, and intra-frame and inter-frame consistency checks. We further establish a comprehensive evaluation protocol for representative MLLMs under RGB, Thermal, and retrieval-augmented settings using radiometric thermal statistics. Experiments show that across task categories, RGB remains the strongest modality for current models, while retrieved thermal context yields gains for stronger MLLMs, highlighting both the value of temperature-grounded reasoning and the limitations of existing MLLMs in safety-critical wildfire scenarios. The dataset and benchmark code are open-source at https://github.com/mobiiin/WildFire_VQA.
Examiner stratification reveals clinically relevant variability in large language model answers to endodontic patient questions
🔥 引用:
0
Abstract:
Large language models (LLMs) are increasingly used by patients seeking endodontic information, yet their clinical reliability and safety in patient-centred communication remain uncertain.
This study evaluated the clinical reliability and safety of three contemporary LLMs (ChatGPT GPT-4o, Claude Sonnet 4.5, and Gemini 3 Flash) using 50 patient-centred endodontic questions (35 frequently asked questions and 15 scenario-based prompts). Each question was submitted six times per model in independent sessions. Responses were anonymised and independently assessed by four examiners using a structured Clinical Reliability and Safety Framework. Due to poor inter-examiner agreement, analyses were conducted using examiner stratification. Reproducibility was assessed using word count variability, embedding-based semantic similarity, and lexical distance metrics.
Statistically significant differences in clinical reliability were observed across all examiners. ChatGPT consistently received the lowest scores, whereas Gemini most frequently achieved the highest ratings. Model differentiation was clearer for structured frequently asked questions and selected clinical domains than for scenario-based prompts. All models demonstrated stable response lengths across repeated runs. Gemini showed the highest semantic consistency despite greater surface-level rewording.
Contemporary LLMs demonstrate clinically meaningful variability beyond factual accuracy, particularly in safety framing and clinical actionability. Reliability is influenced by question structure and clinical context. Multidimensional, examiner-aware evaluation frameworks are necessary to meaningfully assess safety and support responsible integration of LLMs into endodontic patient communication.
FairyFuse: Multiplication-Free LLM Inference on CPUs via Fused Ternary Kernels
🔥 引用:
0
Abstract: Large language models are increasingly deployed on CPU-only platforms where memory bandwidth is the primary bottleneck for autoregressive generation. Weight quantization to four bits or below reduces memory pressure, yet existing systems still dequantize weights and perform floating-point multiplications, limiting the achievable gains. Ternary weights in {-1, 0, +1} provide a more efficient alternative, replacing multiplications with conditional additions, subtractions, or no-ops. While Fairy2i shows that ternary LLMs can match FP16 quality, its runtime does not exploit this structure. We present FairyFuse, an inference system that enables multiplication-free execution on commodity CPUs by fusing the eight real-valued sub-GEMVs of each widely-linear layer into a single AVX-512 loop using masked additions and subtractions, with zero floating-point multiplications. Roofline analysis shows that 16x weight compression shifts memory-bound GEMV toward the compute regime on bandwidth-limited CPUs, yielding a 29.6x kernel speedup while offering little benefit on GPUs. End-to-end, FairyFuse achieves 32.4 tokens per second on a single Intel Xeon 8558P, outperforming llama.cpp Q4_K_M by 1.24x with near-lossless quality (WikiText-2 perplexity 5.52 vs. 5.47 FP16; downstream accuracy 66.0%).
Language and the framing of historical narrative in large language models: the case of Asia minor (1922)
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
A Hierarchical Spatial Graph Neural Network Resolves Immunogenic and Tolerogenic Tertiary Lymphoid Structures in Renal Cell Carcinoma
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
Raahi AI: A Non-Directive AI for Self-Inquiry
DOI:
10.55041/ijsrem60913
🔥 引用:
0
Abstract: Abstract
Contemporary large language models (LLMs) are predominantly engineered to maximize answer utility — producing solutions, recommendations, and directive guidance in response to user queries. This paper presents Raahi AI, a conversational AI system designed around an opposing epistemological premise: that genuine psychological self-inquiry is obstructed, not aided, by the delivery of answers. Drawing on the philosophy of Jiddu Krishnamurti — specifically his contention that truth emerges through observation rather than prescribed method — and operationalizing structural parallels with
Person-Centered Therapy (PCT) and Motivational Interviewing (MI), we propose a constraint-based prompt architecture that encodes non-directive, dependency-reducing behavior into GPT-4o via MERN stack integration. The system's three core behavioral axioms — (1) problems are symptoms, not roots; (2) dependency on answers is itself the pathology; (3) observation precedes solution — are translated into concrete prompt constraints and evaluated against a custom behavioral rubric. Raahi AI does not advise, diagnose, or solve; it surfaces the questioner's own assumptions through Socratic reflection. Results suggest LLMs can be reliably constrained to simulate a fundamentally different epistemological stance, opening new directions in reflective AI design.
Benchmarking clinical knowledge and multi-modal reasoning of large language models in liver cirrhosis
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
Who Defines Fairness? Target-Based Prompting for Demographic Representation in Generative Models
🔥 引用:
0
Abstract: Text-to-image(T2I) models like Stable Diffusion and DALL-E have made generative AI widely accessible, yet recent studies reveal that these systems often replicate societal biases, particularly in how they depict demographic groups across professions. Prompts such as'doctor'or'CEO'frequently yield lighter-skinned outputs, while lower-status roles like'janitor'show more diversity, reinforcing stereotypes. Existing mitigation methods typically require retraining or curated datasets, making them inaccessible to most users. We propose a lightweight, inference-time framework that mitigates representational bias through prompt-level intervention without modifying the underlying model. Instead of assuming a single definition of fairness, our approach allows users to select among multiple fairness specifications-ranging from simple choices such as a uniform distribution to more complex definitions informed by a large language model(LLM) that cites sources and provides confidence estimates. These distributions guide the construction of demographic specific prompt variants in the corresponding proportions, and we evaluate alignment by auditing adherence to the declared target and measuring the resulting skin tone distribution rather than assuming uniformity as'fairness'. Across 36 prompts spanning 30 occupations and 6 non-occupational contexts, our method shifts observed skin-tone outcomes in directions consistent with the declared target, and reduces deviation from targets when the target is defined directly in skin-tone space(fallback). This work demonstrates how fairness interventions can be made transparent, controllable, and usable at inference time, directly empowering users of generative AI.
Advancements in artificial intelligence for cancer diagnosis and prognosis prediction: current applications and emerging opportunities
🔥 引用:
0
Abstract: Cancer continues to be a leading cause of mortality worldwide, presenting substantial challenges to public health systems. The traditional approaches to cancer diagnosis and prognosis prediction exhibit certain limitations with respect to accuracy, comprehensiveness, dynamic monitoring, and personalization. With the advancement of artificial intelligence (AI) technologies, novel diagnostic and predictive methods are increasingly addressing these shortcomings. This review provides a comprehensive overview of the primary AI algorithms applied in oncology, including machine learning, deep learning, and large language models. It further examines the distinctive characteristics and appropriate use cases of AI algorithms, highlighting their specific roles in cancer screening, diagnostic accuracy, and outcome forecasting. Additionally, the review discusses emerging trends and persistent challenges, aiming to provide actionable insights that support clinical decision-making and advance scientific innovation in this rapidly evolving field. In conclusion, this review systematically outlines recent advances in AI applications for cancer diagnosis and prognostic prediction, with the objective of facilitating a transformative shift in oncology from experience-based practices toward data-driven precision medicine.
Evaluating large language models for accuracy incentivizes hallucinations
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
CHASM: Unveiling Covert Advertisements on Chinese Social Media
🔥 引用:
0
Abstract: Current benchmarks for evaluating large language models (LLMs) in social media moderation completely overlook a serious threat: covert advertisements, which disguise themselves as regular posts to deceive and mislead consumers into making purchases, leading to significant ethical and legal concerns. In this paper, we present the CHASM, a first-of-its-kind dataset designed to evaluate the capability of Multimodal Large Language Models (MLLMs) in detecting covert advertisements on social media. CHASM is a high-quality, anonymized, manually curated dataset consisting of 4,992 instances, based on real-world scenarios from the Chinese social media platform Rednote. The dataset was collected and annotated under strict privacy protection and quality control protocols. It includes many product experience sharing posts that closely resemble covert advertisements, making the dataset particularly challenging.The results show that under both zero-shot and in-context learning settings, none of the current MLLMs are sufficiently reliable for detecting covert advertisements.Our further experiments revealed that fine-tuning open-source MLLMs on our dataset yielded noticeable performance gains. However, significant challenges persist, such as detecting subtle cues in comments and differences in visual and textual structures.We provide in-depth error analysis and outline future research directions. We hope our study can serve as a call for the research community and platform moderators to develop more precise defenses against this emerging threat.
Development of a novel series of thiazole-based compounds with enhanced antiproliferative properties as tubulin polymerization inhibitors
🔥 引用:
0
Abstract:
In cancer therapy, inhibiting tubulin polymerization is a key approach for modifying microtubule dynamics required for cell survival and proliferation. Microtubule destabilizing agents (MDAs), also known as tubulin polymerization inhibitors, prevent tubulin heterodimers from forming microtubules, resulting in catastrophic cellular collapse.
A novel series of thiazole-based compounds
8a-o
was developed to inhibit tubulin polymerization and assess for its antiproliferative efficacy against the NCI 60 cell line. The structures of the newly synthesized compounds were confirmed using
1
H NMR,
13
C NMR, and elemental microanalyses. All 15 compounds (
8a-o
) were assessed for antiproliferative action at a single dosage (10 μM) and analyzed against the comprehensive 60-cell panel at five concentrations (0.01, 0.1, 1, 10, and 100 μM).
The results from the one-dose and five-dose studies demonstrate that
8b
,
8c
,
8d
,
8m
, and
8o
are the most prominent antiproliferative agents, exhibiting the most favourable low-micromolar GI
50
values across various cell lines, frequently advancing to low-micromolar TGI, and, in numerous sensitive cell lines, achieving LC
50
values within the single-digit micromolar range. Compounds
8b
,
8d
and
8m
showed significant anti-tubulin activity, with IC50 values ranging from 3.86 to 7.19 μM, compared to the reference CA-4 (IC50 = 2.40 μM). In the MCF-7 breast cancer cell line, compound
8m
drove a significant accumulation of cells in the G2/M phase, increasing from 13.74% to 45.35%. G2/M arrest is frequently associated with DNA damage or the inhibition of microtubule dynamics, which aligns with Western blot results demonstrating a decrease in tubulin (50 kDa) expression following treatment with
8m
. Apoptotic and necrotic experiments indicate that
8m
stimulates a defined programmed cell death pathway rather than inducing non-specific toxic necrosis. Molecular docking corroborated their binding at the colchicine site, while in silico ADMET profiling indicated a promising drug-like profile for compound
8m
.
Use of Large Language Models to Extract Heart Failure Information from Electronic Health Records: A Scoping Review
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
Hybrid Policy Distillation for LLMs
🔥 引用:
0
Abstract: Knowledge distillation (KD) is a powerful paradigm for compressing large language models (LLMs), whose effectiveness depends on intertwined choices of divergence direction, optimization strategy, and data regime. We break down the design of existing KD methods and present a unified view that establishes connections between them, reformulating KD as a reweighted log-likelihood objective at the token level. We further propose Hybrid Policy Distillation (HPD), which integrates the complementary advantages of forward and reverse KL to balance mode coverage and mode-seeking, and combines off-policy data with lightweight, approximate on-policy sampling. We validate HPD on long-generation math reasoning as well as short-generation dialogue and code tasks, demonstrating improved optimization stability, computational efficiency, and final performance across diverse model families and scales. The code related to this work is available at https://github.com/zwhong714/Hybrid-Policy-Distillation.
COMPASS: COntinual Multilingual PEFT with Adaptive Semantic Sampling
🔥 引用:
0
Abstract: Large language models (LLMs) often exhibit performance disparities across languages, with naive multilingual fine-tuning frequently degrading performance due to negative cross-lingual interference. To address this, we introduce COMPASS (COntinual Multilingual PEFT with Adaptive Semantic Sampling), a novel data-centric framework for adapting LLMs to target languages. COMPASS leverages parameter-efficient fine-tuning (PEFT) by training lightweight, language-specific adapters on a judiciously selected subset of auxiliary multilingual data. The core of our method is a distribution-aware sampling strategy that uses multilingual embeddings and clustering to identify semantic gaps between existing training data and a target usage distribution. By prioritizing auxiliary data from under-represented semantic clusters, COMPASS maximizes positive cross-lingual transfer while minimizing interference. We extend this into a continual learning framework, COMPASS-ECDA, which monitors for data distribution shifts in production and dynamically updates adapters to prevent model staleness, balancing adaptation to new data with the preservation of existing knowledge. Across three different model architectures (Phi-4-Mini, Llama-3.1-8B, and Qwen2.5-7B) and multiple challenging multilingual benchmarks (Global-MMLU, MMLU-ProX), including unseen long-context tasks (OneRuler), we demonstrate that COMPASS consistently outperforms baseline methods guided by linguistic similarity, providing an effective, efficient, and sustainable solution for developing and maintaining high-performing multilingual models in dynamic environments.
Accurate and Efficient Interatomic Potentials for Dislocations in InP
🔥 引用:
0
Abstract: We present Atomic Cluster Expansion (ACE) and MACE models trained on a new dataset of Density Functional Theory (DFT) calculations, constructed for the task of studying the mobility of dislocations in Indium Phosphide (InP). The models are validated in a suite of tests against RSCAN DFT, and compared with previously published potentials from literature. Our new models act as much better surrogates for DFT than the literature models: errors on partial dislocation formation energies are at most 4% for both ACE and MACE, compared with 18% for the MACE-MPA foundation model and 42-50% for earlier bespoke potentials. The bespoke MACE model achieves this accuracy while being around five times faster to evaluate than the MP0 and MPA foundation models.
A Cloud-Native Architecture for Human-in-Control LLM-Assisted OpenSearch in Investigative Settings
🔥 引用:
0
Abstract: Complex criminal investigations are often hindered by large volumes of unstructured evidence and by the semantic gap between natural language investigative intent and technical search logic. To address this challenge, we present a design and feasibility study of a cloud-native microservice architecture tailored to private-cloud deployments, contributing to research in secure cloud computing and leveraging modern cloud paradigms under high security and scalability requirements. The proposed system integrates Large Language Models into a"Human-in-Control"workflow that translates natural-language queries into syntactically valid OpenSearch Domain-Specific Language expressions. We describe the implementation of a hybrid retrieval strategy within OpenSearch that combines BM25-based lexical search with nested semantic vector embeddings. The paper focuses on system design and preliminary functional validation, establishing an architectural baseline for future empirical evaluation. Technical feasibility is demonstrated through a functional prototype, and a rigorous evaluation methodology is outlined using the Enron Email Dataset as a structural proxy for restricted investigative corpora.
Multi-Perspective Evidence Synthesis and Reasoning for Unsupervised Multimodal Entity Linking
🔥 引用:
0
Abstract: Multimodal Entity Linking (MEL) is a fundamental task in data management that maps ambiguous mentions with diverse modalities to the multimodal entities in a knowledge base. However, most existing MEL approaches primarily focus on optimizing instance-centric features and evidence, leaving broader forms of evidence and their intricate interdependencies insufficiently explored. Motivated by the observation that human expert decision-making process relies on multi-perspective judgment, in this work, we propose MSR-MEL, a Multi-perspective Evidence Synthesis and Reasoning framework with Large Language Models (LLMs) for unsupervised MEL. Specifically, we adopt a two-stage framework: (1) Offline Multi-Perspective Evidence Synthesis constructs a comprehensive set of evidence. This includes instance-centric evidence capturing the instance-centric multimodal information of mentions and entities, group-level evidence that aggregates neighborhood information, lexical evidence based on string overlap ratio, and statistical evidence based on simple summary statistics. A core contribution of our framework is the synthesis of group-level evidence, which effectively aggregates vital neighborhood information by graph. We first construct LLM-enhanced contextualized graphs. Subsequently, different modalities are jointly aligned through an asymmetric teacher-student graph neural network. (2) Online Multi-Perspective Evidence Reasoning leverages the power of LLM as a reasoning module to analyze the correlation and semantics of the multi-perspective evidence to induce an effective ranking strategy for accurate entity linking without supervision. Extensive experiments on widely used MEL benchmarks demonstrate that MSR-MEL consistently outperforms state-of-the-art unsupervised methods. The source code of this paper was available at: https://anonymous.4open.science/r/MSR-MEL-C21E/.
SCM: Sleep-Consolidated Memory with Algorithmic Forgetting for Large Language Models
🔥 引用:
0
Abstract: We present SCM (Sleep-Consolidated Memory), a research preview of a memory architecture for large language models that draws on neuroscientific principles to address a fundamental limitation in current systems: the absence of persistent, structured, and biologically plausible memory. Existing approaches rely on truncating context windows, growing vector databases without bound, or tiered storage systems that lack consolidation and forgetting mechanisms. SCM implements five core components inspired by human memory: a limited-capacity working memory, multi-dimensional importance tagging, offline sleep-stage consolidation with distinct NREM and REM phases, intentional value-based forgetting, and a computational self-model enabling introspection. Across a standardized benchmark suite of eight tests, the prototype achieves perfect recall accuracy over ten-turn conversations while reducing memory noise by 90.9% through adaptive forgetting. Memory search latency remains below one millisecond even with hundreds of stored concepts. This work establishes the architectural foundations for memory systems that consolidate, prioritize, and forget, offering a testable platform for advancing LLM memory research.
Odor Maps from the LLM-derived similarity scores
🔥 引用:
0
Abstract: The application of large language models (LLMs) to OdorSpace analysis attracts growing interest. Recent studies have explored the comparison of sensory evaluation spaces derived from LLMs with odor character profiles in the Dravnieks'dataset. In this study, we calculated pairwise distances of odor descriptors using three distance measures and statistically compared these LLM-derived similarities with distances derived from the original data. Next, we extended this approach to odor names (ingredients). Statistical comparison revealed that LLMs can infer odor similarity to some degree, suggesting the potential of odor maps generated from these similarity data. Applying this approach, we generated an odor map of essential oils. It demonstrates that essential oils within the same group are closely located in the odor map, suggesting that the proximity in the odor map corresponds to human evaluation.
Multi-Context Concatenation Across Requests for LLMs
🔥 引用:
0
Abstract: Reusing separate, pre-filled Key-Value (KV) Caches for multiple contexts has become a common practice in handling multi-context scenarios with Large Language Models. However, this leads to a lack of cross-attention mechanisms between contexts. To address this, we propose CatLLM, the first method that concatenates multiple contexts across requests offline to compensate for this deficiency. Specifically, during offline processing, CatLLM identifies contexts that severely lack cross-attention by incorporating the weighted inner products of Q and K vectors from tokens in an un-concatenated context into an equivalently transformed weighted formulation for concatenated Q and K inner products. This yields a weighting wiA+B corresponding to the output vector difference, which can then be used to identify contexts with severe cross-attention deficiencies and concatenate them into a single context for KV Cache computation. Experimental results show that, compared to the baseline of separate caching (i.e., no concatenation), fully concatenating all contexts improves the F1 score by 6%. Meanwhile, the proposed method reduces the number of contexts requiring caching from 10 to 7 while achieving a 3% F1 score, thereby maximizing performance improvement while minimizing the degree of context compression.
UniCVR: From Alignment to Reranking for Unified Zero-Shot Composed Visual Retrieval
🔥 引用:
0
Abstract: Composed image retrieval, multi-turn composed image retrieval, and composed video retrieval all share a common paradigm: composing the reference visual with modification text to retrieve the desired target. Despite this shared structure, the three tasks have been studied in isolation, with no prior work proposing a unified framework, let alone a zero-shot solution. In this paper, we propose UniCVR, the first unified zero-shot composed visual retrieval framework that jointly addresses all three tasks without any task-specific human-annotated data. UniCVR strategically combines two complementary strengths: Multimodal Large Language Models (MLLMs) for compositional query understanding and Vision-Language Pre-trained (VLP) models for structured visual retrieval. Concretely, UniCVR operates in two stages. In Stage I, we train the MLLM as a compositional query embedder via contrastive learning on a curated multi-source dataset of approximately 3.5M samples, bridging the heterogeneous embedding spaces between the MLLM and the frozen VLP gallery encoder. A cluster-based hard negative sampling strategy is proposed to strengthen contrastive supervision. In Stage II, we introduce an MLLM-guided dual-level reranking mechanism that applies adaptive budgeted subset scoring to a small number of top-ranked candidates, and then exploits the resulting relevance signals through a dual-level re-scoring scheme, producing more accurate final rankings with minimal computational overhead. Extensive experiments across five benchmarks covering all three tasks demonstrate that UniCVR achieves cutting-edge performance, validating its effectiveness and generalizability. Our data and code will be released upon acceptance.
Expanding the extreme-k dielectric materials space through physics-validated generative reasoning
🔥 引用:
0
Abstract: The most technologically consequential materials are often the rarest: they occupy narrow regions of chemical space, obey competing physical constraints, and appear only sparsely in existing databases. High-kappa dielectrics, high-Tc superconductors, and ferromagnetic insulators are to name a few. This scarcity fundamentally limits today's data-driven materials discovery, where machine-learning models excel at interpolation but struggle to generate genuinely new candidates. Here, we introduce DielecMIND, an artificial intelligence framework that reframes materials discovery as a reasoning-driven exploration instead of a database-screening problem. Using high-kappa dielectrics as a data-scarce and technologically stringent test case, DielecMIND combines large-language-model hypothesis generation for the first time with physics validated first-principles calculation to navigate chemical space beyond known compounds. Prior to our work, only 14 experimentally or computationally validated materials with kappa>150 were known. Our framework discovers and validates 5 new such compounds, expanding this rare-materials class by a remarkable = 35% in a single study. Among them, we find that Ba2TiHfO6 exhibits a dielectric constant of 637, minimal loss at low optical frequencies, and stability up to 800 K. Beyond dielectrics, this work demonstrates a new paradigm for artificial-intelligence-guided discovery: one that generates a small number of physically grounded, experimentally plausible candidates yet measurably expands sparsely populated functional materials spaces. Thus, DielecMIND points toward a general strategy for discovering rare, high-impact functional materials where data scarcity has long constrained progress.
WebGen-R1: Incentivizing Large Language Models to Generate Functional and Aesthetic Websites with Reinforcement Learning
🔥 引用:
0
Abstract: While Large Language Models (LLMs) excel at function-level code generation, project-level tasks such as generating functional and visually aesthetic multi-page websites remain highly challenging. Existing works are often limited to single-page static websites, while agentic frameworks typically rely on multi-turn execution with proprietary models, leading to substantial token costs, high latency, and brittle integration. Training a small LLM end-to-end with reinforcement learning (RL) is a promising alternative, yet it faces a critical bottleneck in designing reliable and computationally feasible rewards for website generation. Unlike single-file coding tasks that can be verified by unit tests, website generation requires evaluating inherently subjective aesthetics, cross-page interactions, and functional correctness. To this end, we propose WebGen-R1, an end-to-end RL framework tailored for project-level website generation. We first introduce a scaffold-driven structured generation paradigm that constrains the large open-ended action space and preserves architectural integrity. We then design a novel cascaded multimodal reward that seamlessly couples structural guarantees with execution-grounded functional feedback and vision-based aesthetic supervision. Extensive experiments demonstrate that our WebGen-R1 substantially transforms a 7B base model from generating nearly nonfunctional websites into producing deployable, aesthetically aligned multi-page websites. Remarkably, our WebGen-R1 not only consistently outperforms heavily scaled open-source models (up to 72B), but also rivals the state-of-the-art DeepSeek-R1 (671B) in functional success, while substantially exceeding it in valid rendering and aesthetic alignment. These results position WebGen-R1 as a viable path for scaling small open models from function-level code generation to project-level web application generation.
Intersectional Fairness in Large Language Models
🔥 引用:
0
Abstract: Large Language Models (LLMs) are increasingly deployed in socially sensitive settings, raising concerns about fairness and biases, particularly across intersectional demographic attributes. In this paper, we systematically evaluate intersectional fairness in six LLMs using ambiguous and disambiguated contexts from two benchmark datasets. We assess LLM behavior using bias scores, subgroup fairness metrics, accuracy, and consistency through multi-run analysis across contexts and negative and non-negative question polarities. Our results show that while modern LLMs generally perform well in ambiguous contexts, this limits the informativeness of fairness metrics due to sparse non-unknown predictions. In disambiguated contexts, LLM accuracy is influenced by stereotype alignment, with models being more accurate when the correct answer reinforces a stereotype than when it contradicts it. This pattern is especially pronounced in race-gender intersections, where directional bias toward stereotypes is stronger. Subgroup fairness metrics further indicate that, despite low observed disparity in some cases, outcome distributions remain uneven across intersectional groups. Across repeated runs, responses also vary in consistency, including stereotype-aligned responses. Overall, our findings show that apparent model competence is partly associated with stereotype-consistent cues, and no evaluated LLM achieves consistently reliable or fair behavior across intersectional settings. These findings highlight the need for evaluation beyond accuracy, emphasizing the importance of combining bias, subgroup fairness, and consistency metrics across intersectional groups, contexts, and repeated runs.
SCI-IDEA: Context-Aware Scientific Ideation Using Token and Sentence Embeddings
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
SurgCoT: Advancing Spatiotemporal Reasoning in Surgical Videos through a Chain-of-Thought Benchmark
🔥 引用:
0
Abstract: Fine-grained spatiotemporal reasoning on surgical videos is critical, yet the capabilities of Multi-modal Large Language Models (MLLMs) in this domain remain largely unexplored. To bridge this gap, we introduce SurgCoT, a unified benchmark for evaluating chain-of-thought (CoT) reasoning in MLLMs across 7 surgical specialties and 35 diverse procedures. SurgCoT assesses five core reasoning dimensions: Causal Action Ordering, Cue-Action Alignment, Affordance Mapping, Micro-Transition Localization, and Anomaly Onset Tracking, through a structured CoT framework with an intensive annotation protocol (Question-Option-Knowledge-Clue-Answer), where the Knowledge field provides essential background context and Clue provides definitive spatiotemporal evidence. Evaluation of 10 leading MLLMs shows: 1) commercial models outperform open-source and medical-specialized variants; 2) significant gaps exist in surgical CoT reasoning; 3) SurgCoT enables effective evaluation and enhances progressive spatiotemporal reasoning. SurgCoT provides a reproducible testbed to narrow the gap between MLLM capabilities and clinical reasoning demands. Code: https://github.com/CVI-SZU/SurgCoT.
CyberCertBench: Evaluating LLMs in Cybersecurity Certification Knowledge
🔥 引用:
0
Abstract: The rapid evolution and use of Large Language Models (LLMs) in professional workflows require an evaluation of their domain-specific knowledge against industry standards. We introduceCyberCertBench, a new suite of Multiple Choice Question Answering (MCQA) benchmarks derived from industry recognized certifications. CyberCertBench evaluates LLM domain knowledgeagainst the professional standards of Information Technology cybersecurity and more specializedareas such as Operational Technology and related cybersecurity standards. Concurrently, we propose and validate a novel Proposer-Verifier framework, a methodology to generate interpretable,natural language explanations for model performance. Our evaluation shows that frontier modelsachieve human expert level in general networking and IT security knowledge. However, theiraccuracy declines in questions that require vendor-specific nuances or knowledge in formalstandards, like, e.g., IEC 62443. Analysis of model scaling trend and release date demonstratesremarkable gains in parameter efficiency, while recent larger models show diminishing returns.Code and evaluation scripts are available at: https://github.com/GKeppler/CyberCertBench.
AICEBERG: A Novel Agentic AI Framework for Autonomous Radio Monitoring, Compliance and Governance Based on LLM, MCP, and SCPI in Smart Cities
🔥 引用:
0
Abstract: Urban radio spectrum monitoring is becoming increasingly complex due to the rapid growth of wireless devices, unauthorized emissions, and dynamic electromagnetic environments in smart cities. Traditional spectrum analysis approaches, based on manual operation or static detection techniques, are no longer sufficient to ensure scalable, autonomous, and secure monitoring. The convergence of two emergent technologies—Large Language Models (LLMs) and the Model Context Protocol (MCP)—facilitates a fundamental shift in radio monitoring. We define this as the AICEBERG paradigm: a novel, stratified architecture where a high-level, intelligent agentic interface (the peak) abstracts the underlying complexity of SCPI-driven hardware integration and radio governance protocols (the foundational base). This autonomous framework provides the necessary objective rigor to audit the stochastic ‘ocean of electromagnetic waves’ characteristic of modern smart cities, ensuring a stable platform for regulatory enforcement amidst high-density signal interference. The proposed system implements a three-layer processing flow, enabling high-level natural language commands to be translated into validated and secure hardware actions on RF spectrum analyzers. A dual-server design separates operational execution from safety validation, ensuring controlled SCPI command handling, parameter verification, and instrument health monitoring. Experimental validation demonstrates the feasibility of autonomous measurement execution. The results show that the proposed architecture reduces human dependency, enhances reproducibility and lowers the expertise barrier required for RF spectrum surveillance. To the best of our knowledge, AICEBERG represents one of the first integrated frameworks to bridge LLMs with SCPI-compliant hardware through the MCP for autonomous radio governance.
IRIS: Interpolative R\'enyi Iterative Self-play for Large Language Model Fine-Tuning
🔥 引用:
0
Abstract: Self-play fine-tuning enables large language models to improve beyond supervised fine-tuning without additional human annotations by contrasting annotated responses with self-generated ones. Many existing methods rely on a fixed divergence regime. SPIN is closely related to a KL-based regime, SPACE to a Jensen-Shannon-style objective via noise contrastive estimation, and SPIF to $\chi^2$-regularized self-play. Since these divergences exhibit different strengths depending on the distributional gap between model and target, no single choice appears to provide favorable learning dynamics across training stages. We propose IRIS (Interpolative R\'enyi Iterative Self-play), a R\'enyi-based self-play fine-tuning framework with a continuously adjustable objective. IRIS decomposes into two independent tilted risk terms over annotated and synthetic data, with exponential importance weights controlled by the order parameter $\alpha$. We show that several self-play objectives can be interpreted as limiting or representative regimes at particular values of $\alpha$, providing a unified theoretical perspective on these methods. An adaptive order schedule further adjusts $\alpha$ to the distributional gap, shifting from sharper importance weighting early in training to smoother refinement near convergence. Theoretically, we establish the fixed-point property of IRIS and analyze how $\alpha$ controls gradient concentration. Experiments on Zephyr-7B and Qwen2.5-3B across ten benchmarks show that IRIS improves upon baselines, reaching 44.57\% average score with gains across iterations. In our setting, IRIS with only 26$k$ annotated samples surpasses standard supervised fine-tuning trained on the full 200$k$ dataset.
Breaking MCP with Function Hijacking Attacks: Novel Threats for Function Calling and Agentic Models
🔥 引用:
0
Abstract: The growth of agentic AI has drawn significant attention to function calling Large Language Models (LLMs), which are designed to extend the capabilities of AI-powered system by invoking external functions. Injection and jailbreaking attacks have been extensively explored to showcase the vulnerabilities of LLMs to user prompt manipulation. The expanded capabilities of agentic models introduce further vulnerabilities via their function calling interface. Recent work in LLM security showed that function calling can be abused, leading to data tampering and theft, causing disruptive behavior such as endless loops, or causing LLMs to produce harmful content in the style of jailbreaking attacks. This paper introduces a novel function hijacking attack (FHA) that manipulates the tool selection process of agentic models to force the invocation of a specific, attacker-chosen function. While existing attacks focus on semantic preference of the model for function-calling tasks, we show that FHA is largely agnostic to the context semantics and robust to the function sets, making it applicable across diverse domains. We further demonstrate that FHA can be trained to produce universal adversarial functions, enabling a single attacked function to hijack tool selection across multiple queries and payload configurations. We conducted experiments on 5 different models, including instructed and reasoning variants, reaching 70% to 100% ASR over the established BFCL dataset. Our findings further demonstrate the need for strong guardrails and security modules for agentic systems.
The Path Not Taken: Duality in Reasoning about Program Execution
🔥 引用:
0
Abstract: Large language models (LLMs) have shown remarkable capabilities across diverse coding tasks. However, their adoption requires a true understanding of program execution rather than relying on surface-level patterns. Existing benchmarks primarily focus on predicting program properties tied to specific inputs (e.g., code coverage, program outputs). As a result, they provide a narrow view of dynamic code reasoning and are prone to data contamination. We argue that understanding program execution requires evaluating its inherent duality through two complementary reasoning tasks: (i) predicting a program's observed behavior for a given input, and (ii) inferring how the input must be mutated toward a specific behavioral objective. Both tasks jointly probe a model's causal understanding of execution flow. We instantiate this duality in DexBench, a benchmark comprising 445 paired instances, and evaluate 13 LLMs. Our results demonstrate that dual-path reasoning provides a robust and discriminative proxy for dynamic code understanding.
Clinically-Informed Modeling for Pediatric Brain Tumor Classification from Whole-Slide Histopathology Images
🔥 引用:
0
Abstract: Accurate diagnosis of pediatric brain tumors, starting with histopathology, presents unique challenges for deep learning, including severe data scarcity, class imbalance, and fine-grained morphologic overlap across diagnostically distinct subtypes. While pathology foundation models have advanced patch-level representation learning, their effective adaptation to weakly supervised pediatric brain tumor classification under limited data remains underexplored. In this work, we introduce an expert-guided contrastive fine-tuning framework for pediatric brain tumor diagnosis from whole-slide images (WSI). Our approach integrates contrastive learning into slide-level multiple instance learning (MIL) to explicitly regularize the geometry of slide-level representations during downstream fine-tuning. We propose both a general supervised contrastive setting and an expert-guided variant that incorporates clinically informed hard negatives targeting diagnostically confusable subtypes. Through comprehensive experiments on pediatric brain tumor WSI classification under realistic low-sample and class-imbalanced conditions, we demonstrate that contrastive fine-tuning yields measurable improvements in fine-grained diagnostic distinctions. Our experimental analyses reveal complementary strengths across different contrastive strategies, with expert-guided hard negatives promoting more compact intra-class representations and improved inter-class separation. This work highlights the importance of explicitly shaping slide-level representations for robust fine-grained classification in data-scarce pediatric pathology settings.
Whose Story Gets Told? Positionality and Bias in LLM Summaries of Life Narratives
🔥 引用:
0
Abstract: Increasingly, studies are exploring using Large Language Models (LLMs) for accelerated or scaled qualitative analysis of text data. While we can compare LLM accuracy against human labels directly for deductive coding, or labeling text, it is more challenging to judge the ethics and effectiveness of using LLMs in abstractive methods such as inductive thematic analysis. We collaborate with psychologists to study the abstractive claims LLMs make about human life stories, asking, how does using an LLM as an interpreter of meaning affect the conclusions and perspectives of a study? We propose a summarization-based pipeline for surfacing biases in perspective-taking an LLM might employ in interpreting these life stories. We demonstrate that our pipeline can identify both race and gender bias with the potential for representational harm. Finally, we encourage the use of this analysis in future studies involving LLM-based interpretation of study participants'written text or transcribed speech to characterize a positionality portrait for the study.
InVitroVision: a Multi-Modal AI Model for Automated Description of Embryo Development using Natural Language
🔥 引用:
1
Abstract: The application of artificial intelligence (AI) in IVF has shown promise in improving consistency and standardization of decisions, but often relies on annotated data and does not make use of the multimodal nature of IVF data. We investigated whether foundational vision-language models can be fine-tuned to predict natural language descriptions of embryo morphology and development. Using a publicly available embryo time-lapse dataset, we fine-tuned PaliGemma-2, a multi-modal vision-language model, with only 1,000 images and corresponding captions, describing embryo morphology, embryonic cell cycle and developmental stage. Our results show that the fine-tuned model, InVitroVision, outperformed a commercial model, ChatGPT 5.2, and base models in overall metrics, with performance improving with larger training datasets. This study demonstrates the potential of foundational vision-language models to generalize to IVF tasks with limited data, enabling the prediction of natural language descriptions of embryo morphology and development. This approach may facilitate the use of large language models to retrieve information and scientific evidence from relevant publications and guidelines, and has implications for few-shot adaptation to multiple downstream tasks in IVF.
AI Workflow Automation Agent & Multi-Agent System using LangChain and LangGraph
DOI:
10.55041/ijsrem60971
🔥 引用:
0
Abstract: Abstract - This paper examines the architectural shift from linear Large Language Model (LLM) chains to stateful, multi-agent systems (MAS) for the automation of intricate workflows. Traditional automation depends on strict, procedural logic. However, LangGraph, a low-level orchestration framework built on LangChain, makes it possible to make cyclical, event-driven agentic workflows. We look at the main ideas behind graph-based reasoning, such as how to use nodes for functional logic and edges for conditional routing. The core of this study is the assessment of state management via reducer-driven schemas and persistent checkpointers, facilitating durable execution and human-in-the-loop (HITL) interactions. Our research analyses diverse orchestration patterns, including supervisor-worker and collaborative teams, by comparing performance metrics across industry-standard frameworks. Experimental data shows that LangGraph-based systems can get up to 88% of their tasks done correctly when they need to think through several steps. This is a 20–30% improvement in engagement and operational efficiency over rule-based systems. This framework provides a solid base for enterprise-level autonomous agents that can fix themselves and make difficult decisions.
Key Words: AI agents, LangChain, LangGraph, multi-agent systems, workflow automation, LLM orchestration.
Exploring chalcone analogues as potential SARS-CoV-2 inhibitors: insights from synthesis, crystal analysis, molecular dynamics simulations, docking and ADMET studies
DOI:
10.1098/rsos.251866
🔥 引用:
0
Abstract:
Chalcone-based analogues have gained significant attention due to their promising inhibitory activity against multiple viral targets. To identify potential anti-SARS-CoV-2 drug candidates, we designed and synthesized novel chalcone analogues via the Claisen–Schmidt reaction, affording excellent product yields (87–94%). Compound 3c was crystallized by slow evaporation of an ethanolic solution at room temperature. X-ray diffraction analysis revealed that the compound crystallized in the monoclinic crystal system with P21/c space group. The crystallographic data uncovered crystal packing, bond lengths, bond angles and other key parameters. Docking studies revealed that the synthesized compounds 3(a–d) exhibited moderate binding affinities, with compound 3d showing the strongest affinity at −4.657 kcal mol−1. These values are less favourable than the reference inhibitors (−7.424 and −8.245 kcal mol−1 for the co-crystallized ligand and remdesivir, respectively), suggesting potential for further optimizations through medicinal chemistry approaches. The molecular dynamics simulations of the 3d complex demonstrated fluctuations within a satisfactory range (2.5–4.5 Å), as observed in the root mean square deviation plots at 100 ns. Furthermore, absorption, distribution, metabolism and excretion analysis indicated that all compounds displayed drug-like properties, with high gastrointestinal absorption, suggesting bioavailability. The present work provides a promising platform for further development of chalcone-based compounds as promising anti-SARS-CoV-2 agents.
DWTSumm: Discrete Wavelet Transform for Document Summarization
🔥 引用:
0
Abstract: Summarizing long, domain-specific documents with large language models (LLMs) remains challenging due to context limitations, information loss, and hallucinations, particularly in clinical and legal settings. We propose a Discrete Wavelet Transform (DWT)-based multi-resolution framework that treats text as a semantic signal and decomposes it into global (approximation) and local (detail) components. Applied to sentence- or word-level embeddings, DWT yields compact representations that preserve overall structure and critical domain-specific details, which are used directly as summaries or to guide LLM generation. Experiments on clinical and legal benchmarks demonstrate comparable ROUGE-L scores. Compared to a GPT-4o baseline, the DWT based summarization consistently improve semantic similarity and grounding, achieving gains of over 2% in BERTScore, more than 4\% in Semantic Fidelity, factual consistency in legal tasks, and large METEOR improvements indicative of preserved domain-specific semantics. Across multiple embedding models, Fidelity reaches up to 97%, suggesting that DWT acts as a semantic denoising mechanism that reduces hallucinations and strengthens factual grounding. Overall, DWT provides a lightweight, generalizable method for reliable long-document and domain-specific summarization with LLMs.
Mind the Prompt: Self-adaptive Generation of Task Plan Explanations via LLMs
🔥 引用:
0
Abstract: Integrating Large Language Models (LLMs) into complex software systems enables the generation of human-understandable explanations of opaque AI processes, such as automated task planning. However, the quality and reliability of these explanations heavily depend on effective prompt engineering. The lack of a systematic understanding of how diverse stakeholder groups formulate and refine prompts hinders the development of tools that can automate this process. We introduce COMPASS (COgnitive Modelling for Prompt Automated SynthesiS), a proof-of-concept self-adaptive approach that formalises prompt engineering as a cognitive and probabilistic decision-making process. COMPASS models unobservable users'latent cognitive states, such as attention and comprehension, uncertainty, and observable interaction cues as a POMDP, whose synthesised policy enables adaptive generation of explanations and prompt refinements. We evaluate COMPASS using two diverse cyber-physical system case studies to assess the adaptive explanation generation and their qualities, both quantitatively and qualitatively. Our results demonstrate the feasibility of COMPASS integrating human cognition and user profile's feedback into automated prompt synthesis in complex task planning systems.
On the Quantization Robustness of Diffusion Language Models in Coding Benchmarks
🔥 引用:
0
Abstract: Auto-regressive Large Language Models (LLMs) achieve strong performance on coding tasks, but incur high memory and inference costs. Diffusion-based language models (d-LLMs) offer bounded inference cost via iterative denoising, but their behavior under post-training quantization (PTQ) has been sparsely explored. We investigate the application and robustness of PTQ techniques, specifically GPTQ and a modified Hessian-Aware Quantization (HAWQ) algorithm, on a diffusion-based coding LLM (CoDA) and observe that these methods applied to CoDA exhibit greater robustness at low bitwidths compared to Qwen3-1.7B, its auto-regressive counterpart, under a standardized evaluation pipeline. We find that in our setup, CoDA exhibits greater robustness at low bitwidths (2-4 bits), with smaller accuracy degradation across HumanEval and MBPP benchmarks. Additionally, mixed-precision configurations derived from HAWQ provide smooth trade-offs across accuracy, latency, and memory. The results suggest that diffusion LLMs may offer advantages for efficient deployment due to more quantization-resilience.
Where and What: Reasoning Dynamic and Implicit Preferences in Situated Conversational Recommendation
🔥 引用:
0
Abstract: Situated conversational recommendation (SCR), which utilizes visual scenes grounded in specific environments and natural language dialogue to deliver contextually appropriate recommendations, has emerged as a promising research direction due to its close alignment with real-world scenarios. Compared to traditional recommendations, SCR requires a deeper understanding of dynamic and implicit user preferences, as the surrounding scene often influences users'underlying interests, while both may evolve across conversations. This complexity significantly impacts the timing and relevance of recommendations. To address this, we propose situated preference reasoning (SiPeR), a novel framework that integrates two core mechanisms: (1) Scene transition estimation, which estimates whether the current scene satisfies user needs, and guides the user toward a more suitable scene when necessary; and (2) Bayesian inverse inference, which leverages the likelihood of multimodal large language models (MLLMs) to predict user preferences about candidate items within the scene. Extensive experiments on two representative benchmarks demonstrate SiPeR's superiority in both recommendation accuracy and response generation quality. The code and data are available at https://github.com/DongdingLin/SiPeR.
AnalogMaster: Large Language Model-based Automated Analog IC Design Framework from Image to Layout
🔥 引用:
0
Abstract: Design automation has the potential to substantially improve the efficiency of analog integrated circuit (IC) design. However, existing algorithms and tools typically focus on individual stages, such as device sizing, placement, or routing, and still require significant manual intervention to complete the full design flow. While large language models (LLMs) have recently demonstrated remarkable success in automating digital IC design workflows, these advances cannot be directly transferred to analog IC design. Key challenges include strongly coupled performance metrics, the predominance of unstructured circuit schematic images, and the fact that most prior approaches address only isolated stages of the analog design process, limiting their ability to capture end-to-end performance impact. To address these challenges, we propose AnalogMaster, an extensible, LLM-based framework that enables end-to-end automation of analog IC design through a unified pipeline spanning circuit image-to-netlist generation, parameter optimization, placement, and routing. AnalogMaster integrates a joint reasoning mechanism that leverages in-context learning and intent reasoning to achieve accurate and robust image-to-netlist conversion. A parameter search agent integrating self-enhanced prompt engineering and context truncation is developed for effective device sizing and downstream physical design. Experimental evaluations on 15 representative circuits with varying levels of complexity demonstrate strong and consistent performance across multiple models. In particular, GPT-5 achieves success rates of 92.9% and 99.9% on Pass@1 and Pass@5, respectively. These results validate the effectiveness and robustness of the proposed framework and establish a practical paradigm for applying LLMs to full-stack analog IC design automation.
Render-in-the-Loop: Vector Graphics Generation via Visual Self-Feedback
🔥 引用:
0
Abstract: Multimodal Large Language Models (MLLMs) have shown promising capabilities in generating Scalable Vector Graphics (SVG) via direct code synthesis. However, existing paradigms typically adopt an open-loop"blind drawing"approach, where models generate symbolic code sequences without perceiving intermediate visual outcomes. This methodology severely underutilizes the powerful visual priors embedded in MLLMs vision encoders, treating SVG generation as a disjointed textual sequence modeling task rather than an integrated visuo-spatial one. Consequently, models struggle to reason about partial canvas states and implicit occlusion relationships, which are visually explicit but textually ambiguous. To bridge this gap, we propose Render-in-the-Loop, a novel generation paradigm that reformulates SVG synthesis as a step-wise, visual-context-aware process. By rendering intermediate code states into a cumulative canvas, the model explicitly observes the evolving visual context at each step, leveraging on-the-fly feedback to guide subsequent generation. However, we demonstrate that applying this visual loop naively to off-the-shelf models is suboptimal due to their inability to leverage incremental visual-code mappings. To address this, we first utilize fine-grained path decomposition to construct dense multi-step visual trajectories, and then introduce a Visual Self-Feedback (VSF) training strategy to condition the next primitive generation on intermediate visual states. Furthermore, a Render-and-Verify (RaV) inference mechanism is proposed to effectively filter degenerate and redundant primitives. Our framework, instantiated on a multimodal foundation model, outperforms strong open-weight baselines on the standard MMSVGBench. This result highlights the remarkable data efficiency and generalization capability of our Render-in-the-Loop paradigm for both Text-to-SVG and Image-to-SVG tasks.
Performance of three large language models in answering parent-focused questions on rickets: a dual pediatric–orthopedic specialist evaluation
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
Lightweight large language models for early sepsis prediction via a semantic abstraction rule engine
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
In-vitro analysis, metabolic profiling and in-silico molecular docking analysis shows antidiabetic potential of phytofabricated cerium oxide nanoparticles against diabetes related receptors
🔥 引用:
0
Abstract: Plants play a vital role in sustaining life on Earth, offering essential resources such as food, energy, shelter, and serving as a key source for drug development. Chemical pharmaceutics come with a range of disadvantages, including potential side effects, high costs, and limited availability in certain regions. There is an urgent need for sustainable, cost-effective, plant-based drug development approaches to ensure accessibility, minimize environmental impact, and promote long-term healthcare solutions. Nano biotechnology, focusing on nanoscale materials, has vast applications in drug delivery and therapy. This study synthesized Cerium oxide nanoparticles (CeO2NPs) using Mentha piperita L. leaf extract, and analyzed their properties through UV–Vis spectrophotometry, SEM, EDX, FTIR, and XRD. Various CeO2NPs Nano formulations were tested for antioxidant and antidiabetic efficacy via DPPH, ABTS, H2O2, alpha-amylase, alpha-glucosidase, anti-sucrase assays, and glucose absorption in yeast cells. CeO2NPs were found to have an average size of 46–60 nm with spherical shape and decorated with various important functional groups. LCMS analysis identified compounds in the plant extract, which were then subjected to ADMET analysis. Selected compounds underwent molecular docking using Autodock Vina software. Results showed that CeO2NPs were 46–60 nm in size with highest antioxidant potential of 56.6% and 46.16% of DPPH and ABTS respectively at 500 ug/mL. Further, dose dependent response of CeO2NPs was found as 78.03% inhibition was shown against -amylase, 68.6% inhibition against α-Glucosidase, 59% inhibition against antisucrase, and 76% inhibition against glucose uptake by yeast cells at a concentration of 25 mmol/L glucose. LCMS analysis of plant extract resulted in total of 74 compounds, while ADMET analysis and virtual screening given in total of four bioactive compounds, Luteolin-7-O-rutinoside, Neoeriocitrin, Diosgenin and Apigetrin and out of which Luteolin-7- and Neoeriocitrin shown highest energy of − 10.4 kcal mol, 9.7 9.7 kcal mol−1 against alpha amylase, − 9.7 kcal mol−1 and 9.2 kcal.mol−1 against alpha glucosidase and 9.5 kcal mol−1 and 9.3 kcal mol−1 respectively against sucrase enzymes. Toxicity assessment and Lipinski rule of five revealed that all the four compounds are completely safe to be used as drug against diabetes and specifically for HBB. In conclusion, Mentha piperita L. CeO2NPs exhibit strong potential against antioxidant and hypoglycemia. Mentha contains various bioactive compounds with binding affinities to diabetes-related receptors, warranting further exploration for pharmaceutical applications.
From autonomy to alliance: Robotic foundation models must learn with us, not just for us
🔥 引用:
0
Abstract: This Viewpoint urges reimagining of robotic foundation models, from treating the robot as a solitary, omnipotent agent to embracing a multiagent, alliance-aware paradigm. Alliance-aware models learn with humans and other robots, not merely for them, by embedding mechanisms that foster social interaction and generalization across heterogeneous partners. We outline six design pillars that cultivate such collaborative intelligence: interaction priors, partner modeling (machine theory of mind), modular and composable policies, norm adaptation, trust-aware memory, and communication. Together, these pillars empower robots to fluidly switch social roles, adapt to unfamiliar collaborators, and coordinate robustly within dynamic multiagent ecologies spanning homes, factories, clinics, and field operations.
Revisiting Target‐Aware de novo Molecular Generation with TarPass: Between Rational Design and Texas Sharpshooter
DOI:
10.1002/advs.75411
🔥 引用:
0
Abstract:
Target‐aware molecular generation models hold promise for drug discovery, but it remains unclear whether they genuinely exploit target information or merely resemble the Texas Sharpshooter fallacy by retrospectively rationalizing outputs. To address this, we introduce TarPass, a benchmark comprising a curated dataset of 18 well‐studied targets with expert‐annotated key interactions and experimentally validated active compounds, enabling fair evaluation of target‐aware
de novo
molecular generation models. We assessed 15 representative models across three paradigms: non‐3D, 3D in situ, and optimization‐based, considering protein‐ligand interactions (PLIs), molecular plausibility, and drug‐likeness. Results show that 3D in situ models have a modest average advantage in predicted PLIs. However, many fail to outperform random sampling. Non‐3D models, benefiting from broader pretraining, generate more drug‐like and synthesizable molecules but exhibit weaker target specificity. Optimization‐based methods effectively redirect outputs toward favorable chemical regions for single properties, often at the expense of others, for example by reducing compliance with Lipinski's rules. Integrating these insights, we propose a multi‐tier virtual screening workflow for target‐aware molecular generation as a post‐processing strategy to enrich molecules with improved PLIs and plausibility. Overall, this study highlights the limitations of current models in capturing fine‐grained target‐specific constraints and provides a standardized framework for future structure‐based drug design.
Robustness of Spatio-temporal Graph Neural Networks for Fault Location in Partially Observable Distribution Grids
🔥 引用:
0
Abstract: Fault location in distribution grids is critical for reliability and minimizing outage durations. Yet, it remains challenging due to partial observability, given sparse measurement infrastructure. Recent works show promising results by combining Recurrent Neural Networks (RNNs) and Graph Neural Networks (GNNs) for spatio-temporal learning. Still, many modern GNN architectures remain untested for this grid application, while existing GNN solutions have not explored GNN topology definitions beyond simply adopting the full grid topology to construct the GNN graph. We address these gaps by (i) systematically comparing a newly proposed graph-forming strategy (measured-only) to the traditional full-topology approach, and (ii) introducing STGNN (Spatio-temporal GNN) models based on GraphSAGE and an improved Graph Attention (GATv2), for distribution grid fault location; (iii) benchmarking them against state-of-the-art STGNN and RNN baselines on the IEEE 123-bus feeder. In our experiments, all evaluated STGNN variants achieve high performance and consistently outperform a pure RNN baseline, with improvements up to 11 percentage points F1. Among STGNN models, the newly explored RGATv2 and RGSAGE achieve only marginally higher F1 scores. Still, STGNNs demonstrate superior stability, with tight confidence intervals (within +/- 1.4%) compared to the RNN baseline (up to +/- 7.5%) across different experiment runs. Finally, our proposed reduced GNN topology (measured-only) shows clear benefits in both (i) model training time (6-fold reduction) and (ii) model performance (up to 11 points F1). This suggests that measured-only graphs offer a more practical, efficient, and robust framework for partially observable distribution grids.
To Know is to Construct: Schema-Constrained Generation for Agent Memory
🔥 引用:
0
Abstract: Constructivist epistemology argues that knowledge is actively constructed rather than passively copied. Despite the generative nature of Large Language Models (LLMs), most existing agent memory systems are still based on dense retrieval. However, dense retrieval heavily relies on semantic overlap or entity matching within sentences. Consequently, embeddings often fail to distinguish instances that are semantically similar but contextually distinct, introducing substantial noise by retrieving context-mismatched entries. Conversely, directly employing open-ended generation for memory access risks"Structural Hallucination"where the model generates memory keys that do not exist in the memory, leading to lookup failures. Inspired by this epistemology, we posit that memory is fundamentally organized by cognitive schemas, and valid recall must be a generative process performed within these schematic structures. To realize this, we propose SCG-MEM, a schema-constrained generative memory architecture. SCG-MEM reformulates memory access as Schema-Constrained Generation. By maintaining a dynamic Cognitive Schema, we strictly constrain LLM decoding to generate only valid memory entry keys, providing a formal guarantee against structural hallucinations. To support long-term adaptation, we model memory updates via assimilation (grounding inputs into existing schemas) and accommodation (expanding schemas with novel concepts). Furthermore, we construct an Associative Graph to enable multi-hop reasoning through activation propagation. Experiments on the LoCoMo benchmark show that SCG-MEM substantially improves performance across all categories over retrieval-based baselines.
From Scene to Object: Text-Guided Dual-Gaze Prediction
🔥 引用:
0
Abstract: Interpretable driver attention prediction is crucial for human-like autonomous driving. However, existing datasets provide only scene-level global gaze rather than fine-grained object-level annotations, inherently failing to support text-grounded cognitive modeling. Consequently, while Vision-Language Models (VLMs) hold great potential for semantic reasoning, this critical data limitations leads to severe text-vision decoupling and visual-bias hallucinations. To break this bottleneck and achieve precise object-level attention prediction, this paper proposes a novel dual-branch gaze prediction framework, establishing a complete paradigm from data construction to model architecture. First, we construct G-W3DA, a object-level driver attention dataset. By integrating a multimodal large language model with the Segment Anything Model 3 (SAM3), we decouple macroscopic heatmaps into object-level masks under rigorous cross-validation, fundamentally eliminating annotation hallucinations. Building upon this high-quality data foundation, we propose the DualGaze-VLM architecture. This architecture extracts the hidden states of semantic queries and dynamically modulates visual features via a Condition-Aware SE-Gate, achieving intent-driven precise spatial anchoring. Extensive experiments on the W3DA benchmark demonstrate that DualGaze-VLM consistently surpasses existing state-of-the-art (SOTA) models in spatial alignment metrics, notably achieving up to a 17.8% improvement in Similarity (SIM) under safety-critical scenarios. Furthermore, a visual Turing test reveals that the attention heatmaps generated by DualGaze-VLM are perceived as authentic by 88.22% of human evaluators, proving its capability to generate rational cognitive priors.
The MIF-CD74 axis drives colorectal cancer via glycolytic reprogramming and is targeted by a novel small-molecule inhibitor
🔥 引用:
0
Abstract: Background Macrophage migration inhibitory factor (MIF) promotes inflammation, regulates immune responses and chemotherapy resistance in the tumor microenvironment. However, its mechanism of action in colorectal cancer (CRC) metabolic reprogramming and targeted therapeutic potential remain unclear. This study aims to investigate the function, mechanism, and targeted therapeutic potential of MIF in CRC. Methods Data were integrated from TCGA, GTEx, CPTAC, and HPA databases with clinical sample validation. Single-cell sequencing analysis (datasets GSE166555 and GSE144735) was performed, alongside functional assays and mechanistic studies. A novel high-potency MIF inhibitor was identified through virtual screening and validated in vitro and in vivo. Results MIF expression was found to be significantly elevated in CRC tissues and cell lines, correlating with poor overall survival (OS) and disease-specific survival (DSS). Single-cell sequencing confirmed malignant epithelial cells as the primary MIF source. Functional assays demonstrated that MIF knockout suppressed CRC cell proliferation, migration, and tumor growth in vivo, while MIF overexpression promoted these effects. Mechanistically, MIF binds CD74 to upregulate glycolytic enzymes (HK2, PKM2, LDHA), enhancing glucose uptake and lactate/pyruvate production, thereby driving the Warburg effect and CRC progression. Virtual screening identified a novel high-potency MIF inhibitor, F3277-0933 (IC50 = 8.284 μM). In vitro and in vivo, F3277-0933 surpassed the classical inhibitor ISO-1 in suppressing MIF-driven glycolytic reprogramming and proliferation. Conclusion This study elucidates a novel mechanism by which the MIF-CD74 axis drives CRC progression through glycolytic reprogramming and provides robust preclinical evidence for developing MIF-targeted therapies. Graphical Abstract The MIF-CD74 axis drives colorectal cancer progression via glycolytic reprogramming. Extracellular MIF binds to the CD74/CD44 receptor complex on the surface of colorectal cancer cells, initiating downstream signaling cascades. This signal transduction leads to the transcriptional upregulation of key glycolytic enzymes (HK2, PKM2, and LDHA), driving a metabolic switch towards the Warburg effect—characterized by enhanced glucose uptake, increased lactate production (High ECAR), and suppressed mitochondrial oxidative phosphorylation (Low OCR). This metabolic reprogramming fuels malignant progression. The novel small-molecule inhibitor, F3277-0933, identified via structure-based virtual screening, specifically targets the MIF tautomerase active site. By blocking this oncogenic signaling axis, F3277-0933 effectively reverses the glycolytic phenotype and suppresses tumor growth. Supplementary Information The online version contains supplementary material available at 10.1007/s13402-026-01202-9.
AI models of unstable flow exhibit hallucination
🔥 引用:
0
Abstract: We report the first systematic evidence of hallucination in AI models of fluid dynamics, demonstrated in the canonical problem of hydrodynamically unstable transport known as viscous fingering. AI-based modeling of flow with instabilities remains challenging because rapidly evolving, multiscale fingering patterns are difficult to resolve accurately. We identify solutions that appear visually realistic yet are physically implausible, analogous to hallucinations in large language models. These hallucinations manifest as spurious fluid interfaces and reverse diffusion that violate conservation laws. We show that their origin lies in the spectral bias of AI models, which becomes dominant at high flow rates and viscosity contrasts. Guided by this insight, we introduce DeepFingers, a new framework for AI-driven fluid dynamics that enforces balanced learning across the full spectrum of spatial modes by combining the Fourier Neural Operator with a Deep Operator Network to predict the spatiotemporal evolution of viscous fingers. By conditioning on both time and viscosity contrast, DeepFingers learns mappings between successive concentration fields across regimes. The framework accurately captures tip splitting, finger merging, and channel formation while preserving global metrics of mixing. The results open a new research direction to investigate fundamental limitations in AI models of physical systems.
Text Steganography with Dynamic Codebook and Multimodal Large Language Model
🔥 引用:
0
Abstract: With the popularity of the large language models (LLMs), text steganography has achieved remarkable performance. However, existing methods still have some issues: (1) For the white-box paradigm, this steganography behavior is prone to exposure due to sharing the off-the-shelf language model between Alice and Bob.(2) For the black-box paradigm, these methods lack flexibility and practicality since Alice and Bob should share the fixed codebook while sharing a specific extracting prompt for each steganographic sentence. In order to improve the security and practicality, we introduce a black-box text steganography with a dynamic codebook and multimodal large language model. Specifically, we first construct a dynamic codebook via some shared session configuration and a multimodal large language model. Then an encrypted steganographic mapping is designed to embed secret messages during the steganographic caption generation. Furthermore, we introduce a feedback optimization mechanism based on reject sampling to ensure accurate extraction of secret messages. Experimental results show that the proposed method outperforms existing white-box text steganography methods in terms of embedding capacity and text quality. Meanwhile, the proposed method has achieved better practicality and flexibility than the existing black-box paradigm in some popular online social networks.
A critical analysis of MBTI-based personality profiling with large language models
🔥 引用:
0
Abstract: This paper critically analyzes MBTI-based personality profiling using Large Language Models (LLMs), examining both their use as tools for inferring human personality and as subjects evaluated through psychometric frameworks. We review recent work (2020–2025) spanning traditional machine learning, fine-tuned transformer models, and zero-shot prompting approaches across datasets such as Kaggle MBTI, PersonalityCafe, Pandora, and MBTIBench. While top-performing LLM-based systems report 75%–85% accuracy at the dichotomy level, improvements over baselines are often modest, domain-dependent, and sensitive to dataset biases. Recent benchmarks employing soft labels reveal systematic issues, including polarized predictions, overconfidence, and limited calibration relative to population trait distributions. Beyond predictive performance, we examine emerging research that applies MBTI instruments directly to LLMs, showing that models exhibit reproducible yet context-dependent “personality-like” profiles, often skewed toward socially desirable traits due to alignment training. These findings raise conceptual questions about whether stable internal dispositions can meaningfully be attributed to generative systems whose outputs vary across prompts and versions. We argue that MBTI-based modeling with LLMs faces three core challenges: psychometric limitations of the MBTI construct itself, methodological weaknesses in self-reported training data, and philosophical ambiguity regarding the notion of AI personality. The paper concludes by outlining ethical risks, evaluation gaps, and research directions for more rigorous, calibrated, and theoretically grounded personality modeling in artificial intelligence systems.
DiP-SD: Distributed Pipelined Speculative Decoding for Efficient LLM Inference at the Edge
🔥 引用:
0
Abstract: Speculative decoding has emerged as a promising technique for large language model (LLM) inference by accelerating autoregressive decoding via draft-then-verify. This paper studies a new edge scenario with multi-user inference, where draft tokens are generated locally on devices and subsequently offloaded to a centralized edge server for batch verification. The key challenge is to sustain high throughput under coupled decisions of (i) batching and pipeline scheduling and (ii) per user draft token length. We propose DiP-SD, which exploits two complementary parallelism dimensions: device-level distributed drafting and phase-level draft-verify pipelining. We formulate a throughput-maximization objective, defined as the expected number of accepted tokens per unit time, and jointly optimize the number of batches, user-to-batch assignment, and integer draft lengths. To solve the resulting fractional mixed-integer program, DiP-SD scans the batch number and iteratively alternates between an association subproblem and a draft-length subproblem. Numerical results under a Qwen3-1.7B/Qwen3-32B device-edge deployment show that DiP-SD achieves up to 17.89x throughput over autoregressive decoding (AD) and 1.93x over AD with greedy batching.
Serialisation Strategy Matters: How FHIR Data Format Affects LLM Medication Reconciliation
🔥 引用:
0
Abstract: Medication reconciliation at clinical handoffs is a high-stakes, error-prone process. Large language models are increasingly proposed to assist with this task using FHIR-structured patient records, but a fundamental and largely unstudied variable is how the FHIR data is serialised before being passed to the model. We present the first systematic comparison of four FHIR serialisation strategies (Raw JSON, Markdown Table, Clinical Narrative, and Chronological Timeline) across five open-weight models (Phi-3.5-mini, Mistral-7B, BioMistral-7B, Llama-3.1-8B, Llama-3.3-70B) on a controlled benchmark of 200 synthetic patients, totalling 4,000 inference runs. We find that serialisation strategy has a large, statistically significant effect on performance for models up to 8B parameters: Clinical Narrative outperforms Raw JSON by up to 19 F1 points for Mistral-7B (r = 0.617, p<10^{-10}). This advantage reverses at 70B, where Raw JSON achieves the best mean F1 of 0.9956. In all 20 model and strategy combinations, mean precision exceeds mean recall: omission is the dominant failure mode, with models more often missing an active medication than fabricating one, which changes how clinical safety auditing priorities should be set. Smaller models plateau at roughly 7-10 concurrent active medications, leaving polypharmacy patients, the patients most at risk from reconciliation errors, systematically underserved. BioMistral-7B, a domain-pretrained model without instruction tuning, produces zero usable output in all conditions, showing that domain pretraining alone is not sufficient for structured extraction. These results offer practical, evidence-based format recommendations for clinical LLM deployment: Clinical Narrative for models up to 8B, Raw JSON for 70B and above. The complete pipeline is reproducible on open-source tools running on an AWS g6e.xlarge instance (NVIDIA L40S, 48 GB VRAM).
Adaptive Conformal Anomaly Detection with Time Series Foundation Models for Signal Monitoring
🔥 引用:
0
Abstract: We propose a post-hoc adaptive conformal anomaly detection method for monitoring time series that leverages predictions from pre-trained foundation models without requiring additional fine-tuning. Our method yields an interpretable anomaly score directly interpretable as a false alarm rate (p-value), facilitating transparent and actionable decision-making. It employs weighted quantile conformal prediction bounds and adaptively learns optimal weighting parameters from past predictions, enabling calibration under distribution shifts and stable false alarm control, while preserving out-of-sample guarantees. As a model-agnostic solution, it integrates seamlessly with foundation models and supports rapid deployment in resource-constrained environments. This approach addresses key industrial challenges such as limited data availability, lack of training expertise, and the need for immediate inference, while taking advantage of the growing accessibility of time series foundation models. Experiments on both synthetic and real-world datasets show that the proposed approach delivers strong performance, combining simplicity, interpretability, robustness, and adaptivity.
Masked-Token Prediction for Anomaly Detection at the Large Hadron Collider
🔥 引用:
0
Abstract: Anomaly detection in High Energy Physics requires identifying rare signals against overwhelming backgrounds, without prior knowledge of the signal. We present the first application of masked-token prediction, a technique from Large Language Models, to this problem. A lightweight encoder architecture trained solely on background events captures the structure of Standard Model (SM) physics; at inference, sequences deviating from this learned structure are flagged as anomalous. We evaluate the approach on searches for four-top-quark production and supersymmetric gluino pair production, both featuring top-rich final states with substantial missing transverse energy, covering SM and beyond the Standard Model (BSM) scenarios. Strong performance on the four-top signature, which closely resembles background, demonstrates the method's sensitivity to subtle deviations. We further show that the tokenization strategy significantly impacts performance: deep-learned tokenization via vector-quantized variational autoencoders (VQ-VAE) outperforms look-up table tokenization. Comparison with established anomaly detection baselines confirms robustness. These results highlight the potential of token-based collider data representations combined with transformer architectures for new-physics discovery. Once trained on SM background, the model transfers across different BSM searches, enabling scalable, model-independent anomaly detection at reduced computational cost.
Hierarchical Policy Optimization for Simultaneous Translation of Unbounded Speech
🔥 引用:
0
Abstract: Simultaneous speech translation (SST) generates translations while receiving partial speech input. Recent advances show that large language models (LLMs) can substantially improve SST quality, but at the cost of high computational overhead. To reduce this cost, prior work reformulates SST as a multi-turn dialogue task, enabling full reuse of the LLM's key-value (KV) cache and eliminating redundant feature recomputation. However, this approach relies on supervised fine-tuning (SFT) data in dialogue form, for which few human annotations exist, and existing synthesis methods cannot guarantee data quality. In this work, we propose a Hierarchical Policy Optimization (HPO) approach that post-train models trained on imperfect SFT data. We introduce a hierarchical reward that balances translation quality and latency objectives. Experiments on English to Chinese/German/Japanese demonstrate improvements of over +7 COMET score and +1.25 MetricX score at a latency of 1.5 seconds. Comprehensive ablation studies further validate the effectiveness of different quality rewards, hierarchical reward formulations, and segmentation strategies. Code can be found here https://github.com/owaski/HPO
From Lexicons to Large Language Models: A Holistic Evaluation of Psychometric Text Analysis in Social Science Research
🔥 引用:
1
Abstract: Research Spotlight Abstract
Extracting psychological insights from text is vital for modern analytics, yet organizations often rely on analysis tools that are either biased and simplistic or prohibitively expensive to build. Our research demonstrates that Large Language Models (LLMs) offer a superior alternative. They match the accuracy of specialized artificial intelligence (AI), while significantly reducing costs and technical barriers. Crucially for policy considerations, we find LLMs are statistically fairer than traditional methods. In our tests, they reduced racial and gender bias by up to 60%. Beyond assessing performance, we introduce a practical technique called “cognitive-affective prompting.” By instructing the AI to adopt specific human strengths, such as using “superior reasoning” for complex tasks or “emotional intelligence” for sentiment analysis, practitioners can boost accuracy by over 10%. To facilitate adoption, we provide a user-friendly “cookbook” to help nonexperts apply these findings immediately. For policymakers and business leaders, this research validates LLMs as a robust, consistent, and equitable standard for analyzing human behavior at scale.
Behavioral Consistency and Transparency Analysis on Large Language Model API Gateways
🔥 引用:
0
Abstract: Third-party Large Language Model (LLM) API gateways are rapidly emerging as unified access points to models offered by multiple vendors. However, the internal routing, caching, and billing policies of these gateways are largely undisclosed, leaving users with limited visibility into whether requests are served by the advertised models, whether responses remain faithful to upstream APIs, or whether invoices accurately reflect public pricing policies. To address this gap, we introduce GateScope, a lightweight black-box measurement framework for evaluating behavioral consistency and operational transparency in commercial LLM gateways. GateScope is designed to detect key misbehaviors, including model downgrading or switching, silent truncation, billing inaccuracies, and instability in latency by auditing gateways along four critical dimensions: response content analysis, multi-turn conversation performance, billing accuracy, and latency characteristics. Our measurements across 10 real-world commercial LLM API gateways reveal frequent gaps between expected and actual behaviors, including silent model substitutions, degraded memory retention, deviations from announced pricing, and substantial variation in latency stability across platforms.
Multimodal Integration of Ambulatory ECG and Clinical Features for Sudden Cardiac Death and Pump Failure Death Prediction
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
From Hidden Profiles to Governable Personalization: Recommender Systems in the Age of LLM Agents
🔥 引用:
0
Abstract: Personalization has traditionally depended on platform-specific user models that are optimized for prediction but remain largely inaccessible to the people they describe. As LLM-based assistants increasingly mediate search, shopping, travel, and content access, this arrangement may be giving way to a new personalization stack in which user representation is no longer confined to isolated platforms. In this paper, we argue that the key issue is not simply that large language models can enhance recommendation quality, but that they reconfigure where and how user representations are produced, exposed, and acted upon. We propose a shift from hidden platform profiling toward governable personalization, where user representations may become more inspectable, revisable, portable, and consequential across services. Building on this view, we identify five research fronts for recommender systems: transparent yet privacy-preserving user modeling, intent translation and alignment, cross-domain representation and memory design, trustworthy commercialization in assistant-mediated environments, and operational mechanisms for ownership, access, and accountability. We position these not as isolated technical challenges, but as interconnected design problems created by the emergence of LLM agents as intermediaries between users and digital platforms. We argue that the future of recommender systems will depend not only on better inference, but on building personalization systems that users can meaningfully understand, shape, and govern.
3,3′-Di-O-methylellagic Acid Isolated from Euphorbia humifusa Willd Suppresses Prostate Cancer Cell Viability via Regulating VDAC1 Protein Expression
DOI:
10.3390/ph19050652
🔥 引用:
0
Abstract: Background: Prostate cancer (PCa) is the leading male urinary malignancy globally. Our previous article demonstrated the anti-PCa activity of Euphorbia humifusa Willd water extract (EHW) and some of its compounds via downregulating AR expression, but the anti-PCa active compounds from Euphorbia humifusa Willd (EH) and their mechanisms of action are yet to be clarified. Thus, the current article studied the in vitro anti-PCa effects of 3,3′-di-O-methylellagic acid (3,3′-di-O-Me-EA) derived from EHW and the related mechanism involved. Methods: 3,3’-di-O-Me-EA was isolated from EHW applying bioassay-guided fractionation. The spectroscopic methods were used to determining the structure of 3,3′-di-O-Me-EA. The drug-likeness and ADMET properties (absorption, distribution, metabolism, excretion, and toxicity) of 3,3′-di-O-Me-EA were analyzed in silico. Molecular docking and real-time surface plasmon resonance (SPR) analysis were performed to measure the interaction of 3,3′-di-O-Me-EA and VDAC1 protein. The viability and apoptosis of 22RV-1 and DU145 PCa cells were determined using MTT and Annexin V-FITC staining assay, respectively. q-PCR and Western blot experiments were used to analyzing the gene and protein expressions of VDAC1. Results: 3,3′-di-O-Me-EA was isolated and purified from EHW with a purity of ≥90.06%, and its structure was identified by HRTOF mass, NMR, and an authentic standard. In silico ADMET analysis indicated its favorable drug-like and pharmacokinetic properties. Molecular docking and SPR results confirmed that 3,3′-di-O-Me-EA could bind with the VDAC1 protein. Moreover, 3,3′-di-O-Me-EA dose- and time-dependently inhibited 22RV-1 and DU145 PCa cell viability, and induced apoptosis in a dose-dependent manner (p < 0.05). RT-qPCR and Western blot results showed that 3,3′-di-O-Me-EA dose-dependently up-regulated VDAC1 gene and protein expression levels in 22RV-1 and DU145 cells (p < 0.05). Meanwhile, in VDAC1-depleted 22RV-1 and DU145 cells, 3,3′-di-O-Me-EA down-regulated VDAC1 gene and protein expression levels, increased cell viability, and inhibited apoptosis compared to 22RV-1 and DU145 cells (p < 0.05). Furthermore, 3,3′-di-O-Me-EA enhanced VDAC1 gene and protein expression levels, inhibited cell viability, and induced apoptosis in VDAC1-overexpressed 22RV-1 and DU145 cells compared with 22RV-1 and DU145 cells (p < 0.05). Overall, EH active compound 3,3′-di-O-Me-EA may inhibit viability and induce apoptosis of 22RV-1 and DU145 PCa cells via up-regulating VDAC1 gene and protein expression levels. Conclusion: The results indicated that the 22RV1 and DU145 PCa cell viability inhibitory effects of 3,3′-di-O-Me-EA isolated from EH may be mediated by induction of apoptosis through up-regulation of VDAC1 gene and protein expression levels.
Knowledge Capsules: Structured Nonparametric Memory Units for LLMs
🔥 引用:
0
Abstract: Large language models (LLMs) encode knowledge in parametric weights, making it costly to update or extend without retraining. Retrieval-augmented generation (RAG) mitigates this limitation by appending retrieved text to the input, but operates purely through context expansion, where external knowledge competes as tokens within the attention mechanism. As a result, its influence is indirect and often unstable, particularly in long context and multi hop reasoning scenarios. We propose Knowledge Capsules, structured nonparametric memory units that represent normalized relational knowledge and can be constructed directly from document corpora using a frozen base model. Instead of injecting knowledge as text, we introduce an External Key Value Injection (KVI) framework that compiles capsules into attention-compatible key value representations, enabling external knowledge to directly participate in the model's attention computation. By shifting knowledge integration from context-level augmentation to memory level interaction, the proposed framework consistently outperforms RAG and GraphRAG across multiple QA benchmarks, with improved stability and accuracy in long context and multi hop reasoning, while requiring no parameter updates.
Open-H-Embodiment: A Large-Scale Dataset for Enabling Foundation Models in Medical Robotics
🔥 引用:
0
Abstract: Autonomous medical robots hold promise to improve patient outcomes, reduce provider workload, democratize access to care, and enable superhuman precision. However, autonomous medical robotics has been limited by a fundamental data problem: existing medical robotic datasets are small, single-embodiment, and rarely shared openly, restricting the development of foundation models that the field needs to advance. We introduce Open-H-Embodiment, the largest open dataset of medical robotic video with synchronized kinematics to date, spanning more than 49 institutions and multiple robotic platforms including the CMR Versius, Intuitive Surgical's da Vinci, da Vinci Research Kit (dVRK), Rob Surgical BiTrack, Virtual Incision's MIRA, Moon Surgical Maestro, and a variety of custom systems, spanning surgical manipulation, robotic ultrasound, and endoscopy procedures. We demonstrate the research enabled by this dataset through two foundation models. GR00T-H is the first open foundation vision-language-action model for medical robotics, which is the only evaluated model to achieve full end-to-end task completion on a structured suturing benchmark (25% of trials vs. 0% for all others) and achieves 64% average success across a 29-step ex vivo suturing sequence. We also train Cosmos-H-Surgical-Simulator, the first action-conditioned world model to enable multi-embodiment surgical simulation from a single checkpoint, spanning nine robotic platforms and supporting in silico policy evaluation and synthetic data generation for the medical domain. These results suggest that open, large-scale medical robot data collection can serve as critical infrastructure for the research community, enabling advances in robot learning, world modeling, and beyond.
LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model
🔥 引用:
0
Abstract: We present LLaDA2.0-Uni, a unified discrete diffusion large language model (dLLM) that supports multimodal understanding and generation within a natively integrated framework. Its architecture combines a fully semantic discrete tokenizer, a MoE-based dLLM backbone, and a diffusion decoder. By discretizing continuous visual inputs via SigLIP-VQ, the model enables block-level masked diffusion for both text and vision inputs within the backbone, while the decoder reconstructs visual tokens into high-fidelity images. Inference efficiency is enhanced beyond parallel decoding through prefix-aware optimizations in the backbone and few-step distillation in the decoder. Supported by carefully curated large-scale data and a tailored multi-stage training pipeline, LLaDA2.0-Uni matches specialized VLMs in multimodal understanding while delivering strong performance in image generation and editing. Its native support for interleaved generation and reasoning establishes a promising and scalable paradigm for next-generation unified foundation models. Codes and models are available at https://github.com/inclusionAI/LLaDA2.0-Uni.
Beyond Rating: A Comprehensive Evaluation and Benchmark for AI Reviews
🔥 引用:
0
Abstract: The rapid adoption of Large Language Models (LLMs) has spurred interest in automated peer review; however, progress is currently stifled by benchmarks that treat reviewing primarily as a rating prediction task. We argue that the utility of a review lies in its textual justification--its arguments, questions, and critique--rather than a scalar score. To address this, we introduce Beyond Rating, a holistic evaluation framework that assesses AI reviewers across five dimensions: Content Faithfulness, Argumentative Alignment, Focus Consistency, Question Constructiveness, and AI-Likelihood. Notably, we propose a Max-Recall strategy to accommodate valid expert disagreement and introduce a curated dataset of paper with high-confidence reviews, rigorously filtered to remove procedural noise. Extensive experiments demonstrate that while traditional n-gram metrics fail to reflect human preferences, our proposed text-centric metrics--particularly the recall of weakness arguments--correlate strongly with rating accuracy. These findings establish that aligning AI critique focus with human experts is a prerequisite for reliable automated scoring, offering a robust standard for future research.
SAGE: Signal-Amplified Guided Embeddings for LLM-based Vulnerability Detection
🔥 引用:
0
Abstract: Software vulnerabilities are a primary threat to modern infrastructure. While static analysis and Graph Neural Networks have long served as the foundation for vulnerability detection, the emergence of Large Language Models (LLMs) has introduced a transformative paradigm driven by superior semantic reasoning and cross-environment generalization. However, in the context of LLM-based vulnerability detection, we identify a fundamental bottleneck in these models termed \textbf{Signal Submersion}: a state where features related to vulnerability are activated internally but numerically overwhelmed by dominant functional semantics. To address this, we propose \textbf{SAGE} (\textbf{S}ignal-\textbf{A}mplified \textbf{G}uided \textbf{E}mbeddings), a framework that shifts from passive signal submersion to active signal recovery. SAGE integrates task-conditional Sparse Autoencoders (SAEs) to isolate and amplify these faint vulnerability signals. Extensive evaluations on BigVul, PrimeVul, and PreciseBugs demonstrate that SAGE achieves state-of-the-art performance. Notably, SAGE mitigates Signal Submersion by increasing the internal Signal-to-Noise Ratio (SNR) by 12.7$\times$ via sparse manifold projection. This mechanistic intervention enables a 7B model to achieve up to 318\% Matthews Correlation Coefficient (MCC) gains on unseen distributions and a 319\% gain on classic datasets. By maintaining robust performance across 13 programming languages and outperforming 34B baselines, SAGE establishes a more efficient and scalable path to software security than simple parameter scaling.
Expert Upcycling: Shifting the Compute-Efficient Frontier of Mixture-of-Experts
🔥 引用:
0
Abstract: Mixture-of-Experts (MoE) has become the dominant architecture for scaling large language models: frontier models routinely decouple total parameters from per-token computation through sparse expert routing. Scaling laws show that under fixed active computation, model quality scales predictably with total parameters, and MoEs realize this by increasing expert count. However, training large MoEs is expensive, as memory requirements and inter-device communication both scale with total parameter count. We propose expert upcycling, a method for progressively expanding MoE capacity by increasing the number of experts during continued pre-training (CPT). Given a trained E-expert model, the upcycling operator constructs an mE-expert model through expert duplication and router extension while holding top-K routing fixed, preserving per-token inference cost. Duplication provides a warm initialization: the expanded model inherits the source checkpoint's learned representations, starting from a substantially lower loss than random initialization. Subsequent CPT then breaks the symmetry among duplicated experts to drive specialization. We formalize the upcycling operator and develop a theoretical framework decomposing the quality gap into a capacity term and an initialization term. We further introduce utility-based expert selection, which uses gradient-based importance scores to guide non-uniform duplication, more than tripling gap closure when CPT is limited. In our 7B-13B total parameter experiments, the upcycled model matches the fixed-size baseline on validation loss while saving 32% of GPU hours. Comprehensive ablations across model scales, activation ratios, MoE architectures, and training budgets yield a practical recipe for deploying expert upcycling, establishing it as a principled, compute-efficient alternative to training large MoE models from scratch.
Lessons Learned From AI-Assisted Guideline Generation in Parastomal Hernia Repair
🔥 引用:
0
Abstract:
Large language models (LLMs) can analyse scientific literature and draft medical recommendations, but their role in formal clinical guideline development is unclear.
To evaluate whether a publicly available GPT-based LLM can generate coherent, GRADE-based guidelines for parastomal hernia management from a predefined evidence base, and to compare these with the 2017 European Hernia Society (EHS) guidelines. A secondary aim was to explore implications for academic publishing and scientific authorship.
The 2017 EHS parastomal hernia guidelines (Antoniou et al.) were used as the reference framework. Within a closed session, the model was instructed to apply AGREE II and GRADE principles to 52 full-text clinical papers mirroring the original EHS reference set, and to formulate recommendations for nine key clinical questions (KQs). For each KQ, the model defined PICO, summarized the evidence, rated certainty, and stated direction and strength of recommendation. AI-derived guidance was then systematically compared with EHS statements. Divergences were classified as interpretative, threshold-based (handling of low-certainty evidence), or evidence-weighting.
AI-generated recommendations showed full or near-full alignment with EHS guidance in most domains, including diagnosis, prophylactic mesh for permanent end colostomy, rejection of suture-only repair, preference for non-keyhole laparoscopic repair, and favouring synthetic over biologic meshes. Differences arose primarily where evidence was very low quality: the model issued cautious, conditional recommendations (e.g., watchful waiting in asymptomatic hernias, consideration of laparoscopy in suitable patients, preference for retromuscular synthetic mesh and avoidance of cross-linked collagen onlay), whereas EHS opted for no recommendation.
Within a closed evidence base, a GPT-based model can reproduce the logic and structure of expert guideline development with high fidelity. Discrepancies mainly reflect different thresholds for acting on low-certainty evidence, supporting a complementary role for AI as a structured methodological and drafting assistant rather than a replacement for human consensus.
Automated LTL Specification Generation from Industrial Aerospace Requirements
🔥 引用:
0
Abstract: In the development and verification of safety-critical aero-space software, Linear Temporal Logic (LTL) has been widely used to specify complex system properties derived from requirements. However, a significant gap remains in industrial practice: translating natural language (NL) requirements into formal LTL properties is a labor-intensive and error-prone process that requires rare expertise in both aerospace control engineering and formal methods. While recent NL-to-LTL tools (e.g., NL2SPEC, NL2TL, NL2LTL) are capable of automating parts of this process, they often fail on real requirement documents in industrial settings, due to complex domain terminology or implicit temporal and logical structure. To address these challenges, we present AeroReq2LTL, a framework that automates LTL property generation for aerospace requirements using large language models (LLMs), with two key industrial innovations: (i) a data dictionary that normalizes technical jargon into precise atomic propositions; and (ii) a template-based requirement language that makes temporal cues and logical relations explicit before translation. On a real aerospace dataset, AeroReq2LTL achieves 85% precision and 88% recall in LTL generation, and its outputs can be directly consumed by existing verification tools.
PeFoMed: Parameter efficient fine-tuning of multimodal large language models for medical CXR
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
RDP LoRA: Geometry-Driven Identification for Parameter-Efficient Adaptation in Large Language Models
🔥 引用:
0
Abstract: Fine-tuning Large Language Models (LLMs) remains structurally uncertain despite parameter-efficient methods such as Low-Rank Adaptation (LoRA), as the layer-specific roles of internal representations are poorly understood, leading to heuristic decisions about where adaptation should be applied. We model the evolution of hidden states as a high-dimensional geometric trajectory and propose using the Ramer-Douglas-Peucker (RDP) algorithm, a parameter-free and training-free polygon simplification method that preserves global structural transitions while eliminating locally redundant changes, to identify critical breakpoints along the representation path. Crucially, we use these geometric pivots not merely for analysis, but as a direct decision signal for determining which layers should be adapted during parameter-efficient fine-tuning. By integrating this geometry-aware layer selection strategy into LoRA fine-tuning of Qwen3-8B-Base, we achieve superior performance on MMLU-Math using only 13 RDP-selected layers (81.67%), significantly outperforming both full 36-layer adaptation (79.32%) and random 13-layer selection (75.56%), as well as the baseline Qwen3-8B-Base model (74.25%). These results demonstrate that leveraging the intrinsic geometry of representation trajectories provides a robust, interpretable, and training-free signal for optimizing layer selection during model adaptation.
ECLASS-Augmented Semantic Product Search for Electronic Components
🔥 引用:
0
Abstract: Efficient semantic access to industrial product data is a key enabler for factory automation and emerging LLM-based agent workflows, where both human engineers and autonomous agents must identify suitable components from highly structured catalogs. However, the vocabulary mismatch between natural-language queries and attribute-centric product descriptions limits the effectiveness of traditional retrieval approaches, e.g., BM25. In this work, we present a systematic evaluation of LLM-assisted dense retrieval for semantic product search on industrial electronic components, and investigate the integration of hierarchical semantics from the ECLASS standard into embedding-based retrieval. Our results show that dense retrieval combined with re-ranking substantially outperforms classical lexical methods and foundation model web-search baselines. In particular, the proposed approach achieves a Hit_Rate@5 of 94.3 %, compared to 31.4 % for BM25 on expert queries, while also exceeding foundation model baselines in both effectiveness and efficiency. Furthermore, augmenting product representations with ECLASS semantics yields consistent performance gains across configurations, demonstrating that standardized hierarchical metadata provides a crucial semantic bridge between user intent and sparse product descriptions.
Detecting Data Contamination in Large Language Models
🔥 引用:
0
Abstract: Large Language Models (LLMs) utilize large amounts of data for their training, some of which may come from copyrighted sources. Membership Inference Attacks (MIA) aim to detect those documents and whether they have been included in the training corpora of the LLMs. The black-box MIAs require a significant amount of data manipulation; therefore, their comparison is often challenging. We study state-of-the-art (SOTA) MIAs under the black-box assumptions and compare them to each other using a unified set of datasets to determine if any of them can reliably detect membership under SOTA LLMs. In addition, a new method, called the Familiarity Ranking, was developed to showcase a possible approach to black-box MIAs, thereby giving LLMs more freedom in their expression to understand their reasoning better. The results indicate that none of the methods are capable of reliably detecting membership in LLMs, as shown by an AUC-ROC of approximately 0.5 for all methods across several LLMs. The higher TPR and FPR for more advanced LLMs indicate higher reasoning and generalizing capabilities, showcasing the difficulty of detecting membership in LLMs using black-box MIAs.
Learning transcriptome architecture from sequence with a long-context RNA foundation model
🔥 引用:
4
Abstract: Linking DNA sequence to genomic function remains one of the grand challenges in genetics and genomics. Here, we present a large-scale compendium of single-molecule transcriptome sequencing of diverse cancer cell lines, revealing their isoform diversity and specificity. We used this compendium to build Mach-1, an RNA foundation model that learns how the nucleotide sequence of unspliced pre-mRNA dictates transcriptome architecture—the relative abundances and molecular structures of mRNA isoforms. By using the Striped-Hyena architecture, Mach-1 handles extremely long sequence inputs at nucleotide resolution (64 kilobase pairs), allowing for quantitative, zero-shot prediction of all aspects of transcriptome architecture, spanning isoform abundance, structure, and variant-induced splicing changes. To test both the interpretive and generative capabilities of Mach-1, we experimentally validated its learned regulatory grammar and predictions through perturbation of RNA-binding proteins nominated by Mach-1 to impact targeted splicing, precise CRISPR editing of variants of uncertain significance that the model predicted to alter splicing, and de novo transcript synthesis and expression in human cells. Together, this release establishes a new foundation for sequence-to-transcript modeling. Mach-1’s representations can be extended and fine-tuned across a spectrum of biological contexts, from variant interpretation to RNA engineering.
Taming Actor-Observer Asymmetry in Agents via Dialectical Alignment
🔥 引用:
0
Abstract: Large Language Model agents have rapidly evolved from static text generators into dynamic systems capable of executing complex autonomous workflows. To enhance reliability, multi-agent frameworks assigning specialized roles are increasingly adopted to enable self-reflection and mutual auditing. While such role-playing effectively leverages domain expert knowledge, we find it simultaneously induces a human-like cognitive bias known as Actor-Observer Asymmetry (AOA). Specifically, an agent acting as an actor (during self-reflection) tends to attribute failures to external factors, whereas an observer (during mutual auditing) attributes the same errors to internal faults. We quantify this using our new Ambiguous Failure Benchmark, which reveals that simply swapping perspectives triggers the AOA effect in over 20% of cases for most models. To tame this bias, we introduce ReTAS (Reasoning via Thesis-Antithesis-Synthesis), a model trained through dialectical alignment to enforce perspective-invariant reasoning. By integrating dialectical chain-of-thought with Group Relative Policy Optimization, ReTAS guides agents to synthesize conflicting viewpoints into an objective consensus. Experiments demonstrate that ReTAS effectively mitigates attribution inconsistency and significantly improves fault resolution rates in ambiguous scenarios.
Joint Service Chain Orchestration and Computation Offloading via GNN-Based QMIX in Industrial IoT
DOI:
10.3390/s26082559
🔥 引用:
0
Abstract: In IIoT edge computing, multi-edge server collaborative scheduling faces two core issues due to random task arrivals, heterogeneous resources, and complex topology: traditional model-driven methods cannot make dynamic decisions in dynamic environments, and conventional MARL fails to characterize inter-node topological dependencies and load correlations. To address this, this paper investigates the joint optimization of task offloading, computing resource allocation, and SFC orchestration in IIoT, constructs a cloud-edge-end collaborative architecture, and models the problem as a POMDP to minimize the overall system cost under multiple constraints. A graph-guided value-decomposition MARL method is proposed, which extracts spatial topology and neighborhood-load features of edge nodes via a GNN and combines them with the QMIX framework to realize multi-agent centralized training and distributed execution. Simulations show that the algorithm converges stably under different server scales and task loads, significantly outperforms benchmark algorithms, and can suppress performance degradation in high-load scenarios, demonstrating its robustness and scalability in complex industrial environments.
Predicting Scale-Up of Metal-Organic Framework Syntheses with Large Language Models
🔥 引用:
0
Abstract: Scalable synthesis remains the gate between MOF discovery and industrial deployment, as scale-up know-how is fragmented across disparate reports. We introduce ESU-MOF, a literature-mined dataset and a positive-unlabeled learning strategy that fine-tunes large language models to predict scalability potential with 91.4% accuracy, enabling rapid data-driven triage for industrial MOF discovery.
CulturALL: Benchmarking Multilingual and Multicultural Competence of LLMs on Grounded Tasks
🔥 引用:
0
Abstract: Large language models (LLMs) are now deployed worldwide, inspiring a surge of benchmarks that measure their multilingual and multicultural abilities. However, these benchmarks prioritize generic language understanding or superficial cultural trivia, leaving the evaluation of grounded tasks -- where models must reason within real-world, context-rich scenarios -- largely unaddressed. To fill this gap, we present CulturALL, a comprehensive and challenging benchmark to assess LLMs'multilingual and multicultural competence on grounded tasks. CulturALL is built via a human--AI collaborative framework: expert annotators ensure appropriate difficulty and factual accuracy, while LLMs lighten the manual workload. By incorporating diverse sources, CulturALL ensures comprehensive scenario coverage. Each item is carefully designed to present a high level of difficulty, making CulturALL challenging. CulturALL contains 2,610 samples in 14 languages from 51 regions, distributed across 16 topics to capture the full breadth of grounded tasks. Experiments show that the best LLM achieves 44.48% accuracy on CulturALL, underscoring substantial room for improvement.
ChipCraftBrain: Validation-First RTL Generation via Multi-Agent Orchestration
🔥 引用:
0
Abstract: Large Language Models (LLMs) show promise for generating Register-Transfer Level (RTL) code from natural language specifications, but single-shot generation achieves only 60-65% functional correctness on standard benchmarks. Multi-agent approaches such as MAGE reach 95.9% on VerilogEval yet remain untested on harder industrial benchmarks such as NVIDIA's CVDP, lack synthesis awareness, and incur high API costs. We present ChipCraftBrain, a framework combining symbolic-neural reasoning with adaptive multi-agent orchestration for automated RTL generation. Four innovations drive the system: (1) adaptive orchestration over six specialized agents via a PPO policy over a 168-dim state (an alternative world-model MPC planner is also evaluated); (2) a hybrid symbolic-neural architecture that solves K-map and truth-table problems algorithmically while specialized agents handle waveform timing and general RTL; (3) knowledge-augmented generation from a 321-pattern base plus 971 open-source reference implementations with focus-aware retrieval; and (4) hierarchical specification decomposition into dependency-ordered sub-modules with interface synchronization. On VerilogEval-Human, ChipCraftBrain achieves 97.2% mean pass@1 (range 96.15-98.72% across 7 runs, best 154/156), on par with ChipAgents (97.4%, self-reported) and ahead of MAGE (95.9%). On a 302-problem non-agentic subset of CVDP spanning five task categories, we reach 94.7% mean pass@1 (286/302, averaged over 3 runs), a 36-60 percentage-point lift per category over the published single-shot baseline; we additionally lead three of four categories shared with NVIDIA's ACE-RTL despite using roughly 30x fewer per-problem attempts. A RISC-V SoC case study demonstrates hierarchical decomposition generating 8/8 lint-passing modules (689 LOC) validated on FPGA, where monolithic generation fails entirely.
Benchmarking Agentic Large Language Models for Complex Protein-Set Functional Annotation
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
Are Large Language Models Economically Viable for Industry Deployment?
🔥 引用:
0
Abstract: Generative AI-powered by Large Language Models (LLMs)-is increasingly deployed in industry across healthcare decision support, financial analytics, enterprise retrieval, and conversational automation, where reliability, efficiency, and cost control are critical. In such settings, models must satisfy strict constraints on energy, latency, and hardware utilization-not accuracy alone. Yet prevailing evaluation pipelines remain accuracy-centric, creating a Deployment-Evaluation Gap-the absence of operational and economic criteria in model assessment. To address this gap, we present EDGE-EVAL-a industry-oriented benchmarking framework that evaluates LLMs across their full lifecycle on legacy NVIDIA Tesla T4 GPUs. Benchmarking LLaMA and Qwen variants across three industrial tasks, we introduce five deployment metrics-Economic Break-Even (Nbreak), Intelligence-Per-Watt (IPW ), System Density (\r{ho}sys), Cold-Start Tax (Ctax), and Quantization Fidelity (Qret)-capturing profitability, energy efficiency, hardware scaling, serverless feasibility, and compression safety. Our results reveal a clear efficiency frontier-models in the<2B parameter class dominate larger baselines across economic and ecological dimensions. LLaMA-3.2-1B (INT4) achieves ROI break-even in 14 requests (median), delivers 3x higher energy-normalized intelligence than 7B models, and exceeds 6,900 tokens/s/GB under 4-bit quantization. We further uncover an efficiency anomaly-while QLoRA reduces memory footprint, it increases adaptation energy by up to 7x for small models-challenging prevailing assumptions about quantization-aware training in edge deployment.
Adaptive feature fusion network for machine fault diagnosis with multiple knowledge based graphs
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
Enhancing large language model clinical support information with machine learning risk and explainability: a feasibility study
🔥 引用:
0
Abstract: Background Current machine learning (ML) prediction models offer limited guidance for individualized actionable management. Large language models (LLMs) can transform ML model-predicted risk estimates with Shapley Additive Explanations (SHAP) into clinically meaningful support information, yet the added value of incorporating ML-derived data and the relative performance of different LLMs remain uncertain. To address these gaps, we used our previously developed IMPACT framework to evaluate the quality of LLM-generated outputs. Methods In this retrospective analysis of MIMIC-IV v3.1 intensive care unit (ICU) admissions, we applied a previously developed XGBoost model to estimate ICU mortality risk and derive corresponding SHAP values. GPT-4o transformed the predicted mortality risk, clinical predictors, and their SHAP values into risk interpretation, recommended examinations and management. The primary analysis examined whether augmenting LLM inputs with predicted mortality risk and SHAP values improved clinical response quality, as assessed by the IMPACT framework. We further compared GPT-4o with seven contemporary LLMs; all eight models generated clinical support responses that were scored by Claude 3.7 Sonnet to assess performance differences. Results Claude 3.7 Sonnet showed excellent agreement with human IMPACT ratings (intraclass correlation coefficient [ICC] 0.979, 95% CI 0.973–0.984) and o3-mini (ICC 0.971, 95% CI 0.964–0.980). In the primary analysis, adding predicted ICU mortality risk and SHAP values significantly increased GPT-4o IMPACT scores across prompting strategies. GPT-5 mini (96.0) and gpt-oss-120B (93.4) outperformed GPT-4o (90.4; both p < 0.001) for interpretability and quality. Conclusions Combining ML-derived risk, SHAP explanations and LLMs may modestly improve ICU clinical support information, while LLM-based evaluators demonstrated feasibility for scalable evaluation of generated clinical content. Supplementary Information The online version contains supplementary material available at 10.1186/s40635-026-00900-w.
Calibrating Scientific Foundation Models with Inference-Time Stochastic Attention
🔥 引用:
0
Abstract: Transformer-based scientific foundation models are increasingly deployed in high-stakes settings, but current architectures give deterministic outputs and provide limited support for calibrated predictive uncertainty. We propose Stochastic Attention, a lightweight inference-time modification that randomizes attention by replacing softmax weights with normalized multinomial samples controlled by a single concentration parameter, and produces predictive ensembles without retraining. To set this parameter, we introduce a calibration objective that matches the stochastic attention output with the target, yielding an efficient univariate post-hoc tuning problem. We evaluate this mechanism on two scientific foundation models for weather and timeseries forecasting along with an additional regression task. Across benchmarks against uncertainty-aware baselines, we find that Stochastic Attention achieves the strongest native calibration and the sharpest prediction intervals at comparable coverage, while requiring only minutes of post-hoc tuning versus days of retraining for competitive baselines.
Unveiling Fine-Grained Visual Traces: Evaluating Multimodal Interleaved Reasoning Chains in Multimodal STEM Tasks
🔥 引用:
0
Abstract: Multimodal large language models (MLLMs) have shown promising reasoning abilities, yet evaluating their performance in specialized domains remains challenging. STEM reasoning is a particularly valuable testbed because it provides highly verifiable feedback, but existing benchmarks often permit unimodal shortcuts due to modality redundancy and focus mainly on final-answer accuracy, overlooking the reasoning process itself. To address this challenge, we introduce StepSTEM: a graduate-level benchmark of 283 problems across mathematics, physics, chemistry, biology, and engineering for fine-grained evaluation of cross-modal reasoning in MLLMs. StepSTEM is constructed through a rigorous curation pipeline that enforces strict complementarity between textual and visual inputs. We further propose a general step-level evaluation framework for both text-only chain-of-thought and interleaved image-text reasoning, using dynamic programming to align predicted reasoning steps with multiple reference solutions. Experiments across a wide range of models show that current MLLMs still rely heavily on textual reasoning, with even Gemini 3.1 Pro and Claude Opus 4.6 achieving only 38.29% accuracy. These results highlight substantial headroom for genuine cross-modal STEM reasoning and position StepSTEM as a benchmark for fine-grained evaluation of multimodal reasoning. Source code is available at https://github.com/lll-hhh/STEPSTEM.
In silico screening of candidate NAMPT modulators for treatment of age-related diseases
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
Large language models perceive cities through a culturally uneven baseline
🔥 引用:
0
Abstract: Large language models (LLMs) are increasingly used to describe, evaluate and interpret places, yet it remains unclear whether they do so from a culturally neutral standpoint. Here we test urban perception in frontier LLMs using a balanced global street-view sample and prompts that either remain neutral or invoke different regional cultural standpoints. Across open-ended descriptions and structured place judgments, the neutral condition proved not to be neutral in practice. Prompts associated with Europe and Northern America remained systematically closer to the baseline than many non-Western prompts, indicating that model perception is organized around a culturally uneven reference frame rather than a universal one. Cultural prompting also shifted affective evaluation, producing sentiment-based ingroup preference for some prompted identities. Comparisons with regional human text-image benchmarks showed that culturally proximate prompting could improve alignment with human descriptions, but it did not recover human levels of semantic diversity and often preserved an affectively elevated style. The same asymmetry reappeared in structured judgments of safety, beauty, wealth, liveliness, boredom and depression, where model outputs were interpretable but only partly reproduced human group differences. These findings suggest that LLMs do not simply perceive cities from nowhere: they do so through a culturally uneven baseline that shapes what appears ordinary, familiar and positively valued.
OpenStreetMap based POI knowledge graph enhanced by large language model
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
Kingdom-wide comparative transcriptomics reveals deeply conserved and predictable stress response programs across Viridiplantae
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
Exploring Cross-Debate Between LLMs to Improve the Forecasting of Financial Market Indicators
DOI:
10.3390/math14081393
🔥 引用:
0
Abstract: In the context of political and financial market turmoil, effectively forecasting financial market trends is crucial for investment decisions. Large language models (LLMs) have been applied in extant research to predict market trends, analyze investor sentiments and interpret financial news, all aiming to help investment decision making. However, LLMs face limitations due to training data heterogeneity, restricting multidimensional perspectives and hindering comparative analysis for optimization. This study proposes a “Dual-Agent LLM Debate Mechanism” framework using a Proponent (LLM1: Gemini Pro 3) and an Opponent (LLM2: ChatGPT 5.2) to address single-LLM forecasting gaps: The Proponent generates a baseline forecast (F1) from an Integrated Context, while the Opponent validates and resolves conflicts with the Proponent via up to three rounds of cross-debate to produce a consensus forecast (F2). A controlled experiment was conducted to analyze 75 financial market indicators (FMIs) across five asset categories, revealing that F2 outperforms F1 in accuracy and directional stability, particularly in highly volatile assets like Cryptocurrencies and 10-Year Government Bonds. Paired-sample t-tests confirmed statistical significance, validating the mechanism’s effectiveness. Our study results demonstrate how cross-debate between LLMs enhances forecasting accuracy through structured optimization.
Statistics, Not Scale: Modular Medical Dialogue with Bayesian Belief Engine
🔥 引用:
0
Abstract: Large language models are increasingly deployed as autonomous diagnostic agents, yet they conflate two fundamentally different capabilities: natural-language communication and probabilistic reasoning. We argue that this conflation is an architectural flaw, not an engineering shortcoming. We introduce BMBE (Bayesian Medical Belief Engine), a modular diagnostic dialogue framework that enforces a strict separation between language and reasoning: an LLM serves only as a sensor, parsing patient utterances into structured evidence and verbalising questions, while all diagnostic inference resides in a deterministic, auditable Bayesian engine. Because patient data never enters the LLM, the architecture is private by construction; because the statistical backend is a standalone module, it can be replaced per target population without retraining. This separation yields three properties no autonomous LLM can offer: calibrated selective diagnosis with a continuously adjustable accuracy-coverage tradeoff, a statistical separation gap where even a cheap sensor paired with the engine outperforms a frontier standalone model from the same family at a fraction of the cost, and robustness to adversarial patient communication styles that cause standalone doctors to collapse. We validate across empirical and LLM-generated knowledge bases against frontier LLMs, confirming the advantage is architectural, not informational.
ReaLB: Real-Time Load Balancing for Multimodal MoE Inference
🔥 引用:
0
Abstract: Mixture-of-Experts (MoE) architectures are widely used in modern large language models and multimodal models. However, inference efficiency is often limited by highly dynamic and skewed expert workloads across different modalities. During the prefill stage with large batch sizes, vision tokens frequently dominate the input sequences. Under expert parallelism (EP), this leads to severe load imbalance, where a subset of devices becomes overloaded, reducing overall system throughput. We propose ReaLB, a real-time load balancing method for multimodal MoE (MMoE) inference that introduces zero scheduling overhead. ReaLB dynamically adjusts the computation precision of MoE experts at runtime on a per-EP-rank basis. For ranks dominated by vision-heavy experts, ReaLB assigns lower-precision computation to improve execution efficiency by exploiting FP4 Tensor Cores. ReaLB does not require redundant experts or additional memory allocation. Instead, it performs layer-wise expert precision transformation on the fly and hides the associated overhead within the dispatch phase before MoE computation. Experiments on representative MMoE models show that ReaLB achieves 1.29x layer-level speedup while limiting accuracy loss to within 1.2%.
Automatic Extraction of Suppliers’ ESG Compliance Information from Textual Sources: A Literature Review
DOI:
10.3390/app16084024
🔥 引用:
0
Abstract: This paper presents a literature review regarding the automatic extraction of meaningful information regarding suppliers’ ESG and sustainability compliance from textual sources. Assessing suppliers’ ESG compliance has become a key challenge for procurement managers. Given the large number of suppliers and required data points, traditional approaches such as questionnaires and audits are inefficient, ineffective and difficult to scale. To solve this problem, we investigate whether the required information can be automatically harvested from suppliers’ textual sources. Our structured literature review identified 82 papers on which we performed a descriptive analysis, finding a rich and flourishing body of literature produced by a heterogeneous scientific community. We further reduced our sample to 73 full-text articles that supported a more in-depth content-based analysis. We investigated which data sources can be used in particular, which technologies can be leveraged, and which types of outputs can be generated. Even though they could provide much of the required information, corporate websites are rarely utilized as data sources, partly due to the limited adoption of large language models (LLMs). LLMs are less diffused than traditional Natural Language Processing (NLP) techniques due to their recent introduction and some gaps that still limit their performance. This represents both a constraint and an opportunity for future research.
GRASPrune: Global Gating for Budgeted Structured Pruning of Large Language Models
🔥 引用:
0
Abstract: Large language models (LLMs) are expensive to serve because model parameters, attention computation, and KV caches impose substantial memory and latency costs. We present GRASPrune, a structured pruning framework applied after pretraining that jointly prunes FFN channels and KV head groups under a single global budget. Instead of learning importance scores without constraints and applying the budget only after training, GRASPrune learns lightweight gate scores with a projected straight-through estimator that enforces a hard mask satisfying the budget at every step while keeping the backbone weights frozen. After the mask is fixed, we calibrate scaling factors on the retained units to mitigate scale mismatch caused by pruning, and fold these factors into the pruned weights to obtain a smaller dense checkpoint with no extra parameters at inference. On LLaMA-2-7B, GRASPrune removes 50% of parameters and achieves 12.18 perplexity on WikiText-2 while maintaining competitive average zero-shot accuracy on five benchmarks, using four epochs on 512 unlabeled calibration sequences on a single NVIDIA A100 80GB GPU without any full model fine-tuning.
The Potential Role of Large Language Models in Assisting Patients and Guiding Emergency Care Visits
DOI:
10.3390/jcm15083170
🔥 引用:
0
Abstract: Background/Objectives: Overcrowding in emergency departments (EDs) remains a critical challenge in modern healthcare systems, driven in part by patient uncertainty regarding symptom urgency and a lack of accessible medical guidance. Recent advances in artificial intelligence, particularly large language models (LLMs), present a novel opportunity to support patient navigation and relieve pressure on ED infrastructures. Methods: A total of 238 unique patient questions were identified through a structured web search. Following deduplication and thematic clustering, 15 representative questions were selected. Each question was submitted to the three LLMs—ChatGPT (v3.5), DeepSeek, and Gemini—using a standardized prompt. Responses were assessed by clinical experts (N = 8) who were blinded to the model source. Reviewers selected the best overall response per question, as well as the individual responses of the three LLMs for each respective question. Results: ChatGPT was selected as the best-performing model in 60% of cases, with DeepSeek and Gemini selected in 23% and 17%, respectively. ChatGPT responses also achieved the highest proportion of “excellent” quality ratings and the lowest proportion of “unsatisfactory” outputs. Across all models, clarity was the most positively rated domain (79% agreement), followed by empathy (72%), length/detail appropriateness (71%), and completeness (65%). Over two-thirds of raters expressed willingness to integrate LLM-based tools into clinical practice for patient education and pre-triage counseling. Conclusions: Large language models demonstrate promising capabilities in responding to emergency care-related patient queries. Their ability to deliver medically sound and communicatively effective answers positions them as potential digital adjuncts in the management of low-acuity ED presentations and prehospital triage.
Deep sprite-based image models: An analysis
🔥 引用:
0
Abstract: While foundation models drive steady progress in image segmentation and diffusion algorithms compose always more realistic images, the seemingly simple problem of identifying recurrent patterns in a collection of images remains very much open. In this paper, we focus on sprite-based image decomposition models, which have shown some promise for clustering and image decomposition and are appealing because of their high interpretability. These models come in different flavors, need to be tailored to specific datasets, and struggle to scale to images with many objects. We dive into the details of their design, identify their core components, and perform an extensive analysis on clustering benchmarks. We leverage this analysis to propose a deep sprite-based image decomposition method that performs on par with state-of-the-art unsupervised class-aware image segmentation methods on the standard CLEVR benchmark, scales linearly with the number of objects, identifies explicitly object categories, and fully models images in an easily interpretable way.
InHabit: Leveraging Image Foundation Models for Scalable 3D Human Placement
🔥 引用:
1
Abstract: Training embodied agents to understand 3D scenes as humans do requires large-scale data of people meaningfully interacting with diverse environments, yet such data is scarce. Real-world motion capture is costly and limited to controlled settings, while existing synthetic datasets rely on simple geometric heuristics that ignore rich scene context. In contrast, 2D foundation models trained on internet-scale data have implicitly acquired commonsense knowledge of human-environment interactions. To transfer this knowledge into 3D, we introduce InHabit, a fully automatic and scalable data generator for populating 3D scenes with interacting humans. InHabit follows a render-generate-lift principle: given a rendered 3D scene, a vision-language model proposes contextually meaningful actions, an image-editing model inserts a human, and an optimization procedure lifts the edited result into physically plausible SMPL-X bodies aligned with the scene geometry. Applied to Habitat-Matterport3D, InHabit produces the first large-scale photorealistic 3D human-scene interaction dataset, containing 78K samples across 800 building-scale scenes with complete 3D geometry, SMPL-X bodies, and RGB images. Augmenting standard training data with our samples improves RGB-based 3D human-scene reconstruction and contact estimation, and in a perceptual user study our data is preferred in 78% of cases over the state of the art.
Bootstrapping Post-training Signals for Open-ended Tasks via Rubric-based Self-play on Pre-training Text
🔥 引用:
0
Abstract: Self-play has recently emerged as a promising paradigm to train Large Language Models (LLMs). In self-play, the target LLM creates the task input (e.g., ask a question), which it then addresses itself by producing a task output (e.g., give an answer). A reward model evaluates the output, and the rewards are then used to train the LLM, typically via Reinforcement Learning (RL). Self-play incurs minimal supervision costs, and this is especially helpful for post-training LLMs, which require high-quality input-output pairs that traditionally have to be written by humans or expensive proprietary models. However, existing work explores self-play only for verifiable tasks such as math and coding. Instead, we seek to extend it to more realistic open-ended tasks. In particular, we propose POP, a self-play framework that uses the same LLM to synthesize evaluation rubrics, along with input-output pairs, for each example. The rubric is then used to evaluate outputs and train the model. We further ground the framework on a content-rich pretraining corpus to (1) ensure a generation-verification gap and reduce reward hacking, and (2) prevent mode collapse. On Qwen-2.5-7B, POP increases performance of both pretrained and instruction-tuned models, across different tasks ranging from long-form Healthcare QA to creative writing and instruction following.
Fragmented Infrastructures, Fragmented Learners: The Case for Integrated, Longitudinal Learning Ecosystems in Computing Education
DOI:
10.1145/3811544
🔥 引用:
0
Abstract: Undergraduate computing education is uniquely shaped by heterogeneous technical infrastructures: students move across programming languages, operating systems, IDEs, auto-graders, version-control platforms, and increasingly large language model (LLM)–mediated tools. These systems generate rich but fragmented learning traces that remain isolated within course-bound platforms, limiting observability of how debugging practices, abstraction strategies, and conceptual understanding evolve across the curriculum. Although intelligent tutoring systems (ITS) and AI-enabled assessment tools show promise within individual contexts, their impact is typically localized, lacking semantic interoperability and continuity across languages and courses. Drawing on research in self-regulated learning, conceptual transfer, and identity formation in computing, this article argues that infrastructural fragmentation constrains longitudinal development of computational expertise. We advance a computing-specific vision of integrated learning ecosystems in which execution histories, feedback artifacts, repository data, and learner models are preserved and interoperable across contexts, enabling trajectory-level inquiry into expertise development over time. We conclude by outlining discipline-centered research challenges for building coherent computing education infrastructures.
Generative AI and Large Language Models
DOI:
10.3390/bdcc10040127
🔥 引用:
0
Abstract: In recent years, generative artificial intelligence and, in particular, large language models (LLMs) have rapidly transformed the landscape of data analysis, knowledge extraction, content generation, and intelligent decision support [...]
Commonsense Knowledge with Negation: A Resource to Enhance Negation Understanding
🔥 引用:
0
Abstract: Negation is a common and important semantic feature in natural language, yet Large Language Models (LLMs) struggle when negation is involved in natural language understanding tasks. Commonsense knowledge, on the other hand, despite being a well-studied topic, lacks investigations involving negation. In this work, we show that commonsense knowledge with negation is challenging for models to understand. We present a novel approach to automatically augment existing commonsense knowledge corpora with negation, yielding two new corpora containing over 2M triples with if-then relations. In addition, pre-training LLMs on our corpora benefits negation understanding.
Structure Guided Retrieval-Augmented Generation for Factual Queries
🔥 引用:
0
Abstract: Retrieval-Augmented Generation (RAG) has been proposed to mitigate hallucinations in large language models (LLMs), where generated outputs may be factually incorrect. However, existing RAG approaches predominantly rely on vector similarity for retrieval, which is prone to semantic noise and fails to ensure that generated responses fully satisfy the complex conditions specified by factual queries, often leading to incorrect answers. To address this challenge, we introduce a novel research problem, named Exact Retrieval Problem (ERP). To the best of our knowledge, this is the first problem formulation that explicitly incorporates structural information into RAG for factual questions to satisfy all query conditions. For this novel problem, we propose Structure Guided Retrieval-Augmented Generation (SG-RAG), which models the retrieval process as an embedding-based subgraph matching task, and uses the retrieved topological structures to guide the LLM to generate answers that meet all specified query conditions. To facilitate evaluation of ERP, we construct and publicly release Exact Retrieval Question Answering (ERQA), a large-scale dataset comprising 120000 fact-oriented QA pairs, each involving complex conditions, spanning 20 diverse domains. The experimental results demonstrate that SG-RAG significantly outperforms strong baselines on ERQA, delivering absolute improvements from 20.68 to 50.88 points across all evaluation metrics, while maintaining reasonable computational overhead.
DW-Bench: Benchmarking LLMs on Data Warehouse Graph Topology Reasoning
🔥 引用:
0
Abstract: This paper introduces DW-Bench, a new benchmark that evaluates large language models (LLMs) on graph-topology reasoning over data warehouse schemas, explicitly integrating both foreign-key (FK) and data-lineage edges. The benchmark comprises 1,046 automatically generated, verifiably correct questions across five schemas. Experiments show that tool-augmented methods substantially outperform static approaches but plateau on hard compositional subtypes.
Sensitivity Uncertainty Alignment in Large Language Models
🔥 引用:
0
Abstract: We propose Sensitivity-Uncertainty Alignment (SUA), a framework for analyzing failures of large language models under adversarial and ambiguous inputs. We argue that adversarial sensitivity and ambiguity reflect a common issue: misalignment between prediction instability and model uncertainty. A reliable model should express higher uncertainty when its predictions are unstable; failure to do so leads to miscalibration. We define a scalar score, SUA_theta(x), capturing the difference between distributional sensitivity and predictive entropy. We show that minimizing its positive part bounds worst-case perturbed risk and relates to calibration error. We also formalize ambiguity collapse, where models produce overconfident outputs despite multiple valid interpretations. We introduce SUA-TR, a training method combining consistency regularization and entropy alignment, along with an abstention rule for safer inference. Across tasks including question answering and classification, SUA better identifies model failures than entropy or self-consistency alone. The framework is model-agnostic and provides a basis for improving reliability in evolving language models.
Fuzz Driver Generation: A Survey and Outlook from the Perspective of Data Sources
DOI:
10.3390/bdcc10040129
🔥 引用:
0
Abstract: Fuzzing is an essential element of software supply chain security governance. Despite its importance, the widespread adoption of library fuzzing is limited by the significant costs associated with constructing fuzz drivers. Without a clear entry point, the reachable path space of the target library is determined by the interplay of API call sequences, parameter dependencies, and state constraints. As a result, fuzz drivers must achieve not only successful builds but also provide sufficient semantic context to enable exploration of deeper state machine interactions, thereby avoiding premature stagnation at superficial validation logic. To systematically assess advancements in automated fuzz driver generation, this paper develops a taxonomy organized around the primary data sources used to derive driver-generation constraints, categorizing existing approaches into four technological trajectories: Usage Artifact Mining, Source Code Constraint Inference, Binary Semantics Recovery, and Heterogeneous Data Fusion. Large language models are increasingly integrated into these workflows as generators and as components for constraint alignment and repair. To address inconsistencies in experimental methodologies, this paper introduces a bounded comparability-oriented evaluation perspective focused on three dimensions: validity, reachability-related evidence, and reproducibility and cost. Together with a disclosure and reporting protocol for metric comparability, this perspective clarifies the information needed for cross-study comparison and examines the unique features and inherent limitations of each technical trajectory. Based on these findings, three key directions for future research are identified: facilitating structural evolution in response to coverage plateaus to address deep logic unreachability; coordinating dynamic closed-loop orchestration that utilizes on-demand heterogeneous data retrieval to resolve context challenges; and developing language-agnostic driver representations with pluggable adaptation mechanisms to improve cross-ecosystem portability and scalability.
Automated data extraction for systematic reviews using GPT‐5.2 and Google Gemini Pro 3: A dual‐large language model approach in orthopaedic research
DOI:
10.1002/ksa.70412
🔥 引用:
0
Abstract:
To evaluate the accuracy, agreement, and efficiency of a dual‐large language model (LLM) approach using Generative Pre‐Trained Transformer 5.2 (GPT‐5.2) and Google Gemini 3 Pro for automated data extraction in orthopaedic systematic reviews.
Eight studies from a previously published systematic review on paediatric revision anterior cruciate ligament reconstruction were used to test extraction accuracy, agreement, and efficiency against a pre‐defined gold‐standard. Both GPT 5.2 and Gemini 3 Pro were prompted via the OpenAI and Google Application Programming Interface (API). Each study had a total of 48 equally‐weighted data fields to extract from spanning six domains: study characteristics, participant details, injury characteristics, primary and revision surgery details, and outcomes. Extractions were graded as correct, partially correct, or incorrect in reference to the gold‐standard.
Across all 384 fields, both LLMs produced fully correct outputs in 315 (82%) cases, while at least one model was fully correct in 365 (95.1%). Among the six extraction domains, study characteristics (100%, 32/32), injury characteristics (93.8%, 30/32), and outcomes (91.1%, 102/112) showed the highest percentage of at least one model being correct. The entire extraction task was completed in 27 and 35.8 min by GPT‐5.2 and Gemini 3 Pro, respectively, for a total API cost of $3.22USD.
A parallel‐LLM approach using GPT‐5.2 and Gemini 3 Pro achieved strong accuracy with a high degree of efficiency for automated data extraction in an orthopaedic systematic review. Most errors were due to omission of minor details in complex domains such as surgical details. At least one model was fully correct in over 95% of fields, supporting the use of a dual‐LLM framework as a reliable first‐pass tool for human verification.
Level IV.
On Accelerating Grounded Code Development for Research
🔥 引用:
0
Abstract: A major challenge for niche scientific and technical domains in leveraging coding agents is the lack of access to up-to-date, domain- specific knowledge. Foundational models often demonstrate limited reasoning capabilities in specialized fields and cannot inherently incorporate knowledge that evolves through ongoing research and experimentation. Materials scientists exploring novel compounds, communication engineers designing and evaluating new protocols, and bioengineering researchers conducting iterative experiments all face this limitation. These experts typically lack the resources to fine-tune large models or continuously embed new findings, creating a barrier to adopting AI-driven coding agents. To address this, we introduce a framework that gives coding agents instanta- neous access to research repositories and technical documentation, enabling real-time, context-aware operation. Our open-source im- plementation allows users to upload documents via doc-search.dev and includes zed-fork, which enforces domain-specific rules and workflows. Together, these tools accelerate the integration of coding agents into specialized scientific and technical workflows
Decompose, Structure, and Repair: A Neuro-Symbolic Framework for Autoformalization via Operator Trees
🔥 引用:
0
Abstract: Statement autoformalization acts as a critical bridge between human mathematics and formal mathematics by translating natural language problems into formal language. While prior works have focused on data synthesis and diverse training paradigms to optimize end-to-end Large Language Models (LLMs), they typically treat formal code as flat sequences, neglecting the hierarchical logic inherent in mathematical statements. In this work, we introduce Decompose, Structure, and Repair (DSR), a neuro-symbolic framework that restructures autoformalization into a modular pipeline. DSR decomposes statements into logical components and maps them to structured operator trees, leveraging this topological blueprint to precisely localize and repair errors via sub-tree refinement. Furthermore, we introduce PRIME, a benchmark of 156 undergraduate and graduate-level theorems selected from canonical textbooks and expertly annotated in Lean 4. Experimental results demonstrate that DSR establishes a new state-of-the-art, consistently outperforming baselines under equivalent computational budgets. The datasets, model, and code will be released to the public soon.
Accessibility Issue Detection and Repair in Mobile Applications: A Systematic Literature Review
DOI:
10.1145/3809496
🔥 引用:
0
Abstract: Mobile application accessibility is crucial for digital inclusion. This paper presents a systematic literature review (SLR) following PRISMA guidelines, synthesizing advancements in accessibility issue detection and repair techniques. We analyze representative mobile accessibility tools and industrial practices to complement SLR evidence. Nine major databases (ACM DL, IEEE Xplore, ScienceDirect, SpringerLink, Wiley, Web of Science, Scopus, CNKI, arXiv) were searched, with 76 high-quality studies published up to July 2025 analyzed. Common accessibility issues are categorized into four types—perceptibility, operability, understandability, and robustness—aligning with user capability modeling. In accessibility issue detection, evolution from static analysis to deep learning and large language models (LLMs) has shifted research from rule-based matching to semantic reasoning, enhancing automation and generalization. However, detection and repair remain loosely coupled and fragmented, with high false-positive rates and weak inter-module feedback loops. For repair, we propose a taxonomy of rule-driven, learning-based, and LLM-assisted approaches, revealing a gap between detection breadth and repair research depth. LLMs show potential for semantic understanding, issue reasoning, and repair generation, paving the way toward intelligent agents for accessibility. Looking ahead, future research should focus on semantic-enhanced detection, multimodal repair, refined user capability modeling, and unified evaluation standards—collectively advancing accessibility engineering toward an accessibility-by-default design paradigm.
DT2IT-MRM: Debiased Preference Construction and Iterative Training for Multimodal Reward Modeling
🔥 引用:
0
Abstract: Multimodal reward models (MRMs) play a crucial role in aligning Multimodal Large Language Models (MLLMs) with human preferences. Training a good MRM requires high-quality multimodal preference data. However, existing preference datasets face three key challenges: lack of granularity in preference strength, textual style bias, and unreliable preference signals. Besides, existing open-source multimodal preference datasets suffer from substantial noise, yet there is a lack of effective and scalable curation methods to enhance their quality. To address these limitations, we propose \textbf{DT2IT-MRM}, which integrates a \textbf{D}ebiased preference construction pipeline, a novel reformulation of text-to-image (\textbf{T2I}) preference data, and an \textbf{I}terative \textbf{T}raining framework that curates existing multimodal preference datasets for \textbf{M}ultimodal \textbf{R}eward \textbf{M}odeling. Our experimental results show that DT2IT-MRM achieves new \textbf{state-of-the-art} overall performance on three major benchmarks: VL-RewardBench, Multimodal RewardBench, and MM-RLHF-RewardBench.
A Reproducibility Study of Metacognitive Retrieval-Augmented Generation
🔥 引用:
0
Abstract: Recently, Retrieval Augmented Generation (RAG) has shifted focus to multi-retrieval approaches to tackle complex tasks such as multi-hop question answering. However, these systems struggle to decide when to stop searching once enough information has been gathered. To address this, \citet{zhou2024metacognitive} introduced Metacognitive Retrieval Augmented Generation (MetaRAG), a framework inspired by metacognition that enables Large Language Models to critique and refine their reasoning. In this reproducibility paper, we reproduce MetaRAG following its original experimental setup and extend it in two directions: (i) by evaluating the effect of PointWise and ListWise rerankers, and (ii) by comparing with SIM-RAG, which employs a lightweight critic model to stop retrieval. Our results confirm MetaRAG's relative improvements over standard RAG and reasoning-based baselines, but also reveal lower absolute scores than reported, reflecting challenges with closed-source LLM updates, missing implementation details, and unreleased prompts. We show that MetaRAG is partially reproduced, gains substantially from reranking, and is more robust than SIM-RAG when extended with additional retrieval features.
TriEx: A Game-based Tri-View Framework for Explaining Internal Reasoning in Multi-Agent LLMs
🔥 引用:
0
Abstract: Explainability for Large Language Model (LLM) agents is especially challenging in interactive, partially observable settings, where decisions depend on evolving beliefs and other agents. We present \textbf{TriEx}, a tri-view explainability framework that instruments sequential decision making with aligned artifacts: (i) structured first-person self-reasoning bound to an action, (ii) explicit second-person belief states about opponents updated over time, and (iii) third-person oracle audits grounded in environment-derived reference signals. This design turns explanations from free-form narratives into evidence-anchored objects that can be compared and checked across time and perspectives. Using imperfect-information strategic games as a controlled testbed, we show that TriEx enables scalable analysis of explanation faithfulness, belief dynamics, and evaluator reliability, revealing systematic mismatches between what agents say, what they believe, and what they do. Our results highlight explainability as an interaction-dependent property and motivate multi-view, evidence-grounded evaluation for LLM agents. Code is available at https://github.com/Einsam1819/TriEx.
Structure-informed Siamese graph neural networks classify CirA missense variants with implications for cefiderocol susceptibility
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
Large Language Models Exhibit Normative Conformity
🔥 引用:
0
Abstract: The conformity bias exhibited by large language models (LLMs) can pose a significant challenge to decision-making in LLM-based multi-agent systems (LLM-MAS). While many prior studies have treated"conformity"simply as a matter of opinion change, this study introduces the social psychological distinction between informational conformity and normative conformity in order to understand LLM conformity at the mechanism level. Specifically, we design new tasks to distinguish between informational conformity, in which participants in a discussion are motivated to make accurate judgments, and normative conformity, in which participants are motivated to avoid conflict or gain acceptance within a group. We then conduct experiments based on these task settings. The experimental results show that, among the six LLMs evaluated, up to five exhibited tendencies toward not only informational conformity but also normative conformity. Furthermore, intriguingly, we demonstrate that by manipulating subtle aspects of the social context, it may be possible to control the target toward which a particular LLM directs its normative conformity. These findings suggest that decision-making in LLM-MAS may be vulnerable to manipulation by a small number of malicious users. In addition, through analysis of internal vectors associated with informational and normative conformity, we suggest that although both behaviors appear externally as the same form of"conformity,"they may in fact be driven by distinct internal mechanisms. Taken together, these results may serve as an initial milestone toward understanding how"norms"are implemented in LLMs and how they influence group dynamics.
De Novo–HIV: Neural-Quantum Protease Inhibition
🔥 引用:
0
Abstract: The search for potent drug candidates targeting the Human Immunodeficiency Virus (HIV) continues to be a significant global health challenge, largely because of the virus’s rapid mutation rate, the emergence of drug resistance, and the lengthy, expensive nature of traditional drug development processes. In recent years, artificial intelligence (AI) and deep learning have shown great promise in expediting early-stage drug discovery. Notably, Long Short-Term Memory (LSTM) networks have demonstrated remarkable ability in understanding chemical representations and generating new molecular structures through sequence-based notations like SMILES. This paper presents a structured review of LSTM-based and related deep learning approaches applied to HIV drug discovery. Existing studies are critically analysed and classified based on their learning objectives, molecular representation strategies, and validation mechanisms. Key research gaps are identified, including limited generative diversity, lack of multi-objective optimization, and insufficient biological validation. Finally, a conceptual hybrid framework is discussed that integrates LSTM-based molecular generation with advanced evaluation strategies, offering future research directions for scalable and clinically relevant HIV drug discovery.
HP-Edit: A Human-Preference Post-Training Framework for Image Editing
🔥 引用:
0
Abstract: Common image editing tasks typically adopt powerful generative diffusion models as the leading paradigm for real-world content editing. Meanwhile, although reinforcement learning (RL) methods such as Diffusion-DPO and Flow-GRPO have further improved generation quality, efficiently applying Reinforcement Learning from Human Feedback (RLHF) to diffusion-based editing remains largely unexplored, due to a lack of scalable human-preference datasets and frameworks tailored to diverse editing needs. To fill this gap, we propose HP-Edit, a post-training framework for Human Preference-aligned Editing, and introduce RealPref-50K, a real-world dataset across eight common tasks and balancing common object editing. Specifically, HP-Edit leverages a small amount of human-preference scoring data and a pretrained visual large language model (VLM) to develop HP-Scorer--an automatic, human preference-aligned evaluator. We then use HP-Scorer both to efficiently build a scalable preference dataset and to serve as the reward function for post-training the editing model. We also introduce RealPref-Bench, a benchmark for evaluating real-world editing performance. Extensive experiments demonstrate that our approach significantly enhances models such as Qwen-Image-Edit-2509, aligning their outputs more closely with human preference.
OLLM: Options-based Large Language Models
🔥 引用:
0
Abstract: We introduce Options LLM (OLLM), a simple, general method that replaces the single next-token prediction of standard LLMs with a \textit{set of learned options} for the next token, indexed by a discrete latent variable. Instead of relying on temperature or sampling heuristics to induce diversity, OLLM models variation explicitly: a small latent space parametrizes multiple plausible next-token options which can be selected or searched by a downstream policy. Architecturally, OLLM is a lightweight"plug-in"that inserts two layers: an encoder and a decoder, before the output head, allowing almost any pretrained LLM to be converted with minimal additional parameters. We apply OLLM to a 1.7B-parameter backbone (only $1.56\%$ of parameters trainable) trained on OpenMathReasoning and evaluated on OmniMath. The SOTA LoRA-adapted baselines peak at $51\%$ final answer correctness, while OLLM's option set allows up to $\sim 70\%$ under optimal latent selection. We then train a compact policy in the latent space that emits latents to control generation. Operating in a low-dimensional option space makes reward optimization far more sample-efficient and substantially reduces common misalignments (e.g., language switching or degenerate reasoning), as the policy is constrained to options learned during SFT. Crucially, this alignment arises from model structure rather than additional KL or handcrafted alignment losses. Our results demonstrate that optionized next-token modeling enhances controllability, robustness, and efficiency in math reasoning, and highlight latent-space policy learning as a promising direction for reinforcement learning in LLMs.
Adaptive reinforcement learning for recommendation via large language models and knowledge graphs
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
STK-Adapter: Incorporating Evolving Graph and Event Chain for Temporal Knowledge Graph Extrapolation
🔥 引用:
0
Abstract: Temporal Knowledge Graph (TKG) extrapolation aims to predict future events based on historical facts. Recent studies have attempted to enhance TKG extrapolation by integrating TKG's evolving structural representations and textual event chains into Large Language Models (LLMs). Yet, two main challenges limit these approaches: (1) The loss of essential spatial-temporal information due to shallow alignment between TKG's graph evolving structural representation and the LLM's semantic space, and (2) the progressive dilution of the TKG's evolving structural features during LLM fine-tuning. To address these challenges, we propose the Spatial-Temporal Knowledge Adapter (STK-Adapter), which flexibly integrates the evolving graph encoder and the LLM to facilitate TKG reasoning. In STK-Adapter, a Spatial-Temporal MoE is designed to capture spatial structures and temporal patterns inherent in TKGs. An Event-Aware MoE is employed to model intricate temporal semantics dependencies within event chains. In addition, a Cross-Modality Alignment MoE is proposed to facilitate deep cross-modality alignment by TKG-guided attention experts. Extensive experiments on benchmark datasets demonstrate that STK-Adapter significantly outperforms state-of-the-art methods and exhibits strong generalization capabilities in cross-dataset task. The code is available at https://github.com/Zhaoshuyuan0246/STK-Adapter.
Challenges and Solutions in Deploying Systematized Nomenclature of Medicine—Clinical Terms in the Chinese Healthcare Context
DOI:
10.1002/hcs2.70069
🔥 引用:
0
Abstract:
Systematized nomenclature of medicine—clinical terms (SNOMED CT), one of the most comprehensive clinical terminology systems, is pivotal in enhancing healthcare interoperability, clinical data governance, and medical artificial intelligence (AI) development globally. In China, with the rapid growth of large‐scale models and an increasing emphasis on transforming the intrinsic value of healthcare data, the absence of a nationally unified clinical terminology standard poses significant challenges. This commentary provides an in‐depth analysis of the benefits of SNOMED CT for global healthcare, examines the critical deficiencies in Chinese healthcare big data and AI development due to the lack of standardized terminology, and outlines the technical, administrative, and educational challenges encountered in deploying SNOMED CT within Chinese environments. Special emphasis is laid on the potential of advanced large language models in facilitating the mapping of Chinese clinical data to SNOMED CT. We further discuss the necessity of high‐quality data standardization in advancing medical AI in China. Finally, key conclusions and a roadmap for overcoming these challenges are proposed.
Efficient Computation of Multi‐State Survival Signature Based on Graph Neural Network
DOI:
10.1002/qre.70210
🔥 引用:
0
Abstract:
Reliability assessment of complex multi‐state systems (MSSs) is essential for their safe and efficient operation. Survival signature, a powerful tool for reliability analysis, faces significant computational challenges when applied to MSSs due to combinatorial explosion. However, research on efficient computation of survival signature for MSSs remains scarce and challenging. To address this issue, this study proposes a graph neural network (GNN)‐based approach for predicting survival signatures with improved computational efficiency, which integrates both the topological structure and component state information of MSSs. The proposed method utilizes the graph attention network v2 (GATv2) to dynamically aggregate node features through learnable attention weights. Furthermore, it incorporates the jumping knowledge (JK) framework to adaptively integrate multi‐scale features across different network layers, thereby mitigating over‐smoothing and enhancing hierarchical feature extraction. The numerical example and the application example of dual‐axis positioning mechanisms for satellite antennas demonstrate that, compared with traditional methods such as Monte Carlo simulation (MCS) or enumeration, the proposed approach not only ensures high prediction accuracy and significantly reduces the computational cost of survival signature evaluation, but also efficiently performs multi‐state Birnbaum importance analysis. It provides an effective tool for reliability analysis and optimization of MSSs.
Personalized Benchmarking: Evaluating LLMs by Individual Preferences
🔥 引用:
0
Abstract: With the rise in capabilities of large language models (LLMs) and their deployment in real-world tasks, evaluating LLM alignment with human preferences has become an important challenge. Current benchmarks average preferences across all users to compute aggregate ratings, overlooking individual user preferences when establishing model rankings. Since users have varying preferences in different contexts, we call for personalized LLM benchmarks that rank models according to individual needs. We compute personalized model rankings using ELO ratings and Bradley-Terry coefficients for 115 active Chatbot Arena users and analyze how user query characteristics (topics and writing style) relate to LLM ranking variations. We demonstrate that individual rankings of LLM models diverge dramatically from aggregate LLM rankings, with Bradley-Terry correlations averaging only $\rho = 0.04$ (57\% of users show near-zero or negative correlation) and ELO ratings showing moderate correlation ($\rho = 0.43$). Through topic modeling and style analysis, we find users exhibit substantial heterogeneity in topical interests and communication styles, influencing their model preferences. We further show that a compact combination of topic and style features provides a useful feature space for predicting user-specific model rankings. Our results provide strong quantitative evidence that aggregate benchmarks fail to capture individual preferences for most users, and highlight the importance of developing personalized benchmarks that rank LLM models according to individual user preferences.
From Signal Degradation to Computation Collapse: Uncovering the Two Failure Modes of LLM Quantization
🔥 引用:
0
Abstract: Post-Training Quantization (PTQ) is critical for the efficient deployment of Large Language Models (LLMs). While 4-bit quantization is widely regarded as an optimal trade-off, reducing the precision to 2-bit usually triggers a catastrophic ``performance cliff.''It remains unclear whether the underlying mechanisms differ fundamentally. Consequently, we conduct a systematic mechanistic analysis, revealing two qualitatively distinct failure modes: Signal Degradation, where the computational patterns remain intact but information precision is impaired by cumulative error; and Computation Collapse, where key components fail to function, preventing correct information processing and destroying the signal in the early layers. Guided by this diagnosis, we conduct mechanism-aware interventions, demonstrating that targeted, training-free repair can mitigate Signal Degradation, but remains ineffective for Computation Collapse. Our findings provide a systematic diagnostic framework for PTQ failures and suggest that addressing Computation Collapse requires structural reconstruction rather than mere compensation.
Air-Know: Arbiter-Calibrated Knowledge-Internalizing Robust Network for Composed Image Retrieval
🔥 引用:
1
Abstract: Composed Image Retrieval (CIR) has attracted significant attention due to its flexible multimodal query method, yet its development is severely constrained by the Noisy Triplet Correspondence (NTC) problem. Most existing robust learning methods rely on the"small loss hypothesis", but the unique semantic ambiguity in NTC, such as"partial matching", invalidates this assumption, leading to unreliable noise identification. This entraps the model in a self dependent vicious cycle where the learner is intertwined with the arbiter, ultimately causing catastrophic"representation pollution". To address this critical challenge, we propose a novel"Expert-Proxy-Diversion"decoupling paradigm, named Air-Know (ArbIteR calibrated Knowledge iNternalizing rObust netWork). Air-Know incorporates three core modules: (1) External Prior Arbitration (EPA), which utilizes Multimodal Large Language Models (MLLMs) as an offline expert to construct a high precision anchor dataset; (2) Expert Knowledge Internalization (EKI), which efficiently guides a lightweight proxy"arbiter"to internalize the expert's discriminative logic; (3) Dual Stream Reconciliation (DSR), which leverages the EKI's matching confidence to divert the training data, achieving a clean alignment stream and a representation feedback reconciliation stream. Extensive experiments on multiple CIR benchmark datasets demonstrate that Air-Know significantly outperforms existing SOTA methods under the NTC setting, while also showing strong competitiveness in traditional CIR.
Efficient Medical Image Segmentation in Multisensor Imaging: A Survey in the Era of Mamba and Foundation Models
DOI:
10.3390/s26082558
🔥 引用:
0
Abstract: Deep learning has revolutionized medical image segmentation; however, the clinical deployment of state-of-the-art models is severely impeded by their quadratic computational complexity and substantial resource demands, particularly in multisensor and multimodal imaging scenarios. In response, the field is undergoing a paradigm shift towards efficiency, characterized by the rise of linear-complexity architectures and the optimization of foundation models. This paper presents a comprehensive survey of efficient medical image segmentation methodologies, systematically reviewing the evolution from heavy, accuracy-driven models to lightweight, deployment-ready paradigms. In particular, we highlight the growing importance of efficient segmentation in multisensor medical imaging, where heterogeneous data sources such as CT, MRI, ultrasound, and infrared imaging introduce additional challenges in scalability and computational cost. We propose a novel taxonomy that categorizes these advancements into four distinct streams: (1) Mamba and State Space Models, which leverage selective scanning mechanisms to achieve global receptive fields with linear complexity; (2) Efficient Adaptation of Foundation Models, focusing on parameter-efficient fine-tuning and knowledge distillation to tailor the Segment Anything Model (SAM) for medical domains; (3) Advanced Lightweight Architectures, covering the resurgence of large-kernel CNNs and the emergence of Kolmogorov–Arnold Networks (KANs); and (4) Data-Efficient Strategies, including semi-supervised and federated learning to address annotation scarcity. Furthermore, we conduct a rigorous comparative analysis of representative algorithms on mainstream benchmarks, providing a granular evaluation of the trade-offs between segmentation accuracy and computational overhead. The survey also discusses key challenges in multisensor and multimodal settings, including modality heterogeneity, data fusion complexity, and resource constraints. Finally, we identify critical challenges and outline future research directions, serving as a roadmap for the development of next-generation efficient and scalable medical image analysis systems.
Structural insights and biological activity of (E)-4-(1-(2-(4-(4-chlorophenyl)thiazol-2-yl)hydrazono)ethyl)phenol: a potential therapeutic for breast cancer
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
Quadruped Parkour Learning: Sparsely Gated Mixture of Experts with Visual Input
🔥 引用:
0
Abstract: Robotic parkour provides a compelling benchmark for advancing locomotion over highly challenging terrain, including large discontinuities such as elevated steps. Recent approaches have demonstrated impressive capabilities, including dynamic climbing and jumping, but typically rely on sequential multilayer perceptron (MLP) architectures with densely activated layers. In contrast, sparsely gated mixture-of-experts (MoE) architectures have emerged in the large language model domain as an effective paradigm for improving scalability and performance by activating only a subset of parameters at inference time. In this work, we investigate the application of sparsely gated MoE architectures to vision-based robotic parkour. We compare control policies based on standard MLPs and MoE architectures under a controlled setting where the number of active parameters at inference time is matched. Experimental results on a real Unitree Go2 quadruped robot demonstrate clear performance gains, with the MoE policy achieving double the number of successful trials in traversing large obstacles compared to a standard MLP baseline. We further show that achieving comparable performance with a standard MLP requires scaling its parameter count to match that of the total MoE model, resulting in a 14.3\% increase in computation time. These results highlight that sparsely gated MoE architectures provide a favorable trade-off between performance and computational efficiency, enabling improved scaling of control policies for vision-based robotic parkour. An anonymized link to the codebase is https://osf.io/v2kqj/files/github?view_only=7977dee10c0a44769184498eaba72e44.
UniT: Toward a Unified Physical Language for Human-to-Humanoid Policy Learning and World Modeling
🔥 引用:
0
Abstract: Scaling humanoid foundation models is bottlenecked by the scarcity of robotic data. While massive egocentric human data offers a scalable alternative, bridging the cross-embodiment chasm remains a fundamental challenge due to kinematic mismatches. We introduce UniT (Unified Latent Action Tokenizer via Visual Anchoring), a framework that establishes a unified physical language for human-to-humanoid transfer. Grounded in the philosophy that heterogeneous kinematics share universal visual consequences, UniT employs a tri-branch cross-reconstruction mechanism: actions predict vision to anchor kinematics to physical outcomes, while vision reconstructs actions to filter out irrelevant visual confounders. Concurrently, a fusion branch synergies these purified modalities into a shared discrete latent space of embodiment-agnostic physical intents. We validate UniT across two paradigms: 1) Policy Learning (VLA-UniT): By predicting these unified tokens, it effectively leverages diverse human data to achieve state-of-the-art data efficiency and robust out-of-distribution (OOD) generalization on both humanoid simulation benchmark and real-world deployments, notably demonstrating zero-shot task transfer. 2) World Modeling (WM-UniT): By aligning cross-embodiment dynamics via unified tokens as conditions, it realizes direct human-to-humanoid action transfer. This alignment ensures that human data seamlessly translates into enhanced action controllability for humanoid video generation. Ultimately, by inducing a highly aligned cross-embodiment representation (empirically verified by t-SNE visualizations revealing the convergence of human and humanoid features into a shared manifold), UniT offers a scalable path to distill vast human knowledge into general-purpose humanoid capabilities.
SafetyALFRED: Evaluating Safety-Conscious Planning of Multimodal Large Language Models
🔥 引用:
0
Abstract: Multimodal Large Language Models are increasingly adopted as autonomous agents in interactive environments, yet their ability to proactively address safety hazards remains insufficient. We introduce SafetyALFRED, built upon the embodied agent benchmark ALFRED, augmented with six categories of real-world kitchen hazards. While existing safety evaluations focus on hazard recognition through disembodied question answering (QA) settings, we evaluate eleven state-of-the-art models from the Qwen, Gemma, and Gemini families on not only hazard recognition, but also active risk mitigation through embodied planning. Our experimental results reveal a significant alignment gap: while models can accurately recognize hazards in QA settings, average mitigation success rates for these hazards are low in comparison. Our findings demonstrate that static evaluations through QA are insufficient for physical safety, thus we advocate for a paradigm shift toward benchmarks that prioritize corrective actions in embodied contexts. We open-source our code and dataset under https://github.com/sled-group/SafetyALFRED.git
Understanding the Mechanism of Altruism in Large Language Models
🔥 引用:
0
Abstract: Altruism is fundamental to human societies, fostering cooperation and social cohesion. Recent studies suggest that large language models (LLMs) can display human-like prosocial behavior, but the internal computations that produce such behavior remain poorly understood. We investigate the mechanisms underlying LLM altruism using sparse autoencoders (SAEs). In a standard Dictator Game, minimal-pair prompts that differ only in social stance (generous versus selfish) induce large, economically meaningful shifts in allocations. Leveraging this contrast, we identify a set of SAE features (0.024% of all features across the model's layers) whose activations are strongly associated with the behavioral shift. To interpret these features, we use benchmark tasks motivated by dual-process theories to classify a subset as primarily heuristic (System 1) or primarily deliberative (System 2). Causal interventions validate their functional role: activation patching and continuous steering of this feature direction reliably shift allocation distributions, with System 2 features exerting a more proximal influence on the model's final output than System 1 features. The same steering direction generalizes across multiple social-preference games. Together, these results enhance our understanding of artificial cognition by translating altruistic behaviors into identifiable network states and provide a framework for aligning LLM behavior with human values, thereby informing more transparent and value-aligned deployment.
Structure-guided molecular design with contrastive 3D protein-ligand learning
🔥 引用:
0
Abstract: Structure-based drug discovery faces the dual challenge of accurately capturing 3D protein-ligand interactions while navigating ultra-large chemical spaces to identify synthetically accessible candidates. In this work, we present a unified framework that addresses these challenges by combining contrastive 3D structure encoding with autoregressive molecular generation conditioned on commercial compound spaces. First, we introduce an SE(3)-equivariant transformer that encodes ligand and pocket structures into a shared embedding space via contrastive learning, achieving competitive results in zero-shot virtual screening. Second, we integrate these embeddings into a multimodal Chemical Language Model (MCLM). The model generates target-specific molecules conditioned on either pocket or ligand structures, with a learned dataset token that steers the output toward targeted chemical spaces, yielding candidates with favorable predicted binding properties across diverse targets.
Computational Drug Repurposing Targeting LuxS-Mediated Quorum Sensing in Fusobacterium nucleatum: A Virtual Screening and Molecular Dynamics Approach
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
Cyber Defense Benchmark: Agentic Threat Hunting Evaluation for LLMs in SecOps
🔥 引用:
0
Abstract: We introduce the Cyber Defense Benchmark, a benchmark for measuring how well large language model (LLM) agents perform the core SOC analyst task of threat hunting: given a database of raw Windows event logs with no guided questions or hints, identify the exact timestamps of malicious events. The benchmark wraps 106 real attack procedures from the OTRF Security-Datasets corpus - spanning 86 MITRE ATT&CK sub-techniques across 12 tactics - into a Gymnasium reinforcement-learning environment. Each episode presents the agent with an in-memory SQLite database of 75,000-135,000 log records produced by a deterministic campaign simulator that time-shifts and entity-obfuscates the raw recordings. The agent must iteratively submit SQL queries to discover malicious event timestamps and explicitly flag them, scored CTF-style against Sigma-rule-derived ground truth. Evaluating five frontier models - Claude Opus 4.6, GPT-5, Gemini 3.1 Pro, Kimi K2.5, and Gemini 3 Flash - on 26 campaigns covering 105 of 106 procedures, we find that all models fail dramatically: the best model (Claude Opus 4.6) submits correct flags for only 3.8% of malicious events on average, and no run across any model ever finds all flags. We define a passing score as>= 50% recall on every ATT&CK tactic - the minimum bar for unsupervised SOC deployment. No model passes: the leader clears this bar on 5 of 13 tactics and the remaining four on zero. These results suggest that current LLMs are poorly suited for open-ended, evidence-driven threat hunting despite strong performance on curated Q&A security benchmarks.
Fra juks til læring
🔥 引用:
0
Abstract: Artificial intelligence (AI), particularly in the form of large language models, can be highly useful in academic work. While many students are using such chatbots extensively, others remain skeptical or uncertain about the new technology. To prevent a growing divide between these groups, the Learning Support Center at OsloMet has developed a course in AI literacy. The course is grounded in a learning philosophy in which knowledge and understanding are constructed and developed through dialogue and collaboration. AI literacy encompasses attitudes as well as intellectual understanding and technical skills. Therefore, emotional engagement is also incorporated into the course. In the case of large language models, students have often taken up these tools without prior training. In developing the course, an anthropological approach was used to understand how GenAI has influenced the learning culture among students. Their experiences and attitudes toward GenAI are integrated into the course to foster personal engagement. The article presents the pedagogical approaches rather than describing the entire course. This is illustrated with five teaching examples that show how dialogue, including with the language model, and active learning activities build experience and awareness for using GenAI in a critical, learning-oriented, and more effective way.
Construction of Knowledge Graph based on Language Model
🔥 引用:
0
Abstract: Knowledge Graph (KG) can effectively integrate valuable information from massive data, and thus has been rapidly developed and widely used in many fields. Traditional KG construction methods rely on manual annotation, which often consumes a lot of time and manpower. And KG construction schemes based on deep learning tend to have weak generalization capabilities. With the rapid development of Pre-trained Language Models (PLM), PLM has shown great potential in the field of KG construction. This paper provides a comprehensive review of recent research advances in the field of construction of KGs using PLM. In this paper, we explain how PLM can utilize its language understanding and generation capabilities to automatically extract key information for KGs, such as entities and relations, from textual data. In addition, We also propose a new Hyper-Relarional Knowledge Graph construction framework based on lightweight Large Language Model (LLM) named LLHKG and compares it with previous methods. Under our framework, the KG construction capability of lightweight LLM is comparable to GPT3.5.
On Reasoning-Centric LLM-based Automated Theorem Proving
🔥 引用:
0
Abstract: Automated theorem proving is fundamental to formal methods, and the recent trend is to integrate large language models (LLMs) and proof assistants to form effective proof agents. While existing proof agents show promising performance, they inadequately leverage reasoning capabilities of modern LLMs in high-level planning and self-critique. We argue that proof agents should not merely generate tactics but also reason strategically about proof plans and critically evaluate their own proposals. This paper introduces ReCent-Prover, a reasoning-centric LLM-based proof agent for Rocq that addresses two critical limitations in current systems. First, we present validation with reflection, enabling LLMs to scrutinize their generated tactics and synthesize failure summaries when reflection identifies potential errors, filtering out potentially misapplied tactics earlier. Second, we propose retrieval with planning, which conditions retrieval on LLM-generated proof plans rather than subgoal similarity, retrieving lemmas and proofs that align with the anticipated proof strategy. Both techniques increase the number of invocations of LLMs. However, when evaluated on the CoqStoq benchmark, even under the same budget of LLM invocations, ReCent-Prover achieves a 22.58% relative improvement in the number of proved theorems over the previous state-of-the-art, demonstrating that our reasoning-centric design significantly enhances automated theorem proving capabilities.
ShadowPEFT: Shadow Network for Parameter-Efficient Fine-Tuning
🔥 引用:
0
Abstract: Parameter-efficient fine-tuning (PEFT) reduces the training cost of full-parameter fine-tuning for large language models (LLMs) by training only a small set of task-specific parameters while freezing the pretrained backbone. However, existing approaches, such as Low-Rank Adaptation (LoRA), achieve adaptation by inserting independent low-rank perturbations directly to individual weights, resulting in a local parameterization of adaptation. We propose ShadowPEFT, a centralized PEFT framework that instead performs layer-level refinement through a depth-shared shadow module. At each transformer layer, ShadowPEFT maintains a parallel shadow state and evolves it repeatedly for progressively richer hidden states. This design shifts adaptation from distributed weight-space perturbations to a shared layer-space refinement process. Since the shadow module is decoupled from the backbone, it can be reused across depth, independently pretrained, and optionally deployed in a detached mode, benefiting edge computing scenarios. Experiments on generation and understanding benchmarks show that ShadowPEFT matches or outperforms LoRA and DoRA under comparable trainable-parameter budgets. Additional analyses on shadow pretraining, cross-dataset transfer, parameter scaling, inference latency, and system-level evaluation suggest that centralized layer-space adaptation is a competitive and flexible alternative to conventional low-rank PEFT.
Continuous Semantic Caching for Low-Cost LLM Serving
🔥 引用:
0
Abstract: As Large Language Models (LLMs) become increasingly popular, caching responses so that they can be reused by users with semantically similar queries has become a vital strategy for reducing inference costs and latency. Existing caching frameworks have proposed to decide which query responses to cache by assuming a finite, known universe of discrete queries and learning their serving costs and arrival probabilities. As LLMs'pool of users and queries expands, however, such an assumption becomes increasingly untenable: real-world LLM queries reside in an infinite, continuous embedding space. In this paper, we establish the first rigorous theoretical framework for semantic LLM response caching in continuous query space under uncertainty. To bridge the gap between discrete optimization and continuous representation spaces, we introduce dynamic $\epsilon$-net discretization coupled with Kernel Ridge Regression. This design enables the system to formally quantify estimation uncertainty and generalize partial feedback on LLM query costs across continuous semantic query neighborhoods. We develop both offline learning and online adaptive algorithms optimized to reduce switching costs incurred by changing the cached responses. We prove that our online algorithm achieves a sublinear regret bound against an optimal continuous oracle, which reduces to existing bounds for discrete query models. Extensive empirical evaluations demonstrate that our framework approximates the continuous optimal cache well while also reducing computational and switching overhead compared to existing methods.
Synthesis, Anti-Inflammatory and Anticancer Activity Evaluation of Some Novel Acridine Derivatives
🔥 引用:
0
Abstract: Acridine and its derivatives have long occupied a special place in medicinal chemistry owing to their broad-spectrum
pharmacological activities. In the present study, a series of twelve novel acridine-based compounds (ACD-1 through
ACD-12) were designed, synthesized, and characterized with the primary objective of identifying potent antiinflammatory and anticancer agents with an acceptable safety profile. The synthetic route involved a Doebner–Miller
condensation reaction followed by nucleophilic substitution, yielding target molecules bearing diverse substituents at
the 9-position and on the peripheral aromatic rings. All synthesized compounds were characterized by 1H NMR, 13C
NMR, IR, and high-resolution mass spectrometry (HRMS). For anti-inflammatory screening, the compounds were
evaluated through in vitro albumin denaturation inhibition, membrane stabilization, and heat-induced hemolysis assays.
Compounds ACD-5, ACD-8, and ACD-11 emerged as the most promising anti-inflammatory agents, with ACD-8
showing 78.4% inhibition of albumin denaturation at 500 µg/mL, outperforming the standard drug diclofenac sodium
(72.1%). The anticancer potential of the synthesized compounds was assessed against four human cancer cell lines —
MCF-7 (breast), HeLa (cervical), A549 (lung), and HCT116 (colon) — using the MTT assay. ACD-5 and ACD-11
exhibited remarkable cytotoxicity, with IC50 values of 3.24 µM and 4.11 µM against MCF-7 cells, respectively,
surpassing the reference drug doxorubicin (IC50 = 5.82 µM) in selectivity. Molecular docking studies revealed that these
compounds interact favorably with COX-2 and topoisomerase II enzymes through hydrogen bonding and π–π stacking
interactions. Furthermore, in silico ADMET profiling indicated drug-like properties consistent with Lipinski's Rule of
Five for most derivatives. The combined biological data from this study strongly suggest that these novel acridine
derivatives warrant further investigation as leads in anti-inflammatory and anticancer drug discovery.
Time Series Augmented Generation for Financial Applications
🔥 引用:
0
Abstract: Evaluating the reasoning capabilities of Large Language Models (LLMs) for complex, quantitative financial tasks is a critical and unsolved challenge. Standard benchmarks often fail to isolate an agent's core ability to parse queries and orchestrate computations. To address this, we introduce a novel evaluation methodology and benchmark designed to rigorously measure an LLM agent's reasoning for financial time-series analysis. We apply this methodology in a large-scale empirical study using our framework, Time Series Augmented Generation (TSAG), where an LLM agent delegates quantitative tasks to verifiable, external tools. Our benchmark, consisting of 100 financial questions, is used to compare multiple SOTA agents (e.g., GPT-4o, Llama 3, Qwen2) on metrics assessing tool selection accuracy, faithfulness, and hallucination. The results demonstrate that capable agents can achieve near-perfect tool-use accuracy with minimal hallucination, validating the tool-augmented paradigm. Our primary contribution is this evaluation framework and the corresponding empirical insights into agent performance, which we release publicly to foster standardized research on reliable financial AI.
Evaluating Protein Language Model Embeddings for Viral Clade Assignment
🔥 引用:
0
Abstract: Protein language models (PLMs) provide powerful sequence representations, yet their effectiveness for unsupervised viral clade assignment remains uncertain. In this study, we evaluated embeddings from ProtT5, ProtBert, CARP, and several ESM-2 variants on influenza A/H3N2 hemagglutinin sequences. Using dimensionality reduction (t-SNE, UMAP, PCA, MDS) and clustering with HDBSCAN, we compared PLM embeddings against baseline Hamming distance approaches. Our results show that t-SNE combined with PLM embeddings can recover clade structure, with ProtBert yielding the most stable performance and larger ESM-2 models occasionally achieving lower normalized variation of information scores but with greater variability. These findings suggest that while PLM embeddings capture clade-relevant signals, they also suffer from instability and the loss of site- or nucleotide-specific detail. Future improvements in pooling strategies may enhance their utility for viral surveillance.
Graph-Theoretic Models for the Prediction of Molecular Measurements
🔥 引用:
0
Abstract: Graph-theoretic approaches offer simplicity, interpretability, and low computational cost for molecular property prediction. Among these, the model proposed by Mukwembi and Nyabadza, based on the external activity $D(G)$ and internal activity $\zeta(G)$ indices, achieved strong results on a small flavonoid dataset. However, its ability to generalize to larger and chemically diverse datasets has not been tested. This study evaluates the baseline $D(G)$-$\zeta(G)$ polynomial model on five benchmark datasets from MoleculeNet, covering biological activity (BACE, 1,513 molecules), lipophilicity (LogP synthetic, 14,610 molecules; LogP experimental, 753 molecules), aqueous solubility (ESOL, 1,128 molecules), and hydration free energy (SAMPL, 642 molecules). The baseline model achieves an average $R^2 = 0.24$, confirming limited transferability. To address this, a systematic enhancement framework is proposed, progressively incorporating Ridge regularization, additional graph descriptors, physicochemical properties, ensemble learning with Gradient Boosting, Lasso feature selection, and a hybrid approach combining topological indices with Morgan fingerprints. The enhanced models raise the average best $R^2$ to 0.79, with individual improvements ranging from 165\% to 274\%. All improvements are statistically significant ($p<0.001$). A direct comparison with a Graph Convolutional Network under identical experimental conditions shows that the enhanced classical models match or outperform deep learning on all five datasets. Comparison with the recent GNN+PGM hybrid of Djagba et al.\ further confirms competitiveness, with the enhanced models achieving the best results on two datasets and tying on one. The entire framework requires no GPU, trains in under five minutes, and uses only open-source tools, making it accessible for researchers in resource-limited settings.
TACO: TabPFN Augmented Causal Outcomes for early detection of Long COVID
🔥 引用:
0
Abstract:
Long COVID, or Post-Acute Sequelae of COVID-19 (PASC), affects 10 to 40\% \% of COVID-19 survivors worldwide, manifesting as persistent symptoms that last months after initial infection, including fatigue, cognitive impairment, and organ dysfunction. The heterogeneous nature of Long COVID and its delayed onset creates a critical window in which at-risk patients remain unidentified during the presymptomatic phase, when interventions could be most beneficial. Current prediction methods rely on clinical symptoms that manifest too late for early intervention or use molecular approaches based on statistical associations that cannot distinguish disease drivers from correlational noise. We present TACO (TabPFN-Augmented Causal Outcomes), a novel method for the early detection of Long COVID using gene expression data. TACO first applies Differential Causal Effect (DCE) analysis to identify genes with putative regulatory causal relationships to the pathogenesis of Long COVID, rather than simple statistical associations. It then integrates these mechanistic features with the Tabular Prior-Data Fitted Network (TabPFN), a foundation model that requires no hyperparameter tuning, alongside complementary classifiers in an ensemble approach. Experimental results show that TACO performs competitively with existing machine learning methods in predicting Long COVID from gene expression data from patients with early-stage COVID-19, while offering a causally grounded and interpretable feature selection. An ablation study confirms the positive contribution of TabPFN to the ensemble, and a strict nested cross-validation protocol, in which DCE feature selection is performed exclusively within each training fold, confirms that the results are robust to data leakage. The nested CV analysis further identifies a minimal reproducible causal gene signature with potential utility for targeted clinical sequencing during acute hospitalization for COVID-19. In addition, the causal genes identified by TACO are highly relevant to Long COVID-related functions and pathways, as confirmed by the literature and subsequent analyses, providing insights into disease mechanisms and potential treatments.
OOPrompt: Reifying Intents into Structured Artifacts for Modular and Iterative Prompting
🔥 引用:
0
Abstract: The rise of large language models (LLMs) has given rise to a class of prompt-based interactive systems where users primarily express their input in natural language. However, composing a prompt as a linear text string becomes unwieldy when capturing users'multifaceted intents. We present Object-Oriented Prompting (OOPrompt), an emergent interaction paradigm that enables users to create, edit, iterate, and reuse prompts as structured, manipulable artifacts, unifying and generalizing several existing point systems. We first outlined a design space from existing work and built an early prototype, which we deployed as a probe in a formative study with 20 participants. Their feedback informed an expanded OOPrompt design space. We then developed the full OOPrompt prototype and conducted a validation study to further understand OOPrompt's added values and trade-offs. We expect the OOPrompt design space to provide theoretical and empirical guidance to the design and engineering of prompt-based, LLM-enabled interactive systems.
Impact of large language models on peer review opinions from a fine-grained perspective: Evidence from top conference proceedings in AI
🔥 引用:
0
Abstract: With the rapid advancement of Large Language Models (LLMs), the academic community has faced unprecedented disruptions, particularly in the realm of academic communication. The primary function of peer review is improving the quality of academic manuscripts, such as clarity, originality and other evaluation aspects. Although prior studies suggest that LLMs are beginning to influence peer review, it remains unclear whether they are altering its core evaluative functions. Moreover, the extent to which LLMs affect the linguistic form, evaluative focus, and recommendation-related signals of peer-review reports has yet to be systematically examined. In this study, we examine the changes in peer review reports for academic articles following the emergence of LLMs, emphasizing variations at fine-grained level. Specifically, we investigate linguistic features such as the length and complexity of words and sentences in review comments, while also automatically annotating the evaluation aspects of individual review sentences. We also use a maximum likelihood estimation method, previously established, to identify review reports that potentially have modified or generated by LLMs. Finally, we assess the impact of evaluation aspects mentioned in LLM-assisted review reports on the informativeness of recommendation for paper decision-making. The results indicate that following the emergence of LLMs, peer review texts have become longer and more fluent, with increased emphasis on summaries and surface-level clarity, as well as more standardized linguistic patterns, particularly reviewers with lower confidence score. At the same time, attention to deeper evaluative dimensions, such as originality, replicability, and nuanced critical reasoning, has declined.
Chat2Workflow: A Benchmark for Generating Executable Visual Workflows with Natural Language
🔥 引用:
0
Abstract: At present, executable visual workflows have emerged as a mainstream paradigm in real-world industrial deployments, offering strong reliability and controllability. However, in current practice, such workflows are almost entirely constructed through manual engineering: developers must carefully design workflows, write prompts for each step, and repeatedly revise the logic as requirements evolve-making development costly, time-consuming, and error-prone. To study whether large language models can automate this multi-round interaction process, we introduce Chat2Workflow, a benchmark for generating executable visual workflows directly from natural language, and propose a robust agentic framework to mitigate recurrent execution errors. Chat2Workflow is built from a large collection of real-world business workflows, with each instance designed so that the generated workflow can be transformed and directly deployed to practical workflow platforms such as Dify and Coze. Experimental results show that while state-of-the-art language models can often capture high-level intent, they struggle to generate correct, stable, and executable workflows, especially under complex or changing requirements. Although our agentic framework yields up to 5.34% resolve rate gains, the remaining real-world gap positions Chat2Workflow as a foundation for advancing industrial-grade automation. Code is available at https://github.com/zjunlp/Chat2Workflow.
LEO: Tracing GPU Stall Root Causes via Cross-Vendor Backward Slicing
🔥 引用:
0
Abstract: More than half of the Top 500 supercomputers employ GPUs as accelerators. On GPU-accelerated platforms, developers face a key diagnostic gap: profilers show source lines where stalls occur, but not why they occur. Furthermore, the same kernel may have different stalls and underlying causes on different GPUs. This paper presents LEO, a root-cause analyzer for NVIDIA, AMD, and Intel GPUs that performs backward slicing from stalled instructions, considering dependencies arising from registers as well as vendor-specific synchronization mechanisms. LEO attributes GPU stalls to source instructions with the goal of explaining root causes of these inefficiencies. Across 21 workloads on three GPU platforms, LEO-guided optimizations deliver geometric-mean speedups of 1.73$\times$--1.82$\times$. Our case studies show that (1) the same kernel may require different optimizations for different GPU architectures, and (2) LEO's structured diagnostics improve code optimization with large language models relative to code-only and raw-stall-count baselines.
Information Aggregation with AI Agents
🔥 引用:
0
Abstract: Can Large Language Models (AI agents) aggregate dispersed private information through trading and reason about the knowledge of others by observing price movements? We conduct a controlled experiment where AI agents trade in a prediction market after receiving private signals, measuring information aggregation by the log error of the last price. We find that although the median market is effective at aggregating information in the easy information structures, increasing the complexity has a significant and negative impact, suggesting that AI agents may suffer from the same limitations as humans when reasoning about others. Consistent with our theoretical predictions, information aggregation remains unaffected by allowing cheap talk communication, changing the duration of the market or initial price, and strategic prompting-thus demonstrating that prediction markets are robust. We establish that"smarter"AI agents perform better at aggregation and they are more profitable. Surprisingly, giving them feedback about past performance makes them worse at aggregation and reduces their profits.
Conditional Generative AI in Oncology Diagnostics
DOI:
10.3390/app16084015
🔥 引用:
0
Abstract: The increasing complexity of oncology diagnostics requires advanced Clinical Decision Support Systems (CDSS) capable of integrating multimodal data. Traditional discriminative models often struggle with missing data and cross-modal dependencies. This review provides a novel, systematic analysis of conditional generative artificial intelligence (AI), including Variational Autoencoders (VAEs), Generative Adversarial Networks (GANs), diffusion models and Multimodal Large Language Models (MLLMs), specifically tailored for oncological CDSS. We examine how these architectures move beyond simple prediction to learn joint data distributions, enabling robust data imputation, virtual staining, and automated clinical reporting. A central focus of this work is the assessment of translational application, identifying the gaps between experimental proof-of-concepts and clinical deployment. We address critical hurdles such as model hallucinations, domain shift, and demographic bias, providing a roadmap for biological consistency and regulatory compliance. This review highlights the transition from task-specific generators to multimodal reasoning systems. Ultimately, we argue that the integration of generative AI into diagnostic workflows is essential for precision oncology, provided that human-in-the-loop validation and uncertainty-aware inference remain central to their implementation.
PlayCoder: Making LLM-Generated GUI Code Playable
🔥 引用:
0
Abstract: Large language models (LLMs) have achieved strong results in code generation, but their ability to generate GUI applications, especially games, remains insufficiently studied. Existing benchmarks mainly evaluate correctness through test cases, which are inadequate for GUI applications because these systems are interactive, event-driven, and require correct state transitions across sequences of user actions. Their evaluation therefore should consider interaction flows and UI logic rather than only pass/fail outcomes. To study this problem, we introduce PlayEval, a repository-aware benchmark built from 43 multilingual GUI applications in Python, TypeScript, and JavaScript. Unlike prior GUI benchmarks that are difficult to adapt to desktop environments, PlayEval covers six major GUI application categories and directly supports code-generation evaluation. We further propose Play@k, a metric that measures whether at least one of *k* generated candidates can be played end-to-end without logical errors. To support reliable evaluation, we develop PlayTester, an LLM-based agent that performs task-oriented GUI playthroughs and detects logic violations automatically. Experiments on 10 state-of-the-art code LLMs show that, despite high compilation rates, they achieve near-zero Play@3, revealing major weaknesses in generating logically correct GUI applications. To address this limitation, we present PlayCoder, a multi-agent, repository-aware framework that generates, evaluates, and iteratively repairs GUI application code in a closed loop. PlayCoder substantially improves both functional correctness and semantic alignment for open-source and closed-source models, reaching up to 38.1% Exec@3 and 20.3% Play@3. Case studies further show that it can uncover silent logic bugs missed by traditional metrics and fix them through targeted edits.
LePREC: Reasoning as Classification over Structured Factors for Assessing Relevance of Legal Issues
🔥 引用:
0
Abstract: More than half of the global population struggles to meet their civil justice needs due to limited legal resources. While Large Language Models (LLMs) have demonstrated impressive reasoning capabilities, significant challenges remain even at the foundational step of legal issue identification. To investigate LLMs'capabilities in this task, we constructed a dataset from 769 real-world Malaysian Contract Act court cases, using GPT-4o to extract facts and generate candidate legal issues, annotated by senior legal experts, which reveals a critical limitation: while LLMs generate diverse issue candidates, their precision remains inadequate (GPT-4o achieves only 62%). To address this gap, we propose LePREC (Legal Professional-inspired Reasoning Elicitation and Classification), a neuro-symbolic framework combining neural generation with structured statistical reasoning. LePREC consists of: (1) a neuro component leverages LLMs to transform legal descriptions into question-answer pairs representing diverse analytical factors, and (2) a symbolic component applies sparse linear models over these discrete features, learning explicit algebraic weights that identify the most informative reasoning factors. Unlike end-to-end neural approaches, LePREC achieves interpretability through transparent feature weighting while maintaining data efficiency through correlation-based statistical classification. Experiments show a 30-40% improvement over advanced LLM baselines, including GPT-4o and Claude, confirming that correlation-based factor-issue analysis offers a more data-efficient solution for relevance decisions.
Editorial for “Pre‐Imaging Clinical Factors Associated With Cardiac MR Image Quality Using Large Language Model‐Enabled Data Extraction”
DOI:
10.1002/jmri.70346
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
SCURank: Ranking Multiple Candidate Summaries with Summary Content Units for Enhanced Summarization
🔥 引用:
0
Abstract: Small language models (SLMs), such as BART, can achieve summarization performance comparable to large language models (LLMs) via distillation. However, existing LLM-based ranking strategies for summary candidates suffer from instability, while classical metrics (e.g., ROUGE) are insufficient to rank high-quality summaries. To address these issues, we introduce \textbf{SCURank}, a framework that enhances summarization by leveraging \textbf{Summary Content Units (SCUs)}. Instead of relying on unstable comparisons or surface-level overlap, SCURank evaluates summaries based on the richness and semantic importance of information content. We investigate the effectiveness of SCURank in distilling summaries from multiple diverse LLMs. Experimental results demonstrate that SCURank outperforms traditional metrics and LLM-based ranking methods across evaluation measures and datasets. Furthermore, our findings show that incorporating diverse LLM summaries enhances model abstractiveness and overall distilled model performance, validating the benefits of information-centric ranking in multi-LLM distillation. The code for SCURank is available at https://github.com/IKMLab/SCURank.
DINO Eats CLIP: Adapting Beyond Knowns for Open-set 3D Object Retrieval
🔥 引用:
0
Abstract: Vision foundation models have shown great promise for open-set 3D object retrieval (3DOR) through efficient adaptation to multi-view images. Leveraging semantically aligned latent space, previous work typically adapts the CLIP encoder to build view-based 3D descriptors. Despite CLIP's strong generalization ability, its lack of fine-grainedness prompted us to explore the potential of a more recent self-supervised encoder-DINO. To address this, we propose DINO Eats CLIP (DEC), a novel framework for dynamic multi-view integration that is regularized by synthesizing data for unseen classes. We first find that simply mean-pooling over view features from a frozen DINO backbone gives decent performance. Yet, further adaptation causes severe overfitting on average view patterns of known classes. To combat it, we then design a module named Chunking and Adapting Module (CAM). It segments multi-view images into chunks and dynamically integrates local view relations, yielding more robust features than the standard pooling strategy. Finally, we propose Virtual Feature Synthesis (VFS) module to mitigate bias towards known categories explicitly. Under the hood, VFS leverages CLIP's broad, pre-aligned vision-language space to synthesize virtual features for unseen classes. By exposing DEC to these virtual features, we greatly enhance its open-set discrimination capacity. Extensive experiments on standard open-set 3DOR benchmarks demonstrate its superior efficacy.
TripleBind: a generalizable deep learning framework for protein-nucleic acid and protein-ligand binding sites prediction based on pre-trained protein language models
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
Evaluating LLM-Generated Obfuscated XSS Payloads for Machine Learning-Based Detection
🔥 引用:
0
Abstract: Cross-site scripting (XSS) remains a persistent web security vulnerability, especially because obfuscation can change the surface form of a malicious payload while preserving its behavior. These transformations make it difficult for traditional and machine learning-based detection systems to reliably identify attacks. Existing approaches for generating obfuscated payloads often emphasize syntactic diversity, but they do not always ensure that the generated samples remain behaviorally valid. This paper presents a structured pipeline for generating and evaluating obfuscated XSS payloads using large language models (LLMs). The pipeline combines deterministic transformation techniques with LLM-based generation and uses a browser- based runtime evaluation procedure to compare payload behavior in a controlled execution environment. This allows generated samples to be assessed through observable runtime behavior rather than syntactic similarity alone. In the evaluation, an untuned baseline language model achieves a runtime behavior match rate of 0.15, while fine-tuning on behavior-preserving source-target obfuscation pairs improves the match rate to 0.22. Although this represents a measurable improvement, the results show that current LLMs still struggle to generate obfuscations that preserve observed runtime behavior. A downstream classifier evaluation further shows that adding generated payloads does not improve detection performance in this setting, although behavior- filtered generated samples can be incorporated without materially degrading performance. Overall, the study demonstrates both the promise and the limits of applying generative models to adversarial security data generation and emphasizes the importance of runtime behavior checks in improving the quality of generated data for downstream detection systems.
Tracing Relational Knowledge Recall in Large Language Models
🔥 引用:
0
Abstract: We study how large language models recall relational knowledge during text generation, with a focus on identifying latent representations suitable for relation classification via linear probes. Prior work shows how attention heads and MLPs interact to resolve subject, predicate, and object, but it remains unclear which representations support faithful linear relation classification and why some relation types are easier to capture linearly than others. We systematically evaluate different latent representations derived from attention head and MLP contributions, showing that per-head attention contributions to the residual stream are comparatively strong features for linear relation classification. Feature attribution analyses of the trained probes, as well as characteristics of the different relation types, reveal clear correlations between probe accuracy and relation specificity, entity connectedness, and how distributed the signal on which the probe relies is across attention heads. Finally, we show how token-level feature attribution of probe predictions can be used to reveal probe behavior in further detail.
Dose-dependent modeling of combinatorial drug responses stratifies patient survival and reveals therapeutic vulnerabilities in precision oncology
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
Biomedical systems biology workflow orchestration and execution with PoSyMed
🔥 引用:
0
Abstract: The rapid growth of scientific software has created practical barriers for bioinformatics research. Although powerful statistical, artificial intelligence (AI)-based methods are now widely available, their effective use is often hindered by fragmented distribution, inconsistent documentation, complex dependencies, and difficult-to-reproduce execution environments. As a result, reusing published tools and workflow adaptation to own date remains technically demanding and time-intensive, even for experienced users. Here, we present PoSyMed, an open and modular platform for the controlled integration, composition, and execution of bioinformatics tools and workflows. PoSyMed combines a backend-centered platform architecture with formal tool descriptions, controlled container-based build and execution processes, persistent workflow state, and a dialogue-based user interface. Large language models (LLM) are integrated not as autonomous decision-makers, but as human-computer interface with bounded semantic assistants that help identify tools, propose workflow steps, and support parameterization within a typed, validated, and human-supervised execution environment. PoSyMed is designed to improve reproducibility, traceability, and transparency in practical biomedical analysis within one platform. We describe the system architecture and evaluate its behavior across representative biological software scenarios with respect to workflow support, interaction design, and platform extensibility. PoSyMed is publicly available at https://apps.cosy.bio/posymed.
SimDiff: Depth Pruning via Similarity and Difference
🔥 引用:
0
Abstract: Depth pruning improves the deployment efficiency of large language models (LLMs) by identifying and removing redundant layers. A widely accepted standard for this identification process is to measure the similarity between layers using cosine distance. However, we find that methods relying solely on this one-dimensional heuristic can exhibit unpredictable performance and even catastrophic collapse across different architectures. To address this issue, we propose SimDiff, a novel layer importance criterion that jointly evaluates layers from two orthogonal perspectives: representational similarity and transformation difference. The difference is quantified using two distinct metrics: MSSD, which is sensitive to outliers and identifies layers that make decisive corrections, and MASD, which robustly measures a layer's average contribution. Extensive experiments on multiple models ranging from 0.5B to 13B parameters demonstrate that SimDiff significantly outperforms state-of-the-art baselines across various pruning ratios. Notably, our method retains over 91% of LLaMA2-7B's performance at a 25% pruning ratio and achieves up to a 1.49x inference speedup when pruning 12 layers on LLaMA3.1-8B. We also show that pruned models can be effectively recovered with minimal fine-tuning.
Graph neural network approach for exploring urban material circularity: the case of Edinburgh
🔥 引用:
0
Abstract:
The transition towards a circular economy (CE) requires cities to understand how materials, waste and resources circulate through complex urban systems. Yet, current analytical tools remain limited to static or sectoral datasets and cannot capture the dynamic, relational nature of circular flows. This study aims to develop and apply a graph-based analytical framework for exploring urban material circularity in Edinburgh, using graph neural networks (GNNs) to model, visualise and predict interconnections among recycling, reuse and repair activities.
The study adopts a structured, multi-stage computational workflow to analyse urban circular infrastructure as a spatial–relational system. First, CE-related facilities (recycling, reuse and repair) were extracted from OpenStreetMap using tag-based queries and cleaned through GIS-based preprocessing to generate a georeferenced point dataset (Step 1). Second, these facilities were formalised as a proximity-based urban graph, where nodes represent facilities and edges encode spatial interaction potential derived from geographic distance; network characteristics were computed and verified using Google Colab and Gephi (Step 2). Third, the resulting graph was transferred into a GNN learning environment implemented in PyTorch Geometric, where message-passing architectures were trained to learn latent structural patterns and relational similarities across the network (Step 3). Fourth, the learned node embeddings and class probabilities were integrated with spatial and topological attributes to derive a composite circularity potential score for each facility, capturing functional alignment, spatial proximity and network embeddedness (Step 4). Finally, model outputs were reprojected onto the urban geography using Kepler.gl to enable spatial contextualisation and interpretation of circularity patterns across Edinburgh (Step 5).
The results reveal a strongly hierarchical circular system in Edinburgh, characterised by dense recycling clusters at the urban core, a semi-peripheral band of reuse nodes and structurally marginal repair facilities. Network metrics and GNN embeddings converge to show that recycling nodes dominate connectivity and form the principal metabolic backbone, while reuse sites act as intermediary bridges that extend circular exchanges beyond the centre. Repair nodes remain spatially fragmented and weakly integrated, signalling latent but unrealised circular capacity. The derived circularity potential scores further expose critical spatial gaps and highlight neighbourhoods where targeted interventions could significantly enhance systemic material recirculation.
This study advances the understanding of urban metabolism by framing cities as dynamic, learning systems where material, infrastructural and socio-economic interactions evolve through continuous feedback. Methodologically, the research operationalises this systemic perspective through GNNs, which computationally simulate feedback loops and relational dependencies across the urban material network. This integration of systems thinking and graph-based learning introduces a novel approach for capturing the emergent behaviour of circular systems, providing a transferable, data-driven framework for evaluating and forecasting material dynamics in cities.
Are LLM Uncertainty and Correctness Encoded by the Same Features? A Functional Dissociation via Sparse Autoencoders
🔥 引用:
0
Abstract: Large language models can be uncertain yet correct, or confident yet wrong, raising the question of whether their output-level uncertainty and their actual correctness are driven by the same internal mechanisms or by distinct feature populations. We introduce a 2x2 framework that partitions model predictions along correctness and confidence axes, and uses sparse autoencoders to identify features associated with each dimension independently. Applying this to Llama-3.1-8B and Gemma-2-9B, we identify three feature populations that play fundamentally different functional roles. Pure uncertainty features are functionally essential: suppressing them severely degrades accuracy. Pure incorrectness features are functionally inert: despite showing statistically significant activation differences between correct and incorrect predictions, the majority produce near-zero change in accuracy when suppressed. Confounded features that encode both signals are detrimental to output quality, and targeted suppression of them yields a 1.1% accuracy improvement and a 75% entropy reduction, with effects transferring across the ARC-Challenge and RACE benchmarks. The feature categories are also informationally distinct: the activations of just 3 confounded features from a single mid-network layer predict model correctness (AUROC ~0.79), enabling selective abstention that raises accuracy from 62% to 81% at 53% coverage. The results demonstrate that uncertainty and correctness are distinct internal phenomena, with implications for interpretability and targeted inference-time intervention.
LBLLM: Lightweight Binarization of Large Language Models via Three-Stage Distillation
🔥 引用:
0
Abstract: Deploying large language models (LLMs) in resource-constrained environments is hindered by heavy computational and memory requirements. We present LBLLM, a lightweight binarization framework that achieves effective W(1+1)A4 quantization through a novel three-stage quantization strategy. The framework proceeds as follows: (1) initialize a high-quality quantized model via PTQ; (2) quantize binarized weights, group-wise bitmaps, and quantization parameters through layer-wise distillation while keeping activations in full precision; and (3) training learnable activation quantization factors to dynamically quantize activations to 4 bits. This decoupled design mitigates interference between weight and activation quantization, yielding greater training stability and better inference accuracy. LBLLM, trained only using 0.016B tokens with a single GPU, surpasses existing state-of-the-art binarization methods on W2A4 quantization settings across tasks of language modeling, commonsense QA, and language understanding. These results demonstrate that extreme low-bit quantization of LLMs can be both practical and highly effective without introducing any extra high-precision channels or rotational matrices commonly used in recent PTQ-based works, offering a promising path toward efficient LLM deployment in resource-limited situations.
Cascaded Code Editing: Large-Small Model Collaboration for Effective and Efficient Code Editing
🔥 引用:
0
Abstract: Code editing constitutes a fundamental practice in software development, wherein developers modify existing codebases according to natural language requirements. Accurate code editing necessitates a comprehensive understanding of both the existing codebase and the modification requirements. Although large language models (LLMs) have demonstrated promising performance in code editing tasks, they suffer from substantial inefficiency by generating entire modified files that largely consist of unchanged code. While smaller models could potentially address this inefficiency, they typically lack the capacity to effectively comprehend long code contexts required for accurate editing. To ensure both effectiveness and efficiency, we propose to decompose code editing into a two-stage cascade: \textbf{edit sketch generation}, wherein a large model first produces concise sketches representing the requisite modifications (the more challenging phase), and \textbf{edit sketch application}, wherein a smaller model integrates these sketches into the original code to produce the final output edited code (the simpler phase). This cascaded design reduces the number of tokens generated by the large model, as the majority of the output is handled by the smaller, more efficient model, thereby enhancing overall efficiency. However, the effectiveness of this approach is constrained by current small models'limited capabilities in handling long-context scenarios and cross-file dependencies, which are essential for accurate sketch application in real-world codebases. To address these limitations and enhance smaller models'sketch application capabilities, ...
Involuntary In-Context Learning: Exploiting Few-Shot Pattern Completion to Bypass Safety Alignment in GPT-5.4
🔥 引用:
0
Abstract: Safety alignment in large language models relies on behavioral training that can be overridden when sufficiently strong in-context patterns compete with learned refusal behaviors. We introduce Involuntary In-Context Learning (IICL), an attack class that uses abstract operator framing with few-shot examples to force pattern completion that overrides safety training. Through 3479 probes across 10 OpenAI models, we identify the attack's effective components through a seven-experiment ablation study. Key findings: (1)~semantic operator naming achieves 100\,\% bypass rate (50/50, $p<0.001$); (2)~the attack requires abstract framing, since identical examples in direct question-and-answer format yield 0\,\%; (3)~example ordering matters strongly (interleaved: 76\,\%, harmful-first: 6\,\%); (4)~temperature has no meaningful effect (46--56\,\% across 0.0--1.0). On the HarmBench benchmark, IICL achieves 24.0\,\% bypass $[18.6\%, 30.4\%]$ against GPT-5.4 with detailed 619-word responses, compared to 0.0\,\% for direct queries.
Benchmarking Vision Foundation Models for Domain-Generalizable Face Anti-Spoofing
🔥 引用:
0
Abstract: Face Anti-Spoofing (FAS) remains challenging due to the requirement for robust domain generalization across unseen environments. While recent trends leverage Vision-Language Models (VLMs) for semantic supervision, these multimodal approaches often demand prohibitive computational resources and exhibit high inference latency. Furthermore, their efficacy is inherently limited by the quality of the underlying visual features. This paper revisits the potential of vision-only foundation models to establish a highly efficient and robust baseline for FAS. We conduct a systematic benchmarking of 15 pre-trained models, such as supervised CNNs, supervised ViTs, and self-supervised ViTs, under severe cross-domain scenarios including the MICO and Limited Source Domains (LSD) protocols. Our comprehensive analysis reveals that self-supervised vision models, particularly DINOv2 with Registers, significantly suppress attention artifacts and capture critical, fine-grained spoofing cues. Combined with Face Anti-Spoofing Data Augmentation (FAS-Aug), Patch-wise Data Augmentation (PDA) and Attention-weighted Patch Loss (APL), our proposed vision-only baseline achieves state-of-the-art performance in the MICO protocol. This baseline outperforms existing methods under the data-constrained LSD protocol while maintaining superior computational efficiency. This work provides a definitive vision-only baseline for FAS, demonstrating that optimized self-supervised vision transformers can serve as a backbone for both vision-only and future multimodal FAS systems. The project page is available at: https://gsisaoki.github.io/FAS-VFMbenchmark-CVPRW2026/ .
In Silico Senyawa Bioaktif Tanaman Obat Indonesia sebagai Inhibitor Xanthine Oxidase melalui Pendekatan Molecular Docking dan ADMET
🔥 引用:
0
Abstract: Xanthine oxidase (XO) plays a role in the formation of uric acid and contributes to hyperuricemia, whereas the use of synthetic inhibitors such as allopurinol is known to have side effects, thus requiring alternatives from the bioactive compounds of medicinal plants. This study aims to evaluate the potential of Dillapiole, Piperine, Hydroxychavicol, Panduratin A, and Isolicoflavonol as XO inhibitors through an in silico approach using molecular docking, as well as Lipinski and ADMET analyses. The results showed that most ligands met the drug-likeness criteria, except for Panduratin A, which had one violation of LogP. All ligands showed negative binding affinity, with Isolicoflavonol having the best affinity (−9.5 kcal/mol), followed by Piperine and Panduratin A. ADMET predictions showed that most ligands had good absorption and were not mutagenic, although some ligands had the potential to interact with CYP450 enzymes. Overall, Isolicoflavonol showed the best potential as an XO inhibitor candidate based on binding affinity and ADMET profile. These findings affirm the potential of medicinal plant bioactive compounds as alternative XO inhibitors, although further in vitro and in vivo testing is still needed for further validation.
scpFormer: A Foundation Model for Unified Representation and Integration of the Single-Cell Proteomics
🔥 引用:
0
Abstract: The integration of single-cell proteomic data is often hindered by the fragmented nature of targeted antibody panels. To address this limitation, we introduce scpFormer, a transformer-based foundation model designed for single-cell proteomics. Pre-trained on over 390 million cells, scpFormer replaces standard index-based tokenization with a continuous, sequence-anchored approach. By combining Evolutionary Scale Modeling (ESM) with value-aware expression embeddings, it dynamically maps variable panels into a shared semantic space without artificial discretization. We demonstrate that scpFormer generates global cell representations that perform competitively in large-scale batch integration and unsupervised clustering. Moreover, its open-vocabulary architecture facilitates in silico panel expansion, assisting in the reconstruction of biological manifolds in sparse clinical datasets. Finally, this learned protein co-expression logic is transferable to bulk-omics tasks, supporting applications like cancer drug response prediction. scpFormer provides a versatile, panel-agnostic framework to facilitate scalable biomarker discovery and precision oncology.
LLM‐Based Scientific Assistants for Knowledge Extraction: Which Design Choices Matter?
🔥 引用:
0
Abstract: Large Language Model chatbots have gained significant popularity, offering knowledge to support specialists in diverse fields. However, adapting models to specific use cases and specialized domains presents considerable challenges. Hence, we introduce the LLM Playground, a comprehensive approach to optimizing LLMs for specialist applications with respect to their accuracy in answering domain‐specific questions, addressing the limitations of unmodified models. The utilized optimization techniques begin with Prompt Engineering, advance to the integration of external knowledge, and culminate in complex reasoning strategies or self‐feedback loops. This paper introduces various architectures for scientific assistants, comprising individual enhancement techniques, both in isolation and in combination with others, designed to facilitate comparisons. To demonstrate the efficacy of the LLM Playground, a chemical chatbot is set up as a case study, and the optimization techniques are compared using ChemBench, an independent question–answer benchmark for the chemical domain, to measure its performance. By providing tested, ready‐to‐deploy architectures and clear use‐case guidance, this work helps researchers and practitioners leverage LLMs in domain‐specific applications. The insights and methodologies presented in this paper contribute to the growing body of knowledge on tailoring LLMs to meet the unique demands of specialized fields.
Epistemic orientation in parliamentary discourse is associated with deliberative democracy
🔥 引用:
0
Abstract: The pursuit of truth is central to democratic deliberation and governance, yet political discourse reflects varying epistemic orientations, ranging from evidence-based reasoning grounded in verifiable information to intuition-based reasoning rooted in beliefs and subjective interpretation. We introduce a scalable approach to measure epistemic orientation using the Evidence--Minus--Intuition (EMI) score, derived from large language model (LLM) ratings and embedding-based semantic similarity. Applying this approach to 15 million parliamentary speech segments spanning 1946 to 2025 across seven countries, we examine temporal patterns in discourse and its association with deliberative democracy and governance. We find that EMI is positively associated with deliberative democracy within countries over time, with consistent relationships in both contemporaneous and lagged analyses. EMI is also positively associated with the transparency and predictable implementation of laws as a dimension of governance. These findings suggest that the epistemic nature of political discourse is crucial for both the quality of democracy and governance.
Detoxification for LLM: From Dataset Itself
🔥 引用:
0
Abstract: Existing detoxification methods for large language models mainly focus on post-training stage or inference time, while few tackle the source of toxicity, namely, the dataset itself. Such training-based or controllable decoding approaches cannot completely suppress the model's inherent toxicity, whereas detoxifying the pretraining dataset can fundamentally reduce the toxicity that the model learns during training. Hence, we attempt to detoxify directly on raw corpora with SoCD (Soft Contrastive Decoding), which guides an LLM to localize and rewrite toxic spans in raw data while preserving semantics, in our proposed HSPD (Hierarchical Semantic-Preserving Detoxification) pipeline, yielding a detoxified corpus that can drop-in replace the original for fine-tuning or other training. On GPT2-XL, HSPD attains state-of-the-art detoxification, reducing Toxicity Probability (TP) from 0.42 to 0.18 and Expected Maximum Toxicity (EMT) from 0.43 to 0.20. We further validate consistent best-in-class results on LLaMA2-7B, OPT-6.7B, and Falcon-7B. These findings show that semantics-preserving, corpus-level rewriting with HSPD effectively suppresses downstream toxicity while retaining data utility and allowing seamless source-level mitigation, thereby reducing the cost of later model behavior adjustment. (Code is available at: https://github.com/ntsw2001/data_detox_for_llm)
When Graph Structure Becomes a Liability: A Critical Re-Evaluation of Graph Neural Networks for Bitcoin Fraud Detection under Temporal Distribution Shift
🔥 引用:
0
Abstract: The consensus that GCN, GraphSAGE, GAT, and EvolveGCN outperform feature-only baselines on the Elliptic Bitcoin Dataset is widely cited but has not been rigorously stress-tested under a leakage-free evaluation protocol. We perform a seed-matched inductive-versus-transductive comparison and find that this consensus does not hold. Under a strictly inductive protocol, Random Forest on raw features achieves F1 = 0.821 and outperforms all evaluated GNNs, while GraphSAGE reaches F1 = 0.689 +/- 0.017. A paired controlled experiment reveals a 39.5-point F1 gap attributable to training-time exposure to test-period adjacency. Additionally, edge-shuffle ablations show that randomly wired graphs outperform the real transaction graph, indicating that the dataset's topology can be misleading under temporal distribution shift. Hybrid models combining GNN embeddings with raw features provide only marginal gains and remain substantially below feature-only baselines. We release code, checkpoints, and a strict-inductive protocol to enable reproducible, leakage-free evaluation.
Building smarter digital content: a CRITIC – DEMATEL framework for leveraging large language model optimization in marketing
🔥 引用:
0
Abstract:
The study responds to address the practical problem faced by the digital marketers, content creators and digital business agencies on creating content, which is both human-readable and LLM-compatible. The present study identifies and analyses the key factors influencing content optimization for Large Language Models (LLMs) to develop a strategic framework for Large Language Model Optimization (LLMO) that aligns with modern search paradigms.
This research employs a two-phases multi-criteria decision-making (MCDM) approach combining CRITIC (Criteria Importance Through Intercriteria Correlation) to determine factor weights, and DEMATEL (Decision-Making Trial and Evaluation Laboratory) to map causal relationships. A panel of 15 experts across three countries (India, UAE and USA) rated the influence of five identified factors.
The study identifies five critical factors for LLMO: Retrieval Augmentation, Readability Enhancement, Content Quality Assurance, Filtering of Unsafe Content and User-Centric Content Design. Retrieval Augmentation and User-Centric Design emerged as key causal factors, while Readability and Content Quality acted as bridges or effects. Although factor weights were relatively balanced, the DEMATEL analysis revealed interdependencies highlighting the dynamic nature of LLMO.
The results provide actionable guidance to digital marketing experts and agencies, content strategists, marketing heads and developers to structure web content that is both human-readable and LLM-compatible. The study offers insights to organizations on how they can enhance their digital visibility and authority in AI-powered search ecosystems.
This study fills a critical gap by offering the first integrated CRITIC-DEMATEL framework for LLMO. It distinguishes LLMO from traditional SEO and offers a novel causal model to support the development of holistic, future-ready content strategies.
Mind the Unseen Mass: Unmasking LLM Hallucinations via Soft-Hybrid Alphabet Estimation
🔥 引用:
0
Abstract: This paper studies uncertainty quantification for large language models (LLMs) under black-box access, where only a small number of responses can be sampled for each query. In this setting, estimating the effective semantic alphabet size--that is, the number of distinct meanings expressed in the sampled responses--provides a useful proxy for downstream risk. However, frequency-based estimators tend to undercount rare semantic modes when the sample size is small, while graph-spectral quantities alone are not designed to estimate semantic occupancy accurately. To address this issue, we propose SHADE (Soft-Hybrid Alphabet Dynamic Estimator), a simple and interpretable estimator that combines Generalized Good-Turing coverage with a heat-kernel trace of the normalized Laplacian constructed from an entailment-weighted graph over sampled responses. The estimated coverage adaptively determines the fusion rule: under high coverage, SHADE uses a convex combination of the two signals, while under low coverage it applies a LogSumExp fusion to emphasize missing or weakly observed semantic modes. A finite-sample correction is then introduced to stabilize the resulting cardinality estimate before converting it into a coverage-adjusted semantic entropy score. Experiments on pooled semantic alphabet-size estimation against large-sample references and on QA incorrectness detection show that SHADE achieves the strongest improvements in the most sample-limited regime, while the performance gap narrows as the number of samples increases. These results suggest that hybrid semantic occupancy estimation is particularly beneficial when black-box uncertainty quantification must operate under tight sampling budgets.
Guardians of the Lifeworld
🔥 引用:
0
Abstract: This paper draws inspiration from phenomenology and Marvel Entertainment’s Guardians of the Galaxy (GotG) to elevate and guard the practices of scholarship, as one response to the encroachment and infiltration of large language models (LLMs) and generative artificial intelligence (GenAI) into academic knowledge work, thinking in particular. LLMs and GenAI may offer to accelerate labourious knowledge work, and gild with kudos those who thus leverage it, for example in compiling entire literature reviews on demand. We contend that automating aspects of scholarship impoverishes the necessarily relational, embodied and ethical activity of scholars. In this AI-infiltrated era, we present a novel (and fun!) phenomenological stimulant for delegates to the networked learning conference. Rather than the just the usual canonical philosophical writers, our inspiration draws upon some of the main characters of GotG. We argue that GotG may be encountered, attending to their narrative arcs: their dialogue, gesture, sacrifice and relation. These facets, as they play out in moments of grief and generosity, etc. can help us to reflect on our own attempts at scholarly practice through current GenAI-troubled tensions. These lived tensions, as they may texture networked collaborative academic work, are briefly elaborated through four fragments: the disappeared voice, the stillness that registers as absence, the question that got automated and, the speed that erases process. These fragments share a flavour of what the Guardians may address. Brief depictions and insights from the Guardians we include are informed by repeat viewing of the GotG movie trilogy and fan wikis. We begin with Star-Lord, a figure of fragmentation and longing, of disconnection and yet drawing repeatedly back from the brink through music and the call of what matters. We introduce Gamora, as guarding the possibility of ethics amid torrid interminable violence. For us, Rocket and Nebula guard the possibility of healing without erasure. Groot and Mantis guard the quiet forms of knowing. Drax guards the fragility of interpretation. We invite delegates to join us in drawing playful resilience and hope from our GotG interpretive companions and celebrate the freedom to deliberate unplugged inception into scholarly discourse, to guard the conditions of thought and thinking itself.
Multi‐modal large language models for paediatric tele‐ophthalmology: A blinded real‐world evaluation of diagnostic accuracy and safety
DOI:
10.1111/aos.70136
🔥 引用:
0
Abstract:
The rapid integration of large language models (LLMs) into online medical consultations demands rigorous evaluation, particularly in specialized fields like paediatric ophthalmology. This study systematically assessed the diagnostic accuracy, safety and communication quality of advanced LLMs in real‐world paediatric ophthalmology tele‐consultations, with a focus on comparing text‐only inputs to multi‐modal inputs (text paired with parent‐taken mobile photographs).
A prospective, blinded, multi‐model study was conducted using 50 authentic paediatric ophthalmology cases from an internet‐based hospital platform. Four leading LLMs—Grok 4.1, GPT 5.1 Thinking, Gemini 2.5 Pro (advanced models) and GPT‐4o (baseline)—were tested in multi‐modal and text‐only scenarios. Responses were evaluated by three blinded senior paediatric ophthalmologists using composite scoring system (maximum 19 points), encompassing diagnostic accuracy (up to six points), guideline adherence and safety (up to four points) and parent communication quality (up to nine points).
Advanced LLMs consistently outperformed GPT‐4o, driven by superior Safety and Communication efficacy. Grok 4.1 led with the highest overall score in the multi‐modal arm (15.06 ± 0.68), followed by GPT 5.1 Thinking (14.52 ± 1.19) and Gemini 2.5 Pro (14.30 ± 1.08), while GPT‐4o lagged significantly at 11.20 ± 0.72. In the text‐only arm, advanced models again excelled: Grok 4.1 (14.08 ± 1.18), GPT 5.1 Thinking (13.62 ± 1.77) and Gemini 2.5 Pro (13.42 ± 1.50). Multi‐modal inputs significantly boosted scores (e.g., Grok 4.1: Δ = 0.98,
p
< 0.001), with gains specifically attributable to improvements in Safety and Communication rather than diagnostics. Regarding safety, advanced models exhibited markedly fewer major harmful responses (Grok 4.1: 4%; GPT 5.1 Thinking: 8%; Gemini 2.5 Pro: 10%) compared to GPT‐4o (16%). Parent preferences favoured advanced LLMs, with Grok 4.1 receiving 23% of votes, reflecting higher perceived clarity, empathy and trustworthiness.
Advanced LLMs demonstrate markedly superior capabilities over GPT‐4o in handling paediatric ophthalmology tele‐consultations, especially with multi‐modal data, offering enhanced safety, communication and parent trust. However, persistent variations in safety across models and residual risks of harmful advice underscore the need for condition‐specific validation and stringent safety guardrails prior to deployment. Multi‐modal integration is essential for optimizing LLM reliability in this high‐stakes domain.
Improving LLM-Driven Test Generation by Learning from Mocking Information
🔥 引用:
0
Abstract: Large Language Models (LLMs) have recently shown strong potential for automated unit test generation. This has motivated us to investigate whether developer-defined test doubles (commonly referred to as mocks) available in existing test suites can be leveraged to improve LLM-driven test generation. To this end, we propose MOCKMILL, an LLM-based technique and tool that generates test cases by exploiting mocking information automatically extracted from developer-written tests. MOCKMILL targets components that are replaced by test doubles in existing tests and uses the encoded stubbings and interaction expectations to guide test generation, combined with an iterative generation-and-repair process to ensure executable tests. We evaluated MOCKMILL on 10 open-source classes from six Java projects using four LLMs, and compared the generated tests with existing project tests and tests produced by baseline approaches. The results show that MOCKMILL's tests cover lines of code and kill mutants that existing tests and baseline-generated tests miss. Overall, our findings provide preliminary evidence that leveraging mocking information is a complementary and effective way to enhance LLM-based test generation.
Enabling Next-Generation Mass Spectrometry-Based Proteomics: Standards, Proteoform Resolution, and FAIR, Reproducible, and Quantitative Analysis
🔥 引用:
0
Abstract: Recent advances in mass spectrometry, data-independent acquisition, proteoform-resolving workflows, and multi-omics integration have significantly expanded the scale and scope of proteomics. However, the reuse and translational application of these datasets are limited by inconsistent standards, insufficient metadata, and inadequate computational interoperability. Proteoform-centric approaches provide higher molecular resolution by capturing intact protein variants and patterns of post-translational modification. Computational methods, including selected applications of machine learning and large language models (LLMs), are increasingly used for tasks such as spectral prediction and pattern discovery in clinical proteomics datasets. Despite these advancements, FAIR (Findable, Accessible, Interoperable, and Reusable) data practices, proteoform biology, and AI analytics are often pursued independently. This work presents an integrated framework for next-generation proteomics in which standardization and FAIR (Findable, Accessible, Interoperable, and Reusable) principles establish machine-actionable foundations for proteoform-resolved analysis and computational inference. It examines community efforts to promote data sharing and interoperability, as well as strategies for characterizing proteoforms using bottom-up, middle-down, and top-down approaches. It also highlights emerging AI and ML applications within the proteomics workflow. The framework emphasizes the importance of treating proteoforms as primary computational entities and adopting FAIR practices during data collection to enable reproducible and interpretable modeling. Finally, it introduces an architectural model that integrates FAIR infrastructures and proteoform resolution. In addition, practical recommendations for making AI-ready proteomics, including a minimal community checklist to support reproducibility, benchmarking, and translational scalability, are provided.
Fine-Tuning Small Reasoning Models for Quantum Field Theory
🔥 引用:
0
Abstract: Despite the growing application of Large Language Models (LLMs) to theoretical physics, there is little academic exploration into how domain-specific physics reasoning ability develops while training these models. To investigate this, we perform the first academic fine-tuning study of small (7B-parameter) reasoning models dedicated specifically to theoretical physics. Because open-source verifiable training data required to train such capabilities is scarce, we developed a robust data generation pipeline that can both create synthetic problems and make existing human-authored problems suitable for model training. Selecting Quantum Field Theory (QFT) as our primary domain, we generated over 2,500 synthetic problems alongside a curated collection of human-adapted problems sourced from arXiv and standard pedagogical resources. We conduct both Reinforcement Learning (RL) and Supervised Fine-Tuning (SFT) experiments, benchmarking performance gains as well as generalization to other physics domains. We perform an extensive analysis of model chains-of-though before and after fine-tuning, to understand how reasoning errors evolve during RL and SFT. Finally, we publicly release our data pipeline, verifiable QFT training data, and $\sim$200M tokens of QFT reasoning traces.
Learning Posterior Predictive Distributions for Node Classification from Synthetic Graph Priors
🔥 引用:
0
Abstract: One of the most challenging problems in graph machine learning is generalizing across graphs with diverse properties. Graph neural networks (GNNs) face a fundamental limitation: they require separate training for each new graph, preventing universal generalization across diverse graph datasets. A critical challenge facing GNNs lies in their reliance on labeled training data for each individual graph, a requirement that hinders the capacity for universal node classification due to the heterogeneity inherent in graphs -- differences in homophily levels, community structures, and feature distributions across datasets. Inspired by the success of large language models (LLMs) that achieve in-context learning through massive-scale pre-training on diverse datasets, we introduce NodePFN. This universal node classification method generalizes to arbitrary graphs without graph-specific training. NodePFN learns posterior predictive distributions (PPDs) by training only on thousands of synthetic graphs generated from carefully designed priors. Our synthetic graph generation covers real-world graphs through the use of random networks with controllable homophily levels and structural causal models for complex feature-label relationships. We develop a dual-branch architecture combining context-query attention mechanisms with local message passing to enable graph-aware in-context learning. Extensive evaluation on 23 benchmarks demonstrates that a single pre-trained NodePFN achieves 71.27 average accuracy. These results validate that universal graph learning patterns can be effectively learned from synthetic priors, establishing a new paradigm for generalization in node classification.
A-MAR: Agent-based Multimodal Art Retrieval for Fine-Grained Artwork Understanding
🔥 引用:
0
Abstract: Understanding artworks requires multi-step reasoning over visual content and cultural, historical, and stylistic context. While recent multimodal large language models show promise in artwork explanation, they rely on implicit reasoning and internalized knowl- edge, limiting interpretability and explicit evidence grounding. We propose A-MAR, an Agent-based Multimodal Art Retrieval framework that explicitly conditions retrieval on structured reasoning plans. Given an artwork and a user query, A-MAR first decomposes the task into a structured reasoning plan that specifies the goals and evidence requirements for each step. Retrieval is then conditionedon this plan, enabling targeted evidence selection and supporting step-wise, grounded explanations. To evaluate agent-based multi- modal reasoning within the art domain, we introduce ArtCoT-QA. This diagnostic benchmark features multi-step reasoning chains for diverse art-related queries, enabling a granular analysis that extends beyond simple final answer accuracy. Experiments on SemArt and Artpedia show that A-MAR consistently outperforms static, non planned retrieval and strong MLLM baselines in final explanation quality, while evaluations on ArtCoT-QA further demonstrate its advantages in evidence grounding and multi-step reasoning ability. These results highlight the importance of reasoning-conditioned retrieval for knowledge-intensive multimodal understanding and position A-MAR as a step toward interpretable, goal-driven AI systems, with particular relevance to cultural industries. The code and data are available at: https://github.com/ShuaiWang97/A-MAR.
Unified Multi-Foundation-Model Slide Representation for Pan-Cancer Recognition and Text-Guided Tumor Localization
🔥 引用:
0
Abstract: The expanding ecosystem of pathology foundation models has produced powerful but fragmented tile-level representations, limiting their use in clinical tasks that require unified slide-level reasoning and interpretable linkage to clinically meaningful information. We present ASTRA, a pan-cancer framework that integrates heterogeneous foundation-model representations into a shared slide-level representation space and semantically grounds that space using structured pathology annotation fields, including classification category, cancer type, and anatomic site. ASTRA combines sparse mixture-of-experts contextualization, masked multi-model reconstruction, and contrastive alignment to structured pathology prompts to learn slide representations that support 4-category classification, 3-class solid tumor typing, 16-class cancer typing, and text-guided tumor localization without pixel-level supervision. Developed on a CHTN cohort of 10,359 whole-slide images (WSIs) spanning 16 tumor types, ASTRA consistently improves pan-cancer classification across four pathology foundation-model backbones, achieving up to 97.8% macro-AUC for 4-category classification, 99.7% for 3-class solid tumor typing, and 99.2% for 16-class cancer typing. For tumor localization, ASTRA achieves a mean Dice of 0.897 on an annotated in-domain CHTN subset (n = 380) spanning 16 cancer types and 0.738 on an external TCGA cohort (n = 1,686) spanning four cancer types. These results demonstrate that minimal structured pathology annotation fields derived from slide-level metadata can provide effective semantic supervision for unified slide representation learning, enabling both pan-cancer prediction and weakly supervised tumor localization within a single framework.
IndiaFinBench: An Evaluation Benchmark for Large Language Model Performance on Indian Financial Regulatory Text
🔥 引用:
0
Abstract: We introduce IndiaFinBench, to our knowledge the first publicly available evaluation benchmark for assessing large language model (LLM) performance on Indian financial regulatory text. Existing financial NLP benchmarks draw exclusively from Western financial corpora (SEC filings, US earnings reports, and English-language financial news), leaving a significant gap in coverage of non-Western regulatory frameworks. IndiaFinBench addresses this gap with 406 expert-annotated question-answer pairs drawn from 192 documents sourced from the Securities and Exchange Board of India (SEBI) and the Reserve Bank of India (RBI), spanning four task types: regulatory interpretation (174 items), numerical reasoning (92 items), contradiction detection (62 items), and temporal reasoning (78 items). Annotation quality is validated through a model-based secondary pass (kappa=0.918 on contradiction detection) and a 60-item human inter-annotator agreement evaluation (kappa=0.611; 76.7% overall agreement). We evaluate twelve models under zero-shot conditions, with accuracy ranging from 70.4% (Gemma 4 E4B) to 89.7% (Gemini 2.5 Flash). All models substantially outperform a non-specialist human baseline of 60.0%. Numerical reasoning is the most discriminative task, with a 35.9 percentage-point spread across models. Bootstrap significance testing (10,000 resamples) reveals three statistically distinct performance tiers. The dataset, evaluation code, and all model outputs are available at https://github.com/rajveerpall/IndiaFinBench
Rethinking Scale: Deployment Trade-offs of Small Language Models under Agent Paradigms
🔥 引用:
0
Abstract: Despite the impressive capabilities of large language models, their substantial computational costs, latency, and privacy risks hinder their widespread deployment in real-world applications. Small Language Models (SLMs) with fewer than 10 billion parameters present a promising alternative; however, their inherent limitations in knowledge and reasoning curtail their effectiveness. Existing research primarily focuses on enhancing SLMs through scaling laws or fine-tuning strategies while overlooking the potential of using agent paradigms, such as tool use and multi-agent collaboration, to systematically compensate for the inherent weaknesses of small models. To address this gap, this paper presents the first large-scale, comprehensive study of<10B open-source models under three paradigms: (1) the base model, (2) a single agent equipped with tools, and (3) a multi-agent system with collaborative capabilities. Our results show that single-agent systems achieve the best balance between performance and cost, while multi-agent setups add overhead with limited gains. Our findings highlight the importance of agent-centric design for efficient and trustworthy deployment in resource-constrained settings.
Beyond Semantic Similarity: A Component-Wise Evaluation Framework for Medical Question Answering Systems with Health Equity Implications
🔥 引用:
0
Abstract: The use of Large Language Models (LLMs) to support patients in addressing medical questions is becoming increasingly prevalent. However, most of the measures currently used to evaluate the performance of these models in this context only measure how closely a model's answers match semantically, and therefore do not provide a true indication of the model's medical accuracy or of the health equity risks associated with it. To address these shortcomings, we present a new evaluation framework for medical question answering called VB-Score (Verification-Based Score) that provides a separate evaluation of the four components of entity recognition, semantic similarity, factual consistency, and structured information completeness for medical question-answering models. We perform rigorous reviews of the performance of three well-known and widely used LLMs on 48 public health-related topics taken from high-quality, authoritative information sources. Based on our analyses, we discover a major discrepancy between the models'semantic and entity accuracy. Our assessments of the performance of all three models show that each of them has almost uniformly severe performance failures when evaluated against our criteria. Our findings indicate alarming performance disparities across various public health topics, with most of the models exhibiting 13.8% lower performance (compared to an overall average) for all the public health topics that relate to chronic conditions that occur in older and minority populations, which indicates the existence of what's known as condition-based algorithmic discrimination. Our findings also demonstrate that prompt engineering alone does not compensate for basic architectural limitations on how these models perform in extracting medical entities and raise the question of whether semantic evaluation alone is a sufficient measure of medical AI safety.
Cell-Based Representation of Relational Binding in Language Models
🔥 引用:
0
Abstract: Understanding a discourse requires tracking entities and the relations that hold between them. While Large Language Models (LLMs) perform well on relational reasoning, the mechanism by which they bind entities, relations, and attributes remains unclear. We study discourse-level relational binding and show that LLMs encode it via a Cell-based Binding Representation (CBR): a low-dimensional linear subspace in which each ``cell''corresponds to an entity--relation index pair, and bound attributes are retrieved from the corresponding cell during inference. Using controlled multi-sentence data annotated with entity and relation indices, we identify the CBR subspace by decoding these indices from attribute-token activations with Partial Least Squares regression. Across domains and two model families, the indices are linearly decodable and form a grid-like geometry in the projected space. We further find that context-specific CBR representations are related by translation vectors in activation space, enabling cross-context transfer. Finally, activation patching shows that manipulating this subspace systematically changes relational predictions and that perturbing it disrupts performance, providing causal evidence that LLMs rely on CBR for relational binding.
DASH-KV: Accelerating Long-Context LLM Inference via Asymmetric KV Cache Hashing
🔥 引用:
0
Abstract: The quadratic computational complexity of the standard attention mechanism constitutes a fundamental bottleneck for large language models in long-context inference. While existing KV cache compression methods alleviate memory pressure, they often sacrifice generation quality and fail to address the high overhead of floating-point arithmetic. This paper introduces DASH-KV, an innovative acceleration framework that reformulates attention as approximate nearest-neighbor search via asymmetric deep hashing. Under this paradigm, we design an asymmetric encoding architecture that differentially maps queries and keys to account for their distinctions in precision and reuse characteristics. To balance efficiency and accuracy, we further introduce a dynamic mixed-precision mechanism that adaptively retains full-precision computation for critical tokens. Extensive experiments on LongBench demonstrate that DASH-KV significantly outperforms state-of-the-art baseline methods while matching the performance of full attention, all while reducing inference complexity from O(N^2) to linear O(N). The code is available at https://github.com/Zhihan-Zh/DASH-KV
Benchmarking single-cell foundation models for real-world RNA-seq data integration
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
Do Emotions Influence Moral Judgment in Large Language Models?
🔥 引用:
0
Abstract: Large language models have been extensively studied for emotion recognition and moral reasoning as distinct capabilities, yet the extent to which emotions influence moral judgment remains underexplored. In this work, we develop an emotion-induction pipeline that infuses emotion into moral situations and evaluate shifts in moral acceptability across multiple datasets and LLMs. We observe a directional pattern: positive emotions increase moral acceptability and negative emotions decrease it, with effects strong enough to reverse binary moral judgments in up to 20% of cases, and with susceptibility scaling inversely with model capability. Our analysis further reveals that specific emotions can sometimes behave contrary to what their valence would predict (e.g., remorse paradoxically increases acceptability). A complementary human annotation study shows humans do not exhibit these systematic shifts, indicating an alignment gap in current LLMs.
Assessing Capabilities of Large Language Models in Social Media Analytics: A Multi-task Quest
🔥 引用:
0
Abstract: In this study, we present the first comprehensive evaluation of modern LLMs - including GPT-4, GPT-4o, GPT-3.5-Turbo, Gemini 1.5 Pro, DeepSeek-V3, Llama 3.2, and BERT - across three core social media analytics tasks on a Twitter (X) dataset: (I) Social Media Authorship Verification, (II) Social Media Post Generation, and (III) User Attribute Inference. For the authorship verification, we introduce a systematic sampling framework over diverse user and post selection strategies and evaluate generalization on newly collected tweets from January 2024 onward to mitigate"seen-data"bias. For post generation, we assess the ability of LLMs to produce authentic, user-like content using comprehensive evaluation metrics. Bridging Tasks I and II, we conduct a user study to measure real users'perceptions of LLM-generated posts conditioned on their own writing. For attribute inference, we annotate occupations and interests using two standardized taxonomies (IAB Tech Lab 2023 and 2018 U.S. SOC) and benchmark LLMs against existing baselines. Overall, our unified evaluation provides new insights and establishes reproducible benchmarks for LLM-driven social media analytics. The code and data are provided in the supplementary material and will also be made publicly available upon publication.
AI-Powered Palliative Care Management System: Integrating Large Language Models for Clinical Translation and Automated Condition Assessment
DOI:
10.55041/ijsrem60768
🔥 引用:
0
Abstract: Abstract—Delivering effective palliative care in distributed home settings presents significant challenges, particularly regarding accurate clinical documentation and patient monitoring by field workers (e.g., Asha workers) who may lack formal medical training or proficiency in professional medical English. We present the design and implementation of an AI-powered Palliative Care Assistance System built on the Django web framework. The platform seamlessly integrates Large Language Models (a local MedGemma 4B instance and Google Gemini Vision) to automate clinical translation, converting layman terminology and regional languages (Malayalam) into standardized medical English. Furthermore, the system employs vision-capable LLMs to generate structured summaries of medical images (X-Rays, Lab Reports, Prescriptions) and evaluates patient visit histories to dynamically assess condition severity (Stable, Moderate, Severe). Supported by a robust backend managing bilingual visit logs, material allocations, and role-based portals, the system significantly reduces documentation overhead and enhances clinical decision-making for doctors and headnurses.
Index Terms—palliative care, large language models, generative AI, medical translation, clinical assessment, Django, health informatics, bilingual systems
From Top-1 to Top-K: A Reproducibility Study and Benchmarking of Counterfactual Explanations for Recommender Systems
🔥 引用:
0
Abstract: Counterfactual explanations (CEs) provide an intuitive way to understand recommender systems by identifying minimal modifications to user-item interactions that alter recommendation outcomes. Existing CE methods for recommender systems, however, have been evaluated under heterogeneous protocols, using different datasets, recommenders, metrics, and even explanation formats, which hampers reproducibility and fair comparison. Our paper systematically reproduces, re-implement, and re-evaluate eleven state-of-the-art CE methods for recommender systems, covering both native explainers (e.g., LIME-RS, SHAP, PRINCE, ACCENT, LXR, GREASE) and specific graph-based explainers originally proposed for GNNs. Here, a unified benchmarking framework is proposed to assess explainers along three dimensions: explanation format (implicit vs. explicit), evaluation level (item-level vs. list-level), and perturbation scope (user interaction vectors vs. user-item interaction graphs). Our evaluation protocol includes effectiveness, sparsity, and computational complexity metrics, and extends existing item-level assessments to top-K list-level explanations. Through extensive experiments on three real-world datasets and six representative recommender models, we analyze how well previously reported strengths of CE methods generalize across diverse setups. We observe that the trade-off between effectiveness and sparsity depends strongly on the specific method and evaluation setting, particularly under the explicit format; in addition, explainer performance remains largely consistent across item level and list level evaluations, and several graph-based explainers exhibit notable scalability limitations on large recommender graphs. Our results refine and challenge earlier conclusions about the robustness and practicality of CE generation methods in recommender systems: https://github.com/L2R-UET/CFExpRec.
The Logical Expressiveness of Topological Neural Networks
🔥 引用:
0
Abstract: Graph neural networks (GNNs) are the standard for learning on graphs, yet they have limited expressive power, often expressed in terms of the Weisfeiler-Leman (WL) hierarchy or within the framework of first-order logic. In this context, topological neural networks (TNNs) have recently emerged as a promising alternative for graph representation learning. By incorporating higher-order relational structures into message-passing schemes, TNNs offer higher representational power than traditional GNNs. However, a fundamental question remains open: what is the logical expressiveness of TNNs? Answering this allows us to characterize precisely which binary classifiers TNNs can represent. In this paper, we address this question by analyzing isomorphism tests derived from the underlying mechanisms of general TNNs. We introduce and investigate the power of higher-order variants of WL-based tests for combinatorial complexes, called $k$-CCWL test. In addition, we introduce the topological counting logic (TC$_k$), an extension of standard counting logic featuring a novel pairwise counting quantifier $ \exists^{N}(x_i,x_j)\, \varphi(x_i,x_j), $ which explicitly quantifies pairs $(x_i, x_j)$ satisfying property $\varphi$. We rigorously prove the exact equivalence: $ \text{k-CCWL} \equiv \text{TC}_{k{+}2} \equiv \text{Topological }(k{+}2)\text{-pebble game}.$ These results establish a logical expressiveness theory for TNNs.
CoDA: Towards Effective Cross-domain Knowledge Transfer via CoT-guided Domain Adaptation
🔥 引用:
0
Abstract: Large language models (LLMs) have achieved substantial advances in logical reasoning, yet they continue to lag behind human-level performance. In-context learning provides a viable solution that boosts the model's performance via prompting its input with expert-curated, in-domain exemplars. However, in many real-world, expertise-scarce domains, such as low-resource scientific disciplines, emerging biomedical subfields, or niche legal jurisdictions, such high-quality in-domain demonstrations are inherently limited or entirely unavailable, thereby constraining the general applicability of these approaches. To mitigate this limitation, recent efforts have explored the retrieval of cross-domain samples as surrogate in-context demonstrations. Nevertheless, the resulting gains remain modest. This is largely attributable to the pronounced domain shift between source and target distributions, which impedes the model's ability to effectively identify and exploit underlying shared structures or latent reasoning patterns. Consequently, when relying solely on raw textual prompting, LLMs struggle to abstract and transfer such cross-domain knowledge in a robust and systematic manner. To address these issues, we propose CoDA, which employs a lightweight adapter to directly intervene in the intermediate hidden states. By combining feature-based distillation of CoT-enriched reference representations with Maximum Mean Discrepancy (MMD) for kernelized distribution matching, our method aligns the latent reasoning representation of the source and target domains. Extensive experimental results on multiple logical reasoning tasks across various model families validate the efficacy of CoDA by significantly outperforming the previous state-of-the-art baselines by a large margin.
CreativeGame:Toward Mechanic-Aware Creative Game Generation
🔥 引用:
0
Abstract: Large language models can generate plausible game code, but turning this capability into \emph{iterative creative improvement} remains difficult. In practice, single-shot generation often produces brittle runtime behavior, weak accumulation of experience across versions, and creativity scores that are too subjective to serve as reliable optimization signals. A further limitation is that mechanics are frequently treated only as post-hoc descriptions, rather than as explicit objects that can be planned, tracked, preserved, and evaluated during generation. This report presents \textbf{CreativeGame}, a multi-agent system for iterative HTML5 game generation that addresses these issues through four coupled ideas: a proxy reward centered on programmatic signals rather than pure LLM judgment; lineage-scoped memory for cross-version experience accumulation; runtime validation integrated into both repair and reward; and a mechanic-guided planning loop in which retrieved mechanic knowledge is converted into an explicit mechanic plan before code generation begins. The goal is not merely to produce a playable artifact in one step, but to support interpretable version-to-version evolution. The current system contains 71 stored lineages, 88 saved nodes, and a 774-entry global mechanic archive, implemented in 6{,}181 lines of Python together with inspection and visualization tooling. The system is therefore substantial enough to support architectural analysis, reward inspection, and real lineage-level case studies rather than only prompt-level demos. A real 4-generation lineage shows that mechanic-level innovation can emerge in later versions and can be inspected directly through version-to-version records. The central contribution is therefore not only game generation, but a concrete pipeline for observing progressive evolution through explicit mechanic change.
sdAbs-LLM: Generative Large Language Models For de novo Antibody Design and Agentic Evaluation
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
Bridging Foundation Models and ASTM Metallurgical Standards for Automated Grain Size Estimation from Microscopy Images
🔥 引用:
0
Abstract: Extracting standardized metallurgical metrics from microscopy images remains challenging due to complex grain morphology and the data demands of supervised segmentation. To bridge foundational computer vision with practical metallurgical evaluation, we propose an automated pipeline for dense instance segmentation and grain size estimation that adapts Cellpose-SAM to microstructures and integrates its topology-aware gradient tracking with an ASTM E112 Jeffries planimetric module. We systematically benchmark this pipeline against a classical convolutional network (U-Net), an adaptive-prompting vision foundation model (MatSAM) and a contemporary vision-language model (Qwen2.5-VL-7B). Our evaluations reveal that while the out-of-the-box vision-language model struggles with the localized spatial reasoning required for dense microscopic counting and MatSAM suffers from over-segmentation despite its domain-specific prompt generation, our adapted pipeline successfully maintains topological separation. Furthermore, experiments across progressively reduced training splits demonstrate exceptional few-shot scalability; utilizing only two training samples, the proposed system predicts the ASTM grain size number (G) with a mean absolute percentage error (MAPE) as low as 1.50%, while robustness testing across varying target grain counts empirically validates the ASTM 50-grain sampling minimum. These results highlight the efficacy of application-level foundation model integration for highly accurate, automated materials characterization. Our project repository is available at https://github.com/mueez-overflow/ASTM-Grain-Size-Estimator.
Towards Scalable Lifelong Knowledge Editing with Selective Knowledge Suppression
🔥 引用:
0
Abstract: Large language models (LLMs) require frequent knowledge updates to reflect changing facts and mitigate hallucinations. To meet this demand, lifelong knowledge editing has emerged as a continual approach to modify specific pieces of knowledge without retraining the entire model. Existing parameter editing methods struggle with stability during sequential edits due to catastrophic forgetting. While retrieval-based approaches are proposed to alleviate this issue, their applicability remains limited across various datasets because of high training costs. To address these limitations and enhance scalability in lifelong settings, we propose LightEdit. Our framework first selects relevant knowledge from retrieved information to modify the query effectively. It then incorporates a decoding strategy to suppress the model's original knowledge probabilities, thereby enabling efficient edits based on the selected information. Extensive experiments on ZSRE, Counterfact, and RIPE benchmarks demonstrate that LightEdit outperforms existing lifelong knowledge editing methods. Furthermore, by minimizing training costs, LightEdit achieves cost-effective scalability, enabling easy adaptation to various datasets.
Discerning Authorship in Online Health Communities: Experience, Trust, and Transparency Implications for Moderating AI
🔥 引用:
0
Abstract: For online health communities, community trust is paramount. Yet, advances in Large Language Models (LLMs) generating advice may erode this trust, especially if users cannot identify whether LLMs have been used. We investigate the feasibility of community-based detection of health advice authorship and how self-moderation of LLMs could help enhance advice utilization. In an online experiment, we evaluate people's ability to distinguish AI-generated from human-written advice across two health conditions, considering lived experience with a condition, AI-recognition training, and user attitudes towards transparency and trust around AI use. Our results indicate the need for transparency coupled with trust. We find little evidence of people's ability to discern advice authorship. However, we find a consistent effect of the health condition. Our qualitative findings identify unreliable signals, resulting in flawed heuristic evaluations of the advice. Our findings point to opportunities to improve the self-moderation of LLM-based AI and aid community-based AI moderation.
Streamliners for Answer Set Programming
🔥 引用:
0
Abstract: Streamliner constraints reduce the search space of combinatorial problems by ruling out portions of the solution space. We adapt the StreamLLM approach, which uses Large Language Models (LLMs) to generate streamliners for Constraint Programming, to Answer Set Programming (ASP). Given an ASP encoding and a few small training instances, we prompt multiple LLMs to propose candidate constraints. Candidates that cause syntax errors, render satisfiable instances unsatisfiable, or degrade performance on all training instances are discarded. The surviving streamliners are evaluated together with the original encoding, and we report results for a virtual best encoding (VBE) that, for each instance, selects the fastest among the original encoding and its streamlined variants. On three ASP Competition benchmarks (Partner Units Problem, Sokoban, Towers of Hanoi), the VBE achieves speedups of up to 4--5x over the original encoding. Different LLMs produce semantically diverse constraints, not mere syntactic variations, indicating that the approach captures genuine problem structure.
Semantic Prompting: Agentic Incremental Narrative Refinement through Spatial Semantic Interaction
🔥 引用:
0
Abstract: Interactive spatial layouts empower users to synthesize information and organize findings for sensemaking. While Large Language Models (LLMs) can automate narrative generation from spatial layouts, current collage-based and re-generation methods struggle to support the incremental spatial refinements inherent to the sensemaking process. We identify three critical gaps in existing spatial-textual generation: interaction-revision misalignment, human-LLM intent misalignment, and lack of granular customization. To address these, we introduce Semantic Prompting, a framework for spatial refinement that perceives semantic interactions, reasons about refinement intent, and performs targeted positional revisions. We implemented S-PRISM to realize this framework. The empirical evaluation demonstrated that S-PRISM effectively enhanced the precision of interaction-revision refinement. A user study ($N=14$) highlighted how participants leveraged S-PRISM for incremental formalization through interactive steering. Results showed that users valued its efficient, adaptable, and trustworthy support, which effectively strengthens human-LLM intent alignment.
Empowering NPC Dialogue with Environmental Context Using LLMs and Panoramic Images
🔥 引用:
0
Abstract: We present an approach for enhancing non-playable characters (NPCs) in games by combining large language models (LLMs) with computer vision to provide contextual awareness of their surroundings. Conventional NPCs typically rely on pre-scripted dialogue and lack spatial understanding, which limits their responsiveness to player actions and reduces overall immersion. Our method addresses these limitations by capturing panoramic images of an NPC's environment and applying semantic segmentation to identify objects and their spatial positions. The extracted information is used to generate a structured JSON representation of the environment, combining object locations derived from segmentation with additional scene graph data within the NPC's bounding sphere, encoded as directional vectors. This representation is provided as input to the LLM, enabling NPCs to incorporate spatial knowledge into player interactions. As a result, NPCs can dynamically reference nearby objects, landmarks, and environmental features, leading to more believable and engaging gameplay. We describe the technical implementation of the system and evaluate it in two stages. First, an expert interview was conducted to gather feedback and identify areas for improvement. After integrating these refinements, a user study was performed, showing that participants preferred the context-aware NPCs over a non-context-aware baseline, confirming the effectiveness of the proposed approach.
Revisiting Framing Codebooks with AI: Employing Large Language Models as Analytical Collaborators in Deductive Content Analysis
🔥 引用:
0
Abstract: Codebooks are central to framing research, providing theoretically grounded criteria for analyzing news content. While traditionally codebooks are built from theoretical frameworks and researchers'knowledge, applying these codebooks to large news corpora often exposes ambiguities, borderline cases, and underspecified rules that are difficult to resolve through theory alone. Moreover, news corpora evolve over time and differ across cultures, necessitating that researchers revisit the theoretical frameworks underlying these codebooks. In this article, we propose a workflow that uses Large Language Models (LLMs) to augment the creation and refinement of framing codebooks by combining theoretical frameworks with data-driven exploration. Rather than treating LLMs as automated classifiers, this approach positions them as analytic collaborators that help externalize decision rules, surface latent dimensions, and support iterative revisions of codebooks through dialogues between researchers and their data. We illustrate this workflow using a dataset of Latin American news coverage, demonstrating how the application of LLMs'capabilities has led to the surfacing of latent patterns, the generation of frame distinctions, and the adaptation of frameworks to new contexts. This method provides an LLM-assisted strategy that supports methodology creativity while preserving researchers'interpretative authority.
Pocket Specter: AI-Powered Legal Assistance System using Retrieval-Augmented Generation (RAG)
DOI:
10.55041/ijsrem60741
🔥 引用:
0
Abstract: Abstract— India faces an acute problem of legal accessibility, with about 2 lawyers per 1000 people. Complicated legal jargon, expensive consultation fees, and lack of knowledge hinder many people from seeking justice. Pocket Specter is a specialized domain AI SaaS platform that tries to fill the void by using a Retrieval Augmentation Generation mechanism. The Pocket Specter software includes legal chatbot based on AI technology and intelligent document analysis in the fields of consumer, labour, and family laws. Pocket Specter uses BGE-M3 embeddings in combination with PostgreSQL pgvector technology with a large language model to base their responses on legal documents, minimizing hallucinations. In addition, the document analysis tool analyzes and highlights important information about responsibilities and potential risks in uploaded documents in .pdf/.docx formats. According to experiment results, Pocket Specter is more than 75% relevant in legal responses in comparison with simple LLM methods.
Index Terms—Retrieval-Augmented Generation, Legal AI, Natural Language Processing, pgvector, Document Analysis, Large Language Models, SaaS, India.
Depression Risk Assessment in Social Media via Large Language Models
🔥 引用:
0
Abstract: Depression is one of the most prevalent and debilitating mental health conditions worldwide, frequently underdiagnosed and undertreated. The proliferation of social media platforms provides a rich source of naturalistic linguistic signals for the automated monitoring of psychological well-being. In this work, we propose a system based on Large Language Models (LLMs) for depression risk assessment in Reddit posts, through multi-label classification of eight depression-associated emotions and the computation of a weighted severity index. The method is evaluated in a zero-shot setting on the annotated DepressionEmo dataset (~6,000 posts) and applied in-the-wild to 469,692 comments collected from four subreddits over the period 2024-2025. Our best model, gemma3:27b, achieves micro-F1 = 0.75 and macro-F1 = 0.70, results competitive with purpose-built fine-tuned models (BART: micro-F1 = 0.80, macro-F1 = 0.76). The in-the-wild analysis reveals consistent and temporally stable risk profiles across communities, with marked differences between r/depression and r/anxiety. Our findings demonstrate the feasibility of a cost-effective, scalable approach for large-scale psychological monitoring.
Infection-Reasoner: A Compact Vision-Language Model for Wound Infection Classification with Evidence-Grounded Clinical Reasoning
🔥 引用:
0
Abstract: Assessing chronic wound infection from photographs is challenging because visual appearance varies across wound etiologies, anatomical locations, and imaging conditions. Prior image-based deep learning methods have mainly focused on classification with limited interpretability, despite the need for evidence-grounded explanations to support point-of-care decision making. We present Infection-Reasoner, a compact 4B-parameter reasoning vision-language model for chronic wound infection classification and rationale generation. To address the scarcity of expert-labeled wound images with reasoning annotations, Infection-Reasoner is trained using a two-stage pipeline: (1) reasoning distillation, in which GPT-5.1 generates chain-of-thought rationales for unlabeled wound images to initialize wound-specific reasoning in a smaller student model (Qwen3-VL-4B-Thinking), and (2) reinforcement learning post-training with Group Relative Policy Optimization on a small labeled infection dataset to refine classification reasoning. On a held-out heterogeneous wound dataset, Infection-Reasoner achieved 86.8\% accuracy, 86.4\% sensitivity, and 87.1\% specificity, outperforming several strong baselines, including GPT-5.1. Rationale quality was further evaluated using both multimodal large language model (MLLM) judges and wound expert review. Across four MLLM judges, visual-support agreement scores ranged from 0.722 to 0.903, while expert review rated 61.8\% of rationales as Correct and 32.4\% as Partially Correct.
Detecting Hallucinations in SpeechLLMs at Inference Time Using Attention Maps
🔥 引用:
0
Abstract: Hallucinations in Speech Large Language Models (SpeechLLMs) pose significant risks, yet existing detection methods typically rely on gold-standard outputs that are costly or impractical to obtain. Moreover, hallucination detection methods developed for text-based LLMs do not directly capture audio-specific signals. We investigate four attention-derived metrics: AUDIORATIO, AUDIOCONSISTENCY, AUDIOENTROPY, and TEXTENTROPY, designed to capture pathological attention patterns associated with hallucination, and train lightweight logistic regression classifiers on these features for efficient inference-time detection. Across automatic speech recognition and speech-to-text translation tasks, evaluations on Qwen-2-Audio and Voxtral-3B show that our approach outperforms uncertainty-based and prior attention-based baselines on in-domain data, achieving improvements of up to +0.23 PR-AUC, and generalises to out-of-domain ASR settings. We further find that strong performance can be achieved with approximately 100 attention heads, improving out-of-domain generalisation compared to using all heads. While effectiveness is model-dependent and task-specific training is required, our results demonstrate that attention patterns provide a valuable tool for hallucination detection in SpeechLLMs.
Value systems of artificial intelligence and university students: theoretical dominance in large language models and religious priority in humans
🔥 引用:
0
Abstract: The rapid advancement of artificial intelligence (AI), particularly large language models (LLMs), raises critical questions about the value system these systems appear to reflect in comparison with human values. This study aimed to examine Spranger’s six value types (religious, social, theoretical, economic, political, and aesthetic) as manifested in three LLMs (OpenAI-o1, Gemini-2.0, and DeepSeek-V3), and to compare them with the value system of a sample of students at King Khalid University. A descriptive–comparative design was employed, administering the Study of Values to both groups: 214 students (male and female across academic levels) and the three LLMs, with repeated administrations to the latter to ensure test–retest reliability. Results indicated statistically significant differences in both the prominence and ranking of values across groups. Theoretical values consistently dominated in the LLMs, followed by social, aesthetic, and political values, with religious values ranking lowest. In contrast, students prioritized religious values, followed by theoretical values, while aesthetic values occupied the lowest ranks. Further, significant effects of gender and academic level were observed among students: religious values were more salient among females, theoretical values among males, and aesthetic values among undergraduates. These findings suggest that LLMs project value system shaped by their training data, rather than by human cultural or moral frameworks. The study highlights the importance of integrating culturally diverse value dimensions into AI development and calls for raising students’ awareness of using AI tools in ways aligned with human values. Effect-size estimates further indicated very large human–AI discrepancies, particularly in the religious (d = 2.21) and theoretical domains (d = 1.22).
Headlines You Won't Forget: Can Pronoun Insertion Increase Memorability?
🔥 引用:
0
Abstract: For news headlines to influence beliefs and drive action, relevant information needs to be retained and retrievable from memory. In this probing study we draw on experiment designs from cognitive psychology to examine how a specific linguistic feature, namely direct address through first- and second-person pronouns, affects memorability and to what extent it is feasible to use large language models for the targeted insertion of such a feature into existing text without changing its core meaning. Across three controlled memorization experiments with a total of 240 participants, yielding 7,680 unique memory judgments, we show that pronoun insertion has mixed effects on memorability. Exploratory analyses indicate that effects differ based on headline topic, how pronouns are inserted and their immediate contexts. Additional data and fine-grained analysis is needed to draw definitive conclusions on these mediating factors. We further show that automatic revisions by LLMs are not always appropriate: Crowdsourced evaluations find many of them to be lacking in content accuracy and emotion retention or resulting in unnatural writing style. We make our collected data available for future work.
Post-deployment monitoring of foundation models in radiology: Why outputs aren’t enough
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
A Context-Aware Feedback Loop for AI-Assisted Verification IP Synthesis: Bridging the Gap from Natural Language to Regression-Ready
🔥 引用:
0
Abstract: The widening gap between System-on-Chip (SoC) design complexity and verification productivity has rendered traditional script-based automation insufficient. While Large Language Models (LLMs) offer promise for code synthesis, they typically fail in hardware verification contexts due to a lack of architectural consistency and an inability to reason about temporal signal semantics. This paper proposes a Context-Aware Verification Loop (CAVL) methodology, an iterative framework that integrates semantic project indexing with simulation-based feedback to achieve verification closure. Unlike static generation, CAVL employs a dynamic refinement cycle where compiler diagnostics, simulation logs, and functional coverage metrics serve as feedback signals to guide the AI agent. We validate this framework on a dual-mode I2C Universal Verification Methodology (UVM) environment as a representative case study. The experimental results indicate, within this single-protocol context, the framework’s capacity to (1) resolve complex signal-level contention issues through logic refactoring, (2) achieve complete functional coverage via directed test synthesis, and (3) maintain cross-file architectural consistency with reduced human intervention. This work presents an initial quantitative baseline for AI-driven Electronic Design Automation (EDA), suggesting that context-aware feedback loops offer a pathway toward restructuring the verification engineer’s role from implementation to architectural intent specification.
Defining Robust Ultrasound Quality Metrics via an Ultrasound Foundation Model
🔥 引用:
0
Abstract: Clinicians lack a principled framework to quantify diagnostic utility in ultrasound reconstructions. Existing standards like PSNR and VGG-LPIPS are inadequate, failing to account for modality-specific physics or the structural nuances of acoustic imaging. We close this gap with a TinyUSFM-based evaluation framework featuring two distinct metrics: TinyUSFM-uLPIPS, a full-reference perceptual distance based on multi-layer token relations, and TinyUSFM-NRQ, a deployable no-reference quality score utilizing clean-manifold modeling and worst-region aggregation to detect localized harmful artifacts. We demonstrate that the presented metrics have four unique advantages: 1) Task-linked quality, where TinyUSFM-uLPIPS achieves superior calibration with semantic task damage, accurately reflecting Dice-score drops in segmentation where VGG-based metrics fail; 2) Cross-organ comparability, maintaining stable scoring scales and consistent severity rankings across diverse anatomical sites and domain-shifted data; 3) PSNR-consistent sensitivity, with TinyUSFM-NRQ providing a reliable quality score without ground-truth images that remains consistent with traditional fidelity benchmarks (i.e. PSNR); and 4) Clinical utility, improving the prediction of expert preference from 47.2$\%$ to 72.8$\%$ accuracy and producing super-resolution reconstructions preferred by sonographers. By integrating these advantages into a unified assessment and optimization loop, this work establishes a modality-aligned standard that finally bridges the gap between algorithmic performance and diagnostic utility. https://github.com/sextant-fable/US-Metrics
FedProxy: Federated Fine-Tuning of LLMs via Proxy SLMs and Heterogeneity-Aware Fusion
🔥 引用:
0
Abstract: Federated fine-tuning of Large Language Models (LLMs) is obstructed by a trilemma of challenges: protecting LLMs intellectual property (IP), ensuring client privacy, and mitigating performance loss on heterogeneous data. Existing methods like Offsite-Tuning (OT) secure the LLMs IP by having clients train only lightweight adapters, yet our analysis reveals they suffer from a fundamental performance bottleneck, leaving a significant gap compared to centralized training. To bridge this gap, we introduce FedProxy, a new federated adaptation framework. FedProxy replaces weak adapters with a unified, powerful Proxy Small Language Model (SLM), compressed from the proprietary LLM, to serve as a high-fidelity surrogate for collaborative fine-tuning. Our framework systematically resolves the trilemma through a three-stage architecture: (i) Efficient Representation via server-guided compression to create a resource-friendly proxy; (ii) Robust Optimization through an interference-mitigating aggregation strategy to handle data heterogeneity; and (iii) Effortless Fusion via a training-free"plug-in"mechanism to integrate learned knowledge back into the LLM. Experiments show FedProxy significantly outperforms OT methods and approaches centralized performance, establishing a new benchmark for secure and high-performance federated LLM adaptation.
AeSlides: Incentivizing Aesthetic Layout in LLM-Based Slide Generation via Verifiable Rewards
🔥 引用:
0
Abstract: Large language models (LLMs) have demonstrated strong potential in agentic tasks, particularly in slide generation. However, slide generation poses a fundamental challenge: the generation process is text-centric, whereas its quality is governed by visual aesthetics. This modality gap leads current models to frequently produce slides with aesthetically suboptimal layouts. Existing solutions typically rely either on heavy visual reflection, which incurs high inference cost yet yields limited gains; or on fine-tuning with large-scale datasets, which still provides weak and indirect aesthetic supervision. In contrast, the explicit use of aesthetic principles as supervision remains unexplored. In this work, we present AeSlides, a reinforcement learning framework with verifiable rewards for Aesthetic layout supervision in Slide generation. We introduce a suite of meticulously designed verifiable metrics to quantify slide layout quality, capturing key layout issues in an accurate, efficient, and low-cost manner. Leveraging these verifiable metrics, we develop a GRPO-based reinforcement learning method that directly optimizes slide generation models for aesthetically coherent layouts. With only 5K training prompts on GLM-4.7-Flash, AeSlides improves aspect ratio compliance from 36% to 85%, while reducing whitespace by 44%, element collisions by 43%, and visual imbalance by 28%. Human evaluation further shows a substantial improvement in overall quality, increasing scores from 3.31 to 3.56 (+7.6%), outperforming both model-based reward optimization and reflection-based agentic approaches, and even edging out Claude-Sonnet-4.5. These results demonstrate that such a verifiable aesthetic paradigm provides an efficient and scalable approach to aligning slide generation with human aesthetic preferences. Our repository is available at https://github.com/ympan0508/aeslides.
TRN-R1-Zero: Text-rich Network Reasoning via LLMs with Reinforcement Learning Only
🔥 引用:
0
Abstract: Zero-shot reasoning on text-rich networks (TRNs) remains a challenging frontier, as models must integrate textual semantics with relational structure without task-specific supervision. While graph neural networks rely on fixed label spaces and supervised objectives, recent large language model (LLM)-based approaches often overlook graph context or depend on distillation from larger models, limiting generalisation. We propose TRN-R1-Zero, a post-training framework for TRN reasoning trained solely via reinforcement learning. TRN-R1-Zero directly optimises base LLMs using a Neighbour-aware Group Relative Policy Optimisation objective that dynamically adjusts rewards based on a novel margin gain metric for the informativeness of neighbouring signals, effectively guiding the model toward relational reasoning. Unlike prior methods, TRN-R1-Zero requires no supervised fine-tuning or chain-of-thought data generated from large reasoning models. Extensive experiments across citation, hyperlink, social and co-purchase TRN benchmarks demonstrate the superiority and robustness of TRN-R1-Zero. Moreover, relying strictly on node-level training, TRN-R1-Zero achieves zero-shot inference on edge- and graph-level tasks, extending beyond cross-domain transfer. The codebase is publicly available at https://github.com/superallen13/TRN-R1-Zero.
Seeing Candidates at Scale: Multimodal LLMs for Visual Political Communication on Instagram
🔥 引用:
0
Abstract: This paper presents a computational case study that evaluates the capabilities of specialized machine learning models and emerging multimodal large language models for Visual Political Communication (VPC) analysis. Focusing on concentrated visibility in Instagram stories and posts during the 2021 German federal election campaign, we compare the performance of traditional computer vision models (FaceNet512, RetinaFace, Google Cloud Vision) with a multimodal large language model (GPT-4o) in identifying front-runner politicians and counting individuals in images. GPT-4o outperformed the other models, achieving a macro F1-score of 0.89 for face recognition and 0.86 for person counting in stories. These findings demonstrate the potential of advanced AI systems to scale and refine visual content analysis in political communication while highlighting methodological considerations for future research.
What Makes an LLM a Good Optimizer? A Trajectory Analysis of LLM-Guided Evolutionary Search
🔥 引用:
0
Abstract: Recent work has demonstrated the promise of orchestrating large language models (LLMs) within evolutionary and agentic optimization systems. However, the mechanisms driving these optimization gains remain poorly understood. In this work, we present a large-scale study of LLM-guided evolutionary search, collecting optimization trajectories for 15 LLMs across 8 tasks. Although zero-shot problem-solving ability correlates with final optimization outcomes, it explains only part of the variance: models with similar initial capability often induce dramatically different search trajectories and outcomes. By analyzing these trajectories, we find that strong LLM optimizers behave as local refiners, producing frequent incremental improvements while progressively localizing the search in semantic space. Conversely, weaker optimizers exhibit large semantic drift, with sporadic breakthroughs followed by stagnation. Notably, various measures of solution novelty do not predict final performance; novelty is beneficial only when the search remains sufficiently localized around high-performing regions of the solution space. Our results highlight the importance of trajectory analysis for understanding and improving LLM-based optimization systems and provide actionable insights for their design and training.
HarDBench: A Benchmark for Draft-Based Co-Authoring Jailbreak Attacks for Safe Human-LLM Collaborative Writing
🔥 引用:
0
Abstract: Large language models (LLMs) are increasingly used as co-authors in collaborative writing, where users begin with rough drafts and rely on LLMs to complete, revise, and refine their content. However, this capability poses a serious safety risk: malicious users could jailbreak the models-filling incomplete drafts with dangerous content-to force them into generating harmful outputs. In this paper, we identify the vulnerability of current LLMs to such draft-based co-authoring jailbreak attacks and introduce HarDBench, a systematic benchmark designed to evaluate the robustness of LLMs against this emerging threat. HarDBench spans a range of high-risk domains-including Explosives, Drugs, Weapons, and Cyberattacks-and features prompts with realistic structure and domain-specific cues to assess the model susceptibility to harmful completions. To mitigate this risk, we introduce a safety-utility balanced alignment approach based on preference optimization, training models to refuse harmful completions while remaining helpful on benign drafts. Experimental results show that existing LLMs are highly vulnerable in co-authoring contexts and our alignment method significantly reduces harmful outputs without degrading performance on co-authoring capabilities. This presents a new paradigm for evaluating and aligning LLMs in human-LLM collaborative writing settings. Our new benchmark and dataset are available on our project page at https://github.com/untae0122/HarDBench
DebugRepair: Enhancing LLM-Based Automated Program Repair via Self-Directed Debugging
🔥 引用:
0
Abstract: Automated Program Repair (APR) has benefited from the code understanding and generation capabilities of Large Language Models (LLMs). Existing feedback-based APR methods iteratively refine candidate patches using test execution feedback and have shown promising results. However, most rely on outcome-level failure symptoms, such as stack traces, which show how failures are observed but fail to expose the intermediate runtime states critical for root-cause analysis. As a result, LLMs often infer bug causes without sufficient runtime evidence, leading to incorrect patches. To address this limitation, we propose DebugRepair, a self-directed debugging framework for LLM-based APR. DebugRepair enhances patch refinement with intermediate runtime evidence collected through simulated debugging. It consists of three components: test semantic purification, simulated instrumentation, and debugging-driven conversational repair. Together, they reduce noisy test context, collect runtime traces through targeted debugging statements with rule-based fallback, and progressively refine candidate patches using prior attempts and newly observed runtime states. We evaluate DebugRepair on three benchmarks across Java and Python. Experiments show that DebugRepair achieves state-of-the-art performance against 15 approaches. With GPT-3.5, it correctly fixes 224 bugs on Defects4J, outperforming prior SOTA LLM-based methods by 26.2%. With DeepSeek-V3, it correctly fixes 295 Defects4J bugs, surpassing the second-best baseline by 59 bugs. Across five additional backbone LLMs, DebugRepair improves repair performance by 51.3% over vanilla settings. Ablation studies further confirm the effectiveness of all components.
A Comparative Assessment of Large Language Models in Congenital Hypothyroidism: Reliability, Quality and Readability
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
Robust AI generated text detection through multi-grained latent feature denoising and contrastive representation learning
🔥 引用:
0
Abstract:
As large language models (LLMs) evolve rapidly, distinguishing AI-generated text (AIGT) from human-written text (HWT) is becoming increasingly challenging. Recently, some AIGT detectors have been developed to overcome this challenge and have achieved decent accuracy. However, their brittle text representations make them highly susceptible to text perturbations, such that even minor character-level perturbations can reverse their predictions. In this work, we propose a multi-grained latent feature denoising and contrastive representation learning architecture to enhance text representations in terms of granularity, robustness, and distinguishability of features, thereby achieving robust AIGT detection. Specifically, we first extract both document-level and fine-grained segment-level features using a dual network, which captures the global and subtle local differences between AIGT and HWT. To encourage feature stability under perturbations, we inject random noise into both latent features and employ a denoising network to reconstruct the original representations. While this does not precisely simulate discrete character-level perturbations, it acts as a feature-level regularizer that suppresses non-essential variations and promotes smoother, more stable representations. Considering the similarities between AIGT and HWT, we further design a contrastive augmentation mechanism to increase the distinguishability between them. Extensive experiments demonstrate that our method not only outperforms baseline models in terms of classification accuracy but also exhibits superior robustness against various text perturbations.
Cross-Model Consistency of AI-Generated Exercise Prescriptions: A Repeated Generation Study Across Three Large Language Models
🔥 引用:
0
Abstract: This study compared repeated generation consistency of exercise prescription outputs across three large language models (LLMs), specifically GPT-4.1, Claude Sonnet 4.6, and Gemini 2.5 Flash, under temperature=0 conditions. Each model generated prescriptions for six clinical scenarios 20 times, yielding 360 total outputs analyzed across four dimensions: semantic similarity, output reproducibility, FITT classification, and safety expression. Mean semantic similarity was highest for GPT-4.1 (0.955), followed by Gemini 2.5 Flash (0.950) and Claude Sonnet 4.6 (0.903), with significant inter-model differences confirmed (H = 458.41, p<.001). Critically, these scores reflected fundamentally different generative behaviors: GPT-4.1 produced entirely unique outputs (100%) with stable semantic content, while Gemini 2.5 Flash showed pronounced output repetition (27.5% unique outputs), indicating that its high similarity score derived from text duplication rather than consistent reasoning. Identical decoding settings thus yielded fundamentally different consistency profiles, a distinction that single-output evaluations cannot capture. Safety expression reached ceiling levels across all models, confirming its limited utility as a differentiating metric. These results indicate that model selection constitutes a clinical rather than merely technical decision, and that output behavior under repeated generation conditions should be treated as a core criterion for reliable deployment of LLM-based exercise prescription systems.
ProjLens: Unveiling the Role of Projectors in Multimodal Model Safety
🔥 引用:
0
Abstract: Multimodal Large Language Models (MLLMs) have achieved remarkable success in cross-modal understanding and generation, yet their deployment is threatened by critical safety vulnerabilities. While prior works have demonstrated the feasibility of backdoors in MLLMs via fine-tuning data poisoning to manipulate inference, the underlying mechanisms of backdoor attacks remain opaque, complicating the understanding and mitigation. To bridge this gap, we propose ProjLens, an interpretability framework designed to demystify MLLMs backdoors. We first establish that normal downstream task alignment--even when restricted to projector fine--tuning--introduces vulnerability to backdoor injection, whose activation mechanism is different from that observed in text-only LLMs. Through extensive experiments across four backdoor variants, we uncover:(1) Low-Rank Structure: Backdoor injection updates appear overall full-rank and lack dedicated ``trigger neurons'', but the backdoor-critical parameters are encoded within a low-rank subspace of the projector;(2) Activation Mechanism: Both clean and poisoned embedding undergoes a semantic shift toward a shared direction aligned with the backdoor target, but the shifting magnitude scales linearly with the input norm, resulting in the distinct backdoor activation on poisoned samples. Our code is available at: https://anonymous.4open.science/r/ProjLens-8FD7
Outsourcing Our Epiphanies
DOI:
10.31046/6ataj823
🔥 引用:
0
Abstract: Traditionally, being an author has been considered to involve having the capacity to think. This has raised questions about the attribution of authorship to AI. In this essay, I first explain how large language models (LLMs), the technology behind popular chatbots, produce their output. I then survey recent literature which defines what thinking is with particular comparison to the way LLMs operate. This literature argues that that thinking involves opening oneself to new ideas and evaluating those ideas, always keeping alert to the possibility of unexpected discoveries. While LLMs can have new ideas put into their training data, they currently cannot evaluate those ideas correctly without human intervention and they cannot have epiphanies. I finally argue that both of these concepts are necessary for authorship.
Title: Benchmarking Open-Source and Commercial Docking Tools for Virtual Screening against a Bacterial Peroxiredoxin Target
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
Proactive Detection of GUI Defects in Multi-Window Scenarios via Multimodal Reasoning
🔥 引用:
0
Abstract: Multi-window mobile scenarios, such as split-screen and foldable modes, make GUI display defects more likely by forcing applications to adapt to changing window sizes and dynamic layout reflow. Existing detection techniques are limited in two ways: they are largely passive, analyzing screenshots only after problematic states have been reached, and they are mainly designed for conventional full-screen interfaces, making them less effective in multi-window settings.We propose an end-to-end framework for GUI display defect detection in multi-window mobile scenarios. The framework proactively triggers split-screen, foldable, and window-transition states during app exploration, uses Set-of-Mark (SoM) to align screenshots with widget-level interface elements, and leverages multimodal large language models with chain-of-thought prompting to detect, localize, and explain display defects. We also construct a benchmark of GUI display defects using 50 real-world Android applications.Experimental results show that multi-window settings substantially increase the exposure of layout-related defects, with text truncation increasing by 184% compared with conventional full-screen settings. At the application level, our method detects 40 defect-prone apps with a false positive rate of 10.00% and a false negative rate of 11.11%, outperforming OwlEye and YOLO-based baselines. At the fine-grained level, it achieves the best F1 score of 87.2% for widget occlusion detection.
UAF: A Unified Audio Front-end LLM for Full-Duplex Speech Interaction
🔥 引用:
0
Abstract: Full-duplex speech interaction, as the most natural and intuitive mode of human communication, is driving artificial intelligence toward more human-like conversational systems. Traditional cascaded speech processing pipelines suffer from critical limitations, including accumulated latency, information loss, and error propagation across modules. To address these issues, recent efforts focus on the end-to-end audio large language models (LLMs) like GPT-4o, which primarily unify speech understanding and generation task. However, most of these models are inherently half-duplex, and rely on a suite of separate, task-specific front-end components, such as voice activity detection (VAD) and turn-taking detection (TD). In our development of speech assistant, we observed that optimizing the speech front-end is equally crucial as advancing the back-end unified model for achieving seamless, responsive interactions. To bridge this gap, we propose the first unified audio front-end LLM (UAF) tailored for full-duplex speech systems. Our model reformulates diverse audio front-end tasks into a single auto-regressive sequence prediction problem, including VAD, TD, speaker recognition (SR), automatic speech recognition (ASR) and question answer (QA). It takes streaming fixed-duration audio chunk (e.g., 600 ms) as input, leverages a reference audio prompt to anchor the target speaker at the beginning, and regressively generates discrete tokens encoding both semantic content and system-level state controls (e.g., interruption signals). Experiments demonstrate that our model achieves leading performance across multiple audio front-end tasks and significantly enhances response latency and interruption accuracy in real-world interaction scenarios.
GELAX: IoT botnet detection using dynamic graph pruning and anchored explainable AI
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
Evaluating LLM-Driven Summarisation of Parliamentary Debates with Computational Argumentation
🔥 引用:
0
Abstract: Understanding how policy is debated and justified in parliament is a fundamental aspect of the democratic process. However, the volume and complexity of such debates mean that outside audiences struggle to engage. Meanwhile, Large Language Models (LLMs) have been shown to enable automated summarisation at scale. While summaries of debates can make parliamentary procedures more accessible, evaluating whether these summaries faithfully communicate argumentative content remains challenging. Existing automated summarisation metrics have been shown to correlate poorly with human judgements of consistency (i.e., faithfulness or alignment between summary and source). In this work, we propose a formal framework for evaluating parliamentary debate summaries that grounds argument structures in the contested proposals up for debate. Our novel approach, driven by computational argumentation, focuses the evaluation on formal properties concerning the faithful preservation of the reasoning presented to justify or oppose policy outcomes. We demonstrate our methods using a case-study of debates from the European Parliament and associated LLM-driven summaries.
AI Tools for Teaching the Safe Administration of Medications in Nursing: A Scoping Review
🔥 引用:
0
Abstract: Background: Safe medication administration is a fundamental aspect of nursing practice and a core component of patient safety. However, systemic failures, workload pressures, and educational gaps continue to contribute to medication errors, posing persistent challenges for healthcare systems. In this context, innovative educational technologies, particularly Artificial Intelligence (AI), have emerged as promising strategies to support the development of competencies related to safe medication administration. Methods: This scoping review aimed to map evidence on AI-based tools used to teach safe medication administration in nursing. The review was conducted in accordance with the Joanna Briggs Institute (JBI) methodology and reported following the PRISMA-ScR guidelines. Searches were performed in PubMed, Scopus, Web of Science, LILACS, and Google Scholar, covering studies published between 2010 and October 2025 in English, Portuguese, and Spanish. Study selection was conducted in two stages, followed by standardized data extraction. Results: A total of 545 records were identified, of which only two studies met the eligibility criteria. The included studies, conducted in Israel and South Korea, evaluated a microlearning chatbot and Large Language Model (LLM)-based tools designed to support teaching safe medication administration. Both studies demonstrated improvements in knowledge and performance in tasks and simulations related to the medication process, as well as positive acceptability among participants. However, neither study assessed direct clinical outcomes, such as reductions in medication errors or preventable adverse events. Conclusions: Although AI-based educational tools show potential to enhance competencies related to medication safety in nursing, the available evidence remains limited. Further robust, multicenter, and comparative studies are needed to evaluate their impact on clinical outcomes and to support their integration into nursing education and practice.
Inductive Subgraphs as Shortcuts: Causal Disentanglement for Heterophilic Graph Learning
🔥 引用:
0
Abstract: Heterophily is a prevalent property of real-world graphs and is well known to impair the performance of homophilic Graph Neural Networks (GNNs). Prior work has attempted to adapt GNNs to heterophilic graphs through non-local neighbor extension or architecture refinement. However, the fundamental reasons behind misclassifications remain poorly understood. In this work, we take a novel perspective by examining recurring inductive subgraphs, empirically and theoretically showing that they act as spurious shortcuts that mislead GNNs and reinforce non-causal correlations in heterophilic graphs. To address this, we adopt a causal inference perspective to analyze and correct the biased learning behavior induced by shortcut inductive subgraphs. We propose a debiased causal graph that explicitly blocks confounding and spillover paths responsible for these shortcuts. Guided by this causal graph, we introduce Causal Disentangled GNN (CD-GNN), a principled framework that disentangles spurious inductive subgraphs from true causal subgraphs by explicitly blocking non-causal paths. By focusing on genuine causal signals, CD-GNN substantially improves the robustness and accuracy of node classification in heterophilic graphs. Extensive experiments on real-world datasets not only validate our theoretical findings but also demonstrate that our proposed CD-GNN outperforms state-of-the-art heterophily-aware baselines.
Pause or Fabricate? Training Language Models for Grounded Reasoning
🔥 引用:
0
Abstract: Large language models have achieved remarkable progress on complex reasoning tasks. However, they often implicitly fabricate information when inputs are incomplete, producing confident but unreliable conclusions -- a failure mode we term ungrounded reasoning. We argue that this issue arises not from insufficient reasoning capability, but from the lack of inferential boundary awareness -- the ability to recognize when the necessary premises for valid inference are missing. To address this issue, we propose Grounded Reasoning via Interactive Reinforcement Learning (GRIL), a multi-turn reinforcement learning framework for grounded reasoning under incomplete information. GRIL decomposes the reasoning process into two stages: clarify and pause, which identifies whether the available information is sufficient, and grounded reasoning, which performs task solving once the necessary premises are established. We design stage-specific rewards to penalize hallucinations, enabling models to detect gaps, stop proactively, and resume reasoning after clarification. Experiments on GSM8K-Insufficient and MetaMATH-Insufficient show that GRIL significantly improves premise detection (up to 45%), leading to a 30% increase in task success while reducing average response length by over 20%. Additional analyses confirm robustness to noisy user responses and generalization to out-of-distribution tasks.
Targeting Aging Mechanisms: Pharmacokinetic And Admet Challenges In Senescent Physiology
🔥 引用:
0
Abstract: Natural bioactive agents include flavonoids, mitochondrial-targeted antioxidants, and NAD+ precursors are
under growing investigation as well as synthetic agents like dasatinib and rapamycin as promising
senotherapeutics due to their ability to control these hallmarks. This broad overview incorporates the
mechanistic understanding of the action of both plant-derived and synthetic agents on major aging processes
and, in particular, the pharmacodynamics and pharmacokinetics of their actions in aging (or senescence)
physiology. The changes in the performance of the CYP450 enzymes, tissue distributions, renal and hepatic
clearance as well as gut microbiome composition with age significantly alter the ADMET profile of these
agents, which affects the efficacy and safety of these agents. We critically review translational barriers,
uncover gaps in age-relevant preclinical models, and suggest strategic solutions on how best to streamline
senotherapeutic dosing and delivery system. Through a synthesis of the views of pharmacognosy, molecular geroscience, and translational pharmacology, this synthesis offers a holistic approach to safe and effective interventions that can lengthen healthspan in aging people.
The Rise of Verbal Tics in Large Language Models: A Systematic Analysis Across Frontier Models
🔥 引用:
0
Abstract: As Large Language Models (LLMs) continue to evolve through alignment techniques such as Reinforcement Learning from Human Feedback (RLHF) and Constitutional AI, a growing and increasingly conspicuous phenomenon has emerged: the proliferation of verbal tics, repetitive, formulaic linguistic patterns that pervade model outputs. These range from sycophantic openers (That's a great question!, Awesome!) to pseudo-empathetic affirmations (I completely understand your concern, I'm right here to catch you) and overused vocabulary (delve, tapestry, nuanced). In this paper, we present a systematic analysis of the verbal tic phenomenon across eight state-of-the-art LLMs: GPT-5.4, Claude Opus 4.7, Gemini 3.1 Pro, Grok 4.2, Doubao-Seed-2.0-pro, Kimi K2.5, DeepSeek V3.2, and MiMo-V2-Pro. Utilizing a custom evaluation framework for standardized API-based evaluation, we assess 10,000 prompts across 10 task categories in both English and Chinese, yielding 160,000 model responses. We introduce the Verbal Tic Index (VTI), a composite metric quantifying tic prevalence, and analyze its correlation with sycophancy, lexical diversity, and human-perceived naturalness. Our findings reveal significant inter-model variation: Gemini 3.1 Pro exhibits the highest VTI (0.590), while DeepSeek V3.2 achieves the lowest (0.295). We further demonstrate that verbal tics accumulate over multi-turn conversations, are amplified in subjective tasks, and show distinct cross-lingual patterns. Human evaluation (N = 120) confirms a strong inverse relationship between sycophancy and perceived naturalness (r = -0.87, p<0.001). These results underscore the alignment tax of current training paradigms and highlight the urgent need for more authentic human-AI interaction frameworks.
Reasoning-Aware AIGC Detection via Alignment and Reinforcement
🔥 引用:
0
Abstract: The rapid advancement and widespread adoption of Large Language Models (LLMs) have elevated the need for reliable AI-generated content (AIGC) detection, which remains challenging as models evolve. We introduce AIGC-text-bank, a comprehensive multi-domain dataset with diverse LLM sources and authorship scenarios, and propose REVEAL, a detection framework that generates interpretable reasoning chains before classification. Our approach uses a two-stage training strategy: supervised fine-tuning to establish reasoning capabilities, followed by reinforcement learning to improve accuracy, improve logical consistency, and reduce hallucinations. Extensive experiments show that REVEAL achieves state-of-the-art performance across multiple benchmarks, offering a robust and transparent solution for AIGC detection. The project is open-source at https://aka.ms/reveal
Indic-CodecFake meets SATYAM: Towards Detecting Neural Audio Codec Synthesized Speech Deepfakes in Indic Languages
🔥 引用:
0
Abstract: The rapid advancement of Audio Large Language Models (ALMs), driven by Neural Audio Codecs (NACs), has led to the emergence of highly realistic speech deepfakes, commonly referred to as CodecFakes (CFs). Consequently, CF detection has attracted increasing attention from the research community. However, existing studies predominantly focus on English or Chinese, leaving the vulnerability of Indic languages largely unexplored. To bridge this gap, we introduce Indic-CodecFake (ICF) dataset, the first large-scale benchmark comprising real and NAC-synthesized speech across multiple Indic languages, diverse speaker profiles, and multiple NAC types. We use IndicSUPERB as the real speech corpus for generation of ICF dataset. Our experiments demonstrate that state-of-the-art (SOTA) CF detectors trained on English-centric datasets fail to generalize to ICF, underscoring the challenges posed by phonetic diversity and prosodic variability in Indic speech. Further, we present systematic evaluation of SOTA ALMs in a zero-shot setting on ICF dataset. We evaluate these ALMs as they have shown effectiveness for different speech tasks. However, our findings reveal that current ALMs exhibit consistently poor performance. To address this, we propose SATYAM, a novel hyperbolic ALM tailored for CF detection in Indic languages. SATYAM integrates semantic representations from Whisper and prosodic representations from TRILLsson using through Bhattacharya distance in hyperbolic space and subsequently performs the same alignment procedure between the fused speech representation and an input conditioning prompt. This dual-stage fusion framework enables SATYAM to effectively model hierarchical relationships both within speech (semantic-prosodic) and across modalities (speech-text). Extensive evaluations show that SATYAM consistently outperforms competitive end-to-end and ALM-based baselines on the ICF benchmark.
STAR-Teaming: A Strategy-Response Multiplex Network Approach to Automated LLM Red Teaming
🔥 引用:
0
Abstract: While Large Language Models (LLMs) are widely used, they remain susceptible to jailbreak prompts that can elicit harmful or inappropriate responses. This paper introduces STAR-Teaming, a novel black-box framework for automated red teaming that effectively generates such prompts. STAR-Teaming integrates a Multi-Agent System (MAS) with a Strategy-Response Multiplex Network and employs network-driven optimization to sample effective attack strategies. This network-based approach recasts the intractable high-dimensional embedding space into a tractable structure, yielding two key advantages: it enhances the interpretability of the LLM's strategic vulnerabilities, and it streamlines the search for effective strategies by organizing the search space into semantic communities, thereby preventing redundant exploration. Empirical results demonstrate that STAR-Teaming significantly surpasses existing methods, achieving a higher attack success rate (ASR) at a lower computational cost. Extensive experiments validate the effectiveness and explainability of the Multiplex Network. The code is available at https://github.com/selectstar-ai/STAR-Teaming-paper.
Machine learning for the prediction of gram-negative bacterial secreted effectors: advances and challenges
🔥 引用:
0
Abstract: Accurately identifying virulence-associated proteins secreted by Gram-negative pathogens is essential for elucidating bacterial pathogenic mechanisms and developing novel antimicrobial interventions. However, traditional experimental approaches for effector identification are time-consuming and labor-intensive. Recent advances in machine learning (ML), ranging from handcrafted features to context-aware embeddings derived from protein language models, have significantly improved secreted effector prediction. Here, we provide a systematic overview of ML-based methods for secreted effector prediction, surveying available database resources, negative dataset construction strategies, feature representation approaches, and model architectures spanning classical machine learning to deep learning. We discuss fundamental challenges, including data scarcity and class imbalance, evaluation bias, and model interpretability. Finally, we outline future directions encompassing multimodal data integration, meta-learning to address data limitations, and uncertainty quantification to enhance predictive robustness.
Exploratory Evaluation of Diagnostic Accuracy and Temporal Reproducibility of Multimodal Large Language Models in the Image-Based Assessment of Oral Mucosal Lesions
DOI:
10.3390/app16084046
🔥 引用:
0
Abstract: Objective: The aim was to evaluate the diagnostic accuracy and temporal reproducibility of multimodal large language models (LLMs) in the image-based diagnosis of oral mucosal lesions. Materials and Methods: The study included 100 anonymized clinical photographs of oral mucosal conditions obtained from the archive of the Department of Oral Medicine, School of Dental Medicine, University of Zagreb. Images were categorized into four subgroups: physiological variations, benign mucosal lesions, oral potentially malignant disorders, and oral cancer (25 images each). Three multimodal LLMs (ChatGPT-5.1 Plus, Gemini 3 Pro, and Perplexity Pro) analyzed each image using an identical prompt and were required to provide a single most probable diagnosis based solely on visual features. To evaluate temporal reproducibility, the entire evaluation was repeated in three independent testing cycles conducted at one-month intervals. Diagnostic accuracy was compared using chi-square tests, while intra-model agreement across cycles was assessed using Fleiss’ kappa. Results: Gemini demonstrated the highest diagnostic accuracy, reaching 78% correct responses in cycles 2 and 3, significantly outperforming ChatGPT (55–57%) and Perplexity (28–31%) (p < 0.00001). Subgroup analyses showed similar trends, with Gemini achieving the highest accuracy across most lesion categories. Intra-model agreement across cycles was moderate for ChatGPT (κ = 0.525), fair for Gemini (κ = 0.338) and Perplexity (κ = 0.409). Gemini also showed the highest proportion of responses that remained correct across all three cycles (51%). Conclusions: Multimodal LLMs demonstrate promising diagnostic capabilities in the image-based assessment of oral mucosal lesions; however, variability in reproducibility highlights the need for cautious clinical implementation and further validation.
$R^2$-dLLM: Accelerating Diffusion Large Language Models via Spatio-Temporal Redundancy Reduction
🔥 引用:
0
Abstract: Diffusion Large Language Models (dLLMs) have emerged as a promising alternative to autoregressive generation by enabling parallel token prediction. However, practical dLLM decoding still suffers from high inference latency, which limits deployment. In this work, we observe that a substantial part of this inefficiency comes from recurring redundancy in the decoding process, including spatial redundancy caused by confidence clusters and positional ambiguity, and temporal redundancy caused by repeatedly remasking predictions that have already stabilized. Motivated by these patterns, we propose $R^2$-dLLM, a unified framework for reducing decoding redundancy from both inference and training perspectives. At inference time, we introduce training-free decoding rules that aggregate local confidence and token predictions, and finalize temporally stable tokens to avoid redundant decoding steps. We further propose a redundancy-aware supervised fine-tuning pipeline that aligns the model with efficient decoding trajectories and reduces reliance on manually tuned thresholds. Experiments demonstrate that $R^2$-dLLM consistently reduces the number of decoding steps by up to 75% compared to existing decoding strategies, while maintaining competitive generation quality across different models and tasks. These results validate that decoding redundancy is a central bottleneck in dLLMs, and that explicitly reducing it yields substantial practical efficiency gains.
Untangling the Nuances and Networks of Mobile-GenAI English Learning
🔥 引用:
0
Abstract: Networked learning is defined as learning supported by purposeful connections between people, resources, and technologies (NLEC, 2020). The preliminary results of the research described in this paper focuses on Arab immigrant and refugee women (AIRW) who live in the Canadian province of Saskatchewan and their experiences using mAIL technology to learn English. Kearney et al.’s (2012) sociocultural pedagogical framework guided this study. As the research progressed, it became apparent that it is not possible to simply layer GenAI upon mobile learning; rather, the use of AI changed the nature and networking patterns of the learners. This phenomenological study illustrates how mobile GenAI (mAIL) technology alters the admixture of networked learning in both general and subtle ways. In a general sense, the very nature of the large language models (LLMs) is inherently networked as it connects a vast collection of human-produced artefacts/resources (mostly text based) to human learners who access these artefacts via prompts. We suggest that current models of mobile learning can be shifted to better reflect the nuanced interactions.
GraphRAG-IRL: Personalized Recommendation with Graph-Grounded Inverse Reinforcement Learning and LLM Re-ranking
🔥 引用:
0
Abstract: Personalized recommendation requires models that capture sequential user preferences while remaining robust to sparse feedback and semantic ambiguity. Recent work has explored large language models (LLMs) as recommenders and re-rankers, but pure prompt-based ranking often suffers from poor calibration, sensitivity to candidate ordering, and popularity bias. These limitations make LLMs useful semantic reasoners, but unreliable as standalone ranking engines. We present \textbf{GraphRAG-IRL}, a hybrid recommendation framework that combines graph-grounded feature construction, inverse reinforcement learning (IRL), and persona-guided LLM re-ranking. Our method constructs a heterogeneous knowledge graph over items, categories, and concepts, retrieves both individual and community preference context, and uses these signals to train a Maximum Entropy IRL model for calibrated pre-ranking. An LLM is then applied only to a short candidate list, where persona-guided prompts provide complementary semantic judgments that are fused with IRL rankings. Experiments show that GraphRAG-IRL is a strong standalone recommender: IRL-MLP with GraphRAG improves NDCG@10 by 15.7\% on MovieLens and 16.6\% on KuaiRand over supervised baselines. The results also show that IRL and GraphRAG are superadditive, with the combined gain exceeding the sum of their individual improvements. Persona-guided LLM fusion further improves ranking quality, yielding up to 16.8\% NDCG@10 improvement over the IRL-only baseline on MovieLens ml-1m, while score fusion on KuaiRand provides consistent gains of 4--6\% across LLM providers.
DP-FlogTinyLLM: Differentially private federated log anomaly detection using Tiny LLMs
🔥 引用:
0
Abstract: Modern distributed systems generate massive volumes of log data that are critical for detecting anomalies and cyber threats. However, in real world settings, these logs are often distributed across multiple organizations and cannot be centralized due to privacy and security constraints. Existing log anomaly detection methods, including recent large language model (LLM) based approaches, largely rely on centralized training and are not suitable for such environments. In this paper, we propose DP-FLogTinyLLM, a privacy preserving federated framework for log anomaly detection using parameter efficient LLMs. Our approach enables collaborative learning without sharing raw log data by integrating federated optimization with differential privacy. To ensure scalability in resource constrained environments, we employ low rank adaptation (LoRA) for efficient fine tuning of Tiny LLMs at each client. Empirical results on the Thunderbird and BGL datasets show that the proposed framework matches the performance of centralized LLM based methods, while incurring additional computational overhead due to privacy mechanisms. Compared to existing federated baselines, DP-FLogTinyLLM consistently achieves higher precision and F1-score, with particularly strong gains on the Thunderbird dataset, highlighting its effectiveness in detecting anomalies while minimizing false positives.
Attend what matters: Leveraging vision foundational models for breast cancer classification using mammograms
🔥 引用:
0
Abstract: Vision Transformers $(\texttt{ViT})$ have become the architecture of choice for many computer vision tasks, yet their performance in computer-aided diagnostics remains limited. Focusing on breast cancer detection from mammograms, we identify two main causes for this shortfall. First, medical images are high-resolution with small abnormalities, leading to an excessive number of tokens and making it difficult for the softmax-based attention to localize and attend to relevant regions. Second, medical image classification is inherently fine-grained, with low inter-class and high intra-class variability, where standard cross-entropy training is insufficient. To overcome these challenges, we propose a framework with three key components: (1) Region of interest $(\texttt{RoI})$ based token reduction using an object detection model to guide attention; (2) contrastive learning between selected $\texttt{RoI}$ to enhance fine-grained discrimination through hard-negative based training; and (3) a $\texttt{DINOv2}$ pretrained $\texttt{ViT}$ that captures localization-aware, fine-grained features instead of global $\texttt{CLIP}$ representations. Experiments on public mammography datasets demonstrate that our method achieves superior performance over existing baselines, establishing its effectiveness and potential clinical utility for large-scale breast cancer screening. Our code is available for reproducibility here: https://aih-iitd.github.io/publications/attend-what-matters
Bangla Key2Text: Text Generation from Keywords for a Low Resource Language
🔥 引用:
0
Abstract: This paper introduces \textit{Bangla Key2Text}, a large-scale dataset of $2.6$ million Bangla keyword--text pairs designed for keyword-driven text generation in a low-resource language. The dataset is constructed using a BERT-based keyword extraction pipeline applied to millions of Bangla news texts, transforming raw articles into structured keyword--text pairs suitable for supervised learning. To establish baseline performance on this new benchmark, we fine-tune two sequence-to-sequence models, \texttt{mT5} and \texttt{BanglaT5}, and evaluate them using multiple automatic metrics and human judgments. Experimental results show that task-specific fine-tuning substantially improves keyword-conditioned text generation in Bangla compared to zero-shot large language models. The dataset, trained models, and code are publicly released to support future research in Bangla natural language generation and keyword-to-text generation tasks.
Automated Morphological Profiling via Deep Learning-Based Segmentation for High-Throughput Phenotypic Screening
🔥 引用:
0
Abstract: Reproducible morphological profiling, particularly for drug discovery, has become an important tool for compound evaluation. Established workflows such as CellProfiler provide a widely adopted foundation for Cell Painting analysis. However, conventional pipelines often require substantial manual configuration and technical expertise, which can limit scalability and accessibility. In this study, a fully automated deep learning-based workflow is presented for segmentation-driven morphological profiling from raw microscopy data. Using a curated subset of the JUMP Cell Painting pilot dataset, ground-truth masks were generated and used to train a U-net–based segmentation model in the IKOSA platform. Post-processing strategies were introduced to improve instance separation and reduce segmentation artifacts. The final model achieved strong segmentation performance (precision/recall/AP up to 0.98/0.94/0.92 for nuclei), with an average runtime of 2.2 s per 1080 × 1080 image. Segmentation outputs enabled large-scale feature extraction, yielding 3664 morphological descriptors that showed high correlation with CellProfiler-derived measurements (normalized MAE: 0.0298). Feature prioritization further reduced redundancy to 1145 informative descriptors. These results demonstrate that automated deep learning pipelines can complement established Cell Painting workflows by reducing configuration overhead while maintaining compatibility with validated morphological profiling standards. The proposed workflow may help improve resource efficiency in drug discovery and personalized medicine.
Policy Gradient Primal-Dual Method for Safe Reinforcement Learning from Human Feedback
🔥 引用:
0
Abstract: Safe Reinforcement Learning from Human Feedback (Safe RLHF) has recently achieved empirical success in developing helpful and harmless large language models by decoupling human preferences regarding helpfulness and harmlessness. Existing approaches typically rely on fitting fixed horizon reward models from human feedback and have only been validated empirically. In this paper, we formulate safe RLHF as an infinite horizon discounted Con- strained Markov Decision Process (CMDP), since humans may interact with the model over a continuing sequence of interactions rather than within a single finite episode. We propose two Safe RLHF algorithms that do not require reward model fitting and, in contrast to prior work assuming fixed-length trajectories, support flexible trajectory lengths for training. Both algo- rithms are based on the primal-dual method and achieve global convergence guarantees with polynomial rates in terms of policy gradient iterations, trajectory sample lengths, and human preference queries. To the best of our knowledge, this is the first work to study infinite horizon discounted CMDP under human feedback and establish global, non-asymptotic convergence.
MedSafe-Dx (v0): A Safety-Focused Benchmark for Evaluating LLMs in Clinical Diagnostic Decision Support
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
SAMoRA: Semantic-Aware Mixture of LoRA Experts for Task-Adaptive Learning
🔥 引用:
0
Abstract: The combination of Mixture-of-Experts (MoE) and Low-Rank Adaptation (LoRA) has shown significant potential for enhancing the multi-task learning capabilities of Large Language Models. However, existing methods face two primary challenges: (1)Imprecise Routing in the current MoE-LoRA method fails to explicitly match input semantics with expert capabilities, leading to weak expert specialization. (2)Uniform weight fusion strategies struggle to provide adaptive update strengths, overlooking the varying complexity of different tasks. To address these limitations, we propose SAMoRA (Semantic-Aware Mixture of LoRA Experts), a novel parameter-efficient fine-tuning framework tailored for task-adaptive learning. Specifically, A Semantic-Aware Router is proposed to explicitly align textual semantics with the most suitable experts for precise routing. A Task-Adaptive Scaling mechanism is designed to regulate expert contributions based on specific task requirements dynamically. In addition, a novel regularization objective is proposed to jointly promote expert specialization and effective scaling. Extensive experiments on multiple multi-task benchmarks demonstrate that SAMoRA significantly outperforms the state-of-the-art methods and holds excellent task generalization capabilities. Code is available at https://github.com/boyan-code/SAMoRA
Do Agents Dream of Root Shells? Partial-Credit Evaluation of LLM Agents in Capture The Flag Challenges
🔥 引用:
0
Abstract: Large Language Model (LLM) agents are increasingly proposed for autonomous cybersecurity tasks, but their capabilities in realistic offensive settings remain poorly understood. We present DeepRed, an open-source benchmark for evaluating LLM-based agents on realistic Capture The Flag (CTF) challenges in isolated virtualized environments. DeepRed places an agent in a Kali attacker environment with terminal tools and optional web search, connected over a private network to a target challenge, and records full execution traces for analysis. To move beyond binary solved/unsolved outcomes, we introduce a partial-credit scoring method based on challenge-specific checkpoints derived from public writeups, together with an automated summarise-then-judge labelling pipeline for assigning checkpoint completion from logs. Using DeepRed, we benchmark ten commercially accessible LLMs on ten VM-based CTF challenges spanning different challenge categories. The results indicate that current agents remain limited: the best model achieves only 35% average checkpoint completion, performing strongest on common challenge types and weakest on tasks requiring non-standard discovery and longer-horizon adaptation.
Wan-Image: Pushing the Boundaries of Generative Visual Intelligence
🔥 引用:
0
Abstract: We present Wan-Image, a unified visual generation system explicitly engineered to paradigm-shift image generation models from casual synthesizers into professional-grade productivity tools. While contemporary diffusion models excel at aesthetic generation, they frequently encounter critical bottlenecks in rigorous design workflows that demand absolute controllability, complex typography rendering, and strict identity preservation. To address these challenges, Wan-Image features a natively unified multi-modal architecture by synergizing the cognitive capabilities of large language models with the high-fidelity pixel synthesis of diffusion transformers, which seamlessly translates highly nuanced user intents into precise visual outputs. It is fundamentally powered by large-scale multi-modal data scaling, a systematic fine-grained annotation engine, and curated reinforcement learning data to surpass basic instruction following and unlock expert-level professional capabilities. These include ultra-long complex text rendering, hyper-diverse portrait generation, palette-guided generation, multi-subject identity preservation, coherent sequential visual generation, precise multi-modal interactive editing, native alpha-channel generation, and high-efficiency 4K synthesis. Across diverse human evaluations, Wan-Image exceeds Seedream 5.0 Lite and GPT Image 1.5 in overall performance, reaching parity with Nano Banana Pro in challenging tasks. Ultimately, Wan-Image revolutionizes visual content creation across e-commerce, entertainment, education, and personal productivity, redefining the boundaries of professional visual synthesis.
CASCADE: Detecting Inconsistencies between Code and Documentation with Automatic Test Generation
DOI:
10.1145/3808175
🔥 引用:
0
Abstract: Maintaining consistency between code and documentation is a crucial yet frequently overlooked aspect of software development. Even minor mismatches can confuse API users, introduce new bugs, and increase overall maintenance effort. This creates demand for automated solutions that can assist developers in identifying code-documentation inconsistencies. However, since automatic reports still require human confirmation, false positives carry serious consequences: wasting developer time and discouraging practical adoption. We introduce CASCADE (Consistency Analysis for Source Code And Documentation through Execution), a novel tool for detecting inconsistencies with a strong emphasis on reducing false positives. CASCADE leverages Large Language Models (LLMs) to generate unit tests directly from natural-language documentation. Since these tests are derived from the documentation, any failure during execution indicates a potential mismatch between the documented and actual behavior of the code. To minimize false positives, CASCADE also generates code from the documentation to cross-check the generated tests. By design, an inconsistency is reported only when two conditions are met: the existing code fails a test, while the code generated from the documentation passes the same test. We evaluated CASCADE on a novel dataset of 71 inconsistent and 814 consistent code-documentation pairs drawn from open-source Java projects. Further, we applied CASCADE to additional Java, C#, and Rust repositories, where we uncovered 13 previously unknown inconsistencies, of which 10 have subsequently been fixed, demonstrating both CASCADE's precision and its applicability to real-world codebases.
Location Not Found: Exposing Implicit Local and Global Biases in Multilingual LLMs
🔥 引用:
0
Abstract: Multilingual large language models (LLMs) have minimized the fluency gap between languages. This advancement, however, exposes models to the risk of biased behavior, as knowledge and norms may propagate across languages. In this work, we aim to quantify models'inter- and intra-lingual biases, via their ability to answer locale-ambiguous questions. To this end, we present LocQA, a test set containing 2,156 questions in 12 languages, referring to various locale-dependent facts such as laws, dates, and measurements. The questions do not contain indications of the locales they relate to, other than the querying language itself. LLMs'responses to LocQA locale-ambiguous questions thus reveal models'implicit priors. We used LocQA to evaluate 32 models, and detected two types of structural biases. Inter-lingually, we show a global bias towards answers relevant to the US-locale, even when models are asked in languages other than English. Moreover, we discovered that this global bias is exacerbated in models that underwent instruction tuning, compared to their base counterparts. Intra-lingually, we show that when multiple locales are relevant for the same language, models act as demographic probability engines, prioritizing locales with larger populations. Taken together, insights from LocQA may help in shaping LLMs'desired local behavior, and in quantifying the impact of various training phases on different kinds of biases.
Frontiers: ChatGPT Referrals to E-Commerce Websites: How Do LLMs Compare Against Traditional Channels?
🔥 引用:
0
Abstract: This is a descriptive study reporting financial and engagement metrics for 973 e-commerce websites, comparing organic large language model traffic (oLLM) with traditional digital channels.
AlignCultura: Towards Culturally Aligned Large Language Models?
🔥 引用:
0
Abstract: Cultural alignment in Large Language Models (LLMs) is essential for producing contextually aware, respectful, and trustworthy outputs. Without it, models risk generating stereotyped, insensitive, or misleading responses that fail to reflect cultural diversity w.r.t Helpful, Harmless, and Honest (HHH) paradigm. Existing benchmarks represent early steps toward cultural alignment; yet, no benchmarks currently enables systematic evaluation of cultural alignment in line with UNESCO's principles of cultural diversity w.r.t HHH paradigm. Therefore, to address this gap, we built Align-Cultura, two-stage pipeline for cultural alignment. Stage I constructs CULTURAX, the HHH-English dataset grounded in the UNESCO cultural taxonomy, through Query Construction, which reclassifies prompts, expands underrepresented domains (or labels), and prevents data leakage with SimHash. Then, Response Generation pairs prompts with culturally grounded responses via two-stage rejection sampling. The final dataset contains 1,500 samples spanning 30 subdomains of tangible and intangible cultural forms. Stage II benchmarks CULTURAX on general-purpose models, culturally fine-tuned models, and open-weight LLMs (Qwen3-8B and DeepSeek-R1-Distill-Qwen-7B). Empirically, culturally fine-tuned models improve joint HHH by 4%-6%, reduce cultural failures by 18%, achieve 10%-12% efficiency gains, and limit leakage to 0.3%.
Mechanistic In Silico and In Vitro Evaluation of Phytochemicals Targeting HIV-1 Reverse Transcriptase and Protease
🔥 引用:
0
Abstract: Introduction: Human immunodeficiency virus type-1 (HIV-1) replication depends on the coordinated activity of viral enzymes,
particularly reverse transcriptase and protease, which are critical for genome replication and viral maturation. Targeting these
enzymes remains a central strategy in antiviral drug development.
Materials and Methods: In the present study, an integrated in silico–in vitro approach was employed to investigate plant-derived
phytochemicals as potential inhibitors of HIV-1 reverse transcriptase and protease. Molecular docking was performed to predict
binding affinities and interaction patterns of selected phytochemicals with the catalytic regions of both enzymes. Drug-likeness and
pharmacokinetic properties were evaluated using ADMET and SwissADME tools, while biological activity was predicted using
PASS analysis.
Results and Discussion: Experimental validation was carried out using enzyme-based inhibition assays, complemented by
cytotoxicity assessment in host cells using the MTT assay. Docking analyses revealed stable and target-specific interactions of
phytochemicals with key amino acid residues involved in enzyme activity. In vitro assays confirmed inhibitory effects with IC₅₀
values ranging from 2.0 to 30.0 μM, alongside acceptable cytotoxicity profiles. Several compounds demonstrated differential
inhibition of reverse transcriptase and protease, indicating distinct modes of interference with HIV-1 replication machinery.
Conclusion: Overall, this study provides mechanistic insight into the inhibition of HIV-1 enzymes by structurally diverse
phytochemicals and establishes a direct correlation between predicted molecular interactions and functional enzyme inhibition. The findings highlight the potential of natural compounds as lead scaffolds for structure-guided antiviral drug development targeting HIV-1 pathogenesis.
Adaptive Gyroscopic Feedback-Based Foundation Control for Sustainable and Automated Torsional Seismic Mitigation in Buildings
DOI:
10.3390/su18084120
🔥 引用:
0
Abstract: Seismic-induced torsional response remains a significant barrier to achieving resilient and sustainable building foundations, as traditional passive isolation systems often fail to regulate rotational motion effectively. This study examines an adaptive gyroscopic feedback-based foundation control system designed to provide automated torsional seismic mitigation. The proposed system integrates real-time angular velocity sensing using MEMS gyroscopes, Kalman filter state estimation, and an adaptive Linear Quadratic Regulator to modulate damping in response to changing ground motion. A single-degree-of-freedom torsional foundation model was developed and evaluated in GNU Octave 8.4.0/MATLAB R2024a Simulink using the recorded El Centro 1940 NS earthquake input. The adaptive controller achieved notable improvements, reducing total vibration energy by 69%, peak angular displacement by 47.6%, and RMS angular velocity by 39.5% relative to the uncontrolled case, while keeping control energy below 19% of the seismic input. These results demonstrate that gyroscopic feedback enhances damping, limits torsional resonance, and stabilises foundation behaviour under actual earthquake excitation. The system’s low energy requirement, compatibility with embedded hardware, and automated response characteristics underscore its potential for integration into sustainable and intelligent foundation designs. While results are demonstrated using the El Centro 1940 record as a benchmark, broader generalisation will be established through multi-record suites and uncertainty quantification in future work. The study highlights a feasible pathway for advancing automated seismic protection in buildings through active, sensor-driven torsional control.
BioLAMR: A Biomimetically Inspired Large Language Model Adaptation Framework for Automatic Modulation Recognition
🔥 引用:
0
Abstract: Automatic modulation recognition (AMR) is increasingly relevant to communication-sensing front ends in robotic and human–robot collaborative systems, where reliable spectrum awareness and adaptive wireless reception are desired. However, existing methods often degrade sharply at low signal-to-noise ratios (SNRs), and large language models (LLMs) are not natively compatible with continuous I/Q signals due to the inherent modality gap. We propose BioLAMR, a GPT-2 adaptation framework for AMR inspired by the auditory system’s parallel time–frequency processing and cortical hierarchy. The framework combines bio-inspired dual-domain feature extraction with parameter-efficient LLM adaptation. BioLAMR includes three components. First, a lightweight dual-domain fusion (LDDF) module extracts complementary time- and frequency-domain features and fuses them through channel and spatial attention. Second, a convolutional embedding module converts continuous I/Q signals into GPT-2-compatible sequences without discrete tokenization. Third, a hierarchical fine-tuning strategy updates only 8.9% of parameters to preserve pretrained knowledge while adapting to modulation recognition. Experiments on the RadioML2016.10a and RadioML2016.10b benchmarks show that BioLAMR achieves overall accuracies of 64.99% and 67.43%, outperforming the strongest competing method by 2.60 and 2.47 percentage points, respectively. Under low-SNR conditions, it reaches 36.78% and 38.14%, the best results among the compared methods. Ablation studies verify the contribution of each component. These results demonstrate that combining dual-domain signal modeling with parameter-efficient GPT-2 adaptation is an effective route to robust AMR in challenging wireless environments.
ICAM-Trans: Implicit-Context-Aware Multi-Agent Framework for Function-Level Code Translation
DOI:
10.1145/3810952
🔥 引用:
0
Abstract: With the rapid advancement of large language models (LLMs), code translation has become a critical yet challenging task in software engineering. Existing evaluations of this task remain unreliable due to two key limitations: the absence of specialized function-level code translation benchmarks and flawed evaluation pipelines. Specifically, current pipelines lack post-processing mechanisms to extract target code from LLMs’ inherently unstable output formats, directly resulting in non-executable translations and false negatives in evaluation. To address these issues, we introduce FuncTransEval, an enhanced multilingual benchmark. It comprises 400×4 functions across four programming languages (Python, C++, Java, and JavaScript) and 12 bidirectional translation pairs, each paired with executable unit tests for fine-grained functional validation. Using FuncTransEval, we systematically assess LLMs’ true capabilities in code translation and identify their core limitations: inadequate understanding of the implicit information such as function logic, type specifications, and cross-language differences. To mitigate these shortcomings, we propose ICAM-Trans (Implicit-Context-Aware Multi-Agent Framework for function-level code translation). ICAM-Trans is a hierarchical, autonomous multi-agents architecture that explicitly uncovers and leverages these implicit contextual information during translation. It employs a Translation Orchestrate Agent (TOA) to autonomously coordinate four specialized Context Analysis Agents (CAAs), which conduct pre-translation analyses on different aspects. Insights from these analyses guide a Context-Aware Translation Agent (CTA) to generate semantically faithful target code. Experiments on FuncTransEval demonstrate that ICAM-Trans consistently outperforms strong baselines, validating its effectiveness in achieving high-fidelity and interpretable function-level code translation.
Large language models in NLP: evolution, architectural trends, and open challenges
🔥 引用:
0
Abstract: 暂无摘要,请点击原文查看。
Scientific Accuracy of Large Language Models in Tilted Implant Dentistry: A Guideline-Based Comparative Evaluation
🔥 引用:
0
Abstract:
Tilted dental implant systems are widely used in the rehabilitation of anatomically compromised jaws and are supported by international consensus guidelines. Concurrently, large language models (LLMs) are increasingly accessed as informational tools in implant dentistry; however, their scientific accuracy and adherence to guideline-based principles in advanced implant concepts remain insufficiently explored. This study evaluated the scientific accuracy, guideline conformity, and clinical consistency of responses generated by 4 LLMs regarding tilted dental implant systems. A total of 120 guideline-based questions covering 8 predefined domains (definition, indications, contraindications, advantages, surgical procedure content, prosthetic procedure content, complications, and prognosis/survival) were developed in accordance with ITI, EAO, and AAOMS consensus reports. Each question was independently submitted to ChatGPT-5.2, Copilot, DeepSeek, and Gemini, and all responses were anonymized and evaluated by a multidisciplinary expert panel using a structured ordinal scoring system. Overall, scientific accuracy scores were high across all models, with near-ceiling performance observed in domains related to indications, advantages, procedural content, and prognosis. Statistically significant between-model differences were identified in the definition (
P
=0.003), contraindications (
P
=0.006), and complications (
P
<0.001) domains, with DeepSeek and Gemini demonstrating consistently higher scores in complication-related content compared with ChatGPT and Copilot. Within-model analyses further revealed significant domain-dependent variability across all LLMs. Although LLMs demonstrate a strong capacity to reproduce established, guideline-based knowledge regarding tilted implant systems, limitations remain in safety-critical domains requiring nuanced clinical judgment. Accordingly, LLMs should be regarded as adjunctive educational tools rather than substitutes for expert decision-making in craniofacial implantology.