AI agents cut hallucinations and improve healthcare tool usability

Failures by the standalone LLMs often stemmed from hallucinations, outdated knowledge, or misinterpretation of clinical rules. For example, the LLM on its own provided obsolete age eligibility criteria for cardiovascular risk guidelines and sometimes misidentified required risk factors. The agentic interface avoided these pitfalls by sourcing responses directly from the appropriate calculators and current guidelines.


CO-EDP, VisionRICO-EDP, VisionRI | Updated: 29-09-2025 09:39 IST | Created: 29-09-2025 09:39 IST
AI agents cut hallucinations and improve healthcare tool usability
Representative Image. Credit: ChatGPT

A team of researchers has demonstrated that large language models (LLMs) can serve as effective interfaces for existing digital health tools, significantly improving usability and reducing errors. The study, published in Frontiers in Artificial Intelligence, explores whether LLMs, when deployed as intermediaries rather than decision-makers, can enhance trust and reliability in clinical workflows.

The research, titled “Redefining Digital Health Interfaces with Large Language Models”, addresses a central challenge in digital healthcare: clinicians often find predictive tools difficult to use and untrustworthy. By focusing on agentic LLM interfaces that pull information directly from validated sources, the study suggests a practical path to safer and more transparent AI integration in healthcare.

Closing the gap between AI tools and clinical needs

The first major question examined by the researchers is how LLMs can bridge the gap between sophisticated digital tools and the real-world needs of clinicians. Traditional clinical risk calculators and machine learning models often provide a single output, such as a probability of disease, without sufficient context. This limits their adoption, as clinicians need to understand not just the prediction but also the factors driving it and the relevant guidelines for action.

The team developed an LLM-based interface that orchestrates external tools and references rather than replacing them. Prototypes integrated QRisk3 for cardiovascular risk assessment, a machine-learning cardiovascular disease model powered by AutoPrognosis 2.0 with SHAP explanations, and the CHA₂DS₂-VASc tool for stroke risk in atrial fibrillation. These were coupled with direct access to authoritative clinical guidelines such as those from NICE, ensuring that the interface delivers sourced and verifiable responses.

By focusing on enabling dialogue-driven access to approved tools, the interface allowed clinicians to ask natural language questions, request recalculations under different scenarios, and receive guideline-linked recommendations—improving usability and trust.

Reducing hallucinations and improving accuracy

The second key question addressed by the study is whether LLM-agent interfaces can reduce the well-known problem of hallucinations—fabricated or outdated information often produced by general-purpose LLMs.

In rigorous testing using over 230 carefully designed clinical questions across cardiovascular and atrial fibrillation use cases, the LLM-agent system delivered near-perfect accuracy. It answered 126 out of 127 questions in the cardiovascular risk workflow and 104 out of 106 questions in the atrial fibrillation workflow, with correctness rates exceeding 98%. By contrast, the same base LLMs operating alone scored only 44–50% in cardiovascular tasks and 75% in atrial fibrillation tasks.

Failures by the standalone LLMs often stemmed from hallucinations, outdated knowledge, or misinterpretation of clinical rules. For example, the LLM on its own provided obsolete age eligibility criteria for cardiovascular risk guidelines and sometimes misidentified required risk factors. The agentic interface avoided these pitfalls by sourcing responses directly from the appropriate calculators and current guidelines.

The study’s findings highlight that positioning LLMs as orchestrators rather than decision-makers can significantly enhance the reliability of AI-driven clinical decision support.

Addressing adoption challenges and future directions

The third major question posed by the study concerns the barriers to real-world adoption of such interfaces. The authors acknowledge that despite their promise, LLM-based interfaces introduce practical challenges, including computational costs, latency, regulatory requirements, and the need for ongoing maintenance.

The researchers estimate that each interaction with their prototypes costs less than ten cents at current rates and that near real-time responses can be achieved. However, they stress that deploying such systems in clinical environments will require robust privacy safeguards, compliance with medical device regulations, and continuous updates to reflect evolving clinical guidelines.

The study also calls for further research through controlled trials with healthcare professionals to evaluate usability, identify edge cases, and refine interface designs. These steps are critical to moving from laboratory demonstrations to widespread clinical integration.

  • FIRST PUBLISHED IN:
  • Devdiscourse
Give Feedback