AI in cardiology: ChatGPT leads, but not ready to replace doctors

While ChatGPT emerged as the front-runner in this study, the authors caution against premature clinical adoption of any general-purpose AI for autonomous decision-making. The findings indicate that without domain-specific fine-tuning and stringent oversight, even the best-performing models are not ready to replace human judgment in clinical cardiology.


CO-EDP, VisionRICO-EDP, VisionRI | Updated: 21-07-2025 17:51 IST | Created: 21-07-2025 17:51 IST
AI in cardiology: ChatGPT leads, but not ready to replace doctors
Representative Image. Credit: ChatGPT

With the rapid integration of artificial intelligence (AI) becomes increasingly integrated into healthcare, questions about its clinical reliability are more urgent than ever. In a newly published empirical investigation, Italian researchers critically examine how today’s top general-purpose language models perform when faced with cardiology-specific queries. 

The study, titled “Evaluating Large Language Models in Cardiology: A Comparative Study of ChatGPT, Claude, and Gemini”, is published in the journal Hearts. The research delivers one of the first systematic evaluations of large language models (LLMs) within the high-stakes domain of cardiology.

Can General-Purpose AI Assist in Cardiological Decision-Making?

The study addresses a pressing question in clinical innovation: can broadly trained AI models provide dependable answers in cardiology - a field where accuracy is non-negotiable? The researchers put three of the most prominent LLMs, ChatGPT (OpenAI), Claude (Anthropic), and Gemini (Google DeepMind), through a rigorous head-to-head comparison.

A total of 70 clinical prompts were crafted and divided by diagnostic phase (pre-diagnosis and post-diagnosis) and by user type (patient vs. physician). These prompts simulated real-world questions from both patients seeking information and clinicians seeking assistance. Each model responded to all prompts, and their answers were then blindly evaluated by three expert cardiologists using a standardized 5-point Likert scale. The evaluators assessed the responses across four critical dimensions: scientific accuracy, completeness, clarity, and coherence.

The results showed that ChatGPT consistently outperformed its competitors. On average, ChatGPT scored between 3.7 and 4.2, compared to Claude’s 3.4–4.0 and Gemini’s lower 2.9–3.7. Pre-diagnostic and patient-focused queries elicited stronger performance across all three models, suggesting that current LLMs are better at managing general informational content rather than technical, post-diagnostic specifics.

How did the models differ and why does it matter?

The differences in model output were not merely numerical, they had practical implications. ChatGPT’s responses were more aligned with the expectations of experienced cardiologists, particularly in terms of clarity and coherence. Claude followed closely but trailed in accuracy and depth, while Gemini showed the most inconsistency, especially in post-diagnosis physician-level queries.

Despite these disparities, none of the models achieved perfect scores, underscoring the persistent limitations of LLMs in specialized medical fields. Even ChatGPT’s leading performance revealed areas that require human supervision and expert validation. The study found that performance varied depending on the structure and context of the question, with patient-framed queries yielding more complete and comprehensible answers than those framed from a physician’s perspective.

To ensure the robustness of their findings, the authors applied comprehensive statistical analyses, including Kruskal–Wallis tests, Dunn’s post hoc tests, Kendall’s W, and weighted kappa metrics. These tests confirmed a substantial level of agreement among the cardiologist reviewers, adding credibility to the comparative outcomes.

What are the clinical and ethical implications?

While ChatGPT emerged as the front-runner in this study, the authors caution against premature clinical adoption of any general-purpose AI for autonomous decision-making. The findings indicate that without domain-specific fine-tuning and stringent oversight, even the best-performing models are not ready to replace human judgment in clinical cardiology.

The study also highlights the importance of context-awareness. Questions posed by patients triggered more accurate and understandable answers, suggesting that AI tools may be better suited for front-line educational or triage tasks rather than serving as back-end clinical advisors—at least for now. This nuance has substantial implications for how LLMs should be integrated into healthcare workflows and where safeguards must be implemented.

Furthermore, the researchers stress that while LLMs like ChatGPT show clear promise, the road to clinical certification and real-world deployment must be paved with rigorous testing, interdisciplinary collaboration, and policy frameworks that address accountability and transparency.

  • FIRST PUBLISHED IN:
  • Devdiscourse
Give Feedback