AI in healthcare: ChatGPT struggles with diagnostic accuracy in chronic disease

In terms of diagnostic accuracy, ChatGPT's performance was evaluated based on its ability to correctly identify whether a subject had heart disease or diabetes based on the presented symptoms and clinical values. While the model demonstrated a reasonable capacity to recognize patterns in patient descriptions, its diagnostic precision fell short of dedicated machine learning models trained on the same datasets.


CO-EDP, VisionRICO-EDP, VisionRI | Updated: 26-06-2025 09:18 IST | Created: 26-06-2025 09:18 IST
AI in healthcare: ChatGPT struggles with diagnostic accuracy in chronic disease
Representative Image. Credit: ChatGPT

The emergence of large language models (LLMs) like OpenAI's ChatGPT introduces exciting possibilities for transforming the way healthcare is delivered. A new study published in BioMedInformatics investigates the effectiveness of ChatGPT as a predictive diagnostic tool in chronic disease management, specifically focusing on cardiovascular disease and diabetes. 

Titled “Evaluating ChatGPT for Disease Prediction: A Comparative Study on Heart Disease and Diabetes” the study is one of the first to systematically evaluate ChatGPT’s diagnostic capability on chronic disease prediction

Can a language model like ChatGPT reliably predict chronic diseases?

The study explores whether large language models (LLMs), originally designed for human-like text generation, be repurposed for accurate disease prediction? Chronic illnesses such as heart disease and diabetes are among the top global causes of death and disability. Their prevention and early detection are pivotal to reducing healthcare costs and improving quality of life. Traditional clinical diagnostic systems rely on statistical modeling and machine learning algorithms trained on structured datasets. ChatGPT, a transformer-based language model developed by OpenAI, represents a fundamentally different approach, drawing inferences based on linguistic context rather than numerical optimization alone.

To evaluate its capabilities, the research used two benchmark datasets: the Heart Disease dataset and the Diabetes dataset, both from the UCI Machine Learning Repository. These datasets include structured medical features such as blood pressure, cholesterol, glucose levels, age, body mass index, and other clinical parameters. ChatGPT was prompted through carefully crafted text inputs simulating medical interviews, where each input corresponded to a patient profile.

Unlike conventional models that use tabular data directly, ChatGPT required data transformation into natural language statements. For example, numeric attributes like blood pressure were phrased in descriptive sentences, such as "The patient has a systolic blood pressure of 145 mmHg," allowing the language model to interpret clinical profiles conversationally.

The goal was to test whether ChatGPT could produce correct diagnostic outputs when prompted with simulated patient cases, and how its responses compared to machine learning classifiers. The experiment also assessed ChatGPT’s consistency, contextual understanding, and limitations in dealing with structured clinical scenarios.

How did ChatGPT perform against traditional machine learning models?

In terms of diagnostic accuracy, ChatGPT's performance was evaluated based on its ability to correctly identify whether a subject had heart disease or diabetes based on the presented symptoms and clinical values. While the model demonstrated a reasonable capacity to recognize patterns in patient descriptions, its diagnostic precision fell short of dedicated machine learning models trained on the same datasets.

In the heart disease prediction task, ChatGPT occasionally failed to weigh key risk factors such as resting ECG abnormalities or exercise-induced angina with the required significance. For the diabetes dataset, while it correctly associated high glucose levels and BMI with diabetes risk in many cases, it lacked consistency across borderline cases and did not apply probabilistic thresholds in its reasoning.

Machine learning algorithms such as decision trees, support vector machines (SVM), and logistic regression, in contrast, delivered higher predictive reliability when tested on the same data. These models benefit from their ability to calculate optimal feature weights and decision boundaries, which ChatGPT does not compute due to its generative nature.

However, the study also highlighted areas where ChatGPT excelled. It demonstrated robust contextualization, was capable of integrating multiple symptoms into a coherent diagnostic narrative, and provided natural language justifications for its predictions. These capabilities suggest its potential value not as a replacement for diagnostic algorithms but as a complementary tool for explanation, education, and patient interaction.

The interpretability of ChatGPT’s output, combined with its conversational interface, may enhance communication between doctors and patients. While traditional models often require expert interpretation, ChatGPT can articulate its rationale in a way that is accessible to non-specialists. This usability factor may eventually support broader AI adoption in general practice, particularly in telehealth environments.

What are the study’s implications for healthcare AI and future research?

The study raises critical concerns about the readiness of generative AI models like ChatGPT for clinical deployment. Although it showcases potential in simulating diagnostic reasoning, the research concludes that ChatGPT, in its current form, is not yet a reliable standalone diagnostic tool. Its inability to quantify uncertainty or reference evidence from structured medical guidelines introduces risks, particularly in high-stakes healthcare scenarios.

Moreover, the model’s dependency on prompt design emerged as a crucial limitation. Variations in phrasing or question structure led to different outcomes for the same patient data. This prompt sensitivity poses challenges in standardizing AI-driven decision support tools for clinical use.

Despite these limitations, the study acknowledges ChatGPT's future potential, especially if integrated into hybrid frameworks. For instance, coupling ChatGPT with a back-end probabilistic model or feeding its outputs into a clinical decision support system could balance interpretability with quantitative rigor.

The paper also calls for more rigorous testing across diverse clinical datasets and emphasizes the need for explainable AI standards in medical settings. As generative models evolve, ensuring transparency, accountability, and alignment with medical ethics will be paramount. Researchers are encouraged to explore how fine-tuning such models on domain-specific corpora, such as electronic health records or diagnostic manuals, may improve their reliability.

To sum up, while ChatGPT offers an impressive imitation of medical reasoning, it lacks the safeguards necessary for clinical reliability. Nonetheless, its intuitive dialogue-based interface could prove useful in educational contexts, helping medical students understand disease mechanisms or simulate patient assessments.

  • FIRST PUBLISHED IN:
  • Devdiscourse
Give Feedback