Medical AI can misdiagnose you over typos, gender and tone

The core finding of the study is that LLMs used for patient-facing medical applications are heavily influenced by alterations in input text that are unrelated to clinical data. The researchers applied nine perturbations to clinical inputs, ranging from gender swapping and colloquial language to typos and random whitespace, and then assessed the outputs of four widely used LLMs: GPT-4, LLama-3-70B, LLama-3-8B, and Palmyra-Med.


CO-EDP, VisionRICO-EDP, VisionRI | Updated: 25-06-2025 09:21 IST | Created: 25-06-2025 09:21 IST
Medical AI can misdiagnose you over typos, gender and tone
Representative Image. Credit: ChatGPT

Large language models (LLMs) used in healthcare alter treatment recommendations based on non-medical details such as tone, grammar, and perceived gender, according to new research presented at the 2025 ACM Conference on Fairness, Accountability, and Transparency, raising concerns about the reliability and fairness of these models in clinical decision-making.

Titled “The Medium is the Message: How Non-Clinical Information Shapes Clinical Decisions in LLMs”, the MIT-led research shows that state-of-the-art AI models consistently shift clinical advice when exposed to patient messages altered by typos, uncertain language, or gender-swapped pronouns, despite no change in underlying medical symptoms. Researchers found the models not only reduced care recommendations under these conditions but also introduced statistically significant errors in advice given to women and other vulnerable groups.

Do non-clinical changes influence clinical decisions?

The core finding of the study is that LLMs used for patient-facing medical applications are heavily influenced by alterations in input text that are unrelated to clinical data. The researchers applied nine perturbations to clinical inputs, ranging from gender swapping and colloquial language to typos and random whitespace, and then assessed the outputs of four widely used LLMs: GPT-4, LLama-3-70B, LLama-3-8B, and Palmyra-Med.

Across 6,750 static cases and 41,600 conversational samples, the models exhibited notable shifts in treatment decisions. In particular, average treatment recommendation variability increased by 7–9% across all models when non-clinical perturbations were introduced. More alarmingly, these changes led to a consistent reduction in recommended care, with models incorrectly suggesting patients should self-manage or delay clinician visits. Perturbations like "colorful language" and "gender swapping" had the most severe effects, sometimes increasing erroneous care reduction rates by more than 5%.

Such inconsistencies pose significant risks in real-world applications. For example, an input altered merely by uncertain or dramatic phrasing, a style linked in linguistic literature to female or marginalized authorship, led to more frequent under-treatment recommendations.

Is gender bias embedded in clinical AI?

The study conducted a comprehensive analysis of treatment disparities across both actual and model-inferred gender subgroups. It found that female patients were disproportionately affected by reductions in care following perturbations. In the “VISIT” category, whether a patient should be advised to seek clinical evaluation, female patients consistently experienced higher treatment variability, higher care reduction rates, and more frequent erroneous recommendations than male patients.

Even when explicit gender markers were removed, the models demonstrated bias based on inferred gender from writing style or structure. These findings indicate that LLMs may be implicitly assigning demographic traits, such as gender, to patients and adjusting medical advice accordingly. This raises serious concerns about fairness in AI-assisted healthcare systems.

The disparities were particularly pronounced in the "self-management" task. Despite equal baseline performance between genders, perturbed inputs resulted in significantly more female patients being incorrectly advised to manage symptoms without medical supervision.

Do AI models hold up in patient conversations?

The team extended their analysis to conversational formats that mirror real-world patient-AI interactions, such as chatbots used in digital health platforms. Across four conversational formats (vignette, single-turn, multi-turn, and summarized), all models experienced significant declines in diagnostic accuracy, averaging a 7.5% drop across seven different perturbations.

Notably, even minor text formatting issues like excessive capitalization or misplaced whitespace, which simulate real-world typing habits or electronic errors, triggered accuracy drops across all models and conversation types. The “whitespace” and “uppercase” perturbations were among the most damaging in reducing model accuracy in diagnostic contexts.

This sensitivity to input structure and tone underlines the brittleness of current clinical LLMs and underscores the challenges of deploying them in settings where patients may use informal, colloquial, or non-standard language.

Moreover, disparities between male and female subgroup performance also persisted in conversational datasets. Gender-based gaps in diagnostic accuracy remained stable across all perturbations, and in some cases, gender-swapped inputs reduced, but did not eliminate, these gaps.

  • FIRST PUBLISHED IN:
  • Devdiscourse
Give Feedback