Medical AI can misdiagnose you over typos, gender and tone

The core finding of the study is that LLMs used for patient-facing medical applications are heavily influenced by alterations in input text that are unrelated to clinical data. The researchers applied nine perturbations to clinical inputs, ranging from gender swapping and colloquial language to typos and random whitespace, and then assessed the outputs of four widely used LLMs: GPT-4, LLama-3-70B, LLama-3-8B, and Palmyra-Med.

CO-EDP, VisionRI | Updated: 25-06-2025 09:21 IST | Created: 25-06-2025 09:21 IST

Medical AI can misdiagnose you over typos, gender and tone — Representative Image. Credit: ChatGPT

Large language models (LLMs) used in healthcare alter treatment recommendations based on non-medical details such as tone, grammar, and perceived gender, according to new research presented at the 2025 ACM Conference on Fairness, Accountability, and Transparency, raising concerns about the reliability and fairness of these models in clinical decision-making.

Titled “The Medium is the Message: How Non-Clinical Information Shapes Clinical Decisions in LLMs”, the MIT-led research shows that state-of-the-art AI models consistently shift clinical advice when exposed to patient messages altered by typos, uncertain language, or gender-swapped pronouns, despite no change in underlying medical symptoms. Researchers found the models not only reduced care recommendations under these conditions but also introduced statistically significant errors in advice given to women and other vulnerable groups.

Do non-clinical changes influence clinical decisions?

The core finding of the study is that LLMs used for patient-facing medical applications are heavily influenced by alterations in input text that are unrelated to clinical data. The researchers applied nine perturbations to clinical inputs, ranging from gender swapping and colloquial language to typos and random whitespace, and then assessed the outputs of four widely used LLMs: GPT-4, LLama-3-70B, LLama-3-8B, and Palmyra-Med.

Across 6,750 static cases and 41,600 conversational samples, the models exhibited notable shifts in treatment decisions. In particular, average treatment recommendation variability increased by 7–9% across all models when non-clinical perturbations were introduced. More alarmingly, these changes led to a consistent reduction in recommended care, with models incorrectly suggesting patients should self-manage or delay clinician visits. Perturbations like "colorful language" and "gender swapping" had the most severe effects, sometimes increasing erroneous care reduction rates by more than 5%.

Such inconsistencies pose significant risks in real-world applications. For example, an input altered merely by uncertain or dramatic phrasing, a style linked in linguistic literature to female or marginalized authorship, led to more frequent under-treatment recommendations.

Is gender bias embedded in clinical AI?

The study conducted a comprehensive analysis of treatment disparities across both actual and model-inferred gender subgroups. It found that female patients were disproportionately affected by reductions in care following perturbations. In the “VISIT” category, whether a patient should be advised to seek clinical evaluation, female patients consistently experienced higher treatment variability, higher care reduction rates, and more frequent erroneous recommendations than male patients.

Even when explicit gender markers were removed, the models demonstrated bias based on inferred gender from writing style or structure. These findings indicate that LLMs may be implicitly assigning demographic traits, such as gender, to patients and adjusting medical advice accordingly. This raises serious concerns about fairness in AI-assisted healthcare systems.

The disparities were particularly pronounced in the "self-management" task. Despite equal baseline performance between genders, perturbed inputs resulted in significantly more female patients being incorrectly advised to manage symptoms without medical supervision.

Do AI models hold up in patient conversations?

The team extended their analysis to conversational formats that mirror real-world patient-AI interactions, such as chatbots used in digital health platforms. Across four conversational formats (vignette, single-turn, multi-turn, and summarized), all models experienced significant declines in diagnostic accuracy, averaging a 7.5% drop across seven different perturbations.

Notably, even minor text formatting issues like excessive capitalization or misplaced whitespace, which simulate real-world typing habits or electronic errors, triggered accuracy drops across all models and conversation types. The “whitespace” and “uppercase” perturbations were among the most damaging in reducing model accuracy in diagnostic contexts.

This sensitivity to input structure and tone underlines the brittleness of current clinical LLMs and underscores the challenges of deploying them in settings where patients may use informal, colloquial, or non-standard language.

Moreover, disparities between male and female subgroup performance also persisted in conversational datasets. Gender-based gaps in diagnostic accuracy remained stable across all perturbations, and in some cases, gender-swapped inputs reduced, but did not eliminate, these gaps.

FIRST PUBLISHED IN:
Devdiscourse

Medical AI can misdiagnose you over typos, gender and tone

Do non-clinical changes influence clinical decisions?

Is gender bias embedded in clinical AI?

Do AI models hold up in patient conversations?

TRENDING

Emma Raducanu: Britain's Tennis Hope at Wimbledon

Shallow Earthquake Jolts Dominican Republic Coastline

Amit Shah to Lead Central Zonal Council Meeting in Varanasi

Tesla's European Market Challenges Amidst EV Surge

OPINION / BLOG / INTERVIEW

Sustainable Fodder Strategies to Rescue Mongolia’s Livestock and Steppe Ecosystem

Bangkok-Based MSMEs Gained Most from COVID Aid, New ADB Index Study Reveals

Turning Crisis Into Opportunity: UNDP Calls for a New Era of Risk-Informed Development

Indonesia’s 3 Million Homes Vision: A Bold Strategy for Jobs, Equity, and Development

DevShots

Latest News

Iran's Nuclear Woes: Fallout from U.S. Strikes

Cognizant's Grand Investment in Andhra Pradesh: A Boost to Regional Tech Transformation

Namma Yatri Launches Revolutionary Driver Welfare Trust in Karnataka

Cricket World Mourns the Loss of Spin Maestro Dilip Doshi

Connect us on

SECTORS

EDITIONS

OTHER LINKS

OTHER PRODUCTS

CONNECT