ChatGPT-4 rivals gastroenterologists in clarity and accuracy of IBD advice

The AI’s strongest performance was recorded in medical therapy and surgery, where its structured, evidence-based explanations closely mirrored current clinical guidelines. ChatGPT-4’s weakest area was dietary advice, where it tended to provide generalized recommendations rather than tailored nutritional guidance. This limitation reflects the model’s current inability to personalize responses based on an individual’s medical history or treatment plan.


CO-EDP, VisionRICO-EDP, VisionRI | Updated: 13-10-2025 09:05 IST | Created: 13-10-2025 09:05 IST
ChatGPT-4 rivals gastroenterologists in clarity and accuracy of IBD advice
Representative Image. Credit: ChatGPT

Artificial intelligence has taken another step into clinical practice, with a new multinational study revealing that ChatGPT-4 can match or even surpass the quality of responses given by human gastroenterologists in patient communication about inflammatory bowel disease (IBD). The research provides compelling evidence that large language models may soon play a practical role in medical education and patient support.

The study, titled “When AI Speaks Like a Specialist: ChatGPT-4 in the Management of Inflammatory Bowel Disease” and published in Frontiers in Artificial Intelligence, evaluated ChatGPT-4’s ability to answer real patient questions against responses from human experts. The findings show that the AI model not only provides accurate and reliable information but also delivers it with clarity and structure that physicians themselves rated higher than their own peers.

How ChatGPT-4 was tested against medical experts

The research team designed a direct comparison between AI-generated and human-written responses to explore whether ChatGPT-4 could effectively assist in medical communication about IBD, a chronic inflammatory condition that includes Crohn’s disease and ulcerative colitis.

Over one month, two gastroenterologists from the study team collected 25 frequently asked questions from 500 IBD patients attending routine outpatient visits. These questions covered five major categories of concern, pregnancy and breastfeeding, diet, vaccinations, lifestyle, and medical therapy including surgery. Each question was presented both to ChatGPT-4 and to two human IBD specialists, who answered independently.

The responses were anonymized and evaluated by 12 physicians, six IBD specialists and six general gastroenterologists, who rated them across four key dimensions: accuracy, reliability, comprehensibility, and actionability. A five-point scale was used to score each answer, and evaluators were also asked to guess whether a response came from an AI model or a human doctor.

This experimental setup allowed the team to assess not only the factual quality of the answers but also the tone, coherence, and clarity that shape patient understanding.

AI matches or exceeds doctors in quality and clarity

The results were striking. Across all 25 questions, ChatGPT-4 achieved an average score of 4.28 out of 5, outperforming human gastroenterologists, who scored 4.05. The difference was statistically significant, particularly in the categories of clarity and actionability, two factors crucial to patient education.

The AI’s strongest performance was recorded in medical therapy and surgery, where its structured, evidence-based explanations closely mirrored current clinical guidelines. ChatGPT-4’s weakest area was dietary advice, where it tended to provide generalized recommendations rather than tailored nutritional guidance. This limitation reflects the model’s current inability to personalize responses based on an individual’s medical history or treatment plan.

One of the most surprising findings was that the physician evaluators could rarely distinguish between AI and human responses. Only one-third of the AI-generated answers were correctly identified as machine-written, and in some cases, none of the doctors recognized the AI output. This suggests that ChatGPT-4’s tone and phrasing have reached a level of fluency comparable to professional medical writing.

Notably, both IBD specialists and general gastroenterologists rated ChatGPT-4 highly, though specialists tended to give slightly higher scores overall. The results demonstrate that the model not only generates medically sound content but does so with a degree of linguistic precision and readability that appeals even to trained experts.

Implications for medical communication and patient education

The findings have far-reaching implications for the future of doctor–patient communication. The researchers argue that AI can act as a supplementary educational tool, helping patients better understand their conditions while reducing the repetitive communication burden on physicians.

In the context of chronic diseases like IBD, where patients often seek clarification on treatment, diet, and lifestyle, ChatGPT-4 could function as a reliable assistant that reinforces medical advice, improves comprehension, and promotes adherence to therapy. By translating complex medical concepts into accessible language, the AI may also help bridge the gap between medical expertise and patient literacy.

The authors note, however, that AI is not a replacement for professional medical care. While ChatGPT-4 can deliver accurate and well-structured information, it lacks the contextual awareness and empathy essential to human medical practice. The model does not interpret individual test results or adjust advice for specific clinical conditions, nor does it recognize emotional cues that influence care decisions.

Still, the study positions ChatGPT-4 as a promising complement to clinical practice, especially in educational and administrative roles. Hospitals and health organizations could use AI-driven platforms to manage common patient inquiries, generate preliminary educational materials, or guide patients through routine processes before consultations.

Limits of AI in medicine and the road ahead

While the study celebrates AI’s performance, it also acknowledges its boundaries. ChatGPT-4’s limitations are most evident in dietary and lifestyle guidance, areas where personalized recommendations are vital. The model’s training on general data means it cannot yet provide patient-specific instructions or factor in comorbidities and current medications.

Moreover, the researchers highlight that the study involved a relatively small sample size, 25 questions and 12 evaluators, and included only physicians, not patients. Future studies, they suggest, should involve patient participants to measure how lay audiences perceive the clarity, trustworthiness, and empathy of AI-generated answers.

Another critical area for future development is interactive communication. The study assessed single-response outputs rather than conversational exchanges, which are more typical of real-world AI use. The authors call for exploring how sustained interactions between patients and AI might shape understanding, trust, and adherence to medical advice.

Despite these caveats, the findings point to a transformative role for large language models in medicine. By providing consistent, evidence-based explanations, AI tools like ChatGPT-4 could become valuable allies in improving health communication, particularly in resource-limited or high-demand settings.

  • FIRST PUBLISHED IN:
  • Devdiscourse
Give Feedback