AI hallucination crisis? ChatGPT excels in speed but flounders in scholarly rigor

The researchers conducted an extensive meta-analysis of 124 academic papers, 40 of which were subjected to deeper scrutiny under stringent inclusion criteria. Using the PRISMA methodology and AI-aided tools such as Elicit, researchers assessed ChatGPT's role in major review tasks including abstract screening, Boolean query construction, data extraction, and synthesis.

CO-EDP, VisionRI | Updated: 23-06-2025 22:20 IST | Created: 23-06-2025 22:20 IST

AI hallucination crisis? ChatGPT excels in speed but flounders in scholarly rigor — Representative Image. Credit: ChatGPT

The rapid adoption of large language models in evidence synthesis has raised both hopes and red flags in universities and research institutions worldwide. ChatGPT, one of the most widely used AI tools, is now under scrutiny after researchers documented its potential to drastically reduce review time, while simultaneously generating high volumes of fabricated content.

In a paper published in AI & Society, titled “Can Generative AI Reliably Synthesise Literature? Exploring Hallucination Issues in ChatGPT”, researchers conducted a detailed evaluation of ChatGPT’s performance across 124 studies, with in-depth analysis of 40 rigorously selected cases. The findings highlight a troubling paradox: although ChatGPT reduces researcher workload by up to 90%, it also introduces hallucinated material, unverifiable or entirely new facts, in as many as 91% of outputs.

How reliable is ChatGPT in literature review tasks?

The researchers conducted an extensive meta-analysis of 124 academic papers, 40 of which were subjected to deeper scrutiny under stringent inclusion criteria. Using the PRISMA methodology and AI-aided tools such as Elicit, researchers assessed ChatGPT's role in major review tasks including abstract screening, Boolean query construction, data extraction, and synthesis.

Results showed significant variability in performance depending on task complexity and domain specificity. For instance, in structured domains like clinical research, title and abstract screening using GPT-4 reached sensitivity rates between 80.6% and 96.2%, suggesting AI can rival human reviewers for basic classification. However, precision, the measure of correct positive identifications, dropped as low as 4.6% in more nuanced interpretive tasks.

This disparity points to a domain-specific ceiling for ChatGPT’s utility. In areas like public health and social science where context-rich language dominates, the model struggled to retain consistent accuracy. Furthermore, hallucination rates, defined as confidently stated but fabricated or unverifiable outputs, were alarmingly high, ranging from 28% to 91% across various tasks.

Can ChatGPT improve efficiency without sacrificing accuracy?

Despite these shortcomings, the report underscores ChatGPT’s strong potential in accelerating the systematic review process. Four separate studies cited in the analysis highlighted time savings between 40% and 90%, with one documenting a reduction from 100 hours to just 60 when AI-assisted tools were used. In certain cases, ChatGPT completed title and abstract screening within an hour, compared to the 7–10 days typically required by human reviewers.

These efficiencies were most apparent during low-complexity tasks such as data extraction and relevance filtering, especially when supported by clearly structured prompts. The study’s authors found that ChatGPT performed best when guided by domain-specific prompt engineering—a factor that significantly increased accuracy while curbing hallucinations.

However, speed did not always translate into reliability. Some studies noted that rapid AI-generated outputs occasionally masked shallow judgment or missed borderline-relevant articles. In one instance, ChatGPT invented a plausible yet nonexistent randomized control trial, which was only detected through manual review.

What frameworks can mitigate AI hallucination and support trust?

To bridge the gap between speed and scholarly rigor, the study introduces the Systematic Research Processing Framework (SRPF). This hybrid model integrates AI tools like ChatGPT and Elicit with structured human oversight across four phases: research kickoff, abstract screening, full-text evaluation, and thematic synthesis.

The SRPF was designed not only to streamline workflow but to enforce validation checkpoints and clarify task divisions between AI and human researchers. This structured collaboration aims to preserve transparency, verifiability, and critical academic judgment, elements at risk when relying solely on generative AI.

The findings also resonate with broader theoretical concerns. Referencing O’Neill’s (2022) concept of epistemic trust and Pasquale’s (2015) critique of algorithmic opacity, the authors argue that ChatGPT’s polished yet unverifiable outputs create a façade of credibility. Such “trust without traceability” undermines academic accountability and risks contaminating knowledge systems with unchecked misinformation.

In the future, researchers should prioritize the development of standardized evaluation metrics, domain-specific fine-tuning, and ensemble AI models to reduce hallucination. More comparative studies across disciplines are also needed to assess ChatGPT’s generalizability.

FIRST PUBLISHED IN:
Devdiscourse

AI hallucination crisis? ChatGPT excels in speed but flounders in scholarly rigor

How reliable is ChatGPT in literature review tasks?

Can ChatGPT improve efficiency without sacrificing accuracy?

What frameworks can mitigate AI hallucination and support trust?

TRENDING

AAP Triumphs in Ludhiana West: Electoral Success Amid Political Challenges

India's Potential Role in Mediating US-Iran Conflict Amid Escalation

Israel announces it is closing its airspace following US attack on Iranian n...

Vondrousova's Stunning Comeback: Triumph at Berlin Open

OPINION / BLOG / INTERVIEW

How ‘green’ hydropower projects fuel displacement and injustice

AI moral alignment is an illusion without justification democracy and debate

AI hallucination crisis? ChatGPT excels in speed but flounders in scholarly rigor

Excitement, fear, distrust: ChatGPT's debut fueled global emotional reckoning

DevShots

Latest News

Rates and Risks: Bowman's Vision for U.S. Monetary Policy

Britain's Bold Defence Budget: Boosting Security Spending to 5% of GDP by 2035

Griezmann's Last-Minute Strike: Atlético's Bittersweet Victory

USDA Revokes Clinton-Era Forest Protections

Connect us on

SECTORS

EDITIONS

OTHER LINKS

OTHER PRODUCTS

CONNECT