AI hallucination crisis? ChatGPT excels in speed but flounders in scholarly rigor
The researchers conducted an extensive meta-analysis of 124 academic papers, 40 of which were subjected to deeper scrutiny under stringent inclusion criteria. Using the PRISMA methodology and AI-aided tools such as Elicit, researchers assessed ChatGPT's role in major review tasks including abstract screening, Boolean query construction, data extraction, and synthesis.

The rapid adoption of large language models in evidence synthesis has raised both hopes and red flags in universities and research institutions worldwide. ChatGPT, one of the most widely used AI tools, is now under scrutiny after researchers documented its potential to drastically reduce review time, while simultaneously generating high volumes of fabricated content.
In a paper published in AI & Society, titled “Can Generative AI Reliably Synthesise Literature? Exploring Hallucination Issues in ChatGPT”, researchers conducted a detailed evaluation of ChatGPT’s performance across 124 studies, with in-depth analysis of 40 rigorously selected cases. The findings highlight a troubling paradox: although ChatGPT reduces researcher workload by up to 90%, it also introduces hallucinated material, unverifiable or entirely new facts, in as many as 91% of outputs.
How reliable is ChatGPT in literature review tasks?
The researchers conducted an extensive meta-analysis of 124 academic papers, 40 of which were subjected to deeper scrutiny under stringent inclusion criteria. Using the PRISMA methodology and AI-aided tools such as Elicit, researchers assessed ChatGPT's role in major review tasks including abstract screening, Boolean query construction, data extraction, and synthesis.
Results showed significant variability in performance depending on task complexity and domain specificity. For instance, in structured domains like clinical research, title and abstract screening using GPT-4 reached sensitivity rates between 80.6% and 96.2%, suggesting AI can rival human reviewers for basic classification. However, precision, the measure of correct positive identifications, dropped as low as 4.6% in more nuanced interpretive tasks.
This disparity points to a domain-specific ceiling for ChatGPT’s utility. In areas like public health and social science where context-rich language dominates, the model struggled to retain consistent accuracy. Furthermore, hallucination rates, defined as confidently stated but fabricated or unverifiable outputs, were alarmingly high, ranging from 28% to 91% across various tasks.
Can ChatGPT improve efficiency without sacrificing accuracy?
Despite these shortcomings, the report underscores ChatGPT’s strong potential in accelerating the systematic review process. Four separate studies cited in the analysis highlighted time savings between 40% and 90%, with one documenting a reduction from 100 hours to just 60 when AI-assisted tools were used. In certain cases, ChatGPT completed title and abstract screening within an hour, compared to the 7–10 days typically required by human reviewers.
These efficiencies were most apparent during low-complexity tasks such as data extraction and relevance filtering, especially when supported by clearly structured prompts. The study’s authors found that ChatGPT performed best when guided by domain-specific prompt engineering—a factor that significantly increased accuracy while curbing hallucinations.
However, speed did not always translate into reliability. Some studies noted that rapid AI-generated outputs occasionally masked shallow judgment or missed borderline-relevant articles. In one instance, ChatGPT invented a plausible yet nonexistent randomized control trial, which was only detected through manual review.
What frameworks can mitigate AI hallucination and support trust?
To bridge the gap between speed and scholarly rigor, the study introduces the Systematic Research Processing Framework (SRPF). This hybrid model integrates AI tools like ChatGPT and Elicit with structured human oversight across four phases: research kickoff, abstract screening, full-text evaluation, and thematic synthesis.
The SRPF was designed not only to streamline workflow but to enforce validation checkpoints and clarify task divisions between AI and human researchers. This structured collaboration aims to preserve transparency, verifiability, and critical academic judgment, elements at risk when relying solely on generative AI.
The findings also resonate with broader theoretical concerns. Referencing O’Neill’s (2022) concept of epistemic trust and Pasquale’s (2015) critique of algorithmic opacity, the authors argue that ChatGPT’s polished yet unverifiable outputs create a façade of credibility. Such “trust without traceability” undermines academic accountability and risks contaminating knowledge systems with unchecked misinformation.
In the future, researchers should prioritize the development of standardized evaluation metrics, domain-specific fine-tuning, and ensemble AI models to reduce hallucination. More comparative studies across disciplines are also needed to assess ChatGPT’s generalizability.
- FIRST PUBLISHED IN:
- Devdiscourse