AI in pharmacy: OpenAI’s o3 model surpasses human benchmarks
Interestingly, the study found a significant decline in AI performance as the difficulty of the questions increased. Despite that, OpenAI o3 maintained an edge, particularly in the cardiovascular domain, where it achieved 93.3% accuracy. GPT-3.5 performed most effectively in the infectious disease section, suggesting some domain-specific variability in model strengths.

New research sheds light on the clinical capabilities of large language models in pharmaceutical decision-making. The study published in Healthcare benchmarked two widely used generative AI systems, ChatGPT-3.5 and OpenAI o3, against licensed clinical pharmacists in answering multiple-choice questions (MCQs) from four therapeutic areas.
The study, titled “Benchmarking ChatGPT-3.5 and OpenAI o3 Against Clinical Pharmacists: Preliminary Insights into Clinical Accuracy, Sensitivity, and Specificity in Pharmacy MCQs”, involved a controlled test across 60 clinical MCQs validated by academic and professional experts, comparing AI model responses to those of 25 licensed clinical pharmacists practicing in Jordan. The findings raise important questions about the future of AI in clinical pharmacy education, practice, and decision support.
How do AI models stack up against licensed pharmacists?
The researchers evaluated performance across a dataset of MCQs drawn from cardiovascular, endocrine, infectious, and respiratory therapeutic domains. Each question was developed according to current treatment guidelines and thoroughly reviewed for clarity and difficulty. To ensure data integrity, each AI model was prompted individually in isolated sessions and retested after two weeks to examine consistency.
The results were notable: OpenAI o3 outperformed both ChatGPT-3.5 and human pharmacists across nearly all parameters. OpenAI o3 demonstrated an accuracy of 83.3%, with a sensitivity of 90.0% and specificity of 70.0%. By contrast, ChatGPT-3.5 achieved 70.0% accuracy, 77.5% sensitivity, and 55.0% specificity. The group of pharmacists scored 69.7% accuracy, nearly mirroring ChatGPT-3.5, with a similar sensitivity (77.0%) and specificity (55.0%).
Interestingly, the study found a significant decline in AI performance as the difficulty of the questions increased. Despite that, OpenAI o3 maintained an edge, particularly in the cardiovascular domain, where it achieved 93.3% accuracy. GPT-3.5 performed most effectively in the infectious disease section, suggesting some domain-specific variability in model strengths.
Can AI enhance clinical pharmacy decision-making?
The findings prompt serious reflection about how AI might be deployed in real-world clinical environments. The ability of OpenAI o3 to exceed human performance in knowledge-based assessments opens the door to AI-supported clinical decision tools, not as replacements for pharmacists, but as companions that could increase speed, precision, and consistency in medication management and treatment planning.
AI’s strengths, as the authors stress, lie in fast retrieval and accurate interpretation of large volumes of evidence-based content. Such capabilities may be particularly valuable in fast-paced or high-pressure clinical settings where time and resources are limited. Moreover, AI’s consistency in answering reproducibly, even when re-tested after a two-week interval, highlights its potential utility as a reliable reference tool for pharmacists, especially in routine and protocol-driven decision-making.
However, the study also cautions against overreliance. While the AI models performed well on standardized questions, the authors noted a decline in effectiveness when MCQs were more complex or clinically ambiguous, reflecting the limitations of AI in nuanced, context-dependent judgment. This underscores the need for AI to function as a supplementary tool under the supervision of experienced clinicians.
What are the educational and regulatory implications?
If AI models can deliver consistent and high-quality responses in structured evaluations, there is a strong case for integrating these tools into pharmacy training environments, the study says, adding that this could provide students with access to instant feedback, reinforce guideline-based learning, and encourage critical thinking by comparing human and machine decision paths.
From a policy perspective, the performance of AI in this controlled study suggests that regulatory bodies may need to develop new frameworks to govern the use of AI in healthcare settings, particularly as tools like OpenAI o3 become increasingly embedded in digital health platforms. Standardized guidelines on accountability, patient safety, and AI model transparency will be essential to protect both practitioners and patients.
Future research will assess how AI tools behave in real-time clinical scenarios that involve ambiguity, risk assessment, and multidisciplinary coordination, dimensions that go far beyond the scope of structured MCQs.
- FIRST PUBLISHED IN:
- Devdiscourse