Electronic health records transform drug discovery and validation

The study addresses the long-standing inefficiencies of traditional drug development, typically spanning over a decade and costing upwards of $2 billion per compound, and argues that repurposing existing drugs with known safety profiles is a faster, more cost-effective path to therapeutic innovation. EHRs, which record real-time clinical interactions, are now widely available and can support this by revealing unforeseen drug-disease associations, especially in complex cases involving comorbidities or under-represented populations.


CO-EDP, VisionRICO-EDP, VisionRI | Updated: 03-06-2025 18:19 IST | Created: 03-06-2025 18:19 IST
Electronic health records transform drug discovery and validation
Representative Image. Credit: ChatGPT

In a sweeping review of the landscape of computational drug repurposing, researchers at Michigan State University have outlined how electronic health records (EHRs) are becoming a foundational asset for discovering and validating new drug indications outside traditional clinical trials. The peer-reviewed study, titled “A Survey of Using EHR as Real-World Evidence for Discovering and Validating New Drug Indications”, was published on arXiv.

Authored by Nabasmita Talukdar, Xiaodan Zhang, Shreya Paithankar, Hui Wang, and Bin Chen, the report meticulously surveys the growing role of EHR-derived real-world data (RWD) in complementing or even replacing randomized controlled trials (RCTs) - a shift driven by cost-efficiency, scalability, and clinical relevance.

How are EHRs shaping the future of drug repurposing?

The study addresses the long-standing inefficiencies of traditional drug development, typically spanning over a decade and costing upwards of $2 billion per compound, and argues that repurposing existing drugs with known safety profiles is a faster, more cost-effective path to therapeutic innovation. EHRs, which record real-time clinical interactions, are now widely available and can support this by revealing unforeseen drug-disease associations, especially in complex cases involving comorbidities or under-represented populations.

The authors catalog more than 20 EHR databases, including the All of Us Research Program, Epic Cosmos, and IBM MarketScan, each with varying degrees of access, population coverage, and longitudinal depth. Collectively, these repositories capture hundreds of millions of patient records, offering unprecedented statistical power for validation studies.

A key breakthrough cited in the paper is the successful use of EHR data in identifying new indications for GLP-1 receptor agonists, such as semaglutide, which are now being explored beyond their original use in diabetes.

What methods are used to validate drugs with EHR data?

To unlock insights from EHRs, researchers rely on a suite of computational and statistical tools. Data must first be preprocessed and standardized across multiple formats and vocabularies, such as SNOMED CT, LOINC, and RxNorm, to ensure interoperability.

Advanced natural language processing (NLP) and machine learning (ML) algorithms are then applied to mine both structured data (like lab results) and unstructured clinical notes. Tools such as MedCAT, cTAKES, and MedExtractR are instrumental in converting clinical narratives into analyzable formats. Deep learning further enhances this pipeline by identifying complex phenotypes and drug interactions.

Validation hinges on rigorous study designs, particularly retrospective cohort and case-control models. To mitigate confounding, propensity score analysis is employed using both traditional logistic regression and newer ML models such as LSTM and Random Forests. These techniques help construct matched treatment-control groups to estimate causal effects without randomization.

Statistical methods, including Cox proportional-hazards models, Kaplan-Meier survival curves, and linear mixed models, are widely used to quantify drug efficacy and survival benefits across cohorts. The study also discusses when to use non-parametric tests like Wilcoxon or Fisher’s Exact Test for smaller or skewed datasets.

Can real-world evidence replace traditional clinical trials?

A central theme in the report is whether target trial emulation, a methodology that mimics randomized controlled trials using observational EHR data, can substitute for traditional RCTs. While EHR-based emulation has yielded encouraging results, particularly in Alzheimer’s, COVID-19, and cancer research, the authors stress that real-world studies are not without flaws. Challenges include missing data, medication adherence uncertainty, and unrecorded confounders.

For instance, Zang et al. successfully emulated trials for thousands of drug candidates for Alzheimer’s disease, validating them against real-world outcomes. In another example, Jeon et al. used EHR data to build an external control arm for COVID-19 patients, replacing the need for a traditional control group during a health emergency.

Despite promising case studies, only 15% of clinical trials reviewed in a cross-sectional study were deemed replicable using existing RWD, highlighting that RCTs remain the gold standard for confirming efficacy and safety. However, regulatory agencies, including the FDA, have begun incorporating real-world evidence (RWE) in drug approvals, especially for conditional access pathways.

What's next: Integrating AI and LLMs in drug validation

One of the most forward-looking sections of the study explores the role of large language models (LLMs) such as ChatGPT, DeepSeek, and Med-PaLM in augmenting drug repurposing workflows. These models can support data extraction, hypothesis generation, and even phenotyping by processing unstructured clinical text. ChatGPT, for example, identified three Alzheimer’s drug candidates that were subsequently validated using EHR data.

Yet, risks remain. Privacy breaches, hallucinations, and limited transparency in proprietary models pose barriers to clinical adoption. The authors advocate for privacy-preserving analytics and the use of open-source LLMs that can be securely run on local infrastructure.

  • FIRST PUBLISHED IN:
  • Devdiscourse
Give Feedback