DeepSeek-R1’s medical breakthroughs overshadowed by risk of AI misuse

DeepSeek-R1’s standout feature is its ability to decompose complex problems using chain-of-thought (CoT) reasoning, reinforced by a multi-stage training pipeline. This includes pretraining on diverse datasets, supervised fine-tuning, reinforcement learning with Group Relative Policy Optimization (GRPO), and a rare self-reflection phase that allows the model to internally critique and refine its reasoning strategies.


CO-EDP, VisionRICO-EDP, VisionRI | Updated: 04-06-2025 09:47 IST | Created: 04-06-2025 09:47 IST
DeepSeek-R1’s medical breakthroughs overshadowed by risk of AI misuse
Representative Image. Credit: ChatGPT

A new report warns that DeepSeek-R1, a powerful open-source large language model (LLM) developed in China, poses serious safety and ethical risks even as it shows impressive promise across medical, scientific, and technical domains. The study, titled “DeepSeek in Healthcare: A Survey of Capabilities, Risks, and Clinical Applications of Open-Source Large Language Models”, evaluates the model’s performance, cost-efficiency, clinical utility, and susceptibility to adversarial misuse across over 30 global benchmarking studies.

While DeepSeek-R1 delivers top-tier reasoning performance in healthcare decision support, mathematics, and code generation, the report emphasizes a stark trade-off: enhanced capability comes with significant exposure to manipulation, misinformation, and bias. Its open architecture, while democratizing AI access, invites risks in high-stakes domains such as clinical diagnosis, pharmaceuticals, and health education.

How does DeepSeek-R1 perform in clinical and technical benchmarks?

According to the study, DeepSeek-R1’s standout feature is its ability to decompose complex problems using chain-of-thought (CoT) reasoning, reinforced by a multi-stage training pipeline. This includes pretraining on diverse datasets, supervised fine-tuning, reinforcement learning with Group Relative Policy Optimization (GRPO), and a rare self-reflection phase that allows the model to internally critique and refine its reasoning strategies.

The model achieved 86.7% accuracy on the AIME mathematics benchmark, ranked in the 96.3rd percentile on Codeforces, and matched proprietary models in pediatric diagnostics and ophthalmology decision support, often at a fraction of the cost. In the United States Medical Licensing Examination (USMLE), DeepSeek-R1 scored comparably to GPT-4o, validating its utility in medical education and testing scenarios.

In clinical decision support, the model excelled in low-resource environments, particularly in pediatrics and nephrology, where its performance helped reduce diagnostic errors and improve accessibility to AI-driven medical guidance. It also performed competitively in drug-drug interaction (DDI) prediction and pharmaceutical research, showing cost-effective potential for early-phase safety evaluations.

Yet, this performance comes with computational trade-offs. DeepSeek-R1 generates more tokens per answer, driving up latency in real-time systems. It also underperforms in open-ended natural language tasks, emotional reasoning, and creative generation compared to closed-source models like GPT-4o and Claude-3 Opus.

What are the main safety, bias, and ethical concerns?

The report identifies DeepSeek-R1’s biggest liability as its vulnerability to adversarial misuse. Its CoT reasoning, designed to improve interpretability, can be hijacked through techniques such as H-CoT (Hijacking Chain-of-Thought), allowing attackers to jailbreak the model’s safety protocols. In tests, DeepSeek-R1 generated more harmful content than many proprietary models, particularly in politically or ethically sensitive contexts.

DeepSeek-R1 is also significantly more prone to bias and misinformation. Studies found that it is up to three times more likely than Claude-3 Opus to produce biased outputs and four times more likely than GPT-4o to generate inaccurate or misleading content. In multilingual tests, the model failed to block harmful Chinese-language prompts and showed uneven performance across racial and political prompts.

Its open-source license, while promoting transparency and scalability, enables unrestricted fine-tuning. Without proper oversight, institutions or bad actors could easily repurpose the model to spread disinformation or execute unauthorized clinical applications. Furthermore, DeepSeek-R1 lacks robust legal clarity regarding its compliance with global data privacy laws like HIPAA and GDPR.

Real-world deployment is further hindered by concerns about liability and ethical accountability. In clinical scenarios, where outputs can affect diagnoses and treatment, users may not always recognize that recommendations come from an AI model. Without transparency mechanisms or explainability features, patients and providers are at risk of overreliance or misinformed decisions.

Can DeepSeek-R1 be safely integrated into healthcare systems?

Despite its risks, the study emphasizes DeepSeek-R1’s enormous potential, particularly for underfunded healthcare systems. Its low inference cost, modular architecture, and offline deployability offer scalable alternatives for hospitals, academic centers, and public health agencies with limited access to commercial LLMs. In ophthalmology tasks, it matched OpenAI’s o1 model but operated at 1/15th the cost per query.

The model has also demonstrated educational benefits. In patient communication, it produced readable, understandable medical information exceeding ChatGPT’s readability scores. For clinicians, it supports diagnostic reasoning and generates evidence-backed rationales that aid decision-making, especially in virtual training environments.

To bridge its capability-risk divide, researchers propose a multilayered solution: integrating supervised fine-tuning (SFT) with reinforcement learning, developing more robust prompt filters, enhancing red-teaming protocols, and embedding ethical auditing throughout its lifecycle. Future directions also include improving fairness in training data, minimizing token overhead, and building specialized domain modules for cardiology, oncology, and genomics.

Governance is critical. The report calls for international coordination to establish auditing standards, clarify accountability, and regulate fine-tuning rights. Without these controls, even high-performing open-source models like DeepSeek-R1 risk being sidelined or abused, undermining their credibility and potential benefits.

  • FIRST PUBLISHED IN:
  • Devdiscourse
Give Feedback