AI may need to disobey humans to be truly trustworthy

Drawing inspiration from real-life examples such as guide dogs trained to disobey commands that could endanger their handlers, the study argues that human-AI teams should adopt similar dynamics. In this analogy, disobedience is not a breakdown of function but a sign of sophisticated understanding and loyalty to the greater good.


CO-EDP, VisionRICO-EDP, VisionRI | Updated: 13-07-2025 21:06 IST | Created: 13-07-2025 21:06 IST
AI may need to disobey humans to be truly trustworthy
Representative Image. Credit: ChatGPT

A growing number of AI systems are being deployed as teammates in high-stakes environments, assisting doctors, driving vehicles, and even aiding in defense operations. But what happens when these systems are asked to carry out instructions that may lead to unintended harm or conflict with overarching goals?

A new study published in AI Magazine, Artificial Intelligent Disobedience: Rethinking the Agency of Our Artificial Teammates”, introduces a provocative argument: intelligent disobedience by AI may be essential for ethical, safe, and effective collaboration. The research advocates for the development of AI agents that can refuse human commands under specific circumstances, based on ethical reasoning, contextual understanding, or goal misalignment.

Rather than undermining authority, this capability, referred to as Artificial Intelligent Disobedience (AID), is proposed as a vital step toward building more trustworthy and autonomous AI teammates.

Should AI always obey? Rethinking the role of compliance

The author challenges the prevailing design ethos in AI systems, one that prioritizes obedience as a foundational principle. She argues that in real-world collaborative environments, unwavering compliance can be dangerous. Unlike rule-based systems designed for controlled tasks, AI agents increasingly operate in unpredictable, dynamic settings where rigid obedience may conflict with ethical imperatives or safety goals.

Drawing inspiration from real-life examples such as guide dogs trained to disobey commands that could endanger their handlers, the study argues that human-AI teams should adopt similar dynamics. In this analogy, disobedience is not a breakdown of function but a sign of sophisticated understanding and loyalty to the greater good.

Mirsky proposes that AI systems must be equipped with the cognitive flexibility to identify when an instruction is ethically flawed, unsafe, or contextually misaligned. This, she argues, is not a futuristic ideal but a necessary evolution in the trajectory of human-AI collaboration.

What would it take for AI to disobey intelligently?

For AI systems to effectively implement disobedience, Mirsky outlines the need for advanced cognitive and ethical capabilities. At the core is a multi-layered framework of agency, where AI agents can interpret commands, assess potential consequences, and weigh the value of compliance versus defiance.

This shift requires the integration of several capabilities:

  • Context awareness to understand environmental or situational cues.
  • Moral and ethical reasoning to judge the potential implications of a command.
  • Common-sense inference to distinguish between literal instruction and intended meaning.
  • Collaborative fluency to understand team goals and anticipate human intent.

The study acknowledges that current AI systems, including most deployed large language models and task-oriented bots, fall short of this threshold. Their responses are often reactive, probabilistic, and constrained by training data rather than a deep understanding of consequences.

To overcome this, the paper emphasizes the importance of interdisciplinary development, merging insights from computer science, philosophy, psychology, and systems engineering. Such convergence would be needed to build AI agents capable of operating with the type of discretionary judgment we expect from human teammates.

Where would disobedient AI be useful and where are the risks?

The idea of intelligent disobedience opens both promising applications and ethical dilemmas. Mirsky explores several real-world domains where AID could dramatically improve outcomes:

  • Healthcare: An AI system could override a physician's recommendation if the prescribed treatment contradicts known allergies or dosage safety thresholds.
  • Military and defense: Autonomous drones might reject commands that violate international humanitarian law, thus acting as compliance safeguards rather than instruments of misuse.
  • Transportation: A self-driving vehicle could disregard a passenger’s reckless instruction that would jeopardize road safety.
  • Information delivery: Conversational agents could withhold responses that involve misinformation or incitement to violence.

However, the potential advantages come with critical risks. Disobedience must be carefully bounded; otherwise, systems may become unreliable or uncontrollable. The trust humans place in AI may erode if refusal is perceived as erratic or opaque. Moreover, there are profound questions of accountability: Who is responsible when a system disobeys and things go wrong?

AID must be transparent, explainable, and auditable, the author stresses. The AI’s decision not to follow a command should be accompanied by a justifiable rationale, interpretable by human users. Furthermore, such systems must be aligned with human values and goals - a challenge that underscores the urgency of ongoing work in AI safety and alignment.

  • FIRST PUBLISHED IN:
  • Devdiscourse
Give Feedback