Human intelligence still out of reach for deep reinforcement learning models

The authors argue that the success of DRL agents in fixed benchmarks misleads the public and even segments of the research community. Unlike humans, who can solve problems through abstract reasoning, analogical transfer, and curiosity-driven exploration, DRL systems depend on overfitting to specific reward structures. This distinction becomes glaringly obvious when DRL agents are evaluated in out-of-distribution settings or with minimal rewards.


CO-EDP, VisionRICO-EDP, VisionRI | Updated: 30-05-2025 09:29 IST | Created: 30-05-2025 09:29 IST
Human intelligence still out of reach for deep reinforcement learning models
Representative Image. Credit: ChatGPT

A recent analysis challenges prevailing narratives about the progress of artificial intelligence, demonstrating that deep reinforcement learning (DRL) agents remain significantly inferior to humans in complex environments, even after years of research and extensive compute.

The study, titled "Deep Reinforcement Learning Agents are not even close to Human Intelligence," was released as a preprint on arXiv. It delivers a sobering assessment of the capabilities of DRL systems in the domain of complex arcade games, suggesting that recent progress in DRL benchmarks fails to reflect genuine advances toward general intelligence.

How do DRL agents perform in complex environments?

Using the Atari benchmark suite, specifically the challenging game "Montezuma's Revenge", the researchers conducted an in-depth evaluation of the learning efficiency and problem-solving skills of DRL agents compared to human players. The game, chosen for its sparse rewards and exploration demands, serves as a critical stress test for intelligent behavior. The authors examined the performance of several state-of-the-art DRL algorithms, including Go-Explore, Agent57, and others, each representing peak achievements in the field.

The findings were stark: even the best DRL agents required hundreds of millions of interactions, far more than a human player, just to achieve comparable or inferior scores. While some DRL methods eventually outscored humans in terms of cumulative rewards, they often failed to demonstrate the flexibility, generalization, or problem-solving strategies characteristic of intelligent behavior. Instead, these agents relied on brute-force exploration and trajectory memorization, rather than learning reusable, transferrable skills.

Further, the study points out that DRL agents remain fragile to small environmental changes. Human players easily adapt when game layouts shift or mechanics are slightly altered. In contrast, DRL systems show extreme sensitivity to such perturbations, requiring retraining or failing entirely. These brittleness issues underscore the fundamental difference between statistical pattern recognition and the abstract, causal reasoning observed in human cognition.

What makes human intelligence distinct from DRL performance?

The authors argue that the success of DRL agents in fixed benchmarks misleads the public and even segments of the research community. Unlike humans, who can solve problems through abstract reasoning, analogical transfer, and curiosity-driven exploration, DRL systems depend on overfitting to specific reward structures. This distinction becomes glaringly obvious when DRL agents are evaluated in out-of-distribution settings or with minimal rewards.

In Montezuma’s Revenge, for example, humans rapidly develop hierarchical strategies and goals, first retrieving a key, then navigating complex traps, based on intuitive planning. DRL agents, by contrast, must incrementally learn these strategies through millions of samples. Their trial-and-error approach, even when enhanced by memory or planning modules, lacks semantic understanding or foresight.

Moreover, human players require only a handful of demonstrations or exposures to infer rules and objectives. DRL agents must be embedded in their environments for prolonged durations, suggesting a sample inefficiency that is orders of magnitude worse than biological learners. This observation, the authors note, fundamentally undermines claims that DRL benchmarks signal progress toward general artificial intelligence.

What are the implications for AI development and research direction?

The study’s conclusion is unequivocal: current DRL systems, despite their milestone achievements in controlled environments, do not exhibit general intelligence and should not be treated as precursors to human-level AI. The authors warn against misinterpreting benchmark scores as evidence of cognitive equivalence or capability. They advocate for a paradigm shift in evaluating AI progress, not merely through high scores or leaderboard performance but via metrics that capture abstraction, transfer learning, and reasoning.

They also call for greater interdisciplinary integration, suggesting that insights from cognitive science, neuroscience, and developmental psychology should guide the next generation of learning systems. Rather than chasing benchmarks, the AI field should prioritize architectures that embody human-like priors, intrinsic motivation, and structural causal understanding. These elements, largely absent in DRL agents, are essential for building systems capable of generalization, robustness, and autonomy.

While AI systems may outperform humans in constrained tasks, they remain brittle, narrow, and data-hungry. The illusion of progress must not distract from the deeper scientific gaps that separate artificial pattern learners from truly intelligent agents.

  • FIRST PUBLISHED IN:
  • Devdiscourse
Give Feedback