The cost of thinking: Advanced AI models emit more CO₂ than ever before

The carbon cost escalates when reasoning capabilities are activated. Reasoning-enabled models such as DeepSeek-R1 70B and Cogito 70B emitted up to six times more CO₂eq than their default text-generation modes. This spike was largely driven by increased token generation, particularly "thinking tokens" produced during complex multi-step reasoning tasks.


CO-EDP, VisionRICO-EDP, VisionRI | Updated: 20-06-2025 18:27 IST | Created: 20-06-2025 18:27 IST
The cost of thinking: Advanced AI models emit more CO₂ than ever before
Representative Image. Credit: ChatGPT

A new benchmark analysis of large language models (LLMs) has revealed a critical environmental trade-off between AI accuracy and carbon emissions, raising sustainability concerns amid the rapid expansion of generative AI. The study, titled “Energy Costs of Communicating with AI” and published in Frontiers in Communication, delivers the first detailed empirical assessment of CO₂ equivalent emissions produced by 14 leading LLMs, including OpenAI, Meta, Alibaba, and DeepSeek systems.

The researchers evaluated models ranging from 7 to 72 billion parameters, testing their performance on 1,000 tasks from the Massive Multitask Language Understanding (MMLU) dataset across five diverse domains: Philosophy, Abstract Algebra, High School World History, High School Mathematics, and International Law. Each model’s energy use and emissions were recorded on an NVIDIA A100 GPU and converted into standardized carbon output using a 480 gCO₂/kWh emission factor.

How do model size and reasoning capabilities impact emissions?

The analysis shows that larger models with advanced reasoning capabilities yield higher task accuracy but generate exponentially more carbon emissions. For example, the Cogito 70B reasoning model delivered the highest performance at 84.9% accuracy but emitted 1,341 grams of CO₂ equivalent for the full task suite. In comparison, the Qwen 7B model achieved just 32.9% accuracy while generating only 27.7 grams of CO₂eq, making it the lowest-emission performer but also the least accurate.

The carbon cost escalates when reasoning capabilities are activated. Reasoning-enabled models such as DeepSeek-R1 70B and Cogito 70B emitted up to six times more CO₂eq than their default text-generation modes. This spike was largely driven by increased token generation, particularly "thinking tokens" produced during complex multi-step reasoning tasks.

Subject-level data also revealed emission disparities. Tasks in Abstract Algebra required significantly more computational overhead, leading to longer model responses and higher emissions. The DeepSeek-R1 7B model, for instance, generated over 6,700 tokens for a single algebra prompt. In contrast, simpler history and law questions incurred lower emissions due to shorter, fact-based answers.

What are the accuracy gains for higher environmental costs?

While increased scale and reasoning depth improve performance, the study warns of diminishing environmental returns. Twelve of the 14 tested models required less than 500 grams of CO₂eq but none of them exceeded 80% accuracy. Only two reasoning-enhanced large models, Cogito 70B and DeepSeek-R1 70B, achieved accuracies above that threshold, but at the cost of significantly higher emissions: 1,341 g and 2,042 g CO₂eq respectively.

Notably, the Cogito 70B reasoning model struck the best balance between accuracy and emissions. It outperformed DeepSeek-R1 70B by 7.6 percentage points in accuracy while emitting 34% less CO₂eq. The Qwen 2.5 72B model also demonstrated strong efficiency, scoring 77.6% accuracy with only 418 g CO₂eq, aided by its minimalist response style, averaging one token per answer during multiple-choice tasks.

However, the study flagged challenges in controlling output verbosity. Reasoning models often disregarded token-length constraints, producing excessively long answers. For instance, the Cogito 8B reasoning model generated a 37,575-token response to a single algebra question, highlighting the difficulty of managing reasoning verbosity in large-scale LLMs.

What do the findings mean for sustainable AI development?

The findings have major implications for AI development, policy, and environmental strategy. As LLMs scale and become integrated into public services, education, and enterprise tools, the environmental impact of inference tasks, not just training, becomes a growing concern.

The study emphasizes that optimizing reasoning efficiency is essential to developing environmentally responsible LLMs. While accuracy gains from reasoning are real, they can only be justified if token overhead is minimized. Developers are urged to prioritize algorithmic brevity, introduce response caps, and improve prompt control to avoid runaway emissions.

Moreover, the researchers call for standardized benchmarking of LLM emissions using empirical measurements rather than theoretical proxies. They argue that emissions should become a core performance metric, alongside accuracy, latency, and fairness, when evaluating LLM suitability for deployment.

This has regulatory implications as well. In sectors like education, healthcare, and legal services, high-emission models could be disincentivized in favor of more efficient alternatives. Carbon-conscious AI governance, the authors suggest, could emerge as a new frontier in digital sustainability.

The researchers urge further analysis into the carbon costs of fine-tuned and task-specific models. While general-purpose LLMs dominate today’s deployments, purpose-trained models may offer better accuracy-to-emission ratios if their design aligns with the constraints of targeted tasks and computational efficiency.

  • FIRST PUBLISHED IN:
  • Devdiscourse
Give Feedback