Every chat costs the planet: GPT-4o’s 2025 footprint equals 35,000 homes in energy use

The study tackles a critical gap in existing AI sustainability discourse: while training LLMs has been widely scrutinized for its environmental toll, inference, arguably the more frequent and impactful phase, has remained understudied. The researchers found that inference can account for up to 90% of a model's lifecycle energy consumption. Using a newly developed benchmarking methodology that incorporates public API performance data, hardware specifications, and region-specific infrastructure multipliers, the team calculated energy use per query across three prompt lengths (short, medium, and long).


CO-EDP, VisionRICO-EDP, VisionRI | Updated: 16-05-2025 18:22 IST | Created: 16-05-2025 18:22 IST
Every chat costs the planet: GPT-4o’s 2025 footprint equals 35,000 homes in energy use
Representative Image. Credit: ChatGPT

The invisible cost of artificial intelligence is no longer just a question of computation - it is now a matter of planetary sustainability. A comprehensive new study titled “How Hungry is AI? Benchmarking Energy, Water, and Carbon Footprint of LLM Inference”, published on arXiv on May 14, 2025, delivers the most detailed environmental benchmarking of inference-phase operations for large language models (LLMs) to date.

The research, conducted by teams from the University of Rhode Island and the University of Tunis, evaluates the sustainability of 30 leading LLMs deployed in real-world, commercial cloud environments. It introduces a novel, infrastructure-aware framework that measures the environmental impact of each query made to these models, highlighting the pressing need for transparency, efficiency, and systemic regulation as AI usage scales globally.

How much energy do LLMs consume during inference?

The study tackles a critical gap in existing AI sustainability discourse: while training LLMs has been widely scrutinized for its environmental toll, inference, arguably the more frequent and impactful phase, has remained understudied. The researchers found that inference can account for up to 90% of a model's lifecycle energy consumption. Using a newly developed benchmarking methodology that incorporates public API performance data, hardware specifications, and region-specific infrastructure multipliers, the team calculated energy use per query across three prompt lengths (short, medium, and long).

Results show staggering differences in energy efficiency across models. OpenAI’s GPT-4.1 nano emerged as the most efficient, consuming just 0.454 Wh for a long prompt, while DeepSeek-R1 and OpenAI’s o3 consumed over 33 Wh and 39 Wh respectively, more than 70 times higher. The eco-efficiency star of the benchmark was Claude-3.7 Sonnet from Anthropic, which balanced performance with modest resource use.

Interestingly, model size alone didn’t determine efficiency. GPT-4o mini, for instance, consumed more energy than its larger counterpart GPT-4o, due to deployment on older A100 GPUs rather than the more efficient H100 or H200 units. This finding underscores the critical role of deployment infrastructure and regional energy grid quality in shaping a model’s real-world environmental footprint.

What are the broader environmental impacts at scale?

Although individual query costs may appear modest, their impact is compounded at scale. A case study of GPT-4o, currently OpenAI’s default deployment, demonstrates this clearly. Each short query consumes 0.42 Wh—40% more than a Google search. Multiplied by 700 million daily queries (a conservative estimate based on 2025 usage statistics), the study projects GPT-4o will require between 391,509 and 463,269 megawatt-hours (MWh) of electricity in 2025 alone. That’s more than the annual electricity consumption of 35,000 U.S. homes.

The water footprint is equally alarming. The model is estimated to evaporate between 1.33 and 1.57 million kiloliters of freshwater, equivalent to the annual drinking needs of 1.2 million people or filling over 500 Olympic swimming pools. Carbon emissions are expected to range from 138,125 to 163,441 metric tons of CO2 equivalent, comparable to the emissions of 30,000 gasoline-powered vehicles or 2,300 transatlantic flights. Offset requirements would demand a forest area the size of Chicago.

These numbers provide empirical weight to what the researchers call the “LLM sustainability paradox”: despite improvements in per-query efficiency, the explosive growth in AI usage is driving an unsustainable resource burden - a classic example of the Jevons Paradox, where greater efficiency leads to greater total consumption.

Can AI models be benchmarked for environmental accountability?

To address this paradox, the authors propose a new evaluation lens: eco-efficiency. Using cross-efficiency Data Envelopment Analysis (DEA), the study assesses how well each model converts environmental inputs (energy, water, carbon) into output intelligence. This method reduces bias by comparing each model against both its own metrics and those of its peers.

The DEA results were revealing. Claude-3.7 Sonnet led the rankings with the highest eco-efficiency score, followed by OpenAI’s o4-mini (high) and o3-mini. GPT-4.1 mini and GPT-4o also performed well, showing strong balance between capability and environmental cost. On the other end of the spectrum, DeepSeek-R1 and DeepSeek-V3, though functionally powerful, scored lowest due to high operational footprints, particularly in water and carbon metrics.

These findings suggest that policy and procurement decisions in both public and private sectors should not merely favor performance benchmarks like tokens-per-second or reasoning ability, but also integrate environmental costs into AI evaluation standards. The authors call for model-specific sustainability thresholds, mandated transparency in energy and water use reporting, and infrastructure innovations such as dielectric liquid cooling to mitigate resource loss.term sustainability. Without this shift, the benefits of generative AI may come at an ecological cost that undermines its utility for future generations.

  • FIRST PUBLISHED IN:
  • Devdiscourse
Give Feedback
OSZAR »