Per-query energy consumption of LLMs

2026Q1.

How much energy is consumed when querying an LLM? We're largely in the dark when it comes to proprietary models, but for open weight models that anyone can host on readily available, albeit eye-wateringly expensive, hardware this is something that can be measured and reported, right? In fact, given other people are doing the hard work of setting up and running benchmarks across all kinds of different hardware and software configurations for common open weight models, can we just re-use that to get a reasonable figure in terms of Watt-hours (Wh) per query?

For the kind of model you can run locally on a consumer GPU then of course there's some value in seeing how low the per-query energy usage might be on a large scale commercial setup. But my main interest is in larger and more capable models, the kind that you wouldn't realistically run locally and end up using in a pay-per-token manner either directly with your host of choice or through an intermediary like OpenRouter. In these cases where models are efficiently served with a minimum of 4-8 GPUs or even multi-node clusters it's not easy to get a feel for the resources you're using. I'm pretty happy that simple back of the envelope maths shows that whether providers are properly amortising the cost of their GPUs or not, it's implausible that they're selling per-token API access for open models at below the cost of electricity. That gives a kind of upper bound on energy usage, and looking at the pennies I spend on such services it's clearly a drop in the ocean compared to my overall energy footprint. But it's not a very tight bound, which means it's hard to assess the impact of increasing my usage.

We can look at things like Google's published figures on energy usage for Gemini but this doesn't help much. They don't disclose the length of the median prompt and its response, or details of the model used to serve that median query meaning it's not helpful for either estimating how it might apply to other models or how it might apply to your own usage (which may be far away from this mysterious median query). Mistral released data on the per query environmental impact (assuming for a 400 token query), but the size of the Mistral Large 2 model is not disclosed and they don't calculate a Wh per query figure. CO2 and water per query are very helpful to evaluate a particular deployment, but the actual energy used is a better starting point that can be applied to other providers assuming different levels of carbon intensity. If one of the API providers were to share statistics based on a real world deployment of one of the open models with a much higher degree of transparency (i.e. sharing stats on the number of queries served during the period, statistics on their length, and measured system power draw) that would be a useful source of data. But today we're looking at what we can conclude from the InferenceMAX benchmark suite published results.

I'd started looking at options for getting good figures thinking I might have to invest in the hassle and expense of renting a multi-GPU cloud instance to run my own benchmarks, then felt InferenceMAX may make that unnecessary. After writing this up along with all my provisos I'm perhaps tempted again to try to generate figures myself. Anyway, read on for a more detailed look at that benchmark suite. You can scroll past all the provisos and jump ahead to the figures giving the Wh/query figures implied by the benchmark results across different GPUs, different average input/output sequence lengths, and for gpt-oss 120B and DeepSeek-R1-0528. But I hope you'll feel a bit guilty about it.

If you see any errors, please let me know.

High-level notes on InferenceMAX

InferenceMAX benchmark suite has the stated goal to "provide benchmarks that both emulate real world applications as much as possible and reflect the continuous pace of software innovation." They differentiate themselves from other benchmarking efforts noting "Existing performance benchmarks quickly become obsolete because they are static, and participants often game the benchmarks with unrealistic, highly specific configurations."

The question I'm trying to answer is "what is the most 'useful AI' I can expect for a modern GPU cluster in a realistic deployment and how much energy does it consume". Any benchmark is going to show peak throughput higher than you'd expect to achieve in real workload and there's naturally a desire to keep it pinned on a specific model for as long as it isn't totally irrelevant in order to enable comparisons as hardware and software evolves with a common point of reference. But although I might make slightly different choices about what gets benchmarked and how, the InferenceMAX setup at first look seems broadly aligned with what I want to achieve.

They benchmark DeepSeek-R1-0528 (both at the native fp8 quantisation and at fp4) which is a 671B parameter model with 37B active weights released ~7 months ago and seems a fair representative of a large MoE open weight model. gpt-oss-120b is also benchmarked, providing a point of comparison for a much smaller and efficient to run model. Different input sequence length and output sequence length (ISL and OSL - the number of input and output tokens) are tested: 1k/1k, 1k/8k, 8k/1k, which provides coverage of different query types. Plus tests against a wide range of GPUs (including the 72-GPU GB200 NVL72 cluster) and sweeps different settings.

At the time of writing you might reasonably consider to be 'InferenceMAX' is split into around three pieces:

GitHub Actions is used to orchestrate the runs, ultimately producing a zip file containing JSON with the statistics of each configuration (e.g. here). The benchmark_serving.py script is invoked via the run_benchmark_serving wrapper in benchmark_lib.sh which hardcodes some options and passes through some others from the workflow YAML. The results logged by benchmark_serving.py are processed in InferenceMAX's process_result.py helper which will produce JSON in the desired output format. Together, these scripts provide statistics like throughput (input and output token), end to end latency, interactivity (output tokens per second) etc.

Further studying the benchmark setup

So, let's look at the benchmarking logic in more detail to look for any surprises or things that might affect the accuracy of the Wh-per-query figure I want to generate. I'll note that InferenceMAX is an ongoing project that is actively being developed. These observations are based on a recent repo checkout, but of course things may have changed since then if you're reading this post some time after it was first published.

Looking through I made the following observations. Some represent potential issues (see the next subheading for a list of the upstream issues I filed), while others are just notes based on aspects of the benchmark I wanted to better understand.

Filed issues

I ended up filing the following issues upstream:

In the best case, you'd hope to look at the benchmark results, accept they're probably represent a higher degree of efficiency than you'd likely get on a real workload, that an API provider might achieve 50% of that and double the effective cost per query to give a very rough upper estimate on per-query cost But that only really works if the reported benchmark results roughly match the achievable throughput in a setup configured for commercial serving. Given the tuning to specific isl/osl values, I'm not at all confident thats the case and I don't know how wide the gap is.

Generating results

Firstly I wrote a quick script to check some assumptions about the data and look for anything that seems anomalous. Specifically:

Based on the information available in the generated result JSON and the reported all-in power per GPU (based on SemiAnalysis' model), we can calculate the Watt hours per query. First calculate the joules per token (watts per GPU divided by the total throughput per GPU). This gives a weighted average of the joules per token for the measured workload (i.e. reflecting the ratio of isl:osl). Multiplying joules per token by the tokens per query (isl+osl) gives the joules per query, and we can just divide by 3600 to get Wh.

There is some imprecision because we're constructing the figure for e.g. 8192/1024 ISL based on measurements with an average 0.9*8192 input and 0.9*1024 output length. The whole calculation would be much simpler if the benchmark harness recorded the number of queries executed and in what time, meaning we can directly calculate the Wh/query from the Wh for the system over the benchmark duration divided by the number of queries served (and remembering that in the current setup each query is on average 90% of the advertised sequence length).

This logic is wrapped up in a simple script.

There's been a recent change to remove the 'full sweep' workflows in favour of only triggering a subset of runs when there is a relevant change. But I grabbed my results from before this happened, from a December 15th 2025 run. However when finalising this article I spotted Nvidia managed to land some new NVL72 DeepSeek R1 0528 configurations just before Christmas, so I've merged in those results as well, using a run from December 19th. All data and scripts are collected together in this Gist.

Results

As well as giving the calculated Wh per query, the script also gives a comparison point of minutes of PS5 gameplay (according to Sony, "Active Power Consumption" ranges from ~217W to ~197W depending on model - we'll just use 200W). The idea here is to provide some kind of reference point for what a given Wh figure means in real-world times, rather than focusing solely on the relative differences between different deployments. Comparisons to "minutes of internet streaming" seem popular at the moment, presumably as it's because an activity basically everyone does. I'm steering away from that because I'd be comparing one value that's hard to estimate accurately and has many provisos to another figure that's hard to estimate accurately and has many provisos, which just injects more error and uncertainty into this effort to better measure/understand/contextualise energy used for LLM inference.

I'm now going to cherry-pick some results for discussion. Firstly for DeepSeek R1 0528 with 8k/1k ISL/OSL, we see that the reported configurations that give a usable level of interactivity at fp8 report between 0.96-3.74 Wh/query (equivalent to 0.29-1.12 minutes of PS5 gaming). The top row which is substantially more efficient is the newer GB200 NVL72 configuration added at the end of last year. It's not totally easy to trace the configuration changes given they're accompanied by a reworking of the associated scripts, but as far as I can see the configuration ultimately used is this file from the dynamo repository. Looking at the JSON the big gain comes from significantly higher prefill throughput (with output throughput per GPU remaining roughly the same). This indicates the older results (the second row) were bottlenecked waiting for waiting for prefill to complete.

Workload Intvty (tok/s) E2EL (s) Details Wh/Q PS5 min
fp8 DS R1 0528 8k/1k 39.5 36.5 gb200 dynamo-sglang (72 GPUs disagg, conc: 2048, pfill_dp_attn, dec_dp_attn) 0.96 0.29
fp8 DS R1 0528 8k/1k 31.3 55.2 gb200 dynamo-sglang (72 GPUs disagg, conc: 1024, pfill_dp_attn, dec_dp_attn) 3.13 0.94
fp8 DS R1 0528 8k/1k 20.9 48.8 h200 trt (8 GPUs, conc: 64, dp_attn) 3.32 1.00
fp8 DS R1 0528 8k/1k 19.5 49.6 h200 sglang (8 GPUs, conc: 64) 3.39 1.02
fp8 DS R1 0528 8k/1k 23.9 39.9 b200-trt trt (8 GPUs, conc: 64) 3.39 1.02
fp8 DS R1 0528 8k/1k 22.3 44.5 b200 sglang (8 GPUs, conc: 64) 3.74 1.12

Now taking a look at the results for an fp4 quantisation of the same workload, the result is significantly cheaper to serve with similer or better interactivity and the NVL72 setup Nvidia submitted does have a significant advantage over the 4/8 GPU clusters. This time we see 0.63-1.67 Wh/query (equivalent to 0.19-0.50 minutes of PS5 power draw while gaming). Serving at a lower quantisation impacts the quality of results of course, but the improved efficiency, including on smaler 4 GPU setups helps demonstrate why models like Kimi K2 thinking are distributed as "native int4", with benchmark results reported at this quantisation and quantisation aware training used to maintain quality of result.

Workload Intvty (tok/s) E2EL (s) Details Wh/Q PS5 min
fp4 DS R1 0528 8k/1k 41.6 24.6 gb200 dynamo-trt (40 GPUs disagg, conc: 1075, pfill_dp_attn, dec_dp_attn) 0.63 0.19
fp4 DS R1 0528 8k/1k 22.8 43.2 b200-trt trt (4 GPUs, conc: 128, dp_attn) 0.93 0.28
fp4 DS R1 0528 8k/1k 18.7 59.3 b200 sglang (4 GPUs, conc: 128) 1.25 0.38
fp4 DS R1 0528 8k/1k 30.3 39.4 b200 sglang (4 GPUs, conc: 64) 1.67 0.50

Looking now at the 1k/8k workload (i.e. generating significant output) and the cost is 15.0-16.3 Wh/query (equivalent to 4.49-4.89 minutes of PS5 power draw while gaming). As expected this is significantly higher than the 8k/1k workload as prefill (processing input tokens) is much cheaper per token than decode (generating output tokens)

Workload Intvty (tok/s) E2EL (s) Details Wh/Q PS5 min
fp8 DS R1 0528 1k/8k 42.5 176.3 b200 sglang (8 GPUs, conc: 64) 15.0 4.49
fp8 DS R1 0528 1k/8k 31.9 232.2 h200 sglang (8 GPUs, conc: 64) 15.9 4.76
fp8 DS R1 0528 1k/8k 31.2 237.9 h200 trt (8 GPUs, conc: 64) 16.3 4.88
fp8 DS R1 0528 1k/8k 39.1 189.5 b200-trt trt (8 GPUs, conc: 64) 16.3 4.89

Again, fp4 has a significant improvement in efficiency:

Workload Intvty (tok/s) E2EL (s) Details Wh/Q PS5 min
fp4 DS R1 0528 1k/8k 29.7 251.5 b200-trt trt (4 GPUs, conc: 256, dp_attn) 2.73 0.82
fp4 DS R1 0528 1k/8k 37.7 197.5 b200-trt trt (8 GPUs, conc: 256, dp_attn) 4.31 1.29
fp4 DS R1 0528 1k/8k 34.2 221.2 b200 sglang (4 GPUs, conc: 128) 4.75 1.43
fp4 DS R1 0528 1k/8k 33.1 223.1 b200-trt trt (4 GPUs, conc: 128) 4.79 1.44

As you'd expect for a much smaller model at native fp4 quantisation, GPT-OSS-120B is much cheaper to serve. e.g. for 8k/1k:

Workload Intvty (tok/s) E2EL (s) Details Wh/Q PS5 min
fp4 GPT-OSS 120B 8k/1k 45.8 20.8 b200-trt trt (1 GPUs, conc: 128) 0.11 0.03
fp4 GPT-OSS 120B 8k/1k 93.1 10.5 b200-trt trt (2 GPUs, conc: 128, dp_attn) 0.11 0.03
fp4 GPT-OSS 120B 8k/1k 44.3 21.4 b200 vllm (1 GPUs, conc: 128) 0.11 0.03
fp4 GPT-OSS 120B 8k/1k 145.7 6.7 b200-trt trt (2 GPUs, conc: 64, dp_attn) 0.14 0.04
fp4 GPT-OSS 120B 8k/1k 103.8 9.2 b200 vllm (2 GPUs, conc: 64) 0.20 0.06

Or for 1k/8k:

Workload Intvty (tok/s) E2EL (s) Details Wh/Q PS5 min
fp4 GPT-OSS 120B 1k/8k 80.5 91.6 b200-trt trt (1 GPUs, conc: 128) 0.49 0.15
fp4 GPT-OSS 120B 1k/8k 72.3 102.0 b200 vllm (1 GPUs, conc: 128) 0.55 0.16
fp4 GPT-OSS 120B 1k/8k 144.9 51.1 b200-trt trt (2 GPUs, conc: 128, dp_attn) 0.55 0.17
fp4 GPT-OSS 120B 1k/8k 129.4 57.0 b200-trt trt (2 GPUs, conc: 128) 0.61 0.18

Conclusion

Well, this took rather a lot more work than I thought it would and I'm not yet fully satisfied with the result. Partly we have to accept a degree of fuzziness about marginal energy usage of an individual query - it's going to depend on the overall workload of the system so there's going to be some approximation when you try to cost a single query.

I'm glad that InferenceMAX exists and am especially glad that it's open and publicly developed, which is what has allowed me to dive into its implementation to the extent I have and flag concerns/issues. I feel it's not yet fully living up to its aim of providing results that reflect real world application, but I hope that will improve with further maturation and better rules for benchmark participants. Of course, it may still make most sense to collect benchmark figures myself and even if doing so, being able to refer to the benchmarked configurations and get an indication of what hardware can achieve what performance is helpful in doing so. Renting a 72-GPU cluster is expensive and as far as I can see not typically available for a short time, so any benchmarking run by myself would be limited to 4-8 GPU configurations. If the gap in efficiency is huge for such setups vs the NVL72 then these smaller setups are maybe less interesting.

If I found the time to run benchmarks myself, what would I be testing? I'd move to DeepSeek V3.2. One of the big features of this release was the movement to a new attention mechanism which scales much closer to linearly with sequence length. With e.g. Kimi Linear and Qwen3-Next, other labs are moving in a similar direction experimentally at least. I'd try to set up 8 GPU configuration with sglang/vllm configured in a way that it would be capable of serving a commercial workload with varied input/output sequence lengths and test this is the case (Chutes provide their deployed configs which may be another reference point). I'd want to see how much the effective Wh per million input/output tokens varies depending on the different isl/osl workloads. These should be relatively similar given the linear attention mechanism, and if so it's a lot easier to estimate the rough energy cost of a series of your own queries of varied length. I would stick with the random input tokens for the time being.

So where does that leave us? All of this and we've got figures for two particular models, with one benchmark harness, a limited set of input/output sequence lengths, and a range of potential issues that might impact the conclusion. I think this is a useful yardstick / datapoint, though I'd like to get towards something that's even more useful and that I have more faith in.


Article changelog