Last month I wrote up a fairly long piece on per-query energy consumption of LLMs using the data from InferenceMAX (note: InferenceMAX has since been renamed to InferenceX). Much of the write-up was dedicated to exploring what you can actually conclude from these figures and how that interacts with some of the implementation decisions in the benchmark, but I feel the results still give a useful yardstick. Beyond concerns about overly-specialised serving engine configurations and whether the workload is representative of real-world model serving in a paid API host, the other obvious limitation is that InferenceMAX is only testing GPT-OSS 120b and DeepSeek R1 0528 when there is a world of other models out there. I dutifully added "run my own tests using other models" to the todo list and here we are. By "here we are" I of course mean I made no progress towards that goal but Zach Mueller at Lambda started publishing model cards with the needed data - thanks Zach!
The setup for Lambda is simple - each model card lists the observed token
generation throughput and total throughput (along with other stats) for an
input sequence length / output sequence length (ISL/OSL) of 8192/1024, as
benchmarked using vllm bench serve. The command used to serve the LLM (using
sglang or vllm depending on the model) is also given. As a starting point this
is no worse than the InferenceMAX data, and potentially somewhat better due to
figures being taken from a configuration that's not overly specialised to a
particular query
length.
The figures each Lambda model card gives us that are relevant for calculating the energy per query are: the hardware used, token generation throughput and total token throughput (input+output tokens). Other statistics such as the time to first token, inter-token latency, and parallel requests tested help confirm whether this is a configuration someone would realistically use. Using an equivalent methodology to before, we get the Watt hours per query by:
Collecting the data from the individual model cards we can generate the following (as before, using minutes of PlayStation 5 gameplay as a point of comparison):
data = {
"Qwen/Qwen3.5-397B-A17B": {
"num_b200": 8,
"total_throughput": 11092,
},
"MiniMaxAI/MiniMax-M2.5": {
"num_b200": 2,
"total_throughput": 8062,
},
"zai-org/GLM-5-FP8": {
"num_b200": 8,
"total_throughput": 6300,
},
"zai-org/GLM-4.7-Flash": {
"num_b200": 1,
"total_throughput": 8125,
},
"arcee-ai/Trinity-Large-Preview": {
"num_b200": 8,
"total_throughput": 15611,
},
}
# 8192 + 1024
TOKENS_PER_QUERY = 9216
# Taken from <https://inferencex.semianalysis.com/>
B200_KW = 2.17
# Reference power draw for PS5 playing a game. Taken from
# <https://www.playstation.com/en-gb/legal/ecodesign/> ("Active Power
# Consumption"). Ranges from ~217W to ~197W depending on model.
PS5_KW = 0.2
def wh_per_query(num_b200, total_throughput, tokens_per_query):
total_cluster_kw = num_b200 * B200_KW
total_cluster_watts = total_cluster_kw * 1000
# joules_per_token is a weighted average for the measured mix of input
# and output tokens.
joules_per_token = total_cluster_watts / total_throughput
joules_per_query = joules_per_token * tokens_per_query
# Convert joules to watt-hours
return joules_per_query / 3600.0
def ps5_minutes(wh):
ps5_watts = PS5_KW * 1000
return (wh / ps5_watts) * 60.0
MODEL_WIDTH = 31
WH_WIDTH = 8
PS5_WIDTH = 8
header = f"{'Model':<{MODEL_WIDTH}} | {'Wh/q':<{WH_WIDTH}} | {'PS5 min':<{PS5_WIDTH}}"
separator = f"{'-' * MODEL_WIDTH} | {'-' * WH_WIDTH} | {'-' * PS5_WIDTH}"
print(header)
print(separator)
for model, vals in data.items():
wh = wh_per_query(vals["num_b200"], vals["total_throughput"], TOKENS_PER_QUERY)
ps5_min = ps5_minutes(wh)
wh_str = f"{wh:.2f}" if wh < 10 else f"{wh:.1f}"
print(f"{model.strip():<{MODEL_WIDTH}} | {wh_str:<{WH_WIDTH}} | {ps5_min:.2f}")
This gives the following figures (reordered to show Wh per query in ascending order, and added a column for interactivity (1/TPOT)):
| Model | Intvty (tok/s) | Wh/q | PS5 min. |
|---|---|---|---|
| zai-org/GLM-4.7-Flash (bf16) | 34.0 | 0.68 | 0.21 |
| MiniMaxAI/MiniMax-M2.5 (fp8) | 30.3 | 1.38 | 0.41 |
| arcee-ai/Trinity-Large-Preview (bf16) | 58.8 | 2.85 | 0.85 |
| Qwen/Qwen3.5-397B-A17B (bf16) | 41.7 | 4.01 | 1.20 |
| zai-org/GLM-5-FP8 (fp8) | 23.3 | 7.05 | 2.12 |
As a point of comparison, the most efficient 8 GPU deployment of fp8 DeepSeek R1 0528 from my figures in the previous article was 3.32 Wh per query.
And that's all I really have for today. Some interesting datapoints with hopefully more to come as Lambda puts up more model cards in this format. There's a range of interesting potential further experiments to do, but for now, I just wanted to share this initial look.