Minipost: Additional figures for per-query energy consumption of LLMs

2026Q1.

Last month I wrote up a fairly long piece on per-query energy consumption of LLMs using the data from InferenceMAX (note: InferenceMAX has since been renamed to InferenceX). Much of the write-up was dedicated to exploring what you can actually conclude from these figures and how that interacts with some of the implementation decisions in the benchmark, but I feel the results still give a useful yardstick. Beyond concerns about overly-specialised serving engine configurations and whether the workload is representative of real-world model serving in a paid API host, the other obvious limitation is that InferenceMAX is only testing GPT-OSS 120b and DeepSeek R1 0528 when there is a world of other models out there. I dutifully added "run my own tests using other models" to the todo list and here we are. By "here we are" I of course mean I made no progress towards that goal but Zach Mueller at Lambda started publishing model cards with the needed data - thanks Zach!

The setup for Lambda is simple - each model card lists the observed token generation throughput and total throughput (along with other stats) for an input sequence length / output sequence length (ISL/OSL) of 8192/1024, as benchmarked using vllm bench serve. The command used to serve the LLM (using sglang or vllm depending on the model) is also given. As a starting point this is no worse than the InferenceMAX data, and potentially somewhat better due to figures being taken from a configuration that's not overly specialised to a particular query length.

The figures each Lambda model card gives us that are relevant for calculating the energy per query are: the hardware used, token generation throughput and total token throughput (input+output tokens). Other statistics such as the time to first token, inter-token latency, and parallel requests tested help confirm whether this is a configuration someone would realistically use. Using an equivalent methodology to before, we get the Watt hours per query by:

Collecting the data from the individual model cards we can generate the following (as before, using minutes of PlayStation 5 gameplay as a point of comparison):

data = {
    "Qwen/Qwen3.5-397B-A17B": {
        "num_b200": 8,
        "total_throughput": 11092,
    },
    "MiniMaxAI/MiniMax-M2.5": {
        "num_b200": 2,
        "total_throughput": 8062,
    },
    "zai-org/GLM-5-FP8": {
        "num_b200": 8,
        "total_throughput": 6300,
    },
    "zai-org/GLM-4.7-Flash": {
        "num_b200": 1,
        "total_throughput": 8125,
    },
    "arcee-ai/Trinity-Large-Preview": {
        "num_b200": 8,
        "total_throughput": 15611,
    },
}

# 8192 + 1024
TOKENS_PER_QUERY = 9216

# Taken from <https://inferencex.semianalysis.com/>
B200_KW = 2.17

# Reference power draw for PS5 playing a game. Taken from
# <https://www.playstation.com/en-gb/legal/ecodesign/> ("Active Power
# Consumption"). Ranges from ~217W to ~197W depending on model.
PS5_KW = 0.2


def wh_per_query(num_b200, total_throughput, tokens_per_query):
    total_cluster_kw = num_b200 * B200_KW
    total_cluster_watts = total_cluster_kw * 1000
    # joules_per_token is a weighted average for the measured mix of input
    # and output tokens.
    joules_per_token = total_cluster_watts / total_throughput
    joules_per_query = joules_per_token * tokens_per_query
    # Convert joules to watt-hours
    return joules_per_query / 3600.0

def ps5_minutes(wh):
    ps5_watts = PS5_KW * 1000
    return (wh / ps5_watts) * 60.0

MODEL_WIDTH = 31
WH_WIDTH = 8
PS5_WIDTH = 8

header = f"{'Model':<{MODEL_WIDTH}} | {'Wh/q':<{WH_WIDTH}} | {'PS5 min':<{PS5_WIDTH}}"
separator = f"{'-' * MODEL_WIDTH} | {'-' * WH_WIDTH} | {'-' * PS5_WIDTH}"

print(header)
print(separator)

for model, vals in data.items():
    wh = wh_per_query(vals["num_b200"], vals["total_throughput"], TOKENS_PER_QUERY)
    ps5_min = ps5_minutes(wh)

    wh_str = f"{wh:.2f}" if wh < 10 else f"{wh:.1f}"
    print(f"{model.strip():<{MODEL_WIDTH}} | {wh_str:<{WH_WIDTH}} | {ps5_min:.2f}")

This gives the following figures (reordered to show Wh per query in ascending order, and added a column for interactivity (1/TPOT)):

Model Intvty (tok/s) Wh/q PS5 min.
zai-org/GLM-4.7-Flash (bf16) 34.0 0.68 0.21
MiniMaxAI/MiniMax-M2.5 (fp8) 30.3 1.38 0.41
arcee-ai/Trinity-Large-Preview (bf16) 58.8 2.85 0.85
Qwen/Qwen3.5-397B-A17B (bf16) 41.7 4.01 1.20
zai-org/GLM-5-FP8 (fp8) 23.3 7.05 2.12

As a point of comparison, the most efficient 8 GPU deployment of fp8 DeepSeek R1 0528 from my figures in the previous article was 3.32 Wh per query.

And that's all I really have for today. Some interesting datapoints with hopefully more to come as Lambda puts up more model cards in this format. There's a range of interesting potential further experiments to do, but for now, I just wanted to share this initial look.


Article changelog