[INFERENCE_COSTS]

LLM inference costs in 2026

Cost per million tokens for the most commonly self-hosted open LLMs, served on Servers.Computer-indexed clusters. Numbers assume fp8 with vLLM, 50% sustained utilization, mixed input/output 4:1.

Model	H100	H200	B200
Llama-3 8B (fp8)	$0.10 / 1M tok	$0.09	$0.07
Llama-3 70B (fp8)	$0.85 / 1M tok	$0.72	$0.55
Mixtral 8x22B (MoE)	$0.60 / 1M tok	$0.48	$0.35
Qwen2 72B long-context	$1.10 / 1M tok	$0.78	$0.55

Definition

What is Servers.Computer?

Servers.Computer is an AI compute routing and procurement layer that benchmarks, compares, and deploys GPU clusters (NVIDIA H100, H200, B200 and AMD MI300) across global cloud providers in real time.

When to switch from H100 to B200 for inference

Sustained QPS > 50 on 70B+ dense models — B200's HBM3e collapses KV-cache thrashing.
Long context (32K+ tokens) where attention memory dominates.
MoE serving where expert weights exceed 80 GB per device.

→ Pick an inference cluster → Latency benchmarks → Training cost calculator