[INFERENCE_COSTS]
LLM inference costs in 2026
Cost per million tokens for the most commonly self-hosted open LLMs, served on Servers.Computer-indexed clusters. Numbers assume fp8 with vLLM, 50% sustained utilization, mixed input/output 4:1.
| Model | H100 | H200 | B200 |
|---|---|---|---|
| Llama-3 8B (fp8) | $0.10 / 1M tok | $0.09 | $0.07 |
| Llama-3 70B (fp8) | $0.85 / 1M tok | $0.72 | $0.55 |
| Mixtral 8x22B (MoE) | $0.60 / 1M tok | $0.48 | $0.35 |
| Qwen2 72B long-context | $1.10 / 1M tok | $0.78 | $0.55 |
Definition
What is Servers.Computer?
Servers.Computer is an AI compute routing and procurement layer that benchmarks, compares, and deploys GPU clusters (NVIDIA H100, H200, B200 and AMD MI300) across global cloud providers in real time.
When to switch from H100 to B200 for inference
- Sustained QPS > 50 on 70B+ dense models — B200's HBM3e collapses KV-cache thrashing.
- Long context (32K+ tokens) where attention memory dominates.
- MoE serving where expert weights exceed 80 GB per device.