Concepts: kv-cache | inference-efficiency | continuous-batching Builds on: vLLM | Speculative Decoding Leads to: Disaggregated inference (general principle adopted in SGLang, TensorRT-LLM, Mooncake)
Part 1: The Problem
Every time you send a prompt to an LLM, the system does two completely different jobs back-to-back. And right now, both jobs share the same machine, fighting over the same resources.
Let’s think about why that’s a problem. The first job — reading your prompt and producing the first output token — is a sprint. Thousands of tokens, one giant parallel forward pass, GPU at full tilt. The second job — generating each subsequent token one at a time — is a marathon. Slow, sequential, barely touching the GPU’s compute but constantly reaching into memory for the KV-cache.
You wouldn’t make a sprinter run a marathon and call it efficient. But that’s exactly what every LLM serving stack was doing. The result? You’re paying $38/hr for an H100 to spend half its time barely using its most expensive feature.
Part 2: The Core Idea
The Analogy
Let’s say you run a restaurant kitchen. There’s a prep station and there’s a cook station.
The prep station reads the order ticket, chops everything, measures every ingredient, gets it all ready. All at once, in parallel. It runs hot. You need your sharpest, fastest knives here.
The cook station takes those prepped ingredients and adds them to the pot one at a time, stirs, tastes, adjusts. Sequential, careful, dependent on the last step.
Now: what if you forced both stations to share the same counter, the same tools, the same chef? The prep work interrupts the cooking mid-stir. The cooking has to wait for prep to finish. Neither works as well as it could.
Splitwise gives each station its own kitchen.
The Mechanism
Every LLM inference request has two distinct phases:
Phase 1: Prompt (prefill). All your input tokens run through the model in a single parallel forward pass to produce the first output token. Compute-intensive — the GPU is doing matrix multiplications on thousands of tokens simultaneously. It loves the H100’s 66.9 TFLOPs. Power draw scales with batch size. This phase runs hot.
Phase 2: Token generation (decode). Each subsequent token is generated one at a time. The model only sees the last token plus the KV-cache — a stored record of every previous token’s attention keys and values. This phase barely uses the GPU’s compute. It’s bottlenecked by how fast it can read the KV-cache from memory.
So: H100 has 3.43× more compute than an A100. Same 80GB of memory. Only 1.64× more memory bandwidth. For token generation — which is memory-bound — you’re paying 2.16× more for hardware that gives you 1.64× the thing you actually need.
“Running both phases on the same machine often leads to inconsistent end-to-end latencies due to the arbitrary batching of prompt and token phases.”
When they share a machine, they interfere. A long prompt preempts tokens mid-stream. Tokens that could be batched 64 at a time get interrupted. The GPU whipsaws between two completely different utilization modes.
Splitwise’s fix: separate machines for each phase. Prompt machines get the latest high-FLOP GPUs. Token machines get cheaper, memory-oriented hardware. After the prompt phase finishes, the KV-cache transfers over InfiniBand to the token machine.
ASCII Diagram
BEFORE Splitwise (mixed batching, one machine):
┌────────────────────────────────────────────────┐
│ Single GPU Machine │
│ │
│ [Prompt Phase] ←→ [Token Phase] │
│ compute-heavy memory-heavy │
│ batches badly batches great │
│ runs hot barely uses GPU │
│ │
│ → Prompt preempts tokens → TBT spikes │
│ → H100's 3.4× compute wasted on decode │
└────────────────────────────────────────────────┘
AFTER Splitwise (phase-split, two machine pools):
┌──────────────────┐ ┌──────────────────┐
│ Prompt Machine │ KV-cache │ Token Machine │
│ (H100: 66.9 │ ──────────► │ (A100: cheaper, │
│ TFLOPs) │ layer-by- │ same memory, │
│ │ layer async │ good bandwidth)│
│ Process 1500 │ │ Generate tokens │
│ tokens in one │ │ one by one, │
│ parallel pass │ │ batch deeply │
└──────────────────┘ └──────────────────┘
compute-bound memory-bound
H100 earns its keep A100 is fine here
The Math: Does KV-Cache Transfer Kill the Latency?
The obvious question: doesn’t shipping the KV-cache across machines add latency? Let’s work through it.
For Llama-70B on H100: 80 layers, 8 KV-heads per layer, head dimension 128, fp16 (2 bytes).
KV-cache per token per layer: 2 (K + V) × 8 heads × 128 dim × 2 bytes = 4 KB
For a 1500-token prompt: 1500 tokens × 80 layers × 4 KB = 480 MB total.
Naive serialized transfer over InfiniBand at 200 Gbps (25 GB/s): 480 MB ÷ 25 GB/s ≈ 19.2 ms. Token generation time between tokens is ~28 ms. So naive transfer eats 68% of one token-generation slot. Painful.
Now here’s what’s clever. The prompt machine doesn’t wait until the end to send everything. As it finishes each layer, it immediately starts shipping that layer’s KV-cache in the background while computing the next layer.
- Each layer’s KV-cache: 1500 × 4 KB = 6 MB
- Transfer time per layer: 6 MB ÷ 25 GB/s = 0.24 ms
- Compute time per layer (1500 tokens on H100): 84ms TTFT ÷ 80 layers ≈ 1.05 ms
Since 0.24 ms is much less than 1.05 ms, almost all of the transfer is hidden behind computation. Only the very last synchronization barrier is visible. The paper measures this as ~5 ms constant overhead on H100 (400 Gbps) — down from the 19.2 ms naive case.
“Splitwise only incurs 0.8% of E2E [latency overhead]. In a user-facing inference, the only visible impact of KV-cache transfer overhead is the latency for the second token.”
0.8%. Essentially invisible.
What’s Clever
The insight was sitting in the hardware spec sheet the whole time. Let’s look at the numbers:
| A100 | H100 | Ratio | |
|---|---|---|---|
| TFLOPs | 19.5 | 66.9 | 3.43× |
| HBM capacity | 80GB | 80GB | 1.00× |
| HBM bandwidth | 2039 GBps | 3352 GBps | 1.64× |
| Power | 400W | 700W | 1.75× |
| Cost/hr | $17.6 | $38.0 | 2.16× |
The H100 costs 2.16× more but delivers the same memory, only 1.64× more bandwidth. For memory-bound token generation, you’re paying 2.16× for a 1.64× improvement. The A100 is the better deal for decode. It always was.
Nobody acted on this because the assumption was: one request needs one machine. Splitwise breaks that assumption. One request, two machines, right tool for each job.
There’s a second insight buried in the power data. Token generation machines can be power-capped by 50% with almost no latency impact — because they’re memory-bound and barely using GPU compute. The paper shows capping token H100s to 350W has near-zero effect on time-between-tokens. Prompt machines can’t be capped; they’re compute-saturated.
“While the prompt phase utilizes the power budget of the GPU efficiently, the token phase does not.”
“Batching during the prompt phase is compute-bound, whereas the token phase is limited by memory capacity.”
That’s two different problems. So why were we solving them on the same machine?
Part 3: Does It Work + What Breaks
Results compared to baseline mixed-batching clusters:
| Configuration | vs Baseline-H100 | Interpretation |
|---|---|---|
| Splitwise-HA (H100 prompt + A100 token) | 1.4× throughput at 20% lower cost | Best cost-efficiency for conversational workloads |
| Splitwise-HHcap (H100 + power-capped H100) | 2.35× throughput at same power | Same datacenter power budget, more than double the requests |
| KV-cache transfer overhead (optimized) | 0.8% of E2E latency | Essentially invisible |
The 1.4× and 2.35× numbers are real, but they’re not free. The system assumes you can cleanly separate the two phases at the request level — which works well when prompts are long enough to make the coordination overhead worth it. Very short prompts (under 512 tokens) fall back to serialized transfer, and at very high request rates the cluster-level scheduler can become the new bottleneck.
The gains also depend on workload shape. Coding workloads (median: 1500 prompt tokens, only 13 output tokens) need more prompt machines. Conversational workloads (1020 prompt tokens, 129 output tokens) need more token machines. If you provision the wrong ratio, the gains evaporate.
And the authors are honest about one thing: Splitwise doesn’t improve single-request latency. It improves throughput and cluster efficiency. If you’re running at low load, you might not feel the difference.
Part 4: So What?
If you’re building LLM serving infrastructure, the practical takeaway is this: prompt and decode are not the same job, so stop treating them the same way. The first decision is whether to disaggregate at the machine level (buy separate hardware pools) or at least at the scheduling level (keep prompt and decode batches separate, even on shared machines). If you’re deploying at scale, the cost math works out quickly. 1.4× throughput at 20% lower cost isn’t a marginal improvement.
The transfer mechanism is the enabling piece — and it’s simpler than it sounds once you see it: transfer layer-by-layer in the background while compute continues. If you’re implementing this yourself, the MSCCL++ one-sided put primitive is the key: the prompt machine pushes without requiring receive instructions from the token machine.
This paper builds directly on PagedAttention, which solved memory fragmentation in the KV-cache — the same KV-cache being transferred here. Splitwise runs on top of vLLM. And while Speculative Decoding speeds up token generation by running a small draft model in parallel, Splitwise improves throughput through better resource allocation. The two are orthogonal and composable — you could run speculative decoding inside the token machine pool.
If GPUs are your biggest cost center, the answer is often not a smarter algorithm — it’s running each job on the hardware it actually needs.
Paper: Splitwise: Efficient Generative LLM Inference Using Phase Splitting — Patel et al. — 2023
Connections
Sources:
- pagedattention-vllm — KV-cache management system Splitwise is built on top of
- speculative-decoding — orthogonal decode speedup technique
Concepts:
- kv-cache — the data structure transferred between prompt and token machines
- inference-efficiency — the broader problem class this addresses
- continuous-batching — batching mechanism used within each phase pool
Citation
Patel, P., Choukse, E., Zhang, C., Shah, A., Goiri, Í., Maleki, S., & Bianchini, R. (2023). Splitwise: Efficient Generative LLM Inference Using Phase Splitting. arXiv preprint arXiv:2311.18677.