What It Is
Continuous batching (also called iteration-level scheduling) is a serving strategy where the scheduler makes batching decisions at each forward pass rather than per-request. When one request finishes, a new one is immediately inserted into its slot — no waiting for the entire batch to complete.
Why It Matters
Without it, a slow request holds up the entire batch. With a 100-token prompt and a 2000-token output in the same batch, short requests wait for the long one. Continuous batching keeps GPUs saturated by immediately backfilling completed slots.
How It Works
The scheduler runs before each transformer forward pass. It checks which requests have finished, removes them from the batch, inserts waiting requests, then kicks off the next pass. Each request advances at its own pace. The batch composition changes every iteration.
There are three variants: request-level batching (old, naive), continuous batching (prompt OR token phase per batch), and mixed batching (prompt and token phases in the same batch). Mixed batching maximizes utilization but can cause interference — a large prompt slows down ongoing token generation for other requests.
Splitwise addresses this interference directly by putting prompt and decode in separate machine pools, making mixed batching unnecessary.
Key Sources
-
pagedattention-vllm — vLLM implements continuous batching with PagedAttention
-
splitwise-llm-inference-phase-splitting — analyzes interference between prompt/token batches and solves it via phase separation