Continuous Batching

What It Is

Continuous batching (also called iteration-level scheduling) is a serving strategy where the scheduler makes batching decisions at each forward pass rather than per-request. When one request finishes, a new one is immediately inserted into its slot — no waiting for the entire batch to complete.

Why It Matters

Without it, a slow request holds up the entire batch. With a 100-token prompt and a 2000-token output in the same batch, short requests wait for the long one. Continuous batching keeps GPUs saturated by immediately backfilling completed slots.

How It Works

The scheduler runs before each transformer forward pass. It checks which requests have finished, removes them from the batch, inserts waiting requests, then kicks off the next pass. Each request advances at its own pace. The batch composition changes every iteration.

There are three variants: request-level batching (old, naive), continuous batching (prompt OR token phase per batch), and mixed batching (prompt and token phases in the same batch). Mixed batching maximizes utilization but can cause interference — a large prompt slows down ongoing token generation for other requests.

Splitwise addresses this interference directly by putting prompt and decode in separate machine pools, making mixed batching unnecessary.

Key Sources

pagedattention-vllm — vLLM implements continuous batching with PagedAttention
splitwise-llm-inference-phase-splitting — analyzes interference between prompt/token batches and solves it via phase separation
orca-distributed-serving-transformer-generative-models

ML Wiki

Explorer

Continuous Batching

What It Is

Why It Matters

How It Works

Key Sources

Graph View

Table of Contents

Backlinks

ML Wiki

Explorer

Continuous Batching

What It Is

Why It Matters

How It Works

Key Sources

Related Concepts

Graph View

Table of Contents

Backlinks