Concepts: continuous-batching | inference-efficiency | kv-cache Builds on: attention-is-all-you-need | language-models-are-few-shot-learners Leads to: pagedattention-vllm | splitwise-llm-inference-phase-splitting
In 2022, the standard way to serve a generative LLM was to copy the recipe used for any other deep-learning model: gather a batch of N requests, run the model, return all N results. The mismatch is that an LLM does not produce one output per request; it produces a token, then another token, then another, until each request’s output sequence ends at its own length. With per-request batching, when one request in the batch finishes (say at 50 tokens) and another runs to 500 tokens, the GPU spends 450 iterations doing useful work for one request and idle work for the other. Orca, from the Seoul National University team that became FriendliAI, is the paper that fixed this with two ideas: iteration-level scheduling and selective batching.
The core idea
The analogy: Imagine a factory line where every worker is given a stack of tasks of different lengths, but the line moves forward only when all workers finish their current task. Worker A finishes their task in 30 seconds; worker B’s task takes 5 minutes. For 4.5 minutes, worker A is paid to stand idle while watching B finish. Now extend this to 16 workers; on average, 15 are idle most of the time waiting for the slowest.
The fix is obvious in factory terms but counter-intuitive in deep learning: let workers grab a new task the instant they finish the current one, even mid-batch. Don’t wait for the whole batch. Don’t even keep the same batch composition from one step to the next.
This is iteration-level scheduling. Instead of scheduling at the granularity of “request” (one schedule decision per arrived request), Orca makes a scheduling decision before each model forward pass:
- Look at the current batch (some requests are mid-generation, some just arrived).
- For each finished request, return its result and remove it from the batch.
- For each waiting request, insert it.
- Run one forward pass (one decode step) on the new batch.
- Repeat.
“We propose iteration-level scheduling, a new scheduling mechanism that schedules execution at the granularity of iteration (instead of request) where the scheduler invokes the execution engine to run only a single iteration of the model on the batch.”
A request joins the batch as soon as a slot opens up. A request leaves as soon as it emits an end-of-sequence (EOS) token. The batch composition changes every step. No request waits behind a slower one. GPU utilization jumps from “lowest common denominator” to “near-saturation.”
The community’s name for this is continuous batching (also called dynamic batching or in-flight batching). Orca is the paper that introduced it.
What’s clever — find the instinct
The non-obvious move is recognizing that iteration-level scheduling alone does not work. If you naively put requests with different prompt lengths into the same forward pass, certain operations (matmuls in linear layers) are happy — they batch across the batch dimension regardless of sequence positions. But other operations (attention, layer norm) cannot trivially batch when each request is at a different position with a different KV-cache size.
The clever solution: selective batching. Operations are split into two categories:
- Position-independent operations. Linear projections, layer norms, MLPs. These can be applied to a flat tensor of shape
[total_tokens, hidden_dim]regardless of which token belongs to which request. Just stack all the active tokens, do the matmul, split the result back out. - Position-dependent operations. Attention. Each request has its own KV cache and its own current query. The cleanest implementation processes attention per-request; alternatively, with a kernel like FlashAttention or PagedAttention, batch the attention work across requests with explicit indexing.
“To apply batching and iteration-level scheduling to a Transformer model at the same time, we suggest selective batching, which applies batching only to a selected set of operations.”
This separation is what makes the design tractable. The bulk of FLOPs in a transformer are in the position-independent linear / MLP operations — those batch easily. Attention, the harder case, is small relative to the linear layers and can be handled as a per-request loop with minimal overhead.
The third clever move is the architecture for scaling. Orca splits the model across GPUs (tensor + pipeline parallelism for 175B models) and decouples the scheduler from the execution engines. The scheduler is a CPU-side process that maintains the request queue, tracks each request’s state (prompt phase vs. decode phase, current position, KV-cache pointer). Workers receive a “new batch composition for this step” instruction, fetch the relevant KV-cache rows, and execute. Importantly, the KV cache is a first-class citizen: each request’s KV state is identifiable and can be appended to as the request progresses.
Walkthrough: the difference in throughput
Setup: GPT-3 175B served on 8 A100 GPUs.
Prompts of varying length (32 to 512 tokens).
Generation lengths varying (50 to 500 tokens).
Old way (request-level batching, batch size 8):
Step 0: Eight requests arrive. Batch them all.
Step 1..N: Run forward pass on full batch. All requests advance one token.
Eventually: shortest request finishes. Others continue.
PROBLEM: shortest request still occupies a "batch slot" until everyone finishes.
Effective batch size when most requests are done: 1 or 2.
GPU utilization: low.
New requests arriving meanwhile: blocked. They wait for the current batch.
Tail latency: time to finish ALL eight requests.
Orca (iteration-level scheduling):
Step 0: One request arrives. Run prompt phase, then decode step.
Step 1: Two more arrive. Schedule prompt for the new ones, decode for the first.
Selective batching: prompts and decodes can be in the same forward pass
(different position-dependent attention handling, but same linear layers).
Step 5: Request 0 emits EOS. Remove it. Request 4 arrives. Insert it.
Step ...: Always running with as many requests in flight as memory allows.
GPU utilization: near peak the whole time.
New requests admitted within microseconds.
Per-request latency: not affected by how many other requests are in flight
(above a saturation point).
The numbers from the paper: 36.9x throughput improvement at the same level of latency, vs NVIDIA’s FasterTransformer baseline. The baseline batches statically; Orca’s continuous batching is what closes the gap.
“Our evaluation on a GPT-3 175B model shows that Orca can significantly outperform NVIDIA FasterTransformer in terms of both latency and throughput: 36.9x throughput improvement at the same level of latency.”
Does it work? What breaks?
| Metric | FasterTransformer | Orca |
|---|---|---|
| Throughput @ 1s P99 latency (GPT-3 175B) | baseline | 36.9x |
| Median latency, low load | similar | similar |
| Tail latency under load | high (waits for batch-mate) | low |
| Memory utilization | wasted on padded slots | tight (only active KV rows) |
The headline 36.9x is partly system-engineering, partly algorithmic. The pure algorithmic gain (batch composition flexibility) accounts for most of it. The rest is implementation discipline (avoiding padding, efficient KV-cache addressing).
What breaks:
- Mixed prompt and decode phases. A long prompt phase in the same batch as ongoing decode phases interferes: the prompt’s compute dominates, slowing decode latency for other requests. (Splitwise, a follow-up, separates prompt and decode onto different GPU pools.)
- KV-cache memory pressure. With many in-flight requests, you need enough memory for all their KV states. Orca uses contiguous KV memory per request; PagedAttention later improved this with pagination, but Orca itself is somewhat wasteful.
- Scheduler complexity. A bug in the scheduler (mismatched KV-cache pointers, double-freed slots) silently corrupts outputs. Production systems require tight invariants.
- Per-request state. Generation parameters (temperature, top-p, stop tokens) differ per request. The implementation has to track these per-slot.
- Pre-emption of old workloads. Some shapes of long generations could in principle starve newly arrived short requests. Orca’s scheduler is FIFO; production systems often add priority.
So what?
For a practitioner serving generative models in production:
- Continuous batching is non-negotiable. Any LLM serving system that does not implement iteration-level scheduling is leaving 10-50x throughput on the table. vLLM, Text Generation Inference, TensorRT-LLM, and SGLang all implement it.
- Throughput vs. latency trade-off shifts. Without continuous batching, you choose one or the other. With it, you mostly choose throughput; latency stays bounded by per-token forward time.
- KV-cache management is the next frontier. Orca’s contiguous-per-request KV layout is wasteful at high concurrency. PagedAttention (vLLM) extends Orca by paginating KV memory. Both papers are required reading.
- Selective batching is the reusable abstraction. Any sequence-model serving system (not just transformers) can apply it: position-independent ops batch trivially; position-dependent ops are handled separately.
- For systems interview prep: “How would you serve a transformer LLM efficiently?” The first answer is continuous batching. The second is paged KV cache. The third is prompt/decode disaggregation (Splitwise).
For Saikat’s career gap on serving systems: this paper is the canonical L5 reference. It maps cleanly to the operating-systems literature (it is essentially OS-style preemptive scheduling for ML compute). Knowing both the algorithmic move (iteration-level + selective batching) and the engineering tradeoffs (memory pressure, fairness, prompt-decode interference) is enough to discuss most LLM-serving system design questions.
Connections
- continuous-batching — Orca is the paper that introduced this
- inference-efficiency — broader category
- kv-cache — Orca treats KV cache as a first-class scheduled resource
- pagedattention-vllm — vLLM extends Orca with paged KV-cache management
- splitwise-llm-inference-phase-splitting — addresses prompt-decode interference Orca exposes
- attention-is-all-you-need — the architecture being served
- language-models-are-few-shot-learners — workload generator (GPT-3-style autoregressive LLMs)
Citation
Yu, G. I., Jeong, J. S., Kim, G. W., Kim, S., & Chun, B. G. (2022). Orca: A Distributed Serving System for Transformer-Based Generative Models. OSDI 2022. https://www.usenix.org/conference/osdi22/presentation/yu