Sampling

What It Is

Drawing outputs from a probability distribution rather than always taking the most probable option. In language models, sampling means picking the next token according to the model’s probability distribution (temperature > 0) instead of always selecting the highest-probability token (greedy decoding).

Why It Matters

Greedy decoding is deterministic but brittle — committing to the locally best token at each step can lead to globally wrong outputs, especially in multi-step reasoning. Sampling introduces diversity: multiple runs of the same prompt produce different reasoning paths, different phrasings, different solutions. This diversity is exploitable: aggregate many samples to get better answers than any single greedy run could provide.

How It Works

At each generation step, instead of $ar g max_{t} P (t ∣ context)$ , draw $t \sim P (\cdot ∣ context)^{1/ T}$ where $T$ is temperature. Higher $T$ → flatter distribution → more diverse outputs. Lower $T \to 0$ → approaches greedy decoding. Common strategies:

Temperature sampling: scale logits by $1/ T$ before softmax
Top-k sampling: restrict sampling to the k most probable tokens
Top-p (nucleus) sampling: restrict to the smallest set of tokens whose cumulative probability exceeds p

Diversity from sampling enables self-consistency: sample k reasoning chains, majority-vote on the final answer. Correct answers tend to cluster across independent samples; wrong answers tend to be idiosyncratic.

ML Wiki

Explorer

Sampling

What It Is

Why It Matters

How It Works

Key Sources

Graph View

Table of Contents

Backlinks

ML Wiki

Explorer

Sampling

What It Is

Why It Matters

How It Works

Key Sources

Related Concepts

Graph View

Table of Contents

Backlinks