What It Is
Drawing outputs from a probability distribution rather than always taking the most probable option. In language models, sampling means picking the next token according to the model’s probability distribution (temperature > 0) instead of always selecting the highest-probability token (greedy decoding).
Why It Matters
Greedy decoding is deterministic but brittle — committing to the locally best token at each step can lead to globally wrong outputs, especially in multi-step reasoning. Sampling introduces diversity: multiple runs of the same prompt produce different reasoning paths, different phrasings, different solutions. This diversity is exploitable: aggregate many samples to get better answers than any single greedy run could provide.
How It Works
At each generation step, instead of , draw where is temperature. Higher → flatter distribution → more diverse outputs. Lower → approaches greedy decoding. Common strategies:
- Temperature sampling: scale logits by before softmax
- Top-k sampling: restrict sampling to the k most probable tokens
- Top-p (nucleus) sampling: restrict to the smallest set of tokens whose cumulative probability exceeds p
Diversity from sampling enables self-consistency: sample k reasoning chains, majority-vote on the final answer. Correct answers tend to cluster across independent samples; wrong answers tend to be idiosyncratic.