advanced

Top-K Sampling

The Shortlist Round

6 min read

The Analogy

The Shortlist Round

A hiring manager receives 500 CVs but only interviews the top 10.

Before making a decision, they cut the field to the 10 best candidates. Randomness still exists — they might pick candidate 7 over candidate 3 based on subtle factors — but the truly unsuitable 490 are eliminated first. Top-K works the same for AI word selection: only the top K most probable next tokens are eligible, then the model picks one.

In Plain English

Top-K sampling limits the AI's word choices to the K most likely next words at each step. It prevents the AI from picking very unlikely or bizarre words while still allowing some creative variation.

The Technical Picture

Top-K sampling filters the probability distribution to keep only the K highest-probability tokens before renormalisation and sampling. This eliminates the long tail of low-probability tokens (reducing incoherence) while preserving diversity among the most plausible continuations.

Real-World Examples

Most production LLM APIs allow setting top_k as a generation parameter
K=40 is a common default — the top 40 words compete at each step
K=1 is greedy decoding — always picks the single most likely word

Key Takeaway

Top-K creates a shortlist of likely words before picking one — reducing nonsense while keeping variety.

Related Concepts

Top-P (Nucleus) Sampling

Temperature

Large Language Models (LLMs)

Back to All Concepts