advanced

Top-P (Nucleus) Sampling

The Budget Bucket

6 min read

The Analogy

The Budget Bucket

You have ₹1000 for snacks. You buy the most popular items until the budget runs out — however many that takes.

Some days 3 items exhaust your budget. Other days 7 items fit. The number changes, but the budget is constant. Top-P works like this budget — instead of a fixed number of words (Top-K), it selects words until their combined probability reaches P%. Automatically adjusts the shortlist size based on how confident the model is.

In Plain English

Top-P (Nucleus Sampling) selects the smallest group of words whose combined probability adds up to P%. When the model is confident, fewer words qualify; when it's uncertain, more do. It's more adaptive than Top-K.

The Technical Picture

Top-P (nucleus) sampling samples from the minimal set of tokens whose cumulative probability mass exceeds the threshold p. This dynamically adjusts the effective vocabulary size: small nucleus when model is confident, larger when uncertain. Often preferred over Top-K for its adaptive behaviour.

Real-World Examples

Claude's API defaults to top_p=1 (full nucleus) paired with temperature
P=0.9 is a common production setting — use 90% of the probability mass
Most modern LLMs use Top-P over Top-K as the primary sampling control

Key Takeaway

Top-P adapts the word shortlist based on model confidence — smarter than Top-K's fixed list.

Related Concepts

Top-K Sampling

Temperature

Large Language Models (LLMs)

Back to All Concepts