Top-P (Nucleus) Sampling
The Budget Bucket
6 min read
The Budget Bucket
You have ₹1000 for snacks. You buy the most popular items until the budget runs out — however many that takes.
Some days 3 items exhaust your budget. Other days 7 items fit. The number changes, but the budget is constant. Top-P works like this budget — instead of a fixed number of words (Top-K), it selects words until their combined probability reaches P%. Automatically adjusts the shortlist size based on how confident the model is.
In Plain English
Top-P (Nucleus Sampling) selects the smallest group of words whose combined probability adds up to P%. When the model is confident, fewer words qualify; when it's uncertain, more do. It's more adaptive than Top-K.
The Technical Picture
Top-P (nucleus) sampling samples from the minimal set of tokens whose cumulative probability mass exceeds the threshold p. This dynamically adjusts the effective vocabulary size: small nucleus when model is confident, larger when uncertain. Often preferred over Top-K for its adaptive behaviour.
Real-World Examples
- Claude's API defaults to top_p=1 (full nucleus) paired with temperature
- P=0.9 is a common production setting — use 90% of the probability mass
- Most modern LLMs use Top-P over Top-K as the primary sampling control
Top-P adapts the word shortlist based on model confidence — smarter than Top-K's fixed list.