Latency
The Difference Between a Fast Cook and a Slow Cook
4 min read
The Difference Between a Fast Cook and a Slow Cook
A Maggi takes 2 minutes. A biryani takes 2 hours. Both are delicious, but you'd choose based on how hungry you are right now.
In AI, latency is how long the model takes to respond. A small, fast model (Maggi) responds in under a second — great for chatbots. A large, powerful model (biryani) takes longer — better for complex tasks. Choosing the right model often means balancing speed against quality.
In Plain English
Latency is the time between sending a request to an AI model and receiving its response. Lower latency means faster responses. Smaller models are faster; larger models are slower but usually better.
The Technical Picture
Latency in LLM inference is primarily determined by model size, hardware (GPU/TPU), batching strategy, and token generation speed. Time-to-first-token (TTFT) and tokens-per-second (TPS) are the key metrics. Streaming reduces perceived latency by showing output as it generates.
Real-World Examples
- Claude Haiku responds in ~0.5 seconds; Claude Opus takes 3–5 seconds
- Voice assistants need sub-500ms latency to feel natural
- Perplexity streams results to reduce perceived latency
Latency = how fast the AI responds. Speed vs. quality is the fundamental tradeoff in AI deployment.