AI Fundamentals
← All Concepts
beginner

Latency

The Difference Between a Fast Cook and a Slow Cook

4 min read

The Analogy

The Difference Between a Fast Cook and a Slow Cook

A Maggi takes 2 minutes. A biryani takes 2 hours. Both are delicious, but you'd choose based on how hungry you are right now.

In AI, latency is how long the model takes to respond. A small, fast model (Maggi) responds in under a second — great for chatbots. A large, powerful model (biryani) takes longer — better for complex tasks. Choosing the right model often means balancing speed against quality.

In Plain English

Latency is the time between sending a request to an AI model and receiving its response. Lower latency means faster responses. Smaller models are faster; larger models are slower but usually better.


The Technical Picture

Latency in LLM inference is primarily determined by model size, hardware (GPU/TPU), batching strategy, and token generation speed. Time-to-first-token (TTFT) and tokens-per-second (TPS) are the key metrics. Streaming reduces perceived latency by showing output as it generates.

Real-World Examples

  • Claude Haiku responds in ~0.5 seconds; Claude Opus takes 3–5 seconds
  • Voice assistants need sub-500ms latency to feel natural
  • Perplexity streams results to reduce perceived latency
Key Takeaway

Latency = how fast the AI responds. Speed vs. quality is the fundamental tradeoff in AI deployment.

Related Concepts