advanced

Model Quantisation

The Compressed Music File

7 min read

The Analogy

The Compressed Music File

An MP3 sounds almost as good as a WAV file but takes a tenth of the storage — quantisation does the same for AI models.

A 70B parameter model in full precision (float32) needs 280GB of memory — a server rack. Quantise it to 4-bit and it fits in 35GB — a high-end laptop. You lose a small amount of precision on each parameter, like converting a WAV to MP3, but the quality drop is often barely noticeable in practice. This is what made running AI locally possible.

In Plain English

Quantisation reduces the precision of a model's parameters — from 32-bit or 16-bit floating point numbers to smaller formats like 8-bit or 4-bit integers. This shrinks the model dramatically with minimal quality loss, making large models runnable on consumer hardware.

The Technical Picture

Quantisation maps floating-point weights to lower-bit integer representations (INT8, INT4, GPTQ, GGUF). Post-training quantisation (PTQ) is applied after training; quantisation-aware training (QAT) incorporates it during fine-tuning. Tools like llama.cpp and Ollama use GGUF quantised models for CPU inference.

Real-World Examples

Ollama running Llama 3 8B Q4 quantised on a MacBook Air
GGUF format files downloadable from Hugging Face for local inference
4-bit quantisation reduces memory by 8x with often less than 5% quality loss

Key Takeaway

Quantisation is how a 70B model fits on your laptop — trading tiny precision loss for massive memory savings.

Related Concepts

Parameters & Model Size

Inference vs Training

Local AI Models

Back to All Concepts