Model Quantisation
The Compressed Music File
7 min read
The Compressed Music File
An MP3 sounds almost as good as a WAV file but takes a tenth of the storage — quantisation does the same for AI models.
A 70B parameter model in full precision (float32) needs 280GB of memory — a server rack. Quantise it to 4-bit and it fits in 35GB — a high-end laptop. You lose a small amount of precision on each parameter, like converting a WAV to MP3, but the quality drop is often barely noticeable in practice. This is what made running AI locally possible.
In Plain English
Quantisation reduces the precision of a model's parameters — from 32-bit or 16-bit floating point numbers to smaller formats like 8-bit or 4-bit integers. This shrinks the model dramatically with minimal quality loss, making large models runnable on consumer hardware.
The Technical Picture
Quantisation maps floating-point weights to lower-bit integer representations (INT8, INT4, GPTQ, GGUF). Post-training quantisation (PTQ) is applied after training; quantisation-aware training (QAT) incorporates it during fine-tuning. Tools like llama.cpp and Ollama use GGUF quantised models for CPU inference.
Real-World Examples
- Ollama running Llama 3 8B Q4 quantised on a MacBook Air
- GGUF format files downloadable from Hugging Face for local inference
- 4-bit quantisation reduces memory by 8x with often less than 5% quality loss
Quantisation is how a 70B model fits on your laptop — trading tiny precision loss for massive memory savings.