Quantization: Running Big Models on Small Hardware
You don't need an H100 to run Llama-3. How quantization shrinks models from 16-bit to 4-bit with surprisingly little loss in intelligence.
Quantization: Running Big Models on Small Hardware
The biggest barrier to local AI is VRAM (Video RAM). A standard “uncompressed” model uses 16-bit precision (FP16).
- 70B Model @ FP16 = ~140 GB VRAM.
- Consumer GPU (RTX 4090) = 24 GB VRAM.
It doesn’t fit. Not even close. Enter Quantization: the art of compressing AI models by reducing the precision of their numbers.
How it Works
Imagine a weight in a neural network is 0.123456789.
- FP16: Stores almost all that detail. (2 bytes)
- INT8: Rounds it to the nearest integer value in a range. (1 byte)
- 4-bit (Q4): Rounds it even more aggressively. (0.5 bytes)
By moving from 16-bit to 4-bit, you reduce the memory footprint by 4x.
- 70B Model @ 4-bit = ~40 GB VRAM. (Still needs 2x 4090s, but manageable!)
- 8B Model @ 4-bit = ~5 GB VRAM. (Runs on a laptop!)
Does it make the model stupid?
Surprisingly, no. Neural networks are remarkably robust. They are “sparse” and “noisy.” Removing precision is like blurring a photo slightly—you can still recognize the face.
Research shows:
- 4-bit Quantization: Negligible performance loss (<1% perplexity increase).
- 3-bit: Noticeable degradation, but usable.
- 2-bit: Significant brain damage. The model starts speaking gibberish.
Methods of Quantization
1. Post-Training Quantization (PTQ)
You take a finished model (like Llama-3) and “round down” the weights. This is what you see with formats like GGUF (llama.cpp) or AWQ / GPTQ.
- GGUF: Best for running on CPUs (Apple Silicon M-series).
- EXL2 / GPTQ: Best for running on NVIDIA GPUs.
2. Quantization-Aware Training (QAT)
You train the model knowing it will be quantized later. The model “learns” to adapt its weights to fit into the smaller buckets. This yields much higher quality at lower bitrates (e.g., usable 2-bit models).
The 1.58-bit Era (BitNet)
The frontier of research is 1-bit LLMs (like Microsoft’s BitNet b1.58).
Instead of storing a weight as a complex decimal, every weight is just one of three values: {-1, 0, 1}.
- No multiplication is needed (just addition).
- Massively faster compute.
- Drastically lower energy.
Early results suggest these models perform as well as full-precision FP16 models. If this scales, we might be able to run GPT-5 on a smartphone.
Summary
If you are downloading models from Hugging Face, you will see filenames like Llama-3-8B-Instruct-v0.1.Q4_K_M.gguf.
- Q4: 4-bit quantization.
- K_M: A specific quantization mix (Medium).
Download that one. Unless you are doing scientific research, the FP16 version is a waste of memory.