LoRA and QLoRA: Efficient Model Fine-Tuning
How to fine-tune a massive 70B parameter model on a single consumer GPU.
LoRA and QLoRA: Efficient Model Fine-Tuning
Training a large AI model used to require a warehouse full of A100 GPUs. If you wanted to fine-tune GPT-3 (175 Billion parameters), you needed to update 175 billion numbers. That requires terabytes of VRAM.
Then came LoRA (Low-Rank Adaptation), allowing hobbyists to fine-tune massive models on a gaming PC.
The Problem: Full Fine-Tuning is Heavy
In traditional fine-tuning, you take the pre-trained weights ($W$) and update them all to get ($W’$). This means you need to keep a copy of all the optimizer states for every single parameter.
The Solution: LoRA (Low-Rank Adaptation)
LoRA proposes a trick: Freeze the original model. Don’t touch the main weights. Instead, inject tiny “adapter” matrices ($A$ and $B$) into the layers.
$$ W_{new} = W_{frozen} + (A \times B) $$
Instead of updating the massive $W$ matrix (dimension 10,000 x 10,000), we update two skinny matrices (10,000 x 8 and 8 x 10,000).
- This reduces the number of trainable parameters by 99.9%.
- The result is almost as good as full fine-tuning.
QLoRA: Going Even Smaller
LoRA reduces the trainable parameters, but you still need to load the main model ($W_{frozen}$) into memory. A 65B parameter model takes ~130GB of VRAM just to load.
QLoRA (Quantized LoRA) fixes this. It loads the main model in 4-bit precision (instead of 16-bit).
- 16-bit = high fidelity.
- 4-bit = “pixelated” numbers.
It compresses the model from 130GB -> 26GB. This fits on a consumer GPU (like an RTX 3090 or 4090).
Why this is a revolution
Before LoRA/QLoRA, only Google and OpenAI could build custom models. Now, a developer in a basement can download Llama 3 (70B), quantize it to 4-bit, attach a LoRA adapter, and fine-tune it on their medical textbooks overnight.
It democratized the “means of production” for AI.