How It Works October 10, 2025 ⏱ 2 min read

LoRA and QLoRA: Efficient Model Fine-Tuning

How to fine-tune a massive 70B parameter model on a single consumer GPU.

lorafine-tuningefficiency

LoRA and QLoRA: Efficient Model Fine-Tuning

Training a large AI model used to require a warehouse full of A100 GPUs. If you wanted to fine-tune GPT-3 (175 Billion parameters), you needed to update 175 billion numbers. That requires terabytes of VRAM.

Then came LoRA (Low-Rank Adaptation), allowing hobbyists to fine-tune massive models on a gaming PC.

The Problem: Full Fine-Tuning is Heavy

In traditional fine-tuning, you take the pre-trained weights ($W$) and update them all to get ($W’$). This means you need to keep a copy of all the optimizer states for every single parameter.

The Solution: LoRA (Low-Rank Adaptation)

LoRA proposes a trick: Freeze the original model. Don’t touch the main weights. Instead, inject tiny “adapter” matrices ($A$ and $B$) into the layers.

$$ W_{new} = W_{frozen} + (A \times B) $$

Instead of updating the massive $W$ matrix (dimension 10,000 x 10,000), we update two skinny matrices (10,000 x 8 and 8 x 10,000).

This reduces the number of trainable parameters by 99.9%.
The result is almost as good as full fine-tuning.

QLoRA: Going Even Smaller

LoRA reduces the trainable parameters, but you still need to load the main model ($W_{frozen}$) into memory. A 65B parameter model takes ~130GB of VRAM just to load.

QLoRA (Quantized LoRA) fixes this. It loads the main model in 4-bit precision (instead of 16-bit).

16-bit = high fidelity.
4-bit = “pixelated” numbers.

It compresses the model from 130GB -> 26GB. This fits on a consumer GPU (like an RTX 3090 or 4090).

Why this is a revolution

Before LoRA/QLoRA, only Google and OpenAI could build custom models. Now, a developer in a basement can download Llama 3 (70B), quantize it to 4-bit, attach a LoRA adapter, and fine-tune it on their medical textbooks overnight.

It democratized the “means of production” for AI.