VRAM Requirements: How Much GPU Memory Do You Need?

The #1 question for anyone building a Deep Learning PC or renting cloud GPUs: “How much VRAM do I need?” System RAM (DDR5) is cheap. GPU VRAM (HBM/GDDR6X) is gold.

Here is the math to figure it out.

1. Inference (Running the Model)

For simply running a model (chatting with it), the math is simple. $$ VRAM \approx Parameters \times Precision $$

  • FP16 (Original): 2 bytes per parameter.
  • 8-bit (INT8): 1 byte per parameter.
  • 4-bit (Q4): 0.5 bytes per parameter.

Add 20% overhead for the “KV Cache” (context window) and temporary buffers.

The Cheat Sheet

Model SizePrecisionBase SizeVRAM Needed (Safe)
8B (Llama-3)FP1616 GB20 GB (RTX 3090/4090)
8B4-bit4 GB6 GB (RTX 3060/Laptop)
70BFP16140 GB160 GB (2x A100 80GB)
70B4-bit35 GB48 GB (2x RTX 3090/4090)
405B4-bit202 GB240 GB (3x A100 or 8x 3090)

Note: For long context windows (e.g., 128k tokens), the KV Cache grows massive. You might need double the VRAM.

2. Fine-Tuning (Training)

Training requires significantly more memory. You aren’t just storing weights; you are storing:

  1. Gradients (Same size as weights)
  2. Optimizer States (Adam optimizer stores 2 states per weight = 2x size)
  3. Activations (The output of every layer, saved for backpropagation)

Full Fine-Tuning (FFT)

$$ VRAM \approx 6 \times Model Size $$ To train a 7B model in FP16:

  • Model: 14 GB
  • Gradients: 14 GB
  • Optimizer: 28 GB
  • Activations: ~20 GB
  • Total: ~80 GB. (You need an A100 just to train a small 7B model!)

LoRA / QLoRA (The Savior)

Low-Rank Adaptation (LoRA) freezes the main model and only trains a tiny adapter layer (<1% of params). QLoRA quantizes the frozen model to 4-bit.

This drastically cuts requirements.

  • 70B Model (QLoRA): Can fine-tune on two RTX 3090s (48GB) instead of massive server clusters.

Hardware Recommendations (2025)

Budget: “I just want to run 8B models”

  • GPU: NVIDIA RTX 3060 (12GB) or RTX 4060 Ti (16GB).
  • Alternative: Mac Mini M4 (Unified Memory is slow but huge).

Enthusiast: “I want to run 70B models”

  • GPU: Dual RTX 3090s (Used).
  • Why: 24GB + 24GB = 48GB. NVLink isn’t needed for inference, just PCIe. This is the cheapest way to 48GB VRAM (~$1,500).

Pro: “I want to train/fine-tune seriously”

  • GPU: A6000 Ada (48GB) or A100 (80GB).
  • Why: Consumer cards (GeForce) are artificially slowed down for training workloads (P2P communication is blocked).

Summary

  • Inference: 4-bit quantization is your friend.
  • Training: Use QLoRA unless you have a corporate budget.
  • Buying: VRAM quantity > GPU speed. An RTX 3090 (24GB) beats an RTX 4080 (16GB) for AI every day of the week.