Hardware December 10, 2025 ⏱ 3 min read

VRAM Requirements: How Much GPU Memory Do You Need?

A practical calculator for building your own AI rig. How to calculate VRAM usage for Training vs. Inference.

vramgpumemoryrequirements

VRAM Requirements: How Much GPU Memory Do You Need?

The #1 question for anyone building a Deep Learning PC or renting cloud GPUs: “How much VRAM do I need?” System RAM (DDR5) is cheap. GPU VRAM (HBM/GDDR6X) is gold.

Here is the math to figure it out.

1. Inference (Running the Model)

For simply running a model (chatting with it), the math is simple. $$ VRAM \approx Parameters \times Precision $$

FP16 (Original): 2 bytes per parameter.
8-bit (INT8): 1 byte per parameter.
4-bit (Q4): 0.5 bytes per parameter.

Add 20% overhead for the “KV Cache” (context window) and temporary buffers.

The Cheat Sheet

Model Size	Precision	Base Size	VRAM Needed (Safe)
8B (Llama-3)	FP16	16 GB	20 GB (RTX 3090/4090)
8B	4-bit	4 GB	6 GB (RTX 3060/Laptop)
70B	FP16	140 GB	160 GB (2x A100 80GB)
70B	4-bit	35 GB	48 GB (2x RTX 3090/4090)
405B	4-bit	202 GB	240 GB (3x A100 or 8x 3090)

Note: For long context windows (e.g., 128k tokens), the KV Cache grows massive. You might need double the VRAM.

2. Fine-Tuning (Training)

Training requires significantly more memory. You aren’t just storing weights; you are storing:

Gradients (Same size as weights)
Optimizer States (Adam optimizer stores 2 states per weight = 2x size)
Activations (The output of every layer, saved for backpropagation)

Full Fine-Tuning (FFT)

$$ VRAM \approx 6 \times Model Size $$ To train a 7B model in FP16:

Model: 14 GB
Gradients: 14 GB
Optimizer: 28 GB
Activations: ~20 GB
Total: ~80 GB. (You need an A100 just to train a small 7B model!)

LoRA / QLoRA (The Savior)

Low-Rank Adaptation (LoRA) freezes the main model and only trains a tiny adapter layer (<1% of params). QLoRA quantizes the frozen model to 4-bit.

This drastically cuts requirements.

70B Model (QLoRA): Can fine-tune on two RTX 3090s (48GB) instead of massive server clusters.

Hardware Recommendations (2025)

Budget: “I just want to run 8B models”

GPU: NVIDIA RTX 3060 (12GB) or RTX 4060 Ti (16GB).
Alternative: Mac Mini M4 (Unified Memory is slow but huge).

Enthusiast: “I want to run 70B models”

GPU: Dual RTX 3090s (Used).
Why: 24GB + 24GB = 48GB. NVLink isn’t needed for inference, just PCIe. This is the cheapest way to 48GB VRAM (~$1,500).

Pro: “I want to train/fine-tune seriously”

GPU: A6000 Ada (48GB) or A100 (80GB).
Why: Consumer cards (GeForce) are artificially slowed down for training workloads (P2P communication is blocked).

Summary

Inference: 4-bit quantization is your friend.
Training: Use QLoRA unless you have a corporate budget.
Buying: VRAM quantity > GPU speed. An RTX 3090 (24GB) beats an RTX 4080 (16GB) for AI every day of the week.