Tensor Cores and Mixed Precision Training
The dedicated silicon inside NVIDIA GPUs that makes modern AI possible. How mixed precision speeds up training by 10x.
Tensor Cores and Mixed Precision Training
If CUDA is the software engine of AI, Tensor Cores are the turbochargers. Introduced in the NVIDIA Volta architecture (V100) in 2017, these specialized processing units are the reason why AI training speeds have exploded in the last few years.
What is a Tensor Core?
A standard GPU core (CUDA Core) is a jack-of-all-trades. It can do floating point math (FP32), integers (INT32), etc., one operation at a time.
A Tensor Core is a specialist. It does exactly one thing, but it does it incredibly fast: Matrix Multiplication. specifically, a 4x4 matrix multiply-accumulate operation:
$$ D = A \times B + C $$
Since Deep Learning is fundamentally just billions of matrix multiplications, having dedicated hardware for this specific math operation yields massive speedups.
The Magic of Precision
To understand Tensor Cores, you have to understand Precision. Computers store numbers as bits.
- FP32 (Single Precision): Uses 32 bits. Very accurate (e.g.,
3.14159265). Standard for science. - FP16 (Half Precision): Uses 16 bits. Less accurate (e.g.,
3.1416). - BF16 (Bfloat16): Google’s format. Keeps the “range” of FP32 but cuts the precision. Perfect for AI.
- FP8: Only 8 bits! Used in the newest H100 GPUs.
Mixed Precision Training
Training a neural network usually doesn’t need 32 decimal places of accuracy. “Roughly correct” is often enough, as long as the gradients point in the right direction.
Mixed Precision training uses:
- FP16/BF16 for the heavy matrix math (on Tensor Cores) — super fast.
- FP32 for the “master weights” update — to preserve stability.
This technique, often enabled by a single line of code in PyTorch (torch.cuda.amp), cuts memory usage in half and doubles speed.
Why Bfloat16 (BF16) Matters
Standard FP16 had a problem: it couldn’t handle very large or very small numbers (dynamic range). Training often crashed with “NaN” (Not a Number) errors.
Bfloat16 (Brain Floating Point) fixes this. It has the same dynamic range as FP32, just less precision.
- FP16: 5 bits exponent, 10 bits mantissa.
- BF16: 8 bits exponent, 7 bits mantissa.
Because neural networks care more about the magnitude of a number than the exact digits, BF16 became the gold standard for Large Language Model training.
FP8 and the H100
NVIDIA’s H100 introduced FP8 Tensor Cores. By reducing the numbers to just 8 bits (a byte!), they effectively doubled the throughput again compared to BF16. There are challenges—8 bits is very “grainy”—but the Transformer Engine in the H100 automatically switches between FP8 and BF16 layer-by-layer to maintain accuracy while maximizing speed.
Summary
Without Tensor Cores and Mixed Precision:
- Models would be 4x-10x slower to train.
- We would need 2x more VRAM.
- GPT-4 would likely still be training today.
They are the unsung heroes of the hardware stack, crunching the numbers that think the thoughts.