Training at Scale: Distributed Computing and Model Parallelism

Training a small neural network is easy: load the model onto your GPU, feed it data, update weights. But what happens when your model is GPT-4 (1.8 Trillion parameters)? That model requires terabytes of memory just to exist. The best GPU (H100) only has 80GB.

It physically doesn’t fit.

To train modern LLMs, we have to break the model and the data into pieces and spread them across thousands of GPUs. This is Distributed Training.

1. Data Parallelism (The Easy One)

If your model fits on one GPU, but you want to train faster, you use Data Parallelism.

  1. Copy the entire model to GPU A, GPU B, and GPU C.
  2. Split your dataset. Give Batch 1 to A, Batch 2 to B, Batch 3 to C.
  3. Each GPU calculates gradients independently.
  4. Average the gradients across all GPUs and update the weights.

This is standard. But what if the model doesn’t fit?

2. Model Parallelism

When the model is too big for one VRAM buffer, you must split the model itself.

A. Pipeline Parallelism

Imagine the model as a layer cake with 100 layers.

  • GPU 1 holds Layers 1-25.
  • GPU 2 holds Layers 26-50.
  • GPU 3 holds Layers 51-75.
  • GPU 4 holds Layers 76-100.

Data enters GPU 1, gets processed, and the output is passed to GPU 2. Problem: While GPU 2 is working, GPU 1 is sitting idle waiting for the next batch. This is called the “bubble.” Sophisticated schedules (like 1F1B) try to minimize this idle time.

B. Tensor Parallelism

Instead of splitting by layers (horizontal cut), we split the layers themselves (vertical cut). Inside a Transformer, there are massive matrix multiplications: $Y = W \times X$. We split the matrix $W$ into chunks.

  • GPU 1 holds the left half of the matrix.
  • GPU 2 holds the right half.

They both compute their part of the math simultaneously and then communicate to combine the results. Requirement: This requires extremely fast communication between GPUs (like NVLink), because they have to talk after every single layer.

3. Sharded Data Parallelism (FSDP / ZeRO)

This is the modern standard for many open-source efforts (like Deepspeed). In standard Data Parallelism, every GPU holds a full copy of the optimizer states and gradients (redundant waste of memory).

ZeRO (Zero Redundancy Optimizer) shards these states.

  • GPU 1 holds the gradients for Layer 1.
  • GPU 2 holds the gradients for Layer 2.
  • When GPU 1 needs to update Layer 2, it asks GPU 2 for the data.

This allows you to train massive models on standard clusters without needing the complex engineering of Tensor Parallelism.

The 3D Parallelism

Training the largest models (like Llama 3 405B) uses 3D Parallelism—combining all three:

  • Tensor Parallelism within a single node (8 GPUs linked by NVLink).
  • Pipeline Parallelism across different nodes.
  • Data Parallelism across the entire cluster.

It is an orchestration miracle that thousands of GPUs can dance together without stepping on each other’s toes, updating a trillion-parameter brain in sync.