Neural Networks from Scratch: Forward and Backpropagation

It’s easy to import pytorch or tensorflow and build a neural network in three lines of code. But to truly understand AI, you need to understand what happens under the hood.

A neural network is essentially a massive mathematical function that maps Inputs to Outputs. It does this through two main passes: Forward Propagation (Guessing) and Backpropagation (Learning).

The Building Block: The Neuron

An artificial neuron performs a simple calculation: $$ y = f( \sum (w \cdot x) + b ) $$

  1. Inputs ($x$): The data coming in.
  2. Weights ($w$): The importance of each input. (This is what the network learns).
  3. Bias ($b$): A shift threshold.
  4. Activation Function ($f$): Adds non-linearity (e.g., ReLU or Sigmoid), deciding if the neuron “fires.”

Step 1: Forward Propagation (The Guess)

Imagine a network trying to recognize a digit (0-9) from an image.

  1. Input Layer: The pixels of the image enter the network.
  2. Hidden Layers: Each neuron takes the inputs, multiplies them by current weights, adds a bias, runs the activation function, and passes the result to the next layer.
  3. Output Layer: The final layer gives 10 probabilities (one for each digit).

At the very beginning, the weights are random. So the network sees a “7” and might guess: “I’m 10% sure it’s a 3.” This guess is obviously wrong.

Step 2: The Loss Function (The Scorecard)

To learn, the network needs to know how wrong it was. We use a Loss Function (or Cost Function).

  • Example: Mean Squared Error (MSE).
  • If the correct answer is 1.0 and the prediction was 0.1, the error is high.

The goal of training is to minimize this Loss.

Step 3: Backpropagation (The Learning)

This is the magic. We know the error. Now we need to blame specific weights for that error.

“Who is responsible for this mistake?”

We use Calculus (Chain Rule) to calculate the Gradient. We move backward from the Output layer to the Input layer.

  1. Output Layer: “My prediction was too low. The weights connecting to me should have been higher.”
  2. Hidden Layer: “The Output layer complained. I need to adjust my incoming weights to send a stronger signal next time.”

We calculate the partial derivative of the Loss with respect to every single weight in the network. $$ \frac{\partial Loss}{\partial w} $$ This tells us the direction and magnitude to adjust each weight to reduce the error.

Step 4: The Optimizer (The Update)

Once we have the gradients, an optimizer (like SGD or Adam) updates the weights. $$ w_{new} = w_{old} - (LearningRate \times Gradient) $$

We subtract the gradient because we want to go down the slope of error (Gradient Descent).

The Cycle (Epochs)

  1. Forward: Make a guess.
  2. Loss: Check the score.
  3. Backward: Calculate gradients.
  4. Optimize: Update weights.

Repeat this millions of times across massive datasets, and the random weights slowly tune themselves into a highly accurate pattern recognition machine. That is all “learning” really is: calculus adjusting numbers until the error gets close to zero.