Gradient Backpropagation

methods-of-ai

Backpropagation is the algorithm that computes the gradient ∇L(θ) of a loss function with respect to every parameter of a neural network — in a single backward pass through the computation graph. It’s the engine behind every neural net trained since the 1980s, and it works because of one piece of high-school calculus: the chain rule.

Backprop itself is not an optimizer — it just computes the gradients. An optimizer (SGD, Adam, AdamW, …) then uses those gradients via Gradient Descent to update the weights. The pair “backprop computes ∇L → SGD uses ∇L” is the core loop of all modern AI.

The 90-second summary

  1. Forward pass: push input through the net, store every intermediate activation, compute loss L.
  2. Backward pass: starting from L, walk backwards through the graph, applying the chain rule at each node. At every weight you arrive at, you now have ∂L/∂w — the gradient for that weight.
  3. Update: hand all those gradients to the optimizer, which takes a step (w ← w − η·∂L/∂w).
  4. Repeat for the next mini-batch.

🧠 The one big idea — chain rule, applied recursively

A neural net is a deeply nested function: L = ℓ(f₃(f₂(f₁(x)))). The chain rule says:

∂L/∂x = ∂L/∂f₃ · ∂f₃/∂f₂ · ∂f₂/∂f₁ · ∂f₁/∂x

For a weight buried deep in layer 1, you’d think you need to recompute the whole chain. Backprop’s trick: compute these derivatives from the top down, and at every layer you’ve already done all the work needed for the layer below it. Each layer’s backward pass costs roughly the same as its forward pass → total backward cost ≈ forward cost. No exponential blow-up. This is the entire reason deep learning is computationally feasible.

Why "back-propagation" is a great name

The gradient of the output is known immediately (∂L/∂output = predicted − target for MSE). That signal then propagates backwards through the layers. Layer N uses the gradient flowing in from layer N+1 to compute its own contribution and pass a new gradient signal down to layer N−1. It’s a message-passing scheme, where the “message” is ∂L/∂(layer-output).


✏️ Worked example — backprop a tiny network by hand

Consider the smallest possible net: one input x, one hidden neuron, one output. Sigmoid activations, MSE loss.

x ─[w₁]→ z₁ ─σ→ h ─[w₂]→ z₂ ─σ→ ŷ ─compare to y→ L = ½(ŷ−y)²

Forward (with x=1, w₁=0.5, w₂=−0.3, y=1):

QuantityComputationValue
z₁ = w₁ · x0.5 · 10.500
h = σ(z₁)1/(1+e⁻⁰·⁵)0.622
z₂ = w₂ · h−0.3 · 0.622−0.187
ŷ = σ(z₂)1/(1+e⁰·¹⁸⁷)0.453
L = ½(ŷ − y)²½(0.453 − 1)²0.150

Now backprop — walk back, accumulating δ = ∂L/∂(thing):

StepChain-rule expressionValue
δŷ = ∂L/∂ŷŷ − y = 0.453 − 1−0.547
δz₂ = ∂L/∂z₂ = δŷ · σ'(z₂)δŷ · ŷ·(1−ŷ) = −0.547 · 0.248−0.135
∂L/∂w₂ = δz₂ · h−0.135 · 0.622−0.0840
δh = δz₂ · w₂−0.135 · −0.3+0.0405
δz₁ = δh · σ'(z₁) = δh · h·(1−h)0.0405 · 0.622·0.378+0.00952
∂L/∂w₁ = δz₁ · x0.00952 · 1+0.00952

After one update with learning rate η=1:

  • w₂ ← w₂ − η · ∂L/∂w₂ = −0.3 − (−0.084) = −0.216
  • w₁ ← w₁ − η · ∂L/∂w₁ = 0.5 − 0.00952 = 0.490

Notice: the gradient at w₁ is tiny (0.0095) compared to w₂ (0.084). Why? Every backward layer multiplies by σ', which is at most 0.25. After two layers we’re already at 0.25² = 0.0625 of the original signal. This is the vanishing-gradient problem — and the entire reason ReLU replaced sigmoid in modern nets.


💻 Code — implement backprop from scratch, train on XOR

Two layers, no autograd, just numpy. You will see the gradients with your own eyes.

What this shows:

  • Loss drops over ~3000 epochs as backprop pushes the weights into a configuration that separates XOR’s 4 points
  • Decision boundary curves around the two 1-points and the two 0-points — clearly non-linear, exactly what XOR needs
  • The whole gradient computation is ~10 lines of numpy. PyTorch/JAX automate this, but it’s not magic — just chain rule applied bottom-up.

📋 The 5-step algorithm (cleaned up)

  1. Define a piecewise-differentiable loss L(θ) measuring how wrong the network is on a training sample (or batch).
  2. Initialize weights randomly (Xavier, He, etc. — never all zeros).
  3. Forward pass: feed an input batch through the network, store every intermediate activation.
  4. Backward pass: starting from ∂L/∂output, walk backwards through the computation graph applying the chain rule at each node. At every weight you obtain ∂L/∂w.
  5. Update: each weight gets nudged: w ← w − η · ∂L/∂w (vanilla SGD; Adam/AdamW also use momentum + per-parameter learning rates).
  6. Repeat steps 3–5 on the next mini-batch until convergence or early stopping.

The critical efficiency property: a backward pass costs the same Big-O as a forward pass. For a net with N parameters, gradient computation is O(N) — not O(N²) or worse. This is what makes training 175B-parameter models tractable at all.


⚠️ The two classic failure modes

ProblemWhy it happensModern fix
Vanishing gradientsEach backward layer multiplies by activation derivative (≤ 0.25 for sigmoid, ≤ 1 for tanh). After many layers, gradient ≈ 0 → early layers barely learn.ReLU (derivative = 0 or 1, no shrinking) · Residual connections (gradient shortcut around blocks) · Batch/Layer Norm (rescales activations)
Exploding gradientsIf weights are large, products of Jacobians can grow exponentially → loss becomes NaN.Gradient clipping (cap ‖∇‖ at a threshold) · Careful initialization (Xavier/He scales weights to ~1/√fan_in)

Both are direct consequences of the chain rule — the gradient signal is a long product, and any product either decays or explodes unless every factor is carefully managed.


🌍 Where backpropagation is used today

Backprop is, without exaggeration, the most consequential algorithm in modern AI. Every neural network you’ve heard of is trained with it.

  • All large language models — GPT-4, Claude, Gemini, LLaMA, DeepSeek — backprop + Adam/AdamW
  • Image generation — Stable Diffusion, DALL-E, Imagen, Midjourney — backprop through diffusion U-Nets / transformer denoisers
  • Speech recognition — Whisper, Conformer — backprop through encoder-decoder networks
  • Protein folding — AlphaFold 2, ESMFold — backprop through Evoformer / structure modules
  • Self-driving perception — Tesla FSD, Waymo — backprop through CNNs and transformers on camera/LiDAR
  • Recommendation systems — YouTube, TikTok, Spotify — backprop through embedding + ranking networks
  • Reinforcement learning policies — DQN, PPO, SAC, AlphaZero — backprop through value/policy networks
  • Scientific ML — climate modeling, drug discovery, materials science — backprop through specialized architectures

Auto-differentiation frameworks (PyTorch, JAX, TensorFlow) made backprop universally accessible: you write the forward pass, the framework computes gradients automatically by tracking the computation graph. This is the infrastructure of modern AI.


🔬 Where backpropagation is being challenged — and by what

Despite its dominance, backprop has serious problems: it requires storing all activations for the backward pass (memory-hungry), it doesn’t match how biological neurons learn (no plausible global error signal), and it’s hard to parallelize across layers (layer N can’t update until layer N+1’s gradient arrives).

Limitation of backpropProposed alternativeStatus
Biologically implausible (needs symmetric weights + global error)Feedback Alignment, Direct Feedback AlignmentResearch — works on small tasks, struggles to scale
Locked computation (can’t update layer N until layer N+1 is done)Synthetic Gradients, DNI (DeepMind, 2017)Niche — never displaced standard backprop
Requires storing all activationsGradient Checkpointing (now standard practice)Widely used as memory optimization, not replacement
Inefficient for sequence models (memory grows with length)Forward-Forward Algorithm (Hinton, 2022)Active research — no major deployment yet
RLHF specifically (reward model is brittle)DPO (Direct Preference Optimization)Still uses backprop, just skips the RL loop
Quantum / neuromorphic hardwareEquilibrium Propagation, predictive codingTheoretical — no scaled implementations

Bottom line: backprop has not been replaced. Every alternative either works on tiny problems or is itself trained with backprop under the hood. It’s the most successful single algorithm in AI history — and despite ~40 years of attempts to dethrone it, nothing has come close.


🎯 Exam traps

Backprop ≠ gradient descent

Backprop computes ∇L(θ). SGD/Adam uses it. They are two distinct algorithms in one pipeline. Saying “trained with backprop” is shorthand for “gradients computed by backprop, weights updated by SGD/Adam”. The exam will sometimes test if you can keep them apart.

Backprop requires differentiability

Backprop only works because every operation in the network has a defined derivative. Non-differentiable operations (argmax, discrete sampling, hard attention) break the chain → workarounds: Gumbel-Softmax, REINFORCE, straight-through estimator. This is one place Reinforcement Learning (RL) still beats supervised learning.

The forward pass must be stored

To compute σ'(z) you need σ(z) (the output of the activation). So forward-pass activations must be cached for the backward pass — that’s where the memory cost comes from (and what gradient checkpointing trades for re-computation).

Backprop is local in time, global in space

Each weight’s update depends on the global error signal (the loss L), which the brain almost certainly does not have access to. This is the biggest argument that brains do not use backprop — see The neuroconnectionist research programme.


See also

Source

Daniel, L., K., Yamins., James, J., DiCarlo. (2016). Using goal-driven deep learning models to understand sensory cortex. Nature Neuroscience, 19(3):356–365. doi:10.1038/NN.4244

Tags: methods-of-ai neural-networks backpropagation chain-rule deep-learning
Created: 20-11-24 · Restructured: 18-05-26