Backpropagation is the algorithm that computes the gradient ∇L(θ) of a loss function with respect to every parameter of a neural network — in a single backward pass through the computation graph. It’s the engine behind every neural net trained since the 1980s, and it works because of one piece of high-school calculus: the chain rule.
Backprop itself is not an optimizer — it just computes the gradients. An optimizer (SGD, Adam, AdamW, …) then uses those gradients via Gradient Descent to update the weights. The pair “backprop computes ∇L → SGD uses ∇L” is the core loop of all modern AI.
The 90-second summary
Forward pass: push input through the net, store every intermediate activation, compute loss L.
Backward pass: starting from L, walk backwards through the graph, applying the chain rule at each node. At every weight you arrive at, you now have ∂L/∂w — the gradient for that weight.
Update: hand all those gradients to the optimizer, which takes a step (w ← w − η·∂L/∂w).
Repeat for the next mini-batch.
🧠 The one big idea — chain rule, applied recursively
A neural net is a deeply nested function: L = ℓ(f₃(f₂(f₁(x)))). The chain rule says:
∂L/∂x = ∂L/∂f₃ · ∂f₃/∂f₂ · ∂f₂/∂f₁ · ∂f₁/∂x
For a weight buried deep in layer 1, you’d think you need to recompute the whole chain. Backprop’s trick: compute these derivatives from the top down, and at every layer you’ve already done all the work needed for the layer below it. Each layer’s backward pass costs roughly the same as its forward pass → total backward cost ≈ forward cost. No exponential blow-up. This is the entire reason deep learning is computationally feasible.
Why "back-propagation" is a great name
The gradient of the output is known immediately (∂L/∂output = predicted − target for MSE). That signal then propagates backwards through the layers. Layer N uses the gradient flowing in from layer N+1 to compute its own contribution and pass a new gradient signal down to layer N−1. It’s a message-passing scheme, where the “message” is ∂L/∂(layer-output).
✏️ Worked example — backprop a tiny network by hand
Consider the smallest possible net: one input x, one hidden neuron, one output. Sigmoid activations, MSE loss.
x ─[w₁]→ z₁ ─σ→ h ─[w₂]→ z₂ ─σ→ ŷ ─compare to y→ L = ½(ŷ−y)²
Forward (with x=1, w₁=0.5, w₂=−0.3, y=1):
Quantity
Computation
Value
z₁ = w₁ · x
0.5 · 1
0.500
h = σ(z₁)
1/(1+e⁻⁰·⁵)
0.622
z₂ = w₂ · h
−0.3 · 0.622
−0.187
ŷ = σ(z₂)
1/(1+e⁰·¹⁸⁷)
0.453
L = ½(ŷ − y)²
½(0.453 − 1)²
0.150
Now backprop — walk back, accumulating δ = ∂L/∂(thing):
Step
Chain-rule expression
Value
δŷ = ∂L/∂ŷ
ŷ − y = 0.453 − 1
−0.547
δz₂ = ∂L/∂z₂ = δŷ · σ'(z₂)
δŷ · ŷ·(1−ŷ) = −0.547 · 0.248
−0.135
∂L/∂w₂ = δz₂ · h
−0.135 · 0.622
−0.0840
δh = δz₂ · w₂
−0.135 · −0.3
+0.0405
δz₁ = δh · σ'(z₁) = δh · h·(1−h)
0.0405 · 0.622·0.378
+0.00952
∂L/∂w₁ = δz₁ · x
0.00952 · 1
+0.00952
After one update with learning rate η=1:
w₂ ← w₂ − η · ∂L/∂w₂ = −0.3 − (−0.084) = −0.216
w₁ ← w₁ − η · ∂L/∂w₁ = 0.5 − 0.00952 = 0.490
Notice: the gradient at w₁ is tiny (0.0095) compared to w₂ (0.084). Why? Every backward layer multiplies by σ', which is at most 0.25. After two layers we’re already at 0.25² = 0.0625 of the original signal. This is the vanishing-gradient problem — and the entire reason ReLU replaced sigmoid in modern nets.
💻 Code — implement backprop from scratch, train on XOR
Two layers, no autograd, just numpy. You will see the gradients with your own eyes.
Loss drops over ~3000 epochs as backprop pushes the weights into a configuration that separates XOR’s 4 points
Decision boundary curves around the two 1-points and the two 0-points — clearly non-linear, exactly what XOR needs
The whole gradient computation is ~10 lines of numpy. PyTorch/JAX automate this, but it’s not magic — just chain rule applied bottom-up.
📋 The 5-step algorithm (cleaned up)
Define a piecewise-differentiable lossL(θ) measuring how wrong the network is on a training sample (or batch).
Initialize weights randomly (Xavier, He, etc. — never all zeros).
Forward pass: feed an input batch through the network, store every intermediate activation.
Backward pass: starting from ∂L/∂output, walk backwards through the computation graph applying the chain rule at each node. At every weight you obtain ∂L/∂w.
Update: each weight gets nudged: w ← w − η · ∂L/∂w (vanilla SGD; Adam/AdamW also use momentum + per-parameter learning rates).
Repeat steps 3–5 on the next mini-batch until convergence or early stopping.
The critical efficiency property: a backward pass costs the same Big-O as a forward pass. For a net with N parameters, gradient computation is O(N) — not O(N²) or worse. This is what makes training 175B-parameter models tractable at all.
⚠️ The two classic failure modes
Problem
Why it happens
Modern fix
Vanishing gradients
Each backward layer multiplies by activation derivative (≤ 0.25 for sigmoid, ≤ 1 for tanh). After many layers, gradient ≈ 0 → early layers barely learn.
ReLU (derivative = 0 or 1, no shrinking) · Residual connections (gradient shortcut around blocks) · Batch/Layer Norm (rescales activations)
Exploding gradients
If weights are large, products of Jacobians can grow exponentially → loss becomes NaN.
Gradient clipping (cap ‖∇‖ at a threshold) · Careful initialization (Xavier/He scales weights to ~1/√fan_in)
Both are direct consequences of the chain rule — the gradient signal is a long product, and any product either decays or explodes unless every factor is carefully managed.
🌍 Where backpropagation is used today
Backprop is, without exaggeration, the most consequential algorithm in modern AI. Every neural network you’ve heard of is trained with it.
All large language models — GPT-4, Claude, Gemini, LLaMA, DeepSeek — backprop + Adam/AdamW
Scientific ML — climate modeling, drug discovery, materials science — backprop through specialized architectures
Auto-differentiation frameworks (PyTorch, JAX, TensorFlow) made backprop universally accessible: you write the forward pass, the framework computes gradients automatically by tracking the computation graph. This is the infrastructure of modern AI.
🔬 Where backpropagation is being challenged — and by what
Despite its dominance, backprop has serious problems: it requires storing all activations for the backward pass (memory-hungry), it doesn’t match how biological neurons learn (no plausible global error signal), and it’s hard to parallelize across layers (layer N can’t update until layer N+1’s gradient arrives).
Limitation of backprop
Proposed alternative
Status
Biologically implausible (needs symmetric weights + global error)
Feedback Alignment, Direct Feedback Alignment
Research — works on small tasks, struggles to scale
Locked computation (can’t update layer N until layer N+1 is done)
Synthetic Gradients, DNI (DeepMind, 2017)
Niche — never displaced standard backprop
Requires storing all activations
Gradient Checkpointing (now standard practice)
Widely used as memory optimization, not replacement
Inefficient for sequence models (memory grows with length)
Forward-Forward Algorithm (Hinton, 2022)
Active research — no major deployment yet
RLHF specifically (reward model is brittle)
DPO (Direct Preference Optimization)
Still uses backprop, just skips the RL loop
Quantum / neuromorphic hardware
Equilibrium Propagation, predictive coding
Theoretical — no scaled implementations
Bottom line: backprop has not been replaced. Every alternative either works on tiny problems or is itself trained with backprop under the hood. It’s the most successful single algorithm in AI history — and despite ~40 years of attempts to dethrone it, nothing has come close.
🎯 Exam traps
Backprop ≠ gradient descent
Backprop computes∇L(θ). SGD/Adam uses it. They are two distinct algorithms in one pipeline. Saying “trained with backprop” is shorthand for “gradients computed by backprop, weights updated by SGD/Adam”. The exam will sometimes test if you can keep them apart.
Backprop requires differentiability
Backprop only works because every operation in the network has a defined derivative. Non-differentiable operations (argmax, discrete sampling, hard attention) break the chain → workarounds: Gumbel-Softmax, REINFORCE, straight-through estimator. This is one place Reinforcement Learning (RL) still beats supervised learning.
The forward pass must be stored
To compute σ'(z) you need σ(z) (the output of the activation). So forward-pass activations must be cached for the backward pass — that’s where the memory cost comes from (and what gradient checkpointing trades for re-computation).
Backprop is local in time, global in space
Each weight’s update depends on the global error signal (the loss L), which the brain almost certainly does not have access to. This is the biggest argument that brains do not use backprop — see The neuroconnectionist research programme.
See also
Loss Surface — what ∇L(θ) is the gradient OF; dimensionality of real loss landscapes
Gradient Descent — the optimizer that uses the gradients backprop computes (GD, SGD, Adam, AdamW)
Hill Climbing — the discrete ancestor; GD is its continuous, gradient-using version
Perceptron — the simplest learning rule, conceptual predecessor of backprop