Gradient Backpropagation

methods-of-ai

Backpropagation is the algorithm that computes the gradient ∇L(θ) of a loss function with respect to every parameter of a neural network — in a single backward pass through the computation graph. It’s the engine behind every neural net trained since the 1980s, and it works because of one piece of high-school calculus: the chain rule.

Backprop itself is not an optimizer — it just computes the gradients. An optimizer (SGD, Adam, AdamW, …) then uses those gradients via Gradient Descent to update the weights. The pair “backprop computes ∇L → SGD uses ∇L” is the core loop of all modern AI.

The 90-second summary

Forward pass: push input through the net, store every intermediate activation, compute loss L.

Backward pass: starting from L, walk backwards through the graph, applying the chain rule at each node. At every weight you arrive at, you now have ∂L/∂w — the gradient for that weight.

Update: hand all those gradients to the optimizer, which takes a step (w ← w − η·∂L/∂w).

Repeat for the next mini-batch.

🧠 The one big idea — chain rule, applied recursively

A neural net is a deeply nested function: L = ℓ(f₃(f₂(f₁(x)))). The chain rule says:

∂L/∂x = ∂L/∂f₃ · ∂f₃/∂f₂ · ∂f₂/∂f₁ · ∂f₁/∂x

For a weight buried deep in layer 1, you’d think you need to recompute the whole chain. Backprop’s trick: compute these derivatives from the top down, and at every layer you’ve already done all the work needed for the layer below it. Each layer’s backward pass costs roughly the same as its forward pass → total backward cost ≈ forward cost. No exponential blow-up. This is the entire reason deep learning is computationally feasible.

Why "back-propagation" is a great name

The gradient of the output is known immediately (∂L/∂output = predicted − target for MSE). That signal then propagates backwards through the layers. Layer N uses the gradient flowing in from layer N+1 to compute its own contribution and pass a new gradient signal down to layer N−1. It’s a message-passing scheme, where the “message” is ∂L/∂(layer-output).

✏️ Worked example — backprop a tiny network by hand

Consider the smallest possible net: one input x, one hidden neuron, one output. Sigmoid activations, MSE loss.

x ─[w₁]→ z₁ ─σ→ h ─[w₂]→ z₂ ─σ→ ŷ ─compare to y→ L = ½(ŷ−y)²

Forward (with x=1, w₁=0.5, w₂=−0.3, y=1):

Quantity	Computation	Value
`z₁ = w₁ · x`	`0.5 · 1`	`0.500`
`h = σ(z₁)`	`1/(1+e⁻⁰·⁵)`	`0.622`
`z₂ = w₂ · h`	`−0.3 · 0.622`	`−0.187`
`ŷ = σ(z₂)`	`1/(1+e⁰·¹⁸⁷)`	`0.453`
`L = ½(ŷ − y)²`	`½(0.453 − 1)²`	`0.150`

Now backprop — walk back, accumulating δ = ∂L/∂(thing):

Step	Chain-rule expression	Value
`δŷ = ∂L/∂ŷ`	`ŷ − y = 0.453 − 1`	`−0.547`
`δz₂ = ∂L/∂z₂ = δŷ · σ'(z₂)`	`δŷ · ŷ·(1−ŷ) = −0.547 · 0.248`	`−0.135`
`∂L/∂w₂ = δz₂ · h`	`−0.135 · 0.622`	`−0.0840`
`δh = δz₂ · w₂`	`−0.135 · −0.3`	`+0.0405`
`δz₁ = δh · σ'(z₁) = δh · h·(1−h)`	`0.0405 · 0.622·0.378`	`+0.00952`
`∂L/∂w₁ = δz₁ · x`	`0.00952 · 1`	`+0.00952`

After one update with learning rate η=1:

w₂ ← w₂ − η · ∂L/∂w₂ = −0.3 − (−0.084) = −0.216
w₁ ← w₁ − η · ∂L/∂w₁ = 0.5 − 0.00952 = 0.490

Notice: the gradient at w₁ is tiny (0.0095) compared to w₂ (0.084). Why? Every backward layer multiplies by σ', which is at most 0.25. After two layers we’re already at 0.25² = 0.0625 of the original signal. This is the vanishing-gradient problem — and the entire reason ReLU replaced sigmoid in modern nets.

💻 Code — implement backprop from scratch, train on XOR

Two layers, no autograd, just numpy. You will see the gradients with your own eyes.

🐍 Code anzeigen / ausblenden

# Pyodide / Obsidian Execute Code: install matplotlib first.
import micropip
await micropip.install("matplotlib")
 
import numpy as np
import matplotlib.pyplot as plt
 
np.random.seed(0)
 
# --- XOR data: famously NOT linearly separable, needs a hidden layer ---
X = np.array([[0,0],[0,1],[1,0],[1,1]], dtype=float)
y = np.array([[0],[1],[1],[0]], dtype=float)
 
# --- Architecture: 2 inputs → 4 hidden (tanh) → 1 output (sigmoid) ---
W1 = np.random.randn(2, 4) * 0.7;  b1 = np.zeros((1, 4))
W2 = np.random.randn(4, 1) * 0.7;  b2 = np.zeros((1, 1))
 
def sigmoid(z):   return 1 / (1 + np.exp(-z))
def tanh(z):      return np.tanh(z)
def tanh_grad(h): return 1 - h**2          # derivative in terms of the OUTPUT
def sig_grad(p):  return p * (1 - p)
 
losses = []
lr = 0.5
 
for epoch in range(8000):
    # ---- FORWARD PASS ----
    z1 = X @ W1 + b1;        h1 = tanh(z1)
    z2 = h1 @ W2 + b2;       y_hat = sigmoid(z2)
    L  = np.mean((y_hat - y) ** 2)         # MSE
    losses.append(L)
 
    # ---- BACKWARD PASS (chain rule, layer by layer) ----
    # Output layer
    dL_dyhat = 2 * (y_hat - y) / y.size    # ∂L/∂ŷ  (MSE)
    dL_dz2   = dL_dyhat * sig_grad(y_hat)  # ∂L/∂z₂ = δŷ · σ'(z₂)
    dL_dW2   = h1.T @ dL_dz2               # ∂L/∂W₂ = h₁ᵀ · δz₂
    dL_db2   = dL_dz2.sum(axis=0, keepdims=True)
 
    # Hidden layer — gradient flows back through W₂
    dL_dh1   = dL_dz2 @ W2.T               # ∂L/∂h₁ = δz₂ · W₂ᵀ
    dL_dz1   = dL_dh1 * tanh_grad(h1)      # ∂L/∂z₁ = δh₁ · tanh'(z₁)
    dL_dW1   = X.T @ dL_dz1                # ∂L/∂W₁ = xᵀ · δz₁
    dL_db1   = dL_dz1.sum(axis=0, keepdims=True)
 
    # ---- WEIGHT UPDATE (vanilla SGD) ----
    W2 -= lr * dL_dW2;  b2 -= lr * dL_db2
    W1 -= lr * dL_dW1;  b1 -= lr * dL_db1
 
# --- Visualize loss curve + decision boundary ---
fig, axes = plt.subplots(1, 2, figsize=(13, 4.5))
 
axes[0].plot(losses, lw=1.5, color='steelblue')
axes[0].set_xlabel('epoch'); axes[0].set_ylabel('MSE loss')
axes[0].set_title('Training loss (XOR, 2-4-1 net trained with hand-coded backprop)')
axes[0].set_yscale('log'); axes[0].grid(alpha=0.3)
 
# Decision boundary
xx, yy = np.meshgrid(np.linspace(-0.3, 1.3, 200), np.linspace(-0.3, 1.3, 200))
grid = np.c_[xx.ravel(), yy.ravel()]
zz = sigmoid(tanh(grid @ W1 + b1) @ W2 + b2).reshape(xx.shape)
cs = axes[1].contourf(xx, yy, zz, levels=20, cmap='RdBu_r', alpha=0.85)
axes[1].scatter(X[:,0], X[:,1], c=y.ravel(), s=250, cmap='RdBu_r',
                edgecolor='black', linewidth=2, vmin=0, vmax=1)
for i, (xi, yi) in enumerate(X):
    axes[1].annotate(f'{int(y[i,0])}', (xi, yi), ha='center', va='center',
                     fontweight='bold', fontsize=11)
axes[1].set_title('Learned decision boundary (XOR)')
axes[1].set_xlabel('x₁'); axes[1].set_ylabel('x₂')
plt.colorbar(cs, ax=axes[1], label='P(class=1)')
 
plt.tight_layout(); plt.show()
 
print(f"\nFinal loss: {losses[-1]:.6f}")
print(f"Predictions: {sigmoid(tanh(X @ W1 + b1) @ W2 + b2).ravel().round(3)}")
print(f"Targets:     {y.ravel()}")

What this shows:

Loss drops over ~3000 epochs as backprop pushes the weights into a configuration that separates XOR’s 4 points
Decision boundary curves around the two 1-points and the two 0-points — clearly non-linear, exactly what XOR needs
The whole gradient computation is ~10 lines of numpy. PyTorch/JAX automate this, but it’s not magic — just chain rule applied bottom-up.

📋 The 5-step algorithm (cleaned up)

Define a piecewise-differentiable loss L(θ) measuring how wrong the network is on a training sample (or batch).
Initialize weights randomly (Xavier, He, etc. — never all zeros).
Forward pass: feed an input batch through the network, store every intermediate activation.
Backward pass: starting from ∂L/∂output, walk backwards through the computation graph applying the chain rule at each node. At every weight you obtain ∂L/∂w.
Update: each weight gets nudged: w ← w − η · ∂L/∂w (vanilla SGD; Adam/AdamW also use momentum + per-parameter learning rates).
Repeat steps 3–5 on the next mini-batch until convergence or early stopping.

The critical efficiency property: a backward pass costs the same Big-O as a forward pass. For a net with N parameters, gradient computation is O(N) — not O(N²) or worse. This is what makes training 175B-parameter models tractable at all.

⚠️ The two classic failure modes

Problem	Why it happens	Modern fix
Vanishing gradients	Each backward layer multiplies by activation derivative (≤ 0.25 for sigmoid, ≤ 1 for tanh). After many layers, gradient ≈ 0 → early layers barely learn.	ReLU (derivative = 0 or 1, no shrinking) · Residual connections (gradient shortcut around blocks) · Batch/Layer Norm (rescales activations)
Exploding gradients	If weights are large, products of Jacobians can grow exponentially → loss becomes NaN.	Gradient clipping (cap ‖∇‖ at a threshold) · Careful initialization (Xavier/He scales weights to `~1/√fan_in`)

Both are direct consequences of the chain rule — the gradient signal is a long product, and any product either decays or explodes unless every factor is carefully managed.

🌍 Where backpropagation is used today

Backprop is, without exaggeration, the most consequential algorithm in modern AI. Every neural network you’ve heard of is trained with it.

All large language models — GPT-4, Claude, Gemini, LLaMA, DeepSeek — backprop + Adam/AdamW
Image generation — Stable Diffusion, DALL-E, Imagen, Midjourney — backprop through diffusion U-Nets / transformer denoisers
Speech recognition — Whisper, Conformer — backprop through encoder-decoder networks
Protein folding — AlphaFold 2, ESMFold — backprop through Evoformer / structure modules
Self-driving perception — Tesla FSD, Waymo — backprop through CNNs and transformers on camera/LiDAR
Recommendation systems — YouTube, TikTok, Spotify — backprop through embedding + ranking networks
Reinforcement learning policies — DQN, PPO, SAC, AlphaZero — backprop through value/policy networks
Scientific ML — climate modeling, drug discovery, materials science — backprop through specialized architectures

Auto-differentiation frameworks (PyTorch, JAX, TensorFlow) made backprop universally accessible: you write the forward pass, the framework computes gradients automatically by tracking the computation graph. This is the infrastructure of modern AI.

🔬 Where backpropagation is being challenged — and by what

Despite its dominance, backprop has serious problems: it requires storing all activations for the backward pass (memory-hungry), it doesn’t match how biological neurons learn (no plausible global error signal), and it’s hard to parallelize across layers (layer N can’t update until layer N+1’s gradient arrives).

Limitation of backprop	Proposed alternative	Status
Biologically implausible (needs symmetric weights + global error)	Feedback Alignment, Direct Feedback Alignment	Research — works on small tasks, struggles to scale
Locked computation (can’t update layer N until layer N+1 is done)	Synthetic Gradients, DNI (DeepMind, 2017)	Niche — never displaced standard backprop
Requires storing all activations	Gradient Checkpointing (now standard practice)	Widely used as memory optimization, not replacement
Inefficient for sequence models (memory grows with length)	Forward-Forward Algorithm (Hinton, 2022)	Active research — no major deployment yet
RLHF specifically (reward model is brittle)	DPO (Direct Preference Optimization)	Still uses backprop, just skips the RL loop
Quantum / neuromorphic hardware	Equilibrium Propagation, predictive coding	Theoretical — no scaled implementations

Bottom line: backprop has not been replaced. Every alternative either works on tiny problems or is itself trained with backprop under the hood. It’s the most successful single algorithm in AI history — and despite ~40 years of attempts to dethrone it, nothing has come close.

🎯 Exam traps

Backprop ≠ gradient descent

Backprop computes ∇L(θ). SGD/Adam uses it. They are two distinct algorithms in one pipeline. Saying “trained with backprop” is shorthand for “gradients computed by backprop, weights updated by SGD/Adam”. The exam will sometimes test if you can keep them apart.

Backprop requires differentiability

Backprop only works because every operation in the network has a defined derivative. Non-differentiable operations (argmax, discrete sampling, hard attention) break the chain → workarounds: Gumbel-Softmax, REINFORCE, straight-through estimator. This is one place Reinforcement Learning (RL) still beats supervised learning.

The forward pass must be stored

To compute σ'(z) you need σ(z) (the output of the activation). So forward-pass activations must be cached for the backward pass — that’s where the memory cost comes from (and what gradient checkpointing trades for re-computation).

Backprop is local in time, global in space

Each weight’s update depends on the global error signal (the loss L), which the brain almost certainly does not have access to. This is the biggest argument that brains do not use backprop — see The neuroconnectionist research programme.

Brain Online

Explorer

Gradient Backpropagation

Gradient Backpropagation

🧠 The one big idea — chain rule, applied recursively

✏️ Worked example — backprop a tiny network by hand

💻 Code — implement backprop from scratch, train on XOR

📋 The 5-step algorithm (cleaned up)

⚠️ The two classic failure modes

🌍 Where backpropagation is used today

🔬 Where backpropagation is being challenged — and by what

🎯 Exam traps

See also

Source

Backlinks

Mika

✨ Features

⚙️ Einstellungen

📚 Chat-Verlauf

📖 Citation Manager

✍️ Writing Assistant

Inhaltsverzeichnis