Methods of AI — SoSe 2026 · Sessions 12 (Neural Networks) & 13 (Deep Learning, incl. Transformers). This is the topic hub for everything from Hopfield networks through MLPs/backprop and deep-learning practicalities to Transformer self-attention.
Neural networks are layered units computing weighted sums + activation. Trained via backpropagation of error gradients.
Hopfield networks are energy-based associative memories — symmetric weights, asynchronous updates, guaranteed convergence to a local energy minimum (stored or parasitic pattern).
Deep networks stack many layers for higher expressivity. Practical challenges: vanishing/exploding gradients, overfitting, compute. Solutions: ReLU, residual connections, dropout, weight decay, early stopping.
Transformers replace recurrence with self-attention: every token attends to every other token in O(n²). The backbone of BERT, GPT, and modern LLMs.
Glossary — important vocabulary
Neuron / unit. Computes y = φ(Σᵢ wᵢ xᵢ + b) where φ is an activation function.
Weights wᵢⱼ. Connection strength from neuron j (source) to neuron i (target). In Hopfield slides notation, w_ji denotes the weight on the incoming edge into i from j. Because Hopfield weights are symmetric (w_ji = w_ij), the two indexings agree.
Bias b. Per-neuron constant shift.
Hopfield network. Recurrent net with symmetric weights, no self-connections (wᵢᵢ = 0), binary states yᵢ ∈ {−1, +1}. Stores patterns as fixed points of an energy function.
Energy function. Scalar Lyapunov function that never increases under valid updates. For Hopfield: E = −½ Σᵢⱼ wᵢⱼ yᵢ yⱼ − Σᵢ bᵢ yᵢ.
Parasitic memory (spurious state). A local energy minimum that was not stored explicitly — the network “remembers” patterns it was never trained on.
Capacity. Maximum number of patterns that can be reliably stored and retrieved.
ReLU. Rectified Linear Unit: ReLU(x) = max(0, x). Gradient is 1 for x > 0, 0 for x < 0 — avoids saturation.
Vanishing gradient. With sigmoid/tanh activations, gradients shrink toward 0 across many layers because each layer multiplies by a factor ≤ 1 → early layers barely learn.
Exploding gradient. Mirror problem: gradients grow without bound. Mitigated by gradient clipping.
Residual connection.y = F(x) + x — skip connection adds the layer’s input to its output. Gradient can flow directly through the identity → enables very deep networks (ResNet).
Dropout. During training, randomly zero out a fraction p of neurons each forward pass → ensemble effect, less co-adaptation.
Weight decay (L2 regularization). Add λ Σ ‖w‖² to the loss → penalises large weights.
Early stopping. Halt training when validation loss stops improving.
Autoencoder. Encoder–bottleneck–decoder net trained to reconstruct its input → learns a compressed code.
Q / K / V (Query / Key / Value). Three learned linear projections of an input token. Query asks “what am I looking for?”, Key advertises “what do I offer?”, Value is the actual content aggregated.
Self-attention. Q, K, V all derived from the same sequence → every token attends to every other token (and to itself).
Multi-head attention. Run H parallel attention heads on different learned subspaces, concatenate, project → captures multiple relation types simultaneously.
Positional encoding. Added to input embeddings so the (otherwise permutation-invariant) Transformer knows token order. Vaswani et al. used fixed sinusoidal encodings of absolute positions.
N neurons, all-to-all connected, symmetric weights wᵢⱼ = wⱼᵢ, no self-loops wᵢᵢ = 0.
Binary states yᵢ ∈ {−1, +1} (some textbooks use {0, 1}).
Update rule
Asynchronous: pick one neuron i at random, update: yᵢ ← sign(Σⱼ≠ᵢ wᵢⱼ yⱼ + bᵢ)
The slides write the sum using w_ji (the incoming weight into i from j). Because W is symmetric, w_ji = w_ij — same formula.
Energy function
E(y) = −½ Σᵢⱼ wᵢⱼ yᵢ yⱼ − Σᵢ bᵢ yᵢ
Energy never increases when one neuron is updated to obey the update rule (Lyapunov property).
Therefore the network always converges to a local minimum — but not necessarily the global one, and not necessarily a stored pattern.
Hebbian learning (storing P patterns ξ^(1), …, ξ^(P))
wᵢⱼ = (1/N) Σₚ ξᵢ^(p) ξⱼ^(p) for i ≠ j, wᵢᵢ = 0.
Capacity
Slides: state “up to N target memories” for N neurons (a loose theoretical upper bound).
External textbook fact (Amit, Gutfreund & Sompolinsky 1985, not Hopfield 1982): the reliable retrieval capacity of the standard Hopfield model is Pₘₐₓ ≈ 0.138 · N. Above this critical load, all stored memories become unstable.
⚠️ For exam answers, use the slide claim (“up to N”). Flag the 0.138·N number as external if you cite it.
Parasitic / spurious memories
Local minima of E that were never stored — the network “remembers” garbage patterns.
Sources: linear combinations of stored patterns (mixture states), inverted patterns (−ξ has the same energy as ξ).
More patterns → more parasitic states → degraded retrieval.
Content-addressable memory
Initialize the network near a stored pattern (e.g. a noisy version) → updates flow downhill in energy → converges to the clean pattern.
This is associative retrieval (by content), as opposed to addressed retrieval (by index).
Slides’ version (constructive): a feedforward network with 2 hidden layers of finite width and non-linear activations can approximate any continuous function on a compact domain to arbitrary precision.
Textbook version (Cybenko 1989; Hornik et al. 1989): even 1 hidden layer with sigmoid (or any non-polynomial) activation is sufficient — this is the standard textbook universal approximation theorem.
⚠️ Slide-vs-textbook difference: the Methods of AI slides use the 2-hidden-layer constructive proof, NOT the 1-hidden-layer Cybenko result. On the exam, follow the slide claim.
Width and number of units can be astronomically large — universality is an existence result, not a practicality result. This motivates deeper (not just wider) networks.
Deep Learning — challenges & solutions
Challenge
Root cause
Solution
Vanishing gradients
Sigmoid/tanh saturate; products of small numbers shrink
Q (queries): what each token is looking for. Shape (n, d_k).
K (keys): what each token offers as context. Shape (n, d_k).
V (values): the actual content to aggregate. Shape (n, d_v).
Step 1: dot products Q Kᵀ give an (n × n) matrix of raw similarity scores.
Step 2: divide by √d_k — prevents the scores from growing too large in high dimensions (where they would push softmax into saturated regions with near-zero gradient).
Step 4: multiply by V → each token’s output is a weighted average of all tokens’ values.
Self-attention
Q, K, V are all linear projections of the same input X: Q = X W_Q, K = X W_K, V = X W_V.
Every token attends to every other token (and to itself).
Captures long-range dependencies in O(1) hop count (compare RNN’s O(n) hops).
Multi-head attention
Run H attention operations in parallel, each with its own learned W_Q^(h), W_K^(h), W_V^(h).
Concatenate the H outputs and project: MultiHead = Concat(head₁, …, head_H) W_O.
Different heads can specialise in different relationship types (syntactic, coreference, positional, etc.).
Positional encoding
Self-attention is permutation-equivariant — without position info, “the cat sat on the mat” and “the mat sat on the cat” produce the same outputs.
Solution: add a positional embedding to each input token’s vector.
Vaswani et al. (2017) used fixed sinusoidal encodings of absolute positions: PE(pos, 2i) = sin(pos / 10000^(2i/d)), PE(pos, 2i+1) = cos(...).
⚠️ Slide error to be aware of: the Methods of AI slides claim Vaswani used relative positional encoding. This is incorrect — Vaswani’s original paper used absolute sinusoidal encoding. (Relative encodings exist — Shaw et al. 2018, T5, RoPE — but those came later.) Flag this slide claim if you encounter it.
Reliable Hopfield capacity (Amit et al. 1985 — external)
ReLU(x) = max(0, x)
Rectified linear unit
δ^(ℓ) = (W^(ℓ+1))ᵀ δ^(ℓ+1) ⊙ φ'(z^(ℓ))
Backprop error signal
L_wd = L + λ Σ ‖w‖²
Loss with weight decay
Attention(Q,K,V) = softmax(Q Kᵀ / √d_k) V
Scaled dot-product attention
Q, K, V
Query / Key / Value matrices
d_k
Dimension of key vectors (scaling factor)
PE(pos, 2i) = sin(pos / 10000^(2i/d))
Sinusoidal positional encoding
Common Exam Traps ⚠️
Hopfield energy decreases monotonically under asynchronous updates — it cannot oscillate. Convergence is to a local minimum, not the global one.
Parasitic memories are local minima the network never explicitly stored — they exist alongside (and dilute) the desired memories.
Hopfield capacity slide claim says “up to N target memories” — the realistic ~0.138·N figure is Amit et al. 1985, not Hopfield 1982. Cite it as external if asked.
Symmetric weights: w_ji = w_ij in Hopfield. The slide’s w_ji notation (incoming weight) is mathematically identical to w_ij.
Universal approximation — slide vs. textbook: slides use the 2-hidden-layer constructive proof. Textbook (Cybenko) gives 1 hidden layer. Quote the slide version on the exam.
Positional encoding is required in Transformers because self-attention is permutation-equivariant.
Vaswani used absolute sinusoidal PE, not relative — the slides have this wrong. Be ready to give the correct answer if pressed but go with the slide answer on a slide-derived question.
√d_k scaling prevents large dot products from pushing softmax into saturation (near-zero gradient).
Self-attention attends to all positions including itself; masked (causal) attention attends only to past positions — that’s the BERT vs. GPT split.
ReLU vs. sigmoid: sigmoid saturates → vanishing gradient. ReLU has constant gradient 1 for x > 0.
Residual connections carry gradient through the identity path — what enables very deep networks (ResNet).
BERT bidirectional, GPT unidirectional — don’t mix these up.
Dropout is training-only; at test time use the full network with rescaled activations.
Comparison Table
Architecture
Memory mechanism
Recurrence
Long-range deps
Training
Hopfield
Associative (energy minima)
No (asynchronous updates)
Limited by capacity (~0.138·N)
Hebbian (one-shot)
MLP
Distributed in weights
No
None (no sequence)
Backprop + SGD
RNN
Hidden state over time
Yes
Vanishes over long sequences
BPTT (backprop through time)
Transformer
Via attention weights (per-pass)
No
Full — every token attends every other in O(n²)
Backprop + Adam
Slide-vs-textbook fact-check notes ⚠️
Hopfield capacity. Slides: “up to N target memories.” Reality (Amit, Gutfreund, Sompolinsky 1985): Pₘₐₓ ≈ 0.138 · N. The 0.138·N number is not from Hopfield 1982 and not in the slides — cite it as external.
Universal approximation. Slides use a 2-hidden-layer constructive proof. The textbook 1-hidden-layer version (Cybenko 1989) is not the slide claim.
Positional encoding (Vaswani et al. 2017). Slides claim Vaswani used relative encoding. This is a slide error — Vaswani actually used absolute sinusoidal encoding. Relative PE comes from Shaw et al. 2018 and later work.
Modern Hopfield Networks ↔ Self-Attention. The Ramsauer et al. 2020 “Hopfield Networks Is All You Need” result connecting modern Hopfield networks to Transformer self-attention is not in the lecture slides. Treat related quiz items (e.g. Q100 / Q136 in the mega quiz) as supplementary, not exam-core.
Hopfield update notation. Slides write the incoming weight as w_ji. With symmetric W this equals w_ij — pure notation choice, no math difference.
Visualisations (Python)
Five executable figures that make the neural-network and deep-learning concepts in this hub concrete. Each toggle opens a self-contained Pyodide block — install matplotlib (and numpy where needed) and plt.show() renders inline in Obsidian. Read the prose under each toggle for what to look for.
🐍 Figure — Activation functions (sigmoid, tanh, ReLU, Leaky ReLU) and their derivatives
What this shows. Top: the four activation curves on x ∈ [−5, 5]. Bottom: their derivatives — the quantity that backprop multiplies layer by layer.
Things to notice.
Sigmoid and tanh saturate. For |x| ≳ 3 their derivatives are essentially zero. Stack 10 sigmoid layers and the product of gradients shrinks to ~10⁻⁹ → vanishing gradient.
ReLU’s derivative is exactly 1 for x > 0. Multiplying by 1 a hundred times still gives 1. This is the single biggest reason ReLU dominates modern deep learning.
ReLU’s flaw: derivative is 0 for x ≤ 0. A unit that drifts into the negative half stops learning (dying ReLU). Leaky ReLU keeps a small slope α = 0.1 there → the unit can recover.
🐍 Figure — Hopfield energy landscape (8 neurons, 2 stored patterns, all 256 states)
import micropipawait micropip.install("matplotlib")import matplotlib.pyplot as pltimport numpy as npN = 8# Two stored bipolar patterns (in {-1, +1}^N)xi1 = np.array([ 1, 1, 1, 1, -1, -1, -1, -1])xi2 = np.array([ 1, -1, 1, -1, 1, -1, 1, -1])patterns = [xi1, xi2]# Hebbian learning rule: W = (1/N) Σ_p ξ_p ξ_p^T, diag(W) = 0W = np.zeros((N, N))for p in patterns: W += np.outer(p, p)W /= Nnp.fill_diagonal(W, 0.0)def energy(y, W): return -0.5 * y @ W @ y# Enumerate all 2^N = 256 binary states in {-1, +1}^Nstates = []energies = []for k in range(2 ** N): bits = [(k >> i) & 1 for i in range(N)] y = np.array([2 * b - 1 for b in bits]) # 0/1 -> -1/+1 states.append(y); energies.append(energy(y, W))states = np.array(states); energies = np.array(energies)# Sort by energyorder = np.argsort(energies)energies_sorted = energies[order]; states_sorted = states[order]# Locate stored patterns and their negatives (negatives have same energy)def find_idx(target): for i, s in enumerate(states_sorted): if np.array_equal(s, target): return i return Noneidx_xi1 = find_idx(xi1); idx_xi2 = find_idx(xi2)idx_xi1_n = find_idx(-xi1); idx_xi2_n = find_idx(-xi2)# Find local minima (state with no Hamming-1 neighbour of lower energy)def is_local_min(i, all_states, all_E): y = all_states[i]; E = all_E[i] for j in range(N): y2 = y.copy(); y2[j] = -y2[j] # locate y2's energy for k, s in enumerate(all_states): if np.array_equal(s, y2): if all_E[k] < E: return False break return True# Compute local minima against UNSORTED list (Hamming neighbours)local_min_mask = np.array([is_local_min(i, states, energies) for i in range(len(states))])# Map back into sorted ordersorted_local_min = local_min_mask[order]fig, ax = plt.subplots(figsize=(11, 5))xs = np.arange(len(energies_sorted))ax.plot(xs, energies_sorted, color="#888", lw=1, zorder=1, label="all 2⁸ = 256 binary states (sorted by E)")# Mark stored patternsfor idx, lbl, col in [(idx_xi1, "stored ξ₁", "#e74c3c"), (idx_xi2, "stored ξ₂", "#27ae60"), (idx_xi1_n, "−ξ₁ (mirror)", "#c0392b"), (idx_xi2_n, "−ξ₂ (mirror)", "#16a085")]: if idx is not None: ax.scatter([idx], [energies_sorted[idx]], s=140, color=col, edgecolor="black", lw=1.2, zorder=4, label=lbl)# Mark parasitic local minima (local minima that are NOT one of the stored / mirror states)stored_set = {tuple(xi1), tuple(-xi1), tuple(xi2), tuple(-xi2)}parasitic_idx = [i for i in range(len(states_sorted)) if sorted_local_min[i] and tuple(states_sorted[i]) not in stored_set]if parasitic_idx: ax.scatter(parasitic_idx, energies_sorted[parasitic_idx], s=80, facecolor="none", edgecolor="#9b59b6", lw=1.8, zorder=3, label=f"parasitic minima ({len(parasitic_idx)})")ax.set_xlabel("state index (sorted by energy)")ax.set_ylabel("E(y) = −½ yᵀ W y")ax.set_title("Hopfield energy landscape — N = 8, 2 stored patterns (Hebbian rule)\n" "stored ξ live at the bottom of the basin; mirrors share the same energy")ax.grid(alpha=0.3); ax.legend(loc="lower right", fontsize=9)plt.tight_layout(); plt.show()print(f"Stored ξ₁ energy: {energy(xi1, W):.3f}")print(f"Stored ξ₂ energy: {energy(xi2, W):.3f}")print(f"−ξ₁ energy: {energy(-xi1, W):.3f} (identical to ξ₁ — symmetry of E)")print(f"Number of local minima total: {local_min_mask.sum()}")print(f"Of those, parasitic (not ±ξ₁, ±ξ₂): {len(parasitic_idx)}")
What this shows. All 2⁸ = 256 possible binary states of an 8-neuron Hopfield net, sorted by their energy E(y) = −½ yᵀ W y. Weights come from the Hebbian rule on two stored patterns ξ₁, ξ₂.
Things to notice.
The stored patterns sit at the global minimum — exactly what Hebbian learning is designed to achieve.
Mirror patterns −ξ have the same energy as ξ (because E is quadratic in y → invariant under sign flip). This is built into the energy function and is one source of spurious memories.
Purple circles mark parasitic local minima — states the network was never told to store but which still trap asynchronous updates. Their count grows quickly as you store more patterns; this is the empirical reason behind the ~0.138·N capacity bound.
🐍 Figure — Self-attention weights on a toy 6-token sentence
import micropipawait micropip.install("matplotlib")import matplotlib.pyplot as pltimport numpy as nprng = np.random.default_rng(7)tokens = ["The", "cat", "sat", "on", "the", "mat"]n = len(tokens)d_model = 16d_k = 8# Token embeddings (random, just to demo)X = rng.standard_normal((n, d_model))# Random Q, K projection matrices (we don't need V for the heatmap)W_Q = rng.standard_normal((d_model, d_k)) * 0.3W_K = rng.standard_normal((d_model, d_k)) * 0.3Q = X @ W_Q # (n, d_k)K = X @ W_K # (n, d_k)# Scaled dot-product scoresscores = Q @ K.T / np.sqrt(d_k) # (n, n)# Row-wise softmax → attention weightsdef softmax(z, axis=-1): z = z - z.max(axis=axis, keepdims=True) e = np.exp(z) return e / e.sum(axis=axis, keepdims=True)A = softmax(scores, axis=-1) # rows sum to 1fig, ax = plt.subplots(figsize=(7, 6))im = ax.imshow(A, cmap="viridis", vmin=0, vmax=A.max())ax.set_xticks(range(n)); ax.set_xticklabels(tokens, rotation=30, ha="right")ax.set_yticks(range(n)); ax.set_yticklabels(tokens)ax.set_xlabel("Key (what each token offers)")ax.set_ylabel("Query (what each token is looking for)")ax.set_title("Self-attention weights softmax(Q Kᵀ / √d_k)\n" "row i = how token i distributes its attention across all tokens")# Annotate each cell with the weight valuefor i in range(n): for j in range(n): ax.text(j, i, f"{A[i,j]:.2f}", ha="center", va="center", color="white" if A[i,j] < A.max() * 0.55 else "black", fontsize=8)plt.colorbar(im, ax=ax, fraction=0.046, pad=0.04, label="attention weight")plt.tight_layout(); plt.show()print("Row sums (should each be 1.0):", A.sum(axis=1).round(3))
What this shows. A typical attention heatmap. Rows = queries (one per token, asking “what should I look at?”). Cols = keys (what each token offers). Each cell is the softmax-normalised attention weight — every row sums to exactly 1.
Things to notice.
Even with random Q and K matrices the rows are clearly non-uniform — some tokens get more weight than others.
After training, real Transformers learn weights that concentrate on syntactically/semantically relevant tokens (e.g. for a pronoun, the matching noun lights up).
The √d_k divisor in scores = Q K^T / √d_k is what keeps softmax out of the saturated regime here — without it, larger d_k would push the maximum scores toward ±∞ and the rows would degenerate into one-hot vectors with zero gradient elsewhere.
🐍 Figure — Sinusoidal positional encoding (Vaswani et al. 2017)
import micropipawait micropip.install("matplotlib")import matplotlib.pyplot as pltimport numpy as npd_model = 64max_pos = 50# PE(pos, 2i) = sin(pos / 10000^(2i/d))# PE(pos, 2i+1) = cos(pos / 10000^(2i/d))pos = np.arange(max_pos)[:, None] # (max_pos, 1)i = np.arange(d_model)[None, :] # (1, d_model)# use 2*(i//2) so neighbouring dims share the same frequencydiv = np.power(10000.0, (2 * (i // 2)) / d_model)angles = pos / div # (max_pos, d_model)PE = np.zeros((max_pos, d_model))PE[:, 0::2] = np.sin(angles[:, 0::2]) # even dims → sinPE[:, 1::2] = np.cos(angles[:, 1::2]) # odd dims → cosfig, ax = plt.subplots(figsize=(10, 5))im = ax.imshow(PE, aspect="auto", cmap="RdBu_r", vmin=-1, vmax=1)ax.set_xlabel("embedding dimension i (0 … 63)")ax.set_ylabel("position pos (0 … 49)")ax.set_title("Sinusoidal positional encoding PE(pos, i)\n" "Vaswani et al. 2017 — fixed (not learned), absolute positions")# Annotate the frequency gradientax.annotate("low-index dims\nvary FAST with position", xy=(3, 25), xytext=(8, 5), fontsize=9, color="black", arrowprops=dict(arrowstyle="->", color="black"))ax.annotate("high-index dims\nvary SLOW with position", xy=(60, 25), xytext=(40, 45), fontsize=9, color="black", arrowprops=dict(arrowstyle="->", color="black"))plt.colorbar(im, ax=ax, fraction=0.046, pad=0.04, label="PE value")plt.tight_layout(); plt.show()
What this shows. The full (50 × 64) positional-encoding matrix. Each row is the vector added to the embedding of one token position; each column is one of the 64 embedding dimensions.
Things to notice.
Low dimensions (left) vary fast with position — short-wavelength sinusoids encode fine-grained, local position info.
High dimensions (right) vary slowly — long-wavelength sinusoids encode coarse, “where in the document am I” info.
This multi-scale frequency stack is why the encoding can represent both relative offsets and absolute positions, and why the model can extrapolate to sequence lengths it wasn’t trained on.
⚠️ Slide-vs-paper: the Methods of AI slides call this “relative” positional encoding. Vaswani’s original paper uses absolute sinusoidal encoding — exactly the matrix above. Relative PE (Shaw et al. 2018, RoPE, ALiBi) came later.
🐍 Figure — Vanishing gradient: sigmoid stack vs ReLU stack
import micropipawait micropip.install("matplotlib")import matplotlib.pyplot as pltimport numpy as np# Sample pre-activations from a typical "post-init" distributionrng = np.random.default_rng(0)z = rng.standard_normal(2000)def sigmoid(x): return 1.0 / (1.0 + np.exp(-x))def dsigmoid(x): s = sigmoid(x); return s * (1 - s)def drelu(x): return (x > 0).astype(float)max_layers = 30layers = np.arange(1, max_layers + 1)# For each L, the product of L i.i.d. derivative samples — averaged over the samplesig_d = dsigmoid(z) # per-sample derivativerelu_d = drelu(z)# Geometric mean of derivative magnitudes is a clean proxy for# how a product of L of them scales: product ≈ (geo_mean)^Lsig_gm = np.exp(np.mean(np.log(sig_d + 1e-12)))relu_gm = np.exp(np.mean(np.log(relu_d + 1e-12))) # 0 for dead units → tiny floorsig_prod = sig_gm ** layersrelu_prod = relu_gm ** layersfig, ax = plt.subplots(figsize=(9, 5))ax.semilogy(layers, sig_prod, "o-", color="#3498db", label="sigmoid stack (∏ σ'(z))")ax.semilogy(layers, relu_prod, "s-", color="#e74c3c", label="ReLU stack (∏ ReLU'(z))")ax.axhline(1.0, color="#888", lw=0.7)ax.set_xlabel("number of layers L")ax.set_ylabel("typical magnitude of ∏ₗ φ'(zₗ) (log scale)")ax.set_title("Why sigmoid stacks vanish and ReLU stacks don't\n" "(geometric-mean product of layer-wise activation derivatives)")ax.grid(alpha=0.3, which="both"); ax.legend(fontsize=10)ax.annotate(f"after 10 sigmoid layers ≈ {sig_prod[9]:.1e}", xy=(10, sig_prod[9]), xytext=(12, sig_prod[9] * 0.3), fontsize=9, color="#3498db", arrowprops=dict(arrowstyle="->", color="#3498db"))ax.annotate(f"ReLU stays near {relu_prod[9]:.2f}", xy=(10, relu_prod[9]), xytext=(14, relu_prod[9] * 2), fontsize=9, color="#e74c3c", arrowprops=dict(arrowstyle="->", color="#e74c3c"))plt.tight_layout(); plt.show()
What this shows. Backprop multiplies one activation derivative per layer. The plot tracks the typical product magnitude ∏ₗ φ'(zₗ) (geometric mean over a Gaussian pre-activation sample) as the network gets deeper.
Things to notice.
Sigmoid’s derivative has maximum value 0.25. Multiplying 10 of these gives ≈ 4·10⁻⁷ — the gradient reaching the first layer is effectively zero. That is the vanishing-gradient problem.
ReLU’s derivative is either 0 or 1. On active units the product stays at 1 regardless of depth → gradients survive intact across 30+ layers.
This single property — combined with residual connections and good init (He/Xavier) — is what made networks with hundreds of layers practical (ResNet, Transformers).
ALGORITHMS (full reference) ⭐
Pseudocode and intuition for the algorithms you might write or trace by hand in the exam. Organised by topic:
Goal. Asynchronously update one neuron at a time toward a stored pattern.
HOPFIELD-UPDATE(state y, weights W, biases b): repeat until stable: pick neuron i uniformly at random s ← Σⱼ≠ᵢ W[i, j] * y[j] + b[i] y[i] ← +1 if s ≥ 0 else −1 return y
Slide-equivalent form writes W[j, i] (= w_ji) instead of W[i, j] — same value because W is symmetric.
Property. Energy E(y) = −½ yᵀ W y − bᵀ y is non-increasing on every update → guaranteed convergence to a local minimum (stored pattern or parasitic state).
2. Hopfield Hebbian learning
Goal. Store P bipolar patterns ξ^(1), …, ξ^(P) ∈ {−1, +1}^N as fixed points.
HEBBIAN-LEARN(patterns ξ^(1..P), N): initialise W ← 0 # N × N matrix for p in 1..P: for i in 1..N: for j in 1..N: if i ≠ j: W[i, j] += ξ^(p)[i] * ξ^(p)[j] W ← W / N return W
One-shot. No iteration over the data — single pass. Symmetric.W[i, j] = W[j, i] by construction. Capacity. Slides: “up to N”. Reality: ~0.138 · N (Amit et al. 1985).
3. MLP forward pass
Goal. Compute the output of a feedforward network with L layers.
FORWARD(x, weights {W^(1..L)}, biases {b^(1..L)}, activation φ): h ← x for ℓ in 1..L: z^(ℓ) ← W^(ℓ) · h + b^(ℓ) h^(ℓ) ← φ(z^(ℓ)) # ReLU / sigmoid / tanh h ← h^(ℓ) return h # network output
4. Backpropagation (chain rule pseudocode)
Goal. Compute gradient of the loss w.r.t. all weights via the chain rule, layer by layer, backward.
BACKPROP(x, target t, weights {W^(1..L)}, biases {b^(1..L)}): # --- forward pass, cache intermediates --- h^(0) ← x for ℓ in 1..L: z^(ℓ) ← W^(ℓ) · h^(ℓ−1) + b^(ℓ) h^(ℓ) ← φ(z^(ℓ)) L ← loss(h^(L), t) # --- backward pass --- δ^(L) ← ∇_{h^(L)} loss ⊙ φ'(z^(L)) # output-layer error for ℓ in L−1 .. 1 step −1: δ^(ℓ) ← (W^(ℓ+1))ᵀ · δ^(ℓ+1) ⊙ φ'(z^(ℓ)) # propagate back # --- gradients --- for ℓ in 1..L: grad_W^(ℓ) ← δ^(ℓ) · (h^(ℓ−1))ᵀ grad_b^(ℓ) ← δ^(ℓ) return {grad_W^(ℓ)}, {grad_b^(ℓ)}
Cost. Same order as the forward pass: O(network size) per example. Numerical pitfalls. Vanishing gradient when φ’ is small (sigmoid saturation); exploding gradient when products of W norms exceed 1.
5. Gradient Descent step
Goal. Apply the gradient computed by backprop to update the parameters.
SGD-STEP(weights W, gradients ∇W, learning rate η, weight decay λ): for each parameter w in W: w ← w − η · (∇w + 2 λ w) # 2λw term = L2 / weight decay
Variants. Momentum, Nesterov, AdaGrad, RMSProp, Adam (default for Transformers).
6. Dropout
Goal. Regularise by randomly silencing units during training.
DROPOUT-FORWARD(h, drop_prob p, training): if training: m ← Bernoulli(1 − p, shape = h.shape) # 0/1 mask return (h * m) / (1 − p) # inverted dropout: rescale else: return h # no dropout at test time
Why divide by (1 − p)? Keeps expected activation the same as in eval mode → no need to rescale at test time.
7. Scaled Dot-Product Attention
Goal. Compute attention output given Q, K, V matrices.
Reading the result. Token 1 (q1 = [1,0]) attends most to keys that share its first component (k1, k3) → its output is closer to v1 + v3. Token 3 (q3 = [1,1]) is most similar to k3, so its output is dominated by v3 = (1,1). This is exactly the intuition: high Q·K similarity → high attention weight → that V contributes most.