Neural Networks & Deep Learning

chatbot methods-of-ai ai-generated

Methods of AI — SoSe 2026 · Sessions 12 (Neural Networks) & 13 (Deep Learning, incl. Transformers). This is the topic hub for everything from Hopfield networks through MLPs/backprop and deep-learning practicalities to Transformer self-attention.

Companion files for exam prep:

Table of contents

  1. Core Ideas
  2. Glossary — important vocabulary
  3. Hopfield Networks
  4. Multilayer Perceptron (MLP)
  5. Backpropagation
  6. Universal Approximation
  7. Deep Learning — challenges & solutions
  8. Autoencoders
  9. Transformers & Self-Attention
  10. BERT vs. GPT
  11. Formulas & Notation
  12. Common Exam Traps ⚠️
  13. Comparison Table
  14. Slide-vs-textbook fact-check notes ⚠️
  15. Visualisations (Python)
  16. ALGORITHMS
    • [[#1. Hopfield update rule|#1 Hopfield update rule]]
    • [[#2. Hopfield Hebbian learning|#2 Hopfield Hebbian learning]]
    • [[#3. MLP forward pass|#3 MLP forward pass]]
    • [[#4. Backpropagation (chain rule pseudocode)|#4 Backpropagation]]
    • [[#5. Gradient Descent step|#5 Gradient Descent step]]
    • [[#6. Dropout|#6 Dropout]]
    • [[#7. Scaled Dot-Product Attention|#7 Scaled Dot-Product Attention]]
    • [[#8. Self-Attention|#8 Self-Attention]]
    • [[#9. Multi-head Attention|#9 Multi-head Attention]]
  17. Worked example: small backprop trace
  18. Worked example: attention on a tiny matrix
  19. Related atomic notes

Core Ideas

  • Neural networks are layered units computing weighted sums + activation. Trained via backpropagation of error gradients.
  • Hopfield networks are energy-based associative memories — symmetric weights, asynchronous updates, guaranteed convergence to a local energy minimum (stored or parasitic pattern).
  • Deep networks stack many layers for higher expressivity. Practical challenges: vanishing/exploding gradients, overfitting, compute. Solutions: ReLU, residual connections, dropout, weight decay, early stopping.
  • Transformers replace recurrence with self-attention: every token attends to every other token in O(n²). The backbone of BERT, GPT, and modern LLMs.

Glossary — important vocabulary

Neuron / unit. Computes y = φ(Σᵢ wᵢ xᵢ + b) where φ is an activation function.

Activation function. Non-linearity applied to weighted sum. Common: sigmoid, tanh, ReLU.

Weights wᵢⱼ. Connection strength from neuron j (source) to neuron i (target). In Hopfield slides notation, w_ji denotes the weight on the incoming edge into i from j. Because Hopfield weights are symmetric (w_ji = w_ij), the two indexings agree.

Bias b. Per-neuron constant shift.

Hopfield network. Recurrent net with symmetric weights, no self-connections (wᵢᵢ = 0), binary states yᵢ ∈ {−1, +1}. Stores patterns as fixed points of an energy function.

Energy function. Scalar Lyapunov function that never increases under valid updates. For Hopfield: E = −½ Σᵢⱼ wᵢⱼ yᵢ yⱼ − Σᵢ bᵢ yᵢ.

Parasitic memory (spurious state). A local energy minimum that was not stored explicitly — the network “remembers” patterns it was never trained on.

Capacity. Maximum number of patterns that can be reliably stored and retrieved.

MLP (Multilayer Perceptron). Feedforward network: input → hidden layer(s) → output. Fully connected, no recurrence.

ReLU. Rectified Linear Unit: ReLU(x) = max(0, x). Gradient is 1 for x > 0, 0 for x < 0 — avoids saturation.

Vanishing gradient. With sigmoid/tanh activations, gradients shrink toward 0 across many layers because each layer multiplies by a factor ≤ 1 → early layers barely learn.

Exploding gradient. Mirror problem: gradients grow without bound. Mitigated by gradient clipping.

Residual connection. y = F(x) + x — skip connection adds the layer’s input to its output. Gradient can flow directly through the identity → enables very deep networks (ResNet).

Dropout. During training, randomly zero out a fraction p of neurons each forward pass → ensemble effect, less co-adaptation.

Weight decay (L2 regularization). Add λ Σ ‖w‖² to the loss → penalises large weights.

Early stopping. Halt training when validation loss stops improving.

Autoencoder. Encoder–bottleneck–decoder net trained to reconstruct its input → learns a compressed code.

Q / K / V (Query / Key / Value). Three learned linear projections of an input token. Query asks “what am I looking for?”, Key advertises “what do I offer?”, Value is the actual content aggregated.

Self-attention. Q, K, V all derived from the same sequence → every token attends to every other token (and to itself).

Multi-head attention. Run H parallel attention heads on different learned subspaces, concatenate, project → captures multiple relation types simultaneously.

Positional encoding. Added to input embeddings so the (otherwise permutation-invariant) Transformer knows token order. Vaswani et al. used fixed sinusoidal encodings of absolute positions.

BERT. Bidirectional Encoder Representations from Transformers — encoder-only, masked LM objective.

GPT. Generative Pre-trained Transformer — decoder-only, autoregressive (left-to-right) LM.


Hopfield Networks

Architecture

  • N neurons, all-to-all connected, symmetric weights wᵢⱼ = wⱼᵢ, no self-loops wᵢᵢ = 0.
  • Binary states yᵢ ∈ {−1, +1} (some textbooks use {0, 1}).

Update rule

  • Asynchronous: pick one neuron i at random, update:
    yᵢ ← sign(Σⱼ≠ᵢ wᵢⱼ yⱼ + bᵢ)
  • The slides write the sum using w_ji (the incoming weight into i from j). Because W is symmetric, w_ji = w_ij — same formula.

Energy function

E(y) = −½ Σᵢⱼ wᵢⱼ yᵢ yⱼ − Σᵢ bᵢ yᵢ

  • Energy never increases when one neuron is updated to obey the update rule (Lyapunov property).
  • Therefore the network always converges to a local minimum — but not necessarily the global one, and not necessarily a stored pattern.

Hebbian learning (storing P patterns ξ^(1), …, ξ^(P))

wᵢⱼ = (1/N) Σₚ ξᵢ^(p) ξⱼ^(p) for i ≠ j, wᵢᵢ = 0.

Capacity

  • Slides: state “up to N target memories” for N neurons (a loose theoretical upper bound).
  • External textbook fact (Amit, Gutfreund & Sompolinsky 1985, not Hopfield 1982): the reliable retrieval capacity of the standard Hopfield model is Pₘₐₓ ≈ 0.138 · N. Above this critical load, all stored memories become unstable.
  • ⚠️ For exam answers, use the slide claim (“up to N”). Flag the 0.138·N number as external if you cite it.

Parasitic / spurious memories

  • Local minima of E that were never stored — the network “remembers” garbage patterns.
  • Sources: linear combinations of stored patterns (mixture states), inverted patterns (−ξ has the same energy as ξ).
  • More patterns → more parasitic states → degraded retrieval.

Content-addressable memory

  • Initialize the network near a stored pattern (e.g. a noisy version) → updates flow downhill in energy → converges to the clean pattern.
  • This is associative retrieval (by content), as opposed to addressed retrieval (by index).

Multilayer Perceptron (MLP)

  • Architecture: input layer → ≥ 1 hidden layer(s) → output layer. Fully connected, feedforward.
  • Forward pass per layer: h^(ℓ) = φ(W^(ℓ) h^(ℓ−1) + b^(ℓ)).
  • Activation φ: sigmoid 1/(1+e^−x), tanh, or ReLU max(0, x).
  • Output activation: typically softmax (classification) or identity (regression).

Why ReLU

  • Sigmoid/tanh saturate: for large |x|, gradient ≈ 0 → vanishing gradient in deep nets.
  • ReLU’s gradient is 1 for x > 0 (constant!) and 0 for x ≤ 0 → keeps gradients alive across many layers.
  • Caveat: “dying ReLU” — units stuck at 0 stop learning. Mitigated by Leaky ReLU, ELU, etc.

Backpropagation

The chain rule applied layer by layer, backward through the network.

Setup

  • Loss L = ½ ‖y − t‖² (squared error) or cross-entropy.
  • For each layer ℓ define z^(ℓ) = W^(ℓ) h^(ℓ−1) + b^(ℓ) and h^(ℓ) = φ(z^(ℓ)).

Local error signal

δ^(L) = ∇_h L ⊙ φ'(z^(L)) for the output layer
δ^(ℓ) = (W^(ℓ+1))ᵀ δ^(ℓ+1) ⊙ φ'(z^(ℓ)) for hidden layers (backward recursion)

Gradient w.r.t. parameters

∂L/∂W^(ℓ) = δ^(ℓ) (h^(ℓ−1))ᵀ · ∂L/∂b^(ℓ) = δ^(ℓ)

Update

W^(ℓ) ← W^(ℓ) − η · ∂L/∂W^(ℓ) (stochastic / mini-batch gradient descent)


Universal Approximation

  • Slides’ version (constructive): a feedforward network with 2 hidden layers of finite width and non-linear activations can approximate any continuous function on a compact domain to arbitrary precision.
  • Textbook version (Cybenko 1989; Hornik et al. 1989): even 1 hidden layer with sigmoid (or any non-polynomial) activation is sufficient — this is the standard textbook universal approximation theorem.
  • ⚠️ Slide-vs-textbook difference: the Methods of AI slides use the 2-hidden-layer constructive proof, NOT the 1-hidden-layer Cybenko result. On the exam, follow the slide claim.
  • Width and number of units can be astronomically large — universality is an existence result, not a practicality result. This motivates deeper (not just wider) networks.

Deep Learning — challenges & solutions

ChallengeRoot causeSolution
Vanishing gradientsSigmoid/tanh saturate; products of small numbers shrinkReLU activations; residual connections; careful init (Xavier, He)
Exploding gradientsRepeated multiplication by factors > 1Gradient clipping; weight init; normalisation
OverfittingToo many parameters for the dataDropout, weight decay, early stopping, data augmentation, batch norm
Compute costMany params × many opsGPUs/TPUs, parallel matrix multiplies, mixed precision
OverparameterisationModern nets have more params than data pointsLottery ticket hypothesis: a small sparse sub-network does the work; prune after training

Residual connections (ResNet)

  • Block computes y = F(x; W) + x. The skip path is the identity.
  • Backward: ∂L/∂x = ∂L/∂y · (∂F/∂x + I) — the + I term means gradient is never fully zeroed.
  • Enables training networks with 100+ layers.

Dropout

  • At each training step, independently zero each neuron’s output with probability p.
  • Acts as ensemble averaging over an exponential number of thinned sub-networks.
  • At test time: no dropout; scale activations by (1 − p) (or use inverted dropout during training).

Weight decay

  • Add λ Σ ‖w‖² to the loss.
  • Gradient contribution: + 2 λ w → shrinks weights toward zero each step.
  • Equivalent (for SGD) to L2 regularisation; mathematically a Gaussian prior on weights.

Early stopping

  • Track validation loss each epoch. When it stops decreasing for K epochs (patience), stop training.
  • Effectively a regulariser: prevents the network from memorising training noise.

Autoencoders

  • Encoder f: x → z maps input to a low-dimensional code z (the bottleneck).
  • Decoder g: z → x̂ reconstructs the input.
  • Loss: reconstruction error ‖x − x̂‖² (or cross-entropy for binary inputs).
  • The bottleneck forces compression — the network must learn a useful latent representation.

Variants and uses

  • Denoising autoencoder: corrupt input, train to reconstruct the clean version.
  • Sparse autoencoder: add sparsity penalty on z (KL or L1).
  • Variational autoencoder (VAE): probabilistic; learns a distribution over z.
  • Uses: dimensionality reduction (non-linear PCA), anomaly detection (high reconstruction error = anomaly), pre-training, generative modelling.

Transformers & Self-Attention

Scaled dot-product attention

Attention(Q, K, V) = softmax(Q Kᵀ / √d_k) · V

  • Q (queries): what each token is looking for. Shape (n, d_k).
  • K (keys): what each token offers as context. Shape (n, d_k).
  • V (values): the actual content to aggregate. Shape (n, d_v).
  • Step 1: dot products Q Kᵀ give an (n × n) matrix of raw similarity scores.
  • Step 2: divide by √d_k — prevents the scores from growing too large in high dimensions (where they would push softmax into saturated regions with near-zero gradient).
  • Step 3: row-wise softmax → attention weights (each row sums to 1).
  • Step 4: multiply by V → each token’s output is a weighted average of all tokens’ values.

Self-attention

  • Q, K, V are all linear projections of the same input X: Q = X W_Q, K = X W_K, V = X W_V.
  • Every token attends to every other token (and to itself).
  • Captures long-range dependencies in O(1) hop count (compare RNN’s O(n) hops).

Multi-head attention

  • Run H attention operations in parallel, each with its own learned W_Q^(h), W_K^(h), W_V^(h).
  • Concatenate the H outputs and project: MultiHead = Concat(head₁, …, head_H) W_O.
  • Different heads can specialise in different relationship types (syntactic, coreference, positional, etc.).

Positional encoding

  • Self-attention is permutation-equivariant — without position info, “the cat sat on the mat” and “the mat sat on the cat” produce the same outputs.
  • Solution: add a positional embedding to each input token’s vector.
  • Vaswani et al. (2017) used fixed sinusoidal encodings of absolute positions: PE(pos, 2i) = sin(pos / 10000^(2i/d)), PE(pos, 2i+1) = cos(...).
  • ⚠️ Slide error to be aware of: the Methods of AI slides claim Vaswani used relative positional encoding. This is incorrect — Vaswani’s original paper used absolute sinusoidal encoding. (Relative encodings exist — Shaw et al. 2018, T5, RoPE — but those came later.) Flag this slide claim if you encounter it.

BERT vs. GPT

AspectBERTGPT
ArchitectureEncoder-onlyDecoder-only
Training objectiveMasked LM (predict masked tokens) + Next-Sentence PredictionAutoregressive LM (predict next token)
AttentionBidirectional (sees all tokens at once)Causal / masked (only past tokens)
Primary useUnderstanding: classification, NER, QAGeneration: text, code, chat
Released byGoogle, 2018OpenAI, 2018+

Formulas & Notation

SymbolMeaning
E(y) = −½ Σ wᵢⱼ yᵢ yⱼ − Σ bᵢ yᵢHopfield energy
yᵢ ← sign(Σⱼ wᵢⱼ yⱼ + bᵢ)Hopfield asynchronous update
wᵢⱼ = (1/N) Σₚ ξᵢ^(p) ξⱼ^(p)Hebbian learning rule
Pₘₐₓ ≈ 0.138 · NReliable Hopfield capacity (Amit et al. 1985 — external)
ReLU(x) = max(0, x)Rectified linear unit
δ^(ℓ) = (W^(ℓ+1))ᵀ δ^(ℓ+1) ⊙ φ'(z^(ℓ))Backprop error signal
L_wd = L + λ Σ ‖w‖²Loss with weight decay
Attention(Q,K,V) = softmax(Q Kᵀ / √d_k) VScaled dot-product attention
Q, K, VQuery / Key / Value matrices
d_kDimension of key vectors (scaling factor)
PE(pos, 2i) = sin(pos / 10000^(2i/d))Sinusoidal positional encoding

Common Exam Traps ⚠️

  • Hopfield energy decreases monotonically under asynchronous updates — it cannot oscillate. Convergence is to a local minimum, not the global one.
  • Parasitic memories are local minima the network never explicitly stored — they exist alongside (and dilute) the desired memories.
  • Hopfield capacity slide claim says “up to N target memories” — the realistic ~0.138·N figure is Amit et al. 1985, not Hopfield 1982. Cite it as external if asked.
  • Symmetric weights: w_ji = w_ij in Hopfield. The slide’s w_ji notation (incoming weight) is mathematically identical to w_ij.
  • Universal approximation — slide vs. textbook: slides use the 2-hidden-layer constructive proof. Textbook (Cybenko) gives 1 hidden layer. Quote the slide version on the exam.
  • Positional encoding is required in Transformers because self-attention is permutation-equivariant.
  • Vaswani used absolute sinusoidal PE, not relative — the slides have this wrong. Be ready to give the correct answer if pressed but go with the slide answer on a slide-derived question.
  • √d_k scaling prevents large dot products from pushing softmax into saturation (near-zero gradient).
  • Self-attention attends to all positions including itself; masked (causal) attention attends only to past positions — that’s the BERT vs. GPT split.
  • ReLU vs. sigmoid: sigmoid saturates → vanishing gradient. ReLU has constant gradient 1 for x > 0.
  • Residual connections carry gradient through the identity path — what enables very deep networks (ResNet).
  • BERT bidirectional, GPT unidirectional — don’t mix these up.
  • Dropout is training-only; at test time use the full network with rescaled activations.

Comparison Table

ArchitectureMemory mechanismRecurrenceLong-range depsTraining
HopfieldAssociative (energy minima)No (asynchronous updates)Limited by capacity (~0.138·N)Hebbian (one-shot)
MLPDistributed in weightsNoNone (no sequence)Backprop + SGD
RNNHidden state over timeYesVanishes over long sequencesBPTT (backprop through time)
TransformerVia attention weights (per-pass)NoFull — every token attends every other in O(n²)Backprop + Adam

Slide-vs-textbook fact-check notes ⚠️

  1. Hopfield capacity. Slides: “up to N target memories.” Reality (Amit, Gutfreund, Sompolinsky 1985): Pₘₐₓ ≈ 0.138 · N. The 0.138·N number is not from Hopfield 1982 and not in the slides — cite it as external.
  2. Universal approximation. Slides use a 2-hidden-layer constructive proof. The textbook 1-hidden-layer version (Cybenko 1989) is not the slide claim.
  3. Positional encoding (Vaswani et al. 2017). Slides claim Vaswani used relative encoding. This is a slide error — Vaswani actually used absolute sinusoidal encoding. Relative PE comes from Shaw et al. 2018 and later work.
  4. Modern Hopfield Networks ↔ Self-Attention. The Ramsauer et al. 2020 “Hopfield Networks Is All You Need” result connecting modern Hopfield networks to Transformer self-attention is not in the lecture slides. Treat related quiz items (e.g. Q100 / Q136 in the mega quiz) as supplementary, not exam-core.
  5. Hopfield update notation. Slides write the incoming weight as w_ji. With symmetric W this equals w_ij — pure notation choice, no math difference.

Visualisations (Python)

Five executable figures that make the neural-network and deep-learning concepts in this hub concrete. Each toggle opens a self-contained Pyodide block — install matplotlib (and numpy where needed) and plt.show() renders inline in Obsidian. Read the prose under each toggle for what to look for.

What this shows. Top: the four activation curves on x ∈ [−5, 5]. Bottom: their derivatives — the quantity that backprop multiplies layer by layer.

Things to notice.

  • Sigmoid and tanh saturate. For |x| ≳ 3 their derivatives are essentially zero. Stack 10 sigmoid layers and the product of gradients shrinks to ~10⁻⁹ → vanishing gradient.
  • ReLU’s derivative is exactly 1 for x > 0. Multiplying by 1 a hundred times still gives 1. This is the single biggest reason ReLU dominates modern deep learning.
  • ReLU’s flaw: derivative is 0 for x ≤ 0. A unit that drifts into the negative half stops learning (dying ReLU). Leaky ReLU keeps a small slope α = 0.1 there → the unit can recover.

What this shows. All 2⁸ = 256 possible binary states of an 8-neuron Hopfield net, sorted by their energy E(y) = −½ yᵀ W y. Weights come from the Hebbian rule on two stored patterns ξ₁, ξ₂.

Things to notice.

  • The stored patterns sit at the global minimum — exactly what Hebbian learning is designed to achieve.
  • Mirror patterns −ξ have the same energy as ξ (because E is quadratic in y → invariant under sign flip). This is built into the energy function and is one source of spurious memories.
  • Purple circles mark parasitic local minima — states the network was never told to store but which still trap asynchronous updates. Their count grows quickly as you store more patterns; this is the empirical reason behind the ~0.138·N capacity bound.

What this shows. A typical attention heatmap. Rows = queries (one per token, asking “what should I look at?”). Cols = keys (what each token offers). Each cell is the softmax-normalised attention weight — every row sums to exactly 1.

Things to notice.

  • Even with random Q and K matrices the rows are clearly non-uniform — some tokens get more weight than others.
  • After training, real Transformers learn weights that concentrate on syntactically/semantically relevant tokens (e.g. for a pronoun, the matching noun lights up).
  • The √d_k divisor in scores = Q K^T / √d_k is what keeps softmax out of the saturated regime here — without it, larger d_k would push the maximum scores toward ±∞ and the rows would degenerate into one-hot vectors with zero gradient elsewhere.

What this shows. The full (50 × 64) positional-encoding matrix. Each row is the vector added to the embedding of one token position; each column is one of the 64 embedding dimensions.

Things to notice.

  • Low dimensions (left) vary fast with position — short-wavelength sinusoids encode fine-grained, local position info.
  • High dimensions (right) vary slowly — long-wavelength sinusoids encode coarse, “where in the document am I” info.
  • This multi-scale frequency stack is why the encoding can represent both relative offsets and absolute positions, and why the model can extrapolate to sequence lengths it wasn’t trained on.
  • ⚠️ Slide-vs-paper: the Methods of AI slides call this “relative” positional encoding. Vaswani’s original paper uses absolute sinusoidal encoding — exactly the matrix above. Relative PE (Shaw et al. 2018, RoPE, ALiBi) came later.

What this shows. Backprop multiplies one activation derivative per layer. The plot tracks the typical product magnitude ∏ₗ φ'(zₗ) (geometric mean over a Gaussian pre-activation sample) as the network gets deeper.

Things to notice.

  • Sigmoid’s derivative has maximum value 0.25. Multiplying 10 of these gives ≈ 4·10⁻⁷ — the gradient reaching the first layer is effectively zero. That is the vanishing-gradient problem.
  • ReLU’s derivative is either 0 or 1. On active units the product stays at 1 regardless of depth → gradients survive intact across 30+ layers.
  • This single property — combined with residual connections and good init (He/Xavier) — is what made networks with hundreds of layers practical (ResNet, Transformers).

ALGORITHMS (full reference) ⭐

Pseudocode and intuition for the algorithms you might write or trace by hand in the exam. Organised by topic:

  • Hopfield: #1 update rule · #2 Hebbian learning
  • MLP training: #3 forward pass · #4 backpropagation · #5 gradient descent
  • Regularisation: #6 dropout
  • Transformer attention: #7 scaled dot-product · #8 self-attention · #9 multi-head

1. Hopfield update rule

Goal. Asynchronously update one neuron at a time toward a stored pattern.

HOPFIELD-UPDATE(state y, weights W, biases b):
    repeat until stable:
        pick neuron i uniformly at random
        s ← Σⱼ≠ᵢ W[i, j] * y[j] + b[i]
        y[i] ← +1 if s ≥ 0 else −1
    return y

Slide-equivalent form writes W[j, i] (= w_ji) instead of W[i, j] — same value because W is symmetric.

Property. Energy E(y) = −½ yᵀ W y − bᵀ y is non-increasing on every update → guaranteed convergence to a local minimum (stored pattern or parasitic state).


2. Hopfield Hebbian learning

Goal. Store P bipolar patterns ξ^(1), …, ξ^(P) ∈ {−1, +1}^N as fixed points.

HEBBIAN-LEARN(patterns ξ^(1..P), N):
    initialise W ← 0      # N × N matrix
    for p in 1..P:
        for i in 1..N:
            for j in 1..N:
                if i ≠ j:
                    W[i, j] += ξ^(p)[i] * ξ^(p)[j]
    W ← W / N
    return W

One-shot. No iteration over the data — single pass.
Symmetric. W[i, j] = W[j, i] by construction.
Capacity. Slides: “up to N”. Reality: ~0.138 · N (Amit et al. 1985).


3. MLP forward pass

Goal. Compute the output of a feedforward network with L layers.

FORWARD(x, weights {W^(1..L)}, biases {b^(1..L)}, activation φ):
    h ← x
    for ℓ in 1..L:
        z^(ℓ) ← W^(ℓ) · h + b^(ℓ)
        h^(ℓ) ← φ(z^(ℓ))           # ReLU / sigmoid / tanh
        h ← h^(ℓ)
    return h     # network output

4. Backpropagation (chain rule pseudocode)

Goal. Compute gradient of the loss w.r.t. all weights via the chain rule, layer by layer, backward.

BACKPROP(x, target t, weights {W^(1..L)}, biases {b^(1..L)}):
    # --- forward pass, cache intermediates ---
    h^(0) ← x
    for ℓ in 1..L:
        z^(ℓ) ← W^(ℓ) · h^(ℓ−1) + b^(ℓ)
        h^(ℓ) ← φ(z^(ℓ))
 
    L ← loss(h^(L), t)
 
    # --- backward pass ---
    δ^(L) ← ∇_{h^(L)} loss  ⊙  φ'(z^(L))            # output-layer error
    for ℓ in L−1 .. 1 step −1:
        δ^(ℓ) ← (W^(ℓ+1))ᵀ · δ^(ℓ+1)  ⊙  φ'(z^(ℓ))   # propagate back
 
    # --- gradients ---
    for ℓ in 1..L:
        grad_W^(ℓ) ← δ^(ℓ) · (h^(ℓ−1))ᵀ
        grad_b^(ℓ) ← δ^(ℓ)
    return {grad_W^(ℓ)}, {grad_b^(ℓ)}

Cost. Same order as the forward pass: O(network size) per example.
Numerical pitfalls. Vanishing gradient when φ’ is small (sigmoid saturation); exploding gradient when products of W norms exceed 1.


5. Gradient Descent step

Goal. Apply the gradient computed by backprop to update the parameters.

SGD-STEP(weights W, gradients ∇W, learning rate η, weight decay λ):
    for each parameter w in W:
        w ← w − η · (∇w + 2 λ w)      # 2λw term = L2 / weight decay

Variants. Momentum, Nesterov, AdaGrad, RMSProp, Adam (default for Transformers).


6. Dropout

Goal. Regularise by randomly silencing units during training.

DROPOUT-FORWARD(h, drop_prob p, training):
    if training:
        m ← Bernoulli(1 − p, shape = h.shape)   # 0/1 mask
        return (h * m) / (1 − p)                # inverted dropout: rescale
    else:
        return h                                # no dropout at test time

Why divide by (1 − p)? Keeps expected activation the same as in eval mode → no need to rescale at test time.


7. Scaled Dot-Product Attention

Goal. Compute attention output given Q, K, V matrices.

ATTENTION(Q, K, V):
    # Q: (n × d_k),  K: (n × d_k),  V: (n × d_v)
    scores ← Q · Kᵀ                    # (n × n) raw similarity
    scores ← scores / sqrt(d_k)        # scale to keep softmax well-behaved
    weights ← softmax(scores, axis=-1) # row-wise; each row sums to 1
    return weights · V                 # (n × d_v) output

Optional causal mask (for GPT-style decoder): before softmax, set scores[i, j] = −∞ for j > i so each token only attends to earlier ones.


8. Self-Attention

Goal. Apply scaled dot-product attention where Q, K, V are all linear projections of the same input.

SELF-ATTENTION(X, W_Q, W_K, W_V):
    # X: (n × d_model)
    Q ← X · W_Q          # (n × d_k)
    K ← X · W_K          # (n × d_k)
    V ← X · W_V          # (n × d_v)
    return ATTENTION(Q, K, V)

Permutation-equivariance. Without positional encoding, permuting rows of X permutes rows of the output identically — the model has no sense of order.


9. Multi-head Attention

Goal. Run H attention operations in parallel on different learned subspaces, then combine.

MULTI-HEAD(X, {W_Q^h, W_K^h, W_V^h for h in 1..H}, W_O):
    heads ← []
    for h in 1..H:
        head_h ← SELF-ATTENTION(X, W_Q^h, W_K^h, W_V^h)   # (n × d_v)
        heads.append(head_h)
    concat ← Concat(heads, axis=-1)                       # (n × H·d_v)
    return concat · W_O                                   # (n × d_model)

Typical sizing. d_k = d_v = d_model / H so the total parameter count is comparable to a single full-size head.


Worked example: small backprop trace

Tiny 2-input → 1-hidden → 1-output network with sigmoid activations, squared-error loss.

Setup.

  • Inputs: x = (1, 0.5), target t = 1.
  • Weights: W^(1) = [[0.5, −0.3]] (1 hidden unit), b^(1) = 0. W^(2) = [0.8], b^(2) = 0.
  • Activation: σ(x) = 1/(1+e^−x). Loss: L = ½(y − t)².

Forward pass.

  • z^(1) = 0.5·1 + (−0.3)·0.5 = 0.35h^(1) = σ(0.35) ≈ 0.5866.
  • z^(2) = 0.8 · 0.5866 ≈ 0.4693y = σ(0.4693) ≈ 0.6152.
  • L = ½(0.6152 − 1)² ≈ 0.0740.

Backward pass. σ’(z) = σ(z)·(1 − σ(z)).

  • Output error: δ^(2) = (y − t) · σ'(z^(2)) = (0.6152 − 1) · 0.6152 · (1 − 0.6152) ≈ −0.3848 · 0.2367 ≈ −0.0911.
  • Hidden error: δ^(1) = W^(2) · δ^(2) · σ'(z^(1)) = 0.8 · (−0.0911) · 0.5866 · 0.4134 ≈ −0.01767.
  • Gradients:
    • ∂L/∂W^(2) = δ^(2) · h^(1) ≈ −0.0911 · 0.5866 ≈ −0.0534.
    • ∂L/∂W^(1) = δ^(1) · xᵀ ≈ (−0.01767) · (1, 0.5) ≈ (−0.01767, −0.00884).

SGD step (η = 0.1).

  • W^(2) ← 0.8 − 0.1 · (−0.0534) ≈ 0.8053.
  • W^(1) ← (0.5, −0.3) − 0.1 · (−0.01767, −0.00884) ≈ (0.5018, −0.2991).

After one step, the network’s output moves slightly toward t = 1.


Worked example: attention on a tiny matrix

Three tokens, d_k = 2, no scaling for clarity (set √d_k = 1 to keep arithmetic clean — in practice you’d divide).

Inputs (already projected).

Q = [[1, 0],     K = [[1, 0],     V = [[1, 0],
     [0, 1],          [0, 1],          [0, 1],
     [1, 1]]          [1, 1]]          [1, 1]]

Step 1 — raw scores Q · Kᵀ.

       k1   k2   k3
q1 = [ 1,   0,   1 ]
q2 = [ 0,   1,   1 ]
q3 = [ 1,   1,   2 ]

Step 2 — row-wise softmax. (using e ≈ 2.718)

  • Row 1: softmax(1, 0, 1) = (e, 1, e) / (2e + 1) ≈ (0.422, 0.155, 0.422).
  • Row 2: same shape as row 1 by symmetry → (0.155, 0.422, 0.422).
  • Row 3: softmax(1, 1, 2) = (e, e, e²) / (2e + e²) ≈ (0.212, 0.212, 0.576).

Step 3 — output = weights · V.

  • Out₁ = 0.422·(1,0) + 0.155·(0,1) + 0.422·(1,1) ≈ (0.844, 0.577).
  • Out₂ = 0.155·(1,0) + 0.422·(0,1) + 0.422·(1,1) ≈ (0.577, 0.844).
  • Out₃ = 0.212·(1,0) + 0.212·(0,1) + 0.576·(1,1) ≈ (0.788, 0.788).

Reading the result. Token 1 (q1 = [1,0]) attends most to keys that share its first component (k1, k3) → its output is closer to v1 + v3. Token 3 (q3 = [1,1]) is most similar to k3, so its output is dominated by v3 = (1,1). This is exactly the intuition: high Q·K similarity → high attention weight → that V contributes most.


See also

Tags: methods-of-ai atomic-note
Lernzettel: lernzettel_neural-networks-deep-learning_30-04-26
Quiz: quiz_neural-networks_30-04-26
Superlink: Methods of AI Lecture
Source PDFs: Session12-Neural-Networks.pdf, Session-13-Deep-Learning.pdf

Created: 30/04/26