Neural Networks & Deep Learning

chatbot methods-of-ai ai-generated

Methods of AI — SoSe 2026 · Sessions 12 (Neural Networks) & 13 (Deep Learning, incl. Transformers). This is the topic hub for everything from Hopfield networks through MLPs/backprop and deep-learning practicalities to Transformer self-attention.

Companion files for exam prep:

📋 lernzettel_neural-networks-deep-learning_30-04-26 — compact 1-page exam sheet

✏️ quiz_neural-networks_30-04-26 — practice questions

Core Ideas
Glossary — important vocabulary
Hopfield Networks
Multilayer Perceptron (MLP)
Backpropagation
Universal Approximation
Deep Learning — challenges & solutions
Autoencoders
Transformers & Self-Attention
BERT vs. GPT
Formulas & Notation
Common Exam Traps ⚠️
Comparison Table
Slide-vs-textbook fact-check notes ⚠️
Visualisations (Python)
ALGORITHMS ⭐
- [[#1. Hopfield update rule|#1 Hopfield update rule]]
- [[#2. Hopfield Hebbian learning|#2 Hopfield Hebbian learning]]
- [[#3. MLP forward pass|#3 MLP forward pass]]
- [[#4. Backpropagation (chain rule pseudocode)|#4 Backpropagation]]
- [[#5. Gradient Descent step|#5 Gradient Descent step]]
- [[#6. Dropout|#6 Dropout]]
- [[#7. Scaled Dot-Product Attention|#7 Scaled Dot-Product Attention]]
- [[#8. Self-Attention|#8 Self-Attention]]
- [[#9. Multi-head Attention|#9 Multi-head Attention]]
Worked example: small backprop trace
Worked example: attention on a tiny matrix
Related atomic notes

Core Ideas

Neural networks are layered units computing weighted sums + activation. Trained via backpropagation of error gradients.
Hopfield networks are energy-based associative memories — symmetric weights, asynchronous updates, guaranteed convergence to a local energy minimum (stored or parasitic pattern).
Deep networks stack many layers for higher expressivity. Practical challenges: vanishing/exploding gradients, overfitting, compute. Solutions: ReLU, residual connections, dropout, weight decay, early stopping.
Transformers replace recurrence with self-attention: every token attends to every other token in O(n²). The backbone of BERT, GPT, and modern LLMs.

Glossary — important vocabulary

Neuron / unit. Computes y = φ(Σᵢ wᵢ xᵢ + b) where φ is an activation function.

Activation function. Non-linearity applied to weighted sum. Common: sigmoid, tanh, ReLU.

Weights wᵢⱼ. Connection strength from neuron j (source) to neuron i (target). In Hopfield slides notation, w_ji denotes the weight on the incoming edge into i from j. Because Hopfield weights are symmetric (w_ji = w_ij), the two indexings agree.

Bias b. Per-neuron constant shift.

Hopfield network. Recurrent net with symmetric weights, no self-connections (wᵢᵢ = 0), binary states yᵢ ∈ {−1, +1}. Stores patterns as fixed points of an energy function.

Energy function. Scalar Lyapunov function that never increases under valid updates. For Hopfield: E = −½ Σᵢⱼ wᵢⱼ yᵢ yⱼ − Σᵢ bᵢ yᵢ.

Parasitic memory (spurious state). A local energy minimum that was not stored explicitly — the network “remembers” patterns it was never trained on.

Capacity. Maximum number of patterns that can be reliably stored and retrieved.

MLP (Multilayer Perceptron). Feedforward network: input → hidden layer(s) → output. Fully connected, no recurrence.

ReLU. Rectified Linear Unit: ReLU(x) = max(0, x). Gradient is 1 for x > 0, 0 for x < 0 — avoids saturation.

Vanishing gradient. With sigmoid/tanh activations, gradients shrink toward 0 across many layers because each layer multiplies by a factor ≤ 1 → early layers barely learn.

Exploding gradient. Mirror problem: gradients grow without bound. Mitigated by gradient clipping.

Residual connection. y = F(x) + x — skip connection adds the layer’s input to its output. Gradient can flow directly through the identity → enables very deep networks (ResNet).

Dropout. During training, randomly zero out a fraction p of neurons each forward pass → ensemble effect, less co-adaptation.

Weight decay (L2 regularization). Add λ Σ ‖w‖² to the loss → penalises large weights.

Early stopping. Halt training when validation loss stops improving.

Autoencoder. Encoder–bottleneck–decoder net trained to reconstruct its input → learns a compressed code.

Q / K / V (Query / Key / Value). Three learned linear projections of an input token. Query asks “what am I looking for?”, Key advertises “what do I offer?”, Value is the actual content aggregated.

Self-attention. Q, K, V all derived from the same sequence → every token attends to every other token (and to itself).

Multi-head attention. Run H parallel attention heads on different learned subspaces, concatenate, project → captures multiple relation types simultaneously.

Positional encoding. Added to input embeddings so the (otherwise permutation-invariant) Transformer knows token order. Vaswani et al. used fixed sinusoidal encodings of absolute positions.

BERT. Bidirectional Encoder Representations from Transformers — encoder-only, masked LM objective.

GPT. Generative Pre-trained Transformer — decoder-only, autoregressive (left-to-right) LM.

Hopfield Networks

Architecture

N neurons, all-to-all connected, symmetric weights wᵢⱼ = wⱼᵢ, no self-loops wᵢᵢ = 0.
Binary states yᵢ ∈ {−1, +1} (some textbooks use {0, 1}).

Update rule

Asynchronous: pick one neuron i at random, update:
yᵢ ← sign(Σⱼ≠ᵢ wᵢⱼ yⱼ + bᵢ)
The slides write the sum using w_ji (the incoming weight into i from j). Because W is symmetric, w_ji = w_ij — same formula.

Energy function

E(y) = −½ Σᵢⱼ wᵢⱼ yᵢ yⱼ − Σᵢ bᵢ yᵢ

Energy never increases when one neuron is updated to obey the update rule (Lyapunov property).
Therefore the network always converges to a local minimum — but not necessarily the global one, and not necessarily a stored pattern.

Hebbian learning (storing P patterns ξ^(1), …, ξ^(P))

wᵢⱼ = (1/N) Σₚ ξᵢ^(p) ξⱼ^(p) for i ≠ j, wᵢᵢ = 0.

Capacity

Slides: state “up to N target memories” for N neurons (a loose theoretical upper bound).
External textbook fact (Amit, Gutfreund & Sompolinsky 1985, not Hopfield 1982): the reliable retrieval capacity of the standard Hopfield model is Pₘₐₓ ≈ 0.138 · N. Above this critical load, all stored memories become unstable.
⚠️ For exam answers, use the slide claim (“up to N”). Flag the 0.138·N number as external if you cite it.

Parasitic / spurious memories

Local minima of E that were never stored — the network “remembers” garbage patterns.
Sources: linear combinations of stored patterns (mixture states), inverted patterns (−ξ has the same energy as ξ).
More patterns → more parasitic states → degraded retrieval.

Content-addressable memory

Initialize the network near a stored pattern (e.g. a noisy version) → updates flow downhill in energy → converges to the clean pattern.
This is associative retrieval (by content), as opposed to addressed retrieval (by index).

Multilayer Perceptron (MLP)

Architecture: input layer → ≥ 1 hidden layer(s) → output layer. Fully connected, feedforward.
Forward pass per layer: h^(ℓ) = φ(W^(ℓ) h^(ℓ−1) + b^(ℓ)).
Activation φ: sigmoid 1/(1+e^−x), tanh, or ReLU max(0, x).
Output activation: typically softmax (classification) or identity (regression).

Why ReLU

Sigmoid/tanh saturate: for large |x|, gradient ≈ 0 → vanishing gradient in deep nets.
ReLU’s gradient is 1 for x > 0 (constant!) and 0 for x ≤ 0 → keeps gradients alive across many layers.
Caveat: “dying ReLU” — units stuck at 0 stop learning. Mitigated by Leaky ReLU, ELU, etc.

Backpropagation

The chain rule applied layer by layer, backward through the network.

Setup

Loss L = ½ ‖y − t‖² (squared error) or cross-entropy.
For each layer ℓ define z^(ℓ) = W^(ℓ) h^(ℓ−1) + b^(ℓ) and h^(ℓ) = φ(z^(ℓ)).

Local error signal

δ^(L) = ∇_h L ⊙ φ'(z^(L)) for the output layer
δ^(ℓ) = (W^(ℓ+1))ᵀ δ^(ℓ+1) ⊙ φ'(z^(ℓ)) for hidden layers (backward recursion)

Gradient w.r.t. parameters

∂L/∂W^(ℓ) = δ^(ℓ) (h^(ℓ−1))ᵀ · ∂L/∂b^(ℓ) = δ^(ℓ)

Update

W^(ℓ) ← W^(ℓ) − η · ∂L/∂W^(ℓ) (stochastic / mini-batch gradient descent)

Universal Approximation

Slides’ version (constructive): a feedforward network with 2 hidden layers of finite width and non-linear activations can approximate any continuous function on a compact domain to arbitrary precision.
Textbook version (Cybenko 1989; Hornik et al. 1989): even 1 hidden layer with sigmoid (or any non-polynomial) activation is sufficient — this is the standard textbook universal approximation theorem.
⚠️ Slide-vs-textbook difference: the Methods of AI slides use the 2-hidden-layer constructive proof, NOT the 1-hidden-layer Cybenko result. On the exam, follow the slide claim.
Width and number of units can be astronomically large — universality is an existence result, not a practicality result. This motivates deeper (not just wider) networks.

Deep Learning — challenges & solutions

Challenge	Root cause	Solution
Vanishing gradients	Sigmoid/tanh saturate; products of small numbers shrink	ReLU activations; residual connections; careful init (Xavier, He)
Exploding gradients	Repeated multiplication by factors > 1	Gradient clipping; weight init; normalisation
Overfitting	Too many parameters for the data	Dropout, weight decay, early stopping, data augmentation, batch norm
Compute cost	Many params × many ops	GPUs/TPUs, parallel matrix multiplies, mixed precision
Overparameterisation	Modern nets have more params than data points	Lottery ticket hypothesis: a small sparse sub-network does the work; prune after training

Residual connections (ResNet)

Block computes y = F(x; W) + x. The skip path is the identity.
Backward: ∂L/∂x = ∂L/∂y · (∂F/∂x + I) — the + I term means gradient is never fully zeroed.
Enables training networks with 100+ layers.

Dropout

At each training step, independently zero each neuron’s output with probability p.
Acts as ensemble averaging over an exponential number of thinned sub-networks.
At test time: no dropout; scale activations by (1 − p) (or use inverted dropout during training).

Weight decay

Add λ Σ ‖w‖² to the loss.
Gradient contribution: + 2 λ w → shrinks weights toward zero each step.
Equivalent (for SGD) to L2 regularisation; mathematically a Gaussian prior on weights.

Early stopping

Track validation loss each epoch. When it stops decreasing for K epochs (patience), stop training.
Effectively a regulariser: prevents the network from memorising training noise.

Autoencoders

Encoder f: x → z maps input to a low-dimensional code z (the bottleneck).
Decoder g: z → x̂ reconstructs the input.
Loss: reconstruction error ‖x − x̂‖² (or cross-entropy for binary inputs).
The bottleneck forces compression — the network must learn a useful latent representation.

Variants and uses

Denoising autoencoder: corrupt input, train to reconstruct the clean version.
Sparse autoencoder: add sparsity penalty on z (KL or L1).
Variational autoencoder (VAE): probabilistic; learns a distribution over z.
Uses: dimensionality reduction (non-linear PCA), anomaly detection (high reconstruction error = anomaly), pre-training, generative modelling.

Transformers & Self-Attention

Scaled dot-product attention

Attention(Q, K, V) = softmax(Q Kᵀ / √d_k) · V

Q (queries): what each token is looking for. Shape (n, d_k).
K (keys): what each token offers as context. Shape (n, d_k).
V (values): the actual content to aggregate. Shape (n, d_v).
Step 1: dot products Q Kᵀ give an (n × n) matrix of raw similarity scores.
Step 2: divide by √d_k — prevents the scores from growing too large in high dimensions (where they would push softmax into saturated regions with near-zero gradient).
Step 3: row-wise softmax → attention weights (each row sums to 1).
Step 4: multiply by V → each token’s output is a weighted average of all tokens’ values.

Self-attention

Q, K, V are all linear projections of the same input X: Q = X W_Q, K = X W_K, V = X W_V.
Every token attends to every other token (and to itself).
Captures long-range dependencies in O(1) hop count (compare RNN’s O(n) hops).

Multi-head attention

Run H attention operations in parallel, each with its own learned W_Q^(h), W_K^(h), W_V^(h).
Concatenate the H outputs and project: MultiHead = Concat(head₁, …, head_H) W_O.
Different heads can specialise in different relationship types (syntactic, coreference, positional, etc.).

Positional encoding

Self-attention is permutation-equivariant — without position info, “the cat sat on the mat” and “the mat sat on the cat” produce the same outputs.
Solution: add a positional embedding to each input token’s vector.
Vaswani et al. (2017) used fixed sinusoidal encodings of absolute positions: PE(pos, 2i) = sin(pos / 10000^(2i/d)), PE(pos, 2i+1) = cos(...).
⚠️ Slide error to be aware of: the Methods of AI slides claim Vaswani used relative positional encoding. This is incorrect — Vaswani’s original paper used absolute sinusoidal encoding. (Relative encodings exist — Shaw et al. 2018, T5, RoPE — but those came later.) Flag this slide claim if you encounter it.

BERT vs. GPT

Aspect	BERT	GPT
Architecture	Encoder-only	Decoder-only
Training objective	Masked LM (predict masked tokens) + Next-Sentence Prediction	Autoregressive LM (predict next token)
Attention	Bidirectional (sees all tokens at once)	Causal / masked (only past tokens)
Primary use	Understanding: classification, NER, QA	Generation: text, code, chat
Released by	Google, 2018	OpenAI, 2018+

Formulas & Notation

Symbol	Meaning
`E(y) = −½ Σ wᵢⱼ yᵢ yⱼ − Σ bᵢ yᵢ`	Hopfield energy
`yᵢ ← sign(Σⱼ wᵢⱼ yⱼ + bᵢ)`	Hopfield asynchronous update
`wᵢⱼ = (1/N) Σₚ ξᵢ^(p) ξⱼ^(p)`	Hebbian learning rule
`Pₘₐₓ ≈ 0.138 · N`	Reliable Hopfield capacity (Amit et al. 1985 — external)
`ReLU(x) = max(0, x)`	Rectified linear unit
`δ^(ℓ) = (W^(ℓ+1))ᵀ δ^(ℓ+1) ⊙ φ'(z^(ℓ))`	Backprop error signal
`L_wd = L + λ Σ ‖w‖²`	Loss with weight decay
`Attention(Q,K,V) = softmax(Q Kᵀ / √d_k) V`	Scaled dot-product attention
`Q, K, V`	Query / Key / Value matrices
`d_k`	Dimension of key vectors (scaling factor)
`PE(pos, 2i) = sin(pos / 10000^(2i/d))`	Sinusoidal positional encoding

Common Exam Traps ⚠️

Hopfield energy decreases monotonically under asynchronous updates — it cannot oscillate. Convergence is to a local minimum, not the global one.
Parasitic memories are local minima the network never explicitly stored — they exist alongside (and dilute) the desired memories.
Hopfield capacity slide claim says “up to N target memories” — the realistic ~0.138·N figure is Amit et al. 1985, not Hopfield 1982. Cite it as external if asked.
Symmetric weights: w_ji = w_ij in Hopfield. The slide’s w_ji notation (incoming weight) is mathematically identical to w_ij.
Universal approximation — slide vs. textbook: slides use the 2-hidden-layer constructive proof. Textbook (Cybenko) gives 1 hidden layer. Quote the slide version on the exam.
Positional encoding is required in Transformers because self-attention is permutation-equivariant.
Vaswani used absolute sinusoidal PE, not relative — the slides have this wrong. Be ready to give the correct answer if pressed but go with the slide answer on a slide-derived question.
√d_k scaling prevents large dot products from pushing softmax into saturation (near-zero gradient).
Self-attention attends to all positions including itself; masked (causal) attention attends only to past positions — that’s the BERT vs. GPT split.
ReLU vs. sigmoid: sigmoid saturates → vanishing gradient. ReLU has constant gradient 1 for x > 0.
Residual connections carry gradient through the identity path — what enables very deep networks (ResNet).
BERT bidirectional, GPT unidirectional — don’t mix these up.
Dropout is training-only; at test time use the full network with rescaled activations.

Comparison Table

Architecture	Memory mechanism	Recurrence	Long-range deps	Training
Hopfield	Associative (energy minima)	No (asynchronous updates)	Limited by capacity (~0.138·N)	Hebbian (one-shot)
MLP	Distributed in weights	No	None (no sequence)	Backprop + SGD
RNN	Hidden state over time	Yes	Vanishes over long sequences	BPTT (backprop through time)
Transformer	Via attention weights (per-pass)	No	Full — every token attends every other in O(n²)	Backprop + Adam

Slide-vs-textbook fact-check notes ⚠️

Hopfield capacity. Slides: “up to N target memories.” Reality (Amit, Gutfreund, Sompolinsky 1985): Pₘₐₓ ≈ 0.138 · N. The 0.138·N number is not from Hopfield 1982 and not in the slides — cite it as external.
Universal approximation. Slides use a 2-hidden-layer constructive proof. The textbook 1-hidden-layer version (Cybenko 1989) is not the slide claim.
Positional encoding (Vaswani et al. 2017). Slides claim Vaswani used relative encoding. This is a slide error — Vaswani actually used absolute sinusoidal encoding. Relative PE comes from Shaw et al. 2018 and later work.
Modern Hopfield Networks ↔ Self-Attention. The Ramsauer et al. 2020 “Hopfield Networks Is All You Need” result connecting modern Hopfield networks to Transformer self-attention is not in the lecture slides. Treat related quiz items (e.g. Q100 / Q136 in the mega quiz) as supplementary, not exam-core.
Hopfield update notation. Slides write the incoming weight as w_ji. With symmetric W this equals w_ij — pure notation choice, no math difference.

Visualisations (Python)

Five executable figures that make the neural-network and deep-learning concepts in this hub concrete. Each toggle opens a self-contained Pyodide block — install matplotlib (and numpy where needed) and plt.show() renders inline in Obsidian. Read the prose under each toggle for what to look for.

🐍 Figure — Activation functions (sigmoid, tanh, ReLU, Leaky ReLU) and their derivatives

import micropip
await micropip.install("matplotlib")
import matplotlib.pyplot as plt
import numpy as np
 
x = np.linspace(-5, 5, 400)
 
def sigmoid(x): return 1.0 / (1.0 + np.exp(-x))
def dsigmoid(x):
    s = sigmoid(x); return s * (1 - s)
def tanh(x): return np.tanh(x)
def dtanh(x): return 1 - np.tanh(x) ** 2
def relu(x): return np.maximum(0, x)
def drelu(x): return (x > 0).astype(float)
def lrelu(x, a=0.1): return np.where(x > 0, x, a * x)
def dlrelu(x, a=0.1): return np.where(x > 0, 1.0, a)
 
funcs = [
    ("sigmoid",    sigmoid,  dsigmoid,  "#3498db"),
    ("tanh",       tanh,     dtanh,     "#27ae60"),
    ("ReLU",       relu,     drelu,     "#e74c3c"),
    ("Leaky ReLU", lrelu,    dlrelu,    "#9b59b6"),
]
 
fig, (ax_top, ax_bot) = plt.subplots(2, 1, figsize=(9, 7), sharex=True)
 
for name, f, df, col in funcs:
    ax_top.plot(x, f(x), label=name, color=col, lw=2)
    ax_bot.plot(x, df(x), label=f"{name}'", color=col, lw=2)
 
for ax in (ax_top, ax_bot):
    ax.axhline(0, color="#888", lw=0.6); ax.axvline(0, color="#888", lw=0.6)
    ax.grid(alpha=0.3); ax.legend(loc="best", fontsize=9)
 
ax_top.set_title("Activation functions  φ(x)")
ax_top.set_ylabel("φ(x)")
ax_bot.set_title("Derivatives  φ'(x)  —  the quantity that backprop multiplies")
ax_bot.set_ylabel("φ'(x)"); ax_bot.set_xlabel("x")
 
# Annotate sigmoid saturation
ax_bot.annotate("sigmoid saturates\n(φ' ≈ 0 for large |x|)",
                xy=(4, dsigmoid(np.array([4]))[0]), xytext=(1.2, 0.55),
                fontsize=9, color="#3498db",
                arrowprops=dict(arrowstyle="->", color="#3498db"))
ax_bot.annotate("ReLU keeps φ'=1\n(no vanishing gradient)",
                xy=(3, 1.0), xytext=(0.4, 0.78),
                fontsize=9, color="#e74c3c",
                arrowprops=dict(arrowstyle="->", color="#e74c3c"))
 
plt.tight_layout(); plt.show()

What this shows. Top: the four activation curves on x ∈ [−5, 5]. Bottom: their derivatives — the quantity that backprop multiplies layer by layer.

Things to notice.

Sigmoid and tanh saturate. For |x| ≳ 3 their derivatives are essentially zero. Stack 10 sigmoid layers and the product of gradients shrinks to ~10⁻⁹ → vanishing gradient.
ReLU’s derivative is exactly 1 for x > 0. Multiplying by 1 a hundred times still gives 1. This is the single biggest reason ReLU dominates modern deep learning.
ReLU’s flaw: derivative is 0 for x ≤ 0. A unit that drifts into the negative half stops learning (dying ReLU). Leaky ReLU keeps a small slope α = 0.1 there → the unit can recover.

🐍 Figure — Hopfield energy landscape (8 neurons, 2 stored patterns, all 256 states)

import micropip
await micropip.install("matplotlib")
import matplotlib.pyplot as plt
import numpy as np
 
N = 8
 
# Two stored bipolar patterns (in {-1, +1}^N)
xi1 = np.array([ 1,  1,  1,  1, -1, -1, -1, -1])
xi2 = np.array([ 1, -1,  1, -1,  1, -1,  1, -1])
patterns = [xi1, xi2]
 
# Hebbian learning rule:  W = (1/N) Σ_p ξ_p ξ_p^T,  diag(W) = 0
W = np.zeros((N, N))
for p in patterns:
    W += np.outer(p, p)
W /= N
np.fill_diagonal(W, 0.0)
 
def energy(y, W):
    return -0.5 * y @ W @ y
 
# Enumerate all 2^N = 256 binary states in {-1, +1}^N
states = []
energies = []
for k in range(2 ** N):
    bits = [(k >> i) & 1 for i in range(N)]
    y = np.array([2 * b - 1 for b in bits])  # 0/1 -> -1/+1
    states.append(y); energies.append(energy(y, W))
 
states = np.array(states); energies = np.array(energies)
 
# Sort by energy
order = np.argsort(energies)
energies_sorted = energies[order]; states_sorted = states[order]
 
# Locate stored patterns and their negatives (negatives have same energy)
def find_idx(target):
    for i, s in enumerate(states_sorted):
        if np.array_equal(s, target): return i
    return None
 
idx_xi1   = find_idx(xi1);   idx_xi2   = find_idx(xi2)
idx_xi1_n = find_idx(-xi1);  idx_xi2_n = find_idx(-xi2)
 
# Find local minima (state with no Hamming-1 neighbour of lower energy)
def is_local_min(i, all_states, all_E):
    y = all_states[i]; E = all_E[i]
    for j in range(N):
        y2 = y.copy(); y2[j] = -y2[j]
        # locate y2's energy
        for k, s in enumerate(all_states):
            if np.array_equal(s, y2):
                if all_E[k] < E: return False
                break
    return True
 
# Compute local minima against UNSORTED list (Hamming neighbours)
local_min_mask = np.array([is_local_min(i, states, energies) for i in range(len(states))])
# Map back into sorted order
sorted_local_min = local_min_mask[order]
 
fig, ax = plt.subplots(figsize=(11, 5))
xs = np.arange(len(energies_sorted))
ax.plot(xs, energies_sorted, color="#888", lw=1, zorder=1,
        label="all 2⁸ = 256 binary states (sorted by E)")
 
# Mark stored patterns
for idx, lbl, col in [(idx_xi1,   "stored ξ₁",     "#e74c3c"),
                      (idx_xi2,   "stored ξ₂",     "#27ae60"),
                      (idx_xi1_n, "−ξ₁ (mirror)",  "#c0392b"),
                      (idx_xi2_n, "−ξ₂ (mirror)",  "#16a085")]:
    if idx is not None:
        ax.scatter([idx], [energies_sorted[idx]], s=140, color=col,
                   edgecolor="black", lw=1.2, zorder=4, label=lbl)
 
# Mark parasitic local minima (local minima that are NOT one of the stored / mirror states)
stored_set = {tuple(xi1), tuple(-xi1), tuple(xi2), tuple(-xi2)}
parasitic_idx = [i for i in range(len(states_sorted))
                 if sorted_local_min[i] and tuple(states_sorted[i]) not in stored_set]
if parasitic_idx:
    ax.scatter(parasitic_idx, energies_sorted[parasitic_idx], s=80,
               facecolor="none", edgecolor="#9b59b6", lw=1.8, zorder=3,
               label=f"parasitic minima ({len(parasitic_idx)})")
 
ax.set_xlabel("state index (sorted by energy)")
ax.set_ylabel("E(y) = −½ yᵀ W y")
ax.set_title("Hopfield energy landscape — N = 8, 2 stored patterns (Hebbian rule)\n"
             "stored ξ live at the bottom of the basin; mirrors share the same energy")
ax.grid(alpha=0.3); ax.legend(loc="lower right", fontsize=9)
plt.tight_layout(); plt.show()
 
print(f"Stored ξ₁ energy: {energy(xi1, W):.3f}")
print(f"Stored ξ₂ energy: {energy(xi2, W):.3f}")
print(f"−ξ₁    energy:   {energy(-xi1, W):.3f}  (identical to ξ₁ — symmetry of E)")
print(f"Number of local minima total: {local_min_mask.sum()}")
print(f"Of those, parasitic (not ±ξ₁, ±ξ₂): {len(parasitic_idx)}")

What this shows. All 2⁸ = 256 possible binary states of an 8-neuron Hopfield net, sorted by their energy E(y) = −½ yᵀ W y. Weights come from the Hebbian rule on two stored patterns ξ₁, ξ₂.

Things to notice.

The stored patterns sit at the global minimum — exactly what Hebbian learning is designed to achieve.
Mirror patterns −ξ have the same energy as ξ (because E is quadratic in y → invariant under sign flip). This is built into the energy function and is one source of spurious memories.
Purple circles mark parasitic local minima — states the network was never told to store but which still trap asynchronous updates. Their count grows quickly as you store more patterns; this is the empirical reason behind the ~0.138·N capacity bound.

🐍 Figure — Self-attention weights on a toy 6-token sentence

import micropip
await micropip.install("matplotlib")
import matplotlib.pyplot as plt
import numpy as np
 
rng = np.random.default_rng(7)
 
tokens = ["The", "cat", "sat", "on", "the", "mat"]
n = len(tokens)
d_model = 16
d_k = 8
 
# Token embeddings (random, just to demo)
X = rng.standard_normal((n, d_model))
 
# Random Q, K projection matrices  (we don't need V for the heatmap)
W_Q = rng.standard_normal((d_model, d_k)) * 0.3
W_K = rng.standard_normal((d_model, d_k)) * 0.3
 
Q = X @ W_Q          # (n, d_k)
K = X @ W_K          # (n, d_k)
 
# Scaled dot-product scores
scores = Q @ K.T / np.sqrt(d_k)             # (n, n)
 
# Row-wise softmax → attention weights
def softmax(z, axis=-1):
    z = z - z.max(axis=axis, keepdims=True)
    e = np.exp(z)
    return e / e.sum(axis=axis, keepdims=True)
 
A = softmax(scores, axis=-1)                # rows sum to 1
 
fig, ax = plt.subplots(figsize=(7, 6))
im = ax.imshow(A, cmap="viridis", vmin=0, vmax=A.max())
ax.set_xticks(range(n)); ax.set_xticklabels(tokens, rotation=30, ha="right")
ax.set_yticks(range(n)); ax.set_yticklabels(tokens)
ax.set_xlabel("Key  (what each token offers)")
ax.set_ylabel("Query  (what each token is looking for)")
ax.set_title("Self-attention weights  softmax(Q Kᵀ / √d_k)\n"
             "row i = how token i distributes its attention across all tokens")
 
# Annotate each cell with the weight value
for i in range(n):
    for j in range(n):
        ax.text(j, i, f"{A[i,j]:.2f}", ha="center", va="center",
                color="white" if A[i,j] < A.max() * 0.55 else "black",
                fontsize=8)
 
plt.colorbar(im, ax=ax, fraction=0.046, pad=0.04, label="attention weight")
plt.tight_layout(); plt.show()
 
print("Row sums (should each be 1.0):", A.sum(axis=1).round(3))

What this shows. A typical attention heatmap. Rows = queries (one per token, asking “what should I look at?”). Cols = keys (what each token offers). Each cell is the softmax-normalised attention weight — every row sums to exactly 1.

Things to notice.

Even with random Q and K matrices the rows are clearly non-uniform — some tokens get more weight than others.
After training, real Transformers learn weights that concentrate on syntactically/semantically relevant tokens (e.g. for a pronoun, the matching noun lights up).
The √d_k divisor in scores = Q K^T / √d_k is what keeps softmax out of the saturated regime here — without it, larger d_k would push the maximum scores toward ±∞ and the rows would degenerate into one-hot vectors with zero gradient elsewhere.

🐍 Figure — Sinusoidal positional encoding (Vaswani et al. 2017)

import micropip
await micropip.install("matplotlib")
import matplotlib.pyplot as plt
import numpy as np
 
d_model = 64
max_pos = 50
 
# PE(pos, 2i)   = sin(pos / 10000^(2i/d))
# PE(pos, 2i+1) = cos(pos / 10000^(2i/d))
pos = np.arange(max_pos)[:, None]                       # (max_pos, 1)
i   = np.arange(d_model)[None, :]                       # (1, d_model)
# use 2*(i//2) so neighbouring dims share the same frequency
div = np.power(10000.0, (2 * (i // 2)) / d_model)
angles = pos / div                                      # (max_pos, d_model)
 
PE = np.zeros((max_pos, d_model))
PE[:, 0::2] = np.sin(angles[:, 0::2])                   # even dims → sin
PE[:, 1::2] = np.cos(angles[:, 1::2])                   # odd dims  → cos
 
fig, ax = plt.subplots(figsize=(10, 5))
im = ax.imshow(PE, aspect="auto", cmap="RdBu_r", vmin=-1, vmax=1)
ax.set_xlabel("embedding dimension  i  (0 … 63)")
ax.set_ylabel("position  pos  (0 … 49)")
ax.set_title("Sinusoidal positional encoding  PE(pos, i)\n"
             "Vaswani et al. 2017 — fixed (not learned), absolute positions")
 
# Annotate the frequency gradient
ax.annotate("low-index dims\nvary FAST with position",
            xy=(3, 25), xytext=(8, 5),
            fontsize=9, color="black",
            arrowprops=dict(arrowstyle="->", color="black"))
ax.annotate("high-index dims\nvary SLOW with position",
            xy=(60, 25), xytext=(40, 45),
            fontsize=9, color="black",
            arrowprops=dict(arrowstyle="->", color="black"))
 
plt.colorbar(im, ax=ax, fraction=0.046, pad=0.04, label="PE value")
plt.tight_layout(); plt.show()

What this shows. The full (50 × 64) positional-encoding matrix. Each row is the vector added to the embedding of one token position; each column is one of the 64 embedding dimensions.

Things to notice.

Low dimensions (left) vary fast with position — short-wavelength sinusoids encode fine-grained, local position info.
High dimensions (right) vary slowly — long-wavelength sinusoids encode coarse, “where in the document am I” info.
This multi-scale frequency stack is why the encoding can represent both relative offsets and absolute positions, and why the model can extrapolate to sequence lengths it wasn’t trained on.
⚠️ Slide-vs-paper: the Methods of AI slides call this “relative” positional encoding. Vaswani’s original paper uses absolute sinusoidal encoding — exactly the matrix above. Relative PE (Shaw et al. 2018, RoPE, ALiBi) came later.

🐍 Figure — Vanishing gradient: sigmoid stack vs ReLU stack

import micropip
await micropip.install("matplotlib")
import matplotlib.pyplot as plt
import numpy as np
 
# Sample pre-activations from a typical "post-init" distribution
rng = np.random.default_rng(0)
z = rng.standard_normal(2000)
 
def sigmoid(x): return 1.0 / (1.0 + np.exp(-x))
def dsigmoid(x):
    s = sigmoid(x); return s * (1 - s)
def drelu(x): return (x > 0).astype(float)
 
max_layers = 30
layers = np.arange(1, max_layers + 1)
 
# For each L, the product of L i.i.d. derivative samples — averaged over the sample
sig_d  = dsigmoid(z)              # per-sample derivative
relu_d = drelu(z)
 
# Geometric mean of derivative magnitudes is a clean proxy for
# how a product of L of them scales:  product ≈ (geo_mean)^L
sig_gm  = np.exp(np.mean(np.log(sig_d + 1e-12)))
relu_gm = np.exp(np.mean(np.log(relu_d + 1e-12)))  # 0 for dead units → tiny floor
 
sig_prod  = sig_gm  ** layers
relu_prod = relu_gm ** layers
 
fig, ax = plt.subplots(figsize=(9, 5))
ax.semilogy(layers, sig_prod,  "o-", color="#3498db", label="sigmoid stack  (∏ σ'(z))")
ax.semilogy(layers, relu_prod, "s-", color="#e74c3c", label="ReLU stack    (∏ ReLU'(z))")
ax.axhline(1.0, color="#888", lw=0.7)
ax.set_xlabel("number of layers L")
ax.set_ylabel("typical magnitude of  ∏ₗ φ'(zₗ)  (log scale)")
ax.set_title("Why sigmoid stacks vanish and ReLU stacks don't\n"
             "(geometric-mean product of layer-wise activation derivatives)")
ax.grid(alpha=0.3, which="both"); ax.legend(fontsize=10)
 
ax.annotate(f"after 10 sigmoid layers ≈ {sig_prod[9]:.1e}",
            xy=(10, sig_prod[9]), xytext=(12, sig_prod[9] * 0.3),
            fontsize=9, color="#3498db",
            arrowprops=dict(arrowstyle="->", color="#3498db"))
ax.annotate(f"ReLU stays near {relu_prod[9]:.2f}",
            xy=(10, relu_prod[9]), xytext=(14, relu_prod[9] * 2),
            fontsize=9, color="#e74c3c",
            arrowprops=dict(arrowstyle="->", color="#e74c3c"))
 
plt.tight_layout(); plt.show()

What this shows. Backprop multiplies one activation derivative per layer. The plot tracks the typical product magnitude ∏ₗ φ'(zₗ) (geometric mean over a Gaussian pre-activation sample) as the network gets deeper.

Things to notice.

Sigmoid’s derivative has maximum value 0.25. Multiplying 10 of these gives ≈ 4·10⁻⁷ — the gradient reaching the first layer is effectively zero. That is the vanishing-gradient problem.
ReLU’s derivative is either 0 or 1. On active units the product stays at 1 regardless of depth → gradients survive intact across 30+ layers.
This single property — combined with residual connections and good init (He/Xavier) — is what made networks with hundreds of layers practical (ResNet, Transformers).

ALGORITHMS (full reference) ⭐

Pseudocode and intuition for the algorithms you might write or trace by hand in the exam. Organised by topic:

Hopfield: #1 update rule · #2 Hebbian learning

MLP training: #3 forward pass · #4 backpropagation · #5 gradient descent

Regularisation: #6 dropout

Transformer attention: #7 scaled dot-product · #8 self-attention · #9 multi-head

1. Hopfield update rule

Goal. Asynchronously update one neuron at a time toward a stored pattern.

HOPFIELD-UPDATE(state y, weights W, biases b):
    repeat until stable:
        pick neuron i uniformly at random
        s ← Σⱼ≠ᵢ W[i, j] * y[j] + b[i]
        y[i] ← +1 if s ≥ 0 else −1
    return y

Slide-equivalent form writes W[j, i] (= w_ji) instead of W[i, j] — same value because W is symmetric.

Property. Energy E(y) = −½ yᵀ W y − bᵀ y is non-increasing on every update → guaranteed convergence to a local minimum (stored pattern or parasitic state).

2. Hopfield Hebbian learning

Goal. Store P bipolar patterns ξ^(1), …, ξ^(P) ∈ {−1, +1}^N as fixed points.

HEBBIAN-LEARN(patterns ξ^(1..P), N):
    initialise W ← 0      # N × N matrix
    for p in 1..P:
        for i in 1..N:
            for j in 1..N:
                if i ≠ j:
                    W[i, j] += ξ^(p)[i] * ξ^(p)[j]
    W ← W / N
    return W

One-shot. No iteration over the data — single pass.
Symmetric. W[i, j] = W[j, i] by construction.
Capacity. Slides: “up to N”. Reality: ~0.138 · N (Amit et al. 1985).

3. MLP forward pass

Goal. Compute the output of a feedforward network with L layers.

FORWARD(x, weights {W^(1..L)}, biases {b^(1..L)}, activation φ):
    h ← x
    for ℓ in 1..L:
        z^(ℓ) ← W^(ℓ) · h + b^(ℓ)
        h^(ℓ) ← φ(z^(ℓ))           # ReLU / sigmoid / tanh
        h ← h^(ℓ)
    return h     # network output

4. Backpropagation (chain rule pseudocode)

Goal. Compute gradient of the loss w.r.t. all weights via the chain rule, layer by layer, backward.

BACKPROP(x, target t, weights {W^(1..L)}, biases {b^(1..L)}):
    # --- forward pass, cache intermediates ---
    h^(0) ← x
    for ℓ in 1..L:
        z^(ℓ) ← W^(ℓ) · h^(ℓ−1) + b^(ℓ)
        h^(ℓ) ← φ(z^(ℓ))
 
    L ← loss(h^(L), t)
 
    # --- backward pass ---
    δ^(L) ← ∇_{h^(L)} loss  ⊙  φ'(z^(L))            # output-layer error
    for ℓ in L−1 .. 1 step −1:
        δ^(ℓ) ← (W^(ℓ+1))ᵀ · δ^(ℓ+1)  ⊙  φ'(z^(ℓ))   # propagate back
 
    # --- gradients ---
    for ℓ in 1..L:
        grad_W^(ℓ) ← δ^(ℓ) · (h^(ℓ−1))ᵀ
        grad_b^(ℓ) ← δ^(ℓ)
    return {grad_W^(ℓ)}, {grad_b^(ℓ)}

Cost. Same order as the forward pass: O(network size) per example.
Numerical pitfalls. Vanishing gradient when φ’ is small (sigmoid saturation); exploding gradient when products of W norms exceed 1.

5. Gradient Descent step

Goal. Apply the gradient computed by backprop to update the parameters.

SGD-STEP(weights W, gradients ∇W, learning rate η, weight decay λ):
    for each parameter w in W:
        w ← w − η · (∇w + 2 λ w)      # 2λw term = L2 / weight decay

Variants. Momentum, Nesterov, AdaGrad, RMSProp, Adam (default for Transformers).

6. Dropout

Goal. Regularise by randomly silencing units during training.

DROPOUT-FORWARD(h, drop_prob p, training):
    if training:
        m ← Bernoulli(1 − p, shape = h.shape)   # 0/1 mask
        return (h * m) / (1 − p)                # inverted dropout: rescale
    else:
        return h                                # no dropout at test time

Why divide by (1 − p)? Keeps expected activation the same as in eval mode → no need to rescale at test time.

7. Scaled Dot-Product Attention

Goal. Compute attention output given Q, K, V matrices.

ATTENTION(Q, K, V):
    # Q: (n × d_k),  K: (n × d_k),  V: (n × d_v)
    scores ← Q · Kᵀ                    # (n × n) raw similarity
    scores ← scores / sqrt(d_k)        # scale to keep softmax well-behaved
    weights ← softmax(scores, axis=-1) # row-wise; each row sums to 1
    return weights · V                 # (n × d_v) output

Optional causal mask (for GPT-style decoder): before softmax, set scores[i, j] = −∞ for j > i so each token only attends to earlier ones.

8. Self-Attention

Goal. Apply scaled dot-product attention where Q, K, V are all linear projections of the same input.

SELF-ATTENTION(X, W_Q, W_K, W_V):
    # X: (n × d_model)
    Q ← X · W_Q          # (n × d_k)
    K ← X · W_K          # (n × d_k)
    V ← X · W_V          # (n × d_v)
    return ATTENTION(Q, K, V)

Permutation-equivariance. Without positional encoding, permuting rows of X permutes rows of the output identically — the model has no sense of order.

9. Multi-head Attention

Goal. Run H attention operations in parallel on different learned subspaces, then combine.

MULTI-HEAD(X, {W_Q^h, W_K^h, W_V^h for h in 1..H}, W_O):
    heads ← []
    for h in 1..H:
        head_h ← SELF-ATTENTION(X, W_Q^h, W_K^h, W_V^h)   # (n × d_v)
        heads.append(head_h)
    concat ← Concat(heads, axis=-1)                       # (n × H·d_v)
    return concat · W_O                                   # (n × d_model)

Typical sizing. d_k = d_v = d_model / H so the total parameter count is comparable to a single full-size head.

Worked example: small backprop trace

Tiny 2-input → 1-hidden → 1-output network with sigmoid activations, squared-error loss.

Setup.

Inputs: x = (1, 0.5), target t = 1.
Weights: W^(1) = [[0.5, −0.3]] (1 hidden unit), b^(1) = 0. W^(2) = [0.8], b^(2) = 0.
Activation: σ(x) = 1/(1+e^−x). Loss: L = ½(y − t)².

Forward pass.

z^(1) = 0.5·1 + (−0.3)·0.5 = 0.35 → h^(1) = σ(0.35) ≈ 0.5866.
z^(2) = 0.8 · 0.5866 ≈ 0.4693 → y = σ(0.4693) ≈ 0.6152.
L = ½(0.6152 − 1)² ≈ 0.0740.

Backward pass. σ’(z) = σ(z)·(1 − σ(z)).

Output error: δ^(2) = (y − t) · σ'(z^(2)) = (0.6152 − 1) · 0.6152 · (1 − 0.6152) ≈ −0.3848 · 0.2367 ≈ −0.0911.
Hidden error: δ^(1) = W^(2) · δ^(2) · σ'(z^(1)) = 0.8 · (−0.0911) · 0.5866 · 0.4134 ≈ −0.01767.
Gradients:
- ∂L/∂W^(2) = δ^(2) · h^(1) ≈ −0.0911 · 0.5866 ≈ −0.0534.
- ∂L/∂W^(1) = δ^(1) · xᵀ ≈ (−0.01767) · (1, 0.5) ≈ (−0.01767, −0.00884).

SGD step (η = 0.1).

W^(2) ← 0.8 − 0.1 · (−0.0534) ≈ 0.8053.
W^(1) ← (0.5, −0.3) − 0.1 · (−0.01767, −0.00884) ≈ (0.5018, −0.2991).

After one step, the network’s output moves slightly toward t = 1.

Worked example: attention on a tiny matrix

Three tokens, d_k = 2, no scaling for clarity (set √d_k = 1 to keep arithmetic clean — in practice you’d divide).

Inputs (already projected).

Q = [[1, 0],     K = [[1, 0],     V = [[1, 0],
     [0, 1],          [0, 1],          [0, 1],
     [1, 1]]          [1, 1]]          [1, 1]]

Step 1 — raw scores Q · Kᵀ.

       k1   k2   k3
q1 = [ 1,   0,   1 ]
q2 = [ 0,   1,   1 ]
q3 = [ 1,   1,   2 ]

Step 2 — row-wise softmax. (using e ≈ 2.718)

Row 1: softmax(1, 0, 1) = (e, 1, e) / (2e + 1) ≈ (0.422, 0.155, 0.422).
Row 2: same shape as row 1 by symmetry → (0.155, 0.422, 0.422).
Row 3: softmax(1, 1, 2) = (e, e, e²) / (2e + e²) ≈ (0.212, 0.212, 0.576).

Step 3 — output = weights · V.

Out₁ = 0.422·(1,0) + 0.155·(0,1) + 0.422·(1,1) ≈ (0.844, 0.577).
Out₂ = 0.155·(1,0) + 0.422·(0,1) + 0.422·(1,1) ≈ (0.577, 0.844).
Out₃ = 0.212·(1,0) + 0.212·(0,1) + 0.576·(1,1) ≈ (0.788, 0.788).

Reading the result. Token 1 (q1 = [1,0]) attends most to keys that share its first component (k1, k3) → its output is closer to v1 + v3. Token 3 (q3 = [1,1]) is most similar to k3, so its output is dominated by v3 = (1,1). This is exactly the intuition: high Q·K similarity → high attention weight → that V contributes most.

Hopfield / energy-based memory: Hopfield Networks
Backprop: Gradient Backpropagation
Deep nets in general: Deep Neural Networks · Deep Neural Networks in Computational Neuroscience · Implementing Artificial Neural Networks with TensorFlow
Transformers / Attention: Transformers · Self-Attention · Attention is All You Need · YT - Attention in Transformers vs. menschliche Aufmerksamkeit
Neuroscience bridge: The neuroconnectionist research programme · Neuroconnectionism and TNNs · Neural Encoding with Deep Neural Networks
Related Methods of AI hubs: Support Vector Machines · Machine Learning I-II · Methods of AI Lecture

Brain Online

Explorer

Neural Networks & Deep Learning

Neural Networks & Deep Learning

Table of contents

Core Ideas

Glossary — important vocabulary

Hopfield Networks

Architecture

Update rule

Energy function

Hebbian learning (storing P patterns ξ^(1), …, ξ^(P))

Capacity

Parasitic / spurious memories

Content-addressable memory

Multilayer Perceptron (MLP)

Why ReLU

Backpropagation

Setup

Local error signal

Gradient w.r.t. parameters

Update

Universal Approximation

Deep Learning — challenges & solutions

Residual connections (ResNet)

Dropout

Weight decay

Early stopping

Autoencoders

Variants and uses

Transformers & Self-Attention

Scaled dot-product attention

Self-attention

Multi-head attention

Positional encoding

BERT vs. GPT

Formulas & Notation

Common Exam Traps ⚠️

Comparison Table

Slide-vs-textbook fact-check notes ⚠️

Visualisations (Python)

ALGORITHMS (full reference) ⭐

1. Hopfield update rule

2. Hopfield Hebbian learning

3. MLP forward pass

4. Backpropagation (chain rule pseudocode)

5. Gradient Descent step

6. Dropout

7. Scaled Dot-Product Attention

8. Self-Attention

9. Multi-head Attention

Worked example: small backprop trace

Worked example: attention on a tiny matrix

Related atomic notes

See also

Backlinks

Mika

✨ Features

⚙️ Einstellungen

📚 Chat-Verlauf

📖 Citation Manager

✍️ Writing Assistant

Inhaltsverzeichnis