Lernzettel: Neural Networks & Deep Learning (incl. Transformers)
Methods of AI — SoSe 2026 · 1-page exam sheet
For more depth: Neural Networks & Deep Learning (full reference with all algorithms, worked examples, fact-check notes) · quiz_neural-networks_30-04-26 (practice questions)
Core Ideas
- Neural nets = layered weighted-sum + activation, trained by backprop.
- Hopfield = energy-based associative memory; converges to local minimum.
- Deep nets add expressivity but face vanishing gradients / overfitting → ReLU, residuals, dropout, weight decay, early stopping.
- Transformers replace recurrence with self-attention (O(n²), every token attends to every other). Basis of BERT (encoder, bidirectional) and GPT (decoder, autoregressive).
Mini-glossary
| Term | Meaning |
|---|---|
| Hopfield network | Recurrent, symmetric weights, binary states; stores patterns as energy minima |
| MLP | Feedforward net: input → hidden(s) → output |
| ReLU | max(0, x); constant gradient for x > 0 → avoids vanishing gradient |
| Vanishing gradient | In deep nets with sigmoid/tanh: gradients shrink toward 0 across layers |
| Residual connection | y = F(x) + x; identity shortcut lets gradient flow → enables very deep nets |
| Dropout | Randomly zero units during training → ensemble effect, less overfitting |
| Q / K / V | Query / Key / Value — three learned linear projections used in attention |
| Positional encoding | Position info added to embeddings (self-attention is order-blind otherwise) |
| BERT | Encoder-only, bidirectional, masked-LM training, used for understanding |
| GPT | Decoder-only, unidirectional, autoregressive training, used for generation |
⭐ Full glossary + worked backprop and attention examples → Glossary — important vocabulary
Key Formulas
| Formula | Meaning |
|---|---|
E(y) = −½ Σᵢⱼ wᵢⱼ yᵢ yⱼ − Σᵢ bᵢ yᵢ | Hopfield energy (never increases on update) |
yᵢ ← sign(Σⱼ wᵢⱼ yⱼ + bᵢ) | Hopfield update (slides write w_ji = w_ij by symmetry) |
Attention(Q,K,V) = softmax(QKᵀ / √d_k) · V | Scaled dot-product attention |
L_wd = L + λ Σ ‖w‖² | Loss with weight decay (L2 regularisation) |
ReLU(x) = max(0, x) | Activation that avoids vanishing gradient |
Common Exam Traps ⚠️
- Hopfield energy decreases monotonically → converges to a local min, not necessarily a stored pattern.
- Parasitic memories: local minima the net never stored — appear alongside the real ones.
- Hopfield capacity (slides): “up to N target memories.” The ~0.138·N figure is Amit et al. 1985, not Hopfield 1982 — flag as external if cited.
- Universal approximation: slides use 2 hidden layers (constructive proof). Cybenko’s textbook 1-hidden-layer result is not the slide claim.
- Vaswani positional encoding (slide error): slides claim Vaswani used relative encoding, but the original 2017 paper actually used absolute sinusoidal. Flag if asked.
- Modern Hopfield ↔ Self-Attention (Ramsauer 2020) is not in the slides — treat related quiz items (e.g. mega-quiz Q100) as supplementary, not exam-core.
- w_ji vs. w_ij in Hopfield: slides use
w_ji(incoming weight). With symmetric W this equalsw_ij— same formula. - √d_k scaling prevents large dot products from saturating softmax (near-zero gradient).
- Positional encoding is required in Transformers — self-attention is otherwise permutation-equivariant.
- Self-attention vs. masked attention: BERT sees all tokens; GPT only past tokens.
- ReLU vs. sigmoid: sigmoid saturates → vanishing gradient. ReLU does not (for x > 0).
- Residual connections carry gradient through identity path → enables ResNet-depth nets.
- Dropout is training-only; rescale at test time (or use inverted dropout during training).
- BERT bidirectional, GPT unidirectional — don’t mix.
Quick Comparison Table
| Architecture | Memory | Recurrence | Long-range deps | Training |
|---|---|---|---|---|
| Hopfield | Associative (energy minima) | No | Limited by capacity (~0.138·N) | Hebbian (one-shot) |
| MLP | Distributed in weights | No | None (no sequence) | Backprop + SGD |
| RNN | Hidden state over time | Yes | Vanishes over long sequences | BPTT |
| Transformer | Via attention weights | No | Full — O(n²) | Backprop + Adam |
Full algorithms (Hopfield update, Hebbian learning, MLP forward, backprop, SGD, dropout, scaled dot-product, self-attention, multi-head) + worked backprop trace and worked attention computation → ALGORITHMS (full reference) ⭐
Related Q&A & Notes
Practice quiz
Targeted exam questions in Questions for Methods of AI
- Q80–84 (basic: Hopfield, MLP, backprop, perceptron, activations) · Q96 (Autoencoder) · Q97 (Transformer) · Q104 (Vanishing gradient + ReLU / ResNet) · Q133–139 (deep / exam-trap: Hopfield capacity, backprop chain rule, ReLU + exploding gradients, Modern Hopfield ↔ Self-Attention (Ramsauer — supplementary), Scaled Dot-Product Attention, sin/cos positional encoding, BERT vs. GPT)
Atomic notes
- Hopfield Networks · Gradient Backpropagation · Deep Neural Networks · Deep Neural Networks in Computational Neuroscience · Implementing Artificial Neural Networks with TensorFlow · The neuroconnectionist research programme
- Transformers / Attention: Transformers · Self-Attention · Attention is All You Need · Attention Systems · YT - Attention in Transformers vs. menschliche Aufmerksamkeit
See also (sibling Lernzettel)
- lernzettel_svm_30-04-26 — Perceptron foundations
- lernzettel_ml-i-ii_30-04-26 — ML methodology context
See also
Tags: methods-of-ai lernzettel
Full reference: Neural Networks & Deep Learning
Quiz: quiz_neural-networks_30-04-26
Superlink: Methods of AI Lecture
Questions hub: Questions for Methods of AI
Created: 30/04/26