Lernzettel: Neural Networks & Deep Learning (incl. Transformers)

Methods of AI — SoSe 2026 · 1-page exam sheet

For more depth: Neural Networks & Deep Learning (full reference with all algorithms, worked examples, fact-check notes) · quiz_neural-networks_30-04-26 (practice questions)

Core Ideas

Neural nets = layered weighted-sum + activation, trained by backprop.
Hopfield = energy-based associative memory; converges to local minimum.
Deep nets add expressivity but face vanishing gradients / overfitting → ReLU, residuals, dropout, weight decay, early stopping.
Transformers replace recurrence with self-attention (O(n²), every token attends to every other). Basis of BERT (encoder, bidirectional) and GPT (decoder, autoregressive).

Mini-glossary

Term	Meaning
Hopfield network	Recurrent, symmetric weights, binary states; stores patterns as energy minima
MLP	Feedforward net: input → hidden(s) → output
ReLU	max(0, x); constant gradient for x > 0 → avoids vanishing gradient
Vanishing gradient	In deep nets with sigmoid/tanh: gradients shrink toward 0 across layers
Residual connection	y = F(x) + x; identity shortcut lets gradient flow → enables very deep nets
Dropout	Randomly zero units during training → ensemble effect, less overfitting
Q / K / V	Query / Key / Value — three learned linear projections used in attention
Positional encoding	Position info added to embeddings (self-attention is order-blind otherwise)
BERT	Encoder-only, bidirectional, masked-LM training, used for understanding
GPT	Decoder-only, unidirectional, autoregressive training, used for generation

⭐ Full glossary + worked backprop and attention examples → Glossary — important vocabulary

Key Formulas

Formula	Meaning
`E(y) = −½ Σᵢⱼ wᵢⱼ yᵢ yⱼ − Σᵢ bᵢ yᵢ`	Hopfield energy (never increases on update)
`yᵢ ← sign(Σⱼ wᵢⱼ yⱼ + bᵢ)`	Hopfield update (slides write w_ji = w_ij by symmetry)
`Attention(Q,K,V) = softmax(QKᵀ / √d_k) · V`	Scaled dot-product attention
`L_wd = L + λ Σ ‖w‖²`	Loss with weight decay (L2 regularisation)
`ReLU(x) = max(0, x)`	Activation that avoids vanishing gradient

Common Exam Traps ⚠️

Hopfield energy decreases monotonically → converges to a local min, not necessarily a stored pattern.
Parasitic memories: local minima the net never stored — appear alongside the real ones.
Hopfield capacity (slides): “up to N target memories.” The ~0.138·N figure is Amit et al. 1985, not Hopfield 1982 — flag as external if cited.
Universal approximation: slides use 2 hidden layers (constructive proof). Cybenko’s textbook 1-hidden-layer result is not the slide claim.
Vaswani positional encoding (slide error): slides claim Vaswani used relative encoding, but the original 2017 paper actually used absolute sinusoidal. Flag if asked.
Modern Hopfield ↔ Self-Attention (Ramsauer 2020) is not in the slides — treat related quiz items (e.g. mega-quiz Q100) as supplementary, not exam-core.
w_ji vs. w_ij in Hopfield: slides use w_ji (incoming weight). With symmetric W this equals w_ij — same formula.
√d_k scaling prevents large dot products from saturating softmax (near-zero gradient).
Positional encoding is required in Transformers — self-attention is otherwise permutation-equivariant.
Self-attention vs. masked attention: BERT sees all tokens; GPT only past tokens.
ReLU vs. sigmoid: sigmoid saturates → vanishing gradient. ReLU does not (for x > 0).
Residual connections carry gradient through identity path → enables ResNet-depth nets.
Dropout is training-only; rescale at test time (or use inverted dropout during training).
BERT bidirectional, GPT unidirectional — don’t mix.

Quick Comparison Table

Architecture	Memory	Recurrence	Long-range deps	Training
Hopfield	Associative (energy minima)	No	Limited by capacity (~0.138·N)	Hebbian (one-shot)
MLP	Distributed in weights	No	None (no sequence)	Backprop + SGD
RNN	Hidden state over time	Yes	Vanishes over long sequences	BPTT
Transformer	Via attention weights	No	Full — O(n²)	Backprop + Adam

Full algorithms (Hopfield update, Hebbian learning, MLP forward, backprop, SGD, dropout, scaled dot-product, self-attention, multi-head) + worked backprop trace and worked attention computation → ALGORITHMS (full reference) ⭐

Practice quiz

quiz_neural-networks_30-04-26 — 8 Qs

Targeted exam questions in Questions for Methods of AI

Q80–84 (basic: Hopfield, MLP, backprop, perceptron, activations) · Q96 (Autoencoder) · Q97 (Transformer) · Q104 (Vanishing gradient + ReLU / ResNet) · Q133–139 (deep / exam-trap: Hopfield capacity, backprop chain rule, ReLU + exploding gradients, Modern Hopfield ↔ Self-Attention (Ramsauer — supplementary), Scaled Dot-Product Attention, sin/cos positional encoding, BERT vs. GPT)

Atomic notes

Hopfield Networks · Gradient Backpropagation · Deep Neural Networks · Deep Neural Networks in Computational Neuroscience · Implementing Artificial Neural Networks with TensorFlow · The neuroconnectionist research programme
Transformers / Attention: Transformers · Self-Attention · Attention is All You Need · Attention Systems · YT - Attention in Transformers vs. menschliche Aufmerksamkeit

See also (sibling Lernzettel)

lernzettel_svm_30-04-26 — Perceptron foundations
lernzettel_ml-i-ii_30-04-26 — ML methodology context

Brain Online

Explorer

lernzettel_neural-networks-deep-learning_30-04-26

Lernzettel: Neural Networks & Deep Learning (incl. Transformers)

Core Ideas

Mini-glossary

Key Formulas

Common Exam Traps ⚠️

Quick Comparison Table

See also

Backlinks

Mika

✨ Features

⚙️ Einstellungen

📚 Chat-Verlauf

📖 Citation Manager

✍️ Writing Assistant

Inhaltsverzeichnis

Brain Online

Explorer

lernzettel_neural-networks-deep-learning_30-04-26

Lernzettel: Neural Networks & Deep Learning (incl. Transformers)

Core Ideas

Mini-glossary

Key Formulas

Common Exam Traps ⚠️

Quick Comparison Table

Related Q&A & Notes

See also

Backlinks

Mika

✨ Features

⚙️ Einstellungen

📚 Chat-Verlauf

📖 Citation Manager

✍️ Writing Assistant

Inhaltsverzeichnis