Lernzettel: Neural Networks & Deep Learning (incl. Transformers)

Methods of AI — SoSe 2026 · 1-page exam sheet

For more depth: Neural Networks & Deep Learning (full reference with all algorithms, worked examples, fact-check notes) · quiz_neural-networks_30-04-26 (practice questions)

Core Ideas

  • Neural nets = layered weighted-sum + activation, trained by backprop.
  • Hopfield = energy-based associative memory; converges to local minimum.
  • Deep nets add expressivity but face vanishing gradients / overfitting → ReLU, residuals, dropout, weight decay, early stopping.
  • Transformers replace recurrence with self-attention (O(n²), every token attends to every other). Basis of BERT (encoder, bidirectional) and GPT (decoder, autoregressive).

Mini-glossary

TermMeaning
Hopfield networkRecurrent, symmetric weights, binary states; stores patterns as energy minima
MLPFeedforward net: input → hidden(s) → output
ReLUmax(0, x); constant gradient for x > 0 → avoids vanishing gradient
Vanishing gradientIn deep nets with sigmoid/tanh: gradients shrink toward 0 across layers
Residual connectiony = F(x) + x; identity shortcut lets gradient flow → enables very deep nets
DropoutRandomly zero units during training → ensemble effect, less overfitting
Q / K / VQuery / Key / Value — three learned linear projections used in attention
Positional encodingPosition info added to embeddings (self-attention is order-blind otherwise)
BERTEncoder-only, bidirectional, masked-LM training, used for understanding
GPTDecoder-only, unidirectional, autoregressive training, used for generation

Full glossary + worked backprop and attention examplesGlossary — important vocabulary

Key Formulas

FormulaMeaning
E(y) = −½ Σᵢⱼ wᵢⱼ yᵢ yⱼ − Σᵢ bᵢ yᵢHopfield energy (never increases on update)
yᵢ ← sign(Σⱼ wᵢⱼ yⱼ + bᵢ)Hopfield update (slides write w_ji = w_ij by symmetry)
Attention(Q,K,V) = softmax(QKᵀ / √d_k) · VScaled dot-product attention
L_wd = L + λ Σ ‖w‖²Loss with weight decay (L2 regularisation)
ReLU(x) = max(0, x)Activation that avoids vanishing gradient

Common Exam Traps ⚠️

  • Hopfield energy decreases monotonically → converges to a local min, not necessarily a stored pattern.
  • Parasitic memories: local minima the net never stored — appear alongside the real ones.
  • Hopfield capacity (slides): “up to N target memories.” The ~0.138·N figure is Amit et al. 1985, not Hopfield 1982 — flag as external if cited.
  • Universal approximation: slides use 2 hidden layers (constructive proof). Cybenko’s textbook 1-hidden-layer result is not the slide claim.
  • Vaswani positional encoding (slide error): slides claim Vaswani used relative encoding, but the original 2017 paper actually used absolute sinusoidal. Flag if asked.
  • Modern Hopfield ↔ Self-Attention (Ramsauer 2020) is not in the slides — treat related quiz items (e.g. mega-quiz Q100) as supplementary, not exam-core.
  • w_ji vs. w_ij in Hopfield: slides use w_ji (incoming weight). With symmetric W this equals w_ij — same formula.
  • √d_k scaling prevents large dot products from saturating softmax (near-zero gradient).
  • Positional encoding is required in Transformers — self-attention is otherwise permutation-equivariant.
  • Self-attention vs. masked attention: BERT sees all tokens; GPT only past tokens.
  • ReLU vs. sigmoid: sigmoid saturates → vanishing gradient. ReLU does not (for x > 0).
  • Residual connections carry gradient through identity path → enables ResNet-depth nets.
  • Dropout is training-only; rescale at test time (or use inverted dropout during training).
  • BERT bidirectional, GPT unidirectional — don’t mix.

Quick Comparison Table

ArchitectureMemoryRecurrenceLong-range depsTraining
HopfieldAssociative (energy minima)NoLimited by capacity (~0.138·N)Hebbian (one-shot)
MLPDistributed in weightsNoNone (no sequence)Backprop + SGD
RNNHidden state over timeYesVanishes over long sequencesBPTT
TransformerVia attention weightsNoFull — O(n²)Backprop + Adam

Full algorithms (Hopfield update, Hebbian learning, MLP forward, backprop, SGD, dropout, scaled dot-product, self-attention, multi-head) + worked backprop trace and worked attention computation → ALGORITHMS (full reference) ⭐

Practice quiz

Targeted exam questions in Questions for Methods of AI

  • Q80–84 (basic: Hopfield, MLP, backprop, perceptron, activations) · Q96 (Autoencoder) · Q97 (Transformer) · Q104 (Vanishing gradient + ReLU / ResNet) · Q133–139 (deep / exam-trap: Hopfield capacity, backprop chain rule, ReLU + exploding gradients, Modern Hopfield ↔ Self-Attention (Ramsauer — supplementary), Scaled Dot-Product Attention, sin/cos positional encoding, BERT vs. GPT)

Atomic notes

See also (sibling Lernzettel)

See also

Tags: methods-of-ai lernzettel
Full reference: Neural Networks & Deep Learning
Quiz: quiz_neural-networks_30-04-26
Superlink: Methods of AI Lecture
Questions hub: Questions for Methods of AI

Created: 30/04/26