Quiz: Neural Networks & Deep Learning
Methods of AI — SoSe 2026
Reference: Neural Networks & Deep Learning · lernzettel_neural-networks-deep-learning_30-04-26
Q1 — Neural Networks
Question: How does a Hopfield network store and retrieve patterns? What is the energy function?
Answer
Storage: weights are set so that target patterns are energy minima. Training: W* = argmin Σ_{y∈Yp} E(y) − Σ_{y∉Yp} E(y).
Retrieval: initialize network with a (possibly corrupted) input pattern. Iteratively update each neuron: yᵢ = sign(Σⱼ≠ᵢ wᵢⱼ yⱼ + bᵢ). The energy decreases monotonically until a local minimum (stored pattern) is reached. This is content-addressable memory.
Energy function: E(y) = −Σᵢⱼ wᵢⱼ yᵢ yⱼ − Σᵢ bᵢ yᵢ
Each update either decreases or maintains energy → convergence guaranteed.
Max’s answer:
Result:
Q2 — Neural Networks
Question: What is the vanishing gradient problem, and how do ReLU and residual connections solve it?
Answer
Vanishing gradient: in deep networks using sigmoid/tanh activations, gradients shrink exponentially as they propagate backward through layers. Neurons in early layers receive near-zero gradients → can’t learn.
ReLU solution: ReLU(x) = max(0,x). For x>0, gradient = 1 (constant) — no squashing. For x≤0, gradient = 0 (dead neuron problem, but less common).
Residual connections: instead of learning a full transformation f(x), layers learn a residual f(x) = H(x) − x. The shortcut adds x directly → gradient flows backward through the identity connection without being multiplied. Enables training of very deep networks (100+ layers).
Max’s answer:
Result:
Q3 — Neural Networks
Question: Write the scaled dot-product attention formula. What are Q, K, V?
Answer
Attention(Q, K, V) = softmax(QKᵀ / √d_k) · V
- Q (Query): what we’re looking for — “what information does this position need?”
- K (Key): what each position can offer — “what information does each position have?”
- V (Value): actual content — “what information does each position contain?”
Process: QKᵀ computes dot-product similarities between all query-key pairs → divide by √d_k (prevents large values → softmax stability) → softmax gives attention weights → weight the value vectors.
Self-attention: Q, K, V all derived from the same sequence → each token can attend to all others.
Max’s answer:
Result:
Q4 — Neural Networks
Question: Why is positional encoding necessary in Transformers? What happens without it?
Answer
Transformers have no recurrence — they process all tokens in parallel. Without positional encoding, the model is permutation-invariant: “the cat sat on the mat” and “mat the on sat cat the” would produce identical representations.
Positional encoding adds information about the position of each token in the sequence. The original Transformer (Vaswani et al. 2017) uses sinusoidal functions: PE(pos, 2i) = sin(pos/10000^{2i/d}), PE(pos, 2i+1) = cos(pos/10000^{2i/d}).
Added to word embeddings before the attention layers → model can distinguish token order.
Max’s answer:
Result:
Q6 — Neural Networks
Question: What are the three main regularization techniques for deep networks? Briefly describe each.
Answer
- Weight decay (L2 regularization): adds penalty term λΣ‖w‖² to loss function. In update rule: weights are slightly reduced each step. Prevents weights from growing too large → smoother, less irregular decision boundary.
- Dropout: randomly deactivates neurons during training (each with probability p). Effectively trains an ensemble of many subnetworks. Forces the network to not rely on any single neuron → more robust representations.
- Early stopping: monitor loss on a validation set during training. Stop when validation loss starts increasing (even if training loss still decreasing). Prevents overfitting by not training too long.
Max’s answer:
Result:
Q7 — Neural Networks
Question: What is an autoencoder? What is the bottleneck and what is it used for?
Answer
An autoencoder consists of:
- Encoder: compresses input x to a latent code z (lower-dimensional)
- Bottleneck (code layer): the compressed representation
- Decoder: reconstructs input x’ from code z
Training objective: minimize reconstruction error, e.g. MSE: L = E_{x~P}[‖AE(x) − x‖²]
The bottleneck forces the network to learn a compressed representation, discarding noise and keeping essential structure.
Applications: dimensionality reduction (non-linear PCA), denoising, anomaly detection, feature learning, and as the basis for variational autoencoders (VAEs) for generation.
Max’s answer:
Result:
Q8 — Neural Networks
Question: What is the lottery ticket hypothesis? What does it say about overparameterization?
Answer
Lottery ticket hypothesis (Frankle & Carlin, 2019): large overparameterized networks contain many subnetworks (lottery tickets). A small fraction of these — the “winning tickets” — when initialized with the right weights, can be trained to near the same performance as the full network.
Implication: it’s not the overparameterization itself that matters, but having a large pool of subnetworks to draw from. Training finds the winning ticket from this pool.
Practical consequence: you can prune 80-90% of connections after training without significant performance loss, but pruning before training hurts (you don’t know which ticket wins yet).
Max’s answer:
Result:
Beyond the lecture (optional)
These questions go beyond the SoSe 2026 lecture slides (textbook / external additions). Kept for depth, not exam-critical.
Q5 — Neural Networks
Question: What is the difference between BERT and GPT in terms of training and what they’re used for?
Answer
BERT GPT Training objective Masked Language Model (predict masked tokens) Autoregressive (predict next token) Context Bidirectional (sees entire sequence) Unidirectional (left-to-right only) Primary use Understanding (classification, NER, QA) Generation (text completion, dialogue) Pre-training Fill in random masks in sentences Predict each next word given all previous Both are Transformer-based, both use transfer learning (pre-train on large corpus, fine-tune on specific task).
Max’s answer:
Result:
Score
Total: / 8