Quiz: Neural Networks & Deep Learning

Methods of AI — SoSe 2026

Reference: Neural Networks & Deep Learning · lernzettel_neural-networks-deep-learning_30-04-26

Q1 — Neural Networks

Question: How does a Hopfield network store and retrieve patterns? What is the energy function?

Answer

Storage: weights are set so that target patterns are energy minima. Training: W* = argmin Σ_{y∈Yp} E(y) − Σ_{y∉Yp} E(y).
Retrieval: initialize network with a (possibly corrupted) input pattern. Iteratively update each neuron: yᵢ = sign(Σⱼ≠ᵢ wᵢⱼ yⱼ + bᵢ). The energy decreases monotonically until a local minimum (stored pattern) is reached. This is content-addressable memory.
Energy function: E(y) = −Σᵢⱼ wᵢⱼ yᵢ yⱼ − Σᵢ bᵢ yᵢ
Each update either decreases or maintains energy → convergence guaranteed.

Max’s answer:
Result:

Q2 — Neural Networks

Question: What is the vanishing gradient problem, and how do ReLU and residual connections solve it?

Answer

Vanishing gradient: in deep networks using sigmoid/tanh activations, gradients shrink exponentially as they propagate backward through layers. Neurons in early layers receive near-zero gradients → can’t learn.
ReLU solution: ReLU(x) = max(0,x). For x>0, gradient = 1 (constant) — no squashing. For x≤0, gradient = 0 (dead neuron problem, but less common).
Residual connections: instead of learning a full transformation f(x), layers learn a residual f(x) = H(x) − x. The shortcut adds x directly → gradient flows backward through the identity connection without being multiplied. Enables training of very deep networks (100+ layers).

Max’s answer:
Result:

Q3 — Neural Networks

Question: Write the scaled dot-product attention formula. What are Q, K, V?

Answer

Attention(Q, K, V) = softmax(QKᵀ / √d_k) · V

Q (Query): what we’re looking for — “what information does this position need?”

K (Key): what each position can offer — “what information does each position have?”

V (Value): actual content — “what information does each position contain?”
Process: QKᵀ computes dot-product similarities between all query-key pairs → divide by √d_k (prevents large values → softmax stability) → softmax gives attention weights → weight the value vectors.
Self-attention: Q, K, V all derived from the same sequence → each token can attend to all others.

Max’s answer:
Result:

Q4 — Neural Networks

Question: Why is positional encoding necessary in Transformers? What happens without it?

Answer

Transformers have no recurrence — they process all tokens in parallel. Without positional encoding, the model is permutation-invariant: “the cat sat on the mat” and “mat the on sat cat the” would produce identical representations.
Positional encoding adds information about the position of each token in the sequence. The original Transformer (Vaswani et al. 2017) uses sinusoidal functions: PE(pos, 2i) = sin(pos/10000^{2i/d}), PE(pos, 2i+1) = cos(pos/10000^{2i/d}).
Added to word embeddings before the attention layers → model can distinguish token order.

Max’s answer:
Result:

Q6 — Neural Networks

Question: What are the three main regularization techniques for deep networks? Briefly describe each.

Answer

Weight decay (L2 regularization): adds penalty term λΣ‖w‖² to loss function. In update rule: weights are slightly reduced each step. Prevents weights from growing too large → smoother, less irregular decision boundary.

Dropout: randomly deactivates neurons during training (each with probability p). Effectively trains an ensemble of many subnetworks. Forces the network to not rely on any single neuron → more robust representations.

Early stopping: monitor loss on a validation set during training. Stop when validation loss starts increasing (even if training loss still decreasing). Prevents overfitting by not training too long.

Max’s answer:
Result:

Q7 — Neural Networks

Question: What is an autoencoder? What is the bottleneck and what is it used for?

Answer

An autoencoder consists of:

Encoder: compresses input x to a latent code z (lower-dimensional)

Bottleneck (code layer): the compressed representation

Decoder: reconstructs input x’ from code z
Training objective: minimize reconstruction error, e.g. MSE: L = E_{x~P}[‖AE(x) − x‖²]
The bottleneck forces the network to learn a compressed representation, discarding noise and keeping essential structure.
Applications: dimensionality reduction (non-linear PCA), denoising, anomaly detection, feature learning, and as the basis for variational autoencoders (VAEs) for generation.

Max’s answer:
Result:

Q8 — Neural Networks

Question: What is the lottery ticket hypothesis? What does it say about overparameterization?

Answer

Lottery ticket hypothesis (Frankle & Carlin, 2019): large overparameterized networks contain many subnetworks (lottery tickets). A small fraction of these — the “winning tickets” — when initialized with the right weights, can be trained to near the same performance as the full network.
Implication: it’s not the overparameterization itself that matters, but having a large pool of subnetworks to draw from. Training finds the winning ticket from this pool.
Practical consequence: you can prune 80-90% of connections after training without significant performance loss, but pruning before training hurts (you don’t know which ticket wins yet).

Max’s answer:
Result:

Beyond the lecture (optional)

These questions go beyond the SoSe 2026 lecture slides (textbook / external additions). Kept for depth, not exam-critical.

Q5 — Neural Networks

Question: What is the difference between BERT and GPT in terms of training and what they’re used for?

Answer

BERT GPT
Training objective Masked Language Model (predict masked tokens) Autoregressive (predict next token)
Context Bidirectional (sees entire sequence) Unidirectional (left-to-right only)
Primary use Understanding (classification, NER, QA) Generation (text completion, dialogue)
Pre-training Fill in random masks in sentences Predict each next word given all previous
Both are Transformer-based, both use transfer learning (pre-train on large corpus, fine-tune on specific task).

	BERT	GPT
Training objective	Masked Language Model (predict masked tokens)	Autoregressive (predict next token)
Context	Bidirectional (sees entire sequence)	Unidirectional (left-to-right only)
Primary use	Understanding (classification, NER, QA)	Generation (text completion, dialogue)
Pre-training	Fill in random masks in sentences	Predict each next word given all previous
Both are Transformer-based, both use transfer learning (pre-train on large corpus, fine-tune on specific task).

Max’s answer:
Result:

Score

Total: / 8

Brain Online

Explorer

quiz_neural-networks_30-04-26

Quiz: Neural Networks & Deep Learning

Q1 — Neural Networks

Q2 — Neural Networks

Q3 — Neural Networks

Q4 — Neural Networks

Q6 — Neural Networks

Q7 — Neural Networks

Q8 — Neural Networks

Beyond the lecture (optional)

Q5 — Neural Networks

Score

Backlinks

Mika

✨ Features

⚙️ Einstellungen

📚 Chat-Verlauf

📖 Citation Manager

✍️ Writing Assistant

Inhaltsverzeichnis