Mega-Quiz: Methods of AI — All Topics (100 Qs)
SoSe 2026 · Exam Prep · mixed difficulty (🟢 easy · 🟡 medium · 🔴 hard)
Format: each question hides its answer under an Obsidian callout fold. Answer it yourself first, then expand.
Part 1 — Local Search (Q1–12)
Q3 — Local Search 🟡
Question: Write down the SA acceptance probability for a downhill move. What is Δ, what is T?
Answer
P(accept) = exp(Δ/T)withΔ = f(s') − f(s) < 0(negative change in value) andT= current temperature. High T → probability near 1 (lots of exploration); low T → probability near 0 (exploitation).
Max’s answer: exp(-delta/T) with delta = (future state) - (current state), T = Temparature
Result:
Q4 — Local Search 🟡
Question: Why is Local Beam Search with k states not the same as k parallel Hill Climbings?
Answer
In Beam Search the successors of all k states are generated jointly; then the best k are selected across the entire set. The states share information and concentrate on promising regions. With k parallel HCs, each runs independently.
Max’s answer: in LBS when the successors are generated, the global best 4 successors from all previous beams will be the future 4 beams. while in parallel hill climbing the successors only decide among the successors from the specific beam the future successors. so it only looks for local solutions, meaning parallel hill climbing might end up in a local minimum earlier than LBS.
Result:
Q5 — Local Search 🟡
Question: What is the idea behind Stochastic Beam Search and which problem does it solve?
Answer
Instead of deterministically picking the best k successors, k are drawn probabilistically in proportion to fitness. This solves the clustering problem of Local Beam Search: without stochasticity all k states converge to the same region (low diversity).
Max’s answer:
Result:
Q6 — Local Search 🟢
Question: What guarantee does Random-Restart Hill Climbing provide?
Answer
With n independent restarts:
P(success) → 1asn → ∞. Finitely many restarts give no guarantee, only a growing probability.
Max’s answer:
Result:
Q7 — Local Search 🔴
Question: Why does Simulated Annealing guarantee the global optimum in theory but not in practice?
Answer
The theoretical guarantee requires an infinitely slow cooling schedule (e.g. logarithmic). Practical schedules (linear/exponential) are far too fast — at low T, SA can get stuck in a local optimum because exp(Δ/T) → 0.
Max’s answer:
Result:
Q8 — Local Search 🟡
Question: Naive 1-point crossover fails on TSP. Why, and what is used instead?
Answer
Permutations require that each city appears exactly once. 1-point crossover produces duplicates and missing cities → invalid tour. Solution: Uniform-Order Crossover — a binary template selects positions from Parent A, the rest is filled in the order of Parent B.
Max’s answer:
Result:
Q9 — Local Search 🟡
Question: What is a memetic algorithm and why is it often better?
Answer
GA + local search: after each reproduction, Hill Climbing is applied to every offspring. Combines the global exploration of a GA with the local refinement of HC.
Max’s answer:
Result:
Q10 — Local Search 🔴
Question: Which properties must a GA’s fitness function satisfy for selection to work meaningfully?
Answer
It must (a) allow an ordering of solutions (higher = better) and (b) produce selection probabilities that correlate with the fitness value (e.g. roulette-wheel proportional to f). If all values are similar, selection pressure collapses — the population then drifts randomly.
Max’s answer:
Result:
Q11 — Local Search 🟡
Question: Neighbourhood-size trade-off: what happens with a small vs. a large neighbourhood?
Answer
Small: fast iterations, but many local optima (the algorithm gets stuck easily). Large: fewer local optima, but each step is computationally more expensive and the search becomes slow.
Max’s answer:
Result:
Q12 — Local Search 🔴
Question: On the 8-Queens problem: why is Stochastic Hill Climbing often more successful than steepest-ascent HC?
Answer
Steepest-ascent frequently gets stuck on plateaus where many neighbours have the same conflict count — and the “best” move is not unique. Stochastic HC picks a random improving neighbour → it explores the plateau and escapes more easily.
Max’s answer:
Result:
Part 2 — Constraint Satisfaction (Q13–22)
Q13 — CSP 🟢
Question: Which three components formally define a CSP?
Answer
The tuple
({X₁,…,Xₙ}, {D₁,…,Dₙ}, R): variables, domains, constraints (relations over variables). A solution = a complete, consistent assignment of all variables.
Max’s answer:
Result:
Q16 — CSP 🟡
Question: What is MRV (Minimum Remaining Values) for, and what is LCV (Least Constraining Value) for?
Answer
MRV selects the next variable (the one with the fewest legal values → fastest failure, “fail-fast”). LCV selects the value for a given variable (the one that removes the fewest options from neighbours → “fail-slow”). Complementary, no contradiction.
Max’s answer:
Result:
Q17 — CSP 🟢
Question: What is the degree heuristic and when is it used?
Answer
It selects the variable that appears in the most constraints with unassigned variables. Typically used as a tiebreaker for MRV when several variables have the same number of remaining values.
Max’s answer:
Result:
Q18 — CSP 🔴
Question: Difference between Arc Consistency and Path Consistency — why is AC not always enough?
Answer
AC: for every value in Xᵢ there exists a support in Xⱼ (2-consistent). Path consistency: every consistent assignment of (Xᵢ, Xⱼ) can be extended to every third variable Xₘ (3-consistent). AC rules out 2-variable violations but misses conflicts that only become visible with 3 variables.
Max’s answer:
Result:
Q19 — CSP 🟡
Question: Local Search for CSPs: what is the state space, the neighbourhood, and the objective function?
Answer
States = complete assignments (even violating ones). Neighbourhood = change one variable’s value. Objective: maximize the number of satisfied constraints. If a global optimum (= all constraints satisfied) is reached, the CSP is consistent.
Max’s answer:
Result:
Q20 — CSP 🔴
Question: Local Search finds a local optimum with f(s) < constraints. What can you conclude?
Answer
Nothing definitive. Maybe the CSP is inconsistent, maybe the search is just stuck in a local optimum. Local Search is not complete — it cannot prove inconsistency.
Max’s answer:
Result:
Q21 — CSP 🟢
Question: Name the 8 relations of RCC-8.
Answer
DC (disconnected), EC (externally connected), PO (partial overlap), TPP (tangential proper part), TPPi (inverse), NTPP (non-tangential proper part), NTPPi (inverse), EQ (equal). Based on the connection predicate C(x,y), purely topological (no metric).
Max’s answer:
Result:
Q22 — CSP 🔴
Question: Why is strong k-consistency stronger than plain k-consistency?
Answer
Strong k-consistency = j-consistent for all j ≤ k. Plain k-consistency only guarantees consistency at level k, not at the levels below — e.g. a 3-consistent CSP need not be 2-consistent.
Max’s answer:
Result:
Part 3 — Planning I (Classical) (Q23–32)
Q23 — Planning I 🟢
Question: Explain the frame, qualification, and ramification problem in one sentence each.
Answer
Frame: how do you efficiently represent what does not change after an action? Qualification: actions have potentially endless preconditions (battery dead? tyre flat?). Ramification: actions have unintended side effects (move the PSU → all items inside move too).
Max’s answer:
Result:
Q24 — Planning I 🟢
Question: What are the three lists of a STRIPS action?
Answer
PRE (preconditions, must be true before the action), ADD (atoms that become true afterwards), DEL (atoms that become false afterwards). Everything else stays unchanged (CWA convention).
Max’s answer:
Result:
Q25 — Planning I 🔴
Question: How does STRIPS “solve” the frame problem — and why is this solution syntactic, not ontological?
Answer
By convention STRIPS says: whatever is not in ADD or DEL stays unchanged. That is a syntactic rule — not a proof that it is so. The real (ontological) question “what stays the same in the universe?” is not answered but bypassed via the closed world + ADD/DEL convention.
Max’s answer:
Result:
Q27 — Planning I 🟢
Question: What does STRIPS assume about the world (which assumption)?
Answer
Closed-World Assumption (CWA): whatever is not in the state is false. Situation Calculus and DL/OWL do not do this — they use Open-World.
Max’s answer:
Result:
Q28 — Planning I 🟡
Question: What does a STRIPS action put(X, Y) describe with PRE/ADD/DEL for Blocksworld?
Answer
PRE:
{ontable(X), clear(X), clear(Y)}— X lies free on the table, Y is free. ADD:{on(X, Y)}. DEL:{ontable(X), clear(Y)}— X is no longer on the table, Y is no longer free.
Max’s answer:
Result:
Q29 — Planning I 🔴
Question: “Ignore-Delete” heuristic: what is the idea, and is it admissible?
Answer
Delete the DEL list from every action → actions can no longer undo anything. The resulting relaxed problem is easier (still NP-hard), and its solution length is a lower bound … but in practice only an approximation (e.g. greedy) is computed, and that approximation is not admissible → A* optimality is not guaranteed.
Max’s answer:
Result:
Q30 — Planning I 🟡
Question: How does PDDL differ syntactically from pure STRIPS, and which separation does it introduce?
Answer
LISP-like syntax with types, numeric fluents, conditional effects. It separates the domain (actions, types) from the problem (initial state, goal). Standard since IPC 1998.
Max’s answer:
Result:
Part 4 — Planning II (MDP) (Q33–42)
Q33 — Planning II 🟢
Question: Which four components make up the MDP tuple?
Answer
(S, A, p, r): states, actions, transition probability
p(s'|s,a), rewardr(s,a). ⚠️ γ (discount factor) is NOT part of the MDP tuple — it belongs to the objective function / the solver.
Max’s answer:
Result:
Q34 — Planning II 🟡
Question: State the Markov property.
Answer
P(Sₜ₊₁ | Sₜ, Aₜ, Sₜ₋₁, Aₜ₋₁, …) = P(Sₜ₊₁ | Sₜ, Aₜ). The future depends only on the current state and the current action — not on the history.
Max’s answer:
Result:
Q35 — Planning II 🟡
Question: What distinguishes a plan (classical) from a policy (MDP)?
Answer
A plan = an action sequence for one initial state. A policy π = a mapping of every state to an action. A policy thus covers the entire state space; a plan only one path.
Max’s answer:
Result:
Q36 — Planning II 🔴
Question: Write the Bellman equation for a fixed policy π.
Answer
V^π(s) = r(s, π(s)) + γ · Σ_{s'} p(s'|s, π(s)) · V^π(s'). Value of a state = immediate reward + discounted expected value of the successor state.
Max’s answer:
Result:
Q37 — Planning II 🔴
Question: Bellman optimality equation — how does it differ from policy evaluation?
Answer
V*(s) = max_a [r(s,a) + γ · Σ_{s'} p(s'|s,a) · V*(s')]. Instead of fixing the action via a given policy, it maximizes over all possible actions — directly the optimal value function.
Max’s answer:
Result:
Q38 — Planning II 🟡
Question: Describe Policy Iteration in two steps.
Answer
(1) Policy Evaluation: compute V^π for the current policy iteratively via the Bellman equation. (2) Policy Improvement: greedy update
π'(s) = argmax_a [r(s,a) + γ Σ p V^π(s')]. If π’ = π → optimal, otherwise back to (1).
Max’s answer:
Result:
Q39 — Planning II 🔴
Question: What is the main difference between Value Iteration and Policy Iteration?
Answer
Value Iteration: applies the Bellman optimality equation directly and iteratively — no explicit policy until the end. Policy Iteration: alternates a full evaluation of a policy + improvement. PI needs fewer iterations (typically <10), but each iteration is more expensive (full evaluation).
Max’s answer:
Result:
Q40 — Planning II 🟡
Question: Why does γ = 1 cause problems in non-terminating MDPs?
Answer
The series
r₀ + γr₁ + γ²r₂ + …diverges if not all rewards become zero. Values go to infinity → the Bellman update does not converge. Fix: γ < 1 or guaranteed terminal states.
Max’s answer:
Result:
Q41 — Planning II 🟢
Question: Which property must p(s'|s,a) satisfy?
Answer
It is a probability distribution:
Σ_{s'} p(s'|s,a) = 1for all s, a. Non-negative values, summing to 1.
Max’s answer:
Result:
Q42 — Planning II 🔴
Question: In the grid-world example: why is a step reward of e.g. −0.04 useful, and what changes at +0.04?
Answer
A negative step reward forces the optimal policy to reach the goal quickly (every step costs something). +0.04 would make the optimal policy never reach the goal — wandering around forever would be more lucrative than the final reward. The step cost is thus an implicit time-pressure signal.
Max’s answer:
Result:
Part 5 — Knowledge Representation (Q43–54)
Q43 — KR 🟢
Question: What goes in the TBox, what goes in the ABox?
Answer
TBox (terminological): general statements about classes —
C ⊑ D,C ≡ D(general concept inclusions). Analogous to: schema. ABox (assertion): facts about individuals —a : C,(a,b) : R. Analogous to: database content.
Max’s answer:
Result:
Q44 — KR 🟡
Question: What does C ⊑ D mean semantically — and in which direction?
Answer
“Every C is also a D” — C is a subset of D in the interpretation. ⚠️ Classic trap: NOT “D is a subset of C”. The arrow points from the more specific to the more general class.
Max’s answer:
Result:
Q45 — KR 🔴
Question: What are the constructors of ALC and what does each mean?
Answer
Atomic concept (A), ⊤ (everything), ⊥ (nothing), ¬C (negation), C ⊓ D (intersection), C ⊔ D (union), ∀R.C (value restriction — all R-relations lead to C), ∃R.C (existential — at least one R-relation leads to C). “ALC” = Attributive Language with Complement.
Max’s answer:
Result:
Q46 — KR 🟡
Question: Open World Assumption vs. Closed World Assumption — who uses which?
Answer
OWA: what cannot be derived is unknown — DL, OWL, Situation Calculus. CWA: what cannot be derived is false — STRIPS, classical DBs, Prolog (by default).
Max’s answer:
Result:
Q47 — KR 🔴
Question: How do you test C ⊑ D with the tableaux algorithm?
Answer
Reduce it to satisfiability:
C ⊑ Dholds iff C ⊓ ¬D is unsatisfiable. Tableaux tries to construct a model for C ⊓ ¬D — if it cannot (all branches close with a contradiction), the subsumption is proven.
Max’s answer:
Result:
Q48 — KR 🔴
Question: Which “vacuous truth” trap hides in ∀R.C?
Answer
∀R.Cis trivially true for any individual that has no R-relations. Example: “All children of x are students” is true if x has no children — even if it feels semantically odd. ∃R.C, by contrast, requires at least one R-filler.
Max’s answer:
Result:
Q49 — KR 🟢
Question: Which typical reasoning tasks exist in a DL knowledge base?
Answer
TBox reasoning: subsumption (C ⊑ D?), satisfiability of a concept, concept equivalence. ABox reasoning: instance checking (is a an instance of C?), KB consistency (is there a model?), retrieval (which individuals are instances of C?).
Max’s answer:
Result:
Q51 — KR 🟡
Question: How many Allen relations are there, and why not 14?
Answer
13 — 7 base relations (BEFORE, MEETS, OVERLAPS, STARTS, FINISHES, DURING, EQUAL) × 2 directions, but EQUAL is symmetric and counts only once → 6×2 + 1 = 13.
Max’s answer:
Result:
Q52 — KR 🔴
Question: Allen’s composition table — what does it state and why is it important?
Answer
For given relations
R(I₁, I₂)andR(I₂, I₃): which relations are possible forR(I₁, I₃)? It infers not a single relation but a set of possible relations. Used for constraint propagation in temporal CSPs — analogous to AC-3.
Max’s answer:
Result:
Q53 — KR 🟡
Question: What distinguishes RCC-8 from metric geometry?
Answer
RCC-8 is purely topological — based only on the connection predicate C(x,y). No distances, no sizes, no coordinates. Suited for qualitative spatial reasoning without metric data.
Max’s answer:
Result:
Q54 — KR 🔴
Question: Why does ALC have no role hierarchy, and what do you need for one?
Answer
ALC does not allow statements like
R₁ ⊑ R₂(R₁ is a special R₂ role). For that you need extensions — e.g. SH (with role hierarchy and transitive roles) or higher (SROIQ, the basis of OWL 2).
Max’s answer:
Result:
Part 6 — Vagueness & Uncertainty (Q55–64)
Q55 — Fuzzy 🟢
Question: What is the core difference between vagueness and uncertainty?
Answer
Vagueness = partial truth, the predicate itself is fuzzy (“the ball is reddish”). Uncertainty = the truth is sharp, but we don’t know it (“it will probably rain tomorrow”). Fuzzy logic addresses vagueness, probability theory addresses uncertainty.
Max’s answer:
Result:
Q56 — Fuzzy 🟢
Question: How are intersection, union, and complement defined for fuzzy sets (Zadeh standard)?
Answer
μ_{A∩B}(x) = min(μ_A(x), μ_B(x)),μ_{A∪B}(x) = max(μ_A(x), μ_B(x)),μ_{A^C}(x) = 1 − μ_A(x).
Max’s answer:
Result:
Q57 — Fuzzy 🔴
Question: Show that in fuzzy logic A ∨ ¬A ≠ 1 is possible. Give a concrete example.
Answer
Let
μ_A(x) = 0.6. Thenμ_{A^C}(x) = 1 − 0.6 = 0.4. Union:max(0.6, 0.4) = 0.6 ≠ 1. The law of excluded middle does not hold in fuzzy logic — and it is not even desired, because vague predicates have genuine transition zones.
Max’s answer:
Result:
Q58 — Fuzzy 🟡
Question: Which four properties define a t-norm?
Answer
(1) Neutral element 1 (t(x,1)=x), (2) commutative, (3) associative, (4) monotonically increasing (x≤x’, y≤y’ → t(x,y) ≤ t(x’,y’)). Generalizes AND. An s-norm is analogous with neutral element 0 and generalizes OR.
Max’s answer:
Result:
Q59 — Fuzzy 🟡
Question: Besides the Zadeh min, also name the algebraic t-norm and s-norm.
Answer
Algebraic:
t(x,y) = x · y(product),s(x,y) = x + y − x·y. Both satisfy the t-/s-norm axioms but are not identical to min/max. Example: t(0.5, 0.5) = 0.25 (algebraic) vs. 0.5 (min).
Max’s answer:
Result:
Q60 — Fuzzy 🔴
Question: Why does probabilistic logic keep classical properties like P(A ∨ ¬A) = 1, but fuzzy logic does not?
Answer
Probabilistic logic assigns probabilities to classical statements (which are true or false) — tautologies remain tautologies (P=1), contradictions remain false (P=0). Fuzzy logic changes the statements themselves — they are “half true” — so the tertium non datur breaks.
Max’s answer:
Result:
Q61 — Uncertainty 🟢
Question: Write down the Kolmogorov axioms.
Answer
(1) Non-negativity: P(A) ≥ 0. (2) Normalization: P(Ω) = 1. (3) Additivity for disjoint events: P(A ∪ B) = P(A) + P(B) when A ∩ B = ∅.
Max’s answer:
Result:
Q62 — Uncertainty 🟡
Question: Conditional probability — formula and condition.
Answer
P(B|A) = P(B ∩ A) / P(A), defined only forP(A) > 0. For conditional knowledge bases:(G|F)[p]is satisfied iff P(G|F) = p, hence implicitly P(F) > 0.
Max’s answer:
Result:
Q64 — Uncertainty 🔴
Question: The Heap (Sorites) paradox — why is it an argument for fuzzy logic?
Answer
Classically: “A heap of sand stays a heap if you remove 1 grain.” By induction → even 0 grains are a heap. Contradiction. Fuzzy: “being a heap” has a gradual transition —
μ_Heap(n)slowly decreases from 1 to 0. No contradiction, because no binary threshold is forced.
Max’s answer:
Result:
Part 7 — Machine Learning I & II (Q65–76)
Q65 — ML 🟢
Question: Define ML formally following Mitchell.
Answer
A program learns from experience E with respect to a task T and a performance measure P if its performance P on T improves with increasing E.
Max’s answer:
Result:
Q66 — ML 🟢
Question: What is inductive bias and why does every learner need one?
Answer
Inductive bias = the assumptions a learner makes in order to generalize (e.g. “the solution is a linear function”). Without a bias there is no reason to prefer one hypothesis over another that fits the same training data → no generalization possible (No-Free-Lunch).
Max’s answer:
Result:
Q68 — ML 🟡
Question: Entropy of a set S — formula and extreme values.
Answer
E(S) = − Σᵢ pᵢ · log₂(pᵢ). E = 0 when all examples have the same class (pure). E = log₂(c) (maximum) when c classes are uniformly distributed.
Max’s answer:
Result:
Q69 — ML 🔴
Question: Information Gain — formula and what it measures.
Answer
Gain(S, A) = E(S) − Σ_{v ∈ Values(A)} (|S_v|/|S|) · E(S_v). The reduction in entropy from splitting on attribute A. ID3 greedily chooses the A with the highest gain.
Max’s answer:
Result:
Q71 — ML 🟡
Question: What is bagging and why does it reduce variance but not bias?
Answer
Bagging = Bootstrap Aggregation: train several models on randomly drawn samples with replacement, average the predictions. It reduces variance because averaging cancels out uncorrelated errors. It does not reduce bias, because all models share the same model class — the systematic error remains.
Max’s answer:
Result:
Q72 — ML 🔴
Question: How does a Random Forest differ from plain bagging?
Answer
Bagging: each tree uses all features at every split. Random Forest: additionally, at each split only a random subset of features is considered → greater diversity between trees → even stronger variance reduction without loss of bias.
Max’s answer:
Result:
Q74 — ML 🟢
Question: What are the three main categories of machine learning?
Answer
Supervised (labelled data → classification/regression), unsupervised (unlabelled data → clustering, dimensionality reduction), reinforcement (trial-and-error with a reward signal).
Max’s answer:
Result:
Q76 — ML 🔴
Question: Why does k-means find only local optima, and what would the global optimum be?
Answer
k-means optimizes the sum of squared distances to the centroids with a greedy alternating scheme (assignment + centroid update). This is NP-hard in general — Lloyd’s algorithm only guarantees convergence to a local minimum, depending on the initialization. In practice: several runs with different seeds, choose the best result (or k-means++ for better initialization).
Max’s answer:
Result:
Part 8 — SVM & Perceptron (Q77–88)
Q77 — SVM 🟢
Question: What is the central idea of SVMs?
Answer
Find the hyperplane that maximizes the margin (distance to the nearest data points of both classes). Only the support vectors (points on the margin) determine the solution.
Max’s answer:
Result:
Q78 — SVM 🟡
Question: How large is the margin as a function of w, and which optimization does SVM solve?
Answer
Margin =
2/‖w‖. SVM minimizesτ(w) = ½ · ‖w‖²subject to the constraintyᵢ · (⟨xᵢ, w⟩ + b) ≥ 1for all training points. A convex quadratic optimization problem.
Max’s answer:
Result:
Q79 — SVM 🔴
Question: Write down the SVM Lagrangian. Mind the sign.
Answer
L(w, b, α) = ½‖w‖² − Σᵢ αᵢ · (yᵢ · (⟨xᵢ, w⟩ + b) − 1)withαᵢ ≥ 0. ⚠️ Sign: minus, not plus — we want to subtract the constraint penalty. Result:w = Σᵢ αᵢ yᵢ xᵢ.
Max’s answer:
Result:
Q80 — SVM 🔴
Question: Which KKT condition tells you which points are support vectors?
Answer
Complementary slackness:
αᵢ · (yᵢ(⟨xᵢ,w⟩+b) − 1) = 0. So: eitherαᵢ = 0(point is not an SV, contributes nothing to w) oryᵢ(⟨xᵢ,w⟩+b) = 1(point lies exactly on the margin → is an SV).
Max’s answer:
Result:
Q81 — SVM 🟡
Question: Explain the kernel trick in one sentence.
Answer
Replace the inner product
⟨x, x'⟩with a kernel functionK(x, x') = ⟨φ(x), φ(x')⟩that directly computes the inner product in the (higher-dimensional) feature space without applying φ explicitly. Makes high-dimensional (even infinite-dimensional, e.g. RBF) spaces tractable.
Max’s answer:
Result:
Q82 — SVM 🟡
Question: Write down the RBF/Gaussian kernel formula — both common notations.
Answer
γ-notation:
K(x,y) = exp(−γ ‖x−y‖²). σ-notation (in the slides):K(x,y) = exp(−‖x−y‖² / 2σ²). Equivalent withγ = 1/(2σ²).
Max’s answer:
Result:
Q85 — Perceptron 🟢
Question: Write down the perceptron training rule.
Answer
Δw = ε · Σₙ (tₙ − y(xₙ)) · xₙ. ε = learning rate, tₙ = target, y(xₙ) = prediction. In online mode per example; in batch mode over all examples.
Max’s answer:
Result:
Q86 — Perceptron 🟡
Question: What does the perceptron convergence theorem state and which precondition does it require?
Answer
If the data are linearly separable and ε is small enough, the perceptron training rule converges in finitely many steps to a separating hyperplane. For non-linearly-separable data it never converges.
Max’s answer:
Result:
Q87 — Perceptron 🔴
Question: Why can a single-layer perceptron not learn XOR, and what historical consequence did this have?
Answer
XOR is not linearly separable — no single hyperplane separates {(0,0),(1,1)} from {(0,1),(1,0)}. Minsky & Papert (1969) demonstrated this. Consequence: the first AI winter for neural networks, since people wrongly assumed the limitation applied to all NNs. MLPs with a hidden layer solve XOR.
Max’s answer:
Result:
Q88 — SVM vs. Perceptron 🟡
Question: Which fundamental training methods do SVM vs. perceptron rely on?
Answer
Perceptron: gradient descent on an error function (typically MSE), iterative. SVM: constrained quadratic optimization (Lagrange + KKT), solved globally. Completely different classes of optimization methods.
Max’s answer:
Result:
Part 9 — Neural Networks & Deep Learning (Q89–100)
Q89 — Hopfield 🟡
Question: Update rule and energy function of a Hopfield network — write down both.
Answer
Update:
yᵢ^(t+1) = sign(Σⱼ≠ᵢ wᵢⱼ · yⱼ + bᵢ). Energy:E(y) = −Σ_{ij} wᵢⱼ yᵢ yⱼ − Σᵢ bᵢ yᵢ. Under sequential updates the energy cannot increase → convergence to a local minimum.
Max’s answer:
Result:
Q90 — Hopfield 🔴
Question: What are spurious memories and how do they arise?
Answer
Additional stable states in the network that are not among the stored patterns. They arise from the superposition of the Hebbian weights: certain mixed states also become local energy minima. They reduce the effective capacity — hence the rule of thumb ~0.138·N instead of the theoretically possible N.
Max’s answer:
Result:
Q91 — MLP 🟢
Question: What is ReLU and why is it used today instead of sigmoid?
Answer
ReLU(x) = max(0, x). Advantage: for x > 0 the derivative is 1 — no saturation, no vanishing gradient as with sigmoid (whose derivative is near 0 for large |x|). Enables the training of deep networks.
Max’s answer:
Result:
Q92 — MLP 🔴
Question: What does the Universal Approximation Theorem state and what is its practical limitation?
Answer
An MLP with one hidden layer (enough neurons, non-linear activation) can approximate any continuous function on a compact domain arbitrarily well. In practice: the required width is often exponential — deeper networks approximate more efficiently with polynomially fewer parameters.
Max’s answer:
Result:
Q93 — Deep Learning 🟡
Question: Name three standard solutions against overfitting in deep networks.
Answer
(1) Dropout — randomly deactivating neurons during training (trains an ensemble of subnetworks). (2) Weight decay (L2) — add
λ · Σ‖w‖²to the loss, penalizes large weights. (3) Early stopping — stop training as soon as validation loss rises.
Max’s answer:
Result:
Q94 — Deep Learning 🔴
Question: How do residual connections help in training very deep networks?
Answer
A residual block computes
y = x + F(x). The gradient ∂L/∂x = ∂L/∂y · (1 + ∂F/∂x) — the “1” gives a direct shortcut for the gradient through the identity. Prevents vanishing gradients and enables networks with >100 layers (ResNet).
Max’s answer:
Result:
Q95 — Autoencoder 🟢
Question: What is an autoencoder and what is it used for?
Answer
Encoder → bottleneck (code) → decoder. Trained to reconstruct the input. The bottleneck forces a compressed representation. Applications: dimensionality reduction, anomaly detection, generative models (VAE).
Max’s answer:
Result:
Q96 — Transformer 🟡
Question: Write down the scaled dot-product attention formula and name Q, K, V.
Answer
Attention(Q, K, V) = softmax(QKᵀ / √d_k) · V. Q (query) = what we are looking for, K (key) = what is available, V (value) = the actual content. Q·Kᵀ yields similarity scores, softmax turns them into weights, the weighted average is taken over V.
Max’s answer:
Result:
Q97 — Transformer 🔴
Question: Why do you divide by √d_k in the attention formula?
Answer
For high
d_kthe dot productsq·kbecome large (variance grows linearly with d_k). Large values in softmax lead to very sharp distributions → most gradients go to 0 → vanishing gradient. Scaling by √d_k keeps the variance stable and the softmax in a trainable regime.
Max’s answer:
Result:
Q99 — Transformer 🟡
Question: BERT vs. GPT — training, directionality, typical application.
Answer
BERT: masked language model, bidirectional (sees the whole context), for understanding/classification. GPT: autoregressive (next-token prediction), unidirectional (left → right only, masked attention), for generation. Both are transformers, but with different training objectives.
Max’s answer:
Result:
Beyond the lecture (optional)
These questions go beyond the SoSe 2026 lecture slides (textbook / external additions). Kept for depth, not exam-critical.
Q1 — Local Search 🟢
Question: What distinguishes Local Search from classical search (BFS/DFS/A*)?
Answer
Local Search keeps only the current state in memory — no frontier, no path history. The goal is a good state, not the path to it. Classical search needs the path as its solution. ⚠️ BFS/DFS/A* themselves are not taught in Session 02 — they appear here only as the contrast; the testable point is the local-search characterization.
Max’s answer:
Result:
Q2 — Local Search 🟢
Question: Name the three main problems that pure Hill Climbing fails on.
Answer
Local maxima, plateaus, and ridges. At local maxima there is no better neighbour; on plateaus there is no gradient information; ridges require movement in two dimensions that no single neighbour step improves on its own.
Max’s answer:
Result:
Q14 — CSP 🟡
Question: What does AC-3 do at each step — and what happens when a domain becomes empty?
Answer
Per arc (Xᵢ, Xⱼ): remove from Dᵢ all values without support in Dⱼ. If Dᵢ shrinks, push all incoming arcs back into the queue. An empty domain → CSP inconsistent, abort.
Max’s answer:
Result:
Q15 — CSP 🔴
Question: What is the time complexity of AC-3 and what do the variables mean?
Answer
O(c · d³)withc= number of arcs (constraints) andd= maximum domain size. Intuition: up to d removals per arc, each removal can reactivate d−1 arcs, each reactivation costs O(d²) for the consistency check.
Max’s answer:
Result:
Q26 — Planning I 🟡
Question: How many frame axioms does Situation Calculus need in the worst case with n actions and m facts?
Answer
O(n · m)— for each action × each fact, one axiom stating “fact stays the same after the action (unless explicitly changed)“. This is the core of the frame problem in the logical formalization.
Max’s answer:
Result:
Q31 — Planning I 🔴
Question: Why is planning in general PSPACE-complete and not merely NP-complete?
Answer
Plan lengths can be exponential in the state-space size, and the state space itself is exponential in the atoms. Verifying a compactly represented plan may take more time than NP allows, but it gets by with polynomial memory (you only need to hold the current state) → PSPACE.
Max’s answer:
Result:
Q32 — Planning I 🟡
Question: State-space search vs. plan-space search in planning — what is the difference?
Answer
State-space: nodes are world states, edges are actions. Search from init to goal. Plan-space: nodes are partial plans, edges are plan refinements (add an action, add an ordering constraint). Plan-space allows non-linear / partial-order planning.
Max’s answer:
Result:
Q50 — KR 🟡
Question: What are the three OWL dialects and how do they differ?
Answer
OWL Lite: simple hierarchies + constraints, low DL expressivity, easy to reason over. OWL DL: full description logic (SHOIN(D)), reasoning decidable. OWL Full: fully RDF-compatible, but undecidable.
Max’s answer:
Result:
Q63 — Uncertainty 🔴
Question: Dempster-Shafer: what is the difference between belief Bel(A) and plausibility Pl(A)?
Answer
Bel(A) = sum of the mass of all subsets B ⊆ A → what we can believe about A at minimum. Pl(A) = 1 − Bel(¬A) = sum of the mass of all B with B ∩ A ≠ ∅ → the maximum possible confidence in A. The interval [Bel(A), Pl(A)] quantifies ignorance: wide = little evidence, narrow = a clear picture.
Max’s answer:
Result:
Q67 — ML 🔴
Question: Bias-Variance decomposition: write down the decomposition of the expected squared error.
Answer
E[(y − ŷ)²] = Bias[ŷ]² + Var[ŷ] + σ²withσ²= irreducible noise term. Bias = systematic error from overly simple models; variance = sensitivity to fluctuations in the training data; σ² = inherent data noise, not reducible.
Max’s answer:
Result:
Q70 — ML 🔴
Question: Why does pure Information Gain favour attributes with many values — and what is used against it?
Answer
An attribute like “ID” has a unique value for every example → perfect split → maximal gain, but zero generalization (overfitting in the extreme). Remedy: Gain Ratio = Gain / SplitInfo, which penalizes high cardinality. Used in C4.5 instead of ID3.
Max’s answer:
Result:
Q73 — ML 🟡
Question: Out-of-Bag (OOB) error — what is it and why is it “free”?
Answer
In bootstrap sampling, ~37% of the training data are not used in a given tree. These OOB samples can be used as an unused test set for exactly that tree. Averaged over all trees: a solid generalization estimate without a separate validation set.
Max’s answer:
Result:
Q75 — ML 🟡
Question: k-fold cross-validation: procedure and main purpose.
Answer
Split the data into k equally sized folds. For each iteration: train on k−1 folds, test on the remaining one, average the k test errors. Purpose: generalization estimate with lower variance than a single train/test split, and hyperparameter selection.
Max’s answer:
Result:
Q83 — SVM 🔴
Question: Soft-margin SVM: what do the slack variables ξᵢ do and how does the C parameter act?
Answer
ξᵢ ≥ 0 allows violations of the margin constraint:
yᵢ(⟨xᵢ,w⟩+b) ≥ 1 − ξᵢ. The optimization becomes½‖w‖² + C · Σ ξᵢ. Large C: violations are expensive → small margin, possible overfitting. Small C: violations are cheap → large margin, more regularization. C=∞ → hard-margin (no violations); C=0 → ignore all data.
Max’s answer:
Result:
Q84 — SVM 🔴
Question: Mercer’s condition — when is a function a valid kernel?
Answer
K is a valid kernel iff the Gram matrix
K_ij = K(xᵢ, xⱼ)is positive semi-definite and symmetric for every finite set of points. This guarantees the existence of a feature space with a mapping φ such that K(x,y) = ⟨φ(x), φ(y)⟩.
Max’s answer:
Result:
Q98 — Transformer 🔴
Question: Why do transformers need positional encoding, and in which form is it classically added?
Answer
Self-attention is permutation-invariant — without explicit positional information the model could not recognize word order. Classically (Vaswani et al. 2017): sinusoidal encoding,
PE(pos, 2i) = sin(pos / 10000^(2i/d)),PE(pos, 2i+1) = cos(pos / 10000^(2i/d))— added to the input embeddings.
Max’s answer:
Result:
Q100 — Synthesis 🔴
Question: Ramsauer et al. (2020) showed a connection between self-attention and modern Hopfield networks. What is the core idea?
Answer
Modern Hopfield networks (with continuous states and an exponential energy function) have exponential storage capacity instead of the ~0.14N of old Hopfield networks. Their update rule is mathematically equivalent to scaled dot-product attention: the query “retrieves” the nearest stored pattern (key/value). Thus attention in the transformer is a modern form of associative memory — a bridge between classical NNs (Hopfield) and transformers.
Max’s answer:
Result:
Topic Index (with score)
| Topic | Qs | 🟢 | 🟡 | 🔴 |
|---|---|---|---|---|
| Local Search | 1–12 | 3 | 6 | 3 |
| CSP | 13–22 | 2 | 4 | 4 |
| Planning I | 23–32 | 3 | 4 | 3 |
| Planning II | 33–42 | 2 | 5 | 3 |
| KR | 43–54 | 2 | 5 | 5 |
| Vagueness & Uncertainty | 55–64 | 2 | 4 | 4 |
| ML I & II | 65–76 | 2 | 4 | 6 |
| SVM & Perceptron | 77–88 | 2 | 5 | 5 |
| NN & Deep Learning | 89–100 | 2 | 4 | 6 |
Total: 100 questions — 20 🟢 easy · 41 🟡 medium · 39 🔴 hard
Note: 17 questions have been moved to “Beyond the lecture (optional)” — the counts above reflect the original 100.
See also
Tags: methods-of-ai quiz ai-generated
Superlink: Methods of AI Lecture
Lernzettel: lernzettel_local-search_30-04-26 · lernzettel_csp_30-04-26 · lernzettel_planning-i_30-04-26 · lernzettel_planning-ii_30-04-26 · lernzettel_knowledge-representation_30-04-26 · lernzettel_vagueness-uncertainty_30-04-26 · lernzettel_ml-i-ii_30-04-26 · lernzettel_svm_30-04-26 · lernzettel_neural-networks-deep-learning_30-04-26
Created: 23/05/26