ai-generated methods-of-ai exam-prep
Quiz on the three chosen focus topics for the oral exam (1 Jun 26).
Cluster reported to Kühnberger: (Local) Search · CSPs · Machine Learning.
39 questions, each tagged by difficulty. ML weighted heavily toward Random Forest & SVM — Max’s weak spots.
Answers are hidden in collapsible callouts. Fill in your answer + result during the session.
Difficulty legend
- 🟢 Moderate — definitions, recall, single-concept “what is X / name the steps.” Warm-ups; you should answer these fast.
- 🔴 Hard — “why does X fail when Y”, compare two methods, derive a formula, apply to an example. This is where the oral exam lives.
- ⚠️ = known exam trap (easy to confuse / sign errors / direction confusions), independent of difficulty.
Fact-check note
Q1–Q15 were verified 27/05/26 against the SoSe 2026 lecture slides — every claim is correct AI/ML, with 📋 Slide scope tags marking slide-grounded vs. supplementary (R&N textbook) material. Q16–Q39 (added 28/05/26) are correct AI/ML but not yet individually slide-verified; treat their scope notes as “standard course/textbook material” and double-check exact slide wording before relying on a specific phrasing in the oral.
Key Q1–Q15 caveats: AC-3 / O(c·d³) (slides show AC-1 only), soft-margin C & slack ξ (slides cover hard-margin only), OOB error (slides give the 63.2 % fact but not the term), and the bias–variance variance formula (textbook).
Companion
Local Search
Q1 — Local Search · 🔴 Hard
Question: Hill Climbing gets stuck in local maxima. Compare Random-Restart Hill Climbing and Simulated Annealing in their escape strategy — how exactly does each get out of a local maximum, and when is which one superior?
Answer
Both attack the same problem (local optima), but differently:
- Random-Restart HC does not escape a maximum at all — it gives up and restarts from a random point. Over many restarts, the probability of hitting the global optimum at least once becomes high. Its “escape strategy” = starting over, not going downhill.
- Simulated Annealing stays on one path but accepts worse neighbours with probability exp(ΔE/T), so it can climb out of a maximum. As the temperature T falls, this becomes less and less likely → shift from exploration to intensification.
- When which: Random-Restart wins when there are many shallow, similarly-good local optima (each restart has a decent chance). SA wins on a rugged landscape with deep valleys, where you have to work your way out in a controlled way rather than blindly re-rolling.
Max’s answer: RRHC runs often i parallel and by chance it might find the global maximum, because it always starts at a different starting point, SA escapes by accepting worse loss results at the beginning and later the probability of accepting worse outcomes lowers by a cooling schedule which lowers the temperature - which determined how probabilistic a worse acceptance is.
Result:
Q2 — Local Search · 🔴 Hard ⚠️
Question: Write down the acceptance probability for a worse neighbour in Simulated Annealing and explain each variable. What happens in the limits T→∞ and T→0? (Exam trap: the sign of ΔE.)
Answer
P(accept) = exp(ΔE / T) for a worse move (in a maximization problem ΔE = f(neighbor) − f(current) is negative, so P < 1). A better move is always accepted (P = 1).
- ΔE: value difference neighbour − current state. The worse the neighbour, the more negative ΔE, the smaller P.
- T: temperature, controlled by the cooling schedule
schedule(t).- T→∞: exp(ΔE/∞) → exp(0) = 1 → every move is accepted = pure random walk (maximal exploration).
- T→0: exp(ΔE/0⁻) → exp(−∞) = 0 → worse moves are essentially never accepted = pure hill climbing (intensification).
⚠️ Trap: some write exp(−ΔE/T) with ΔE as a cost increase (minimization). In the MoAI slides ΔE is the value difference and the formula is exp(ΔE/T) — the sign is already inside ΔE.
📋 Slide scope: slide 30 confirms exp(ΔE/T) and the limits exactly; the slides describe the low-T regime as becoming like First-choice Hill-Climbing (not literally “plain” HC) — worth saying precisely.
Max’s answer: exp(deltaE/T): cooling schedule reduces temperature T, the lower T, the more only a better neighbour is chosen.
Result:
Q3 — Local Search · 🔴 Hard ⚠️
Question: What is the decisive difference between Local Beam Search with k states and k parallel (random-restart) Hill Climbings? Why is it more than just “k times the same thing”?
Answer
- k parallel Hill Climbing: k independent searches. Each keeps the best neighbour of its own state. No communication.
- Local Beam Search: generates the neighbours of all k states, throws them into one pool, and picks the globally best k for the next round.
- Decisive point: information flows between the paths. If one state has many promising neighbours, it “pulls” computational capacity away from bad paths — good regions get preferentially pursued. Parallel HC cannot do this; a path stuck in a bad region keeps wasting resources.
⚠️ Weakness of Beam Search: all k states can collapse into the same region (loss of diversity) → Stochastic Beam Search picks k neighbours probabilistically (selection probability increasing with value) to preserve diversity.
📋 Slide scope: matches slides 34–38 (Local Beam Search vs. Parallel Hill-Climbing vs. Stochastic Beam Search) exactly.
Max’s answer: global vs local
Result:
Q4 — Local Search · 🔴 Hard
Question: Given two parent chromosomes A = 1 1 0 1 0 and B = 0 0 1 0 1. Perform single-point crossover at the point after position 2, then a mutation at position 4 of the first child. And: how does fitness-proportionate selection work?
Answer
- Crossover after position 2 (split:
11 | 010and00 | 101): swap the tails →
- Child 1 =
1 1 | 1 0 1=1 1 1 0 1- Child 2 =
0 0 | 0 1 0=0 0 0 1 0- Mutation at position 4 of child 1 (
11101→ flip bit 4):1 1 1 1 1.- Fitness-proportionate selection: a chromosome’s selection probability rises with its fitness — P(i) = fitness(i) / Σⱼ fitness(j) (roulette wheel). Fitter individuals are chosen as parents more often, but weaker ones keep a residual chance (preserves diversity).
📋 Slide scope: 1-point crossover, random-gene mutation, and fitness-proportionate selection are all on slides 46–48. (For permutation problems like TSP, naive crossover yields invalid solutions → slides give uniform-order crossover as the fix, slide 56.)
Max’s answer: child C1: 1 1 1 1 1
Result:
Q5 — Local Search · 🟢 Moderate
Question: How does local search differ from classical/tree search (BFS, A*)? What is the “complete-state formulation”, and why is local search well-suited to optimization problems like n-queens?
Answer
- Tree/classical search keeps a frontier of partial paths from the start and cares about the path to the goal; memory grows with the explored tree.
- Local search keeps only the current (complete) state, moves to a neighbour, and discards where it came from → constant memory, no path recorded. It cares only about the goal state’s quality, not how you got there.
- Complete-state formulation: every state is a full candidate solution (e.g. all 8 queens already on the board, one per column), and moves modify it (move a queen). Contrast with incremental formulation (place queens one at a time).
- Why suited to optimization: for problems where the path is irrelevant and only the final configuration matters (n-queens, TSP tour, scheduling), local search explores a huge space cheaply and is an anytime algorithm (always has a current best).
Max’s answer: BFS and A* are not local searches by default - they are global ones assessing the entire tree
Result:
Q6 — Local Search · 🟢 Moderate
Question: Name and explain the three classic ways Hill Climbing gets stuck.
Answer
- Local maximum — a peak lower than the global maximum; every neighbour is worse, so HC halts there.
- Plateau / shoulder — a flat region where neighbours are equal. A shoulder has an uphill exit (escapable with sideways moves); a flat local maximum has none → HC wanders or stops.
- Ridge — a sequence of local maxima ascending diagonally; single-axis moves all look downhill even though the ridge rises, so HC oscillates and can’t follow it (see Q8).
All three follow from HC’s greed: it only ever moves to a strictly better immediate neighbour and has no memory.
Max’s answer: plateaus, no better neighbour, local minimum
Result:
Q7 — Local Search · 🟢 Moderate
Question: Distinguish steepest-ascent, first-choice, and stochastic Hill Climbing.
Answer
- Steepest-ascent (basic) HC: evaluate all neighbours, move to the best one. Expensive if the neighbourhood is large.
- First-choice HC: generate neighbours at random and take the first that is better than the current state. Good when a state has many neighbours (no need to enumerate all).
- Stochastic HC: choose at random among the uphill moves, with probability often weighted by steepness. Slower convergence but can find better solutions by not always being greedy.
Max’s answer: steep: all neighbourhood → picking the best, firt-choice: all neighbourhood → picking the first one that improves, stochastic → chooses among better neighbours by chance
Result:
Q8 — Local Search · 🔴 Hard
Question: Why does a ridge defeat Hill Climbing even though the ridge itself ascends? What kind of move would fix it?
Answer
A ridge is an ascending surface whose top runs diagonally to the search’s move directions. HC’s neighbourhood typically offers only axis-aligned single steps. From any point on the ridge, each individual axis-step steps slightly off the ridge and therefore downhill, even though moving along the ridge (a combination of axis-steps) would go uphill. So HC sees only worse neighbours and stops or zig-zags slowly.
Fix: enrich the move set — allow compound/diagonal moves or a larger neighbourhood, or use methods that tolerate temporary downhill steps (Simulated Annealing) so the search can ride along the ridge.
Max’s answer: the algorithm never leaves the ridge and therefore cannot reach the global minimum, but keeps on exploring the ridge
Result:
Q9 — Local Search · 🔴 Hard
Question: In Genetic Algorithms, why is crossover powerful but also potentially destructive? When does a GA genuinely beat a mutation-only (hill-climbing-like) search?
Answer
- Powerful (building-block idea): crossover recombines good partial solutions (“schemata”/building blocks) from two fit parents into one child. If two parents each solved a different sub-part of the problem, one crossover can combine both — a jump no single mutation could make.
- Destructive: crossover can just as easily split a co-adapted group of genes that only works together (high epistasis / low locality), producing children worse than both parents. The cut point is blind to which genes belong together.
- When GA > mutation-only: when the problem decomposes into independent, recombinable building blocks whose good values can be assembled. If the fitness landscape has no such structure (genes strongly interdependent, or the representation scatters related genes), crossover mostly injects noise and a GA offers little over stochastic hill climbing.
Max’s answer: powerful, it creatively combines two very good sets and therefore might form an even better one: but sometimes the parent sets are coherent within itself and make sense → splitting them up may destruct them completely
Result:
Q10 — Local Search · 🔴 Hard ⚠️
Question: Simulated Annealing borrows from physical annealing. What is its theoretical convergence guarantee, and what is the practical catch?
Answer
- Analogy: metal cooled slowly settles into a low-energy (near-global-minimum) crystal; cooled fast it freezes into a defective high-energy state. SA’s temperature schedule plays the role of cooling.
- Guarantee: if the temperature is lowered slowly enough, SA converges to the global optimum with probability approaching 1.
- ⚠️ The catch: “slowly enough” can require a cooling schedule so slow it takes (near-)exponential time — no better than exhaustive search in the worst case. Any practical, fast schedule loses the guarantee; SA then becomes a good heuristic, not a guaranteed optimizer. So the theorem is reassuring in principle but does not promise efficiency.
Max’s answer: First, it explores the loss surface a lot by accepting worse neighbours → therefore might avoid local minima. But as the iterations progress, the temperature falls and therefore the probibility of accepting worse neighbours also decreases → getting closer to the global minimum. In theory, if the cooling-schedule is long enough, SA will almost always find the global minimum.
Result:
Constraint Satisfaction
Q11 — CSP · 🔴 Hard ⚠️
Question: Define arc consistency precisely and explain why it is not commutative. Give a concrete example where (X,Y) is arc-consistent but (Y,X) is not.
Answer
Arc consistency: the arc (X→Y) is consistent iff for every value in D(X) there is at least one value in D(Y) that satisfies the binary constraint between X and Y.
Not commutative: making (X→Y) consistent means deleting values from D(X) that have no partner in D(Y) — it only touches D(X). (Y→X) looks at the reverse direction and deletes from D(Y). The two checks are asymmetric.
Example: D(X)={1}, D(Y)={1,2}, constraint X<Y.
- (X→Y): X=1 has partner Y=2 ✓ → arc-consistent.
- (Y→X): Y=1 needs an X<1 → none → Y=1 would have to be removed → not arc-consistent.
→ direction (X→Y) fine, (Y→X) not: not commutative. This is why AC-3 works with a queue of directed arcs and re-adds affected arcs after every domain reduction.
📋 Slide scope: the arc-consistency definition (slide 29) and the asymmetric example (slide 31) are on the slides. AC-3 itself is not — the slides present AC-1 (slide 32). The queue-of-directed-arcs mechanism is R&N’s AC-3 (see Q12).
Max’s answer: soo if D1: red blue green; D2: red blue, then D2 is arc consistent with D1, but D1 is not arc consistent with D2, therefore green is removed → now D1 is also arc consistent with D2
Result:
Q12 — CSP · 🔴 Hard ⚠️
Question: The complexity of AC-3 is O(c·d³) (c = number of arcs/constraints, d = domain size). Justify where each factor comes from.
Answer
- Each arc can be re-inserted into the queue at most d times — it is only re-queued when a neighbour’s domain shrinks, and a domain can lose at most d values. → factor d of re-insertions per arc.
- With c arcs that is O(c·d) queue processings in total.
- Each consistency check of an arc (X→Y) costs O(d²): for each of the d values of X, try all d values of Y.
- Total: O(c·d) · O(d²) = O(c·d³).
⚠️ Trap: not O(c·d²) (that would be a single pass over all arcs) — the third d-factor comes from the repeated re-queuing.
📋 Slide scope: supplementary. The SoSe slides only present AC-1 (“the easiest but not the most efficient algorithm”, slide 32) and do not state any complexity bound. AC-3 and the O(c·d³) bound are R&N (textbook). The derivation above is the standard, correct one — fair game for the oral, but flag that the slides themselves stop at AC-1.
Max’s answer:
Result:
Q13 — CSP · 🔴 Hard ⚠️
Question: MRV, the Degree heuristic, and LCV — which control variable selection, which control value selection? MRV (“take the most constrained variable”) and LCV (“take the least constraining value”) sound contradictory. Why aren’t they?
Answer
- Variable selection: MRV (Minimum Remaining Values / “first fail”) — pick the variable with the fewest legal values. Degree — tie-breaker: pick the variable involved in the most constraints with still-unassigned variables.
- Value selection: LCV (Least Constraining Value) — pick the value that removes the fewest options from neighbouring variables.
- No contradiction, because they pursue different goals:
- MRV wants to fail early: problems likely to hit a dead end should be detected immediately → prune the search tree near the top (fail-fast).
- LCV wants to enable success: once a variable is chosen, the value should leave the rest as much flexibility as possible → fail late at the value choice.
- Mnemonic: “Fail first (variable), succeed first (value).” Minimize branching at the variable, maximize solvability at the value.
📋 Slide scope: all three heuristics are named on slide 19 with exactly these definitions (MRV = “First-Fail”, Degree, LCV).
Max’s answer: MRV and Degree Heur: variable, LCV: value. mostly you would apple MRV and LCV at the same time, since they observe two different angles. MRV chooses for failing fast, while LCV tries to stay in as long as possible
Result:
Q14 — CSP · 🔴 Hard ⚠️
Question: Does the combination of node + arc consistency guarantee that a solution exists? If not — what good is arc consistency then?
Answer
No. Node + arc consistency guarantees no global consistency / solvability. A CSP can be fully arc-consistent and still have no solution (classic: 3 variables, domain {red, blue}, pairwise distinct — every arc is consistent, but no 2-colouring of a triangle exists).
What it buys you:
- Domain pruning — removes values that can never be part of a solution → smaller search space for backtracking.
- Early inconsistency detection — if a domain becomes empty, the CSP is provably unsolvable (without search).
- As preprocessing or during search (→ MAC / Maintaining Arc Consistency) it speeds up backtracking massively.
For a full guarantee you’d need strong k-consistency for k = number of variables (too expensive in practice). So arc consistency reduces search but does not replace it.
📋 Slide scope: slides 40–41 make exactly this point (node+arc consistent ⇏ globally consistent), with the same “only two values, pairwise distinct” counterexample. Strong k-consistency is on slide 45. “MAC” as an acronym isn’t on the slides, but “consistency during backtracking search” is (slides 25, 60).
Max’s answer: no arc consistency can only disprove the tree, because it is a binary constraint and does not check the global constraints - while path consistency can at strong k-consistency
Result:
Q15 — CSP · 🟢 Moderate
Question: Give the formal definition of a CSP as a triple, and define what a solution is.
Answer
A CSP is a triple ⟨X, D, C⟩:
- X = {X₁, …, Xₙ} — a set of variables.
- D = {D₁, …, Dₙ} — a domain of allowed values for each variable.
- C — a set of constraints, each restricting the allowed combinations of values for some subset of variables.
An assignment gives values to variables; it is consistent if it violates no constraint and complete if every variable is assigned. A solution = a complete and consistent assignment.
The power of the CSP framing: a generic solver (backtracking + inference) works for any problem expressed this way, regardless of domain (map colouring, scheduling, Sudoku, …).
Max’s answer: CSP is a triple of vairbales, domains and constraints, a solution (all assigneed) and consistent
Result:
Q16 — CSP · 🟢 Moderate
Question: Define node consistency and distinguish unary, binary, and higher-order constraints.
Answer
- Unary constraint: restricts a single variable (e.g. SA ≠ green).
- Binary constraint: restricts a pair of variables (e.g. X ≠ Y). Binary CSPs can be drawn as a constraint graph (nodes = variables, edges = constraints).
- Higher-order (n-ary) constraint: involves ≥3 variables (e.g. AllDifferent). Can always be reduced to binary constraints via auxiliary variables.
- Node consistency: a variable is node-consistent if every value in its domain satisfies all its unary constraints. You enforce it by deleting domain values that violate the unary constraints — the cheapest preprocessing step before arc consistency.
Max’s answer: node consistency only that evry value sastisfies the unary constraint, binary means it involves two varibales, and hogher order even more variables and values
Result:
Q17 — CSP · 🟢 Moderate
Question: What is backtracking search for CSPs, and how does it improve on naive “generate complete assignments and test”?
Answer
Backtracking search = depth-first search that assigns one variable at a time, checks consistency with the already-assigned variables, and backtracks as soon as a constraint is violated (a dead end), trying the next value.
Improvements over generate-and-test:
- Early pruning: a partial assignment that already violates a constraint is abandoned immediately — no need to fill in the rest.
- Exploits commutativity: the order of assignments doesn’t change the set of solutions, so the solver fixes a single variable ordering at each node instead of permuting → branching factor d, not n·d.
It’s the backbone into which heuristics (MRV/Degree/LCV) and inference (forward checking, MAC) are plugged.
Max’s answer: backtracking finds a complete assignment by depth fist and
Result:
Q18 — CSP · 🟢 Moderate
Question: What is forward checking?
Answer
Forward checking is an inference step during backtracking: whenever you assign a value to a variable X, you look at each unassigned neighbour Y of X and delete from D(Y) every value inconsistent with X’s new assignment.
- If any neighbour’s domain becomes empty, you backtrack immediately — no point continuing.
- Benefit: detects many dead ends one assignment earlier than plain backtracking, cheaply.
- Limit: it only checks arcs into the just-assigned variable’s neighbours; it doesn’t propagate further (that’s what MAC/AC-3 does — see Q19).
Max’s answer: prune neighbours domains after each assignment
Result:
Q19 — CSP · 🔴 Hard
Question: Why is MAC (Maintaining Arc Consistency) strictly stronger than forward checking? Sketch a situation FC misses but AC catches.
Answer
- Forward checking propagates one step: after assigning X, it prunes only the domains of X’s direct unassigned neighbours. It never checks consistency between two still-unassigned variables.
- MAC runs full arc-consistency propagation (AC-3) after each assignment: pruning Y’s domain re-queues Y’s arcs, which may prune Z, which re-queues Z’s arcs, … so inconsistencies cascade through the whole graph.
- Where FC misses: suppose after assigning X, FC prunes D(Y) and D(Z) but leaves both non-empty — so FC is satisfied. Yet the remaining values of Y and Z may have no consistent pair between them (their connecting constraint is now unsatisfiable). FC never looks at the Y–Z arc, so it proceeds into a doomed subtree; AC-3 would re-examine the Y–Z arc, empty a domain, and backtrack now, saving the wasted search.
Max’s answer: because it applies AC-3 after each step
Result:
Q20 — CSP · 🔴 Hard
Question: Describe the Min-Conflicts algorithm (local search for CSPs). Why is it astonishingly effective on the n-queens problem, even for huge n?
Answer
- Min-Conflicts: start from a complete (probably conflicting) assignment. Repeat: pick a conflicted variable (one involved in a constraint violation) and reassign it to the value that minimizes the total number of conflicts (ties broken randomly). Stop when zero conflicts.
- It’s hill climbing on the “number of violated constraints” objective, applied to CSPs.
- Why great on n-queens: the n-queens (and many real scheduling) landscapes have an enormous density of solutions and few local minima relative to the search space, so greedy conflict reduction almost always slides straight to a solution. Empirically it solves the million-queens problem in ~50 steps, essentially independent of n — whereas systematic backtracking blows up.
- Caveat: as a local search it is incomplete — it can get stuck in a conflict local minimum (mitigated by random restarts / random sideways moves).
Max’s answer:
Result:
Q21 — CSP · 🟢 Moderate
Question: What does RCC-8 model, and can you name several of its relations?
Answer
RCC-8 (Region Connection Calculus) is a qualitative spatial reasoning calculus: it describes how two spatial regions can relate, using 8 jointly exhaustive and pairwise disjoint relations:
- DC — disconnected (no contact)
- EC — externally connected (touching, no overlap)
- PO — partial overlap
- TPP — tangential proper part (inside, touching boundary)
- NTPP — non-tangential proper part (strictly inside)
- TPPi, NTPPi — the inverses of the two above
- EQ — equal (identical regions)
Reasoning with these (which combinations are possible) is itself a constraint-satisfaction problem over the spatial relations — that’s why it appears in the CSP lecture.
Max’s answer: spatial reasoning - meaning it describes how regions related to each other
Result:
Q22 — CSP · 🔴 Hard
Question: Map colouring is the canonical CSP. Set it up formally, describe its constraint graph, and show how MRV + Degree behave on the Australia map.
Answer
- Formulation: variables = the regions (WA, NT, SA, Q, NSW, V, T), each domain = {red, green, blue}; constraints = adjacent regions must differ in colour (binary ≠ constraints).
- Constraint graph: nodes = regions, an edge between two regions iff they share a border. (Tasmania, T, is unconnected — trivially colourable.)
- Degree heuristic at the start: all domains have 3 values, so MRV can’t discriminate yet → Degree breaks the tie and picks SA, which borders 5 other regions (the highest degree) → assigning the most-constraining variable first prunes hardest.
- MRV afterwards: once SA = red, its neighbours (WA, NT, Q, NSW, V) drop to 2 remaining values while T still has 3 → MRV now picks one of those constrained neighbours, keeping the search where failure is most likely and pruning early.
- Together they let backtracking colour the map essentially without backtracking.
Max’s answer:
Result:
Machine Learning
Q23 — ML · 🔴 Hard ⚠️
Question: Define bias and variance. Explain precisely why bagging reduces variance but not bias.
Answer
- Bias: systematic error from too-simple assumptions — the model misses the true function on average (underfitting).
- Variance: fluctuation of the prediction across different training sets — the model reacts too strongly to the particular sample (overfitting).
- Why bagging lowers variance: train B models on bootstrap samples and average. For the average of B models with individual variance σ² and pairwise correlation ρ, Var ≈ ρσ² + (1−ρ)σ²/B. The second term shrinks with B → averaging reduces variance.
- Why not bias: all B models are of the same type (e.g. deep trees) and equally biased in expectation. The average of quantities with the same expected value has the same expected value → E[ensemble] = E[single model], the bias stays. That’s why you bag high-variance / low-bias models (deep, unpruned trees).
📋 Slide scope: largely supplementary. The slides frame ensembles as needing classifiers that are “accurate and diverse” (slide 36, ML II), and discuss overfitting/Occam qualitatively (ML I, slides 18–22). The bias–variance decomposition and the variance-of-an-average formula (ρσ² + (1−ρ)σ²/B) are R&N / ESL textbook, not on the MoAI slides — correct and a great oral answer, but know it’s beyond the slides.
Max’s answer:
Result:
Q24 — ML · 🔴 Hard ⚠️
Question: What is the one extra source of randomness a Random Forest introduces over plain bagging of trees? Why does decorrelating the trees reduce ensemble variance — and not bagging alone?
Answer
- Extra source of randomness: at every split only a random subset of features (typically √p of p features) is considered as candidates — not all of them. (Bagging randomizes only the data rows; RF additionally randomizes the feature columns per split.)
- Why it helps: from Var ≈ ρσ² + (1−ρ)σ²/B you see that even for B→∞ the term ρσ² remains. With plain bagging the trees are strongly correlated (large ρ), because a dominant feature splits near the top of almost every tree → the trees look alike → averaging helps little.
- Feature subsampling lowers ρ: dominant features aren’t available in every tree, the trees get decorrelated → the floor ρσ² drops → ensemble variance falls below what bagging alone achieves.
⚠️ Key statement: bagging averages, RF averages and decorrelates. The decorrelation is RF’s real added value.
📋 Slide scope: the √p features per split is on the slides (slide 38, ML II: “for classification often p^(1/2) rounded down; for regression p/3”). The ρσ² + (1−ρ)σ²/B formula is textbook (same caveat as Q23). The slides motivate it qualitatively as making trees “substantially different from one another while still accurate” (slide 36).
Max’s answer:
Result:
Q25 — ML · 🔴 Hard ⚠️
Question: What is the Out-of-Bag (OOB) error, why do you get it “for free”, and what does it approximate?
Answer
- Each bootstrap sample draws n examples with replacement. On average ~63.2 % of the original data land in a tree’s sample → ~36.8 % stay “out-of-bag” for that tree. (Limit: 1 − 1/e ≈ 0.368.)
- OOB error: for each example x, collect only the trees that did not see x in training, let those vote, compare to the true label. Averaged over all examples = OOB error.
- Free: no separate validation set / no cross-validation run needed — the OOB examples fall out of bootstrapping anyway.
- Approximates: the test / generalization error (comparable to k-fold CV), because each example is judged only by trees that never saw it.
📋 Slide scope: the 63.2 % bootstrap fact is on the slides (slide 37, ML II: “on average, 63.2 % of all samples will be present at least once”). The term “out-of-bag error” and using the held-out 36.8 % as a free validation estimate are supplementary (R&N) — the slides state the percentage but don’t name OOB error.
Max’s answer: it is like a free test-set, since when creating a new bag, you take the left-overs and you get a similar distribution to a test-run
Result:
Q26 — ML / SVM · 🔴 Hard ⚠️
Question: What exactly are support vectors, and why do only they determine the hyperplane? What happens to the decision boundary if you delete a non-support-vector — and what if you delete a support vector?
Answer
- Support vectors are the training points lying exactly on the margin edge (or, in soft-margin, on or inside the margin). They are the points with αᵢ > 0 in the dual solution.
- Why only they count: the optimal weight direction is w = Σ αᵢ yᵢ xᵢ. For all non-support-vectors αᵢ = 0 → they contribute nothing to w. The hyperplane is fully defined by the support vectors.
- Delete a non-SV: the boundary stays identical — the point was far away with α=0 and had no influence.
- Delete an SV: the boundary can change, because that point helped support the margin → the optimization problem has a different solution.
- This is also why SVMs are sparse and memory-efficient: only the SVs need to be kept.
📋 Slide scope: fully slide-grounded — w = Σ αᵢ yᵢ xᵢ, the KKT condition αᵢ·(yᵢ(⟨xᵢ,w⟩+b)−1)=0, and “points with αᵢ>0 are the support vectors; the rest (αᵢ=0) have no influence” are all on slides 15–16 (Session 11).
Max’s answer:
Result:
Q27 — ML / SVM · 🔴 Hard ⚠️
Question: The C parameter in the soft-margin SVM: what do large vs. small C do to the margin width and to the bias–variance tradeoff? (Direction is frequently confused!)
Answer
C weights, in the optimization problem ½‖w‖² + C·Σξᵢ, the penalty for margin violations (ξᵢ = slack).
- Large C: violations are expensive → the model tolerates almost no errors → narrow (hard) margin, fits the training data tightly → low bias, high variance → overfitting risk.
- Small C: violations are cheap → more slack allowed → wide margin, more points may sit wrong/inside → higher bias, lower variance → more robust, possibly underfitting.
⚠️ Mnemonic: large C = little regularization = tight fit. (Exactly the opposite of the λ regularization parameter, where large = more regularization. That’s the classic confusion.)
📋 Slide scope: supplementary / not on the slides. The SoSe SVM slides cover only the hard-margin (“generalized portrait”) SVM — they never introduce soft margins, slack ξ, or C. This is correct R&N material and the examiner may well ask it, but it is not on the MoAI slides, so don’t present it as if it were.
Max’s answer:
Result:
Q28 — ML / SVM · 🔴 Hard ⚠️
Question: Explain the kernel trick: what does it compute, and why do you never have to evaluate φ(x) explicitly? Give the form of the RBF / Gaussian kernel.
Answer
- In the dual SVM the data points appear only as dot products ⟨xᵢ, xⱼ⟩ — never individually. To separate non-linearly you map the data with φ into a higher-dimensional space and would need ⟨φ(xᵢ), φ(xⱼ)⟩ there.
- Kernel trick: a kernel function K(xᵢ, xⱼ) = ⟨φ(xᵢ), φ(xⱼ)⟩ computes that dot product directly in the input space, without ever evaluating φ. You work implicitly in a (possibly infinite-dimensional) space at input-space cost.
- Why never φ explicitly: φ can be huge / infinite-dimensional (for the RBF it is infinite) — explicitly impossible; K still delivers the result cheaply.
- RBF / Gaussian kernel (slide form): K(x, x′) = exp(−‖x − x′‖² / 2σ²) with free parameter σ. Equivalent γ-form: exp(−γ‖x−y‖²) with γ = 1/(2σ²). Large γ (small σ) → narrow, local boundaries (overfitting risk).
📋 Slide scope: the kernel definition K(x,x′)=⟨Φ(x),Φ(x′)⟩, the decision function f(x)=sgn(Σ αᵢyᵢK(x,xᵢ)+b), and the three kernels (linear, polynomial (⟨x,x′⟩)^d, Gaussian) are all on slides 17–19. The slides’ canonical RBF is the σ-form exp(−‖x−x′‖²/2σ²) — lead with that; the γ-form is the equivalent reparametrization (and only appears later in the “soft tree kernel” example).
Max’s answer:
Result:
Q29 — ML · 🔴 Hard
Question: Decision trees: entropy and information gain drive the splits. Why does a single tree overfit, and how do max-depth / pruning relate to the bias–variance tradeoff?
Answer
- Entropy H = −Σ pᵢ log₂ pᵢ measures the impurity of a node’s set. Information gain = entropy(parent) − weighted entropy(children); ID3 picks, per node, the attribute with maximum IG.
- Why overfitting: with no stopping, the tree grows until each leaf is pure (often 1 example/leaf). Then it models noise instead of structure → perfect on training, poor on test = high variance, low bias.
- Max-depth / pruning: limit complexity → smoother decision boundaries.
- Too shallow / over-pruned: high bias (underfitting).
- Too deep / unpruned: high variance (overfitting).
- Pruning thus shifts you in a controlled way from variance toward bias. (Exactly why unpruned high-variance trees are the ideal base learners for Random Forests → see Q23/Q24.)
📋 Slide scope: entropy E(S)=Σ−pᵢlog₂pᵢ (slide 26), information gain (slide 29), ID3’s inductive bias (“shorter trees / high-IG attributes near the root”, slide 25), and “deep decision trees tend to overfit” (slide 34) are all on the slides. Pruning / max-depth as named remedies are not explicit on the DT slides — the slide remedy for tree overfitting is random forests (ML I mentions regularization only in general terms). The bias–variance framing is textbook (cf. Q23).
Max’s answer:
Result:
Q30 — ML · 🟢 Moderate
Question: Distinguish supervised, unsupervised, and reinforcement learning, with one example each.
Answer
- Supervised: learn a mapping from labelled examples (x, y). Goal: predict y for new x. Examples: classification (spam/not-spam), regression (house price).
- Unsupervised: find structure in unlabelled data. Examples: clustering (k-means), dimensionality reduction (PCA).
- Reinforcement: an agent learns a policy by interacting with an environment and receiving rewards — no labelled examples, learning from delayed feedback. Example: game playing, robot control.
Max’s answer:
Result:
Q31 — ML · 🟢 Moderate
Question: What distinguishes overfitting from underfitting in terms of training vs. test error?
Answer
- Underfitting: model too simple → high training error AND high test error. It hasn’t captured the structure (high bias).
- Overfitting: model too complex → low training error but high test error. It has memorized noise/specifics of the training set and fails to generalize (high variance).
- The telltale sign of overfitting is a large gap between (good) training performance and (poor) test performance. The goal is the sweet spot in between, found e.g. via cross-validation (Q32).
Max’s answer:
Result:
Q32 — ML · 🟢 Moderate
Question: What is k-fold cross-validation and what problem does it solve?
Answer
- Split the data into k equal folds. For each fold i: train on the other k−1 folds, validate on fold i. Repeat for all i, then average the k validation scores.
- Problem it solves: gives a more reliable estimate of generalization error than a single train/test split, and uses all data for both training and validation (each point is in the validation set exactly once) — important when data is scarce.
- Also used for hyperparameter selection / model comparison. Extreme case k = n is leave-one-out CV.
Max’s answer:
Result:
Q33 — ML · 🟢 Moderate
Question: What does a perceptron learn, and what is its fundamental limitation?
Answer
- A perceptron is a single linear threshold unit: output = sign(⟨w, x⟩ + b). The perceptron learning rule nudges w toward misclassified points until (if possible) all are correctly separated.
- Guarantee: if the data are linearly separable, the perceptron converges to a separating hyperplane in finite steps.
- Fundamental limitation: it can only represent linearly separable functions. The classic counterexample is XOR, which no single line separates → a perceptron cannot learn it. This motivated multi-layer networks (hidden layers) with non-linear activations.
Max’s answer:
Result:
Q34 — ML / SVM · 🔴 Hard
Question: Why does a hard-margin SVM fail on real data, motivating the soft margin? And geometrically, why is the margin width 2/‖w‖?
Answer
- Hard-margin failure: the hard-margin SVM requires the data to be perfectly linearly separable (every point on the correct side, outside the margin). With noise, outliers, or overlapping classes there is no feasible solution — the optimization is infeasible. Even when separable, a single outlier can force a tiny, badly-placed margin (overfitting).
- Soft margin fix: introduce slack variables ξᵢ ≥ 0 allowing violations, penalized by C·Σξᵢ → tolerate some misclassification for a wider, more robust margin (Q27).
- Why margin = 2/‖w‖: the two supporting hyperplanes are ⟨w,x⟩+b = +1 and = −1. The perpendicular distance between two parallel planes ⟨w,x⟩+b = ±1 is |1 − (−1)| / ‖w‖ = 2/‖w‖. Maximizing this margin ⇔ minimizing ‖w‖ (see Q35).
Max’s answer:
Result:
Q35 — ML / SVM · 🔴 Hard
Question: Show why “maximize the margin” turns into the optimization min ½‖w‖² s.t. yᵢ(⟨w,xᵢ⟩+b) ≥ 1.
Answer
- Fix the scale so the closest points satisfy ⟨w,xᵢ⟩+b = ±1 (canonical form). Then the margin width is 2/‖w‖ (Q34).
- Maximize 2/‖w‖ ⇔ minimize ‖w‖ ⇔ minimize ½‖w‖² (the ½ and square make it a smooth, convex objective — a quadratic program — without changing the argmin).
- Constraints: every point must be on the correct side and outside the margin: for yᵢ ∈ {−1,+1}, that’s yᵢ(⟨w,xᵢ⟩ + b) ≥ 1 for all i.
- Result: a convex QP with a unique global optimum (no local minima — unlike neural nets). Solving its dual exposes the αᵢ and the dot-product structure that enables the kernel trick (Q28).
Max’s answer:
Result:
Q36 — ML · 🟢 Moderate
Question: What is bootstrap sampling, and what does bagging do with it?
Answer
- Bootstrap sample: draw n examples from the n-example training set with replacement. Some examples appear multiple times, some not at all (~36.8 % left out → the OOB set, Q25).
- Bagging (Bootstrap AGGregatING): create B bootstrap samples, train one model per sample, then aggregate their predictions — majority vote for classification, average for regression.
- Purpose: reduce variance of an unstable, high-variance learner (Q23) without increasing bias. It’s the foundation Random Forests build on (Q24).
Max’s answer:
Result:
Q37 — ML · 🔴 Hard ⚠️
Question: Why is accuracy a misleading metric on imbalanced data? Define precision, recall, and F1, and say when you care about which.
Answer
- Imbalance problem: if 99 % of cases are negative, a classifier that always predicts “negative” scores 99 % accuracy while being useless (catches zero positives). Accuracy hides failure on the rare-but-important class.
- Precision = TP / (TP + FP): of the cases I flagged positive, how many really are? (Cost of false alarms.)
- Recall (sensitivity) = TP / (TP + FN): of the actual positives, how many did I catch? (Cost of misses.)
- F1 = 2·(precision·recall)/(precision+recall): harmonic mean — high only when both are high.
- Which to favour: maximize recall when missing a positive is dangerous (cancer screening, fraud); maximize precision when false alarms are costly (spam filtering a wanted email). F1 when you need a single balanced number.
Max’s answer:
Result:
Q38 — ML · 🔴 Hard
Question: Sketch the U-shaped test-error curve as model complexity grows, in terms of bias and variance. Where is the optimum?
Answer
Decompose expected test error ≈ bias² + variance + irreducible noise. As model complexity increases:
- bias² falls (a richer model can fit the true function better),
- variance rises (the model becomes more sensitive to the particular training sample).
- Training error keeps dropping monotonically, but test error = bias² + variance first falls (bias dominates the improvement) then rises (variance takes over) → a U-shape.
- The optimum is the complexity where the marginal drop in bias² equals the marginal rise in variance — the bottom of the U. Tools to find/control it: cross-validation (Q32), regularization, pruning/max-depth (Q29), C in SVMs (Q27), B and feature-subsampling in RF (Q24).
Max’s answer:
Result:
Q39 — ML / SVM · 🟢 Moderate
Question: Name the three kernels from the slides and give the linear and polynomial forms.
Answer
The three slide kernels (Session 11):
- Linear: K(x, x′) = ⟨x, x′⟩ — the plain dot product; recovers the ordinary linear SVM.
- Polynomial: K(x, x′) = (⟨x, x′⟩)ᵈ (degree d; a constant offset variant (⟨x,x′⟩+c)ᵈ also exists) — implicitly maps to all monomials up to degree d.
- Gaussian / RBF: K(x, x′) = exp(−‖x − x′‖² / 2σ²) — infinite-dimensional feature space (see Q28 for the trick and the γ-reparametrization).
All three are valid kernels because each corresponds to a genuine inner product ⟨φ(x), φ(x′)⟩ in some feature space (Mercer’s condition).
Max’s answer:
Result:
Score
- Local Search: __ / 10
- CSP: __ / 12
- Machine Learning: __ / 17
- Total: __ / 39
By difficulty:
- 🟢 Moderate: __ / 14
- 🔴 Hard: __ / 25
Re-Quiz
Note any questions answered wrong/unsure here and revisit in 2–3 days: