Quiz: SVMs & Perceptron
Methods of AI — SoSe 2026
Q1 — SVM
Question: What are support vectors, and why are they the only points that matter for defining the hyperplane?
Answer
Support vectors are the training examples closest to the decision hyperplane — the points that lie exactly on the margin boundaries (⟨w,x⟩+b = ±1).
Why they’re the only ones that matter: from the Lagrangian formulation, the Karush-Kuhn-Tucker condition gives:
αᵢ · (yᵢ · (⟨xᵢ,w⟩+b) − 1) = 0
For non-support-vector points, the constraint is strict (>1), so αᵢ = 0. Only support vectors have αᵢ > 0.
Sincew = Σ αᵢ · yᵢ · xᵢ, only support vectors contribute to w. Remove all other training points → hyperplane unchanged.
Max’s answer:
Result:
Q2 — SVM
Question: What is the kernel trick? Why is it useful?
Answer
Kernel trick: replace the dot product ⟨x, xᵢ⟩ in the SVM decision function with a kernel function K(x, xᵢ) = ⟨φ(x), φ(xᵢ)⟩.
The kernel computes the dot product in a higher-dimensional feature space without ever explicitly computing φ(x). This makes it computationally tractable even for very high-dimensional (or infinite-dimensional) feature spaces.
Why useful: allows SVM to find non-linear decision boundaries in the original space by implicitly working in a higher-dimensional space where the data becomes linearly separable.
Decision function:f(x) = sgn(Σ αᵢ · yᵢ · K(x, xᵢ) + b)
Max’s answer:
Result:
Q4 — SVM
Question: What is the margin in SVM? Write the formula and explain what it means geometrically.
Answer
The margin is the distance between the two parallel hyperplanes bounding the “no man’s land” between classes:
- Positive boundary: ⟨w, x⟩ + b = +1
- Negative boundary: ⟨w, x⟩ + b = −1
Distance between them:margin = 2 / ‖w‖
SVM maximizes this margin, which is equivalent to minimizing ‖w‖² (constrained optimization).
Geometrically: a wider margin = more separation between classes = better generalization to unseen data.
Max’s answer:
Result:
Q5 — SVM
Question: What is the perceptron, and why can’t it solve XOR?
Answer
A perceptron is a single linear threshold unit: output = sign(⟨w, x⟩ + b). It learns a linear decision boundary via gradient descent on MSE.
XOR is not linearly separable — there’s no straight line that separates (0,0), (1,1) from (0,1), (1,0). A single perceptron can only represent linear boundaries → XOR is impossible.
Historical significance: Minsky & Papert (1969) proved XOR was unsolvable → led to the first “AI winter” for neural networks. The solution came later with multi-layer perceptrons (backprop).
Max’s answer:
Result:
Q6 — SVM
Question: Name three common kernel functions and give their formulas.
Answer
- Linear kernel: K(x, y) = ⟨x, y⟩ — same as standard dot product, no feature mapping
- Polynomial kernel: K(x, y) = (⟨x, y⟩ + c)^d — maps to d-th degree polynomial feature space
- RBF (Radial Basis Function / Gaussian) kernel: slides use the σ-form K(x, y) = exp(−‖x − y‖²/2σ²) (equiv. exp(−γ‖x − y‖²), γ = 1/2σ²) — maps to infinite-dimensional space; σ/γ controls how quickly similarity falls off with distance
RBF is often the default choice — very powerful for complex non-linear boundaries.
Max’s answer:
Result:
Q7 — SVM
Question: Write the SVM optimization problem (what we minimize and what the constraint is).
Answer
Minimize:
τ(w) = ½ · ‖w‖²
Subject to:yᵢ · (⟨xᵢ, w⟩ + b) ≥ 1for all training examples i
This is a constrained quadratic optimization problem. The objective minimizes ‖w‖² (maximizes margin 2/‖w‖). The constraint ensures all training examples are correctly classified and outside the margin.
Solved via Lagrange multipliers → dual formulation → only support vectors (αᵢ > 0) contribute to solution.
Soft-margin adds slack variables ξᵢ: constraint becomesyᵢ · (⟨xᵢ,w⟩+b) ≥ 1 − ξᵢ, and minimizes½‖w‖² + C·Σξᵢ.
Max’s answer:
Result:
Beyond the lecture (optional)
These questions go beyond the SoSe 2026 lecture slides (textbook / external additions). Kept for depth, not exam-critical.
Q3 — SVM
Question: What does the C parameter in soft-margin SVM control? What happens with large vs. small C?
Answer
C controls the tradeoff between margin size and training error (number of violated constraints).
- Large C: few violations tolerated → small margin, more complex boundary → risk of overfitting.
- Small C: many violations allowed → large margin, simpler boundary → more regularization, may underfit.
At C = ∞: hard-margin SVM (no violations allowed — only works if data is linearly separable).
At C → 0: ignores constraints entirely (degenerate).
In practice: C is a hyperparameter tuned via cross-validation.
Max’s answer:
Result:
Session 25/05/26 — slide-level rapid recall ⚡
5 questions on the core slide-level SVM mechanics (margin, support vectors, kernel trick, decision rule, convexity). Hard difficulty.
S1 — Margin
Question: The SVM margin is 2/‖w‖. Where does the 2 come from — why not just 1/‖w‖?
Answer
The two parallel margin hyperplanes are canonicalized as ⟨w,x⟩+b = +1 and ⟨w,x⟩+b = −1. Subtracting the closest positive and negative point equations: ⟨w, x₁−x₂⟩ = 2, and dividing by ‖w‖ gives the distance between them = 2/‖w‖. The 2 is the full gap between the +1 and −1 hyperplanes.
1/‖w‖would only be the distance from the decision boundary (=0) to one margin hyperplane — half the margin.
Max’s answer:
Result:
S2 — Support vectors
Question: You delete a training point and re-train. When does the hyperplane stay identical, and when does it move? Tie your answer to a formula.
Answer
Hyperplane is unchanged if you delete a non-support vector (αᵢ = 0, lies strictly beyond its margin). It moves / is recomputed if you delete a support vector (αᵢ > 0, sits on the margin). Reason:
w = Σᵢ αᵢ yᵢ xᵢ— only points with αᵢ > 0 contribute to w, and by the KKT condition only on-margin points have αᵢ > 0.
Max’s answer:
Result:
S3 — Kernel trick
Question: State precisely what K(x, x′) computes and why we never have to build the feature map Φ.
Answer
K(x, x′) = ⟨Φ(x), Φ(x′)⟩— it computes the inner product in feature space, but evaluated directly in the original input space (typically O(n)). We never build Φ because the SVM dual and the decision function depend on the data only through dot products ⟨xᵢ, xⱼ⟩. Replacing each one withK(xᵢ, xⱼ)classifies in the high-dim space implicitly. Essential when Φ is infinite-dimensional (RBF).
Max’s answer:
Result:
S4 — Decision rule
Question: Write the kernelized decision function for a new input x, and explain how the class is read off it.
Answer
f(x) = sgn(Σᵢ αᵢ yᵢ K(x, xᵢ) + b). The sum effectively runs over support vectors only (αᵢ = 0 elsewhere).sgn = +1→ positive class,−1→ negative class. Intuitively it’s a kernel-similarity-weighted vote of the support vectors, offset by the bias b.
Max’s answer:
Result:
S5 — Convexity
Question: Why does the SVM optimization have a single, unique solution (unlike training a perceptron or neural net)?
Answer
The primal — minimize
½‖w‖²subject to linear inequality constraintsyᵢ(⟨xᵢ,w⟩+b) ≥ 1— is a convex quadratic program: convex objective + convex (linear) constraints ⇒ a single global optimum, no local minima to get stuck in. Perceptrons/NNs optimize non-convex landscapes, so their solution depends on initialization and can be locally optimal.
Max’s answer:
Result:
Score
Total: / 7
Session 25/05/26 (slide-level): / 5