Lernzettel: SVMs & Perceptron
Methods of AI — SoSe 2026 · 1-page exam sheet
For more depth: Support Vector Machines (full reference with all algorithms, worked examples, deep explanations) · quiz_svm_30-04-26 (7 practice questions)
Core Ideas
- SVM: hyperplane that maximizes the margin between two classes. Only support vectors (closest points, αᵢ > 0) determine the hyperplane.
- Margin = 2/‖w‖. Maximize margin ⇔ minimize ½‖w‖² s.t. yᵢ(⟨xᵢ,w⟩+b) ≥ 1.
- Kernel trick: replace ⟨x, x′⟩ with K(x, x′) = ⟨Φ(x), Φ(x′)⟩ to classify in higher-dim spaces without computing Φ.
- Perceptron: single-layer linear classifier, output Θ(s) ∈ {0,1} (Heaviside, slide 36). Provably convergent iff linearly separable. Cannot solve XOR.
Mini-glossary
| Term | Meaning |
|---|---|
| Hyperplane H | {x ∈ Rⁿ | ⟨w,x⟩ + b = 0} — separating (n−1)-dim surface |
| Margin | 2/‖w‖ — distance between the two parallel support hyperplanes |
| Support vector | training point on the margin; αᵢ > 0; alone determines w |
| Kernel K(x, x′) | implicit inner product ⟨Φ(x), Φ(x′)⟩ in feature space |
| Slack ξᵢ (suppl.) | how much constraint i may be violated (soft-margin only) |
| C (suppl.) | soft-margin trade-off: large C → narrow margin, small C → more regularization |
| αᵢ ≥ 0 | Lagrange multiplier; αᵢ > 0 ⇔ point i is a support vector (KKT) |
| RBF | Radial Basis Function (= Gaussian kernel). “Radial” = depends only on distance ‖x−y‖; “basis function” = each SV is a bump. Maps to infinite-dim space. γ = bump width |
| PSD (suppl.) | Positive Semi-Definite: property of the kernel/Gram matrix K (Kᵢⱼ = K(xᵢ,xⱼ)): cᵀKc ≥ 0 ∀c, i.e. all eigenvalues ≥ 0. Mercer: K PSD ⟺ ∃ φ with K(x,y)=⟨φ(x),φ(y)⟩ |
⭐ Full glossary with examples → Perceptron vocabulary ⭐
Key formulas
| Formula | |
|---|---|
| Hyperplane | ⟨w, x⟩ + b = 0 |
| Margin | 2/‖w‖ |
| Optimization (primal) | min ½‖w‖² s.t. yᵢ(⟨xᵢ,w⟩+b) ≥ 1 |
| Lagrangian | L = ½‖w‖² − Σᵢ αᵢ(yᵢ(⟨xᵢ,w⟩+b) − 1) |
| KKT condition | αᵢ · (yᵢ(⟨xᵢ,w⟩+b) − 1) = 0 |
| Weight recovery | w = Σᵢ αᵢ yᵢ xᵢ |
| Decision function | f(x) = sgn(Σᵢ αᵢ yᵢ K(x, xᵢ) + b) |
| Linear kernel | K(x, y) = ⟨x, y⟩ |
| Polynomial kernel | K(x, y) = (⟨x, y⟩ + c)^d |
| RBF (Radial Basis Function = Gaussian) kernel | slides: K(x, y) = exp(−‖x − y‖² / 2σ²), param σ · equiv. exp(−γ‖x−y‖²) with γ = 1/(2σ²) |
| Perceptron output | y = Θ(w⃗ · x⃗) ∈ {0, 1}, x₀ = 1 |
| Perceptron update | Δw⃗ = ε · (tⁿ − yⁿ) · x⃗ⁿ |
Common Exam Traps ⚠️
- Only support vectors matter — αᵢ = 0 points contribute nothing to w. Removing non-SVs leaves the hyperplane unchanged.
- SVs aren’t “chosen” / aren’t simply “the closest points” — they emerge from the optimization as the points with active constraints (αᵢ > 0), i.e. those lying exactly on the margin (
yᵢ(w·xᵢ+b)=1). It’s the QP + KKT, not a distance ranking, that picks them. (Soft-margin:0<αᵢ<Con the margin,αᵢ=Cviolating it.) → How are support vectors chosen? - Lagrangian sign convention — standard form subtracts αᵢ·(constraint); flipping the sign inverts the saddle-point and is a classic error.
- Margin is 2/‖w‖, not 1/‖w‖ — the 2 comes from the ±1 spacing of the two parallel margin hyperplanes.
- Perceptron output is Θ(s) ∈ {0,1} — NOT sign(s). Bias is absorbed as w₀ via constant input x₀ = 1.
- Perceptron “gradient” descent is technically not — Θ isn’t differentiable; slide 44 takes the gradient w.r.t. the preactivation w·x instead of the output.
- Perceptron convergence is conditional — only guaranteed if (i) data linearly separable AND (ii) ε small enough.
- XOR → “temporary drop in interest in ANN” (slide 47 wording). “AI winter” / Minsky & Papert is textbook embellishment.
- Soft-margin SVM, C trade-off, slack variables, Mercer’s condition are NOT in slides — supplementary R&N material (large C → overfitting; small C → more regularization).
- Kernel trick ≠ explicit mapping — K computes ⟨Φ(x), Φ(x′)⟩ without ever evaluating Φ; works even when Φ maps into infinite-dim spaces (RBF).
- PSD-ness = why the kernel trick is legal (suppl., tied to Mercer) — K is a valid kernel iff its Gram matrix is positive semi-definite (eigenvalues ≥ 0, not strictly > 0). PSD guarantees a φ exists with K(x,y)=⟨φ(x),φ(y)⟩, and makes the dual QP convex (concave maximization → one global optimum). Non-PSD kernel → non-convex → solver can get stuck in a local optimum. It’s the Gram matrix that must be PSD, not the raw data.
- RBF γ vs C — γ = kernel width: large γ → narrow bumps → overfitting; small γ → smooth, near-linear boundary. Don’t confuse with C (slack penalty); both tune over/underfitting but via different mechanisms.
Quick comparison
| Perceptron | SVM (linear) | SVM (kernel) | |
|---|---|---|---|
| Boundary | Linear | Linear, max-margin | Non-linear (via Φ) |
| Output | Θ ∈ {0,1} | sgn ∈ {−1,+1} | sgn ∈ {−1,+1} |
| Training | MSE “gradient” descent (on preactivation) | Convex QP | Convex QP with K |
| Uses | All training points | Support vectors only | Support vectors only |
| Solves XOR | No | No | Yes (right K) |
| Convergence | Iff linearly sep. + small ε | Always (unique optimum) | Always |
Full algorithm reference (perceptron training, SVM QP, kernel eval, soft-margin) → ALGORITHMS (full reference) ⭐
Related Q&A & Notes
Practice quiz
- quiz_svm_30-04-26 — 7 questions
Targeted exam questions in Questions for Methods of AI
- Q74–79 (basic: SVM hyperplane, support vectors, kernel trick, perceptron) · Q84 (perceptron) · Q94 (perceptron + XOR) · Q95 (Kernel Trick) · Q130–132 (deep / exam-trap: Lagrangian sign convention, slack ξᵢ + C trade-off, Mercer’s condition)
Atomic notes
Sibling Lernzettel
- lernzettel_ml-i-ii_30-04-26 — ML I & II (background: bias-variance, classifiers)
See also
Tags: methods-of-ai lernzettel ai-generated
Full reference: Support Vector Machines
Quiz: quiz_svm_30-04-26
Superlink: Methods of AI Lecture
Questions hub: Questions for Methods of AI
Created: 30/04/26