Lernzettel: SVMs & Perceptron

Methods of AI — SoSe 2026 · 1-page exam sheet

For more depth: Support Vector Machines (full reference with all algorithms, worked examples, deep explanations) · quiz_svm_30-04-26 (7 practice questions)

Core Ideas

  • SVM: hyperplane that maximizes the margin between two classes. Only support vectors (closest points, αᵢ > 0) determine the hyperplane.
  • Margin = 2/‖w‖. Maximize margin ⇔ minimize ½‖w‖² s.t. yᵢ(⟨xᵢ,w⟩+b) ≥ 1.
  • Kernel trick: replace ⟨x, x′⟩ with K(x, x′) = ⟨Φ(x), Φ(x′)⟩ to classify in higher-dim spaces without computing Φ.
  • Perceptron: single-layer linear classifier, output Θ(s) ∈ {0,1} (Heaviside, slide 36). Provably convergent iff linearly separable. Cannot solve XOR.

Mini-glossary

TermMeaning
Hyperplane H{x ∈ Rⁿ | ⟨w,x⟩ + b = 0} — separating (n−1)-dim surface
Margin2/‖w‖ — distance between the two parallel support hyperplanes
Support vectortraining point on the margin; αᵢ > 0; alone determines w
Kernel K(x, x′)implicit inner product ⟨Φ(x), Φ(x′)⟩ in feature space
Slack ξᵢ (suppl.)how much constraint i may be violated (soft-margin only)
C (suppl.)soft-margin trade-off: large C → narrow margin, small C → more regularization
αᵢ ≥ 0Lagrange multiplier; αᵢ > 0 ⇔ point i is a support vector (KKT)
RBFRadial Basis Function (= Gaussian kernel). “Radial” = depends only on distance ‖x−y‖; “basis function” = each SV is a bump. Maps to infinite-dim space. γ = bump width
PSD (suppl.)Positive Semi-Definite: property of the kernel/Gram matrix K (Kᵢⱼ = K(xᵢ,xⱼ)): cᵀKc ≥ 0 ∀c, i.e. all eigenvalues ≥ 0. Mercer: K PSD ⟺ ∃ φ with K(x,y)=⟨φ(x),φ(y)⟩

Full glossary with examplesPerceptron vocabulary ⭐

Key formulas

Formula
Hyperplane⟨w, x⟩ + b = 0
Margin2/‖w‖
Optimization (primal)min ½‖w‖² s.t. yᵢ(⟨xᵢ,w⟩+b) ≥ 1
LagrangianL = ½‖w‖² − Σᵢ αᵢ(yᵢ(⟨xᵢ,w⟩+b) − 1)
KKT conditionαᵢ · (yᵢ(⟨xᵢ,w⟩+b) − 1) = 0
Weight recoveryw = Σᵢ αᵢ yᵢ xᵢ
Decision functionf(x) = sgn(Σᵢ αᵢ yᵢ K(x, xᵢ) + b)
Linear kernelK(x, y) = ⟨x, y⟩
Polynomial kernelK(x, y) = (⟨x, y⟩ + c)^d
RBF (Radial Basis Function = Gaussian) kernelslides: K(x, y) = exp(−‖x − y‖² / 2σ²), param σ · equiv. exp(−γ‖x−y‖²) with γ = 1/(2σ²)
Perceptron outputy = Θ(w⃗ · x⃗) ∈ {0, 1}, x₀ = 1
Perceptron updateΔw⃗ = ε · (tⁿ − yⁿ) · x⃗ⁿ

Common Exam Traps ⚠️

  • Only support vectors matter — αᵢ = 0 points contribute nothing to w. Removing non-SVs leaves the hyperplane unchanged.
  • SVs aren’t “chosen” / aren’t simply “the closest points” — they emerge from the optimization as the points with active constraints (αᵢ > 0), i.e. those lying exactly on the margin (yᵢ(w·xᵢ+b)=1). It’s the QP + KKT, not a distance ranking, that picks them. (Soft-margin: 0<αᵢ<C on the margin, αᵢ=C violating it.)How are support vectors chosen?
  • Lagrangian sign convention — standard form subtracts αᵢ·(constraint); flipping the sign inverts the saddle-point and is a classic error.
  • Margin is 2/‖w‖, not 1/‖w‖ — the 2 comes from the ±1 spacing of the two parallel margin hyperplanes.
  • Perceptron output is Θ(s) ∈ {0,1} — NOT sign(s). Bias is absorbed as w₀ via constant input x₀ = 1.
  • Perceptron “gradient” descent is technically not — Θ isn’t differentiable; slide 44 takes the gradient w.r.t. the preactivation w·x instead of the output.
  • Perceptron convergence is conditional — only guaranteed if (i) data linearly separable AND (ii) ε small enough.
  • XOR → “temporary drop in interest in ANN” (slide 47 wording). “AI winter” / Minsky & Papert is textbook embellishment.
  • Soft-margin SVM, C trade-off, slack variables, Mercer’s condition are NOT in slides — supplementary R&N material (large C → overfitting; small C → more regularization).
  • Kernel trick ≠ explicit mapping — K computes ⟨Φ(x), Φ(x′)⟩ without ever evaluating Φ; works even when Φ maps into infinite-dim spaces (RBF).
  • PSD-ness = why the kernel trick is legal (suppl., tied to Mercer) — K is a valid kernel iff its Gram matrix is positive semi-definite (eigenvalues ≥ 0, not strictly > 0). PSD guarantees a φ exists with K(x,y)=⟨φ(x),φ(y)⟩, and makes the dual QP convex (concave maximization → one global optimum). Non-PSD kernel → non-convex → solver can get stuck in a local optimum. It’s the Gram matrix that must be PSD, not the raw data.
  • RBF γ vs C — γ = kernel width: large γ → narrow bumps → overfitting; small γ → smooth, near-linear boundary. Don’t confuse with C (slack penalty); both tune over/underfitting but via different mechanisms.

Quick comparison

PerceptronSVM (linear)SVM (kernel)
BoundaryLinearLinear, max-marginNon-linear (via Φ)
OutputΘ ∈ {0,1}sgn ∈ {−1,+1}sgn ∈ {−1,+1}
TrainingMSE “gradient” descent (on preactivation)Convex QPConvex QP with K
UsesAll training pointsSupport vectors onlySupport vectors only
Solves XORNoNoYes (right K)
ConvergenceIff linearly sep. + small εAlways (unique optimum)Always

Full algorithm reference (perceptron training, SVM QP, kernel eval, soft-margin) → ALGORITHMS (full reference) ⭐

Practice quiz

Targeted exam questions in Questions for Methods of AI

  • Q74–79 (basic: SVM hyperplane, support vectors, kernel trick, perceptron) · Q84 (perceptron) · Q94 (perceptron + XOR) · Q95 (Kernel Trick) · Q130–132 (deep / exam-trap: Lagrangian sign convention, slack ξᵢ + C trade-off, Mercer’s condition)

Atomic notes

Sibling Lernzettel

See also

Tags: methods-of-ai lernzettel ai-generated
Full reference: Support Vector Machines
Quiz: quiz_svm_30-04-26
Superlink: Methods of AI Lecture
Questions hub: Questions for Methods of AI

Created: 30/04/26