ML III — SVMs & Perceptron

chatbot methods-of-ai

Companion files for exam prep:

Table of contents

  1. Core Ideas
  2. Glossary
  3. SVM — Maximum Margin Classifier
  4. Lagrangian & KKT conditions
  5. Decision function
  6. Kernel Trick
  7. Common Kernels
  8. Soft-Margin SVM (supplementary)
  9. Perceptron
  10. XOR problem
  11. Formulas & Notation
  12. Common Exam Traps
  13. Visualisations (Python)
  14. ALGORITHMS ⭐ — pseudocode for all 4
  15. Worked examples
  16. Quick comparison table
  17. Original atomic content (preserved)
  18. See also

Core Ideas

  • SVM: find the hyperplane that maximizes the margin between two classes. Only the support vectors (closest points) determine the hyperplane — all other points have αᵢ = 0.
  • Margin = 2/‖w‖. Maximizing margin ⇔ minimizing ½‖w‖² subject to yᵢ(⟨xᵢ,w⟩+b) ≥ 1.
  • Kernel trick: replace the dot product ⟨x,y⟩ with K(x,y) = ⟨Φ(x),Φ(y)⟩ to classify in higher-dimensional spaces without computing Φ explicitly.
  • Perceptron: single-layer linear classifier with Heaviside output Θ(s) ∈ {0,1} (NOT sign). Provably convergent if data is linearly separable. Cannot solve XOR.
  • SVMs handle non-linearity natively via kernels; perceptrons need multi-layer extension (MLP) for non-linear tasks.

Glossary — important SVM/Perceptron vocabulary ⭐

Hyperplane H — an (n−1)-dimensional separating surface in Rⁿ, defined by H = {x ∈ Rⁿ | ⟨w,x⟩ + b = 0}.

  • Example: in R² a hyperplane is a line; in R³ it’s a plane.

Weight vector w ∈ Rⁿ — vector orthogonal to the hyperplane. Determines orientation; its magnitude ‖w‖ inversely sets the margin width.

Bias b ∈ R — offset of the hyperplane from the origin along w. (In the perceptron, the bias is absorbed as w₀ by prepending a constant input x₀ = 1.)

Margin — distance between the two parallel “support” hyperplanes ⟨w,x⟩+b = +1 and ⟨w,x⟩+b = −1. Equals 2/‖w‖. SVM maximizes this.

Support vector — a training point that lies exactly on one of the two margin hyperplanes. Only these points have αᵢ > 0. They alone determine w.

  • Visual cue: in a 2D drawing, the support vectors are the points that “touch” the margin lines.

Lagrange multiplier αᵢ ≥ 0 — one per training point. αᵢ > 0 ⇔ point i is a support vector. From the KKT condition αᵢ · (yᵢ(⟨xᵢ,w⟩+b) − 1) = 0.

Kernel K(x,y) — a function K: X × X → R such that there exists a mapping Φ: X → H with K(x,y) = ⟨Φ(x), Φ(y)⟩. Lets us compute dot products in feature space without ever computing Φ.

RBF (Radial Basis Function) — the Gaussian kernel K(x,y) = exp(−‖x−y‖²/2σ²). “Radial” = depends only on the distance ‖x−y‖; “basis function” = each SV is a localized bump. Maps into an infinite-dim feature space. σ (or γ = 1/2σ²) sets bump width.

PSD (Positive Semi-Definite) (suppl.) — property of the kernel/Gram matrix K: cᵀKc ≥ 0 for all c (all eigenvalues ≥ 0). Mercer: K PSD ⟺ ∃ φ with K(x,y)=⟨φ(x),φ(y)⟩, and it keeps the SVM dual convex.

Slack variable ξᵢ ≥ 0 (soft-margin, supplementary) — how much constraint i is allowed to be violated. ξᵢ = 0 → no violation; 0 < ξᵢ < 1 → inside margin but correctly classified; ξᵢ ≥ 1 → misclassified.

C parameter (soft-margin, supplementary) — trade-off knob between margin size and number of violations. Large C → small tolerance for misclassification (narrow margin, overfitting risk); small C → large margin, more violations allowed (more regularization).

Heaviside step Θ(s) — perceptron output activation: Θ(s) = 1 if s ≥ 0, else 0. (NOT sign(s) ∈ {−1,+1}.)

Preactivation s = w·x = w₀·1 + w₁x₁ + … + w_d x_d — the linear combination before applying Θ. The “gradient” in perceptron training is taken w.r.t. this preactivation (slide 44).

Linear separability — the property that two classes can be separated by a hyperplane. Necessary condition for perceptron convergence.


SVM — Maximum Margin Classifier

Setup

Training data: pairs (x₁,y₁), …, (x_N,y_N) with xᵢ ∈ Rⁿ and yᵢ ∈ {+1, −1}.
Find f: Rⁿ → {+1, −1} with f(xᵢ) = yᵢ on training data; classify unseen x via f(x).

Hyperplane

A separating affine hyperplane:

H = {x ∈ Rⁿ | ⟨w, x⟩ + b = 0}

with w ∈ Rⁿ orthogonal to H, b ∈ R.

  • ⟨w, x⟩ + b > 0 → positive class
  • ⟨w, x⟩ + b < 0 → negative class
  • ⟨w, x⟩ + b = 0 → on the hyperplane

Two parallel margin hyperplanes

We canonicalize by demanding:

⟨w, x₁⟩ + b = +1 (closest positive point)
⟨w, x₂⟩ + b = −1 (closest negative point)

Subtracting: ⟨w, x₁ − x₂⟩ = 2 ⇒ ⟨w/‖w‖, x₁ − x₂⟩ = 2/‖w‖ = the margin.

Optimization problem

minimize τ(w) = ½ · ‖w‖²
subject to yᵢ · (⟨xᵢ, w⟩ + b) ≥ 1 for all i

This is a convex constrained quadratic programming (QP) problem — has a unique global optimum.


Lagrangian & KKT conditions

We introduce one Lagrange multiplier αᵢ ≥ 0 per constraint:

L(w, b, α) = ½‖w‖² − Σᵢ αᵢ · (yᵢ · (⟨xᵢ, w⟩ + b) − 1)

Maximizing the αᵢ while minimizing w and b (saddle-point problem) yields:

w = Σᵢ αᵢ · yᵢ · xᵢ

⇒ The weight vector is a linear combination of the training examples, weighted by αᵢ·yᵢ.

Karush-Kuhn-Tucker (KKT) saddle-point condition

αᵢ · (yᵢ · (⟨xᵢ, w⟩ + b) − 1) = 0 for all i

⇒ For each i, either αᵢ = 0 or yᵢ(⟨xᵢ,w⟩+b) = 1 (point lies exactly on the margin).

  • αᵢ > 0 ⇒ point i is a support vector (on the margin)
  • αᵢ = 0 ⇒ point i is interior to its side; does not influence w at all

Hence:

w = Σᵢ : xᵢ is support vector αᵢ · yᵢ · xᵢ

How are support vectors chosen?

They are not chosen — they fall out of the optimization. There is no selection heuristic; the QP solver assigns every point an αᵢ, and the support vectors are simply the points whose margin constraint ends up active (tight). The KKT complementary-slackness condition above forces exactly one of two cases per point:

Point’s situationConstraint yᵢ(⟨xᵢ,w⟩+b)MultiplierSupport vector?
strictly outside the margin (correct, far away)> 1 (slack)αᵢ = 0No
exactly on the margin boundary= 1 (tight/active)αᵢ > 0Yes

A far-away point has a non-tight constraint, so the only way to satisfy αᵢ·(…−1)=0 is αᵢ = 0 → it drops out. Only points on the margin can carry αᵢ > 0.

Consequences (exam-favourites):

  • The hyperplane depends only on the support vectors. Delete every non-SV and retrain → identical boundary.
  • Move/remove a support vector → the boundary changes. They “hold the margin in place.”
  • This is SVM sparsity: typically few SVs, so prediction sums over them only.

Soft-margin: support vectors come in two flavours (suppl., C/slack — not in Session 11 slides)

With slack ξᵢ and the box bound 0 ≤ αᵢ ≤ C:

  • 0 < αᵢ < C → point lies exactly on the margin.
  • αᵢ = C → point violates the margin (inside it or misclassified).

So soft-margin SVs = points on or inside/violating the margin. (See Soft-Margin SVM (non-separable data) ⚠️ supplementary.)

⚠️ Trap: “support vectors = the closest points” is only loosely true. Precisely, they are the points with active constraints (αᵢ > 0) — on the margin (hard case) or violating it (soft case). It’s the optimization, not a distance ranking, that picks them out.


Decision function

For a new input x:

f(x) = sgn(Σᵢ αᵢ · yᵢ · ⟨x, xᵢ⟩ + b)

The sum runs over training points; only support vectors actually contribute (the rest have αᵢ = 0). The decision uses only dot products ⟨x, xᵢ⟩ — which sets up the kernel trick.


Kernel Trick

Replace every dot product in the decision function with a kernel K:

f(x) = sgn(Σᵢ αᵢ · yᵢ · K(x, xᵢ) + b)

where K(x, x′) = ⟨Φ(x), Φ(x′)⟩ for some mapping Φ: X → H to a (possibly very high-dim) feature space H.

Why this is powerful. Φ may map to a 10⁶-dimensional or even infinite-dimensional space, but K(x,x′) is computed in the original input space — typically in O(n) operations. We never have to actually compute Φ(x).

Key requirement. K must correspond to an inner product in some Hilbert space. The textbook (R&N) calls this Mercer’s conditionnot in Session 11 slides; supplementary.

Mercer’s condition makes precise what “K is a valid kernel” means: the Gram matrix K (with Kᵢⱼ = K(xᵢ, xⱼ) over any finite set of inputs) must be positive semi-definite (PSD):

cᵀ K c ≥ 0 for all vectors c ⟺ all eigenvalues of K are ≥ 0

This buys two things:

  1. A feature map is guaranteed to exist. PSD ⟺ there is some φ with K(x, y) = ⟨φ(x), φ(y)⟩. So you can treat K as an honest dot product in feature space without ever building φ — exactly the kernel trick.
  2. The dual QP stays convex. The dual objective W(α) = Σαᵢ − ½ ΣᵢΣⱼ αᵢαⱼ yᵢyⱼ K(xᵢ,xⱼ) has the kernel matrix as its quadratic form. PSD ⇒ concave maximization ⇒ one global optimum, reliably solvable. A non-PSD kernel makes the objective non-convex → local optima → the QP solver can return garbage.

⚠️ Exam traps: (i) it’s semi-definite — eigenvalues ≥ 0, not strictly > 0; (ii) it’s the Gram matrix that must be PSD, not the raw data; (iii) symmetry alone is not enough — symmetric + PSD is the full condition.

🔗 Full deep dive with worked polynomial-kernel verification, RBF visualisation, and Mercer details: see Kernel Trick.


Common Kernels

KernelFormulaHyper-parameters
LinearK(x, y) = ⟨x, y⟩
PolynomialK(x, y) = (⟨x, y⟩ + c)^dc ≥ 0, degree d ∈ ℕ
RBF / GaussianK(x, y) = exp(−‖x − y‖² / 2σ²)σ > 0 (width)

RBF = Radial Basis Function. “Radial” → the kernel depends only on the distance ‖x−y‖ (a radius), not on direction or absolute position; “basis function” → each support vector acts as a localized bump, and the decision function is a weighted sum of these bumps. It implicitly maps into an infinite-dimensional feature space, which is why it can separate almost any dataset.

⚠️ Notation: many sources write the RBF kernel as exp(−γ · ‖x−y‖²) with γ = 1/(2σ²). The two are equivalent; the Session 11 slide uses the σ-form (the γ-form appears later only for the soft tree kernel). σ (or γ) controls the bump width: small σ / large γ → narrow bumps → overfitting; large σ / small γ → smooth, near-linear boundary.


Soft-Margin SVM (non-separable data) ⚠️ supplementary

Not covered in Session 11 slides — from R&N textbook background. May appear in oral exams.

Real data is rarely perfectly linearly separable. Soft-margin SVM relaxes the hard constraint by introducing slack variables ξᵢ ≥ 0:

minimize ½‖w‖² + C · Σᵢ ξᵢ
subject to yᵢ(⟨xᵢ, w⟩ + b) ≥ 1 − ξᵢ, ξᵢ ≥ 0

  • ξᵢ = 0 → point i is on the correct side of the margin (good).
  • 0 < ξᵢ < 1 → inside the margin but still correctly classified.
  • ξᵢ ≥ 1 → misclassified.

C parameter controls the trade-off:

  • Large C → heavy penalty on slack → small margin, few violations → risk of overfitting.
  • Small C → cheap slack → large margin, more violations allowed → more regularization.
  • C = ∞ recovers the hard-margin SVM; C = 0 makes the algorithm ignore the data entirely.

Perceptron

🔗 Full atomic note with biological-neuron motivation, AND/XOR Python demo, and history of where perceptrons are used today: see Perceptron.

Architecture (slide 36–37)

For an input x ∈ Rᵈ:

Preactivation: s = w₀ + w₁x₁ + w₂x₂ + … + w_d x_d
Output (activation): y = Θ(s) = 1 if s ≥ 0, else 0

w₀ acts as a threshold. To unify notation, we prepend a constant input x₀ = 1:

x = (1, x₁, …, x_d)ᵀ ∈ R^(d+1)
w = (w₀, w₁, …, w_d)ᵀ ∈ R^(d+1)
s = x · w, y = Θ(x · w)

⚠️ Slide convention: the perceptron output is Θ ∈ {0, 1} (Heaviside), NOT sign(·) ∈ {−1, +1}. Many textbooks use the sign-version; the slide deck (and the exam) uses Heaviside.

Decision surface (slide 38)

Perceptron represents the hypothesis space {w | w ∈ R^(d+1)}. The decision surface is the (d−1)-dim hyperplane orthogonal to w. Points with x·w > 0 get y = 1; the rest y = 0.

Class separability (slide 39)

A perceptron can solve a classification task only if the two classes are linearly separable — separable by some hyperplane.

Training rule (derivation, slides 43–45)

The learning rule is derived from minimizing the mean square error (MSE):

E[w; D] = ½ · Σₙ (tⁿ − y(xⁿ))²

with tⁿ ∈ {0, 1} the target label.

⚠️ Why this is “gradient descent in scare quotes”: the activation Θ is not differentiable at 0 and has zero gradient elsewhere — so true gradient descent is impossible. Slide 44 sidesteps this by taking the gradient w.r.t. the preactivation w·x (i.e., treating y(x) as if it were w·x for the gradient step). The result is the standard perceptron update rule.

Δw = ε · Σₙ (tⁿ − y(xⁿ)) · xⁿ
w^(t+1) ← w^(t) + Δw

with ε a small learning rate.

Convergence (slide 45, summary slide 47)

Convergence is proven if:

  1. ε is sufficiently small, AND
  2. The task is solvable by a perceptron (data is linearly separable).

Larger margins → larger hypothesis space (many valid w) → faster convergence.

Batch vs. stochastic mode (slide 46)

ModeUpdate ruleWhen
BatchΔw = ε · Σₙ (tⁿ − y(xⁿ)) · xⁿAfter seeing all examples
StochasticΔw = ε · (tⁿ − y(xⁿ)) · xⁿAfter one randomly chosen example

For small enough ε, stochastic gradient descent approximates batch arbitrarily closely.


XOR problem

The problem (slide 41)

A 2-input perceptron with constant bias can implement AND, OR, NAND, NOR — but not XOR. XOR’s positive examples ((0,1) and (1,0)) and negative examples ((0,0) and (1,1)) are not linearly separable in R².

The fix (slide 42)

  1. Nonlinear feature mapping: transform input x via some f(x) into a feature space where XOR is linearly separable.
  2. Add an extra input channel: e.g. x₃ = x₁ · x₂. Now the 3D representation is linearly separable.

This is the same idea SVMs formalize as the kernel trick.

Historical consequence (slide 47, exact wording)

The fact that perceptrons cannot solve XOR “led to a temporary drop in the interest in ANN”.

⚠️ The dramatic phrasing “AI winter” and the famous Minsky & Papert (1969) book are textbook background, not present in the slide deck. The slides only say “temporary drop in interest”.


Formulas & Notation

SymbolMeaning
w ∈ RⁿWeight vector (orthogonal to hyperplane)
b ∈ RBias (SVM); absorbed as w₀ in perceptron
HSeparating hyperplane
2/‖w‖Margin width
αᵢ ≥ 0Lagrange multiplier (αᵢ > 0 ⇔ support vector)
L(w,b,α)SVM Lagrangian
K(x, y)Kernel function
Φ(x)Feature mapping into high-dim H
ξᵢ ≥ 0Slack variable (soft-margin; supplementary)
CSoft-margin regularization parameter (supplementary)
Θ(s)Heaviside step ∈ {0,1} (perceptron output)
s = w·xPerceptron preactivation
εPerceptron learning rate
tⁿ ∈ {0,1}Target label of training example n

Common Exam Traps ⚠️

  • Only support vectors matter. Non-support-vector training points have αᵢ = 0 and contribute nothing to w. This is the core insight of SVMs.
  • Lagrangian sign convention. The standard SVM Lagrangian uses a minus sign in front of the αᵢ-term (because the constraint is yᵢ(⟨xᵢ,w⟩+b) − 1 ≥ 0 and we subtract the constraint × multiplier for a minimization). Getting this sign wrong flips the saddle-point direction.
  • Margin = 2/‖w‖, NOT 1/‖w‖. The factor 2 comes from the distance between the two parallel margin hyperplanes (+1 and −1 sides).
  • Large C ≠ better. Large C prioritizes correct classification over a large margin → overfitting risk. (Supplementary — not in slides.)
  • Kernel trick ≠ explicit mapping. The kernel computes the dot product in feature space without ever evaluating Φ. This makes infinite-dim feature spaces (RBF) tractable.
  • SVM maximizes margin, not training accuracy. A wider-margin solution can be preferred even if a narrower one classifies all training points correctly.
  • Perceptron convergence theorem is conditional. Guaranteed ONLY if data is linearly separable. If not, it never converges.
  • Perceptron output is Θ(s) ∈ {0,1}, not sign(s) ∈ {−1,+1}. Be careful when reading textbook problems that use the sign-version.
  • Bias in perceptron is absorbed. w₀ is just another weight, with a constant input x₀ = 1. No separate update rule.
  • “Gradient” descent on perceptron MSE is misleading. Θ is not differentiable. Slide 44 takes the gradient w.r.t. the preactivation w·x, not the output y.
  • XOR → “temporary drop in interest in ANN” (slide 47 wording). “AI winter” and “Minsky & Papert” are textbook embellishments.
  • PSD-ness ≠ PD. A valid kernel needs a positive semi-definite Gram matrix (eigenvalues ≥ 0, zeros allowed) — not strictly positive-definite. PSD is what guarantees both the feature map φ and dual convexity; a non-PSD “kernel” breaks the convex QP. (Supplementary — tied to Mercer.)
  • Mercer’s condition, PSD-ness, soft-margin SVM, slack variables, C parameter are NOT in Session 11 slides — supplementary R&N material.

Visualisations (Python)

Figures that bring the math to life. Render each toggle, then read the explanation underneath.

The three circled points are the support vectors the optimizer selected — note they all lie exactly on the dashed margin lines, while every other point is strictly past its margin (and thus irrelevant to w).

What this shows. A toy 2D dataset with 5 positive (red) and 5 negative (blue) points. The solid line is the SVM decision boundary; the dashed lines are the two parallel margin hyperplanes w·x+b = ±1. The two support vectors circled in green — (3,3) and (1,1) — are the only points that touch the margin. Every other point sits strictly inside its own half-plane and contributes αᵢ = 0 to the solution. If you deleted any non-circled point and re-trained, the hyperplane would be unchanged. That is what “only support vectors matter” looks like geometrically. Margin width = 2/‖w‖ = 2√2 ≈ 2.83.


What this shows. Two concentric rings in R² (left) are the classic non-linearly-separable dataset — no straight line can split inner from outer. Apply the mapping φ(x₁, x₂) = (x₁, x₂, x₁² + x₂²): the new third coordinate is the squared radius, which is small (~1) for the inner ring and large (~6.25) for the outer one. In the lifted 3D space (right), a horizontal plane at z ≈ 3 perfectly separates the classes — i.e. the problem is now linear. This is the kernel trick in action: instead of computing φ explicitly, an SVM with a polynomial kernel K(x,y) = (⟨x,y⟩+1)² evaluates the inner product in this lifted space directly in O(d) input-space ops.


What this shows. Three snapshots of the perceptron’s decision boundary while training on the OR gate. After epoch 1 the weights have barely separated anything; by epoch 3 the boundary is sweeping into a usable region; by epoch 10 the perceptron has converged to a valid OR-separating line (the red-shaded region covers exactly the three positive points (0,1), (1,0), (1,1) while leaving (0,0) on the y=0 side). Convergence is guaranteed because OR is linearly separable — that is the Perceptron Convergence Theorem. Try changing the dataset to XOR and the boundary will oscillate forever, since no straight line works.


What this shows. Supplementary — soft-margin is NOT in Session 11 slides. The same partially-overlapping 2D dataset trained with three values of C. Left (C=0.1): large penalty on ‖w‖, low penalty on violations → wide margin, many points lie inside it (these become bounded support vectors with αᵢ = C). Middle (C=1): balanced. Right (C=100): algorithm strongly insists on correct classification → narrow margin, fewer SVs, potential overfit. As C → ∞ you recover the hard-margin SVM; as C → 0 the algorithm ignores the data. This is the canonical bias-variance knob for SVMs.


ALGORITHMS (full reference) ⭐

1. Perceptron training (online + batch) ⭐

Goal. Iteratively update w to classify all training points correctly (if possible).

Online (stochastic) mode

PERCEPTRON-ONLINE(D, ε, max_epochs):
    w ← 0   # or small random
    for epoch in 1..max_epochs:
        any_update ← false
        for each (xⁿ, tⁿ) in shuffle(D):
            xⁿ ← prepend(1, xⁿ)              # absorb bias as x₀ = 1
            yⁿ ← Θ(w · xⁿ)                   # Heaviside ∈ {0, 1}
            if yⁿ ≠ tⁿ:
                Δw ← ε · (tⁿ − yⁿ) · xⁿ
                w ← w + Δw
                any_update ← true
        if not any_update: return w            # converged
    return w

Batch mode

PERCEPTRON-BATCH(D, ε, max_epochs):
    w ← 0
    for epoch in 1..max_epochs:
        gradient ← 0
        for each (xⁿ, tⁿ) in D:
            xⁿ ← prepend(1, xⁿ)
            yⁿ ← Θ(w · xⁿ)
            gradient ← gradient + (tⁿ − yⁿ) · xⁿ
        Δw ← ε · gradient
        if ‖Δw‖ < tol: return w
        w ← w + Δw
    return w

Convergence. Guaranteed iff (i) the data is linearly separable, (ii) ε is small enough.
Update rule per example. Δw = ε · (tⁿ − yⁿ) · xⁿ — equals 0 when the prediction is already correct (so correctly-classified points stop influencing w).
Cost per epoch. O(|D| · (d+1)).


2. SVM constrained QP optimization (conceptual) ⭐

Goal. Solve the primal problem

min ½‖w‖² s.t. yᵢ(⟨xᵢ,w⟩+b) ≥ 1 for all i.

SVM-TRAIN(D = {(xᵢ, yᵢ)}):
    # Step 1: build the Lagrangian
    L(w, b, α) = ½‖w‖² − Σᵢ αᵢ · (yᵢ(⟨xᵢ,w⟩+b) − 1)
                 with αᵢ ≥ 0
 
    # Step 2: stationarity conditions
    ∂L/∂w = w − Σᵢ αᵢ yᵢ xᵢ = 0   ⇒   w = Σᵢ αᵢ yᵢ xᵢ
    ∂L/∂b = −Σᵢ αᵢ yᵢ = 0          ⇒   Σᵢ αᵢ yᵢ = 0
 
    # Step 3: substitute → dual problem (only αᵢ left)
    maximize  W(α) = Σᵢ αᵢ − ½ Σᵢ Σⱼ αᵢ αⱼ yᵢ yⱼ ⟨xᵢ, xⱼ⟩
    subject to  αᵢ ≥ 0,  Σᵢ αᵢ yᵢ = 0
 
    # Step 4: solve the dual (convex QP) with any QP solver
    α* ← QP-SOLVE(dual)
 
    # Step 5: recover w and b
    w  ← Σᵢ α*ᵢ yᵢ xᵢ
    SV ← { i | α*ᵢ > 0 }                     # support vectors
    b  ← yₛ − ⟨w, xₛ⟩  for any s ∈ SV         # (average over all SVs in practice)
    return (w, b, α*, SV)

Key takeaway. The QP is convex → unique global optimum. The decision function depends only on dot products ⟨xᵢ, xⱼ⟩ — which is exactly where kernels plug in (just replace ⟨xᵢ, xⱼ⟩ with K(xᵢ, xⱼ)).


3. Kernel evaluation

Goal. Compute K(x, x′) = ⟨Φ(x), Φ(x′)⟩ without ever building Φ.

KERNEL-EVAL(x, x', type, params):
    if type == "linear":
        return dot(x, x')
 
    if type == "polynomial":         # K(x,y) = (⟨x,y⟩ + c)^d
        c, d ← params
        return (dot(x, x') + c) ** d
 
    if type == "rbf":                # K(x,y) = exp(−‖x−y‖² / 2σ²)
        σ ← params
        return exp(−norm(x − x')**2 / (2 * σ**2))

Classification with kernels.

SVM-CLASSIFY(x, SV, α, y, b, K):
    s ← b
    for each i in SV:
        s ← s + αᵢ · yᵢ · K(x, xᵢ)
    return sgn(s)

Cost per prediction. O(|SV| · n) — does not depend on the (possibly infinite) feature-space dimension.


4. Soft-margin SVM extension (supplementary)

Not in Session 11 slides — R&N textbook material.

SVM-SOFT(D, C):
    # Primal:
    minimize ½‖w‖² + C · Σᵢ ξᵢ
    subject to yᵢ(⟨xᵢ,w⟩+b) ≥ 1 − ξᵢ,  ξᵢ ≥ 0
 
    # Dual (the only thing that changes vs. hard-margin):
    maximize  W(α) = Σᵢ αᵢ − ½ Σᵢ Σⱼ αᵢ αⱼ yᵢ yⱼ K(xᵢ, xⱼ)
    subject to  0 ≤ αᵢ ≤ C,   Σᵢ αᵢ yᵢ = 0
 
    α* ← QP-SOLVE(dual)
    # same w, b, classification as hard-margin

Effect of C. The single change vs. hard-margin: αᵢ is now bounded above by C. Points with αᵢ = C are “bound” support vectors lying inside or beyond the margin. Smaller C ⇒ more bounded SVs ⇒ wider margin ⇒ more regularization.


Worked examples

SVM 2D worked example — identify support vectors

Setup. 4 training points in R²:

  • Positive class (y = +1): (3, 3), (4, 5)
  • Negative class (y = −1): (1, 1), (2, 0)

Eyeballing, the two classes are separated by a line roughly running NW–SE between them. The maximum-margin separator is the line equidistant from the closest positive and closest negative point.

Step 1 — find closest pairs.

  • (3,3)+ is the closest positive to the negatives.
  • (1,1)− is the closest negative to (3,3)+.
  • The midpoint of (3,3) and (1,1) is (2, 2), and the line normal to (3,3)−(1,1) = (2, 2) through (2, 2) is the candidate separator: x₁ + x₂ = 4.

Step 2 — check the other points.

  • (4, 5): 4 + 5 = 9 ≥ 4 ✓ (correct side, well beyond margin)
  • (2, 0): 2 + 0 = 2 ≤ 4 ✓ (correct side, well beyond margin)

So the two non-support points (4,5) and (2,0) are interior to their sides — they get αᵢ = 0 and have no influence on w. The support vectors are exactly (3, 3) and (1, 1) — αᵢ > 0 for these two only.

Step 3 — read off w and margin.

  • Direction of w: (3,3) − (1,1) = (2, 2), normalized: (1/√2, 1/√2). The SVM solution scales w so that the two SVs sit exactly on the ±1 hyperplanes.
  • Distance between (3,3) and (1,1): √(4 + 4) = 2√2.
  • Margin = that distance = 2√2 ⇒ 2/‖w‖ = 2√2 ⇒ ‖w‖ = 1/√2.

Punchline. If you removed (4,5) or (2,0) and re-ran the SVM, the answer would be identical. If you removed (3,3) or (1,1), the entire hyperplane would shift. That is what “only support vectors matter” means in practice.


Perceptron training trace

Setup. Learn the AND function with 2 inputs (Heaviside output, slide convention).

Training data (with absorbed bias x₀ = 1):

nxⁿ = (1, x₁, x₂)tⁿ
1(1, 0, 0)0
2(1, 0, 1)0
3(1, 1, 0)0
4(1, 1, 1)1

Initialize w = (0, 0, 0), ε = 1, stochastic mode.

Epoch 1:

stepxⁿs = w·xy = Θ(s)tⁿ(t−y)Δw = ε(t−y)xⁿnew w
1(1,0,0)010−1(−1, 0, 0)(−1, 0, 0)
2(1,0,1)−1000(0, 0, 0)(−1, 0, 0)
3(1,1,0)−1000(0, 0, 0)(−1, 0, 0)
4(1,1,1)−101+1(+1, +1, +1)(0, 1, 1)

Epoch 2:

stepxⁿsyt(t−y)Δwnew w
1(1,0,0)010−1(−1,0,0)(−1, 1, 1)
2(1,0,1)010−1(−1,0,−1)(−2, 1, 0)
3(1,1,0)−1000(0,0,0)(−2, 1, 0)
4(1,1,1)−101+1(+1,+1,+1)(−1, 2, 1)

Epoch 3 (check): with w = (−1, 2, 1):

  • (1,0,0): s = −1, y = 0 ✓
  • (1,0,1): s = 0, y = 1 ✗ ⇒ update Δw = (−1, 0, −1) ⇒ w = (−2, 2, 0)
  • (1,1,0): s = 0, y = 1 ✗ ⇒ update Δw = (−1, −1, 0) ⇒ w = (−3, 1, 0)
  • (1,1,1): s = −2, y = 0 ✗ ⇒ update Δw = (+1, +1, +1) ⇒ w = (−2, 2, 1)

Training continues; a valid AND-separating w exists (e.g. w = (−3, 2, 2): check (1,0,0)→s=−3→0, (1,0,1)→s=−1→0, (1,1,0)→s=−1→0, (1,1,1)→s=+1→1 ✓), and convergence to some valid w is guaranteed because AND is linearly separable.

Takeaway. Whenever t = y, the update is zero — correctly-classified points stop being touched. Whenever t ≠ y, w moves toward (if t=1) or away from (if t=0) the misclassified input.


Quick comparison table

PerceptronSVM (linear)SVM (kernel)
Decision boundaryLinear hyperplaneLinear, max-marginNon-linear (via Φ)
OutputΘ(s) ∈ {0, 1}sgn(…) ∈ {−1, +1}sgn(…) ∈ {−1, +1}
Training”Gradient” descent on MSE w.r.t. preactivationConvex constrained QPConvex QP in dual form (uses K)
Key vectorsAll training data (all weights update)Support vectors only (αᵢ>0)Support vectors only
Non-linearityNo (single layer)NoYes (via kernel)
ConvergenceIff linearly separable + small εAlways (unique global optimum)Always
Solves XOR?No (single layer)No (linear)Yes (with right K)
Cost @ inferenceO(d)O(d)O(|SV| · d)

Original atomic content (preserved)

Below: the prior atomic-note content kept verbatim — high-level summary, multiclass, applications, tree kernels.

Basics and Motivation

  • Supervised Learning: SVMs are used for classification tasks, learning from labeled data to classify new data points.
  • Motivation:
    • Linear Separation: Single-layer neural networks efficiently learn linear functions.
    • Nonlinear Separation: Multilayer networks can learn nonlinear functions but are harder to train.
    • SVM Idea: Maps input space to a higher-dimensional space for linear separation.

Multiclass Classification

  • One-vs-All: Trains k binary SVM models for k classes.
  • One-vs-One: Trains k(k-1)/2 SVM models for class pairs.

Applications

  • Face recognition
  • Text categorization
  • Bioinformatics
  • Character recognition
  • Financial time series prediction
  • Health risk prediction

Tree Kernel

  • Parse Tree Kernels: Applied to structured data like syntax trees / DOM trees.
  • The WiSe 2024_25 SVM lecture lists the following named variants (Geibel, Gust & Kühnberger 2007 review):
VariantOrigin / domainIdea
Parse Tree KernelCollins & Duffy (NLP, 2001)Counts shared subtrees between two parse trees
SimTK / TagTKDOM trees (HTML/XML)Adapt parse-tree kernels to HTML’s tag-attribute structure
Leaf-Aligned Tree Kernel (LeftTK)NLP / structured docsAligns leaf nodes before comparing — robust to small structural variation
Set Tree KernelGenericCounts shared subsets of children, not just identical subtrees
Subset Tree KernelCollins & Duffy variantTree kernel restricted to contiguous subtree fragments
Soft Tree KernelGeibel et al.Allows partial matches between subtrees (similarity, not equality)

Why care: tree kernels let an SVM classify structured data (parse trees, DOM trees, RNA secondary structure) without flattening them to feature vectors first — keeps the structure in the inner product.

Summary of Features

  • Mathematical Foundation: Strong basis developed by Vapnik and Chervonenkis.
  • De-facto Standard: Long a standard in machine learning.
  • Flexibility: Learns both linear and nonlinear decision functions.
  • Efficiency: Efficient to train with well-chosen kernel functions.
  • Support Vectors: Reduces model complexity by using only relevant data points.

See also

Related atomic notes:

  • Perceptron — biological-neuron motivation, AND/XOR Python demo, where perceptrons are used today
  • Kernel Trick — polynomial-kernel concrete verification, RBF visualisation, Mercer’s condition, NTK connection

Companion exam-prep files:

Superlinks:

Tags: methods-of-ai atomic-note ai-generated

Erstellt: 14-02-25 12:12 · Merged: 23/05/26