Kernel Trick

methods-of-ai

The Kernel Trick lets a linear classifier (like SVM) classify non-linearly separable data — without ever computing the high-dimensional feature map. Instead of mapping each point to a richer feature space and taking dot products there, you compute a single kernel function that gives the dot product directly.

It’s one of the most elegant ideas in machine learning. The same trick generalizes to PCA (kernel PCA), regression (kernel ridge), clustering (spectral methods), and more.

The mechanism

SVM only depends on dot products between data points: ⟨xᵢ, xⱼ⟩. So if you can compute the dot product in some high-dimensional space implicitly, you get the benefits of the mapping without the cost.

A kernel is a function K: X × X → ℝ that secretly computes a dot product in some feature space:

K(x, y) = ⟨φ(x), φ(y)⟩

where φ is a mapping into a (possibly infinite-dimensional) feature space. Crucially: you never need to compute φ(x) explicitly — only K(x, y) matters.

See the trick concretely — the polynomial kernel

The 2D polynomial kernel K(x, y) = (⟨x, y⟩ + 1)² is equivalent to a dot product in a 6-dimensional feature space:

φ(x₁, x₂) = (x₁², x₂², √2·x₁·x₂, √2·x₁, √2·x₂, 1)

Let’s verify both compute the same thing:

Both give 225.0 — same number, but the kernel never computes the 6D map. Saved cost = no need to store or compute φ.

Visual: same data, three kernels, three decision boundaries

Linear, polynomial, and RBF kernels on the same non-linearly-separable dataset. Linear fails, polynomial bends, RBF curves arbitrarily.

What to see:

  • Linear kernel — accuracy ~50 %, the boundary is a straight line, which cannot separate concentric circles. Half the points wrong.
  • Polynomial kernel (degree 3) — accuracy jumps to ~90–95 %, boundary is a curved shape that mostly wraps the inner cluster.
  • RBF kernel — accuracy ~100 %, boundary is a tight closed curve around the inner cluster. The infinite-dimensional feature space lets RBF carve arbitrarily-shaped regions.

This is the kernel trick made visible: same SVM optimizer, same data, totally different model classes — the only difference is the kernel function.

Common kernels

KernelFormulaImplicit feature space
Linear⟨x, y⟩Original space — equivalent to plain SVM
Polynomial(⟨x, y⟩ + c)^dPolynomials of degree ≤ d
RBF (Radial Basis Function = Gaussian)exp(−γ ‖x − y‖²)Infinite-dimensional — Taylor expansion of exp gives all polynomials
Sigmoidtanh(α ⟨x, y⟩ + c)Resembles a 2-layer neural network

RBF = Radial Basis Function. “Radial” → depends only on the distance ‖x−y‖ (a radius), not on direction; “basis function” → each support vector is a localized bump. ⚠️ Notation: the Session 11 slide uses the σ-form exp(−‖x−y‖²/2σ²); this is the equivalent γ-form with γ = 1/(2σ²). Small σ / large γ → narrow bumps → overfitting; large σ / small γ → smooth boundary.

RBF is the default for general SVM use — it’s flexible (infinite-dim space), has only one hyperparameter (γ or σ), and works on most problems.

Mercer’s condition — what makes a function a valid kernel

Not every function K defines a valid dot product. Mercer’s theorem:

A symmetric function K(x, y) is a valid kernel if and only if for every finite set of points {x₁, …, xₙ}, the Gram matrix K_ij = K(xᵢ, xⱼ) is positive semi-definite (PSD) — i.e. cᵀKc ≥ 0 for all c, equivalently all eigenvalues ≥ 0.

Why it matters: PSD-ness guarantees that some φ exists such that K(x, y) = ⟨φ(x), φ(y)⟩. Without it, SVM’s quadratic optimization is no longer convex. ⚠️ It’s semi-definite (eigenvalues ≥ 0, not strictly > 0), and it’s the Gram matrix that must be PSD, not the raw data.

In practice: linear, polynomial, RBF, and combinations (sums and products) of valid kernels are always valid kernels. Custom kernels need verification.

💡 Concrete non-PSD example: the sigmoid kernel tanh(α⟨x,y⟩ + c) (row above) is not PSD for most values of α, c — its Gram matrix can have negative eigenvalues. It violates Mercer yet is still used in practice (it’s historically motivated by neural nets and often “works”), which is exactly why PSD is a sufficient theoretical guarantee, not a hard practical gate. This is the standard “name a function that looks like a kernel but isn’t a valid one” exam answer.

Why this works for SVM

The SVM dual problem and the decision function depend on data only via dot products:

Decision: f(x) = sgn( Σᵢ αᵢ · yᵢ · K(x, xᵢ) + b )

→ Replace every ⟨x, xᵢ⟩ with K(x, xᵢ). The optimization remains the same. You get a non-linear decision boundary in the original space, which corresponds to a linear boundary in the (high-dim) feature space.

⚠️ Common exam traps

  1. The kernel doesn’t compute φ. It computes the dot product in feature space. The φ is implicit, sometimes infinite-dimensional (RBF).
  2. RBF kernel’s feature space is infinite-dimensional. This is not a typo — the Gaussian’s Taylor series has infinitely many terms.
  3. Not every function is a kernel. Need Mercer’s condition (PSD Gram matrix).
  4. The “trick” is not just about cost. For RBF you literally cannot compute φ explicitly because the space is infinite-dim. The kernel is the only way to do the computation.

Where the Kernel Trick is used today

  • SVMs — historically the main use; still in production for medium-data classification (e.g. text categorization, bioinformatics)
  • Kernel Ridge Regression — same trick on regression problems
  • Kernel PCA — nonlinear dimensionality reduction
  • Gaussian Processes — Bayesian regression/classification; the GP kernel is the central object
  • Spectral clustering — RBF kernel matrix → eigendecomposition for clustering
  • Bayesian Optimization — GPs (kernel-based) are the standard surrogate model in Optuna, scikit-optimize
  • Quantum kernels — IBM/Google research on quantum-enhanced kernels for ML

Where the Kernel Trick was challenged — and by what

DomainWas kernel methods, now …Why
Image classificationCNNs / Vision TransformersDeep nets learn hierarchical features from raw pixels — kernels can’t
NLPWord embeddings → TransformersLearned representations beat kernel similarity on most language tasks
Large-scale ML (millions of points)Linear models, deep learning, gradient boostingKernel methods are O(n²) or O(n³) — don’t scale beyond ~10k points without approximation
Hyperparameter tuningOften still kernels (GPs in BayesOpt)Kernel methods are the dominant tool here, just not branded as “kernel methods”

Where kernel methods still shine: small-to-medium tabular datasets, problems where you need calibrated uncertainty (GPs), and any setting where the inductive bias of a specific kernel matches the problem structure (e.g. spectral methods for periodic data).

Connection to modern deep learning

The Neural Tangent Kernel (NTK) theory shows that infinitely-wide neural networks are exactly equivalent to kernel methods with a specific (and computable) kernel. So in a deep theoretical sense, deep learning is doing something kernel-trick-like under the hood — just with kernels that are learned rather than chosen.

See also

Tags: methods-of-ai machine-learning svm kernel-trick
Created: 18-05-26