The Kernel Trick lets a linear classifier (like SVM) classify non-linearly separable data — without ever computing the high-dimensional feature map. Instead of mapping each point to a richer feature space and taking dot products there, you compute a single kernel function that gives the dot product directly.
It’s one of the most elegant ideas in machine learning. The same trick generalizes to PCA (kernel PCA), regression (kernel ridge), clustering (spectral methods), and more.
The mechanism
SVM only depends on dot products between data points: ⟨xᵢ, xⱼ⟩. So if you can compute the dot product in some high-dimensional space implicitly, you get the benefits of the mapping without the cost.
A kernel is a function K: X × X → ℝ that secretly computes a dot product in some feature space:
K(x, y) = ⟨φ(x), φ(y)⟩
where φ is a mapping into a (possibly infinite-dimensional) feature space. Crucially: you never need to compute φ(x) explicitly — only K(x, y) matters.
See the trick concretely — the polynomial kernel
The 2D polynomial kernel K(x, y) = (⟨x, y⟩ + 1)² is equivalent to a dot product in a 6-dimensional feature space:
φ(x₁, x₂) = (x₁², x₂², √2·x₁·x₂, √2·x₁, √2·x₂, 1)
Let’s verify both compute the same thing:
🐍 Code anzeigen / ausblenden
import mathdef explicit_phi(x): """Map (x1, x2) into 6D feature space.""" x1, x2 = x return ( x1**2, x2**2, math.sqrt(2) * x1 * x2, math.sqrt(2) * x1, math.sqrt(2) * x2, 1.0 )def dot(u, v): return sum(a * b for a, b in zip(u, v))def explicit_kernel(x, y): """Map both to 6D, then dot product. Cost: O(d²) ops.""" return dot(explicit_phi(x), explicit_phi(y))def polynomial_kernel(x, y): """The 'trick' version. Cost: O(d) ops.""" return (dot(x, y) + 1) ** 2x = (2, 3); y = (1, 4)print("Explicit (via φ): ", explicit_kernel(x, y)) # → 256.0print("Polynomial kernel: ", polynomial_kernel(x, y)) # → 256.0print(f"6D phi(x): {explicit_phi(x)}") # for intuition
Both give 225.0 — same number, but the kernel never computes the 6D map. Saved cost = no need to store or compute φ.
Visual: same data, three kernels, three decision boundaries
Linear, polynomial, and RBF kernels on the same non-linearly-separable dataset. Linear fails, polynomial bends, RBF curves arbitrarily.
Linear kernel — accuracy ~50 %, the boundary is a straight line, which cannot separate concentric circles. Half the points wrong.
Polynomial kernel (degree 3) — accuracy jumps to ~90–95 %, boundary is a curved shape that mostly wraps the inner cluster.
RBF kernel — accuracy ~100 %, boundary is a tight closed curve around the inner cluster. The infinite-dimensional feature space lets RBF carve arbitrarily-shaped regions.
This is the kernel trick made visible: same SVM optimizer, same data, totally different model classes — the only difference is the kernel function.
Common kernels
Kernel
Formula
Implicit feature space
Linear
⟨x, y⟩
Original space — equivalent to plain SVM
Polynomial
(⟨x, y⟩ + c)^d
Polynomials of degree ≤ d
RBF (Radial Basis Function = Gaussian)
exp(−γ ‖x − y‖²)
Infinite-dimensional — Taylor expansion of exp gives all polynomials
Sigmoid
tanh(α ⟨x, y⟩ + c)
Resembles a 2-layer neural network
RBF = Radial Basis Function. “Radial” → depends only on the distance ‖x−y‖ (a radius), not on direction; “basis function” → each support vector is a localized bump. ⚠️ Notation: the Session 11 slide uses the σ-formexp(−‖x−y‖²/2σ²); this is the equivalent γ-form with γ = 1/(2σ²). Small σ / large γ → narrow bumps → overfitting; large σ / small γ → smooth boundary.
RBF is the default for general SVM use — it’s flexible (infinite-dim space), has only one hyperparameter (γ or σ), and works on most problems.
Mercer’s condition — what makes a function a valid kernel
Not every function K defines a valid dot product. Mercer’s theorem:
A symmetric function K(x, y) is a valid kernel if and only if for every finite set of points {x₁, …, xₙ}, the Gram matrix K_ij = K(xᵢ, xⱼ) is positive semi-definite (PSD) — i.e. cᵀKc ≥ 0 for all c, equivalently all eigenvalues ≥ 0.
Why it matters: PSD-ness guarantees that some φ exists such that K(x, y) = ⟨φ(x), φ(y)⟩. Without it, SVM’s quadratic optimization is no longer convex. ⚠️ It’s semi-definite (eigenvalues ≥ 0, not strictly > 0), and it’s the Gram matrix that must be PSD, not the raw data.
In practice: linear, polynomial, RBF, and combinations (sums and products) of valid kernels are always valid kernels. Custom kernels need verification.
💡 Concrete non-PSD example: the sigmoid kerneltanh(α⟨x,y⟩ + c) (row above) is not PSD for most values of α, c — its Gram matrix can have negative eigenvalues. It violates Mercer yet is still used in practice (it’s historically motivated by neural nets and often “works”), which is exactly why PSD is a sufficient theoretical guarantee, not a hard practical gate. This is the standard “name a function that looks like a kernel but isn’t a valid one” exam answer.
Why this works for SVM
The SVM dual problem and the decision function depend on data only via dot products:
Decision: f(x) = sgn( Σᵢ αᵢ · yᵢ · K(x, xᵢ) + b )
→ Replace every ⟨x, xᵢ⟩ with K(x, xᵢ). The optimization remains the same. You get a non-linear decision boundary in the original space, which corresponds to a linear boundary in the (high-dim) feature space.
⚠️ Common exam traps
The kernel doesn’t compute φ. It computes the dot product in feature space. The φ is implicit, sometimes infinite-dimensional (RBF).
RBF kernel’s feature space is infinite-dimensional. This is not a typo — the Gaussian’s Taylor series has infinitely many terms.
Not every function is a kernel. Need Mercer’s condition (PSD Gram matrix).
The “trick” is not just about cost. For RBF you literally cannot compute φ explicitly because the space is infinite-dim. The kernel is the only way to do the computation.
Where the Kernel Trick is used today
SVMs — historically the main use; still in production for medium-data classification (e.g. text categorization, bioinformatics)
Kernel Ridge Regression — same trick on regression problems
Kernel PCA — nonlinear dimensionality reduction
Gaussian Processes — Bayesian regression/classification; the GP kernel is the central object
Spectral clustering — RBF kernel matrix → eigendecomposition for clustering
Bayesian Optimization — GPs (kernel-based) are the standard surrogate model in Optuna, scikit-optimize
Quantum kernels — IBM/Google research on quantum-enhanced kernels for ML
Where the Kernel Trick was challenged — and by what
Domain
Was kernel methods, now …
Why
Image classification
CNNs / Vision Transformers
Deep nets learn hierarchical features from raw pixels — kernels can’t
NLP
Word embeddings → Transformers
Learned representations beat kernel similarity on most language tasks
Large-scale ML (millions of points)
Linear models, deep learning, gradient boosting
Kernel methods are O(n²) or O(n³) — don’t scale beyond ~10k points without approximation
Hyperparameter tuning
Often still kernels (GPs in BayesOpt)
Kernel methods are the dominant tool here, just not branded as “kernel methods”
Where kernel methods still shine:small-to-medium tabular datasets, problems where you need calibrated uncertainty (GPs), and any setting where the inductive bias of a specific kernel matches the problem structure (e.g. spectral methods for periodic data).
Connection to modern deep learning
The Neural Tangent Kernel (NTK) theory shows that infinitely-wide neural networks are exactly equivalent to kernel methods with a specific (and computable) kernel. So in a deep theoretical sense, deep learning is doing something kernel-trick-like under the hood — just with kernels that are learned rather than chosen.