Kernel Trick

methods-of-ai

The Kernel Trick lets a linear classifier (like SVM) classify non-linearly separable data — without ever computing the high-dimensional feature map. Instead of mapping each point to a richer feature space and taking dot products there, you compute a single kernel function that gives the dot product directly.

It’s one of the most elegant ideas in machine learning. The same trick generalizes to PCA (kernel PCA), regression (kernel ridge), clustering (spectral methods), and more.

The mechanism

SVM only depends on dot products between data points: ⟨xᵢ, xⱼ⟩. So if you can compute the dot product in some high-dimensional space implicitly, you get the benefits of the mapping without the cost.

A kernel is a function K: X × X → ℝ that secretly computes a dot product in some feature space:

K(x, y) = ⟨φ(x), φ(y)⟩

where φ is a mapping into a (possibly infinite-dimensional) feature space. Crucially: you never need to compute φ(x) explicitly — only K(x, y) matters.

See the trick concretely — the polynomial kernel

The 2D polynomial kernel K(x, y) = (⟨x, y⟩ + 1)² is equivalent to a dot product in a 6-dimensional feature space:

φ(x₁, x₂) = (x₁², x₂², √2·x₁·x₂, √2·x₁, √2·x₂, 1)

Let’s verify both compute the same thing:

🐍 Code anzeigen / ausblenden

import math
 
def explicit_phi(x):
    """Map (x1, x2) into 6D feature space."""
    x1, x2 = x
    return (
        x1**2, x2**2, math.sqrt(2) * x1 * x2,
        math.sqrt(2) * x1, math.sqrt(2) * x2, 1.0
    )
 
def dot(u, v):
    return sum(a * b for a, b in zip(u, v))
 
def explicit_kernel(x, y):
    """Map both to 6D, then dot product. Cost: O(d²) ops."""
    return dot(explicit_phi(x), explicit_phi(y))
 
def polynomial_kernel(x, y):
    """The 'trick' version. Cost: O(d) ops."""
    return (dot(x, y) + 1) ** 2
 
x = (2, 3); y = (1, 4)
print("Explicit (via φ):    ", explicit_kernel(x, y))    # → 256.0
print("Polynomial kernel:   ", polynomial_kernel(x, y))  # → 256.0
print(f"6D phi(x): {explicit_phi(x)}")                   # for intuition

Both give 225.0 — same number, but the kernel never computes the 6D map. Saved cost = no need to store or compute φ.

Visual: same data, three kernels, three decision boundaries

Linear, polynomial, and RBF kernels on the same non-linearly-separable dataset. Linear fails, polynomial bends, RBF curves arbitrarily.

🐍 Code anzeigen / ausblenden

# Pyodide / Obsidian Execute Code: install required packages first.
# In normal Python (terminal / Jupyter), delete the next 4 lines.
import micropip
await micropip.install("matplotlib")
await micropip.install("scikit-learn")
 
import numpy as np
import matplotlib.pyplot as plt
from sklearn.svm import SVC
from sklearn.datasets import make_circles
 
X, y = make_circles(n_samples=200, noise=0.1, factor=0.4, random_state=42)
xx, yy = np.meshgrid(np.linspace(-1.5, 1.5, 300), np.linspace(-1.5, 1.5, 300))
grid = np.c_[xx.ravel(), yy.ravel()]
 
kernels = [('linear', {}), ('poly', {'degree': 3}), ('rbf', {'gamma': 2})]
fig, axes = plt.subplots(1, 3, figsize=(13.5, 4.5))
for ax, (kernel, params) in zip(axes, kernels):
    svm = SVC(kernel=kernel, C=1.0, **params).fit(X, y)
    Z = svm.decision_function(grid).reshape(xx.shape)
    acc = svm.score(X, y)
    ax.contourf(xx, yy, Z, levels=[-100, 0, 100], colors=['#ffe5e5', '#e5f0ff'])
    ax.contour(xx, yy, Z, levels=[0], colors='k', linewidths=1.5)
    ax.scatter(X[y == 0, 0], X[y == 0, 1], color='red',  s=20, edgecolor='k')
    ax.scatter(X[y == 1, 0], X[y == 1, 1], color='blue', s=20, edgecolor='k')
    ax.set_title(f"{kernel} kernel — accuracy={acc:.2f}")
    ax.set_xticks([]); ax.set_yticks([])
 
plt.tight_layout(); plt.show()

What to see:

Linear kernel — accuracy ~50 %, the boundary is a straight line, which cannot separate concentric circles. Half the points wrong.
Polynomial kernel (degree 3) — accuracy jumps to ~90–95 %, boundary is a curved shape that mostly wraps the inner cluster.
RBF kernel — accuracy ~100 %, boundary is a tight closed curve around the inner cluster. The infinite-dimensional feature space lets RBF carve arbitrarily-shaped regions.

This is the kernel trick made visible: same SVM optimizer, same data, totally different model classes — the only difference is the kernel function.

Common kernels

Kernel	Formula	Implicit feature space
Linear	`⟨x, y⟩`	Original space — equivalent to plain SVM
Polynomial	`(⟨x, y⟩ + c)^d`	Polynomials of degree ≤ d
RBF (Radial Basis Function = Gaussian)	`exp(−γ ‖x − y‖²)`	Infinite-dimensional — Taylor expansion of exp gives all polynomials
Sigmoid	`tanh(α ⟨x, y⟩ + c)`	Resembles a 2-layer neural network

RBF = Radial Basis Function. “Radial” → depends only on the distance ‖x−y‖ (a radius), not on direction; “basis function” → each support vector is a localized bump. ⚠️ Notation: the Session 11 slide uses the σ-form exp(−‖x−y‖²/2σ²); this is the equivalent γ-form with γ = 1/(2σ²). Small σ / large γ → narrow bumps → overfitting; large σ / small γ → smooth boundary.

RBF is the default for general SVM use — it’s flexible (infinite-dim space), has only one hyperparameter (γ or σ), and works on most problems.

Mercer’s condition — what makes a function a valid kernel

Not every function K defines a valid dot product. Mercer’s theorem:

A symmetric function K(x, y) is a valid kernel if and only if for every finite set of points {x₁, …, xₙ}, the Gram matrix K_ij = K(xᵢ, xⱼ) is positive semi-definite (PSD) — i.e. cᵀKc ≥ 0 for all c, equivalently all eigenvalues ≥ 0.

Why it matters: PSD-ness guarantees that some φ exists such that K(x, y) = ⟨φ(x), φ(y)⟩. Without it, SVM’s quadratic optimization is no longer convex. ⚠️ It’s semi-definite (eigenvalues ≥ 0, not strictly > 0), and it’s the Gram matrix that must be PSD, not the raw data.

In practice: linear, polynomial, RBF, and combinations (sums and products) of valid kernels are always valid kernels. Custom kernels need verification.

💡 Concrete non-PSD example: the sigmoid kernel tanh(α⟨x,y⟩ + c) (row above) is not PSD for most values of α, c — its Gram matrix can have negative eigenvalues. It violates Mercer yet is still used in practice (it’s historically motivated by neural nets and often “works”), which is exactly why PSD is a sufficient theoretical guarantee, not a hard practical gate. This is the standard “name a function that looks like a kernel but isn’t a valid one” exam answer.

Why this works for SVM

The SVM dual problem and the decision function depend on data only via dot products:

Decision: f(x) = sgn( Σᵢ αᵢ · yᵢ · K(x, xᵢ) + b )

→ Replace every ⟨x, xᵢ⟩ with K(x, xᵢ). The optimization remains the same. You get a non-linear decision boundary in the original space, which corresponds to a linear boundary in the (high-dim) feature space.

⚠️ Common exam traps

The kernel doesn’t compute φ. It computes the dot product in feature space. The φ is implicit, sometimes infinite-dimensional (RBF).
RBF kernel’s feature space is infinite-dimensional. This is not a typo — the Gaussian’s Taylor series has infinitely many terms.
Not every function is a kernel. Need Mercer’s condition (PSD Gram matrix).
The “trick” is not just about cost. For RBF you literally cannot compute φ explicitly because the space is infinite-dim. The kernel is the only way to do the computation.

Where the Kernel Trick is used today

SVMs — historically the main use; still in production for medium-data classification (e.g. text categorization, bioinformatics)
Kernel Ridge Regression — same trick on regression problems
Kernel PCA — nonlinear dimensionality reduction
Gaussian Processes — Bayesian regression/classification; the GP kernel is the central object
Spectral clustering — RBF kernel matrix → eigendecomposition for clustering
Bayesian Optimization — GPs (kernel-based) are the standard surrogate model in Optuna, scikit-optimize
Quantum kernels — IBM/Google research on quantum-enhanced kernels for ML

Where the Kernel Trick was challenged — and by what

Domain	Was kernel methods, now …	Why
Image classification	CNNs / Vision Transformers	Deep nets learn hierarchical features from raw pixels — kernels can’t
NLP	Word embeddings → Transformers	Learned representations beat kernel similarity on most language tasks
Large-scale ML (millions of points)	Linear models, deep learning, gradient boosting	Kernel methods are O(n²) or O(n³) — don’t scale beyond ~10k points without approximation
Hyperparameter tuning	Often still kernels (GPs in BayesOpt)	Kernel methods are the dominant tool here, just not branded as “kernel methods”

Where kernel methods still shine: small-to-medium tabular datasets, problems where you need calibrated uncertainty (GPs), and any setting where the inductive bias of a specific kernel matches the problem structure (e.g. spectral methods for periodic data).

Connection to modern deep learning

The Neural Tangent Kernel (NTK) theory shows that infinitely-wide neural networks are exactly equivalent to kernel methods with a specific (and computable) kernel. So in a deep theoretical sense, deep learning is doing something kernel-trick-like under the hood — just with kernels that are learned rather than chosen.

Brain Online

Explorer

Kernel Trick

Kernel Trick

The mechanism

See the trick concretely — the polynomial kernel

Visual: same data, three kernels, three decision boundaries

Common kernels

Mercer’s condition — what makes a function a valid kernel

Why this works for SVM

⚠️ Common exam traps

Where the Kernel Trick is used today

Where the Kernel Trick was challenged — and by what

Connection to modern deep learning

See also

Backlinks

Mika

✨ Features

⚙️ Einstellungen

📚 Chat-Verlauf

📖 Citation Manager

✍️ Writing Assistant

Inhaltsverzeichnis