Bias-Variance Tradeoff

methods-of-ai

The bias-variance tradeoff is the central insight in supervised machine learning: a model’s prediction error can be decomposed into three components — bias (systematic error from wrong assumptions), variance (sensitivity to training data fluctuations), and irreducible noise.

You can reduce bias OR variance — but you cannot reduce both arbitrarily. Increasing model complexity always decreases bias but increases variance. The art of ML is finding the sweet spot.

⚠️ Exam relevance (SoSe 2026). The concept — the tradeoff, the U-shape, under- vs. overfitting — is lecture content. But the exact decomposition formula below (Bias² + Variance + σ²) is not on the SoSe 2026 slides; it’s textbook (R&N / ESL). That’s why the decomposition question sits in the “Beyond the lecture” appendix of quiz_mega-all-topics_23-05-26. Learn the tradeoff + U-shape; treat the formula as enrichment.

The decomposition

For squared loss, the expected prediction error at a point x decomposes exactly as:

$E [(y - \overset{y}{^} (x))^{2}] = Bias^{2} (E [\overset{y}{^} (x)] - y)^{2} + Variance E [(\overset{y}{^} (x) - E [\overset{y}{^} (x)])^{2}] + noise σ^{2}$

Plain-text form (same equation, term groupings marked):

E[(y − ŷ(x))²]  =  (E[ŷ(x)] − y)²  +  E[(ŷ(x) − E[ŷ(x)])²]  +  σ²
                   └──── Bias² ────┘   └────── Variance ──────┘   └ noise

Bias² — how far the expected prediction is from the true value (averaging over many training sets). Captures systematic error from wrong model assumptions.
Variance — how much the prediction fluctuates across different training sets drawn from the same distribution. Captures sensitivity to training data.
σ² — irreducible noise from the data-generating process itself. No model can reduce this.

Visual intuition: the dartboard

Think of repeatedly training your model on different training sets and plotting all predictions for the same test point:

Scenario	Bias	Variance	Predictions look like
High bias, low variance	▲	▽	tight cluster, far from bullseye
Low bias, high variance	▽	▲	scattered all around bullseye
High bias, high variance	▲	▲	scattered, all far from bullseye (worst)
Low bias, low variance	▽	▽	tight cluster on bullseye (best — but hard)

The U-shape: total error vs. model complexity

As you increase model complexity (more parameters, deeper trees, higher-degree polynomial):

Bias goes DOWN (more expressive model can capture true pattern)
Variance goes UP (more parameters = more fit to noise)
Total error $= Bias^{2} + Variance$ has a U-shape — minimum somewhere in the middle

This is the famous “sweet spot” picture every ML textbook shows:

Error
  │       /──── total error
  │      /  \
  │     /    \____
  │    /          \____
  │   / bias²          \____
  │  /                       \____
  │ /__________variance________→ complexity

⚠️ Modern caveat — “double descent”: with very large neural networks (over-parameterized), test error sometimes goes DOWN again past the interpolation threshold. This breaks the classical U-shape but doesn’t invalidate it — it’s a different regime (Belkin et al., 2019).

See it concretely — polynomial regression

🐍 Code anzeigen / ausblenden

# Pyodide environment (e.g. Obsidian Execute Code plugin) needs matplotlib + numpy.
# If you run in normal Python (terminal/Jupyter), delete the next 2 lines.
import micropip
await micropip.install("matplotlib")
 
import numpy as np
import matplotlib.pyplot as plt
 
# True function (sine) + noisy observations
np.random.seed(42)
def true_f(x):
    return np.sin(2 * np.pi * x)
 
n_samples = 100
noise = 0.3
X_train = np.linspace(0, 1, n_samples)
y_train = true_f(X_train) + np.random.normal(0, noise, n_samples)
X_test = np.linspace(0, 1, 200)
y_test = true_f(X_test)
 
# Fit polynomials of increasing degree
degrees = [1, 3, 9, 20]
fig, axes = plt.subplots(1, 4, figsize=(16, 4))
for ax, deg in zip(axes, degrees):
    coeffs = np.polyfit(X_train, y_train, deg)
    pred = np.polyval(coeffs, X_test)
    train_err = np.mean((np.polyval(coeffs, X_train) - y_train)**2)
    test_err  = np.mean((pred - y_test)**2)
    ax.plot(X_test, y_test, 'k-', lw=1.5, label='true f(x)')
    ax.scatter(X_train, y_train, color='steelblue', s=20, alpha=0.7, label='training data')
    ax.plot(X_test, pred, 'r-', lw=2, label=f'degree {deg} fit')
    ax.set_title(f'deg {deg}: train MSE={train_err:.2f}, test MSE={test_err:.2f}')
    ax.set_ylim(-2, 2); ax.legend(fontsize=7)
plt.tight_layout(); plt.show()

What to see:

degree 1 (high bias): straight line — too simple, misses the sine. Train AND test error both high.
degree 3 (sweet spot): captures the curve nicely. Both errors low.
degree 9 (high variance starting): wiggles to chase training noise. Test error rises.
degree 20 (massive overfit): perfect fit to training points, garbage between them. Train error → 0, test error explodes.

Run with n_samples = 100 and the overfit becomes much less dramatic — more data is the universal cure for high variance.

How to reduce each component

To reduce …	Use …
Bias	More expressive model (deeper tree, more params, polynomial degree) · Add features · Reduce regularization
Variance	Simpler model · More training data · Regularization (L1, L2, dropout) · Bagging / ensembles · Early stopping · Cross-validation for model selection
Noise σ²	Cannot reduce — it’s a property of the data

The asymmetry: variance is much easier to reduce than bias (just gather more data). That’s why modern ML throws huge models at huge datasets — high capacity + high variance, then drown the variance with data.x

Why bagging reduces variance — and Random Forest’s trick

Bagging (Bootstrap Aggregation): train many models on bootstrap samples of the data, average their predictions.

Averaging identically distributed estimators reduces variance by 1/n if they’re independent, less if they’re correlated.
Bias stays the same — the expected prediction is unchanged.

Random Forest adds a second trick: at each tree split, only consider a random subset of features. This decorrelates the trees → variance reduction is even bigger than plain bagging.

⚠️ Common exam traps

“Increasing model complexity increases bias” — FALSE. It decreases bias (more expressive). Variance goes up.
“Bagging reduces bias” — FALSE. It reduces variance only. Boosting (different algorithm) reduces bias.
“Overfitting = high bias” — FALSE. Overfitting = high variance (sensitive to training noise). Underfitting = high bias.
“Adding more data reduces bias” — FALSE in general. More data reduces variance. If the model is too simple, no amount of data helps.

Where the bias-variance tradeoff still matters today

All of supervised ML — every model selection decision is implicitly a bias-variance trade.
Hyperparameter tuning — regularization strength, network depth, tree depth, k in k-NN, kernel bandwidth — all control this tradeoff.
Modern deep learning — even with double descent, regularization (dropout, weight decay) is still about variance control.
Cross-validation — the standard tool for finding the sweet spot empirically.

Where it’s been refined — but not replaced

Double descent (Belkin et al. 2019) — over-parameterized regime breaks the U-shape; modern explanations involve implicit regularization of SGD.
Neural Tangent Kernel theory — explains why infinitely-wide NNs don’t overfit the way classical theory predicts.
Generalization bounds based on flatness, margin, or PAC-Bayes — newer theoretical frames, but the bias-variance decomposition is still the canonical first explanation.

Bottom line: the bias-variance tradeoff is the mental model for thinking about generalization. Modern theory has nuances on top, but you’ll never escape it.

Brain Online

Explorer

Bias-Variance Tradeoff

Bias-Variance Tradeoff

The decomposition

Visual intuition: the dartboard

The U-shape: total error vs. model complexity

See it concretely — polynomial regression

How to reduce each component

Why bagging reduces variance — and Random Forest’s trick

⚠️ Common exam traps

Where the bias-variance tradeoff still matters today

Where it’s been refined — but not replaced

See also

Backlinks

Mika

✨ Features

⚙️ Einstellungen

📚 Chat-Verlauf

📖 Citation Manager

✍️ Writing Assistant

Inhaltsverzeichnis