The bias-variance tradeoff is the central insight in supervised machine learning: a model’s prediction error can be decomposed into three components — bias (systematic error from wrong assumptions), variance (sensitivity to training data fluctuations), and irreducible noise.
You can reduce bias OR variance — but you cannot reduce both arbitrarily. Increasing model complexity always decreases bias but increases variance. The art of ML is finding the sweet spot.
⚠️ Exam relevance (SoSe 2026). The concept — the tradeoff, the U-shape, under- vs. overfitting — is lecture content. But the exact decomposition formula below (Bias² + Variance + σ²) is not on the SoSe 2026 slides; it’s textbook (R&N / ESL). That’s why the decomposition question sits in the “Beyond the lecture” appendix of quiz_mega-all-topics_23-05-26. Learn the tradeoff + U-shape; treat the formula as enrichment.
The decomposition
For squared loss, the expected prediction error at a point x decomposes exactly as:
Bias² — how far the expected prediction is from the true value (averaging over many training sets). Captures systematic error from wrong model assumptions.
Variance — how much the prediction fluctuates across different training sets drawn from the same distribution. Captures sensitivity to training data.
σ² — irreducible noise from the data-generating process itself. No model can reduce this.
Visual intuition: the dartboard
Think of repeatedly training your model on different training sets and plotting all predictions for the same test point:
Scenario
Bias
Variance
Predictions look like
High bias, low variance
▲
▽
tight cluster, far from bullseye
Low bias, high variance
▽
▲
scattered all around bullseye
High bias, high variance
▲
▲
scattered, all far from bullseye (worst)
Low bias, low variance
▽
▽
tight cluster on bullseye (best — but hard)
The U-shape: total error vs. model complexity
As you increase model complexity (more parameters, deeper trees, higher-degree polynomial):
Bias goes DOWN (more expressive model can capture true pattern)
Variance goes UP (more parameters = more fit to noise)
Total error=Bias2+Variance has a U-shape — minimum somewhere in the middle
This is the famous “sweet spot” picture every ML textbook shows:
⚠️ Modern caveat — “double descent”: with very large neural networks (over-parameterized), test error sometimes goes DOWN again past the interpolation threshold. This breaks the classical U-shape but doesn’t invalidate it — it’s a different regime (Belkin et al., 2019).
See it concretely — polynomial regression
🐍 Code anzeigen / ausblenden
# Pyodide environment (e.g. Obsidian Execute Code plugin) needs matplotlib + numpy.# If you run in normal Python (terminal/Jupyter), delete the next 2 lines.import micropipawait micropip.install("matplotlib")import numpy as npimport matplotlib.pyplot as plt# True function (sine) + noisy observationsnp.random.seed(42)def true_f(x): return np.sin(2 * np.pi * x)n_samples = 100noise = 0.3X_train = np.linspace(0, 1, n_samples)y_train = true_f(X_train) + np.random.normal(0, noise, n_samples)X_test = np.linspace(0, 1, 200)y_test = true_f(X_test)# Fit polynomials of increasing degreedegrees = [1, 3, 9, 20]fig, axes = plt.subplots(1, 4, figsize=(16, 4))for ax, deg in zip(axes, degrees): coeffs = np.polyfit(X_train, y_train, deg) pred = np.polyval(coeffs, X_test) train_err = np.mean((np.polyval(coeffs, X_train) - y_train)**2) test_err = np.mean((pred - y_test)**2) ax.plot(X_test, y_test, 'k-', lw=1.5, label='true f(x)') ax.scatter(X_train, y_train, color='steelblue', s=20, alpha=0.7, label='training data') ax.plot(X_test, pred, 'r-', lw=2, label=f'degree {deg} fit') ax.set_title(f'deg {deg}: train MSE={train_err:.2f}, test MSE={test_err:.2f}') ax.set_ylim(-2, 2); ax.legend(fontsize=7)plt.tight_layout(); plt.show()
What to see:
degree 1 (high bias): straight line — too simple, misses the sine. Train AND test error both high.
degree 3 (sweet spot): captures the curve nicely. Both errors low.
degree 9 (high variance starting): wiggles to chase training noise. Test error rises.
degree 20 (massive overfit): perfect fit to training points, garbage between them. Train error → 0, test error explodes.
Run with n_samples = 100 and the overfit becomes much less dramatic — more data is the universal cure for high variance.
How to reduce each component
To reduce …
Use …
Bias
More expressive model (deeper tree, more params, polynomial degree) · Add features · Reduce regularization
Variance
Simpler model · More training data · Regularization (L1, L2, dropout) · Bagging / ensembles · Early stopping · Cross-validation for model selection
Noise σ²
Cannot reduce — it’s a property of the data
The asymmetry: variance is much easier to reduce than bias (just gather more data). That’s why modern ML throws huge models at huge datasets — high capacity + high variance, then drown the variance with data.x
Why bagging reduces variance — and Random Forest’s trick
Bagging (Bootstrap Aggregation): train many models on bootstrap samples of the data, average their predictions.
Averaging identically distributed estimators reduces variance by 1/n if they’re independent, less if they’re correlated.
Bias stays the same — the expected prediction is unchanged.
Random Forest adds a second trick: at each tree split, only consider a random subset of features. This decorrelates the trees → variance reduction is even bigger than plain bagging.
⚠️ Common exam traps
“Increasing model complexity increases bias” — FALSE. It decreases bias (more expressive). Variance goes up.
“Overfitting = high bias” — FALSE. Overfitting = high variance (sensitive to training noise). Underfitting = high bias.
“Adding more data reduces bias” — FALSE in general. More data reduces variance. If the model is too simple, no amount of data helps.
Where the bias-variance tradeoff still matters today
All of supervised ML — every model selection decision is implicitly a bias-variance trade.
Hyperparameter tuning — regularization strength, network depth, tree depth, k in k-NN, kernel bandwidth — all control this tradeoff.
Modern deep learning — even with double descent, regularization (dropout, weight decay) is still about variance control.
Cross-validation — the standard tool for finding the sweet spot empirically.
Where it’s been refined — but not replaced
Double descent (Belkin et al. 2019) — over-parameterized regime breaks the U-shape; modern explanations involve implicit regularization of SGD.
Neural Tangent Kernel theory — explains why infinitely-wide NNs don’t overfit the way classical theory predicts.
Generalization bounds based on flatness, margin, or PAC-Bayes — newer theoretical frames, but the bias-variance decomposition is still the canonical first explanation.
Bottom line: the bias-variance tradeoff is the mental model for thinking about generalization. Modern theory has nuances on top, but you’ll never escape it.
See also
Loss Surface — model complexity changes the shape of the loss landscape; high-D wide nets often have flatter, easier surfaces