Bias-Variance Tradeoff

methods-of-ai

The bias-variance tradeoff is the central insight in supervised machine learning: a model’s prediction error can be decomposed into three components — bias (systematic error from wrong assumptions), variance (sensitivity to training data fluctuations), and irreducible noise.

You can reduce bias OR variance — but you cannot reduce both arbitrarily. Increasing model complexity always decreases bias but increases variance. The art of ML is finding the sweet spot.

⚠️ Exam relevance (SoSe 2026). The concept — the tradeoff, the U-shape, under- vs. overfitting — is lecture content. But the exact decomposition formula below (Bias² + Variance + σ²) is not on the SoSe 2026 slides; it’s textbook (R&N / ESL). That’s why the decomposition question sits in the “Beyond the lecture” appendix of quiz_mega-all-topics_23-05-26. Learn the tradeoff + U-shape; treat the formula as enrichment.

The decomposition

For squared loss, the expected prediction error at a point x decomposes exactly as:

Plain-text form (same equation, term groupings marked):

E[(y − ŷ(x))²]  =  (E[ŷ(x)] − y)²  +  E[(ŷ(x) − E[ŷ(x)])²]  +  σ²
                   └──── Bias² ────┘   └────── Variance ──────┘   └ noise
  • Bias² — how far the expected prediction is from the true value (averaging over many training sets). Captures systematic error from wrong model assumptions.
  • Variance — how much the prediction fluctuates across different training sets drawn from the same distribution. Captures sensitivity to training data.
  • σ² — irreducible noise from the data-generating process itself. No model can reduce this.

Visual intuition: the dartboard

Think of repeatedly training your model on different training sets and plotting all predictions for the same test point:

ScenarioBiasVariancePredictions look like
High bias, low variancetight cluster, far from bullseye
Low bias, high variancescattered all around bullseye
High bias, high variancescattered, all far from bullseye (worst)
Low bias, low variancetight cluster on bullseye (best — but hard)

The U-shape: total error vs. model complexity

As you increase model complexity (more parameters, deeper trees, higher-degree polynomial):

  • Bias goes DOWN (more expressive model can capture true pattern)
  • Variance goes UP (more parameters = more fit to noise)
  • Total error has a U-shape — minimum somewhere in the middle

This is the famous “sweet spot” picture every ML textbook shows:

Error
  │       /──── total error
  │      /  \
  │     /    \____
  │    /          \____
  │   / bias²          \____
  │  /                       \____
  │ /__________variance________→ complexity

⚠️ Modern caveat — “double descent”: with very large neural networks (over-parameterized), test error sometimes goes DOWN again past the interpolation threshold. This breaks the classical U-shape but doesn’t invalidate it — it’s a different regime (Belkin et al., 2019).

See it concretely — polynomial regression

What to see:

  • degree 1 (high bias): straight line — too simple, misses the sine. Train AND test error both high.
  • degree 3 (sweet spot): captures the curve nicely. Both errors low.
  • degree 9 (high variance starting): wiggles to chase training noise. Test error rises.
  • degree 20 (massive overfit): perfect fit to training points, garbage between them. Train error → 0, test error explodes.

Run with n_samples = 100 and the overfit becomes much less dramatic — more data is the universal cure for high variance.

How to reduce each component

To reduce …Use …
BiasMore expressive model (deeper tree, more params, polynomial degree) · Add features · Reduce regularization
VarianceSimpler model · More training data · Regularization (L1, L2, dropout) · Bagging / ensembles · Early stopping · Cross-validation for model selection
Noise σ²Cannot reduce — it’s a property of the data

The asymmetry: variance is much easier to reduce than bias (just gather more data). That’s why modern ML throws huge models at huge datasets — high capacity + high variance, then drown the variance with data.x

Why bagging reduces variance — and Random Forest’s trick

Bagging (Bootstrap Aggregation): train many models on bootstrap samples of the data, average their predictions.

  • Averaging identically distributed estimators reduces variance by 1/n if they’re independent, less if they’re correlated.
  • Bias stays the same — the expected prediction is unchanged.

Random Forest adds a second trick: at each tree split, only consider a random subset of features. This decorrelates the trees → variance reduction is even bigger than plain bagging.

⚠️ Common exam traps

  1. “Increasing model complexity increases bias” — FALSE. It decreases bias (more expressive). Variance goes up.
  2. “Bagging reduces bias” — FALSE. It reduces variance only. Boosting (different algorithm) reduces bias.
  3. “Overfitting = high bias” — FALSE. Overfitting = high variance (sensitive to training noise). Underfitting = high bias.
  4. “Adding more data reduces bias” — FALSE in general. More data reduces variance. If the model is too simple, no amount of data helps.

Where the bias-variance tradeoff still matters today

  • All of supervised ML — every model selection decision is implicitly a bias-variance trade.
  • Hyperparameter tuning — regularization strength, network depth, tree depth, k in k-NN, kernel bandwidth — all control this tradeoff.
  • Modern deep learning — even with double descent, regularization (dropout, weight decay) is still about variance control.
  • Cross-validation — the standard tool for finding the sweet spot empirically.

Where it’s been refined — but not replaced

  • Double descent (Belkin et al. 2019) — over-parameterized regime breaks the U-shape; modern explanations involve implicit regularization of SGD.
  • Neural Tangent Kernel theory — explains why infinitely-wide NNs don’t overfit the way classical theory predicts.
  • Generalization bounds based on flatness, margin, or PAC-Bayes — newer theoretical frames, but the bias-variance decomposition is still the canonical first explanation.

Bottom line: the bias-variance tradeoff is the mental model for thinking about generalization. Modern theory has nuances on top, but you’ll never escape it.

See also

Tags: methods-of-ai machine-learning bias-variance overfitting regularization
Created: 18-05-26