Random Forest

methods-of-ai

A Random Forest (Breiman, 2001) is an ensemble of decision trees that votes on the answer. Each tree is trained on a different bootstrap sample of the data, and at each split only a random subset of features is considered. The result: a model with the interpretability of trees but dramatically lower variance than a single tree.

Random Forest is the de facto default classifier for tabular data when you don’t know what else to try. Often beats deep learning on small/medium tabular datasets.

The two tricks

Random Forest = Decision Trees + two randomization tricks:

Trick 1: Bagging (Bootstrap Aggregation)

  • For each tree, sample N examples with replacement from the training set (a “bootstrap sample”).
  • Each bootstrap sample is the same size as the original, but ~37 % of examples are missing (those become the Out-Of-Bag (OOB) set).
  • Train each tree on its own bootstrap sample.

Trick 2: Feature subsampling at each split

  • At every split decision, consider only a random subset of features (typically √p for classification, p/3 for regression, where p = total features).
  • This decorrelates the trees — without it, all trees would pick the same “strong” feature near the root.

Prediction

  • Classification: majority vote across trees
  • Regression: average across trees

Why this reduces variance — without raising bias

Average of n identically distributed estimators:

  • Variance: reduces by 1/n if they’re independent; less if they’re correlated
  • Bias: unchanged — averaging doesn’t shift the expected prediction

Tree-vs-tree correlation is what controls how much variance reduction you actually get. Feature subsampling is the genius bit — it forces different trees to look at different features, decorrelating them. Without it, all trees pick the same strong root and you get only modest variance reduction.

⚠️ Exam trap: “Bagging reduces bias” → FALSE. It reduces variance only. (Boosting reduces bias.)

The Out-Of-Bag (OOB) error — free cross-validation

Each example is “out-of-bag” for ~37 % of trees (since 1 − 1/e ≈ 0.63 chance of being included in a bootstrap sample of size N).

→ For each example, predict using only the trees that didn’t see it → this gives an unbiased estimate of test error without needing a separate validation set.

Why 37 %: P(not selected in N draws) = (1 − 1/N)^N → 1/e ≈ 0.368 as N → ∞.

See it in code

(For a real implementation use sklearn.ensemble.RandomForestClassifier — this is just to make the bagging + OOB mechanism visible.)

Visual: OOB error vs. number of trees

Two things to show in one plot: (a) OOB error drops as you add more trees, then plateaus, and (b) the forest beats a single tree by a wide margin.

What to see:

  • OOB error tracks test error closely — that’s the OOB-as-free-validation magic. You don’t need a held-out set to estimate generalization.
  • Both drop sharply for the first 10–20 trees, then plateau. Adding more trees past ~50 gives diminishing returns.
  • Single-tree error (gray dashed) is ~2× worse than the forest. The variance reduction is the whole story.
  • This curve is why “1000 trees” is overkill for most problems — 100 is usually plenty.

Properties

  • Low variance, decent bias → strong default classifier
  • OOB error = built-in cross-validation
  • Feature importance computed via gain reduction across trees
  • Robust to irrelevant features
  • Handles missing values, mixed types, no scaling needed
  • Parallelizable — each tree is independent
  • Less interpretable than single tree — you can’t trace a prediction through 100 trees
  • Slow at inference if many deep trees
  • Loses to gradient boosting on most accuracy benchmarks

Random Forest vs. Gradient Boosting

Random ForestGradient Boosting (XGBoost)
How trees combineIndependent, voting/averagingSequential, each fixing previous errors
ReducesVarianceBias
Sensitive to outliersRobustMore sensitive
TrainingParallelSequential (slower)
Hyperparameter tuningFew, robustMany, sensitive
Default pickWhen you want fast + robustWhen you want best accuracy

Rule of thumb: Random Forest as baseline; XGBoost/LightGBM if you need every last % of accuracy.

Where Random Forest is used today

  • Kaggle tabular ML — Random Forest is the standard baseline before XGBoost
  • Bioinformatics / genomics — gene expression analysis (high-dim, low-sample data)
  • Remote sensing — satellite image classification (land use, deforestation tracking)
  • Finance — credit risk, fraud detection (explainability for regulators)
  • Healthcare — clinical risk prediction (often co-deployed with logistic regression for comparison)
  • Feature selection in ML pipelines — use RF importance scores to drop weak features before training a final model
  • Ecology — species distribution modeling

Where Random Forest was challenged — and by what

DomainWas RF, now …Why
Top-accuracy tabular MLXGBoost, LightGBM, CatBoostGradient boosting reaches lower bias than RF can
Image / text / sequenceDeep neural networksRFs can’t learn hierarchical features from raw input
Very high-dim sparse dataLinear models + L1 regularizationRFs struggle with thousands of mostly-zero features
Real-time inferenceSingle neural networkForest of 1000 trees is slow at inference

Where RF still wins: medium-size tabular data, when you need OOB-style validation, when interpretability + robustness matter more than peak accuracy, and as a first baseline in any tabular ML project.

See also

Tags: methods-of-ai machine-learning random-forest ensemble bagging oob
Created: 18-05-26