Random Forest

methods-of-ai

A Random Forest (Breiman, 2001) is an ensemble of decision trees that votes on the answer. Each tree is trained on a different bootstrap sample of the data, and at each split only a random subset of features is considered. The result: a model with the interpretability of trees but dramatically lower variance than a single tree.

Random Forest is the de facto default classifier for tabular data when you don’t know what else to try. Often beats deep learning on small/medium tabular datasets.

The two tricks

Random Forest = Decision Trees + two randomization tricks:

Trick 1: Bagging (Bootstrap Aggregation)

For each tree, sample N examples with replacement from the training set (a “bootstrap sample”).
Each bootstrap sample is the same size as the original, but ~37 % of examples are missing (those become the Out-Of-Bag (OOB) set).
Train each tree on its own bootstrap sample.

Trick 2: Feature subsampling at each split

At every split decision, consider only a random subset of features (typically √p for classification, p/3 for regression, where p = total features).
This decorrelates the trees — without it, all trees would pick the same “strong” feature near the root.

Prediction

Classification: majority vote across trees
Regression: average across trees

Why this reduces variance — without raising bias

Average of n identically distributed estimators:

Variance: reduces by 1/n if they’re independent; less if they’re correlated
Bias: unchanged — averaging doesn’t shift the expected prediction

Tree-vs-tree correlation is what controls how much variance reduction you actually get. Feature subsampling is the genius bit — it forces different trees to look at different features, decorrelating them. Without it, all trees pick the same strong root and you get only modest variance reduction.

⚠️ Exam trap: “Bagging reduces bias” → FALSE. It reduces variance only. (Boosting reduces bias.)

The Out-Of-Bag (OOB) error — free cross-validation

Each example is “out-of-bag” for ~37 % of trees (since 1 − 1/e ≈ 0.63 chance of being included in a bootstrap sample of size N).

→ For each example, predict using only the trees that didn’t see it → this gives an unbiased estimate of test error without needing a separate validation set.

Why 37 %: P(not selected in N draws) = (1 − 1/N)^N → 1/e ≈ 0.368 as N → ∞.

See it in code

🐍 Code anzeigen / ausblenden

import random
from collections import Counter, defaultdict
 
def fit_tree(data, max_depth=3):
    """Stub: pretend this fits a decision tree. Returns a callable predictor."""
    # In reality: recursive entropy-based split (see [[Decision Trees and ID3]])
    # Here we use a closure over the training data for demonstration.
    def predict(x):
        # Naive 1-nearest-neighbor for demo simplicity
        return min(data, key=lambda d: sum((a - b)**2 for a, b in zip(x, d[:-1])))[-1]
    return predict
 
def bootstrap(data):
    """Sample N examples with replacement → ~63 % unique, ~37 % out-of-bag."""
    return [random.choice(data) for _ in range(len(data))]
 
def fit_random_forest(data, n_trees=50, max_depth=3):
    """Train n_trees on bootstrap samples; track which examples each tree didn't see."""
    forest, oob_sets = [], []
    data_set = [tuple(row) for row in data]
    for _ in range(n_trees):
        sample = bootstrap(data)
        in_bag = set(tuple(row) for row in sample)
        oob = [row for row in data_set if row not in in_bag]
        tree = fit_tree(sample, max_depth)
        forest.append(tree)
        oob_sets.append(oob)
    return forest, oob_sets
 
def predict_forest(forest, x):
    """Majority vote across all trees."""
    votes = [tree(x) for tree in forest]
    return Counter(votes).most_common(1)[0][0]
 
def oob_error(forest, oob_sets, data):
    """For each example, predict using only trees that didn't see it."""
    correct = 0; total = 0
    for row in data:
        row_t = tuple(row)
        # Find trees where this row is OOB
        relevant_trees = [forest[i] for i, oob in enumerate(oob_sets) if row_t in (tuple(r) for r in oob)]
        if not relevant_trees:
            continue
        pred = Counter(t(row[:-1]) for t in relevant_trees).most_common(1)[0][0]
        correct += int(pred == row[-1])
        total += 1
    return 1 - correct / total if total else None
 
# Toy dataset: 2 features, binary label
random.seed(42)
data = [(random.gauss(i, 1), random.gauss(i, 1), i) for i in [0, 1] for _ in range(50)]
 
forest, oob_sets = fit_random_forest(data, n_trees=30)
print("OOB error:", oob_error(forest, oob_sets, data))
# OOB error gives an unbiased generalization estimate — no train/test split needed.

(For a real implementation use sklearn.ensemble.RandomForestClassifier — this is just to make the bagging + OOB mechanism visible.)

Visual: OOB error vs. number of trees

Two things to show in one plot: (a) OOB error drops as you add more trees, then plateaus, and (b) the forest beats a single tree by a wide margin.

🐍 Code anzeigen / ausblenden

# Pyodide / Obsidian Execute Code: install required packages first.
# In normal Python (terminal / Jupyter), delete the next 4 lines.
import micropip
await micropip.install("matplotlib")
await micropip.install("scikit-learn")
 
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
 
X, y = make_moons(n_samples=500, noise=0.3, random_state=42)
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.3, random_state=42)
 
# Single tree as baseline
tree = DecisionTreeClassifier(random_state=42).fit(X_tr, y_tr)
single_tree_err = 1 - tree.score(X_te, y_te)
 
# Random Forest with varying # of trees, tracking OOB error
n_trees = list(range(1, 101, 2))
oob_errs, test_errs = [], []
for n in n_trees:
    rf = RandomForestClassifier(n_estimators=n, oob_score=True, bootstrap=True,
                                random_state=42, n_jobs=-1).fit(X_tr, y_tr)
    oob_errs.append (1 - rf.oob_score_)
    test_errs.append(1 - rf.score(X_te, y_te))
 
plt.figure(figsize=(9, 4.5))
plt.plot(n_trees, oob_errs,  'o-', color='steelblue', lw=1.2, label='OOB error')
plt.plot(n_trees, test_errs, 's-', color='red',       lw=1.2, label='test error', alpha=0.7)
plt.axhline(single_tree_err, color='gray', ls='--', alpha=0.6,
            label=f'single tree test error = {single_tree_err:.3f}')
plt.xlabel('number of trees in forest'); plt.ylabel('classification error')
plt.title('Random Forest — OOB & test error vs. # trees')
plt.legend(); plt.grid(alpha=0.3); plt.tight_layout(); plt.show()

What to see:

OOB error tracks test error closely — that’s the OOB-as-free-validation magic. You don’t need a held-out set to estimate generalization.
Both drop sharply for the first 10–20 trees, then plateau. Adding more trees past ~50 gives diminishing returns.
Single-tree error (gray dashed) is ~2× worse than the forest. The variance reduction is the whole story.
This curve is why “1000 trees” is overkill for most problems — 100 is usually plenty.

Properties

✅ Low variance, decent bias → strong default classifier
✅ OOB error = built-in cross-validation
✅ Feature importance computed via gain reduction across trees
✅ Robust to irrelevant features
✅ Handles missing values, mixed types, no scaling needed
✅ Parallelizable — each tree is independent
❌ Less interpretable than single tree — you can’t trace a prediction through 100 trees
❌ Slow at inference if many deep trees
❌ Loses to gradient boosting on most accuracy benchmarks

Random Forest vs. Gradient Boosting

	Random Forest	Gradient Boosting (XGBoost)
How trees combine	Independent, voting/averaging	Sequential, each fixing previous errors
Reduces	Variance	Bias
Sensitive to outliers	Robust	More sensitive
Training	Parallel	Sequential (slower)
Hyperparameter tuning	Few, robust	Many, sensitive
Default pick	When you want fast + robust	When you want best accuracy

Rule of thumb: Random Forest as baseline; XGBoost/LightGBM if you need every last % of accuracy.

Where Random Forest is used today

Kaggle tabular ML — Random Forest is the standard baseline before XGBoost
Bioinformatics / genomics — gene expression analysis (high-dim, low-sample data)
Remote sensing — satellite image classification (land use, deforestation tracking)
Finance — credit risk, fraud detection (explainability for regulators)
Healthcare — clinical risk prediction (often co-deployed with logistic regression for comparison)
Feature selection in ML pipelines — use RF importance scores to drop weak features before training a final model
Ecology — species distribution modeling

Where Random Forest was challenged — and by what

Domain	Was RF, now …	Why
Top-accuracy tabular ML	XGBoost, LightGBM, CatBoost	Gradient boosting reaches lower bias than RF can
Image / text / sequence	Deep neural networks	RFs can’t learn hierarchical features from raw input
Very high-dim sparse data	Linear models + L1 regularization	RFs struggle with thousands of mostly-zero features
Real-time inference	Single neural network	Forest of 1000 trees is slow at inference

Where RF still wins: medium-size tabular data, when you need OOB-style validation, when interpretability + robustness matter more than peak accuracy, and as a first baseline in any tabular ML project.

Brain Online

Explorer

Random Forest

Random Forest

The two tricks

Trick 1: Bagging (Bootstrap Aggregation)

Trick 2: Feature subsampling at each split

Prediction

Why this reduces variance — without raising bias

The Out-Of-Bag (OOB) error — free cross-validation

See it in code

Visual: OOB error vs. number of trees

Properties

Random Forest vs. Gradient Boosting

Where Random Forest is used today

Where Random Forest was challenged — and by what

See also

Backlinks

Mika

✨ Features

⚙️ Einstellungen

📚 Chat-Verlauf

📖 Citation Manager

✍️ Writing Assistant

Inhaltsverzeichnis