A Random Forest (Breiman, 2001) is an ensemble of decision trees that votes on the answer. Each tree is trained on a different bootstrap sample of the data, and at each split only a random subset of features is considered. The result: a model with the interpretability of trees but dramatically lower variance than a single tree.
Random Forest is the de facto default classifier for tabular data when you don’t know what else to try. Often beats deep learning on small/medium tabular datasets.
The two tricks
Random Forest = Decision Trees + two randomization tricks:
Trick 1: Bagging (Bootstrap Aggregation)
For each tree, sample N examples with replacement from the training set (a “bootstrap sample”).
Each bootstrap sample is the same size as the original, but ~37 % of examples are missing (those become the Out-Of-Bag (OOB) set).
Train each tree on its own bootstrap sample.
Trick 2: Feature subsampling at each split
At every split decision, consider only a random subset of features (typically √p for classification, p/3 for regression, where p = total features).
This decorrelates the trees — without it, all trees would pick the same “strong” feature near the root.
Prediction
Classification: majority vote across trees
Regression: average across trees
Why this reduces variance — without raising bias
Average of n identically distributed estimators:
Variance: reduces by 1/n if they’re independent; less if they’re correlated
Bias: unchanged — averaging doesn’t shift the expected prediction
Tree-vs-tree correlation is what controls how much variance reduction you actually get. Feature subsampling is the genius bit — it forces different trees to look at different features, decorrelating them. Without it, all trees pick the same strong root and you get only modest variance reduction.
The Out-Of-Bag (OOB) error — free cross-validation
Each example is “out-of-bag” for ~37 % of trees (since 1 − 1/e ≈ 0.63 chance of being included in a bootstrap sample of size N).
→ For each example, predict using only the trees that didn’t see it → this gives an unbiased estimate of test error without needing a separate validation set.
Why 37 %: P(not selected in N draws) = (1 − 1/N)^N → 1/e ≈ 0.368 as N → ∞.
See it in code
🐍 Code anzeigen / ausblenden
import randomfrom collections import Counter, defaultdictdef fit_tree(data, max_depth=3): """Stub: pretend this fits a decision tree. Returns a callable predictor.""" # In reality: recursive entropy-based split (see [[Decision Trees and ID3]]) # Here we use a closure over the training data for demonstration. def predict(x): # Naive 1-nearest-neighbor for demo simplicity return min(data, key=lambda d: sum((a - b)**2 for a, b in zip(x, d[:-1])))[-1] return predictdef bootstrap(data): """Sample N examples with replacement → ~63 % unique, ~37 % out-of-bag.""" return [random.choice(data) for _ in range(len(data))]def fit_random_forest(data, n_trees=50, max_depth=3): """Train n_trees on bootstrap samples; track which examples each tree didn't see.""" forest, oob_sets = [], [] data_set = [tuple(row) for row in data] for _ in range(n_trees): sample = bootstrap(data) in_bag = set(tuple(row) for row in sample) oob = [row for row in data_set if row not in in_bag] tree = fit_tree(sample, max_depth) forest.append(tree) oob_sets.append(oob) return forest, oob_setsdef predict_forest(forest, x): """Majority vote across all trees.""" votes = [tree(x) for tree in forest] return Counter(votes).most_common(1)[0][0]def oob_error(forest, oob_sets, data): """For each example, predict using only trees that didn't see it.""" correct = 0; total = 0 for row in data: row_t = tuple(row) # Find trees where this row is OOB relevant_trees = [forest[i] for i, oob in enumerate(oob_sets) if row_t in (tuple(r) for r in oob)] if not relevant_trees: continue pred = Counter(t(row[:-1]) for t in relevant_trees).most_common(1)[0][0] correct += int(pred == row[-1]) total += 1 return 1 - correct / total if total else None# Toy dataset: 2 features, binary labelrandom.seed(42)data = [(random.gauss(i, 1), random.gauss(i, 1), i) for i in [0, 1] for _ in range(50)]forest, oob_sets = fit_random_forest(data, n_trees=30)print("OOB error:", oob_error(forest, oob_sets, data))# OOB error gives an unbiased generalization estimate — no train/test split needed.
(For a real implementation use sklearn.ensemble.RandomForestClassifier — this is just to make the bagging + OOB mechanism visible.)
Visual: OOB error vs. number of trees
Two things to show in one plot: (a) OOB error drops as you add more trees, then plateaus, and (b) the forest beats a single tree by a wide margin.
🐍 Code anzeigen / ausblenden
# Pyodide / Obsidian Execute Code: install required packages first.# In normal Python (terminal / Jupyter), delete the next 4 lines.import micropipawait micropip.install("matplotlib")await micropip.install("scikit-learn")import matplotlib.pyplot as pltfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.tree import DecisionTreeClassifierfrom sklearn.datasets import make_moonsfrom sklearn.model_selection import train_test_splitX, y = make_moons(n_samples=500, noise=0.3, random_state=42)X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.3, random_state=42)# Single tree as baselinetree = DecisionTreeClassifier(random_state=42).fit(X_tr, y_tr)single_tree_err = 1 - tree.score(X_te, y_te)# Random Forest with varying # of trees, tracking OOB errorn_trees = list(range(1, 101, 2))oob_errs, test_errs = [], []for n in n_trees: rf = RandomForestClassifier(n_estimators=n, oob_score=True, bootstrap=True, random_state=42, n_jobs=-1).fit(X_tr, y_tr) oob_errs.append (1 - rf.oob_score_) test_errs.append(1 - rf.score(X_te, y_te))plt.figure(figsize=(9, 4.5))plt.plot(n_trees, oob_errs, 'o-', color='steelblue', lw=1.2, label='OOB error')plt.plot(n_trees, test_errs, 's-', color='red', lw=1.2, label='test error', alpha=0.7)plt.axhline(single_tree_err, color='gray', ls='--', alpha=0.6, label=f'single tree test error = {single_tree_err:.3f}')plt.xlabel('number of trees in forest'); plt.ylabel('classification error')plt.title('Random Forest — OOB & test error vs. # trees')plt.legend(); plt.grid(alpha=0.3); plt.tight_layout(); plt.show()
What to see:
OOB error tracks test error closely — that’s the OOB-as-free-validation magic. You don’t need a held-out set to estimate generalization.
Both drop sharply for the first 10–20 trees, then plateau. Adding more trees past ~50 gives diminishing returns.
Single-tree error (gray dashed) is ~2× worse than the forest. The variance reduction is the whole story.
This curve is why “1000 trees” is overkill for most problems — 100 is usually plenty.
Finance — credit risk, fraud detection (explainability for regulators)
Healthcare — clinical risk prediction (often co-deployed with logistic regression for comparison)
Feature selection in ML pipelines — use RF importance scores to drop weak features before training a final model
Ecology — species distribution modeling
Where Random Forest was challenged — and by what
Domain
Was RF, now …
Why
Top-accuracy tabular ML
XGBoost, LightGBM, CatBoost
Gradient boosting reaches lower bias than RF can
Image / text / sequence
Deep neural networks
RFs can’t learn hierarchical features from raw input
Very high-dim sparse data
Linear models + L1 regularization
RFs struggle with thousands of mostly-zero features
Real-time inference
Single neural network
Forest of 1000 trees is slow at inference
Where RF still wins: medium-size tabular data, when you need OOB-style validation, when interpretability + robustness matter more than peak accuracy, and as a first baseline in any tabular ML project.