A Decision Tree is a flowchart-shaped classifier: at each internal node, test one attribute; follow the matching branch; reach a leaf, output its class. They’re the most interpretable ML model — every prediction comes with a human-readable explanation (“Path: weather=sunny → humidity=high → play=no”).
ID3 (Iterative Dichotomiser 3, Quinlan 1986) is the original algorithm to build a decision tree from data. It recursively picks the attribute with highest Information Gain to split on, until each leaf is pure or no attributes remain.
The algorithm in 4 lines
build_tree(data):
if all examples have same class: return leaf(class)
if no attributes left: return leaf(majority_class(data))
A* = argmax_A InformationGain(data, A)
for each value v of A*: build_tree(data where A*=v) as subtree
That’s it. Greedy, recursive, no backtracking. The only thing that varies between DT algorithms is how the split attribute is chosen (ID3 uses Information Gain, CART uses Gini impurity, C4.5 uses Gain Ratio).
Entropy and Information Gain
Entropy measures impurity of a label set S:
H(S) = − Σᵢ pᵢ · log₂(pᵢ)
where pᵢ is the fraction of S labeled with class i.
H = 0 when all examples have the same class (pure leaf — perfect)
H = 1 for a 50/50 binary split (maximum uncertainty — worst)
H = log₂(k) for a uniform k-class distribution
Information Gain of splitting on attribute A:
Gain(S, A) = H(S) − Σᵥ (|Sᵥ| / |S|) · H(Sᵥ)
In plain English: “how much does splitting on A reduce uncertainty?” The attribute with highest gain becomes the root of (this) subtree.
Worked example (the classic “play tennis” dataset)
Day
Weather
Humidity
Wind
Play?
1
Sunny
High
Weak
No
2
Sunny
High
Strong
No
3
Cloudy
High
Weak
Yes
4
Rain
High
Weak
Yes
5
Rain
Normal
Weak
Yes
6
Rain
Normal
Strong
No
Labels = 3 Yes (days 3,4,5) / 3 No (days 1,2,6) → a perfectly balanced root:
If you give ID3 an attribute like DayOfYear or UserID (every example has a unique value), Information Gain reports perfect gain — each value gets its own pure leaf. The tree overfits catastrophically.
Fixes:
Gain Ratio (C4.5): normalize gain by SplitInformation — penalizes attributes that split into many branches
Gini impurity (CART): different impurity measure, less biased toward high-cardinality
Random subsampling of features (used in Random Forests) — sidesteps the problem by only considering a few features per split
Properties
✅ Interpretable — every decision is a readable if-then chain
✅ No feature scaling needed — works with raw data
✅ Handles categorical + numerical features
✅ Robust to irrelevant features — they just never get picked
❌ High variance — small changes in data → very different tree
❌ Prone to overfitting — deep trees memorize training noise
❌ Greedy, no backtracking — locally optimal split may not lead to globally optimal tree
❌ Cannot model XOR-like interactions cleanly — needs to be split twice
Preventing overfitting
Pre-pruning (early stopping): stop splitting when |S| < threshold, or gain < threshold, or max depth reached
Post-pruning (reduced error pruning): build full tree, then collapse subtrees whose removal doesn’t hurt validation accuracy
Random Forest (see Random Forest) — the modern default: build many trees on bootstrap samples and average
Where Decision Trees and ID3 are used today
CART (sklearn DecisionTreeClassifier) — the default tree in scikit-learn; widely used as baseline
XGBoost / LightGBM / CatBoost — gradient-boosted decision trees dominate tabular ML. Most Kaggle wins use these.
Random Forest — ensemble of decision trees, the standard non-deep-learning classifier
Medical decision support — clinicians prefer trees because they can audit the reasoning (“IF temperature > 38 AND cough → flu test”)
Credit scoring — banks use trees + ensembles because regulators require explainability (GDPR Article 22)
Feature importance analysis — even when not deployed, trees are used to understand which features matter
Where pure ID3 was replaced — and by what
Application
Was ID3, now …
Why
Single-tree classification
Random Forest, XGBoost, LightGBM
Single trees have too much variance — ensembles fix this
Tabular ML in general
Gradient-boosted trees (XGBoost et al.)
Better accuracy through boosting (sequential trees correcting errors)
Image / text / sequence data
Deep neural networks (CNNs, RNNs, Transformers)
Trees can’t learn hierarchical features from raw pixels / tokens
Multi-way splits with continuous features
CART (binary splits)
CART’s binary splits are simpler and more flexible
Where decision trees still dominate:tabular data with mixed feature types, when interpretability matters, and when training data is small-to-medium. Deep learning has not replaced trees for tabular ML — XGBoost still wins most Kaggle tabular competitions.
See also
Random Forest — the ensemble that fixes single-tree variance
Bias-Variance Tradeoff — single trees are high-variance; forests reduce variance via bagging