Decision Trees and ID3

methods-of-ai

A Decision Tree is a flowchart-shaped classifier: at each internal node, test one attribute; follow the matching branch; reach a leaf, output its class. They’re the most interpretable ML model — every prediction comes with a human-readable explanation (“Path: weather=sunny → humidity=high → play=no”).

ID3 (Iterative Dichotomiser 3, Quinlan 1986) is the original algorithm to build a decision tree from data. It recursively picks the attribute with highest Information Gain to split on, until each leaf is pure or no attributes remain.

The algorithm in 4 lines

build_tree(data):
    if all examples have same class: return leaf(class)
    if no attributes left:           return leaf(majority_class(data))
    A* = argmax_A InformationGain(data, A)
    for each value v of A*:          build_tree(data where A*=v) as subtree

That’s it. Greedy, recursive, no backtracking. The only thing that varies between DT algorithms is how the split attribute is chosen (ID3 uses Information Gain, CART uses Gini impurity, C4.5 uses Gain Ratio).

Entropy and Information Gain

Entropy measures impurity of a label set S:

H(S) = − Σᵢ pᵢ · log₂(pᵢ)

where pᵢ is the fraction of S labeled with class i.

  • H = 0 when all examples have the same class (pure leaf — perfect)
  • H = 1 for a 50/50 binary split (maximum uncertainty — worst)
  • H = log₂(k) for a uniform k-class distribution

Information Gain of splitting on attribute A:

Gain(S, A) = H(S) − Σᵥ (|Sᵥ| / |S|) · H(Sᵥ)

In plain English: “how much does splitting on A reduce uncertainty?” The attribute with highest gain becomes the root of (this) subtree.

Worked example (the classic “play tennis” dataset)

DayWeatherHumidityWindPlay?
1SunnyHighWeakNo
2SunnyHighStrongNo
3CloudyHighWeakYes
4RainHighWeakYes
5RainNormalWeakYes
6RainNormalStrongNo

Labels = 3 Yes (days 3,4,5) / 3 No (days 1,2,6) → a perfectly balanced root:

H(S) = −(3/6)·log₂(3/6) − (3/6)·log₂(3/6) = 1.0 (50/50 → maximally impure)

Split on Weather:

  • Sunny (2 examples, 0 yes / 2 no): H = 0
  • Cloudy (1 example, 1 yes / 0 no): H = 0
  • Rain (3 examples, 2 yes / 1 no): H ≈ 0.92

Weighted entropy after split = (2/6)·0 + (1/6)·0 + (3/6)·0.92 = 0.46

Information Gain(Weather) = 1.0 − 0.46 = 0.54 ← highest gain → root split.

See it work in code

ID3 picks Weather as the root, then recurses on each branch’s subset.

Visual: tree depth vs. overfitting

Classic plot: as you let the tree grow deeper, training error → 0 but test error first drops then rises. The gap = overfitting.

What to see:

  • Train error drops to 0 by depth ~5 (tree memorizes training set)
  • Test error has a U-shape — minimum around depth 4–6, then climbs as the tree starts overfitting
  • The gap between the two lines = generalization gap, the visual signature of overfitting
  • Optimal depth = the green dashed line

This is the bias-variance tradeoff incarnate. See Bias-Variance Tradeoff.

⚠️ Famous trap: high-cardinality attributes

If you give ID3 an attribute like DayOfYear or UserID (every example has a unique value), Information Gain reports perfect gain — each value gets its own pure leaf. The tree overfits catastrophically.

Fixes:

  • Gain Ratio (C4.5): normalize gain by SplitInformation — penalizes attributes that split into many branches
  • Gini impurity (CART): different impurity measure, less biased toward high-cardinality
  • Random subsampling of features (used in Random Forests) — sidesteps the problem by only considering a few features per split

Properties

  • Interpretable — every decision is a readable if-then chain
  • No feature scaling needed — works with raw data
  • Handles categorical + numerical features
  • Robust to irrelevant features — they just never get picked
  • High variance — small changes in data → very different tree
  • Prone to overfitting — deep trees memorize training noise
  • Greedy, no backtracking — locally optimal split may not lead to globally optimal tree
  • Cannot model XOR-like interactions cleanly — needs to be split twice

Preventing overfitting

  • Pre-pruning (early stopping): stop splitting when |S| < threshold, or gain < threshold, or max depth reached
  • Post-pruning (reduced error pruning): build full tree, then collapse subtrees whose removal doesn’t hurt validation accuracy
  • Random Forest (see Random Forest) — the modern default: build many trees on bootstrap samples and average

Where Decision Trees and ID3 are used today

  • CART (sklearn DecisionTreeClassifier) — the default tree in scikit-learn; widely used as baseline
  • XGBoost / LightGBM / CatBoost — gradient-boosted decision trees dominate tabular ML. Most Kaggle wins use these.
  • Random Forest — ensemble of decision trees, the standard non-deep-learning classifier
  • Medical decision support — clinicians prefer trees because they can audit the reasoning (“IF temperature > 38 AND cough → flu test”)
  • Credit scoring — banks use trees + ensembles because regulators require explainability (GDPR Article 22)
  • Feature importance analysis — even when not deployed, trees are used to understand which features matter

Where pure ID3 was replaced — and by what

ApplicationWas ID3, now …Why
Single-tree classificationRandom Forest, XGBoost, LightGBMSingle trees have too much variance — ensembles fix this
Tabular ML in generalGradient-boosted trees (XGBoost et al.)Better accuracy through boosting (sequential trees correcting errors)
Image / text / sequence dataDeep neural networks (CNNs, RNNs, Transformers)Trees can’t learn hierarchical features from raw pixels / tokens
Multi-way splits with continuous featuresCART (binary splits)CART’s binary splits are simpler and more flexible

Where decision trees still dominate: tabular data with mixed feature types, when interpretability matters, and when training data is small-to-medium. Deep learning has not replaced trees for tabular ML — XGBoost still wins most Kaggle tabular competitions.

See also

Tags: methods-of-ai machine-learning decision-trees id3 entropy information-gain
Created: 18-05-26