Decision Trees and ID3

methods-of-ai

A Decision Tree is a flowchart-shaped classifier: at each internal node, test one attribute; follow the matching branch; reach a leaf, output its class. They’re the most interpretable ML model — every prediction comes with a human-readable explanation (“Path: weather=sunny → humidity=high → play=no”).

ID3 (Iterative Dichotomiser 3, Quinlan 1986) is the original algorithm to build a decision tree from data. It recursively picks the attribute with highest Information Gain to split on, until each leaf is pure or no attributes remain.

The algorithm in 4 lines

build_tree(data):
    if all examples have same class: return leaf(class)
    if no attributes left:           return leaf(majority_class(data))
    A* = argmax_A InformationGain(data, A)
    for each value v of A*:          build_tree(data where A*=v) as subtree

That’s it. Greedy, recursive, no backtracking. The only thing that varies between DT algorithms is how the split attribute is chosen (ID3 uses Information Gain, CART uses Gini impurity, C4.5 uses Gain Ratio).

Entropy and Information Gain

Entropy measures impurity of a label set S:

H(S) = − Σᵢ pᵢ · log₂(pᵢ)

where pᵢ is the fraction of S labeled with class i.

H = 0 when all examples have the same class (pure leaf — perfect)
H = 1 for a 50/50 binary split (maximum uncertainty — worst)
H = log₂(k) for a uniform k-class distribution

Information Gain of splitting on attribute A:

Gain(S, A) = H(S) − Σᵥ (|Sᵥ| / |S|) · H(Sᵥ)

In plain English: “how much does splitting on A reduce uncertainty?” The attribute with highest gain becomes the root of (this) subtree.

Worked example (the classic “play tennis” dataset)

Day	Weather	Humidity	Wind	Play?
1	Sunny	High	Weak	No
2	Sunny	High	Strong	No
3	Cloudy	High	Weak	Yes
4	Rain	High	Weak	Yes
5	Rain	Normal	Weak	Yes
6	Rain	Normal	Strong	No

Labels = 3 Yes (days 3,4,5) / 3 No (days 1,2,6) → a perfectly balanced root:

H(S) = −(3/6)·log₂(3/6) − (3/6)·log₂(3/6) = 1.0 (50/50 → maximally impure)

Split on Weather:

Sunny (2 examples, 0 yes / 2 no): H = 0
Cloudy (1 example, 1 yes / 0 no): H = 0
Rain (3 examples, 2 yes / 1 no): H ≈ 0.92

Weighted entropy after split = (2/6)·0 + (1/6)·0 + (3/6)·0.92 = 0.46

Information Gain(Weather) = 1.0 − 0.46 = 0.54 ← highest gain → root split.

See it work in code

🐍 Code anzeigen / ausblenden

import math
from collections import Counter
 
def entropy(labels):
    n = len(labels)
    counts = Counter(labels).values()
    return -sum((c/n) * math.log2(c/n) for c in counts if c > 0)
 
def information_gain(data, attr_idx, label_idx=-1):
    """data: list of feature-tuples + label at label_idx."""
    total_H = entropy([row[label_idx] for row in data])
    # Partition by the attribute's value
    partitions = {}
    for row in data:
        partitions.setdefault(row[attr_idx], []).append(row)
    weighted_H = sum(
        (len(part) / len(data)) * entropy([r[label_idx] for r in part])
        for part in partitions.values()
    )
    return total_H - weighted_H
 
# Tennis dataset (Weather, Humidity, Wind, Play?)
data = [
    ('Sunny',  'High',   'Weak',   'No'),
    ('Sunny',  'High',   'Strong', 'No'),
    ('Cloudy', 'High',   'Weak',   'Yes'),
    ('Rain',   'High',   'Weak',   'Yes'),
    ('Rain',   'Normal', 'Weak',   'Yes'),
    ('Rain',   'Normal', 'Strong', 'No'),
]
attrs = ['Weather', 'Humidity', 'Wind']
for i, name in enumerate(attrs):
    print(f"Gain({name}) = {information_gain(data, i):.3f}")
# Gain(Weather)  = 0.541     ← winner → root of the tree
# Gain(Humidity) = 0.000     (High: 2/2, Normal: 1/1 → both H=1 → zero gain)
# Gain(Wind)     = 0.459

ID3 picks Weather as the root, then recurses on each branch’s subset.

Visual: tree depth vs. overfitting

Classic plot: as you let the tree grow deeper, training error → 0 but test error first drops then rises. The gap = overfitting.

🐍 Code anzeigen / ausblenden

# Pyodide / Obsidian Execute Code: install required packages first.
# In normal Python (terminal / Jupyter), delete the next 4 lines.
import micropip
await micropip.install("matplotlib")
await micropip.install("scikit-learn")
 
import numpy as np
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
 
X, y = make_moons(n_samples=300, noise=0.3, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
 
depths = list(range(1, 21))
train_err, test_err = [], []
for d in depths:
    tree = DecisionTreeClassifier(max_depth=d, random_state=42).fit(X_train, y_train)
    train_err.append(1 - tree.score(X_train, y_train))
    test_err.append (1 - tree.score(X_test,  y_test))
 
plt.figure(figsize=(9, 4.5))
plt.plot(depths, train_err, 'o-', color='steelblue', label='train error')
plt.plot(depths, test_err,  's-', color='red',       label='test error')
plt.axvline(np.argmin(test_err) + 1, color='green', ls='--', alpha=0.6, label='best depth')
plt.xlabel('max tree depth'); plt.ylabel('classification error')
plt.title('Decision Tree on make_moons — train vs test error')
plt.legend(); plt.grid(alpha=0.3); plt.tight_layout(); plt.show()

What to see:

Train error drops to 0 by depth ~5 (tree memorizes training set)
Test error has a U-shape — minimum around depth 4–6, then climbs as the tree starts overfitting
The gap between the two lines = generalization gap, the visual signature of overfitting
Optimal depth = the green dashed line

This is the bias-variance tradeoff incarnate. See Bias-Variance Tradeoff.

⚠️ Famous trap: high-cardinality attributes

If you give ID3 an attribute like DayOfYear or UserID (every example has a unique value), Information Gain reports perfect gain — each value gets its own pure leaf. The tree overfits catastrophically.

Fixes:

Gain Ratio (C4.5): normalize gain by SplitInformation — penalizes attributes that split into many branches
Gini impurity (CART): different impurity measure, less biased toward high-cardinality
Random subsampling of features (used in Random Forests) — sidesteps the problem by only considering a few features per split

Properties

✅ Interpretable — every decision is a readable if-then chain
✅ No feature scaling needed — works with raw data
✅ Handles categorical + numerical features
✅ Robust to irrelevant features — they just never get picked
❌ High variance — small changes in data → very different tree
❌ Prone to overfitting — deep trees memorize training noise
❌ Greedy, no backtracking — locally optimal split may not lead to globally optimal tree
❌ Cannot model XOR-like interactions cleanly — needs to be split twice

Preventing overfitting

Pre-pruning (early stopping): stop splitting when |S| < threshold, or gain < threshold, or max depth reached
Post-pruning (reduced error pruning): build full tree, then collapse subtrees whose removal doesn’t hurt validation accuracy
Random Forest (see Random Forest) — the modern default: build many trees on bootstrap samples and average

Where Decision Trees and ID3 are used today

CART (sklearn DecisionTreeClassifier) — the default tree in scikit-learn; widely used as baseline
XGBoost / LightGBM / CatBoost — gradient-boosted decision trees dominate tabular ML. Most Kaggle wins use these.
Random Forest — ensemble of decision trees, the standard non-deep-learning classifier
Medical decision support — clinicians prefer trees because they can audit the reasoning (“IF temperature > 38 AND cough → flu test”)
Credit scoring — banks use trees + ensembles because regulators require explainability (GDPR Article 22)
Feature importance analysis — even when not deployed, trees are used to understand which features matter

Where pure ID3 was replaced — and by what

Application	Was ID3, now …	Why
Single-tree classification	Random Forest, XGBoost, LightGBM	Single trees have too much variance — ensembles fix this
Tabular ML in general	Gradient-boosted trees (XGBoost et al.)	Better accuracy through boosting (sequential trees correcting errors)
Image / text / sequence data	Deep neural networks (CNNs, RNNs, Transformers)	Trees can’t learn hierarchical features from raw pixels / tokens
Multi-way splits with continuous features	CART (binary splits)	CART’s binary splits are simpler and more flexible

Where decision trees still dominate: tabular data with mixed feature types, when interpretability matters, and when training data is small-to-medium. Deep learning has not replaced trees for tabular ML — XGBoost still wins most Kaggle tabular competitions.

Brain Online

Explorer

Decision Trees and ID3

Decision Trees and ID3

The algorithm in 4 lines

Entropy and Information Gain

Worked example (the classic “play tennis” dataset)

See it work in code

Visual: tree depth vs. overfitting

⚠️ Famous trap: high-cardinality attributes

Properties

Preventing overfitting

Where Decision Trees and ID3 are used today

Where pure ID3 was replaced — and by what

See also

Backlinks

Mika

✨ Features

⚙️ Einstellungen

📚 Chat-Verlauf

📖 Citation Manager

✍️ Writing Assistant

Inhaltsverzeichnis