Information Gain

Information gain aims to reduce entropy — it measures how much knowing the value of an attribute A reduces our uncertainty about the class label of a set S. It is the splitting criterion the ID3 decision-tree algorithm uses: at each node, pick the attribute with the highest information gain.

Formula

$Gain (S, A) = E (S) - \sum_{v \in Values (A)} \frac{∣ S _{v} ∣}{∣ S ∣} E (S_{v})$

E(S) = entropy of S, E(S) = −Σᵢ pᵢ log₂ pᵢ (in bits). ⚠️ This E is the lecture’s notation for entropy = Shannon H(S); it is not the expectation 𝔼. See Machine Learning I & II.
Values(A) = the distinct values attribute A can take; S_v = the subset of S with A = v; |S_v|/|S| is that subset’s share.

In words: entropy before the split, minus the weighted entropy after the split. A big gain means A splits the data into much purer subsets.

Intuition

Pure subset (all one class) → entropy 0 → contributes nothing to the weighted sum → high gain.
50/50 split → entropy 1 bit → no purity gained → low gain.
ID3 is greedy: it maximises gain at each node, never looking ahead.

⚠️ The high-cardinality bias

Pure information gain over-favours attributes with many distinct values. An attribute like patient_ID (unique per example) splits S into singleton, perfectly pure subsets → maximal gain → but zero generalisation (overfitting in the extreme).

Fix: Gain Ratio (used in C4.5) divides gain by the intrinsic information of the split, SplitInfo(A) = −Σ_v (|S_v|/|S|) log₂(|S_v|/|S|), penalising high-cardinality attributes. Gain Ratio / C4.5 are textbook content — the slides give Gini impurity as the alternative criterion.

Exam framing

IG = entropy reduction; it is ID3’s per-node choice.
Know the formula and the weighted child-entropy term (easy to forget the |S_v|/|S| weights).
High-cardinality trap → Gain Ratio.

Quellen

Erstellt: 15-01-25 19:53

Brain Online

Explorer

Information Gain

Information Gain

Formula

Intuition

⚠️ The high-cardinality bias

Exam framing

See also

Quellen

Backlinks

Mika

✨ Features

⚙️ Einstellungen

📚 Chat-Verlauf

📖 Citation Manager

✍️ Writing Assistant

Inhaltsverzeichnis