Information Gain

Information gain aims to reduce entropy — it measures how much knowing the value of an attribute A reduces our uncertainty about the class label of a set S. It is the splitting criterion the ID3 decision-tree algorithm uses: at each node, pick the attribute with the highest information gain.

Formula

  • E(S) = entropy of S, E(S) = −Σᵢ pᵢ log₂ pᵢ (in bits). ⚠️ This E is the lecture’s notation for entropy = Shannon H(S); it is not the expectation 𝔼. See Machine Learning I & II.
  • Values(A) = the distinct values attribute A can take; S_v = the subset of S with A = v; |S_v|/|S| is that subset’s share.

In words: entropy before the split, minus the weighted entropy after the split. A big gain means A splits the data into much purer subsets.

Intuition

  • Pure subset (all one class) → entropy 0 → contributes nothing to the weighted sum → high gain.
  • 50/50 split → entropy 1 bit → no purity gained → low gain.
  • ID3 is greedy: it maximises gain at each node, never looking ahead.

⚠️ The high-cardinality bias

Pure information gain over-favours attributes with many distinct values. An attribute like patient_ID (unique per example) splits S into singleton, perfectly pure subsets → maximal gain → but zero generalisation (overfitting in the extreme).

Fix: Gain Ratio (used in C4.5) divides gain by the intrinsic information of the split, SplitInfo(A) = −Σ_v (|S_v|/|S|) log₂(|S_v|/|S|), penalising high-cardinality attributes. Gain Ratio / C4.5 are textbook content — the slides give Gini impurity as the alternative criterion.

Exam framing

  • IG = entropy reduction; it is ID3’s per-node choice.
  • Know the formula and the weighted child-entropy term (easy to forget the |S_v|/|S| weights).
  • High-cardinality trap → Gain Ratio.

See also

Quellen

Erstellt: 15-01-25 19:53