Quiz: Decision Trees and ID3
Methods of AI — SoSe 2026
7 questions. From definition → mechanics → exam-trap. Type your answer in the **Max's answer:** field below each question, ping me to evaluate.
Q1 — Entropy
Question: A leaf node contains 6 training examples, all labeled Yes. What is its entropy, and what does that tell you about the node — both numerically and intuitively?
Answer
Entropy = 0 bits.
Numerically: H(S) = −Σ pᵢ · log₂(pᵢ). With p(Yes) = 1 and p(No) = 0, the formula gives −1·log₂(1) − 0·log₂(0) = 0 − 0 = 0 (using the convention 0·log₂(0) = 0).
Intuitively: zero entropy = zero impurity = the node is pure = ID3 stops splitting here and declares this leaf with labelYes. No information would be gained by further splitting.
Max’s answer:
Result:
Q2 — Information Gain
Question: ID3 picks the attribute with the highest Information Gain at each split. Why does it use Information Gain — and not, say, just the attribute that produces the smallest entropy in the resulting child nodes?
Answer
Information Gain measures the reduction in entropy: Gain(S, A) = H(S) − Σᵥ (|Sᵥ|/|S|) · H(Sᵥ).
The second term is a weighted average of child entropies — weighted by how many examples each child receives.
If you only minimized child entropy without the weighting, ID3 would prefer to create one tiny pure child and one large impure child, which would barely reduce overall uncertainty.
By measuring the expected entropy after the split (and comparing to the parent), Information Gain captures how much uncertainty you actually eliminate per split, not just whether some child looks pure.
Max’s answer:
Result:
Q4 — Overfitting
Question: Your decision tree has 0 % training error and 25 % test error at depth 20. What’s the technical name for this phenomenon, what bias-variance term does it correspond to, and name two distinct ways to fix it.
Answer
Phenomenon: overfitting.
Bias-variance term: high variance (low bias). The tree fits training noise; small changes in training data → very different tree.
Fixes (any two):
- Pre-pruning (early stopping): cap max_depth, set min_samples_leaf, require min Information Gain to split.
- Post-pruning (reduced-error pruning): build the full tree, then collapse subtrees whose removal doesn’t hurt validation accuracy.
- Use Random Forest (or any bagging ensemble): variance averages out across many trees.
- More training data: variance reduces as 1/n; if you can collect more examples, the noise drowns out.
- Restrict feature subset at each split (RF-style) → more decorrelated trees.
Max’s answer:
Result:
Q5 — Decision Tree vs. Random Forest
Question: Random Forest fixes a specific weakness of single Decision Trees. What weakness, and how exactly does Random Forest fix it?
Answer
Weakness fixed: high variance (a single tree is brittle — small data changes give very different trees).
How RF fixes it (two complementary tricks):
- Bagging (Bootstrap Aggregation): train each tree on a different bootstrap sample of the data. Averaging identically-distributed estimators reduces variance by ~1/n if independent. Bias stays the same.
- Feature subsampling at each split: at every node, only consider a random subset of features (typically √p for classification). This decorrelates the trees — without it, all trees would pick the same strong feature near the root and the averaging would barely help. Decorrelation amplifies the variance reduction.
Final prediction = majority vote (classification) or average (regression).
Max’s answer:
Result:
Q6 — Mechanism (short)
Question: In ONE sentence: why is a single Decision Tree considered a high-variance model?
Answer
Because the recursive greedy split chooses each attribute based on the specific training data it sees, so even small changes in the data can flip an early split decision, which propagates downward and produces a structurally very different tree.
(Alt: “Because the recursive top-down construction is unstable — a small data perturbation early in the tree cascades into very different subtrees, so the overall predictions fluctuate a lot across different training samples.“)
Max’s answer:
Result:
Beyond the lecture (optional)
These questions go beyond the SoSe 2026 lecture slides (textbook / external additions). Kept for depth, not exam-critical.
Q3 — ⚠️ Exam trap: high-cardinality attributes
Question: You include UserID as a feature in your training data. ID3 picks it as the root split. The tree has 100 % training accuracy but 50 % test accuracy. What happened, and what’s the standard fix?
Answer
What happened: every UserID value appears exactly once → splitting on it creates one pure leaf per example. Information Gain is maximal (entropy drops to 0 in every child). But the tree has memorized the IDs, not learned any pattern → catastrophic overfitting, useless on new data.
Standard fixes:
- Gain Ratio (C4.5): normalize Gain by SplitInformation, which penalizes attributes that split into many branches.
- Gini impurity (CART): different impurity measure; less biased toward high-cardinality attributes.
- Random feature subsampling at each split (Random Forest trick): only consider a few features per split → high-cardinality attributes don’t always get chosen.
- Drop the column — UserID has no predictive signal anyway.
Max’s answer:
Result:
Q7 — Applied judgement
Question: You’re designing a tabular medical decision-support system. Doctors must be able to audit every recommendation. You have ~10,000 patient records and ~30 features. Choose between (a) a single ID3 tree, (b) Random Forest, (c) XGBoost — and justify your choice. There is no single “right” answer; what matters is the trade-off you make explicit.
Answer
A strong defensible answer is (a) a single ID3/CART tree, possibly with pre-pruning + post-pruning. Reasoning:
- Auditability is a hard requirement. A doctor can trace any prediction through if-then-else nodes in a single tree. A 100-tree forest is essentially a black box (you can’t explain a vote across 100 trees).
- Random Forest and XGBoost would beat the single tree in accuracy by 2–5%, but lose interpretability. With only 10k samples and 30 features, the accuracy gap is often small.
- Mitigations for the tree’s variance: prune aggressively (depth 4–6), tune on cross-validation, validate with clinicians on edge cases.
- Alternative defensible answer: Random Forest with SHAP values for explanations — gives near-XGBoost accuracy with per-prediction explanations. Slightly less raw interpretability but still auditable.
- XGBoost is the wrong choice despite being most accurate — its boosted-tree explanations are post-hoc and harder to defend to regulators (GDPR Article 22).
The exam-relevant insight: when interpretability is a constraint, raw accuracy is not the only objective.
Max’s answer:
Result:
Score
When all 7 are graded:
- ✓ Correct:
- ~ Partial:
- ✗ Wrong:
Topics to re-drill if any wrong:
- Q1, Q6 — entropy and variance intuition
- Q2 — Information Gain mechanics
- Q3 — high-cardinality trap
- Q4, Q5 — overfitting fixes + ensemble reasoning
- Q7 — applied trade-offs
See also
- Decision Trees and ID3 — atomic note with code + plot
- Random Forest — atomic note with OOB + ensemble code
- Bias-Variance Tradeoff — the framing behind Q4–Q6
- lernzettel_ml-i-ii_30-04-26
- Machine Learning I & II — full topic hub
- Questions for Methods of AI — hub
- Methods of AI Lecture
Tags: methods-of-ai quiz decision-trees