Machine Learning I & II

chatbot methods-of-ai

This note is the Methods of AI ML I+II topic hub (Sessions 09 & 10, Krumnack — SoSe 2026). It consolidates the merged content from the former ml-i-ii_30-04-26 topic reference: Mitchell’s definition, the three learning paradigms, inductive bias, the bias-variance tradeoff, decision trees & ID3, entropy & information gain, random forests & bagging, similarity & clustering (cosine, k-means, hierarchical), cross-validation, TF-IDF, exam traps, and full pseudocode for the six core algorithms.

Companion files for exam prep:

📋 lernzettel_ml-i-ii_30-04-26 — compact 1-page exam sheet

✏️ quiz_ml_30-04-26 — broad ML methodology + DT/RF (7 Qs)

✏️ quiz_decision-trees_18-05-26 — focused on entropy / IG / ensembles (7 Qs)

Core Ideas
Glossary — important ML vocabulary
Mitchell’s ML Definition
Learning paradigms
Inductive Bias
Bias-Variance Tradeoff
Decision Trees & ID3
Entropy & Information Gain (worked example)
Random Forests, Bagging, OOB error
Similarity & Clustering
Cross-Validation
TF-IDF & Feature Spaces
Formulas & Notation
Common Exam Traps
Comparison Table
Visualisations (Python)
ALGORITHMS ⭐ — pseudocode
- [[#1. ID3 (decision-tree induction)|#1 ID3]]
- [[#2. Random Forest training|#2 Random Forest]]
- [[#3. Bagging (Bootstrap AGGregating)|#3 Bagging]]
- [[#4. k-means (Lloyd’s algorithm)|#4 k-means (Lloyd’s)]]
- [[#5. Hierarchical clustering (agglomerative)|#5 Hierarchical agglomerative]]
- [[#6. k-fold cross-validation|#6 k-fold cross-validation]]

Core Ideas

ML = improving with experience at a task. Formal (Mitchell): improve task T w.r.t. performance measure P based on experience E.
Three paradigms: supervised (labeled), unsupervised (unlabeled), reinforcement (trial-and-error + rewards).
Inductive bias is the set of prior assumptions a learner makes — without it, generalisation is impossible.
Overfitting: model fits training noise instead of signal → poor generalisation. Occam’s razor: prefer simpler models.
Two major task families: classification/regression (supervised) and clustering (unsupervised).

Glossary — important ML vocabulary

Task T — the function the system is supposed to compute (e.g. classify email).
Performance P — a measurable quality (accuracy, MSE, F1).
Experience E — the data (or interactions) the learner sees.

Supervised learning — input/output pairs (x, y). Goal: learn f(x) ≈ y. Subtypes: classification (discrete y) and regression (continuous y).
Unsupervised learning — only inputs x, no labels. Goal: discover structure (clustering, density estimation).
Reinforcement learning — agent interacts with an environment; receives scalar rewards; learns a policy.
Semi-supervised learning (mentioned in WiSe 2024_25 slides; not in SoSe 2026) — small labelled set + large unlabelled set; learner exploits the unlabelled inputs to generalise better (label propagation, self-training).
Self-supervised learning (mentioned in WiSe 2024_25 slides; not in SoSe 2026) — labels are derived from the input itself (e.g. mask a word, predict it; rotate an image, predict the rotation). Backbone of modern foundation models (BERT, GPT pretraining).

Inductive learning — generalising from finite examples to a hypothesis covering unseen cases.
Inductive bias B — minimal set of assumptions such that the learner’s output follows deductively from (training data ∪ B). Without B no generalisation.
Hypothesis space H — the set of functions the learner can express. The restriction bias sits in H; the preference bias in how H is searched.

Overfitting — training error keeps dropping while test error rises. Model has captured noise.
Underfitting — model too simple, both train and test error are high.
Bias (statistical) — systematic error from oversimplifying assumptions.
Variance (statistical) — error from sensitivity to fluctuations in the training data.

Entropy E(S) — measure of impurity / uncertainty in a label distribution (in bits). Lecture writes E; = Shannon H(S); not the expectation 𝔼.
Information Gain Gain(S, A) — expected reduction in entropy when splitting S by attribute A.

Bootstrap sample — random sample with replacement of size n drawn from a dataset of size n (≈ 63 % unique rows).
Out-of-bag (OOB) sample — the ≈ 37 % of rows not in a given bootstrap sample. Free validation set.
Bagging (Bootstrap AGGregating) — train m models on m bootstrap samples; aggregate by vote/average. Reduces variance.

Centroid — mean point of a cluster.
Dendrogram — tree diagram produced by hierarchical clustering, showing merge order and distance.
Linkage — distance rule between clusters (single / complete / average / centroid).

⚠️ Bias has two meanings. Inductive bias (assumptions enabling generalisation) ≠ statistical bias (error from oversimplification). Same word, very different concepts.

Mitchell’s ML Definition

“A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.” — Tom Mitchell, Machine Learning (1997)

Walk-through (spam filter):

T — classify an email as spam / not spam.
P — fraction of correctly classified emails (accuracy) on held-out mail.
E — a corpus of emails labelled spam/ham.

Chess engine:

T = play chess. P = % games won. E = games played against itself or others.

The triple (T, P, E) is the canonical exam answer.

Learning paradigms

Paradigm	Data	Goal	Example
Supervised	(x, y) pairs	learn f(x) ≈ y	spam classification, house-price regression
Unsupervised	x only	discover structure	k-means, hierarchical clustering, PCA
Reinforcement	(state, action, reward)	learn policy π(s) maximising return	AlphaGo, robot control

Sub-types of supervised:

Classification — y ∈ finite set of classes.
Regression — y ∈ ℝ.

⚠️ Semi-supervised (mostly unlabeled + a few labels) and self-supervised (labels generated from data itself) are hybrids — not on these slides.

Inductive Bias

The “no free lunch” intuition. A learner that makes no assumptions cannot do better than random on unseen data — it can memorise the training set, but it has no principled reason to prefer one extension over another.

Formal definition (Mitchell, paraphrased on slide). The inductive bias of learner L is any minimal set of assertions B such that, for every training set D_train and every instance x, the classification L(D_train, x) follows deductively from (D_train ∪ B ∪ {x}).

In English. B is the knowledge needed to bridge “what we saw” to “what we predict on a new point”. Without B, the prediction does not follow from the data alone.

Two flavours of bias.

Restriction bias — limits the hypothesis space (e.g. “only linear functions”).
Preference bias — orders the hypothesis space (e.g. “prefer the shorter hypothesis” — Occam).

ID3’s inductive bias. Trees that place high-information-gain attributes near the root are preferred — and (loosely) shorter trees are preferred over deeper ones.

Bias-Variance Tradeoff

⚠️ Slide coverage. Sessions 09–10 discuss bias-variance qualitatively (more-complex model → lower bias, higher variance; need to balance them; pruning / ensembles help). They do not state the formal decomposition E[(y − f̂(x))²] = Bias² + Variance + irreducible noise — that comes from R&N / standard ML textbooks. The formula below is external/textbook.

Conceptual (slide-level)

Bias — error from wrong assumptions (model too simple → systematic underfitting).
Variance — error from sensitivity to small fluctuations in the training data (model too complex → overfitting).
Increasing model complexity → bias ↓, variance ↑. The sweet spot minimises total error.

Formal decomposition (textbook, external)

For squared-error loss on a fixed test point x with true label y = f(x) + ε, ε ~ N(0, σ²):

E[(y − f̂(x))²] = (E[f̂(x)] − f(x))²   +   E[(f̂(x) − E[f̂(x)])²]   +   σ²
                = Bias²(f̂(x))         +   Variance(f̂(x))         +   irreducible

Implications.

Ensembles like Random Forests reduce variance (averaging cancels independent noise) while keeping bias roughly the same as a single deep tree.
Pruning a decision tree reduces variance at the cost of adding bias.
The irreducible term σ² cannot be removed by any model — it’s the noise in the data-generating process itself.

Decision Trees & ID3

Decision tree — a tree where each internal node tests an attribute, each branch is an attribute value, each leaf is a class label. A path root → leaf = a conjunction of attribute tests.

Properties.

✅ White-box — fully interpretable.
✅ Handles non-linear decision boundaries naturally.
✅ Robust to irrelevant features (they simply never get chosen).
⚠️ Greedy & no backtracking → can produce sub-optimal trees.
⚠️ Prone to overfitting on deep trees (memorises training noise).

ID3 (Iterative Dichotomiser 3, Quinlan 1986). Top-down greedy induction:

If all examples in S share a class → make leaf with that class.
If no attributes left → make leaf with majority class.
Else: compute Information Gain for each remaining attribute; pick the highest-gain attribute A; split S by A’s values; recurse on each subset.

ID3’s inductive bias. Trees that place high-information-gain attributes near the root are preferred → shorter trees are preferred (informal Occam-style preference bias).

⚠️ External terminology. ID3 only handles categorical attributes and has no built-in pruning. Its successors C4.5 (Quinlan 1993) and CART add: continuous attributes, missing-value handling, post-pruning, and the Gain Ratio criterion (normalises Information Gain by the intrinsic information of the split to avoid favouring high-cardinality attributes like “patient ID”). Gain Ratio and C4.5 are textbook content — not in these slides.

Avoiding overfitting (slide-level). Detect via train-vs-validation error gap. Mitigations: limit depth, require minimum samples per leaf, prune the tree post-hoc, or move to Random Forests.

Entropy & Information Gain (worked example)

Formulas

$E (S) = - \sum_{i} p_{i} lo g_{2} p_{i} (sum over class labels i)$

$Gain (S, A) = E (S) - \sum_{v \in Values (A)} \frac{∣ S _{v} ∣}{∣ S ∣} E (S_{v})$

ℹ️ What the symbols mean — read this first.

E(S) = the entropy of the label set S, measured in bits. This is the lecture’s notation (Krumnack writes E). ⚠️ It is not the expected value 𝔼[·] from the MDP/probability notes — same letter, completely different thing. Many textbooks (and the Python figures further down in this very note) write entropy as H(S) (Shannon entropy); E(S) and H(S) denote the same quantity — don’t let the two letters confuse you.

pᵢ = the fraction of examples in S belonging to class i (so Σ pᵢ = 1).

log₂ → the unit is bits; one fair coin flip carries exactly 1 bit of uncertainty.

Values(A) = the distinct values attribute A can take; S_v = the subset of S where A = v. The factor |S_v|/|S| is just that subset’s share of the data.

Intuition: E(S) measures impurity — how mixed the classes are. Gain(S, A) = entropy before the split minus the weighted entropy after → how much attribute A “cleans up” the labels. ID3 greedily picks the highest-gain attribute.

Sanity checks.

E(S) = 0 ⇔ S is pure (one class).
E(S) is maximal ⇔ classes are perfectly balanced (log₂ k for k classes).
0 · log₂ 0 ≔ 0 by convention.

Worked example (slide-level: “go to party?“)

Dataset S with 6 examples, target = party? (yes/no). Suppose 4 yes and 2 no.

p_yes = 4/6 ≈ 0.667
p_no  = 2/6 ≈ 0.333

E(S) = −(0.667 · log₂ 0.667) − (0.333 · log₂ 0.333)
     = −(0.667 · −0.585) − (0.333 · −1.585)
     = 0.390 + 0.528
     ≈ 0.918 bits

Now split on attribute tired ∈ {yes, no}. Suppose:

S_tired=yes: 3 examples, 1 party, 2 no-party → E = −(1/3)log₂(1/3) − (2/3)log₂(2/3) ≈ 0.918
S_tired=no: 3 examples, 3 party, 0 no-party → E = 0 (pure)

Gain(S, tired) = E(S) − (3/6) · 0.918 − (3/6) · 0
              = 0.918 − 0.459
              ≈ 0.459 bits

If Gain(S, raining) came out smaller (say 0.10), ID3 picks tired as the root attribute. Then recurse on each branch.

Why high-cardinality attributes break vanilla IG

An attribute like patient_ID (every value unique) produces |Values(A)| pure singleton subsets → Gain = E(S). Vanilla IG happily picks it, giving a useless tree. Gain Ratio (C4.5) divides Gain by SplitInfo(A) = −Σ (|S_v|/|S|) log₂(|S_v|/|S|) to penalise this. External / not on slides.

Random Forests, Bagging, OOB error

Random Forest (Breiman 2001)

Idea. Train many decision trees, each on a slightly different view of the data, then vote (classification) or average (regression). Two independent sources of diversity:

Bootstrap sample of rows (= Bagging).
Random subset of features considered at each split (typical: √p features for classification, p/3 for regression, where p = total features).

Why it works. Individual classifiers should be accurate and diverse. Bagging + feature subsampling decorrelate the trees → averaging cancels their independent variance.

Effect on bias-variance. Random Forest reduces variance, leaves bias roughly unchanged compared to an unpruned single tree.

Hyperparameters (slides note there are many): number of trees, max depth, min samples / leaf, number of features per split, bootstrap-sample size.

Bagging vs. Random Forest

Method	Row sampling	Feature subsampling at each split
Bagging	✅ bootstrap	❌ all features considered
Random Forest	✅ bootstrap	✅ random subset

⚠️ Common confusion. Random Forest = Bagging plus per-split feature subsampling. They are not the same algorithm.

Out-of-bag (OOB) error

⚠️ The slides mention bagging but do not explicitly walk through OOB error. The mechanism below is textbook standard (Breiman 1996).

When you draw a bootstrap sample of size n from n rows with replacement, each row has probability (1 − 1/n)ⁿ → 1/e ≈ 0.368 of being left out. Those ≈ 37 % are the out-of-bag examples for that tree.

For each training row x, look up only the trees whose bootstrap sample did not contain x, aggregate their predictions, compare to the true label → OOB error estimate. It’s an essentially-free internal estimate of generalisation error — no separate held-out test set needed.

Similarity & Clustering

Cosine similarity

cos(x, y) = (x · y) / (‖x‖ · ‖y‖)        ∈ [−1, 1]   (or [0, 1] for non-negative vectors)

Measures angle, ignores magnitude — standard for text vectors (TF-IDF) where document length should not dominate similarity.

k-means

Goal. Partition n points into k clusters, each represented by a centroid μⱼ, minimising the within-cluster sum of squared distances (WCSS):

J = Σ_{j=1..k} Σ_{x ∈ C_j} ‖x − μ_j‖²

Algorithm (Lloyd’s heuristic — see [[#4. k-means (Lloyd’s algorithm)|#4 below]]):

Initialise k centroids (random points, or k-means++ for smarter init — external).
Assign each point to its nearest centroid.
Update each centroid to the mean of its assigned points.
Repeat until assignments don’t change (or change very little).

Properties.

✅ Simple, fast (O(n · k · d · iters)).
⚠️ Only finds a local optimum — depends on initialisation. Standard fix: run multiple times, keep the lowest-J result.
⚠️ k must be specified in advance — the only hyperparameter, but a big one.
⚠️ Assumes roughly spherical, equally-sized clusters in Euclidean space.

⚠️ Slides say “k-means” but do not name Lloyd’s algorithm or k-means++ explicitly. Both are standard textbook nomenclature.

Hierarchical clustering

Two directions:

Agglomerative (bottom-up) — each point is its own cluster; repeatedly merge the two closest clusters until one remains.
Divisive (top-down) — start with one cluster; recursively split.

Output is a dendrogram (tree of merges with heights = merge distances). Cut at any height → any k.

Linkage criteria (how to measure cluster–cluster distance):

Linkage	Distance rule	Tends to produce
Single	min distance between any two points	”chains”, elongated clusters
Complete	max distance	tight, compact clusters
Average	mean of all pairwise distances	balanced compromise
Centroid	distance between centroids	balanced; can reverse merges

Properties.

✅ No need to pre-specify k.
✅ Dendrogram is interpretable.
⚠️ Expensive: naïve agglomerative is O(n³); efficient variants O(n² log n).
⚠️ Merges are greedy & irreversible — a bad early merge cannot be undone.

Train / Test / Validation split + Cross-Validation

Data partitioning (WiSe 2024_25 Lecture 10 explicit recommendation)

Slides give the canonical split: 50% training, 25% test, 25% validation.

Subset	Size	Used for
Training set	50%	fit model parameters (weights, tree splits, …)
Validation set	25%	tune hyperparameters (tree depth, k in k-NN, regularization λ) and pick the best variant during development
Test set	25%	untouched until the end — final unbiased generalisation estimate

⚠ Critical: never use the test set to choose between models — that leaks test information into your model selection. The validation set exists precisely so the test set stays clean.

k-fold Cross-Validation — external / textbook

⚠️ Slide coverage. SoSe 2026 Sessions 09–10 talk about train/test/validation splits and overfitting but do not use the words “k-fold cross-validation” or “leave-one-out”. The definitions below are standard textbook.

k-fold CV. Split data into k equal folds. For i = 1..k, train on the other k − 1 folds and test on fold i. Average the k test errors → CV estimate of generalisation error.

Leave-one-out (LOO). Special case with k = n (one test point at a time). Nearly unbiased but expensive (n model fits).

Use cases. Estimate generalisation error; tune hyperparameters; compare models. Standard k = 5 or 10.

Classifier evaluation metrics — confusion matrix & friends

WiSe 2024_25 Lecture 10 dedicates a slide to this — these formulas are exam-relevant.

For a binary classifier with positive class P and negative class N:

	Predicted P	Predicted N
Actual P	TP (true positive)	FN (false negative)
Actual N	FP (false positive)	TN (true negative)

Metric	Formula	What it measures
Accuracy	`(TP + TN) / (TP + TN + FP + FN)`	Overall proportion of correct classifications
Precision	`TP / (TP + FP)`	Of the items predicted positive, what fraction are truly positive? (degree of correctness)
Recall / Sensitivity	`TP / (TP + FN)`	Of the truly positive items, what fraction did we catch? (degree of completeness)
F1 score	`2 · (Precision · Recall) / (Precision + Recall)`	Harmonic mean of precision and recall — balanced single number
F-β score	`(1 + β²) · (Precision · Recall) / (β² · Precision + Recall)`	Parameterised generalisation: β > 1 weights recall higher, β < 1 weights precision higher

⭐ Why F1 not arithmetic mean? The harmonic mean penalises imbalance — if either precision or recall is near 0, F1 is near 0. Arithmetic mean would let one dominate the other.

Common mistakes:

Accuracy is misleading on imbalanced classes (99% negative class → predict “negative” always → 99% accuracy, 0% recall).
High precision + low recall = “cautious classifier” (rarely says positive, but usually right when it does).
High recall + low precision = “trigger-happy classifier” (catches most positives, but with many false alarms).

TF-IDF & Feature Spaces

Feature space. Embed data points as vectors in ℝⁿ; each dimension = one feature. Algorithms (k-means, cosine similarity, SVMs) then operate geometrically.

TF-IDF (Term Frequency × Inverse Document Frequency):

tfidf(t, d, D) = tf(t, d) · log( |D| / |{d' ∈ D : t ∈ d'}| )

tf(t, d) — count of term t in document d (sometimes normalised by |d|).
idf(t, D) — downweights terms that appear in many documents (the, and, of …) and upweights rare-but-meaningful terms.

Pair with cosine similarity to compare documents independent of length.

Formulas & Notation

Symbol	Meaning
(T, P, E)	Mitchell’s task / performance measure / experience triple
$E (S) = - \sum_{i} p_{i} lo g_{2} p_{i}$	Entropy of set S (in bits) — lecture’s `E`, = Shannon `H`, ≠ expectation `𝔼`
$Gain (S, A) = E (S) - \sum_{v} \frac{∣ S _{v} ∣}{∣ S ∣} E (S_{v})$	Information Gain for attribute A
cos(x, y) = (x·y) / (‖x‖·‖y‖)	Cosine similarity
J = Σ_j Σ_{x∈C_j} ‖x − μ_j‖²	k-means objective (WCSS)
(1 − 1/n)ⁿ → 1/e ≈ 0.368	Probability a row is OOB for one bootstrap
Bias² + Variance + σ²	Expected squared-error decomposition (external)
tfidf(t, d) = tf · log(\|D\|/\|{d′:t∈d′}\|)	TF-IDF weight
k in k-means	Number of clusters (the only k-means hyperparameter)
k in k-fold	Number of folds in cross-validation

Common Exam Traps ⚠️

Bias has two meanings. Inductive bias (prior assumptions) ≠ statistical bias (under-fitting error). Read the question.
Bias-variance tradeoff. More complexity → lower bias, higher variance. NOT “more complex = better”.
ID3 is greedy. Maximises IG at each split; no backtracking. Can produce sub-optimal trees → needs pruning or ensembling.
High-cardinality attributes break vanilla IG. Attributes like patient_ID give max gain but are useless. C4.5’s Gain Ratio fixes this (textbook, not slides).
Random Forest ≠ Bagging. RF = Bagging + per-split feature subsampling. Both add diversity.
OOB error is a free bonus of bagging — no separate test set needed. ≈ 37 % of rows are OOB per tree.
k-means only finds a local optimum. Result depends on initialisation → run multiple times. k-means++ (external) gives smarter init.
k-means assumes spherical, equal-size clusters. Bad for elongated or density-varying data → hierarchical or DBSCAN.
Entropy = 0 ⇔ pure node; Entropy = max (log₂ k) ⇔ uniformly mixed classes.
Supervised vs. unsupervised vs. RL: supervised needs labels, unsupervised discovers structure, RL learns from rewards.
Hierarchical clustering merges are irreversible. A bad early merge poisons the dendrogram below it.
Cosine similarity ignores magnitude — good for text, bad if magnitude itself is meaningful.
“Random Forest is interpretable” — only partially. Individual trees are; an ensemble of 500 is not.
Variance reduction by averaging assumes the averaged models are at least somewhat independent. Identical trees average to themselves — no benefit. That’s why RF randomises both rows and features.

Comparison Table

Algorithm	Paradigm	Interpretable	Non-linear	Main risk	Main strength
Decision Tree (ID3)	Supervised	✅ fully	✅	Overfitting	White-box, axis-aligned splits
Random Forest	Supervised	⚠️ partial	✅	Many hyperparameters	Variance reduction, robust
k-means	Unsupervised	✅ centroids	❌ (Euclidean only)	Local optima, k must be set	Fast, scalable
Hierarchical (agglomerative)	Unsupervised	✅ dendrogram	depends on linkage	O(n²)+ cost, irreversible merges	No k needed
Cross-validation	Meta-method	n/a	n/a	Computational cost	Reliable generalisation estimate
TF-IDF + cosine	Feature engineering	✅	n/a	Bag-of-words ignores order	Strong text baseline

Visualisations (Python)

Four interactive figures illustrating the core ML I+II concepts. Render each toggle in Obsidian (with the Execute Code plugin + Pyodide) to see the plots.

🐍 Figure — Binary entropy curve H(p)

import micropip
await micropip.install("matplotlib")
 
import numpy as np
import matplotlib.pyplot as plt
 
p = np.linspace(1e-6, 1 - 1e-6, 500)
H = -p * np.log2(p) - (1 - p) * np.log2(1 - p)
 
fig, ax = plt.subplots(figsize=(8, 5))
ax.plot(p, H, color="#2c3e50", lw=2.5, label=r"$H(p) = -p\log_2 p - (1-p)\log_2(1-p)$")
ax.fill_between(p, H, alpha=0.12, color="#2c3e50")
 
# Annotate the three key points
for x, y, txt, dx, dy in [
    (0.0, 0.0, "H(0) = 0\n(pure)",        +0.05, +0.08),
    (0.5, 1.0, "H(0.5) = 1 bit\n(max impurity)", +0.02, -0.18),
    (1.0, 0.0, "H(1) = 0\n(pure)",        -0.25, +0.08),
]:
    ax.scatter([x], [y], color="#e74c3c", s=80, zorder=5)
    ax.annotate(txt, xy=(x, y), xytext=(x + dx, y + dy),
                fontsize=10, ha="left",
                arrowprops=dict(arrowstyle="->", color="#e74c3c", lw=1.2))
 
ax.axvline(0.5, color="#999", ls="--", lw=1, alpha=0.6)
ax.set_xlabel("p  (probability of class 1)", fontsize=11)
ax.set_ylabel("Entropy  H(p)  [bits]", fontsize=11)
ax.set_title("Binary entropy — maximal at p = 0.5 (50/50 split = maximum impurity)",
             fontsize=11)
ax.set_xlim(-0.02, 1.02); ax.set_ylim(-0.02, 1.15)
ax.legend(loc="lower center", fontsize=10)
ax.grid(alpha=0.3)
plt.tight_layout(); plt.show()

What this shows. The binary entropy function. The curve peaks at p = 0.5 with H = 1 bit — a perfectly balanced node carries one full bit of uncertainty. At p = 0 or p = 1 the node is pure, contributing zero entropy. This is exactly why ID3 wants splits that drive children toward purity — any split that produces a 50/50 child has gained nothing in information content.

🐍 Figure — Information Gain on a toy split (14 examples, 9 yes / 5 no)

import micropip
await micropip.install("matplotlib")
 
import numpy as np
import matplotlib.pyplot as plt
 
def H(pos, neg):
    n = pos + neg
    if n == 0: return 0.0
    out = 0.0
    for c in (pos, neg):
        if c == 0: continue
        p = c / n
        out -= p * np.log2(p)
    return out
 
# Parent: 14 examples, 9 yes / 5 no
P_pos, P_neg = 9, 5
P_total = P_pos + P_neg
H_parent = H(P_pos, P_neg)
 
# Split on toy feature -> two children
# Child A: 8 examples (6 yes / 2 no) ; Child B: 6 examples (3 yes / 3 no)
A_pos, A_neg = 6, 2; A_total = A_pos + A_neg
B_pos, B_neg = 3, 3; B_total = B_pos + B_neg
H_A = H(A_pos, A_neg); H_B = H(B_pos, B_neg)
 
H_weighted = (A_total / P_total) * H_A + (B_total / P_total) * H_B
gain = H_parent - H_weighted
 
# ---- bar chart ----
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(13, 5),
                               gridspec_kw={"width_ratios": [1.2, 1]})
 
labels = ["Parent\n(9 yes / 5 no)",
          f"Child A\n({A_pos}y / {A_neg}n)",
          f"Child B\n({B_pos}y / {B_neg}n)",
          "Weighted\nchild entropy"]
vals = [H_parent, H_A, H_B, H_weighted]
colors = ["#34495e", "#3498db", "#3498db", "#e67e22"]
 
bars = ax1.bar(labels, vals, color=colors, edgecolor="black")
for b, v in zip(bars, vals):
    ax1.text(b.get_x() + b.get_width()/2, v + 0.02, f"{v:.3f}",
             ha="center", va="bottom", fontsize=10, fontweight="bold")
ax1.set_ylabel("Entropy [bits]")
ax1.set_ylim(0, max(vals) * 1.25)
ax1.set_title("Entropy before vs. after a split")
ax1.grid(axis="y", alpha=0.3)
 
# ---- gain visual: parent vs weighted, gap = IG ----
ax2.bar(["Parent E(S)", "Σ |Sv|/|S| · E(Sv)"], [H_parent, H_weighted],
        color=["#34495e", "#e67e22"], edgecolor="black")
ax2.annotate("", xy=(1, H_weighted), xytext=(1, H_parent),
             arrowprops=dict(arrowstyle="<->", color="#27ae60", lw=2.5))
ax2.text(1.18, (H_parent + H_weighted)/2,
         f"Gain\n= {gain:.3f}\nbits",
         color="#27ae60", fontsize=12, fontweight="bold", va="center")
ax2.set_ylabel("Entropy [bits]")
ax2.set_ylim(0, max(vals) * 1.25)
ax2.set_title("Information Gain = E(S) − weighted child entropy")
ax2.grid(axis="y", alpha=0.3)
 
fig.suptitle(f"Information Gain — toy split (14 examples)", fontsize=12, fontweight="bold")
plt.tight_layout(rect=[0, 0, 1, 0.95]); plt.show()
 
print(f"H(parent)        = {H_parent:.4f} bits")
print(f"H(child A)       = {H_A:.4f} bits  (weight {A_total}/{P_total})")
print(f"H(child B)       = {H_B:.4f} bits  (weight {B_total}/{P_total})")
print(f"Weighted child H = {H_weighted:.4f} bits")
print(f"Information Gain = {gain:.4f} bits")

What this shows. A worked Information Gain example. Parent set has 14 examples (9 yes / 5 no) with entropy ≈ 0.940 bits. Splitting on a feature produces Child A (6y/2n, fairly pure, H ≈ 0.811) and Child B (3y/3n, perfectly mixed, H = 1.0). The weighted child entropy = (8/14)·0.811 + (6/14)·1.0 ≈ 0.892. Information Gain is the green gap: parent − weighted child ≈ 0.048 bits. ID3 picks the attribute with the largest such gain at each node.

🐍 Figure — Bias-Variance tradeoff (50 noisy datasets, polynomials of degree 1, 3, 9)

import micropip
await micropip.install("matplotlib")
 
import numpy as np
import matplotlib.pyplot as plt
 
rng = np.random.default_rng(0)
 
# True function and grid
def f_true(x): return np.sin(x)
x_grid = np.linspace(0, 2*np.pi, 200)
y_true = f_true(x_grid)
 
# Generate M noisy datasets, fit polynomial of given degree to each
M = 50
N = 30           # points per dataset
NOISE = 0.35
degrees = [1, 3, 9]
 
fits = {d: np.zeros((M, len(x_grid))) for d in degrees}
for m in range(M):
    x = rng.uniform(0, 2*np.pi, N)
    y = f_true(x) + rng.normal(0, NOISE, N)
    for d in degrees:
        coef = np.polyfit(x, y, d)
        fits[d][m] = np.polyval(coef, x_grid)
 
fig, axes = plt.subplots(1, 3, figsize=(15, 5), sharey=True)
 
for ax, d in zip(axes, degrees):
    mean_pred = fits[d].mean(axis=0)
    std_pred  = fits[d].std(axis=0)
    bias2 = (mean_pred - y_true) ** 2
    var   = std_pred ** 2
 
    # individual fits (faint)
    for m in range(M):
        ax.plot(x_grid, fits[d][m], color="#3498db", alpha=0.08, lw=0.8)
 
    # mean prediction
    ax.plot(x_grid, mean_pred, color="#e74c3c", lw=2.5, label="mean prediction")
    # variance band
    ax.fill_between(x_grid, mean_pred - std_pred, mean_pred + std_pred,
                    color="#e74c3c", alpha=0.18, label="±1 std (variance)")
    # true function
    ax.plot(x_grid, y_true, color="black", lw=2, ls="--", label="true sin(x)")
 
    ax.set_title(f"Degree {d}  ·  Bias²={bias2.mean():.3f}  Var={var.mean():.3f}",
                 fontsize=11)
    ax.set_xlabel("x"); ax.set_ylim(-2.2, 2.2)
    ax.grid(alpha=0.3)
    ax.legend(loc="upper right", fontsize=8)
 
axes[0].set_ylabel("y")
axes[0].text(0.02, 0.02, "underfit\n(high bias, low variance)",
             transform=axes[0].transAxes, fontsize=9, color="#c0392b",
             fontweight="bold")
axes[1].text(0.02, 0.02, "sweet spot",
             transform=axes[1].transAxes, fontsize=9, color="#27ae60",
             fontweight="bold")
axes[2].text(0.02, 0.02, "overfit\n(low bias, high variance)",
             transform=axes[2].transAxes, fontsize=9, color="#c0392b",
             fontweight="bold")
 
fig.suptitle("Bias-Variance tradeoff — 50 noisy datasets, polynomial fits",
             fontsize=12, fontweight="bold")
plt.tight_layout(rect=[0, 0, 1, 0.95]); plt.show()

What this shows. 50 noisy datasets were sampled from y = sin(x) + ε. For each dataset we fit a polynomial of degree 1 (left), 3 (middle), 9 (right). Blue translucent lines are the 50 individual fits; red is the average prediction; the shaded band is ±1 std (variance).

Degree 1: average prediction is far from sin(x) — high bias². All 50 fits look similar — low variance.
Degree 3: average prediction tracks sin(x) closely; modest variance band. Sweet spot.
Degree 9: average prediction is on target (low bias²) but individual fits wiggle wildly through the noise — huge variance. Classic overfitting.

This is the empirical version of E[(y−f̂)²] = Bias² + Variance + σ² — increasing model capacity trades bias for variance.

🐍 Figure — k-means in action (3 Gaussian clusters, 1 iteration vs converged)

import micropip
await micropip.install("matplotlib")
 
import numpy as np
import matplotlib.pyplot as plt
 
rng = np.random.default_rng(7)
 
# Generate 3 isotropic Gaussian clusters
centers_true = np.array([[-3, -2], [3, -1], [0, 3]])
n_per = 60
X = np.vstack([rng.normal(c, 0.9, size=(n_per, 2)) for c in centers_true])
 
def kmeans_step(X, mu):
    # assignment
    d2 = ((X[:, None, :] - mu[None, :, :]) ** 2).sum(axis=2)
    labels = d2.argmin(axis=1)
    # update
    new_mu = np.array([X[labels == j].mean(axis=0) if (labels == j).any()
                       else mu[j] for j in range(len(mu))])
    return labels, new_mu
 
# Bad initial centroids (deliberately offset so 1 iter looks "wrong")
mu_init = np.array([[-5.0, 3.0], [5.0, 3.0], [0.0, -4.0]])
 
# Snapshot 1: after 1 iteration
labels1, mu1 = kmeans_step(X, mu_init)
 
# Run to convergence
mu = mu_init.copy()
for _ in range(50):
    labels, new_mu = kmeans_step(X, mu)
    if np.allclose(new_mu, mu): break
    mu = new_mu
labels_final, mu_final = labels, mu
 
COLORS = ["#e74c3c", "#3498db", "#27ae60"]
 
fig, axes = plt.subplots(1, 2, figsize=(13, 6))
 
for ax, (lab, mu_show, title) in zip(
        axes,
        [(labels1, mu1, "After 1 iteration (bad init)"),
         (labels_final, mu_final, "Converged")]):
    for j in range(3):
        pts = X[lab == j]
        ax.scatter(pts[:, 0], pts[:, 1], color=COLORS[j], s=35,
                   alpha=0.7, edgecolor="white", lw=0.6,
                   label=f"cluster {j}")
        ax.scatter(mu_show[j, 0], mu_show[j, 1], color=COLORS[j], s=350,
                   marker="X", edgecolor="black", lw=2, zorder=5)
    ax.set_title(title, fontsize=11)
    ax.set_xlabel("x₁"); ax.set_ylabel("x₂")
    ax.legend(loc="upper right", fontsize=9)
    ax.grid(alpha=0.3); ax.set_aspect("equal")
 
fig.suptitle("k-means (Lloyd's algorithm) — k = 3 on isotropic Gaussian clusters",
             fontsize=12, fontweight="bold")
plt.tight_layout(rect=[0, 0, 1, 0.95]); plt.show()
 
print("Initial centroids:")
print(mu_init)
print("\nConverged centroids:")
print(mu_final.round(3))
print(f"\nTrue cluster centres (for reference):\n{centers_true}")

What this shows. Lloyd’s k-means with k = 3 on 180 points drawn from three isotropic Gaussians. Left: after one assign-and-update step from a deliberately bad initialisation — clusters are partially mixed and centroids (Xs) are far from where they should be. Right: after convergence (typically ~5–10 iterations) — centroids snap to the true cluster means and the assignment is clean.

Key reminders. k-means only finds a local optimum of WCSS — different random inits can give different results. Standard fix: run multiple times, keep the lowest J. k-means assumes roughly spherical, equally-sized clusters in Euclidean space; elongated or density-varying clusters break it (use hierarchical or DBSCAN instead).

ALGORITHMS (full reference) ⭐

Pseudocode for every algorithm you might need to write or trace by hand.
Organised by role:

Supervised tree induction: #1 ID3
Ensemble methods: #2 Random Forest · #3 Bagging
Unsupervised clustering: #4 k-means (Lloyd’s) · #5 Hierarchical agglomerative
Evaluation: #6 k-fold cross-validation

1. ID3 (decision-tree induction)

Goal. Build a decision tree top-down by greedily picking the highest-IG attribute at each node.

ID3(examples S, attributes A, target T):
    if all examples in S have the same label c:
        return Leaf(c)
    if A is empty:
        return Leaf(majority_class(S))
 
    A* ← argmax_{a ∈ A}  Gain(S, a)
    root ← Node(test = A*)
 
    for each value v ∈ Values(A*):
        S_v ← { x ∈ S : x.A* = v }
        if S_v is empty:
            root.add_branch(v, Leaf(majority_class(S)))
        else:
            root.add_branch(v, ID3(S_v, A \ {A*}, T))
 
    return root

Cost. O(|A| · |S| · log|S|) per level in the typical case; deeper trees can grow exponentially.
Limitations. Categorical attributes only; no built-in pruning; biased toward high-cardinality attributes (see Gain Ratio in C4.5, external).

2. Random Forest training

Goal. Train an ensemble of m decorrelated decision trees and aggregate them.

RANDOM-FOREST-TRAIN(D, m, p_split):
    forest ← []
    for i = 1..m:
        D_i  ← BOOTSTRAP-SAMPLE(D)              # row resampling, |D_i| = |D|
        tree ← GROW-TREE(D_i, p_split)          # at each node, sample p_split
                                                 # features uniformly without
                                                 # replacement; pick best split
                                                 # only among those features
        forest.append(tree)
    return forest
 
PREDICT(forest, x):
    votes ← [ tree.predict(x) for tree in forest ]
    return majority(votes)                       # or mean(votes) for regression

Typical p_split. √p for classification, p/3 for regression (p = total feature count).
OOB shortcut. While training, remember which rows were not in D_i for each tree; aggregate predictions only over those trees to estimate generalisation error for free.

3. Bagging (Bootstrap AGGregating)

Goal. Reduce variance by averaging models trained on bootstrap samples.

BAGGING(D, base_learner L, m):
    models ← []
    for i = 1..m:
        D_i ← BOOTSTRAP-SAMPLE(D)                # |D_i| = |D|, sampled with replacement
        models.append( L.train(D_i) )            # all features available at every split
    return models
 
PREDICT(models, x):
    return majority({ M.predict(x) for M in models })   # or mean for regression
 
BOOTSTRAP-SAMPLE(D):
    n ← |D|
    return [ D[random_int(0, n-1)] for _ in range(n) ]   # WITH replacement

Difference from Random Forest. Bagging uses all features at every split; RF additionally restricts splits to a random feature subset. RF therefore produces more decorrelated trees and typically performs better.

4. k-means (Lloyd’s algorithm)

Goal. Partition X = {x₁, …, xₙ} into k clusters minimising WCSS J.

K-MEANS(X, k, max_iters):
    μ ← INIT-CENTROIDS(X, k)                     # random k points, or k-means++
    for t = 1..max_iters:
        # 1. Assignment step
        for each x_i ∈ X:
            c(i) ← argmin_{j ∈ 1..k}  ‖x_i − μ_j‖²
        # 2. Update step
        new_μ ← μ
        for j = 1..k:
            new_μ_j ← mean({ x_i : c(i) = j })   # if cluster empty: reinit randomly
        if new_μ == μ:                            # converged
            return c, μ
        μ ← new_μ
    return c, μ

Cost. O(n · k · d · t) where d = dimensionality, t = iterations.
Why “Lloyd’s”? Stuart Lloyd proposed this two-step (assign, update) heuristic in 1957. Slides call it “k-means” without naming the inventor.
k-means++ initialisation (external, Arthur & Vassilvitskii 2007): pick centroids one at a time, biased toward points far from existing centroids. Dramatically reduces sensitivity to initialisation.

5. Hierarchical clustering (agglomerative)

Goal. Build a dendrogram by repeatedly merging the two closest clusters.

HAC(X, linkage):
    clusters ← [ {x_i} for x_i in X ]            # start: each point its own cluster
    D ← pairwise distance matrix of clusters
    dendrogram ← []
 
    while |clusters| > 1:
        (A, B) ← argmin_{i ≠ j}  linkage(C_i, C_j, D)   # find closest pair
        new_C  ← A ∪ B
        dendrogram.append( (A, B, linkage(A, B, D)) )
 
        clusters.remove(A); clusters.remove(B); clusters.append(new_C)
        UPDATE-D(D, new_C, clusters, linkage)             # recompute distances
 
    return dendrogram

Linkage choices (distance between clusters C_A, C_B):

Single: min_{a∈A, b∈B} d(a, b)
Complete: max_{a∈A, b∈B} d(a, b)
Average: mean_{a∈A, b∈B} d(a, b)
Centroid: d(mean(A), mean(B))

Cost. Naïve O(n³); with priority queues O(n² log n); single-linkage admits O(n²) via SLINK.
Cutting the dendrogram. Pick a height threshold → connected components below it are the clusters. Equivalent to choosing k retroactively.

6. k-fold cross-validation

Goal. Estimate generalisation error of a learning procedure (not a fixed model) using all of the data for both training and testing.

K-FOLD-CV(D, learner L, k):
    shuffle(D)
    folds ← partition D into k equal-size disjoint folds
    errors ← []
    for i = 1..k:
        D_test  ← folds[i]
        D_train ← union of all other folds
        model   ← L.train(D_train)
        errors.append( error(model, D_test) )
    return mean(errors), std(errors)

Leave-one-out (LOO) = K-FOLD-CV with k = n.
Cost. k model fits.
Use. Hyperparameter tuning, model selection, reporting generalisation error. Standard k = 5 or 10. External / not in slides.

Brain Online

Explorer

Machine Learning I & II

Machine Learning I & II

Table of contents

Core Ideas

Glossary — important ML vocabulary

Mitchell’s ML Definition

Learning paradigms

Inductive Bias

Bias-Variance Tradeoff

Conceptual (slide-level)

Formal decomposition (textbook, external)

Decision Trees & ID3

Entropy & Information Gain (worked example)

Formulas

Worked example (slide-level: “go to party?“)

Why high-cardinality attributes break vanilla IG

Random Forests, Bagging, OOB error

Random Forest (Breiman 2001)

Bagging vs. Random Forest

Out-of-bag (OOB) error

Similarity & Clustering

Cosine similarity

k-means

Hierarchical clustering

Train / Test / Validation split + Cross-Validation

Data partitioning (WiSe 2024_25 Lecture 10 explicit recommendation)

k-fold Cross-Validation — external / textbook

Classifier evaluation metrics — confusion matrix & friends

TF-IDF & Feature Spaces

Formulas & Notation

Common Exam Traps ⚠️

Comparison Table

Visualisations (Python)

ALGORITHMS (full reference) ⭐

1. ID3 (decision-tree induction)

2. Random Forest training

3. Bagging (Bootstrap AGGregating)

4. k-means (Lloyd’s algorithm)

5. Hierarchical clustering (agglomerative)

6. k-fold cross-validation

See also

Backlinks

Mika

✨ Features

⚙️ Einstellungen

📚 Chat-Verlauf

📖 Citation Manager

✍️ Writing Assistant

Inhaltsverzeichnis