This note is the Methods of AI ML I+II topic hub (Sessions 09 & 10, Krumnack — SoSe 2026). It consolidates the merged content from the former ml-i-ii_30-04-26 topic reference: Mitchell’s definition, the three learning paradigms, inductive bias, the bias-variance tradeoff, decision trees & ID3, entropy & information gain, random forests & bagging, similarity & clustering (cosine, k-means, hierarchical), cross-validation, TF-IDF, exam traps, and full pseudocode for the six core algorithms.
ML = improving with experience at a task. Formal (Mitchell): improve task T w.r.t. performance measure P based on experience E.
Three paradigms: supervised (labeled), unsupervised (unlabeled), reinforcement (trial-and-error + rewards).
Inductive bias is the set of prior assumptions a learner makes — without it, generalisation is impossible.
Overfitting: model fits training noise instead of signal → poor generalisation. Occam’s razor: prefer simpler models.
Two major task families: classification/regression (supervised) and clustering (unsupervised).
Glossary — important ML vocabulary
Task T — the function the system is supposed to compute (e.g. classify email). Performance P — a measurable quality (accuracy, MSE, F1). Experience E — the data (or interactions) the learner sees.
Supervised learning — input/output pairs (x, y). Goal: learn f(x) ≈ y. Subtypes: classification (discrete y) and regression (continuous y). Unsupervised learning — only inputs x, no labels. Goal: discover structure (clustering, density estimation). Reinforcement learning — agent interacts with an environment; receives scalar rewards; learns a policy. Semi-supervised learning(mentioned in WiSe 2024_25 slides; not in SoSe 2026) — small labelled set + large unlabelled set; learner exploits the unlabelled inputs to generalise better (label propagation, self-training). Self-supervised learning(mentioned in WiSe 2024_25 slides; not in SoSe 2026) — labels are derived from the input itself (e.g. mask a word, predict it; rotate an image, predict the rotation). Backbone of modern foundation models (BERT, GPT pretraining).
Inductive learning — generalising from finite examples to a hypothesis covering unseen cases. Inductive bias B — minimal set of assumptions such that the learner’s output follows deductively from (training data ∪ B). Without B no generalisation. Hypothesis space H — the set of functions the learner can express. The restriction bias sits in H; the preference bias in how H is searched.
Overfitting — training error keeps dropping while test error rises. Model has captured noise. Underfitting — model too simple, both train and test error are high. Bias (statistical) — systematic error from oversimplifying assumptions. Variance (statistical) — error from sensitivity to fluctuations in the training data.
Entropy E(S) — measure of impurity / uncertainty in a label distribution (in bits). Lecture writes E; = Shannon H(S); not the expectation 𝔼. Information Gain Gain(S, A) — expected reduction in entropy when splitting S by attribute A.
Bootstrap sample — random sample with replacement of size n drawn from a dataset of size n (≈ 63 % unique rows). Out-of-bag (OOB) sample — the ≈ 37 % of rows not in a given bootstrap sample. Free validation set. Bagging (Bootstrap AGGregating) — train m models on m bootstrap samples; aggregate by vote/average. Reduces variance.
Centroid — mean point of a cluster. Dendrogram — tree diagram produced by hierarchical clustering, showing merge order and distance. Linkage — distance rule between clusters (single / complete / average / centroid).
⚠️ Bias has two meanings.Inductive bias (assumptions enabling generalisation) ≠ statistical bias (error from oversimplification). Same word, very different concepts.
Mitchell’s ML Definition
“A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.” — Tom Mitchell, Machine Learning (1997)
Walk-through (spam filter):
T — classify an email as spam / not spam.
P — fraction of correctly classified emails (accuracy) on held-out mail.
E — a corpus of emails labelled spam/ham.
Chess engine:
T = play chess. P = % games won. E = games played against itself or others.
The triple (T, P, E) is the canonical exam answer.
Learning paradigms
Paradigm
Data
Goal
Example
Supervised
(x, y) pairs
learn f(x) ≈ y
spam classification, house-price regression
Unsupervised
x only
discover structure
k-means, hierarchical clustering, PCA
Reinforcement
(state, action, reward)
learn policy π(s) maximising return
AlphaGo, robot control
Sub-types of supervised:
Classification — y ∈ finite set of classes.
Regression — y ∈ ℝ.
⚠️ Semi-supervised (mostly unlabeled + a few labels) and self-supervised (labels generated from data itself) are hybrids — not on these slides.
Inductive Bias
The “no free lunch” intuition. A learner that makes no assumptions cannot do better than random on unseen data — it can memorise the training set, but it has no principled reason to prefer one extension over another.
Formal definition (Mitchell, paraphrased on slide). The inductive bias of learner L is any minimal set of assertions B such that, for every training set D_train and every instance x, the classification L(D_train, x) follows deductively from (D_train ∪ B ∪ {x}).
In English. B is the knowledge needed to bridge “what we saw” to “what we predict on a new point”. Without B, the prediction does not follow from the data alone.
Two flavours of bias.
Restriction bias — limits the hypothesis space (e.g. “only linear functions”).
Preference bias — orders the hypothesis space (e.g. “prefer the shorter hypothesis” — Occam).
ID3’s inductive bias. Trees that place high-information-gain attributes near the root are preferred — and (loosely) shorter trees are preferred over deeper ones.
Bias-Variance Tradeoff
⚠️ Slide coverage. Sessions 09–10 discuss bias-variance qualitatively (more-complex model → lower bias, higher variance; need to balance them; pruning / ensembles help). They do not state the formal decomposition E[(y − f̂(x))²] = Bias² + Variance + irreducible noise — that comes from R&N / standard ML textbooks. The formula below is external/textbook.
Conceptual (slide-level)
Bias — error from wrong assumptions (model too simple → systematic underfitting).
Variance — error from sensitivity to small fluctuations in the training data (model too complex → overfitting).
Increasing model complexity → bias ↓, variance ↑. The sweet spot minimises total error.
Formal decomposition (textbook, external)
For squared-error loss on a fixed test point x with true label y = f(x) + ε, ε ~ N(0, σ²):
Ensembles like Random Forests reduce variance (averaging cancels independent noise) while keeping bias roughly the same as a single deep tree.
Pruning a decision tree reduces variance at the cost of adding bias.
The irreducible term σ² cannot be removed by any model — it’s the noise in the data-generating process itself.
Decision Trees & ID3
Decision tree — a tree where each internal node tests an attribute, each branch is an attribute value, each leaf is a class label. A path root → leaf = a conjunction of attribute tests.
If all examples in S share a class → make leaf with that class.
If no attributes left → make leaf with majority class.
Else: compute Information Gain for each remaining attribute; pick the highest-gain attribute A; split S by A’s values; recurse on each subset.
ID3’s inductive bias. Trees that place high-information-gain attributes near the root are preferred → shorter trees are preferred (informal Occam-style preference bias).
⚠️ External terminology. ID3 only handles categorical attributes and has no built-in pruning. Its successors C4.5 (Quinlan 1993) and CART add: continuous attributes, missing-value handling, post-pruning, and the Gain Ratio criterion (normalises Information Gain by the intrinsic information of the split to avoid favouring high-cardinality attributes like “patient ID”). Gain Ratio and C4.5 are textbook content — not in these slides.
Avoiding overfitting (slide-level). Detect via train-vs-validation error gap. Mitigations: limit depth, require minimum samples per leaf, prune the tree post-hoc, or move to Random Forests.
Entropy & Information Gain (worked example)
Formulas
E(S)=−∑ipilog2pi(sum over class labels i)
Gain(S,A)=E(S)−∑v∈Values(A)∣S∣∣Sv∣E(Sv)
ℹ️ What the symbols mean — read this first.
E(S) = the entropy of the label set S, measured in bits. This is the lecture’s notation (Krumnack writes E). ⚠️ It is not the expected value 𝔼[·] from the MDP/probability notes — same letter, completely different thing. Many textbooks (and the Python figures further down in this very note) write entropy as H(S) (Shannon entropy); E(S) and H(S) denote the same quantity — don’t let the two letters confuse you.
pᵢ = the fraction of examples in S belonging to class i (so Σ pᵢ = 1).
log₂ → the unit is bits; one fair coin flip carries exactly 1 bit of uncertainty.
Values(A) = the distinct values attribute A can take; S_v = the subset of S where A = v. The factor |S_v|/|S| is just that subset’s share of the data.
Intuition:E(S) measures impurity — how mixed the classes are. Gain(S, A) = entropy before the split minus the weighted entropy after → how much attribute A “cleans up” the labels. ID3 greedily picks the highest-gain attribute.
Sanity checks.
E(S) = 0 ⇔ S is pure (one class).
E(S) is maximal ⇔ classes are perfectly balanced (log₂ k for k classes).
0 · log₂ 0 ≔ 0 by convention.
Worked example (slide-level: “go to party?“)
Dataset S with 6 examples, target = party? (yes/no). Suppose 4 yes and 2 no.
If Gain(S, raining) came out smaller (say 0.10), ID3 picks tired as the root attribute. Then recurse on each branch.
Why high-cardinality attributes break vanilla IG
An attribute like patient_ID (every value unique) produces |Values(A)| pure singleton subsets → Gain = E(S). Vanilla IG happily picks it, giving a useless tree. Gain Ratio (C4.5) divides Gain by SplitInfo(A) = −Σ (|S_v|/|S|) log₂(|S_v|/|S|) to penalise this. External / not on slides.
Random Forests, Bagging, OOB error
Random Forest (Breiman 2001)
Idea. Train many decision trees, each on a slightly different view of the data, then vote (classification) or average (regression). Two independent sources of diversity:
Bootstrap sample of rows (= Bagging).
Random subset of features considered at each split (typical: √p features for classification, p/3 for regression, where p = total features).
Why it works. Individual classifiers should be accurate and diverse. Bagging + feature subsampling decorrelate the trees → averaging cancels their independent variance.
Effect on bias-variance. Random Forest reduces variance, leaves bias roughly unchanged compared to an unpruned single tree.
Hyperparameters (slides note there are many): number of trees, max depth, min samples / leaf, number of features per split, bootstrap-sample size.
Bagging vs. Random Forest
Method
Row sampling
Feature subsampling at each split
Bagging
✅ bootstrap
❌ all features considered
Random Forest
✅ bootstrap
✅ random subset
⚠️ Common confusion. Random Forest = Bagging plus per-split feature subsampling. They are not the same algorithm.
Out-of-bag (OOB) error
⚠️ The slides mention bagging but do not explicitly walk through OOB error. The mechanism below is textbook standard (Breiman 1996).
When you draw a bootstrap sample of size n from n rows with replacement, each row has probability (1 − 1/n)ⁿ → 1/e ≈ 0.368 of being left out. Those ≈ 37 % are the out-of-bag examples for that tree.
For each training row x, look up only the trees whose bootstrap sample did not contain x, aggregate their predictions, compare to the true label → OOB error estimate. It’s an essentially-free internal estimate of generalisation error — no separate held-out test set needed.
⚠️ Merges are greedy & irreversible — a bad early merge cannot be undone.
Train / Test / Validation split + Cross-Validation
Data partitioning (WiSe 2024_25 Lecture 10 explicit recommendation)
Slides give the canonical split: 50% training, 25% test, 25% validation.
Subset
Size
Used for
Training set
50%
fit model parameters (weights, tree splits, …)
Validation set
25%
tune hyperparameters (tree depth, k in k-NN, regularization λ) and pick the best variant during development
Test set
25%
untouched until the end — final unbiased generalisation estimate
⚠ Critical: never use the test set to choose between models — that leaks test information into your model selection. The validation set exists precisely so the test set stays clean.
k-fold Cross-Validation — external / textbook
⚠️ Slide coverage. SoSe 2026 Sessions 09–10 talk about train/test/validation splits and overfitting but do not use the words “k-fold cross-validation” or “leave-one-out”. The definitions below are standard textbook.
k-fold CV. Split data into k equal folds. For i = 1..k, train on the other k − 1 folds and test on fold i. Average the k test errors → CV estimate of generalisation error.
Leave-one-out (LOO). Special case with k = n (one test point at a time). Nearly unbiased but expensive (n model fits).
Use cases. Estimate generalisation error; tune hyperparameters; compare models. Standard k = 5 or 10.
⭐ Why F1 not arithmetic mean? The harmonic mean penalises imbalance — if either precision or recall is near 0, F1 is near 0. Arithmetic mean would let one dominate the other.
Common mistakes:
Accuracy is misleading on imbalanced classes (99% negative class → predict “negative” always → 99% accuracy, 0% recall).
High precision + low recall = “cautious classifier” (rarely says positive, but usually right when it does).
High recall + low precision = “trigger-happy classifier” (catches most positives, but with many false alarms).
TF-IDF & Feature Spaces
Feature space. Embed data points as vectors in ℝⁿ; each dimension = one feature. Algorithms (k-means, cosine similarity, SVMs) then operate geometrically.
TF-IDF (Term Frequency × Inverse Document Frequency):
tfidf(t, d, D) = tf(t, d) · log( |D| / |{d' ∈ D : t ∈ d'}| )
tf(t, d) — count of term t in document d (sometimes normalised by |d|).
idf(t, D) — downweights terms that appear in many documents (the, and, of …) and upweights rare-but-meaningful terms.
Pair with cosine similarity to compare documents independent of length.
Entropy of set S (in bits) — lecture’s E, = Shannon H, ≠ expectation 𝔼
Gain(S,A)=E(S)−∑v∣S∣∣Sv∣E(Sv)
Information Gain for attribute A
cos(x, y) = (x·y) / (‖x‖·‖y‖)
Cosine similarity
J = Σ_j Σ_{x∈C_j} ‖x − μ_j‖²
k-means objective (WCSS)
(1 − 1/n)ⁿ → 1/e ≈ 0.368
Probability a row is OOB for one bootstrap
Bias² + Variance + σ²
Expected squared-error decomposition (external)
tfidf(t, d) = tf · log(|D|/|{d′:t∈d′}|)
TF-IDF weight
k in k-means
Number of clusters (the only k-means hyperparameter)
k in k-fold
Number of folds in cross-validation
Common Exam Traps ⚠️
Bias has two meanings.Inductive bias (prior assumptions) ≠ statistical bias (under-fitting error). Read the question.
Bias-variance tradeoff. More complexity → lower bias, higher variance. NOT “more complex = better”.
ID3 is greedy. Maximises IG at each split; no backtracking. Can produce sub-optimal trees → needs pruning or ensembling.
High-cardinality attributes break vanilla IG. Attributes like patient_ID give max gain but are useless. C4.5’s Gain Ratio fixes this (textbook, not slides).
Random Forest ≠ Bagging. RF = Bagging + per-split feature subsampling. Both add diversity.
OOB error is a free bonus of bagging — no separate test set needed. ≈ 37 % of rows are OOB per tree.
k-means only finds a local optimum. Result depends on initialisation → run multiple times. k-means++ (external) gives smarter init.
k-means assumes spherical, equal-size clusters. Bad for elongated or density-varying data → hierarchical or DBSCAN.
Entropy = 0 ⇔ pure node; Entropy = max (log₂ k) ⇔ uniformly mixed classes.
Supervised vs. unsupervised vs. RL: supervised needs labels, unsupervised discovers structure, RL learns from rewards.
Hierarchical clustering merges are irreversible. A bad early merge poisons the dendrogram below it.
Cosine similarity ignores magnitude — good for text, bad if magnitude itself is meaningful.
“Random Forest is interpretable” — only partially. Individual trees are; an ensemble of 500 is not.
Variance reduction by averaging assumes the averaged models are at least somewhat independent. Identical trees average to themselves — no benefit. That’s why RF randomises both rows and features.
Comparison Table
Algorithm
Paradigm
Interpretable
Non-linear
Main risk
Main strength
Decision Tree (ID3)
Supervised
✅ fully
✅
Overfitting
White-box, axis-aligned splits
Random Forest
Supervised
⚠️ partial
✅
Many hyperparameters
Variance reduction, robust
k-means
Unsupervised
✅ centroids
❌ (Euclidean only)
Local optima, k must be set
Fast, scalable
Hierarchical (agglomerative)
Unsupervised
✅ dendrogram
depends on linkage
O(n²)+ cost, irreversible merges
No k needed
Cross-validation
Meta-method
n/a
n/a
Computational cost
Reliable generalisation estimate
TF-IDF + cosine
Feature engineering
✅
n/a
Bag-of-words ignores order
Strong text baseline
Visualisations (Python)
Four interactive figures illustrating the core ML I+II concepts. Render each toggle in Obsidian (with the Execute Code plugin + Pyodide) to see the plots.
🐍 Figure — Binary entropy curve H(p)
import micropipawait micropip.install("matplotlib")import numpy as npimport matplotlib.pyplot as pltp = np.linspace(1e-6, 1 - 1e-6, 500)H = -p * np.log2(p) - (1 - p) * np.log2(1 - p)fig, ax = plt.subplots(figsize=(8, 5))ax.plot(p, H, color="#2c3e50", lw=2.5, label=r"$H(p) = -p\log_2 p - (1-p)\log_2(1-p)$")ax.fill_between(p, H, alpha=0.12, color="#2c3e50")# Annotate the three key pointsfor x, y, txt, dx, dy in [ (0.0, 0.0, "H(0) = 0\n(pure)", +0.05, +0.08), (0.5, 1.0, "H(0.5) = 1 bit\n(max impurity)", +0.02, -0.18), (1.0, 0.0, "H(1) = 0\n(pure)", -0.25, +0.08),]: ax.scatter([x], [y], color="#e74c3c", s=80, zorder=5) ax.annotate(txt, xy=(x, y), xytext=(x + dx, y + dy), fontsize=10, ha="left", arrowprops=dict(arrowstyle="->", color="#e74c3c", lw=1.2))ax.axvline(0.5, color="#999", ls="--", lw=1, alpha=0.6)ax.set_xlabel("p (probability of class 1)", fontsize=11)ax.set_ylabel("Entropy H(p) [bits]", fontsize=11)ax.set_title("Binary entropy — maximal at p = 0.5 (50/50 split = maximum impurity)", fontsize=11)ax.set_xlim(-0.02, 1.02); ax.set_ylim(-0.02, 1.15)ax.legend(loc="lower center", fontsize=10)ax.grid(alpha=0.3)plt.tight_layout(); plt.show()
What this shows. The binary entropy function. The curve peaks at p = 0.5 with H = 1 bit — a perfectly balanced node carries one full bit of uncertainty. At p = 0 or p = 1 the node is pure, contributing zero entropy. This is exactly why ID3 wants splits that drive children toward purity — any split that produces a 50/50 child has gained nothing in information content.
🐍 Figure — Information Gain on a toy split (14 examples, 9 yes / 5 no)
import micropipawait micropip.install("matplotlib")import numpy as npimport matplotlib.pyplot as pltdef H(pos, neg): n = pos + neg if n == 0: return 0.0 out = 0.0 for c in (pos, neg): if c == 0: continue p = c / n out -= p * np.log2(p) return out# Parent: 14 examples, 9 yes / 5 noP_pos, P_neg = 9, 5P_total = P_pos + P_negH_parent = H(P_pos, P_neg)# Split on toy feature -> two children# Child A: 8 examples (6 yes / 2 no) ; Child B: 6 examples (3 yes / 3 no)A_pos, A_neg = 6, 2; A_total = A_pos + A_negB_pos, B_neg = 3, 3; B_total = B_pos + B_negH_A = H(A_pos, A_neg); H_B = H(B_pos, B_neg)H_weighted = (A_total / P_total) * H_A + (B_total / P_total) * H_Bgain = H_parent - H_weighted# ---- bar chart ----fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(13, 5), gridspec_kw={"width_ratios": [1.2, 1]})labels = ["Parent\n(9 yes / 5 no)", f"Child A\n({A_pos}y / {A_neg}n)", f"Child B\n({B_pos}y / {B_neg}n)", "Weighted\nchild entropy"]vals = [H_parent, H_A, H_B, H_weighted]colors = ["#34495e", "#3498db", "#3498db", "#e67e22"]bars = ax1.bar(labels, vals, color=colors, edgecolor="black")for b, v in zip(bars, vals): ax1.text(b.get_x() + b.get_width()/2, v + 0.02, f"{v:.3f}", ha="center", va="bottom", fontsize=10, fontweight="bold")ax1.set_ylabel("Entropy [bits]")ax1.set_ylim(0, max(vals) * 1.25)ax1.set_title("Entropy before vs. after a split")ax1.grid(axis="y", alpha=0.3)# ---- gain visual: parent vs weighted, gap = IG ----ax2.bar(["Parent E(S)", "Σ |Sv|/|S| · E(Sv)"], [H_parent, H_weighted], color=["#34495e", "#e67e22"], edgecolor="black")ax2.annotate("", xy=(1, H_weighted), xytext=(1, H_parent), arrowprops=dict(arrowstyle="<->", color="#27ae60", lw=2.5))ax2.text(1.18, (H_parent + H_weighted)/2, f"Gain\n= {gain:.3f}\nbits", color="#27ae60", fontsize=12, fontweight="bold", va="center")ax2.set_ylabel("Entropy [bits]")ax2.set_ylim(0, max(vals) * 1.25)ax2.set_title("Information Gain = E(S) − weighted child entropy")ax2.grid(axis="y", alpha=0.3)fig.suptitle(f"Information Gain — toy split (14 examples)", fontsize=12, fontweight="bold")plt.tight_layout(rect=[0, 0, 1, 0.95]); plt.show()print(f"H(parent) = {H_parent:.4f} bits")print(f"H(child A) = {H_A:.4f} bits (weight {A_total}/{P_total})")print(f"H(child B) = {H_B:.4f} bits (weight {B_total}/{P_total})")print(f"Weighted child H = {H_weighted:.4f} bits")print(f"Information Gain = {gain:.4f} bits")
What this shows. A worked Information Gain example. Parent set has 14 examples (9 yes / 5 no) with entropy ≈ 0.940 bits. Splitting on a feature produces Child A (6y/2n, fairly pure, H ≈ 0.811) and Child B (3y/3n, perfectly mixed, H = 1.0). The weighted child entropy = (8/14)·0.811 + (6/14)·1.0 ≈ 0.892. Information Gain is the green gap: parent − weighted child ≈ 0.048 bits. ID3 picks the attribute with the largest such gain at each node.
import micropipawait micropip.install("matplotlib")import numpy as npimport matplotlib.pyplot as pltrng = np.random.default_rng(0)# True function and griddef f_true(x): return np.sin(x)x_grid = np.linspace(0, 2*np.pi, 200)y_true = f_true(x_grid)# Generate M noisy datasets, fit polynomial of given degree to eachM = 50N = 30 # points per datasetNOISE = 0.35degrees = [1, 3, 9]fits = {d: np.zeros((M, len(x_grid))) for d in degrees}for m in range(M): x = rng.uniform(0, 2*np.pi, N) y = f_true(x) + rng.normal(0, NOISE, N) for d in degrees: coef = np.polyfit(x, y, d) fits[d][m] = np.polyval(coef, x_grid)fig, axes = plt.subplots(1, 3, figsize=(15, 5), sharey=True)for ax, d in zip(axes, degrees): mean_pred = fits[d].mean(axis=0) std_pred = fits[d].std(axis=0) bias2 = (mean_pred - y_true) ** 2 var = std_pred ** 2 # individual fits (faint) for m in range(M): ax.plot(x_grid, fits[d][m], color="#3498db", alpha=0.08, lw=0.8) # mean prediction ax.plot(x_grid, mean_pred, color="#e74c3c", lw=2.5, label="mean prediction") # variance band ax.fill_between(x_grid, mean_pred - std_pred, mean_pred + std_pred, color="#e74c3c", alpha=0.18, label="±1 std (variance)") # true function ax.plot(x_grid, y_true, color="black", lw=2, ls="--", label="true sin(x)") ax.set_title(f"Degree {d} · Bias²={bias2.mean():.3f} Var={var.mean():.3f}", fontsize=11) ax.set_xlabel("x"); ax.set_ylim(-2.2, 2.2) ax.grid(alpha=0.3) ax.legend(loc="upper right", fontsize=8)axes[0].set_ylabel("y")axes[0].text(0.02, 0.02, "underfit\n(high bias, low variance)", transform=axes[0].transAxes, fontsize=9, color="#c0392b", fontweight="bold")axes[1].text(0.02, 0.02, "sweet spot", transform=axes[1].transAxes, fontsize=9, color="#27ae60", fontweight="bold")axes[2].text(0.02, 0.02, "overfit\n(low bias, high variance)", transform=axes[2].transAxes, fontsize=9, color="#c0392b", fontweight="bold")fig.suptitle("Bias-Variance tradeoff — 50 noisy datasets, polynomial fits", fontsize=12, fontweight="bold")plt.tight_layout(rect=[0, 0, 1, 0.95]); plt.show()
What this shows. 50 noisy datasets were sampled from y = sin(x) + ε. For each dataset we fit a polynomial of degree 1 (left), 3 (middle), 9 (right). Blue translucent lines are the 50 individual fits; red is the average prediction; the shaded band is ±1 std (variance).
Degree 1: average prediction is far from sin(x) — high bias². All 50 fits look similar — low variance.
Degree 9: average prediction is on target (low bias²) but individual fits wiggle wildly through the noise — huge variance. Classic overfitting.
This is the empirical version of E[(y−f̂)²] = Bias² + Variance + σ² — increasing model capacity trades bias for variance.
🐍 Figure — k-means in action (3 Gaussian clusters, 1 iteration vs converged)
import micropipawait micropip.install("matplotlib")import numpy as npimport matplotlib.pyplot as pltrng = np.random.default_rng(7)# Generate 3 isotropic Gaussian clusterscenters_true = np.array([[-3, -2], [3, -1], [0, 3]])n_per = 60X = np.vstack([rng.normal(c, 0.9, size=(n_per, 2)) for c in centers_true])def kmeans_step(X, mu): # assignment d2 = ((X[:, None, :] - mu[None, :, :]) ** 2).sum(axis=2) labels = d2.argmin(axis=1) # update new_mu = np.array([X[labels == j].mean(axis=0) if (labels == j).any() else mu[j] for j in range(len(mu))]) return labels, new_mu# Bad initial centroids (deliberately offset so 1 iter looks "wrong")mu_init = np.array([[-5.0, 3.0], [5.0, 3.0], [0.0, -4.0]])# Snapshot 1: after 1 iterationlabels1, mu1 = kmeans_step(X, mu_init)# Run to convergencemu = mu_init.copy()for _ in range(50): labels, new_mu = kmeans_step(X, mu) if np.allclose(new_mu, mu): break mu = new_mulabels_final, mu_final = labels, muCOLORS = ["#e74c3c", "#3498db", "#27ae60"]fig, axes = plt.subplots(1, 2, figsize=(13, 6))for ax, (lab, mu_show, title) in zip( axes, [(labels1, mu1, "After 1 iteration (bad init)"), (labels_final, mu_final, "Converged")]): for j in range(3): pts = X[lab == j] ax.scatter(pts[:, 0], pts[:, 1], color=COLORS[j], s=35, alpha=0.7, edgecolor="white", lw=0.6, label=f"cluster {j}") ax.scatter(mu_show[j, 0], mu_show[j, 1], color=COLORS[j], s=350, marker="X", edgecolor="black", lw=2, zorder=5) ax.set_title(title, fontsize=11) ax.set_xlabel("x₁"); ax.set_ylabel("x₂") ax.legend(loc="upper right", fontsize=9) ax.grid(alpha=0.3); ax.set_aspect("equal")fig.suptitle("k-means (Lloyd's algorithm) — k = 3 on isotropic Gaussian clusters", fontsize=12, fontweight="bold")plt.tight_layout(rect=[0, 0, 1, 0.95]); plt.show()print("Initial centroids:")print(mu_init)print("\nConverged centroids:")print(mu_final.round(3))print(f"\nTrue cluster centres (for reference):\n{centers_true}")
What this shows. Lloyd’s k-means with k = 3 on 180 points drawn from three isotropic Gaussians. Left: after one assign-and-update step from a deliberately bad initialisation — clusters are partially mixed and centroids (Xs) are far from where they should be. Right: after convergence (typically ~5–10 iterations) — centroids snap to the true cluster means and the assignment is clean.
Key reminders. k-means only finds a local optimum of WCSS — different random inits can give different results. Standard fix: run multiple times, keep the lowest J. k-means assumes roughly spherical, equally-sized clusters in Euclidean space; elongated or density-varying clusters break it (use hierarchical or DBSCAN instead).
ALGORITHMS (full reference) ⭐
Pseudocode for every algorithm you might need to write or trace by hand.
Organised by role:
Goal. Build a decision tree top-down by greedily picking the highest-IG attribute at each node.
ID3(examples S, attributes A, target T): if all examples in S have the same label c: return Leaf(c) if A is empty: return Leaf(majority_class(S)) A* ← argmax_{a ∈ A} Gain(S, a) root ← Node(test = A*) for each value v ∈ Values(A*): S_v ← { x ∈ S : x.A* = v } if S_v is empty: root.add_branch(v, Leaf(majority_class(S))) else: root.add_branch(v, ID3(S_v, A \ {A*}, T)) return root
Cost. O(|A| · |S| · log|S|) per level in the typical case; deeper trees can grow exponentially. Limitations. Categorical attributes only; no built-in pruning; biased toward high-cardinality attributes (see Gain Ratio in C4.5, external).
2. Random Forest training
Goal. Train an ensemble of m decorrelated decision trees and aggregate them.
RANDOM-FOREST-TRAIN(D, m, p_split): forest ← [] for i = 1..m: D_i ← BOOTSTRAP-SAMPLE(D) # row resampling, |D_i| = |D| tree ← GROW-TREE(D_i, p_split) # at each node, sample p_split # features uniformly without # replacement; pick best split # only among those features forest.append(tree) return forestPREDICT(forest, x): votes ← [ tree.predict(x) for tree in forest ] return majority(votes) # or mean(votes) for regression
Typical p_split. √p for classification, p/3 for regression (p = total feature count). OOB shortcut. While training, remember which rows were not in D_i for each tree; aggregate predictions only over those trees to estimate generalisation error for free.
3. Bagging (Bootstrap AGGregating)
Goal. Reduce variance by averaging models trained on bootstrap samples.
BAGGING(D, base_learner L, m): models ← [] for i = 1..m: D_i ← BOOTSTRAP-SAMPLE(D) # |D_i| = |D|, sampled with replacement models.append( L.train(D_i) ) # all features available at every split return modelsPREDICT(models, x): return majority({ M.predict(x) for M in models }) # or mean for regressionBOOTSTRAP-SAMPLE(D): n ← |D| return [ D[random_int(0, n-1)] for _ in range(n) ] # WITH replacement
Difference from Random Forest. Bagging uses all features at every split; RF additionally restricts splits to a random feature subset. RF therefore produces more decorrelated trees and typically performs better.
4. k-means (Lloyd’s algorithm)
Goal. Partition X = {x₁, …, xₙ} into k clusters minimising WCSS J.
K-MEANS(X, k, max_iters): μ ← INIT-CENTROIDS(X, k) # random k points, or k-means++ for t = 1..max_iters: # 1. Assignment step for each x_i ∈ X: c(i) ← argmin_{j ∈ 1..k} ‖x_i − μ_j‖² # 2. Update step new_μ ← μ for j = 1..k: new_μ_j ← mean({ x_i : c(i) = j }) # if cluster empty: reinit randomly if new_μ == μ: # converged return c, μ μ ← new_μ return c, μ
Cost. O(n · k · d · t) where d = dimensionality, t = iterations. Why “Lloyd’s”? Stuart Lloyd proposed this two-step (assign, update) heuristic in 1957. Slides call it “k-means” without naming the inventor. k-means++ initialisation (external, Arthur & Vassilvitskii 2007): pick centroids one at a time, biased toward points far from existing centroids. Dramatically reduces sensitivity to initialisation.
5. Hierarchical clustering (agglomerative)
Goal. Build a dendrogram by repeatedly merging the two closest clusters.
HAC(X, linkage): clusters ← [ {x_i} for x_i in X ] # start: each point its own cluster D ← pairwise distance matrix of clusters dendrogram ← [] while |clusters| > 1: (A, B) ← argmin_{i ≠ j} linkage(C_i, C_j, D) # find closest pair new_C ← A ∪ B dendrogram.append( (A, B, linkage(A, B, D)) ) clusters.remove(A); clusters.remove(B); clusters.append(new_C) UPDATE-D(D, new_C, clusters, linkage) # recompute distances return dendrogram
Linkage choices (distance between clusters C_A, C_B):
Single: min_{a∈A, b∈B} d(a, b)
Complete: max_{a∈A, b∈B} d(a, b)
Average: mean_{a∈A, b∈B} d(a, b)
Centroid: d(mean(A), mean(B))
Cost. Naïve O(n³); with priority queues O(n² log n); single-linkage admits O(n²) via SLINK. Cutting the dendrogram. Pick a height threshold → connected components below it are the clusters. Equivalent to choosing k retroactively.
6. k-fold cross-validation
Goal. Estimate generalisation error of a learning procedure (not a fixed model) using all of the data for both training and testing.
K-FOLD-CV(D, learner L, k): shuffle(D) folds ← partition D into k equal-size disjoint folds errors ← [] for i = 1..k: D_test ← folds[i] D_train ← union of all other folds model ← L.train(D_train) errors.append( error(model, D_test) ) return mean(errors), std(errors)
Leave-one-out (LOO) = K-FOLD-CV with k = n. Cost. k model fits. Use. Hyperparameter tuning, model selection, reporting generalisation error. Standard k = 5 or 10. External / not in slides.