Machine Learning I & II

chatbot methods-of-ai

This note is the Methods of AI ML I+II topic hub (Sessions 09 & 10, Krumnack — SoSe 2026). It consolidates the merged content from the former ml-i-ii_30-04-26 topic reference: Mitchell’s definition, the three learning paradigms, inductive bias, the bias-variance tradeoff, decision trees & ID3, entropy & information gain, random forests & bagging, similarity & clustering (cosine, k-means, hierarchical), cross-validation, TF-IDF, exam traps, and full pseudocode for the six core algorithms.

Companion files for exam prep:

Table of contents

  1. Core Ideas
  2. Glossary — important ML vocabulary
  3. Mitchell’s ML Definition
  4. Learning paradigms
  5. Inductive Bias
  6. Bias-Variance Tradeoff
  7. Decision Trees & ID3
  8. Entropy & Information Gain (worked example)
  9. Random Forests, Bagging, OOB error
  10. Similarity & Clustering
  11. Cross-Validation
  12. TF-IDF & Feature Spaces
  13. Formulas & Notation
  14. Common Exam Traps
  15. Comparison Table
  16. Visualisations (Python)
  17. ALGORITHMS ⭐ — pseudocode
    • [[#1. ID3 (decision-tree induction)|#1 ID3]]
    • [[#2. Random Forest training|#2 Random Forest]]
    • [[#3. Bagging (Bootstrap AGGregating)|#3 Bagging]]
    • [[#4. k-means (Lloyd’s algorithm)|#4 k-means (Lloyd’s)]]
    • [[#5. Hierarchical clustering (agglomerative)|#5 Hierarchical agglomerative]]
    • [[#6. k-fold cross-validation|#6 k-fold cross-validation]]

Core Ideas

  • ML = improving with experience at a task. Formal (Mitchell): improve task T w.r.t. performance measure P based on experience E.
  • Three paradigms: supervised (labeled), unsupervised (unlabeled), reinforcement (trial-and-error + rewards).
  • Inductive bias is the set of prior assumptions a learner makes — without it, generalisation is impossible.
  • Overfitting: model fits training noise instead of signal → poor generalisation. Occam’s razor: prefer simpler models.
  • Two major task families: classification/regression (supervised) and clustering (unsupervised).

Glossary — important ML vocabulary

Task T — the function the system is supposed to compute (e.g. classify email).
Performance P — a measurable quality (accuracy, MSE, F1).
Experience E — the data (or interactions) the learner sees.

Supervised learning — input/output pairs (x, y). Goal: learn f(x) ≈ y. Subtypes: classification (discrete y) and regression (continuous y).
Unsupervised learning — only inputs x, no labels. Goal: discover structure (clustering, density estimation).
Reinforcement learning — agent interacts with an environment; receives scalar rewards; learns a policy.
Semi-supervised learning (mentioned in WiSe 2024_25 slides; not in SoSe 2026) — small labelled set + large unlabelled set; learner exploits the unlabelled inputs to generalise better (label propagation, self-training).
Self-supervised learning (mentioned in WiSe 2024_25 slides; not in SoSe 2026) — labels are derived from the input itself (e.g. mask a word, predict it; rotate an image, predict the rotation). Backbone of modern foundation models (BERT, GPT pretraining).

Inductive learning — generalising from finite examples to a hypothesis covering unseen cases.
Inductive bias B — minimal set of assumptions such that the learner’s output follows deductively from (training data ∪ B). Without B no generalisation.
Hypothesis space H — the set of functions the learner can express. The restriction bias sits in H; the preference bias in how H is searched.

Overfitting — training error keeps dropping while test error rises. Model has captured noise.
Underfitting — model too simple, both train and test error are high.
Bias (statistical) — systematic error from oversimplifying assumptions.
Variance (statistical) — error from sensitivity to fluctuations in the training data.

Entropy E(S) — measure of impurity / uncertainty in a label distribution (in bits). Lecture writes E; = Shannon H(S); not the expectation 𝔼.
Information Gain Gain(S, A) — expected reduction in entropy when splitting S by attribute A.

Bootstrap sample — random sample with replacement of size n drawn from a dataset of size n (≈ 63 % unique rows).
Out-of-bag (OOB) sample — the ≈ 37 % of rows not in a given bootstrap sample. Free validation set.
Bagging (Bootstrap AGGregating) — train m models on m bootstrap samples; aggregate by vote/average. Reduces variance.

Centroid — mean point of a cluster.
Dendrogram — tree diagram produced by hierarchical clustering, showing merge order and distance.
Linkage — distance rule between clusters (single / complete / average / centroid).

⚠️ Bias has two meanings. Inductive bias (assumptions enabling generalisation) ≠ statistical bias (error from oversimplification). Same word, very different concepts.


Mitchell’s ML Definition

“A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.” — Tom Mitchell, Machine Learning (1997)

Walk-through (spam filter):

  • T — classify an email as spam / not spam.
  • P — fraction of correctly classified emails (accuracy) on held-out mail.
  • E — a corpus of emails labelled spam/ham.

Chess engine:

  • T = play chess. P = % games won. E = games played against itself or others.

The triple (T, P, E) is the canonical exam answer.


Learning paradigms

ParadigmDataGoalExample
Supervised(x, y) pairslearn f(x) ≈ yspam classification, house-price regression
Unsupervisedx onlydiscover structurek-means, hierarchical clustering, PCA
Reinforcement(state, action, reward)learn policy π(s) maximising returnAlphaGo, robot control

Sub-types of supervised:

  • Classification — y ∈ finite set of classes.
  • Regression — y ∈ ℝ.

⚠️ Semi-supervised (mostly unlabeled + a few labels) and self-supervised (labels generated from data itself) are hybrids — not on these slides.


Inductive Bias

The “no free lunch” intuition. A learner that makes no assumptions cannot do better than random on unseen data — it can memorise the training set, but it has no principled reason to prefer one extension over another.

Formal definition (Mitchell, paraphrased on slide). The inductive bias of learner L is any minimal set of assertions B such that, for every training set D_train and every instance x, the classification L(D_train, x) follows deductively from (D_train ∪ B ∪ {x}).

In English. B is the knowledge needed to bridge “what we saw” to “what we predict on a new point”. Without B, the prediction does not follow from the data alone.

Two flavours of bias.

  • Restriction bias — limits the hypothesis space (e.g. “only linear functions”).
  • Preference bias — orders the hypothesis space (e.g. “prefer the shorter hypothesis” — Occam).

ID3’s inductive bias. Trees that place high-information-gain attributes near the root are preferred — and (loosely) shorter trees are preferred over deeper ones.


Bias-Variance Tradeoff

⚠️ Slide coverage. Sessions 09–10 discuss bias-variance qualitatively (more-complex model → lower bias, higher variance; need to balance them; pruning / ensembles help). They do not state the formal decomposition E[(y − f̂(x))²] = Bias² + Variance + irreducible noise — that comes from R&N / standard ML textbooks. The formula below is external/textbook.

Conceptual (slide-level)

  • Bias — error from wrong assumptions (model too simple → systematic underfitting).
  • Variance — error from sensitivity to small fluctuations in the training data (model too complex → overfitting).
  • Increasing model complexity → bias ↓, variance ↑. The sweet spot minimises total error.

Formal decomposition (textbook, external)

For squared-error loss on a fixed test point x with true label y = f(x) + ε, ε ~ N(0, σ²):

E[(y − f̂(x))²] = (E[f̂(x)] − f(x))²   +   E[(f̂(x) − E[f̂(x)])²]   +   σ²
                = Bias²(f̂(x))         +   Variance(f̂(x))         +   irreducible

Implications.

  • Ensembles like Random Forests reduce variance (averaging cancels independent noise) while keeping bias roughly the same as a single deep tree.
  • Pruning a decision tree reduces variance at the cost of adding bias.
  • The irreducible term σ² cannot be removed by any model — it’s the noise in the data-generating process itself.

Decision Trees & ID3

Decision tree — a tree where each internal node tests an attribute, each branch is an attribute value, each leaf is a class label. A path root → leaf = a conjunction of attribute tests.

Properties.

  • White-box — fully interpretable.
  • ✅ Handles non-linear decision boundaries naturally.
  • ✅ Robust to irrelevant features (they simply never get chosen).
  • ⚠️ Greedy & no backtracking → can produce sub-optimal trees.
  • ⚠️ Prone to overfitting on deep trees (memorises training noise).

ID3 (Iterative Dichotomiser 3, Quinlan 1986). Top-down greedy induction:

  1. If all examples in S share a class → make leaf with that class.
  2. If no attributes left → make leaf with majority class.
  3. Else: compute Information Gain for each remaining attribute; pick the highest-gain attribute A; split S by A’s values; recurse on each subset.

ID3’s inductive bias. Trees that place high-information-gain attributes near the root are preferred → shorter trees are preferred (informal Occam-style preference bias).

⚠️ External terminology. ID3 only handles categorical attributes and has no built-in pruning. Its successors C4.5 (Quinlan 1993) and CART add: continuous attributes, missing-value handling, post-pruning, and the Gain Ratio criterion (normalises Information Gain by the intrinsic information of the split to avoid favouring high-cardinality attributes like “patient ID”). Gain Ratio and C4.5 are textbook content — not in these slides.

Avoiding overfitting (slide-level). Detect via train-vs-validation error gap. Mitigations: limit depth, require minimum samples per leaf, prune the tree post-hoc, or move to Random Forests.


Entropy & Information Gain (worked example)

Formulas

ℹ️ What the symbols mean — read this first.

  • E(S) = the entropy of the label set S, measured in bits. This is the lecture’s notation (Krumnack writes E). ⚠️ It is not the expected value 𝔼[·] from the MDP/probability notes — same letter, completely different thing. Many textbooks (and the Python figures further down in this very note) write entropy as H(S) (Shannon entropy); E(S) and H(S) denote the same quantity — don’t let the two letters confuse you.
  • pᵢ = the fraction of examples in S belonging to class i (so Σ pᵢ = 1).
  • log₂ → the unit is bits; one fair coin flip carries exactly 1 bit of uncertainty.
  • Values(A) = the distinct values attribute A can take; S_v = the subset of S where A = v. The factor |S_v|/|S| is just that subset’s share of the data.
  • Intuition: E(S) measures impurity — how mixed the classes are. Gain(S, A) = entropy before the split minus the weighted entropy after → how much attribute A “cleans up” the labels. ID3 greedily picks the highest-gain attribute.

Sanity checks.

  • E(S) = 0 ⇔ S is pure (one class).
  • E(S) is maximal ⇔ classes are perfectly balanced (log₂ k for k classes).
  • 0 · log₂ 0 ≔ 0 by convention.

Worked example (slide-level: “go to party?“)

Dataset S with 6 examples, target = party? (yes/no). Suppose 4 yes and 2 no.

p_yes = 4/6 ≈ 0.667
p_no  = 2/6 ≈ 0.333

E(S) = −(0.667 · log₂ 0.667) − (0.333 · log₂ 0.333)
     = −(0.667 · −0.585) − (0.333 · −1.585)
     = 0.390 + 0.528
     ≈ 0.918 bits

Now split on attribute tired ∈ {yes, no}. Suppose:

  • S_tired=yes: 3 examples, 1 party, 2 no-party → E = −(1/3)log₂(1/3) − (2/3)log₂(2/3) ≈ 0.918
  • S_tired=no: 3 examples, 3 party, 0 no-party → E = 0 (pure)
Gain(S, tired) = E(S) − (3/6) · 0.918 − (3/6) · 0
              = 0.918 − 0.459
              ≈ 0.459 bits

If Gain(S, raining) came out smaller (say 0.10), ID3 picks tired as the root attribute. Then recurse on each branch.

Why high-cardinality attributes break vanilla IG

An attribute like patient_ID (every value unique) produces |Values(A)| pure singleton subsets → Gain = E(S). Vanilla IG happily picks it, giving a useless tree. Gain Ratio (C4.5) divides Gain by SplitInfo(A) = −Σ (|S_v|/|S|) log₂(|S_v|/|S|) to penalise this. External / not on slides.


Random Forests, Bagging, OOB error

Random Forest (Breiman 2001)

Idea. Train many decision trees, each on a slightly different view of the data, then vote (classification) or average (regression). Two independent sources of diversity:

  1. Bootstrap sample of rows (= Bagging).
  2. Random subset of features considered at each split (typical: √p features for classification, p/3 for regression, where p = total features).

Why it works. Individual classifiers should be accurate and diverse. Bagging + feature subsampling decorrelate the trees → averaging cancels their independent variance.

Effect on bias-variance. Random Forest reduces variance, leaves bias roughly unchanged compared to an unpruned single tree.

Hyperparameters (slides note there are many): number of trees, max depth, min samples / leaf, number of features per split, bootstrap-sample size.

Bagging vs. Random Forest

MethodRow samplingFeature subsampling at each split
Bagging✅ bootstrap❌ all features considered
Random Forest✅ bootstrap✅ random subset

⚠️ Common confusion. Random Forest = Bagging plus per-split feature subsampling. They are not the same algorithm.

Out-of-bag (OOB) error

⚠️ The slides mention bagging but do not explicitly walk through OOB error. The mechanism below is textbook standard (Breiman 1996).

When you draw a bootstrap sample of size n from n rows with replacement, each row has probability (1 − 1/n)ⁿ → 1/e ≈ 0.368 of being left out. Those ≈ 37 % are the out-of-bag examples for that tree.

For each training row x, look up only the trees whose bootstrap sample did not contain x, aggregate their predictions, compare to the true label → OOB error estimate. It’s an essentially-free internal estimate of generalisation error — no separate held-out test set needed.


Similarity & Clustering

Cosine similarity

cos(x, y) = (x · y) / (‖x‖ · ‖y‖)        ∈ [−1, 1]   (or [0, 1] for non-negative vectors)

Measures angle, ignores magnitude — standard for text vectors (TF-IDF) where document length should not dominate similarity.

k-means

Goal. Partition n points into k clusters, each represented by a centroid μⱼ, minimising the within-cluster sum of squared distances (WCSS):

J = Σ_{j=1..k} Σ_{x ∈ C_j} ‖x − μ_j‖²

Algorithm (Lloyd’s heuristic — see [[#4. k-means (Lloyd’s algorithm)|#4 below]]):

  1. Initialise k centroids (random points, or k-means++ for smarter init — external).
  2. Assign each point to its nearest centroid.
  3. Update each centroid to the mean of its assigned points.
  4. Repeat until assignments don’t change (or change very little).

Properties.

  • ✅ Simple, fast (O(n · k · d · iters)).
  • ⚠️ Only finds a local optimum — depends on initialisation. Standard fix: run multiple times, keep the lowest-J result.
  • ⚠️ k must be specified in advance — the only hyperparameter, but a big one.
  • ⚠️ Assumes roughly spherical, equally-sized clusters in Euclidean space.

⚠️ Slides say “k-means” but do not name Lloyd’s algorithm or k-means++ explicitly. Both are standard textbook nomenclature.

Hierarchical clustering

Two directions:

  • Agglomerative (bottom-up) — each point is its own cluster; repeatedly merge the two closest clusters until one remains.
  • Divisive (top-down) — start with one cluster; recursively split.

Output is a dendrogram (tree of merges with heights = merge distances). Cut at any height → any k.

Linkage criteria (how to measure cluster–cluster distance):

LinkageDistance ruleTends to produce
Singlemin distance between any two points”chains”, elongated clusters
Completemax distancetight, compact clusters
Averagemean of all pairwise distancesbalanced compromise
Centroiddistance between centroidsbalanced; can reverse merges

Properties.

  • ✅ No need to pre-specify k.
  • ✅ Dendrogram is interpretable.
  • ⚠️ Expensive: naïve agglomerative is O(n³); efficient variants O(n² log n).
  • ⚠️ Merges are greedy & irreversible — a bad early merge cannot be undone.

Train / Test / Validation split + Cross-Validation

Data partitioning (WiSe 2024_25 Lecture 10 explicit recommendation)

Slides give the canonical split: 50% training, 25% test, 25% validation.

SubsetSizeUsed for
Training set50%fit model parameters (weights, tree splits, …)
Validation set25%tune hyperparameters (tree depth, k in k-NN, regularization λ) and pick the best variant during development
Test set25%untouched until the end — final unbiased generalisation estimate

Critical: never use the test set to choose between models — that leaks test information into your model selection. The validation set exists precisely so the test set stays clean.

k-fold Cross-Validation — external / textbook

⚠️ Slide coverage. SoSe 2026 Sessions 09–10 talk about train/test/validation splits and overfitting but do not use the words “k-fold cross-validation” or “leave-one-out”. The definitions below are standard textbook.

k-fold CV. Split data into k equal folds. For i = 1..k, train on the other k − 1 folds and test on fold i. Average the k test errors → CV estimate of generalisation error.

Leave-one-out (LOO). Special case with k = n (one test point at a time). Nearly unbiased but expensive (n model fits).

Use cases. Estimate generalisation error; tune hyperparameters; compare models. Standard k = 5 or 10.


Classifier evaluation metrics — confusion matrix & friends

WiSe 2024_25 Lecture 10 dedicates a slide to this — these formulas are exam-relevant.

For a binary classifier with positive class P and negative class N:

Predicted PPredicted N
Actual PTP (true positive)FN (false negative)
Actual NFP (false positive)TN (true negative)
MetricFormulaWhat it measures
Accuracy(TP + TN) / (TP + TN + FP + FN)Overall proportion of correct classifications
PrecisionTP / (TP + FP)Of the items predicted positive, what fraction are truly positive? (degree of correctness)
Recall / SensitivityTP / (TP + FN)Of the truly positive items, what fraction did we catch? (degree of completeness)
F1 score2 · (Precision · Recall) / (Precision + Recall)Harmonic mean of precision and recall — balanced single number
F-β score(1 + β²) · (Precision · Recall) / (β² · Precision + Recall)Parameterised generalisation: β > 1 weights recall higher, β < 1 weights precision higher

Why F1 not arithmetic mean? The harmonic mean penalises imbalance — if either precision or recall is near 0, F1 is near 0. Arithmetic mean would let one dominate the other.

Common mistakes:

  • Accuracy is misleading on imbalanced classes (99% negative class → predict “negative” always → 99% accuracy, 0% recall).
  • High precision + low recall = “cautious classifier” (rarely says positive, but usually right when it does).
  • High recall + low precision = “trigger-happy classifier” (catches most positives, but with many false alarms).

TF-IDF & Feature Spaces

Feature space. Embed data points as vectors in ℝⁿ; each dimension = one feature. Algorithms (k-means, cosine similarity, SVMs) then operate geometrically.

TF-IDF (Term Frequency × Inverse Document Frequency):

tfidf(t, d, D) = tf(t, d) · log( |D| / |{d' ∈ D : t ∈ d'}| )
  • tf(t, d) — count of term t in document d (sometimes normalised by |d|).
  • idf(t, D) — downweights terms that appear in many documents (the, and, of …) and upweights rare-but-meaningful terms.

Pair with cosine similarity to compare documents independent of length.


Formulas & Notation

SymbolMeaning
(T, P, E)Mitchell’s task / performance measure / experience triple
Entropy of set S (in bits) — lecture’s E, = Shannon H, ≠ expectation 𝔼
Information Gain for attribute A
cos(x, y) = (x·y) / (‖x‖·‖y‖)Cosine similarity
J = Σ_j Σ_{x∈C_j} ‖x − μ_j‖²k-means objective (WCSS)
(1 − 1/n)ⁿ → 1/e ≈ 0.368Probability a row is OOB for one bootstrap
Bias² + Variance + σ²Expected squared-error decomposition (external)
tfidf(t, d) = tf · log(|D|/|{d′:t∈d′}|)TF-IDF weight
k in k-meansNumber of clusters (the only k-means hyperparameter)
k in k-foldNumber of folds in cross-validation

Common Exam Traps ⚠️

  • Bias has two meanings. Inductive bias (prior assumptions) ≠ statistical bias (under-fitting error). Read the question.
  • Bias-variance tradeoff. More complexity → lower bias, higher variance. NOT “more complex = better”.
  • ID3 is greedy. Maximises IG at each split; no backtracking. Can produce sub-optimal trees → needs pruning or ensembling.
  • High-cardinality attributes break vanilla IG. Attributes like patient_ID give max gain but are useless. C4.5’s Gain Ratio fixes this (textbook, not slides).
  • Random Forest ≠ Bagging. RF = Bagging + per-split feature subsampling. Both add diversity.
  • OOB error is a free bonus of bagging — no separate test set needed. ≈ 37 % of rows are OOB per tree.
  • k-means only finds a local optimum. Result depends on initialisation → run multiple times. k-means++ (external) gives smarter init.
  • k-means assumes spherical, equal-size clusters. Bad for elongated or density-varying data → hierarchical or DBSCAN.
  • Entropy = 0 ⇔ pure node; Entropy = max (log₂ k) ⇔ uniformly mixed classes.
  • Supervised vs. unsupervised vs. RL: supervised needs labels, unsupervised discovers structure, RL learns from rewards.
  • Hierarchical clustering merges are irreversible. A bad early merge poisons the dendrogram below it.
  • Cosine similarity ignores magnitude — good for text, bad if magnitude itself is meaningful.
  • “Random Forest is interpretable” — only partially. Individual trees are; an ensemble of 500 is not.
  • Variance reduction by averaging assumes the averaged models are at least somewhat independent. Identical trees average to themselves — no benefit. That’s why RF randomises both rows and features.

Comparison Table

AlgorithmParadigmInterpretableNon-linearMain riskMain strength
Decision Tree (ID3)Supervised✅ fullyOverfittingWhite-box, axis-aligned splits
Random ForestSupervised⚠️ partialMany hyperparametersVariance reduction, robust
k-meansUnsupervised✅ centroids❌ (Euclidean only)Local optima, k must be setFast, scalable
Hierarchical (agglomerative)Unsupervised✅ dendrogramdepends on linkageO(n²)+ cost, irreversible mergesNo k needed
Cross-validationMeta-methodn/an/aComputational costReliable generalisation estimate
TF-IDF + cosineFeature engineeringn/aBag-of-words ignores orderStrong text baseline

Visualisations (Python)

Four interactive figures illustrating the core ML I+II concepts. Render each toggle in Obsidian (with the Execute Code plugin + Pyodide) to see the plots.

What this shows. The binary entropy function. The curve peaks at p = 0.5 with H = 1 bit — a perfectly balanced node carries one full bit of uncertainty. At p = 0 or p = 1 the node is pure, contributing zero entropy. This is exactly why ID3 wants splits that drive children toward purity — any split that produces a 50/50 child has gained nothing in information content.


What this shows. A worked Information Gain example. Parent set has 14 examples (9 yes / 5 no) with entropy ≈ 0.940 bits. Splitting on a feature produces Child A (6y/2n, fairly pure, H ≈ 0.811) and Child B (3y/3n, perfectly mixed, H = 1.0). The weighted child entropy = (8/14)·0.811 + (6/14)·1.0 ≈ 0.892. Information Gain is the green gap: parent − weighted child ≈ 0.048 bits. ID3 picks the attribute with the largest such gain at each node.


What this shows. 50 noisy datasets were sampled from y = sin(x) + ε. For each dataset we fit a polynomial of degree 1 (left), 3 (middle), 9 (right). Blue translucent lines are the 50 individual fits; red is the average prediction; the shaded band is ±1 std (variance).

  • Degree 1: average prediction is far from sin(x) — high bias². All 50 fits look similar — low variance.
  • Degree 3: average prediction tracks sin(x) closely; modest variance band. Sweet spot.
  • Degree 9: average prediction is on target (low bias²) but individual fits wiggle wildly through the noise — huge variance. Classic overfitting.

This is the empirical version of E[(y−f̂)²] = Bias² + Variance + σ² — increasing model capacity trades bias for variance.


What this shows. Lloyd’s k-means with k = 3 on 180 points drawn from three isotropic Gaussians. Left: after one assign-and-update step from a deliberately bad initialisation — clusters are partially mixed and centroids (Xs) are far from where they should be. Right: after convergence (typically ~5–10 iterations) — centroids snap to the true cluster means and the assignment is clean.

Key reminders. k-means only finds a local optimum of WCSS — different random inits can give different results. Standard fix: run multiple times, keep the lowest J. k-means assumes roughly spherical, equally-sized clusters in Euclidean space; elongated or density-varying clusters break it (use hierarchical or DBSCAN instead).


ALGORITHMS (full reference) ⭐

Pseudocode for every algorithm you might need to write or trace by hand.
Organised by role:

  • Supervised tree induction: #1 ID3
  • Ensemble methods: #2 Random Forest · #3 Bagging
  • Unsupervised clustering: #4 k-means (Lloyd’s) · #5 Hierarchical agglomerative
  • Evaluation: #6 k-fold cross-validation

1. ID3 (decision-tree induction)

Goal. Build a decision tree top-down by greedily picking the highest-IG attribute at each node.

ID3(examples S, attributes A, target T):
    if all examples in S have the same label c:
        return Leaf(c)
    if A is empty:
        return Leaf(majority_class(S))
 
    A* ← argmax_{a ∈ A}  Gain(S, a)
    root ← Node(test = A*)
 
    for each value v ∈ Values(A*):
        S_v ← { x ∈ S : x.A* = v }
        if S_v is empty:
            root.add_branch(v, Leaf(majority_class(S)))
        else:
            root.add_branch(v, ID3(S_v, A \ {A*}, T))
 
    return root

Cost. O(|A| · |S| · log|S|) per level in the typical case; deeper trees can grow exponentially.
Limitations. Categorical attributes only; no built-in pruning; biased toward high-cardinality attributes (see Gain Ratio in C4.5, external).


2. Random Forest training

Goal. Train an ensemble of m decorrelated decision trees and aggregate them.

RANDOM-FOREST-TRAIN(D, m, p_split):
    forest ← []
    for i = 1..m:
        D_i  ← BOOTSTRAP-SAMPLE(D)              # row resampling, |D_i| = |D|
        tree ← GROW-TREE(D_i, p_split)          # at each node, sample p_split
                                                 # features uniformly without
                                                 # replacement; pick best split
                                                 # only among those features
        forest.append(tree)
    return forest
 
PREDICT(forest, x):
    votes ← [ tree.predict(x) for tree in forest ]
    return majority(votes)                       # or mean(votes) for regression

Typical p_split. √p for classification, p/3 for regression (p = total feature count).
OOB shortcut. While training, remember which rows were not in D_i for each tree; aggregate predictions only over those trees to estimate generalisation error for free.


3. Bagging (Bootstrap AGGregating)

Goal. Reduce variance by averaging models trained on bootstrap samples.

BAGGING(D, base_learner L, m):
    models ← []
    for i = 1..m:
        D_i ← BOOTSTRAP-SAMPLE(D)                # |D_i| = |D|, sampled with replacement
        models.append( L.train(D_i) )            # all features available at every split
    return models
 
PREDICT(models, x):
    return majority({ M.predict(x) for M in models })   # or mean for regression
 
BOOTSTRAP-SAMPLE(D):
    n ← |D|
    return [ D[random_int(0, n-1)] for _ in range(n) ]   # WITH replacement

Difference from Random Forest. Bagging uses all features at every split; RF additionally restricts splits to a random feature subset. RF therefore produces more decorrelated trees and typically performs better.


4. k-means (Lloyd’s algorithm)

Goal. Partition X = {x₁, …, xₙ} into k clusters minimising WCSS J.

K-MEANS(X, k, max_iters):
    μ ← INIT-CENTROIDS(X, k)                     # random k points, or k-means++
    for t = 1..max_iters:
        # 1. Assignment step
        for each x_i ∈ X:
            c(i) ← argmin_{j ∈ 1..k}  ‖x_i − μ_j‖²
        # 2. Update step
        new_μ ← μ
        for j = 1..k:
            new_μ_j ← mean({ x_i : c(i) = j })   # if cluster empty: reinit randomly
        if new_μ == μ:                            # converged
            return c, μ
        μ ← new_μ
    return c, μ

Cost. O(n · k · d · t) where d = dimensionality, t = iterations.
Why “Lloyd’s”? Stuart Lloyd proposed this two-step (assign, update) heuristic in 1957. Slides call it “k-means” without naming the inventor.
k-means++ initialisation (external, Arthur & Vassilvitskii 2007): pick centroids one at a time, biased toward points far from existing centroids. Dramatically reduces sensitivity to initialisation.


5. Hierarchical clustering (agglomerative)

Goal. Build a dendrogram by repeatedly merging the two closest clusters.

HAC(X, linkage):
    clusters ← [ {x_i} for x_i in X ]            # start: each point its own cluster
    D ← pairwise distance matrix of clusters
    dendrogram ← []
 
    while |clusters| > 1:
        (A, B) ← argmin_{i ≠ j}  linkage(C_i, C_j, D)   # find closest pair
        new_C  ← A ∪ B
        dendrogram.append( (A, B, linkage(A, B, D)) )
 
        clusters.remove(A); clusters.remove(B); clusters.append(new_C)
        UPDATE-D(D, new_C, clusters, linkage)             # recompute distances
 
    return dendrogram

Linkage choices (distance between clusters C_A, C_B):

  • Single: min_{a∈A, b∈B} d(a, b)
  • Complete: max_{a∈A, b∈B} d(a, b)
  • Average: mean_{a∈A, b∈B} d(a, b)
  • Centroid: d(mean(A), mean(B))

Cost. Naïve O(n³); with priority queues O(n² log n); single-linkage admits O(n²) via SLINK.
Cutting the dendrogram. Pick a height threshold → connected components below it are the clusters. Equivalent to choosing k retroactively.


6. k-fold cross-validation

Goal. Estimate generalisation error of a learning procedure (not a fixed model) using all of the data for both training and testing.

K-FOLD-CV(D, learner L, k):
    shuffle(D)
    folds ← partition D into k equal-size disjoint folds
    errors ← []
    for i = 1..k:
        D_test  ← folds[i]
        D_train ← union of all other folds
        model   ← L.train(D_train)
        errors.append( error(model, D_test) )
    return mean(errors), std(errors)

Leave-one-out (LOO) = K-FOLD-CV with k = n.
Cost. k model fits.
Use. Hyperparameter tuning, model selection, reporting generalisation error. Standard k = 5 or 10. External / not in slides.


See also

Tags: methods-of-ai machine-learning ai-generated
Lernzettel: lernzettel_ml-i-ii_30-04-26
Quizzes: quiz_ml_30-04-26 · quiz_decision-trees_18-05-26
Sibling references: lernzettel_svm_30-04-26 · lernzettel_neural-networks-deep-learning_30-04-26
Atomic notes: Machine Learning · Bias-Variance Tradeoff · Decision Trees and ID3 · Random Forest · Supervised Learning · Unsupervised Learning
Superlink: Methods of AI Lecture · 611 📠Machine Learning
Questions hub: Questions for Methods of AI

Created: 23/05/26 (merged from ml-i-ii_30-04-26)