Methods of AI — Exam Cheat Sheet
A4-printable. Densest possible distillation of all 9 topics: formulas, definitions, exam traps. No examples, no derivations — just the essentials. Companion to Algorithm Decision Tree — MoAI.
1. Local Search
Hill Climbing — pick best-improving neighbor; stop when none improves.
- Variants: Steepest, First-Choice, Stochastic, Random-Restart (P(success) → 1 as n → ∞)
- ⚠️ 3 failure modes: local maximum, plateau, ridge
Simulated Annealing — accept worse moves with prob P = exp(−Δ/T)
- Δ = f(x_new) − f(x_current); for minimization Δ > 0 means worse
- High T → exploration, Low T → exploitation
- ⚠️ Slow cooling schedule essential; too-fast cooling = stuck in local min
- ⚠️ ≠ random search even at high T: still uses gradient information from Δ
Local Beam Search — keep k states; pool all k·b successors; keep best k
- ⚠️ NOT parallel HC: beams share info → can collapse onto one region
- Stochastic Beam Search — sample k probabilistically → preserves diversity
Genetic Algorithms — population + selection + crossover + mutation
- ⚠️ Mutation = only way to introduce new alleles; without it, evolution halts
- Elitism = guarantee best monotone non-decreasing
- Typical: mutation_rate ≈ 1 / chromosome_length
2. Constraint Satisfaction (CSP)
Definition: tuple ({X₁..Xₙ}, {D₁..Dₙ}, R) — variables, domains, constraints
- Consistent = solution exists. Solution = complete + consistent assignment
Backtracking heuristics:
- MRV (Minimum Remaining Values) — pick variable with fewest legal values → fail fast (primary)
- Degree — pick variable with most constraints (tie-breaker for MRV)
- LCV (Least Constraining Value) — pick value that rules out fewest neighbors (different choice point: value, not variable)
Consistency:
- Node consistency — unary constraints satisfied (trivial)
- Arc consistency — every value in Dᵢ has support in Dⱼ; algorithm: AC-3 in O(c·d³) where c = arcs, d = max domain size
- Path consistency — generalizes to triples
- ⚠️ Arc consistency does NOT guarantee solution exists — only domain pruning
RCC-8 — 8 jointly exhaustive pairwise disjoint spatial relations: DC, EC, PO, EQ, TPP, NTPP, TPPi, NTPPi
3. Planning I — Classical
STRIPS operators = (PRE, ADD, DEL) lists; state = set of ground atoms
- Apply: if PRE ⊆ state → new state = (state ∖ DEL) ∪ ADD
- ⚠️ STRIPS sidesteps Frame Problem via CWA + ADD/DEL — does NOT solve it
The Three Problems (must memorize all three):
| Name | What it asks |
|---|---|
| Frame | What stays unchanged after an action? |
| Qualification | What are ALL preconditions for action to succeed? (can be infinite) |
| Ramification | What are the indirect/side effects of an action? |
Situation Calculus — FOL with situations: S₀ (initial), do(a, s) (after action)
- Fluents = time-varying predicates: On(A, B, s)
- Frame axioms: O(n × m) for n actions × m fluents → impractical at scale
- Reiter’s Successor State Axiom — combines all positive + negative effects per fluent
PDDL extends STRIPS with: numeric fluents, durative actions, derived predicates, typed objects
4. Planning II — MDPs / RL
MDP = tuple (S, A, p, r)
- p(s’|s, a) = transition probability; r(s, a) = reward
- ⚠️ γ (discount factor) is NOT part of the MDP tuple — it’s part of the objective
Bellman equations (memorize BOTH forms):
- Policy eval:
Vᵖ(s) = Σ_a π(a|s) · Σ_s' p(s'|s,a) · [r(s,a) + γ · Vᵖ(s')] - Bellman optimality:
V*(s) = max_a Σ_s' p(s'|s,a) · [r(s,a) + γ · V*(s')]
Solution algorithms:
- Value Iteration — iterate Bellman optimality until V converges; slow when γ → 1
- Policy Iteration — alternate policy evaluation (linear system) + greedy improvement; fewer iters but each is more expensive
Q-Learning update: Q(s,a) ← Q(s,a) + α · [r + γ·max_a' Q(s',a') − Q(s,a)]
Markov property: future depends only on current state, not history
5. Knowledge Representation (DL)
ALC concept constructors:
| Symbol | Meaning |
|---|---|
| ⊤ / ⊥ | top / bottom (everything / nothing) |
| ¬C | negation |
| C ⊓ D / C ⊔ D | intersection / union |
| ∃R.C | some R-successor in C |
| ∀R.C | all R-successors in C |
⚠️ ∀R.C is vacuously true when there are no R-successors at all (e.g. ∀hasChild.Doctor is true for everyone childless)
TBox vs. ABox:
- TBox = terminology (Student ⊑ Person); reasoning: subsumption, classification
- ABox = assertions (alice : Student); reasoning: instance check, realization
OWA vs. CWA ⚠️
- DL / OWL → Open World: absence ≠ negation. Must assert ¬ explicitly.
- STRIPS / databases → Closed World: absence = false.
6. Vagueness & Uncertainty
Fuzzy set: μ_A: U → [0, 1] (membership function)
- Operations: ∩ ↔ t-norm (min, product, Łukasiewicz max(0, x+y−1))
- ∪ ↔ s-norm (max, x+y−xy, min(1, x+y))
- Complement: μ_A^c(x) = 1 − μ_A(x)
- ⚠️ A ⊓ A^c ≠ ∅ in fuzzy (excluded middle fails)
t-norm axioms (memorize all 4):
- Neutral element: T(x, 1) = x
- Commutativity: T(x, y) = T(y, x)
- Associativity: T(T(x, y), z) = T(x, T(y, z))
- Monotonicity: x ≤ x’ ⇒ T(x, y) ≤ T(x’, y)
⚠️ Fuzzy ≠ Probability: fuzzy = vagueness of predicate, probability = uncertainty about crisp fact.
Dempster-Shafer: mass function m: 2^Θ → [0, 1]
- Bel(A) = Σ_{B⊆A} m(B) — lower bound on confidence
- Pl(A) = Σ_{B∩A≠∅} m(B) — upper bound on confidence
- 0 ≤ Bel(A) ≤ Pl(A) ≤ 1; gap = ignorance
7. Machine Learning I & II
Bias-Variance decomposition (squared loss):
E[(y − ŷ)²] = Bias² + Variance + σ² (irreducible)
- ↑ complexity → ↓ bias, ↑ variance
- ⚠️ Overfitting = high variance (NOT high bias)
Entropy + Information Gain (ID3):
- H(S) = −Σ pᵢ · log₂(pᵢ)
- Gain(S, A) = H(S) − Σᵥ (|Sᵥ|/|S|) · H(Sᵥ)
- ⚠️ Trap: high-cardinality attributes (e.g. UserID) get maximum gain → catastrophic overfit
- Fix: Gain Ratio (C4.5), Gini (CART), feature subsampling (RF)
Random Forest = Bagging + feature subsampling per split
- ⚠️ Bagging reduces VARIANCE, not bias
- OOB error ≈ free cross-validation (each example unused in ~37% of trees, since 1 − 1/e ≈ 0.368)
Cross-validation: k-fold → train on k−1, test on 1, average
8. Machine Learning III — SVM + Perceptron
Perceptron — linear classifier; learning rule w ← w + η(y − ŷ)x
- ⚠️ Can only learn linearly separable functions; XOR fails
- Convergence guaranteed only if data is linearly separable (Novikoff)
SVM — maximum margin hyperplane ⟨w, x⟩ + b = 0
- Optimization: min ½‖w‖² s.t. yᵢ(⟨xᵢ,w⟩ + b) ≥ 1
- Margin = 2 / ‖w‖
- Lagrangian:
L = ½‖w‖² − Σ αᵢ(yᵢ(⟨xᵢ,w⟩+b) − 1)⚠️ MINUS sign - KKT: αᵢ > 0 only for support vectors; only they define the hyperplane
- Soft margin: add slack ξᵢ ≥ 0; objective: ½‖w‖² + C·Σξᵢ
- ⚠️ C large → narrow margin, overfit risk; C small → wide margin, more robust
Kernel trick: replace ⟨x, x’⟩ with K(x, x’)
- Common: linear, polynomial
(⟨x,x'⟩+c)^d, RBFexp(−γ‖x−x'‖²)(infinite-dim) - Mercer’s condition: K is valid kernel iff Gram matrix is positive semi-definite
9. Neural Networks & Deep Learning
Hopfield Networks (attractor memory):
- Hebbian learning: W = Σ xᵏ(xᵏ)ᵀ, diagonal = 0
- Update: yᵢ ← sign(Σ wᵢⱼ yⱼ)
- Energy: E = −½ Σ wᵢⱼ yᵢ yⱼ (monotone decreasing → convergence)
- Capacity: “up to N” patterns per slides; reality ~0.14·N
MLP: input → hidden (ReLU) → output. Universal approximator with one hidden layer.
Backpropagation: chain rule applied layer-by-layer backwards. SGD: small random batches.
Deep Learning challenges & fixes:
| Challenge | Fix |
|---|---|
| Vanishing gradient | ReLU, residual connections |
| Overfitting | Dropout, L2 weight decay, early stopping |
| Exploding gradients | Gradient clipping, normalization |
Self-Attention: Attention(Q, K, V) = softmax(QKᵀ / √d_k) · V
- ⚠️ Divide by √d_k prevents softmax saturation
- Modern Hopfield ≡ Self-Attention (Ramsauer 2020)
Positional encoding needed because attention is permutation-equivariant
- sin/cos encoding: PE(pos, 2i) = sin(pos/10000^(2i/d))
- Linear in relative position (special property of sin/cos)
BERT vs. GPT:
- BERT: encoder-only, bidirectional, Masked LM objective
- GPT: decoder-only, causal mask, next-token (autoregressive) objective
🚨 Top exam traps to memorize
- STRIPS doesn’t solve the Frame Problem — it sidesteps it via CWA + ADD/DEL
- γ is NOT part of the MDP tuple (S, A, p, r) — it’s part of the discounted objective
- Fuzzy ≠ Probability — fuzzy = vagueness, probability = uncertainty about crisp facts
- Overfitting = high variance (not bias). Bagging reduces variance, not bias.
- DL uses OWA (Open World) — STRIPS/CSP/databases use CWA
- Parallel HC ≠ Local Beam Search — Beam pools successors across beams; Parallel HC is independent
- MRV vs. LCV are different choice points — MRV picks variable, LCV picks value
- Bias-variance vs. overfitting/underfitting — overfit = high variance, underfit = high bias
- Perceptron + XOR — single-layer perceptron cannot learn XOR (linearly inseparable); MLP fixes it
- ∀R.C is vacuously true with no R-fillers — counterintuitive but classical-logic correct
- SVM Lagrangian has MINUS sign in front of Σαᵢ — from inequality constraint convention
- RBF kernel feature space is INFINITE dimensional — that’s why you must use the kernel trick
- Random Forest = Bagging + feature subsampling per split — second part is what decorrelates trees
- Q-Learning’s Bellman equation uses MAX over next actions — Value Iteration also uses max; Policy evaluation does NOT
- t-norm requires neutral element 1; s-norm requires neutral element 0
Formulas Quick-Reference
All | inside formulas are escaped as \| so the markdown table renders correctly.
| Topic | Formula |
|---|---|
| SA acceptance | P = exp(−Δ/T) |
| SA delta | Δ = f(x_new) − f(x_current) ; for minimization Δ>0 = worse |
| Bellman optimality | V*(s) = max_a Σ_s' p(s' | s,a) · [ r(s,a) + γ·V*(s') ] |
| Bellman policy eval | Vᵖ(s) = Σ_a π(a|s) · Σ_s' p(s'|s,a) · [ r(s,a) + γ·Vᵖ(s') ] |
| Q-Learning update | Q(s,a) ← Q(s,a) + α · [ r + γ·max_a' Q(s',a') − Q(s,a) ] |
| TD error | δ = r + γ·V(s') − V(s) |
| Entropy | H(S) = − Σ_i pᵢ · log₂(pᵢ) |
| Information Gain | Gain(S,A) = H(S) − Σ_v (|Sᵥ| / |S|) · H(Sᵥ) |
| Gini impurity | Gini(S) = 1 − Σ_i pᵢ² |
| Bias-variance | E[(y − ŷ)²] = Bias² + Variance + σ² |
| SVM primal | min ½‖w‖² s.t. yᵢ · (⟨xᵢ,w⟩ + b) ≥ 1 |
| SVM Lagrangian | L = ½‖w‖² − Σᵢ αᵢ · (yᵢ(⟨xᵢ,w⟩+b) − 1) ⚠️ MINUS |
| SVM margin | margin = 2 / ‖w‖ |
| SVM decision | f(x) = sgn(Σᵢ αᵢ yᵢ K(x, xᵢ) + b) |
| RBF kernel | K(x, y) = exp(−γ ‖x − y‖²) |
| Polynomial kernel | K(x, y) = (⟨x, y⟩ + c)^d |
| Perceptron update | w ← w + η · (y − ŷ) · x |
| Gradient descent | θ ← θ − η · ∇L(θ) |
| Adam update | m = β₁m + (1−β₁)g ; v = β₂v + (1−β₂)g² ; θ ← θ − η·m̂/(√v̂ + ε) |
| Self-attention | Attention(Q,K,V) = softmax(QKᵀ / √d_k) · V |
| Positional encoding (sin/cos) | PE(pos, 2i) = sin(pos / 10000^(2i/d)) |
| Hopfield update | yᵢ ← sign(Σⱼ wᵢⱼ · yⱼ) |
| Hopfield energy | E = − ½ · Σᵢⱼ wᵢⱼ · yᵢ · yⱼ (monotone decreasing → convergence) |
| Hebbian learning | W = Σ_μ xᵘ (xᵘ)ᵀ ; diagonal zeroed |
| Hopfield capacity | up to N patterns / ~0.14·N realistic |
| Fuzzy AND (Zadeh) | μ_{A∩B}(x) = min(μ_A(x), μ_B(x)) |
| Fuzzy OR (Zadeh) | μ_{A∪B}(x) = max(μ_A(x), μ_B(x)) |
| Fuzzy complement | μ_{¬A}(x) = 1 − μ_A(x) |
| t-norm Łukasiewicz | T(x, y) = max(0, x + y − 1) |
| s-norm algebraic sum | S(x, y) = x + y − x·y |
| DS Belief | Bel(A) = Σ_{B ⊆ A} m(B) (lower bound) |
| DS Plausibility | Pl(A) = Σ_{B ∩ A ≠ ∅} m(B) (upper bound) |
| DS ignorance | Pl(A) − Bel(A) |
| AC-3 complexity | O(c · d³) where c = arcs, d = max domain size |
| Bootstrap OOB fraction | (1 − 1/N)^N → 1/e ≈ 0.368 |
| GA mutation rate (default) | μ ≈ 1 / chromosome_length (≈ 1 flip per chromosome) |
| HC random restart success | P(success after n) = 1 − (1 − p)ⁿ → 1 |
See also
- Algorithm Decision Tree — MoAI — flowchart companion (which algorithm for which problem)
- Questions for Methods of AI — 139 exam Q&A
- Methods of AI Lecture — full course MOC
- Lernzettel: lernzettel_local-search_30-04-26 · lernzettel_csp_30-04-26 · lernzettel_planning-i_30-04-26 · lernzettel_planning-ii_30-04-26 · lernzettel_knowledge-representation_30-04-26 · lernzettel_vagueness-uncertainty_30-04-26 · lernzettel_ml-i-ii_30-04-26 · lernzettel_svm_30-04-26 · lernzettel_neural-networks-deep-learning_30-04-26
Tags: methods-of-ai cheat-sheet exam-prep
Created: 18-05-26