Methods of AI — Exam Cheat Sheet

methods-of-ai exam-prep

A4-printable. Densest possible distillation of all 9 topics: formulas, definitions, exam traps. No examples, no derivations — just the essentials. Companion to Algorithm Decision Tree — MoAI.


Hill Climbing — pick best-improving neighbor; stop when none improves.

  • Variants: Steepest, First-Choice, Stochastic, Random-Restart (P(success) → 1 as n → ∞)
  • ⚠️ 3 failure modes: local maximum, plateau, ridge

Simulated Annealing — accept worse moves with prob P = exp(−Δ/T)

  • Δ = f(x_new) − f(x_current); for minimization Δ > 0 means worse
  • High T → exploration, Low T → exploitation
  • ⚠️ Slow cooling schedule essential; too-fast cooling = stuck in local min
  • ⚠️ ≠ random search even at high T: still uses gradient information from Δ

Local Beam Search — keep k states; pool all k·b successors; keep best k

  • ⚠️ NOT parallel HC: beams share info → can collapse onto one region
  • Stochastic Beam Search — sample k probabilistically → preserves diversity

Genetic Algorithms — population + selection + crossover + mutation

  • ⚠️ Mutation = only way to introduce new alleles; without it, evolution halts
  • Elitism = guarantee best monotone non-decreasing
  • Typical: mutation_rate ≈ 1 / chromosome_length

2. Constraint Satisfaction (CSP)

Definition: tuple ({X₁..Xₙ}, {D₁..Dₙ}, R) — variables, domains, constraints

  • Consistent = solution exists. Solution = complete + consistent assignment

Backtracking heuristics:

  • MRV (Minimum Remaining Values) — pick variable with fewest legal values → fail fast (primary)
  • Degree — pick variable with most constraints (tie-breaker for MRV)
  • LCV (Least Constraining Value) — pick value that rules out fewest neighbors (different choice point: value, not variable)

Consistency:

  • Node consistency — unary constraints satisfied (trivial)
  • Arc consistency — every value in Dᵢ has support in Dⱼ; algorithm: AC-3 in O(c·d³) where c = arcs, d = max domain size
  • Path consistency — generalizes to triples
  • ⚠️ Arc consistency does NOT guarantee solution exists — only domain pruning

RCC-8 — 8 jointly exhaustive pairwise disjoint spatial relations: DC, EC, PO, EQ, TPP, NTPP, TPPi, NTPPi

3. Planning I — Classical

STRIPS operators = (PRE, ADD, DEL) lists; state = set of ground atoms

  • Apply: if PRE ⊆ state → new state = (state ∖ DEL) ∪ ADD
  • ⚠️ STRIPS sidesteps Frame Problem via CWA + ADD/DEL — does NOT solve it

The Three Problems (must memorize all three):

NameWhat it asks
FrameWhat stays unchanged after an action?
QualificationWhat are ALL preconditions for action to succeed? (can be infinite)
RamificationWhat are the indirect/side effects of an action?

Situation Calculus — FOL with situations: S₀ (initial), do(a, s) (after action)

  • Fluents = time-varying predicates: On(A, B, s)
  • Frame axioms: O(n × m) for n actions × m fluents → impractical at scale
  • Reiter’s Successor State Axiom — combines all positive + negative effects per fluent

PDDL extends STRIPS with: numeric fluents, durative actions, derived predicates, typed objects

4. Planning II — MDPs / RL

MDP = tuple (S, A, p, r)

  • p(s’|s, a) = transition probability; r(s, a) = reward
  • ⚠️ γ (discount factor) is NOT part of the MDP tuple — it’s part of the objective

Bellman equations (memorize BOTH forms):

  • Policy eval: Vᵖ(s) = Σ_a π(a|s) · Σ_s' p(s'|s,a) · [r(s,a) + γ · Vᵖ(s')]
  • Bellman optimality: V*(s) = max_a Σ_s' p(s'|s,a) · [r(s,a) + γ · V*(s')]

Solution algorithms:

  • Value Iteration — iterate Bellman optimality until V converges; slow when γ → 1
  • Policy Iteration — alternate policy evaluation (linear system) + greedy improvement; fewer iters but each is more expensive

Q-Learning update: Q(s,a) ← Q(s,a) + α · [r + γ·max_a' Q(s',a') − Q(s,a)]

Markov property: future depends only on current state, not history

5. Knowledge Representation (DL)

ALC concept constructors:

SymbolMeaning
⊤ / ⊥top / bottom (everything / nothing)
¬Cnegation
C ⊓ D / C ⊔ Dintersection / union
∃R.Csome R-successor in C
∀R.Call R-successors in C

⚠️ ∀R.C is vacuously true when there are no R-successors at all (e.g. ∀hasChild.Doctor is true for everyone childless)

TBox vs. ABox:

  • TBox = terminology (Student ⊑ Person); reasoning: subsumption, classification
  • ABox = assertions (alice : Student); reasoning: instance check, realization

OWA vs. CWA ⚠️

  • DL / OWL → Open World: absence ≠ negation. Must assert ¬ explicitly.
  • STRIPS / databases → Closed World: absence = false.

6. Vagueness & Uncertainty

Fuzzy set: μ_A: U → [0, 1] (membership function)

  • Operations: ∩ ↔ t-norm (min, product, Łukasiewicz max(0, x+y−1))
  • ∪ ↔ s-norm (max, x+y−xy, min(1, x+y))
  • Complement: μ_A^c(x) = 1 − μ_A(x)
  • ⚠️ A ⊓ A^c ≠ ∅ in fuzzy (excluded middle fails)

t-norm axioms (memorize all 4):

  1. Neutral element: T(x, 1) = x
  2. Commutativity: T(x, y) = T(y, x)
  3. Associativity: T(T(x, y), z) = T(x, T(y, z))
  4. Monotonicity: x ≤ x’ ⇒ T(x, y) ≤ T(x’, y)

⚠️ Fuzzy ≠ Probability: fuzzy = vagueness of predicate, probability = uncertainty about crisp fact.

Dempster-Shafer: mass function m: 2^Θ → [0, 1]

  • Bel(A) = Σ_{B⊆A} m(B) — lower bound on confidence
  • Pl(A) = Σ_{B∩A≠∅} m(B) — upper bound on confidence
  • 0 ≤ Bel(A) ≤ Pl(A) ≤ 1; gap = ignorance

7. Machine Learning I & II

Bias-Variance decomposition (squared loss):
E[(y − ŷ)²] = Bias² + Variance + σ² (irreducible)

  • ↑ complexity → ↓ bias, ↑ variance
  • ⚠️ Overfitting = high variance (NOT high bias)

Entropy + Information Gain (ID3):

  • H(S) = −Σ pᵢ · log₂(pᵢ)
  • Gain(S, A) = H(S) − Σᵥ (|Sᵥ|/|S|) · H(Sᵥ)
  • ⚠️ Trap: high-cardinality attributes (e.g. UserID) get maximum gain → catastrophic overfit
  • Fix: Gain Ratio (C4.5), Gini (CART), feature subsampling (RF)

Random Forest = Bagging + feature subsampling per split

  • ⚠️ Bagging reduces VARIANCE, not bias
  • OOB error ≈ free cross-validation (each example unused in ~37% of trees, since 1 − 1/e ≈ 0.368)

Cross-validation: k-fold → train on k−1, test on 1, average

8. Machine Learning III — SVM + Perceptron

Perceptron — linear classifier; learning rule w ← w + η(y − ŷ)x

  • ⚠️ Can only learn linearly separable functions; XOR fails
  • Convergence guaranteed only if data is linearly separable (Novikoff)

SVM — maximum margin hyperplane ⟨w, x⟩ + b = 0

  • Optimization: min ½‖w‖² s.t. yᵢ(⟨xᵢ,w⟩ + b) ≥ 1
  • Margin = 2 / ‖w‖
  • Lagrangian: L = ½‖w‖² − Σ αᵢ(yᵢ(⟨xᵢ,w⟩+b) − 1) ⚠️ MINUS sign
  • KKT: αᵢ > 0 only for support vectors; only they define the hyperplane
  • Soft margin: add slack ξᵢ ≥ 0; objective: ½‖w‖² + C·Σξᵢ
  • ⚠️ C large → narrow margin, overfit risk; C small → wide margin, more robust

Kernel trick: replace ⟨x, x’⟩ with K(x, x’)

  • Common: linear, polynomial (⟨x,x'⟩+c)^d, RBF exp(−γ‖x−x'‖²) (infinite-dim)
  • Mercer’s condition: K is valid kernel iff Gram matrix is positive semi-definite

9. Neural Networks & Deep Learning

Hopfield Networks (attractor memory):

  • Hebbian learning: W = Σ xᵏ(xᵏ)ᵀ, diagonal = 0
  • Update: yᵢ ← sign(Σ wᵢⱼ yⱼ)
  • Energy: E = −½ Σ wᵢⱼ yᵢ yⱼ (monotone decreasing → convergence)
  • Capacity: “up to N” patterns per slides; reality ~0.14·N

MLP: input → hidden (ReLU) → output. Universal approximator with one hidden layer.

Backpropagation: chain rule applied layer-by-layer backwards. SGD: small random batches.

Deep Learning challenges & fixes:

ChallengeFix
Vanishing gradientReLU, residual connections
OverfittingDropout, L2 weight decay, early stopping
Exploding gradientsGradient clipping, normalization

Self-Attention: Attention(Q, K, V) = softmax(QKᵀ / √d_k) · V

  • ⚠️ Divide by √d_k prevents softmax saturation
  • Modern Hopfield ≡ Self-Attention (Ramsauer 2020)

Positional encoding needed because attention is permutation-equivariant

  • sin/cos encoding: PE(pos, 2i) = sin(pos/10000^(2i/d))
  • Linear in relative position (special property of sin/cos)

BERT vs. GPT:

  • BERT: encoder-only, bidirectional, Masked LM objective
  • GPT: decoder-only, causal mask, next-token (autoregressive) objective

🚨 Top exam traps to memorize

  1. STRIPS doesn’t solve the Frame Problem — it sidesteps it via CWA + ADD/DEL
  2. γ is NOT part of the MDP tuple (S, A, p, r) — it’s part of the discounted objective
  3. Fuzzy ≠ Probability — fuzzy = vagueness, probability = uncertainty about crisp facts
  4. Overfitting = high variance (not bias). Bagging reduces variance, not bias.
  5. DL uses OWA (Open World) — STRIPS/CSP/databases use CWA
  6. Parallel HC ≠ Local Beam Search — Beam pools successors across beams; Parallel HC is independent
  7. MRV vs. LCV are different choice points — MRV picks variable, LCV picks value
  8. Bias-variance vs. overfitting/underfitting — overfit = high variance, underfit = high bias
  9. Perceptron + XOR — single-layer perceptron cannot learn XOR (linearly inseparable); MLP fixes it
  10. ∀R.C is vacuously true with no R-fillers — counterintuitive but classical-logic correct
  11. SVM Lagrangian has MINUS sign in front of Σαᵢ — from inequality constraint convention
  12. RBF kernel feature space is INFINITE dimensional — that’s why you must use the kernel trick
  13. Random Forest = Bagging + feature subsampling per split — second part is what decorrelates trees
  14. Q-Learning’s Bellman equation uses MAX over next actions — Value Iteration also uses max; Policy evaluation does NOT
  15. t-norm requires neutral element 1; s-norm requires neutral element 0

Formulas Quick-Reference

All | inside formulas are escaped as \| so the markdown table renders correctly.

TopicFormula
SA acceptanceP = exp(−Δ/T)
SA deltaΔ = f(x_new) − f(x_current) ; for minimization Δ>0 = worse
Bellman optimalityV*(s) = max_a Σ_s' p(s' | s,a) · [ r(s,a) + γ·V*(s') ]
Bellman policy evalVᵖ(s) = Σ_a π(a|s) · Σ_s' p(s'|s,a) · [ r(s,a) + γ·Vᵖ(s') ]
Q-Learning updateQ(s,a) ← Q(s,a) + α · [ r + γ·max_a' Q(s',a') − Q(s,a) ]
TD errorδ = r + γ·V(s') − V(s)
EntropyH(S) = − Σ_i pᵢ · log₂(pᵢ)
Information GainGain(S,A) = H(S) − Σ_v (|Sᵥ| / |S|) · H(Sᵥ)
Gini impurityGini(S) = 1 − Σ_i pᵢ²
Bias-varianceE[(y − ŷ)²] = Bias² + Variance + σ²
SVM primalmin ½‖w‖² s.t. yᵢ · (⟨xᵢ,w⟩ + b) ≥ 1
SVM LagrangianL = ½‖w‖² − Σᵢ αᵢ · (yᵢ(⟨xᵢ,w⟩+b) − 1) ⚠️ MINUS
SVM marginmargin = 2 / ‖w‖
SVM decisionf(x) = sgn(Σᵢ αᵢ yᵢ K(x, xᵢ) + b)
RBF kernelK(x, y) = exp(−γ ‖x − y‖²)
Polynomial kernelK(x, y) = (⟨x, y⟩ + c)^d
Perceptron updatew ← w + η · (y − ŷ) · x
Gradient descentθ ← θ − η · ∇L(θ)
Adam updatem = β₁m + (1−β₁)g ; v = β₂v + (1−β₂)g² ; θ ← θ − η·m̂/(√v̂ + ε)
Self-attentionAttention(Q,K,V) = softmax(QKᵀ / √d_k) · V
Positional encoding (sin/cos)PE(pos, 2i) = sin(pos / 10000^(2i/d))
Hopfield updateyᵢ ← sign(Σⱼ wᵢⱼ · yⱼ)
Hopfield energyE = − ½ · Σᵢⱼ wᵢⱼ · yᵢ · yⱼ (monotone decreasing → convergence)
Hebbian learningW = Σ_μ xᵘ (xᵘ)ᵀ ; diagonal zeroed
Hopfield capacityup to N patterns / ~0.14·N realistic
Fuzzy AND (Zadeh)μ_{A∩B}(x) = min(μ_A(x), μ_B(x))
Fuzzy OR (Zadeh)μ_{A∪B}(x) = max(μ_A(x), μ_B(x))
Fuzzy complementμ_{¬A}(x) = 1 − μ_A(x)
t-norm ŁukasiewiczT(x, y) = max(0, x + y − 1)
s-norm algebraic sumS(x, y) = x + y − x·y
DS BeliefBel(A) = Σ_{B ⊆ A} m(B) (lower bound)
DS PlausibilityPl(A) = Σ_{B ∩ A ≠ ∅} m(B) (upper bound)
DS ignorancePl(A) − Bel(A)
AC-3 complexityO(c · d³) where c = arcs, d = max domain size
Bootstrap OOB fraction(1 − 1/N)^N → 1/e ≈ 0.368
GA mutation rate (default)μ ≈ 1 / chromosome_length (≈ 1 flip per chromosome)
HC random restart successP(success after n) = 1 − (1 − p)ⁿ → 1

See also

Tags: methods-of-ai cheat-sheet exam-prep
Created: 18-05-26