Methods of AI — Exam Cheat Sheet

methods-of-ai exam-prep

A4-printable. Densest possible distillation of all 9 topics: formulas, definitions, exam traps. No examples, no derivations — just the essentials. Companion to Algorithm Decision Tree — MoAI.

1. Local Search

Hill Climbing — pick best-improving neighbor; stop when none improves.

Variants: Steepest, First-Choice, Stochastic, Random-Restart (P(success) → 1 as n → ∞)
⚠️ 3 failure modes: local maximum, plateau, ridge

Simulated Annealing — accept worse moves with prob P = exp(−Δ/T)

Δ = f(x_new) − f(x_current); for minimization Δ > 0 means worse
High T → exploration, Low T → exploitation
⚠️ Slow cooling schedule essential; too-fast cooling = stuck in local min
⚠️ ≠ random search even at high T: still uses gradient information from Δ

Local Beam Search — keep k states; pool all k·b successors; keep best k

⚠️ NOT parallel HC: beams share info → can collapse onto one region
Stochastic Beam Search — sample k probabilistically → preserves diversity

Genetic Algorithms — population + selection + crossover + mutation

⚠️ Mutation = only way to introduce new alleles; without it, evolution halts
Elitism = guarantee best monotone non-decreasing
Typical: mutation_rate ≈ 1 / chromosome_length

2. Constraint Satisfaction (CSP)

Definition: tuple ({X₁..Xₙ}, {D₁..Dₙ}, R) — variables, domains, constraints

Consistent = solution exists. Solution = complete + consistent assignment

Backtracking heuristics:

MRV (Minimum Remaining Values) — pick variable with fewest legal values → fail fast (primary)
Degree — pick variable with most constraints (tie-breaker for MRV)
LCV (Least Constraining Value) — pick value that rules out fewest neighbors (different choice point: value, not variable)

Consistency:

Node consistency — unary constraints satisfied (trivial)
Arc consistency — every value in Dᵢ has support in Dⱼ; algorithm: AC-3 in O(c·d³) where c = arcs, d = max domain size
Path consistency — generalizes to triples
⚠️ Arc consistency does NOT guarantee solution exists — only domain pruning

RCC-8 — 8 jointly exhaustive pairwise disjoint spatial relations: DC, EC, PO, EQ, TPP, NTPP, TPPi, NTPPi

3. Planning I — Classical

STRIPS operators = (PRE, ADD, DEL) lists; state = set of ground atoms

Apply: if PRE ⊆ state → new state = (state ∖ DEL) ∪ ADD
⚠️ STRIPS sidesteps Frame Problem via CWA + ADD/DEL — does NOT solve it

The Three Problems (must memorize all three):

Name	What it asks
Frame	What stays unchanged after an action?
Qualification	What are ALL preconditions for action to succeed? (can be infinite)
Ramification	What are the indirect/side effects of an action?

Situation Calculus — FOL with situations: S₀ (initial), do(a, s) (after action)

Fluents = time-varying predicates: On(A, B, s)
Frame axioms: O(n × m) for n actions × m fluents → impractical at scale
Reiter’s Successor State Axiom — combines all positive + negative effects per fluent

PDDL extends STRIPS with: numeric fluents, durative actions, derived predicates, typed objects

4. Planning II — MDPs / RL

MDP = tuple (S, A, p, r)

p(s’|s, a) = transition probability; r(s, a) = reward
⚠️ γ (discount factor) is NOT part of the MDP tuple — it’s part of the objective

Bellman equations (memorize BOTH forms):

Policy eval: Vᵖ(s) = Σ_a π(a|s) · Σ_s' p(s'|s,a) · [r(s,a) + γ · Vᵖ(s')]
Bellman optimality: V*(s) = max_a Σ_s' p(s'|s,a) · [r(s,a) + γ · V*(s')]

Solution algorithms:

Value Iteration — iterate Bellman optimality until V converges; slow when γ → 1
Policy Iteration — alternate policy evaluation (linear system) + greedy improvement; fewer iters but each is more expensive

Q-Learning update: Q(s,a) ← Q(s,a) + α · [r + γ·max_a' Q(s',a') − Q(s,a)]

Markov property: future depends only on current state, not history

5. Knowledge Representation (DL)

ALC concept constructors:

Symbol	Meaning
⊤ / ⊥	top / bottom (everything / nothing)
¬C	negation
C ⊓ D / C ⊔ D	intersection / union
∃R.C	some R-successor in C
∀R.C	all R-successors in C

⚠️ ∀R.C is vacuously true when there are no R-successors at all (e.g. ∀hasChild.Doctor is true for everyone childless)

TBox vs. ABox:

TBox = terminology (Student ⊑ Person); reasoning: subsumption, classification
ABox = assertions (alice : Student); reasoning: instance check, realization

OWA vs. CWA ⚠️

DL / OWL → Open World: absence ≠ negation. Must assert ¬ explicitly.
STRIPS / databases → Closed World: absence = false.

6. Vagueness & Uncertainty

Fuzzy set: μ_A: U → [0, 1] (membership function)

Operations: ∩ ↔ t-norm (min, product, Łukasiewicz max(0, x+y−1))
∪ ↔ s-norm (max, x+y−xy, min(1, x+y))
Complement: μ_A^c(x) = 1 − μ_A(x)
⚠️ A ⊓ A^c ≠ ∅ in fuzzy (excluded middle fails)

t-norm axioms (memorize all 4):

Neutral element: T(x, 1) = x
Commutativity: T(x, y) = T(y, x)
Associativity: T(T(x, y), z) = T(x, T(y, z))
Monotonicity: x ≤ x’ ⇒ T(x, y) ≤ T(x’, y)

⚠️ Fuzzy ≠ Probability: fuzzy = vagueness of predicate, probability = uncertainty about crisp fact.

Dempster-Shafer: mass function m: 2^Θ → [0, 1]

Bel(A) = Σ_{B⊆A} m(B) — lower bound on confidence
Pl(A) = Σ_{B∩A≠∅} m(B) — upper bound on confidence
0 ≤ Bel(A) ≤ Pl(A) ≤ 1; gap = ignorance

7. Machine Learning I & II

Bias-Variance decomposition (squared loss):
E[(y − ŷ)²] = Bias² + Variance + σ² (irreducible)

↑ complexity → ↓ bias, ↑ variance
⚠️ Overfitting = high variance (NOT high bias)

Entropy + Information Gain (ID3):

H(S) = −Σ pᵢ · log₂(pᵢ)
Gain(S, A) = H(S) − Σᵥ (|Sᵥ|/|S|) · H(Sᵥ)
⚠️ Trap: high-cardinality attributes (e.g. UserID) get maximum gain → catastrophic overfit
Fix: Gain Ratio (C4.5), Gini (CART), feature subsampling (RF)

Random Forest = Bagging + feature subsampling per split

⚠️ Bagging reduces VARIANCE, not bias
OOB error ≈ free cross-validation (each example unused in ~37% of trees, since 1 − 1/e ≈ 0.368)

Cross-validation: k-fold → train on k−1, test on 1, average

8. Machine Learning III — SVM + Perceptron

Perceptron — linear classifier; learning rule w ← w + η(y − ŷ)x

⚠️ Can only learn linearly separable functions; XOR fails
Convergence guaranteed only if data is linearly separable (Novikoff)

SVM — maximum margin hyperplane ⟨w, x⟩ + b = 0

Optimization: min ½‖w‖² s.t. yᵢ(⟨xᵢ,w⟩ + b) ≥ 1
Margin = 2 / ‖w‖
Lagrangian: L = ½‖w‖² − Σ αᵢ(yᵢ(⟨xᵢ,w⟩+b) − 1) ⚠️ MINUS sign
KKT: αᵢ > 0 only for support vectors; only they define the hyperplane
Soft margin: add slack ξᵢ ≥ 0; objective: ½‖w‖² + C·Σξᵢ
⚠️ C large → narrow margin, overfit risk; C small → wide margin, more robust

Kernel trick: replace ⟨x, x’⟩ with K(x, x’)

Common: linear, polynomial (⟨x,x'⟩+c)^d, RBF exp(−γ‖x−x'‖²) (infinite-dim)
Mercer’s condition: K is valid kernel iff Gram matrix is positive semi-definite

9. Neural Networks & Deep Learning

Hopfield Networks (attractor memory):

Hebbian learning: W = Σ xᵏ(xᵏ)ᵀ, diagonal = 0
Update: yᵢ ← sign(Σ wᵢⱼ yⱼ)
Energy: E = −½ Σ wᵢⱼ yᵢ yⱼ (monotone decreasing → convergence)
Capacity: “up to N” patterns per slides; reality ~0.14·N

MLP: input → hidden (ReLU) → output. Universal approximator with one hidden layer.

Backpropagation: chain rule applied layer-by-layer backwards. SGD: small random batches.

Deep Learning challenges & fixes:

Challenge	Fix
Vanishing gradient	ReLU, residual connections
Overfitting	Dropout, L2 weight decay, early stopping
Exploding gradients	Gradient clipping, normalization

Self-Attention: Attention(Q, K, V) = softmax(QKᵀ / √d_k) · V

⚠️ Divide by √d_k prevents softmax saturation
Modern Hopfield ≡ Self-Attention (Ramsauer 2020)

Positional encoding needed because attention is permutation-equivariant

sin/cos encoding: PE(pos, 2i) = sin(pos/10000^(2i/d))
Linear in relative position (special property of sin/cos)

BERT vs. GPT:

BERT: encoder-only, bidirectional, Masked LM objective
GPT: decoder-only, causal mask, next-token (autoregressive) objective

🚨 Top exam traps to memorize

STRIPS doesn’t solve the Frame Problem — it sidesteps it via CWA + ADD/DEL
γ is NOT part of the MDP tuple (S, A, p, r) — it’s part of the discounted objective
Fuzzy ≠ Probability — fuzzy = vagueness, probability = uncertainty about crisp facts
Overfitting = high variance (not bias). Bagging reduces variance, not bias.
DL uses OWA (Open World) — STRIPS/CSP/databases use CWA
Parallel HC ≠ Local Beam Search — Beam pools successors across beams; Parallel HC is independent
MRV vs. LCV are different choice points — MRV picks variable, LCV picks value
Bias-variance vs. overfitting/underfitting — overfit = high variance, underfit = high bias
Perceptron + XOR — single-layer perceptron cannot learn XOR (linearly inseparable); MLP fixes it
∀R.C is vacuously true with no R-fillers — counterintuitive but classical-logic correct
SVM Lagrangian has MINUS sign in front of Σαᵢ — from inequality constraint convention
RBF kernel feature space is INFINITE dimensional — that’s why you must use the kernel trick
Random Forest = Bagging + feature subsampling per split — second part is what decorrelates trees
Q-Learning’s Bellman equation uses MAX over next actions — Value Iteration also uses max; Policy evaluation does NOT
t-norm requires neutral element 1; s-norm requires neutral element 0

Formulas Quick-Reference

All | inside formulas are escaped as \| so the markdown table renders correctly.

Topic	Formula
SA acceptance	`P = exp(−Δ/T)`
SA delta	`Δ = f(x_new) − f(x_current)` ; for minimization Δ>0 = worse
Bellman optimality	`V(s) = max_a Σ_s' p(s' \| s,a) · [ r(s,a) + γ·V(s') ]`
Bellman policy eval	`Vᵖ(s) = Σ_a π(a\|s) · Σ_s' p(s'\|s,a) · [ r(s,a) + γ·Vᵖ(s') ]`
Q-Learning update	`Q(s,a) ← Q(s,a) + α · [ r + γ·max_a' Q(s',a') − Q(s,a) ]`
TD error	`δ = r + γ·V(s') − V(s)`
Entropy	`H(S) = − Σ_i pᵢ · log₂(pᵢ)`
Information Gain	`Gain(S,A) = H(S) − Σ_v (\|Sᵥ\| / \|S\|) · H(Sᵥ)`
Gini impurity	`Gini(S) = 1 − Σ_i pᵢ²`
Bias-variance	`E[(y − ŷ)²] = Bias² + Variance + σ²`
SVM primal	`min ½‖w‖² s.t. yᵢ · (⟨xᵢ,w⟩ + b) ≥ 1`
SVM Lagrangian	`L = ½‖w‖² − Σᵢ αᵢ · (yᵢ(⟨xᵢ,w⟩+b) − 1)` ⚠️ MINUS
SVM margin	`margin = 2 / ‖w‖`
SVM decision	`f(x) = sgn(Σᵢ αᵢ yᵢ K(x, xᵢ) + b)`
RBF kernel	`K(x, y) = exp(−γ ‖x − y‖²)`
Polynomial kernel	`K(x, y) = (⟨x, y⟩ + c)^d`
Perceptron update	`w ← w + η · (y − ŷ) · x`
Gradient descent	`θ ← θ − η · ∇L(θ)`
Adam update	`m = β₁m + (1−β₁)g ; v = β₂v + (1−β₂)g² ; θ ← θ − η·m̂/(√v̂ + ε)`
Self-attention	`Attention(Q,K,V) = softmax(QKᵀ / √d_k) · V`
Positional encoding (sin/cos)	`PE(pos, 2i) = sin(pos / 10000^(2i/d))`
Hopfield update	`yᵢ ← sign(Σⱼ wᵢⱼ · yⱼ)`
Hopfield energy	`E = − ½ · Σᵢⱼ wᵢⱼ · yᵢ · yⱼ` (monotone decreasing → convergence)
Hebbian learning	`W = Σ_μ xᵘ (xᵘ)ᵀ` ; diagonal zeroed
Hopfield capacity	`up to N patterns / ~0.14·N realistic`
Fuzzy AND (Zadeh)	`μ_{A∩B}(x) = min(μ_A(x), μ_B(x))`
Fuzzy OR (Zadeh)	`μ_{A∪B}(x) = max(μ_A(x), μ_B(x))`
Fuzzy complement	`μ_{¬A}(x) = 1 − μ_A(x)`
t-norm Łukasiewicz	`T(x, y) = max(0, x + y − 1)`
s-norm algebraic sum	`S(x, y) = x + y − x·y`
DS Belief	`Bel(A) = Σ_{B ⊆ A} m(B)` (lower bound)
DS Plausibility	`Pl(A) = Σ_{B ∩ A ≠ ∅} m(B)` (upper bound)
DS ignorance	`Pl(A) − Bel(A)`
AC-3 complexity	`O(c · d³)` where c = arcs, d = max domain size
Bootstrap OOB fraction	`(1 − 1/N)^N → 1/e ≈ 0.368`
GA mutation rate (default)	`μ ≈ 1 / chromosome_length` (≈ 1 flip per chromosome)
HC random restart success	`P(success after n) = 1 − (1 − p)ⁿ → 1`

Brain Online

Explorer

Methods of AI — Exam Cheat Sheet

Methods of AI — Exam Cheat Sheet

1. Local Search

2. Constraint Satisfaction (CSP)

3. Planning I — Classical

4. Planning II — MDPs / RL

5. Knowledge Representation (DL)

6. Vagueness & Uncertainty

7. Machine Learning I & II

8. Machine Learning III — SVM + Perceptron

9. Neural Networks & Deep Learning

🚨 Top exam traps to memorize

Formulas Quick-Reference

See also

Backlinks

Mika

✨ Features

⚙️ Einstellungen

📚 Chat-Verlauf

📖 Citation Manager

✍️ Writing Assistant

Inhaltsverzeichnis