Lernzettel: Planning II — MDPs & Probabilistic Planning

Methods of AI — SoSe 2026 · 1-page exam sheet

For more depth: Markov Decision Process (MDP) (full reference: all 6 algorithms with pseudocode, worked Grid World, deep explanations) · quiz_planning_30-04-26 (8 practice questions)

Core Ideas

  • Classical planning assumes deterministic actions. MDPs handle uncertainty (stochastic outcomes) AND long-term consequences.
  • An MDP gives a probability to each outcome of each action, and a reward for each state-action pair.
  • The goal is to find an optimal policy π — a mapping from states to actions that maximises expected discounted reward.
  • Key property: Markov = future depends only on the current state, not the history.

Mini-glossary

SymbolMeaning
Sset of states
Aset of actions
p(s’|s, a)transition probability; Σ_{s’} p(s’|s,a) = 1
r(s, a)immediate reward for doing a in s
π : S → Adeterministic policy (defined for every state)
V^π(s)value of s under π — expected discounted reward
V*(s)optimal value function
γ ∈ [0, 1]discount factor — slides say “between 0 and 1”; γ < 1 needed for convergence in non-terminating MDPs
Markov propertyP(Sₜ₊₁ | Sₜ, Aₜ, history) = P(Sₜ₊₁ | Sₜ, Aₜ)

Full glossary + Grid World exampleMini-glossary

Formulas

Discounted return

E_π[r₁] + γ · E_π[r₂] + γ² · E_π[r₃] + γ³ · E_π[r₄] + …

Bellman equation (Policy Evaluation, fixed π)

V^π(s) = r(s, π(s)) + γ · Σ_{s'} p(s' | s, π(s)) · V^π(s')

Bellman optimality equation (defines V*)

V*(s) = max_a [ r(s, a) + γ · Σ_{s'} p(s' | s, a) · V*(s') ]

Greedy policy extraction

π*(s) = argmax_a [ r(s, a) + γ · Σ_{s'} p(s' | s, a) · V*(s') ]

Common Exam Traps ⚠️

  • γ is NOT in the MDP tuple. The tuple is (S, A, p, r). γ belongs to the evaluation criterion. (Q117/Q119.)
  • γ ∈ [0, 1] inclusive — the slides say “between 0 and 1”. γ = 1 is allowed but values may diverge in non-terminating MDPs (slide 67, “termination problem”). γ < 1 is needed for guaranteed convergence.
  • Markov property = outcome depends only on current state and action, NOT history. Defining property of MDPs.
  • Policy ≠ plan. A policy covers ALL states; a classical plan is a sequence for one initial state.
  • Policy evaluation ≠ policy improvement. Evaluation = compute V^π for fixed π. Improvement = update π given V^π.
  • Optimal policy depends on γ. Same MDP, different γ → different optimum.
  • Transition function is a probability distribution: Σ_{s’} p(s’ | s, a) = 1.

Quick Comparisons

Classical Planning vs. MDP

PropertyClassicalMDP
Action outcomesDeterministicProbabilistic
Objectiveaction sequencepolicy π : S → A
State coveragefrom initial stateevery state
Long-term viewfinite plan(discounted) infinite horizon

Policy Iteration vs. Value Iteration

AspectPolicy IterationValue Iteration
Per-iter workHeavy (full eval inside)Cheap (one Bellman backup)
# outer itersFewMany
Stoppingπ stops changing|V_{k+1} − V_k| < ε

Full pros/cons of all 6 algorithms → Algorithm comparison table — pros & cons

Quiz

Targeted exam questions in Questions for Methods of AI

  • Q40–46 (basic: MDP, policy, value iteration, Bellman) · Q100 (Discount factor γ) · Q117–119 (deep / exam-trap: V^π vs V*, VI vs PI trade-offs, why γ is NOT in the MDP tuple)

Atomic notes

Sibling Lernzettel

See also

Tags: methods-of-ai lernzettel ai-generated
Full reference: Markov Decision Process (MDP)
Quiz: quiz_planning_30-04-26
Superlink: Methods of AI Lecture
Questions hub: Questions for Methods of AI

Created: 30/04/26