Lernzettel: Planning II — MDPs & Probabilistic Planning
Methods of AI — SoSe 2026 · 1-page exam sheet
For more depth: Markov Decision Process (MDP) (full reference: all 6 algorithms with pseudocode, worked Grid World, deep explanations) · quiz_planning_30-04-26 (8 practice questions)
Core Ideas
- Classical planning assumes deterministic actions. MDPs handle uncertainty (stochastic outcomes) AND long-term consequences.
- An MDP gives a probability to each outcome of each action, and a reward for each state-action pair.
- The goal is to find an optimal policy π — a mapping from states to actions that maximises expected discounted reward.
- Key property: Markov = future depends only on the current state, not the history.
Mini-glossary
| Symbol | Meaning |
|---|---|
| S | set of states |
| A | set of actions |
| p(s’|s, a) | transition probability; Σ_{s’} p(s’|s,a) = 1 |
| r(s, a) | immediate reward for doing a in s |
| π : S → A | deterministic policy (defined for every state) |
| V^π(s) | value of s under π — expected discounted reward |
| V*(s) | optimal value function |
| γ ∈ [0, 1] | discount factor — slides say “between 0 and 1”; γ < 1 needed for convergence in non-terminating MDPs |
| Markov property | P(Sₜ₊₁ | Sₜ, Aₜ, history) = P(Sₜ₊₁ | Sₜ, Aₜ) |
⭐ Full glossary + Grid World example → Mini-glossary
Formulas
Discounted return
E_π[r₁] + γ · E_π[r₂] + γ² · E_π[r₃] + γ³ · E_π[r₄] + …
Bellman equation (Policy Evaluation, fixed π)
V^π(s) = r(s, π(s)) + γ · Σ_{s'} p(s' | s, π(s)) · V^π(s')
Bellman optimality equation (defines V*)
V*(s) = max_a [ r(s, a) + γ · Σ_{s'} p(s' | s, a) · V*(s') ]
Greedy policy extraction
π*(s) = argmax_a [ r(s, a) + γ · Σ_{s'} p(s' | s, a) · V*(s') ]
Common Exam Traps ⚠️
- γ is NOT in the MDP tuple. The tuple is (S, A, p, r). γ belongs to the evaluation criterion. (Q117/Q119.)
- γ ∈ [0, 1] inclusive — the slides say “between 0 and 1”. γ = 1 is allowed but values may diverge in non-terminating MDPs (slide 67, “termination problem”). γ < 1 is needed for guaranteed convergence.
- Markov property = outcome depends only on current state and action, NOT history. Defining property of MDPs.
- Policy ≠ plan. A policy covers ALL states; a classical plan is a sequence for one initial state.
- Policy evaluation ≠ policy improvement. Evaluation = compute V^π for fixed π. Improvement = update π given V^π.
- Optimal policy depends on γ. Same MDP, different γ → different optimum.
- Transition function is a probability distribution: Σ_{s’} p(s’ | s, a) = 1.
Quick Comparisons
Classical Planning vs. MDP
| Property | Classical | MDP |
|---|---|---|
| Action outcomes | Deterministic | Probabilistic |
| Objective | action sequence | policy π : S → A |
| State coverage | from initial state | every state |
| Long-term view | finite plan | (discounted) infinite horizon |
Policy Iteration vs. Value Iteration
| Aspect | Policy Iteration | Value Iteration |
|---|---|---|
| Per-iter work | Heavy (full eval inside) | Cheap (one Bellman backup) |
| # outer iters | Few | Many |
| Stopping | π stops changing | |V_{k+1} − V_k| < ε |
Full pros/cons of all 6 algorithms → Algorithm comparison table — pros & cons
Related Q&A & Notes
Quiz
- quiz_planning_30-04-26 — 8 Qs (covers both Planning I + II)
Targeted exam questions in Questions for Methods of AI
- Q40–46 (basic: MDP, policy, value iteration, Bellman) · Q100 (Discount factor γ) · Q117–119 (deep / exam-trap: V^π vs V*, VI vs PI trade-offs, why γ is NOT in the MDP tuple)
Atomic notes
Sibling Lernzettel
- lernzettel_planning-i_30-04-26 — Planning I (Classical / STRIPS / Situation Calculus)
See also
Tags: methods-of-ai lernzettel ai-generated
Full reference: Markov Decision Process (MDP)
Quiz: quiz_planning_30-04-26
Superlink: Methods of AI Lecture
Questions hub: Questions for Methods of AI
Created: 30/04/26