The Bellman Equation is the recursive definition of value in a Markov Decision Process. It states that the value of a state equals the immediate reward plus the discounted value of the next state.
Two forms — both appear on the exam, and the difference matters:
Form
Formula
Says
Policy evaluation (for a given π)
Vπ(s)=∑aπ(a∣s)∑s′p(s′∣s,a)[r(s,a)+γVπ(s′)]
Value of state s under policy π
Bellman optimality (for V*)
V∗(s)=maxa∑s′p(s′∣s,a)[r(s,a)+γV∗(s′)]
Value of state s under the optimal policy
The first describes “how good is this state if I follow my current behavior.” The second describes “how good could this state be if I behaved optimally.” Mixing them up is the most common exam mistake on MDPs.
The same equation, simplified for intuition
Strip away the expectation and you get the core idea:
Value(now)=Reward(now)+γ⋅Value(next)
That’s the principle of optimality (Bellman, 1957): the optimal value of any state equals the immediate reward plus the discounted optimal value of where you end up. Recursive. Self-referential. Eerily compact.
Worked numerical example
Tiny MDP with 3 states (A, B, C). Single action available in each. Rewards on transitions:
From A: reward 0, lands in B
From B: reward 1, lands in C
C is terminal (V(C) = 0)
Discount γ = 0.9.
Bellman optimality from C backwards:
V*(C) = 0 (terminal)
V*(B) = 1 + 0.9 · 0 = 1.0
V*(A) = 0 + 0.9 · 1.0 = 0.9
This is exactly dynamic programming — you compute values backwards from the terminal state. Value Iteration and Policy Iteration both rely on this.
Visual: value iteration converges
Watch V(s) converge under repeated Bellman optimality updates. With γ < 1 the sequence is a contraction → guaranteed convergence to V*.
🐍 Code anzeigen / ausblenden
# Pyodide / Obsidian Execute Code: install matplotlib first.# In normal Python (terminal / Jupyter), delete the next 2 lines.import micropipawait micropip.install("matplotlib")import numpy as npimport matplotlib.pyplot as plt# Tiny 4-state MDP (linear chain): 0 → 1 → 2 → 3 (terminal)# Reward 0 for all transitions except entering state 3, which gives reward 10.# Single action available: "move right". Deterministic.n_states = 4rewards = np.array([0, 0, 0, 10]) # reward for entering state s# Value iteration with different discount factorsfig, axes = plt.subplots(1, 2, figsize=(13, 4.5))for gamma, ax_idx in [(0.5, 0), (0.9, 0), (0.99, 0)]: V = np.zeros(n_states) history = [V.copy()] for _ in range(40): V_new = V.copy() for s in range(n_states - 1): V_new[s] = rewards[s + 1] + gamma * V[s + 1] # Bellman optimality V = V_new history.append(V.copy()) history = np.array(history) axes[0].plot(history[:, 0], label=f'γ = {gamma}', lw=2)axes[0].set(xlabel='iteration', ylabel='V(state 0)', title='Value iteration: V(s=0) converges to its fixed point')axes[0].legend(); axes[0].grid(alpha=0.3)axes[0].axhline(10 * 0.99**3, color='gray', ls=':', alpha=0.5)# Show V across all 4 states over iterations for γ=0.9V = np.zeros(n_states); history = [V.copy()]for _ in range(30): V_new = V.copy() for s in range(n_states - 1): V_new[s] = rewards[s + 1] + 0.9 * V[s + 1] V = V_new history.append(V.copy())history = np.array(history)for s in range(n_states): axes[1].plot(history[:, s], label=f'V(s={s})', lw=2)axes[1].set(xlabel='iteration', ylabel='V(s)', title='Value iteration (γ=0.9) — V converges from the goal backwards')axes[1].legend(); axes[1].grid(alpha=0.3)plt.tight_layout(); plt.show()
What to see:
Left panel: with γ = 0.5, V(0) saturates quickly at a small value (myopic). With γ = 0.99, V(0) climbs slowly toward ≈ 9.7 (the reward of 10, discounted across 3 steps).
Right panel: values propagate backwards from the goal. V(3) is 0 (terminal). V(2) updates first to 10. Then V(1) updates to 9. Then V(0) updates to 8.1. This is the contraction property of the Bellman operator made visible.
γ controls how far back rewards reach. With γ = 1.0, rewards reach infinitely far back — but values may not converge in non-terminal MDPs.
Two algorithms that solve it
Algorithm
What it does
When to use
Value Iteration
Repeatedly apply the Bellman optimality update until V converges, then extract greedy policy
Simple, but slow when γ is close to 1
Policy Iteration
Alternate: (a) compute Vπ exactly via the policy-evaluation Bellman equation (linear system), then (b) update policy greedily w.r.t. Vπ
Fewer iterations but each one is more expensive
⚠️ Exam trap: Value Iteration uses Bellman optimality (the max version); Policy Iteration uses Bellman policy evaluation (the no-max version) for the evaluation step, then a greedy step for improvement.
Q-Function form
The Bellman equation has an action-value (Q) variant — this is what most modern RL uses:
Q∗(s,a)=r(s,a)+γmaxa′Q∗(s′,a′)
It says: the value of taking action a in state s equals the immediate reward plus the discounted value of the best action you can take in the resulting state. This is the equation Q-Learning iteratively approximates (see Q-Function).
Why γ matters
γ
Behavior
γ = 0
Pure short-sightedness — only the immediate reward matters
γ = 0.5
Future rewards matter, but decay fast (half-life ≈ 1 step)
γ = 0.9
Standard choice — future matters, but converges over ~10 steps
γ = 0.99
Very long horizon — used in chess, Go, etc.
γ = 1.0
Undiscounted — only valid in episodic / terminal MDPs (else values diverge)
γ is not part of the MDP tuple (S, A, p, r) — it’s part of the objective. The same MDP with different γ has different optimal policies. Important exam trap.
Where the Bellman equation is used today
Q-Learning, SARSA, DQN, all of model-free RL — every RL algorithm in production traces back to Bellman.
AlphaGo / AlphaZero / MuZero — value networks approximate the Bellman value of board positions.
ChatGPT / Claude RLHF — the value head of PPO uses a TD-style Bellman update.
Dynamic programming in algorithm courses — shortest paths (Dijkstra is a special case), edit distance, knapsack with weights → all are Bellman recursions on different state spaces.
Optimal control theory — Hamilton-Jacobi-Bellman PDE is the continuous-time analog. Used in autonomous flight, rocket trajectory optimization.
Operations research — inventory management, dynamic pricing, queueing — all formulated as Bellman recursions.