Bellman Equation

methods-of-ai

The Bellman Equation is the recursive definition of value in a Markov Decision Process. It states that the value of a state equals the immediate reward plus the discounted value of the next state.

Two forms — both appear on the exam, and the difference matters:

FormFormulaSays
Policy evaluation (for a given π)Value of state s under policy π
Bellman optimality (for V*)Value of state s under the optimal policy

The first describes “how good is this state if I follow my current behavior.” The second describes “how good could this state be if I behaved optimally.” Mixing them up is the most common exam mistake on MDPs.

The same equation, simplified for intuition

Strip away the expectation and you get the core idea:

That’s the principle of optimality (Bellman, 1957): the optimal value of any state equals the immediate reward plus the discounted optimal value of where you end up. Recursive. Self-referential. Eerily compact.

Worked numerical example

Tiny MDP with 3 states (A, B, C). Single action available in each. Rewards on transitions:

  • From A: reward 0, lands in B
  • From B: reward 1, lands in C
  • C is terminal (V(C) = 0)

Discount γ = 0.9.

Bellman optimality from C backwards:

  • V*(C) = 0 (terminal)
  • V*(B) = 1 + 0.9 · 0 = 1.0
  • V*(A) = 0 + 0.9 · 1.0 = 0.9

This is exactly dynamic programming — you compute values backwards from the terminal state. Value Iteration and Policy Iteration both rely on this.

Visual: value iteration converges

Watch V(s) converge under repeated Bellman optimality updates. With γ < 1 the sequence is a contraction → guaranteed convergence to V*.

What to see:

  • Left panel: with γ = 0.5, V(0) saturates quickly at a small value (myopic). With γ = 0.99, V(0) climbs slowly toward ≈ 9.7 (the reward of 10, discounted across 3 steps).
  • Right panel: values propagate backwards from the goal. V(3) is 0 (terminal). V(2) updates first to 10. Then V(1) updates to 9. Then V(0) updates to 8.1. This is the contraction property of the Bellman operator made visible.
  • γ controls how far back rewards reach. With γ = 1.0, rewards reach infinitely far back — but values may not converge in non-terminal MDPs.

Two algorithms that solve it

AlgorithmWhat it doesWhen to use
Value IterationRepeatedly apply the Bellman optimality update until V converges, then extract greedy policySimple, but slow when γ is close to 1
Policy IterationAlternate: (a) compute Vπ exactly via the policy-evaluation Bellman equation (linear system), then (b) update policy greedily w.r.t. VπFewer iterations but each one is more expensive

⚠️ Exam trap: Value Iteration uses Bellman optimality (the max version); Policy Iteration uses Bellman policy evaluation (the no-max version) for the evaluation step, then a greedy step for improvement.

Q-Function form

The Bellman equation has an action-value (Q) variant — this is what most modern RL uses:

It says: the value of taking action a in state s equals the immediate reward plus the discounted value of the best action you can take in the resulting state. This is the equation Q-Learning iteratively approximates (see Q-Function).

Why γ matters

γBehavior
γ = 0Pure short-sightedness — only the immediate reward matters
γ = 0.5Future rewards matter, but decay fast (half-life ≈ 1 step)
γ = 0.9Standard choice — future matters, but converges over ~10 steps
γ = 0.99Very long horizon — used in chess, Go, etc.
γ = 1.0Undiscounted — only valid in episodic / terminal MDPs (else values diverge)

γ is not part of the MDP tuple (S, A, p, r) — it’s part of the objective. The same MDP with different γ has different optimal policies. Important exam trap.

Where the Bellman equation is used today

  • Q-Learning, SARSA, DQN, all of model-free RL — every RL algorithm in production traces back to Bellman.
  • AlphaGo / AlphaZero / MuZero — value networks approximate the Bellman value of board positions.
  • ChatGPT / Claude RLHF — the value head of PPO uses a TD-style Bellman update.
  • Dynamic programming in algorithm courses — shortest paths (Dijkstra is a special case), edit distance, knapsack with weights → all are Bellman recursions on different state spaces.
  • Optimal control theory — Hamilton-Jacobi-Bellman PDE is the continuous-time analog. Used in autonomous flight, rocket trajectory optimization.
  • Operations research — inventory management, dynamic pricing, queueing — all formulated as Bellman recursions.

See also

Tags: methods-of-ai planning mdp reinforcement-learning bellman
Created: 18-05-26