Bellman Equation

methods-of-ai

The Bellman Equation is the recursive definition of value in a Markov Decision Process. It states that the value of a state equals the immediate reward plus the discounted value of the next state.

Two forms — both appear on the exam, and the difference matters:

Form	Formula	Says
Policy evaluation (for a given π)	$V^{π} (s) = \sum_{a} π (a ∣ s) \sum_{s^{'}} p (s^{'} ∣ s, a) [r (s, a) + γ V^{π} (s^{'})]$	Value of state s under policy π
Bellman optimality (for V*)	$V^{} (s) = max_{a} \sum_{s^{'}} p (s^{'} ∣ s, a) [r (s, a) + γ V^{} (s^{'})]$	Value of state s under the optimal policy

The first describes “how good is this state if I follow my current behavior.” The second describes “how good could this state be if I behaved optimally.” Mixing them up is the most common exam mistake on MDPs.

The same equation, simplified for intuition

Strip away the expectation and you get the core idea:

$Value(now) = Reward(now) + γ \cdot Value(next)$

That’s the principle of optimality (Bellman, 1957): the optimal value of any state equals the immediate reward plus the discounted optimal value of where you end up. Recursive. Self-referential. Eerily compact.

Worked numerical example

Tiny MDP with 3 states (A, B, C). Single action available in each. Rewards on transitions:

From A: reward 0, lands in B
From B: reward 1, lands in C
C is terminal (V(C) = 0)

Discount γ = 0.9.

Bellman optimality from C backwards:

V*(C) = 0 (terminal)
V*(B) = 1 + 0.9 · 0 = 1.0
V*(A) = 0 + 0.9 · 1.0 = 0.9

This is exactly dynamic programming — you compute values backwards from the terminal state. Value Iteration and Policy Iteration both rely on this.

Visual: value iteration converges

Watch V(s) converge under repeated Bellman optimality updates. With γ < 1 the sequence is a contraction → guaranteed convergence to V*.

🐍 Code anzeigen / ausblenden

# Pyodide / Obsidian Execute Code: install matplotlib first.
# In normal Python (terminal / Jupyter), delete the next 2 lines.
import micropip
await micropip.install("matplotlib")
 
import numpy as np
import matplotlib.pyplot as plt
 
# Tiny 4-state MDP (linear chain): 0 → 1 → 2 → 3 (terminal)
# Reward 0 for all transitions except entering state 3, which gives reward 10.
# Single action available: "move right". Deterministic.
n_states = 4
rewards = np.array([0, 0, 0, 10])                        # reward for entering state s
 
# Value iteration with different discount factors
fig, axes = plt.subplots(1, 2, figsize=(13, 4.5))
 
for gamma, ax_idx in [(0.5, 0), (0.9, 0), (0.99, 0)]:
    V = np.zeros(n_states)
    history = [V.copy()]
    for _ in range(40):
        V_new = V.copy()
        for s in range(n_states - 1):
            V_new[s] = rewards[s + 1] + gamma * V[s + 1]    # Bellman optimality
        V = V_new
        history.append(V.copy())
    history = np.array(history)
    axes[0].plot(history[:, 0], label=f'γ = {gamma}', lw=2)
 
axes[0].set(xlabel='iteration', ylabel='V(state 0)',
            title='Value iteration: V(s=0) converges to its fixed point')
axes[0].legend(); axes[0].grid(alpha=0.3)
axes[0].axhline(10 * 0.99**3, color='gray', ls=':', alpha=0.5)
 
# Show V across all 4 states over iterations for γ=0.9
V = np.zeros(n_states); history = [V.copy()]
for _ in range(30):
    V_new = V.copy()
    for s in range(n_states - 1):
        V_new[s] = rewards[s + 1] + 0.9 * V[s + 1]
    V = V_new
    history.append(V.copy())
history = np.array(history)
for s in range(n_states):
    axes[1].plot(history[:, s], label=f'V(s={s})', lw=2)
axes[1].set(xlabel='iteration', ylabel='V(s)',
            title='Value iteration (γ=0.9) — V converges from the goal backwards')
axes[1].legend(); axes[1].grid(alpha=0.3)
 
plt.tight_layout(); plt.show()

What to see:

Left panel: with γ = 0.5, V(0) saturates quickly at a small value (myopic). With γ = 0.99, V(0) climbs slowly toward ≈ 9.7 (the reward of 10, discounted across 3 steps).
Right panel: values propagate backwards from the goal. V(3) is 0 (terminal). V(2) updates first to 10. Then V(1) updates to 9. Then V(0) updates to 8.1. This is the contraction property of the Bellman operator made visible.
γ controls how far back rewards reach. With γ = 1.0, rewards reach infinitely far back — but values may not converge in non-terminal MDPs.

Two algorithms that solve it

Algorithm	What it does	When to use
Value Iteration	Repeatedly apply the Bellman optimality update until V converges, then extract greedy policy	Simple, but slow when γ is close to 1
Policy Iteration	Alternate: (a) compute V^π exactly via the policy-evaluation Bellman equation (linear system), then (b) update policy greedily w.r.t. V^π	Fewer iterations but each one is more expensive

⚠️ Exam trap: Value Iteration uses Bellman optimality (the max version); Policy Iteration uses Bellman policy evaluation (the no-max version) for the evaluation step, then a greedy step for improvement.

Q-Function form

The Bellman equation has an action-value (Q) variant — this is what most modern RL uses:

$Q^{*} (s, a) = r (s, a) + γ max_{a^{'}} Q^{*} (s^{'}, a^{'})$

It says: the value of taking action a in state s equals the immediate reward plus the discounted value of the best action you can take in the resulting state. This is the equation Q-Learning iteratively approximates (see Q-Function).

Why γ matters

γ	Behavior
γ = 0	Pure short-sightedness — only the immediate reward matters
γ = 0.5	Future rewards matter, but decay fast (half-life ≈ 1 step)
γ = 0.9	Standard choice — future matters, but converges over ~10 steps
γ = 0.99	Very long horizon — used in chess, Go, etc.
γ = 1.0	Undiscounted — only valid in episodic / terminal MDPs (else values diverge)

γ is not part of the MDP tuple (S, A, p, r) — it’s part of the objective. The same MDP with different γ has different optimal policies. Important exam trap.

Where the Bellman equation is used today

Q-Learning, SARSA, DQN, all of model-free RL — every RL algorithm in production traces back to Bellman.
AlphaGo / AlphaZero / MuZero — value networks approximate the Bellman value of board positions.
ChatGPT / Claude RLHF — the value head of PPO uses a TD-style Bellman update.
Dynamic programming in algorithm courses — shortest paths (Dijkstra is a special case), edit distance, knapsack with weights → all are Bellman recursions on different state spaces.
Optimal control theory — Hamilton-Jacobi-Bellman PDE is the continuous-time analog. Used in autonomous flight, rocket trajectory optimization.
Operations research — inventory management, dynamic pricing, queueing — all formulated as Bellman recursions.

Brain Online

Explorer

Bellman Equation

Bellman Equation

The same equation, simplified for intuition

Worked numerical example

Visual: value iteration converges

Two algorithms that solve it

Q-Function form

Why γ matters

Where the Bellman equation is used today

See also

Backlinks

Mika

✨ Features

⚙️ Einstellungen

📚 Chat-Verlauf

📖 Citation Manager

✍️ Writing Assistant

Inhaltsverzeichnis