Monte Carlo Tree Search

methods-of-ai ai-generated

Monte Carlo Tree Search (MCTS) is a search algorithm that builds a search tree selectively and statistically — instead of searching exhaustively, it runs many sampled trial-playouts and concentrates the search on the most promising branches. Famous from AlphaGo / AlphaZero (Go, chess) and AlphaProof / AlphaGeometry (IMO silver 2024).

Why this note exists

It’s the [[Monte Carlo Tree Search]] concept from the exam-paper talk (Selection-Inference · pruefung_paper-transformers-search_25-05-26). MCTS is the cleanest example of externalised search — the search happens in an explicit algorithm around the network, not inside a single forward pass. Exactly what Saparov et al. (2024) imply is necessary, since the bare forward pass can’t robustly search.

The core idea

You have a huge decision tree (e.g. every possible move in Go) that you can never fully search. MCTS does not search exhaustively — it grows the tree incrementally, guided by statistics from random simulations, focusing effort where it pays off.

The 4 steps (looped thousands of times)

StepWhat happens
1. SelectionFrom the root, walk down choosing the “best” child each time (balancing known-good vs. unexplored).
2. ExpansionAt the edge of the current tree, add a new node (move).
3. Simulation (rollout)From there, play randomly to the end → a rough outcome (win/loss).
4. BackpropagationPropagate that outcome back up the path → update each node’s stats (visit count, win count).

After many iterations, pick the move with the best statistics.

The key trick: exploration vs. exploitation

Selection uses a formula (UCB1 / UCT) balancing:

  • Exploitation — branches that did well so far.
  • Exploration — branches tried rarely (might be better).

→ It wastes no time on obviously bad branches, but never forgets to test the unexplored. (Same tension as in Simulated Annealing and local search.)

AlphaGo: MCTS for search, neural nets for evaluation

This is the part that matters for the talk — AlphaGo shows the division of labour between search and network:

        ┌─────────────── MCTS (the search) ───────────────┐
position │  Selection → Expansion → Evaluation → Backprop  │ → best move
        └──────────────────┬──────────────────┬──────────┘
                           │                  │
                    Policy net           Value net
              "which moves good?"   "how good is this position?"
                  (neural nets EVALUATE — they don't search)
SystemYearSearches with
AlphaGo2016MCTS + policy net + value net
AlphaGo Zero / AlphaZero2017MCTS + one net, no human games, no rollouts
AlphaProof / AlphaGeometry2024MCTS-style search over proof steps (IMO silver)
  • Policy net → “which moves are promising?” → guides Selection so MCTS needn’t try every move.
  • Value net → “how good is this position?” → replaces the random rollouts with a learned evaluation (AlphaGo’s big innovation over classical MCTS).
  • The net never searches — it only guides and evaluates. The search is MCTS, outside the net.

Why it matters for the exam paper

This is the strongest historical example for the “modern systems externalise search” argument:

AlphaGo’s neural net doesn’t search — it evaluates. The actual search is MCTS around it. That’s exactly Saparov’s point: a single forward pass doesn’t search robustly, so you offload search to an explicit algorithm and let the net supply the heuristic.

So AlphaGo (2016) is “externalised search” years before o1/o3 and the RL reasoners — it confirms the paper’s conclusion from the other direction.

MCTSA*BFS/DFS
Guaranteeprobabilistic, anytime (better with more compute)optimal (admissible h)complete
How it exploressampled rollouts + UCBbest-f(n)=g+hsystematic
Heuristic sourcerandom sim / learned value nethand-designed hnone

⚠️ Unlike A*, MCTS guarantees nothing about optimality — it’s anytime and statistical: more rollouts → better estimate, but no proof. The random rollouts echo stochastic local search; the UCB balance echoes exploration-exploitation.

One-liner for the oral

“MCTS builds a search tree via many sampled playouts, balancing exploration vs. exploitation (UCB). AlphaGo, AlphaZero, and AlphaProof all use it — with neural nets only to evaluate moves/positions, while MCTS does the search. It’s the textbook case of externalised search: the search lives in the algorithm around the net, not in a single forward pass — exactly what Saparov’s result implies you need.”

See also

Created: 01/06/26