Monte Carlo Tree Search
Monte Carlo Tree Search (MCTS) is a search algorithm that builds a search tree selectively and statistically — instead of searching exhaustively, it runs many sampled trial-playouts and concentrates the search on the most promising branches. Famous from AlphaGo / AlphaZero (Go, chess) and AlphaProof / AlphaGeometry (IMO silver 2024).
Why this note exists
It’s the
[[Monte Carlo Tree Search]]concept from the exam-paper talk (Selection-Inference · pruefung_paper-transformers-search_25-05-26). MCTS is the cleanest example of externalised search — the search happens in an explicit algorithm around the network, not inside a single forward pass. Exactly what Saparov et al. (2024) imply is necessary, since the bare forward pass can’t robustly search.
The core idea
You have a huge decision tree (e.g. every possible move in Go) that you can never fully search. MCTS does not search exhaustively — it grows the tree incrementally, guided by statistics from random simulations, focusing effort where it pays off.
The 4 steps (looped thousands of times)
| Step | What happens |
|---|---|
| 1. Selection | From the root, walk down choosing the “best” child each time (balancing known-good vs. unexplored). |
| 2. Expansion | At the edge of the current tree, add a new node (move). |
| 3. Simulation (rollout) | From there, play randomly to the end → a rough outcome (win/loss). |
| 4. Backpropagation | Propagate that outcome back up the path → update each node’s stats (visit count, win count). |
After many iterations, pick the move with the best statistics.
The key trick: exploration vs. exploitation
Selection uses a formula (UCB1 / UCT) balancing:
- Exploitation — branches that did well so far.
- Exploration — branches tried rarely (might be better).
→ It wastes no time on obviously bad branches, but never forgets to test the unexplored. (Same tension as in Simulated Annealing and local search.)
AlphaGo: MCTS for search, neural nets for evaluation
This is the part that matters for the talk — AlphaGo shows the division of labour between search and network:
┌─────────────── MCTS (the search) ───────────────┐
position │ Selection → Expansion → Evaluation → Backprop │ → best move
└──────────────────┬──────────────────┬──────────┘
│ │
Policy net Value net
"which moves good?" "how good is this position?"
(neural nets EVALUATE — they don't search)
| System | Year | Searches with |
|---|---|---|
| AlphaGo | 2016 | MCTS + policy net + value net |
| AlphaGo Zero / AlphaZero | 2017 | MCTS + one net, no human games, no rollouts |
| AlphaProof / AlphaGeometry | 2024 | MCTS-style search over proof steps (IMO silver) |
- Policy net → “which moves are promising?” → guides Selection so MCTS needn’t try every move.
- Value net → “how good is this position?” → replaces the random rollouts with a learned evaluation (AlphaGo’s big innovation over classical MCTS).
- The net never searches — it only guides and evaluates. The search is MCTS, outside the net.
Why it matters for the exam paper
This is the strongest historical example for the “modern systems externalise search” argument:
AlphaGo’s neural net doesn’t search — it evaluates. The actual search is MCTS around it. That’s exactly Saparov’s point: a single forward pass doesn’t search robustly, so you offload search to an explicit algorithm and let the net supply the heuristic.
So AlphaGo (2016) is “externalised search” years before o1/o3 and the RL reasoners — it confirms the paper’s conclusion from the other direction.
MCTS vs. the lecture’s classical search
| MCTS | A* | BFS/DFS | |
|---|---|---|---|
| Guarantee | probabilistic, anytime (better with more compute) | optimal (admissible h) | complete |
| How it explores | sampled rollouts + UCB | best-f(n)=g+h | systematic |
| Heuristic source | random sim / learned value net | hand-designed h | none |
⚠️ Unlike A*, MCTS guarantees nothing about optimality — it’s anytime and statistical: more rollouts → better estimate, but no proof. The random rollouts echo stochastic local search; the UCB balance echoes exploration-exploitation.
One-liner for the oral
“MCTS builds a search tree via many sampled playouts, balancing exploration vs. exploitation (UCB). AlphaGo, AlphaZero, and AlphaProof all use it — with neural nets only to evaluate moves/positions, while MCTS does the search. It’s the textbook case of externalised search: the search lives in the algorithm around the net, not in a single forward pass — exactly what Saparov’s result implies you need.”
See also
- Selection-Inference · pruefung_paper-transformers-search_25-05-26 (exam-paper talk — “Larger Context”)
- Search Algorithms · Uninformed Search · Simulated Annealing · Reinforcement Learning (RL)
- Superlink: Methods of AI Lecture
Created: 01/06/26