ai-generated methods-of-ai exam-prep
Methods of AI — Module Exam: Mon 1 June 2026, 10:00, Wachsbleiche (room 50/315), examiner Kai-Uwe Kühnberger.
Format: briefly present the paper → critically evaluate it → place it in a larger context (10–15 min), then a discussion about the paper + AI in general.
Expected: depth in the agreed focus topics (Local) Search · CSP · Machine Learning + broad basic AI knowledge.
Allowed: bring anything (so print this dossier!). No slides required.
Paper: Saparov, Pawar, Pimpalgaonkar, Joshi, Pang, Padmakumar, Kazemi, Kim, He (2024). Transformers Struggle to Learn to Search. arXiv:2412.04703.
🎯 Test yourself first → quiz_paper-transformers-search_28-05-26
15 hard questions on this paper (why / derive / defend) + an oral-exam-style block in Kühnberger’s voice (rapid-fire + open discussion). Do the quiz before re-reading the dossier — retrieve, then fill gaps.
What the ICLR 2025 reviewers said → reviews_paper-transformers-search_28-05-26
Accept (Poster), scores 8/8/6/5. The reviewer criticisms (encoder-vs-decoder, toy architecture, data leakage, method doesn’t scale) independently corroborate the critiques below — and the rebuttal gives you the counter-arguments. Great for the “what did the community think?” discussion.
Key reference papers → references_paper-transformers-search_28-05-26
Curated reading list (prioritized). #1 Merrill & Sabharwal — Expressive Power of Transformers with CoT is fully worked: it’s the source for the “learnability ≠ expressivity” critique (poly CoT steps = class P ⊇ graph connectivity).
Companion notes
Methods of AI Lecture · Quiz Exam Schwerpunkte · Transformers · Self-Attention · Search Algorithms · Study Companion book chapters 14 + 15 (the full deep dive).
This dossier has four parts: A = the understanding checklist + full anatomy of the paper (study this first), B = how to deliver the 10–15 min, C = critical evaluation, D = context + bridges + Q&A bank. Tick a checklist box only once you can explain that point out loud without notes.
PART A — Understand the paper
✅ Understanding checklist
- 1. Experimental setup — why DAG path-finding is a testbed for reasoning/planning; what lookahead (search depth) means and how it’s defined.
- 2. Training distributions — why Naïve & Star let the model cheat with heuristics; how/why the Balanced distribution forces real search.
- 3. Mechanistic interpretability — how activation patching + token/position perturbations reveal why an attention head fires; how the computation graph is reconstructed.
- 4. The learned algorithm — Exponential Path-Merging — all vertices searched simultaneously; reachable sets union per layer → exponential-in-layers.
- 5. The scaling problem — why bigger graphs break learning, and why bigger models / more FLOPs do not fix it (scaling laws fail).
- 6. Chain-of-Thought failure — DFS & selection-inference help on small graphs (fewer layers) but still fail on large graphs, regardless of model size.
- 7. Transfer to real logic (proof search) — graph search = proof search in propositional logic → same scaling failure shows up in natural language → relevance to LLMs.
1️⃣ Experimental setup: graph search as the testbed
The task (one step, not the whole path). You get a directed acyclic graph (DAG), a start vertex and a goal vertex (guaranteed reachable). The model must output the next vertex on a path from start to goal — a single move.
Why this is not a toy. Graph connectivity is equivalent to proof search in a fragment of propositional logic:
| Graph | Logic |
|---|---|
| vertex | proposition (“Alice is a mammal”) |
edge u→v | implication “if u then v” |
| start vertex | given premise |
| goal vertex | what to prove |
| next vertex | next step of the proof |
So this is a lower bound on competence: if a model can’t do this, it has no hope on richer reasoning that contains search as a sub-problem. That’s why a negative result here is alarming.
How a graph becomes tokens. Three special tokens flatten the DAG into a sequence:
E u v— an Edge fromutov(the whole graph is a list of these).Q s g— the Query: starts, goalg.P s— the Path so far (here just the start); model predicts the next token.
Reading Figure 1
E 4 1 E 8 3 E 3 6 E 8 4 E 2 3 Q 8 6 P 8→ Label: 3.
Edges:8→4, 4→1, 8→3, 3→6, 2→3; start = 8, goal = 6. Only path is 8 → 3 → 6, so the answer is3. The distractors8→4→1(dead end) and2→3mean you must look ahead — picking “any edge from 8” gives4and fails. ⚠️ Edges appear in random order — nothing flagsE 8 3as the useful one.
Lookahead L — the difficulty knob. How many steps you must search forward before safely committing to the correct first move. Formally:
where P is the path start→goal and the S_i are the other branches leaving the start that are otherwise disjoint from P.
Intuition for the
minYou only need to look as far as the shorter of (i) the true path and (ii) the longest distractor branch you must rule out. A long true path with no distractors is easy (you can’t go wrong); deep distractors force deep search. In Figure 1, L = 2.
The model — deliberately tiny and strange (an examiner may ask you to justify each choice):
- Tiny: 6 layers, hidden dim 16, one attention head per layer, ReLU. (Scaling chapter pushes to 8 layers / 60M params.)
- One-hot, concatenated positional encodings (not added) → “which token” and “which position” sit in disjoint, individually readable coordinates → makes mechanistic analysis tractable.
- No causal mask → edges are unordered, a relevant edge can sit anywhere, so every token attends to every other.
- Streaming training (“limitless data”) → batches sampled fresh from the generator; test set filtered out by exact match → rules out “too little data” as the cause.
- Sophia optimiser, lr 1e-5, weight decay 0.1, no dropout, batch 1024, cross-entropy. Multiple-path graphs: one path is the random label in training; in eval a prediction is correct if it lies on any valid path.
🏗️ Architecture FAQ — 1 head, layer count, and what flows through the layers
Q: One attention head per layer — how is that comparable to GPT-2 / a normal transformer?
- Same family, not same scale. It’s a genuine GPT-2-style stack (attention + MLP + residual + layer-norm, next-token cross-entropy training) — but a minimal instance. GPT-2-small = 12 layers × 12 heads × d=768; this base model = 6 layers × 1 head × d=16 (~3–4 orders of magnitude smaller).
- Why 1 head? Multi-head attention runs several attention operations in parallel per layer (each head in its own subspace) — great for capacity, terrible for interpretation, because a layer’s output is then a superposition of many operations. With one head, each layer performs exactly one attention-operation pattern, so the reconstructed circuit (§3) is readable. They trade width (heads) for depth (layers): what multi-head would do in parallel within a layer, this model must do across layers → that’s why the computation comes out log-depth and the per-layer “one set-union” picture is so clean.
- ⚠️ Comparability caveat (exam-relevant): it is not capacity-comparable to real LLMs; the claim is about architecture + training mechanism, not scale. The appendix robustness check uses a more GPT-like decoder-only + rotary model and finds the same scaling behaviour — that’s what licenses (limited) transfer of the conclusions. (See the ICLR reviewers’ “encoder vs decoder” fight in reviews_paper-transformers-search_28-05-26 — they argued that no causal mask ⇒ effectively encoder-only, not the decoder-only LLMs we care about.)
Q: How many layers?
- Base interpretable model: 6 layers (dim 16, 1 head).
- “Bigger graphs” scaling experiment (Fig. 6): 8 layers (fixed, while graph size grows 8→50).
- Chain-of-thought variants: DFS = 3 layers, selection-inference = 4 layers (constant depth — §6).
Q: Why 8 layers?
- Single-pass path-merging needs only ~log₂(L) layers to reach lookahead L (the reachable set doubles per layer). For graphs ≤ ~50 vertices the lookahead is far below 2⁸ = 256, so 8 layers is comfortably more depth than the algorithm needs.
- That’s the point: with surplus depth, representational capacity is not the bottleneck. So when the model still fails on bigger graphs, the failure is about trainability, not “too few layers to represent search.” (It also explains the non-maximal merging in §4 — with more layers than needed, the model feels no pressure to double maximally.)
Q: What does “hidden dimension 16” mean — and for computation?
- Hidden dimension (= d_model = model width) is the length of the vector representing each token as it flows through the network (the “residual stream”). Dim 16 ⇒ every token is a 16-number vector at every layer. GPT-2-small uses 768; here 16 — minuscule. (Here the feed-forward dim is also set equal to d_model, vs GPT-2’s 4×.)
- Parameter count scales with d². Weight matrices are ~d×d, so params-per-layer scale as d²: 16² = 256 vs 768² ≈ 590k. That’s why 6 layers × dim 16 is on the order of ~10⁴ params; the scaling experiments reach 60M only by cranking d (and layers) up.
- Compute/FLOPs scale with d (projections ~d², attention ~n²·d). Tiny d → cheap forward passes → they can afford the thousands of perturbed passes the interpretability method needs (the L·n²·m·F cost).
- The key meaning — representational bandwidth: d = how much information one token can hold at once. d=16 is a deliberately tight bottleneck. Intuition: think of it as a small budget of ~16 feature-slots — roughly a bounded “reachable-set bitmask,” matched to the small interpretable regime (input length 128, lookahead ≤ ~20). It’s just enough to store the reachable sets the task needs, and narrow enough that the learned features stay readable.
- Ties to the scaling result: “does more hidden dimension help?” is literally tested — the param-scaling sweep grows d up to 60M-param models. Finding: bigger width reaches the shortcut faster but does not reliably find the true search solution. So width (like depth and params) isn’t the bottleneck — trainability is.
Q: Why no causal mask — isn’t masking essential to transformers?
- Masking is essential for decoder-only LLMs (GPT) because their whole objective is next-token prediction over the entire sequence: each position must not see the token it’s predicting → causal mask. (Encoders like BERT use no mask — they read, not autoregressively generate.)
- Here the task is read a graph (unordered edges) → output one vertex, not autoregressive generation. So every token should see every edge → no mask. ⚠️ That makes it functionally an encoder, not the decoder-only LLMs the conclusions are about — exactly reviewer fM1a’s objection (reviews_paper-transformers-search_28-05-26).
Q: What is being computed throughout the layers? (layer-by-layer)
- Embeddings: each token = 1-hot token value ⊕ 1-hot position; a vertex token initially “knows” only itself.
- Each layer = one set-union along edges: a vertex’s embedding accumulates the set of vertices reachable from it, and each layer unions in the reachable set of a vertex on its current frontier → the set can double per layer. After ℓ layers a vertex knows everything reachable within ~2^ℓ steps. This happens for all vertices in parallel.
- (Fig. 4 chain): vertex 1 knows {2} → {…,4} → {…,8}, reaching goal 9 by layer 5.
- = repeated squaring of the reachability relation = transitive closure in log depth.
- Final layer: the output position reads off which next vertex lies on a start→goal path and predicts it.
- Caveat: in practice the merges are non-maximal, and ~92–99 % of examples are explained by this circuit (§4).
2️⃣ Training distributions: heuristics vs. real search
The methodological heart: how you sample the graphs determines what the model learns.
- Naïve (Erdős–Rényi DAG). Vertices in topological order, edges sampled left-to-right (guarantees acyclicity), in-degree capped at 4. Catch: such random graphs almost always have tiny lookahead (L = 1 or 2), and the probability of large L decays exponentially. → The model scores well using shortcuts/heuristics and never learns to search → fails when real search is finally needed.
- Star. Star-shaped graphs,
kspokes (each a chain of lengthL), start at the centre, goal at the end of a spoke. Better, but still exploitable by shortcuts. - Balanced. Lookahead
Lmade uniform across the range, shortcuts deliberately engineered out. Only here does the transformer learn to search almost perfectly.
The single sentence to remember
Whether a transformer learns to search depends dramatically on the training distribution, not just the model. If the data permit shortcuts, the model learns shortcuts; only a balanced, shortcut-free distribution forces a genuine, generalisable search procedure.
Result 1 — the distribution decides everything (Fig. 2). One model per distribution, evaluated across all lookaheads:
- Naïve-trained: near-perfect at small L, collapses to ~0.26 by L=7 — and this is on lookaheads it nominally saw, just too rarely (exponential decay) for gradient descent to bother learning them.
- Star-trained: okay on balanced test graphs, transfers poorly to naïve.
- Balanced-trained: near-perfect everywhere and generalises to unobserved lookaheads — a model trained on L≤12 still handled L=13, 14 (but not larger → the generalisation ceiling, explained in §4).
Subtle point examiners love
“Naïve fails because it never saw L=7” is wrong. With limitless streaming data it did see L=7 — just so rarely (relative to the flood of L=1,2) that minimising the loss never required getting it right. Balanced makes hard examples as “loud” as easy ones.
Two different "lookahead figures" — don't confuse them
- Fig. 2 = accuracy per lookahead (the heatmap above). y-axis is test accuracy of the trained model; this is where naïve collapses to ~0.26.
- Fig. 8 = the probability of generating each lookahead (the histogram in the appendix). y-axis is example frequency on a log scale (10⁻¹ → 10⁻⁵), x-axis lookahead 0–20, from 10M sampled naïve graphs (n=41 vertices). Because the bars fall roughly linearly on a log axis, it’s the direct visual proof of the exponential decay — lookaheads > 6 are, in the paper’s words, “possible, but astronomically unlikely.” This figure is the cause; Fig. 2’s naïve collapse is the effect.
3️⃣ Mechanistic interpretability: opening the black box
When the model does learn (near-perfect on balanced data), the authors ask: what algorithm did it learn — genuine search or a pile of heuristics? Their answer needs a new, hypothesis-free technique that reconstructs the computation graph from activations + attention via activation patching. It does not assume the algorithm in advance — it discovers it. (Arguably the paper’s most durable contribution.)
Vocabulary first: an attention operation = one cell of an attention matrix — “token j (the target/query row) attends to token i (the source/key column) in layer l.” Information flows i → j.
The 5 steps (their Fig. 3 collapses these into 3 coloured stages):
- I. Forward pass. Record activations, attention weights, output logits. This is the baseline; nothing perturbed yet.
- II. Important ops in the last layer. For each attention weight feeding the last token, perturb it — set to 0, and separately to the row’s max + renormalise — and recompute. If the predicted-vertex logit drops by more than threshold
α, mark the op important. (Perturb up too: a token sometimes attends strongly to everything except one token, and that suppressed low-weight connection carries the real signal.) - III. Important ops in all earlier layers. Perturb each earlier-layer weight; keep it if it shifts an already-important last-layer op’s attention (by > √d/κ₁) or drops the output logit (by > α). The first criterion chains early ops to late ones → importance propagates backward from the output.
- IV. Explain each op (the Q·K trick). Attention weight rises with the dot product
Q_j·K_i, whereQ_jcomes from targetj’s embedding andK_ifrom sourcei’s. So if you change an input feature and the dot product swings, that feature must live in the vector you disturbed. Two probe types:- Token perturbation: swap a vertex ID
vfor a freshv'→ “does this depend on a vertex’s identity?” - Position perturbation: zero a positional embedding → “does it depend on where the token sits?”
- Attribute to source vs. target via Eq. (1):
Q̃_j K_iᵀ(perturb the query → feature lives in targetj) vs.Q_j K̃_iᵀ(perturb the key → feature lives in sourcei). A change beyond √d/κ₂ in the correct direction counts. (Freeze earlier layers so the measured change is from this op, not upstream chaos.)
- Token perturbation: swap a vertex ID
- V. Reconstruct the circuit. Bookkeeping notion “explainably contains” = “we have a verified story for how this embedding came to hold this info.” Base case: each input token explainably contains its own value + position. Inductive step: an op copying
i→jthat needs source featuresfˢand target featuresfᵀis explainable only ifialready contains allfˢandjallfᵀ; thenjnow explainably contains the unionfˢ ∪ fᵀ. Propagate layer by layer, then prune ops with no path to the output. What survives = a computation tree.
Why this is the discovery of path-merging
The features being tracked are vertices. So each node’s feature set is “the set of vertices it knows about,” and each explainable op is a set-union along an edge. Step V’s bookkeeping is literally the path-merging algorithm — the authors didn’t assume it; the reconstructed circuits simply turned out to be reachable-set unions that double per layer.
Whodunit analogy
A crime spreads through a phone network. Step III: find every call that swayed the verdict (mute/amplify each, see if the outcome shifts). Step IV: for each, work out why it happened — did the caller act on a name (token feature) or on when they were in the room (position feature)? Step V: draw the family tree of who-told-whom, keep only calls tracing to the verdict. The tree shows a rumour (reachability) spreading and merging = path-merging.
Cost: ~L·n²·m·F forward passes (layers · input size · examples · perturbed features) — feasible only because the models are tiny. (This is itself a critique — see Part C.)
How a single attention head works (be able to say this cleanly)
Each token vector is projected into Query (“what am I looking for?”), Key (“what do I offer?”), Value (“what I pass on”). (1) Score:
Q_i · K_jfor every j. (2) Scale + softmax:α_ij = softmax_j(Q_i·K_j / √d). (3) Weighted sum of values:out_i = Σ_j α_ij · V_j. (4) All tokens in parallel. One-liner: each token broadcasts a question (Q), every token advertises a label (K), dot-products match them, and the matches decide whose content (V) gets copied into whom. Multi-head = several in parallel, concatenated — but this paper uses 1 head/layer so the circuit is readable.
4️⃣ The learned algorithm: Exponential Path-Merging
Applying the method to near-perfect balanced models reveals exponential path-merging (Fig. 4):
- Each vertex token stores, in its embedding, the set of vertices reachable from it (within some number of steps). (Or run backwards: the set of vertices from which it is reachable.)
- At each layer, attention unions a vertex’s reachable set with that of a frontier vertex (copying from the edge of the current reachable set).
- Therefore the reachable set can double every layer → the model searches over a number of vertices exponential in the number of layers.
The key mental image
This is repeated squaring of the reachability relation = computing the transitive closure in a logarithmic number of layers, done for all vertices in parallel. It matches the theory (Sanford et al.): graph connectivity needs only ~log-many layers in the graph size. In Fig. 4, vertex 1 reaches distance 1→2→4→8, hitting the goal by layer 5 — the doubling comes from unioning two equal-size reachable sets, not a prefix growing on its own.
How well does it hold? Reconstructing 2000 held-out circuits (α=0.4, κ₁=20, κ₂=10): path-merging explains ~0.92–0.99 of balanced-model examples but only ~0.05 in a random/untrained control → the method is highly specific (detecting something real, not noise).
⚠️ The model does NOT learn the maximal algorithm
A path-merge is “maximal” if it merges the largest reachable sets available (true doubling). The measured proportion of maximal merges is well below 1 — the network often merges smaller sets than it could. It has no incentive to be maximal: it has more layers than it needs, and the trained lookaheads don’t line up with powers of 2. This non-maximality is exactly why the L≤12 model generalises to L=13,14 but then stops — sloppy merging buys only a little headroom.
5️⃣ The scaling problem: bigger is NOT better
Two experiments, both on the balanced distribution, both reporting the minimum test loss over many seeds (seed variance is large), tested on held-out naïve graphs as a stress test.
(a) Bigger graphs, fixed model (8 layers, hidden 16, 236M examples, 14 seeds; Fig. 6): as max graph size grows 8 → 50 vertices, (i) the probability a seed even converges (train acc ≥ 0.995) becomes vanishingly small, and (ii) minimum test loss grows at an increasing rate. Larger graphs get harder faster.
(b) Bigger models, fixed graph (31 vertices, 0.9M → 60.4M params; Fig. 7): bigger models reach the shortcut local minimum faster, but there is no discernible relationship between model size and the training needed to reach the global minimum (the genuine search solution). Bigger is faster at being wrong, not more likely to be right.
Robustness check: decoder-only models with learned token embeddings + rotary positional embeddings (multiplied, the modern default) — scaling behaviour unchanged. Pre-empts “it’s your weird architecture.”
The result the title leans on
Within the tested regime, adding parameters does not buy robust search. It’s a statement about compute and trainability, not just parameter count — stronger than “the model is too small,” and also the most contestable move when extrapolated to frontier-scale models.
FLOPs
FLOP = one floating-point op (multiply/add); the honest unit of compute. One forward pass ≈
2 × (#params) × (#tokens). The scaling result is really a FLOP claim: throwing more compute at it still doesn’t give robust search → not just a parameter problem, a compute problem.
6️⃣ Chain-of-Thought failure: the illusion of intermediate steps
Real LLMs “think out loud,” so the authors let the model emit intermediate tokens (CoT-style) two ways:
Depth-first search (DFS). Given the graph + the DFS trace up to a “current” vertex, predict the next vertex in the trace — classical DFS produced token by token. Training graphs sampled so backtrack distance is uniform (the DFS analogue of balanced). Result: with just 3 layers the model learns the task across all tested graph sizes → DFS needs only a constant number of layers (vs. log-depth single-pass path-merging). But it still struggles as the graph grows, scaling the model doesn’t help, and larger models need far more FLOPs — so the “learns from fewer examples” benefit evaporates once compute is counted.
Selection-inference (Creswell et al.). Each step split into (1) select a visited vertex with an unvisited child, (2) infer such a child. Graphs sampled so frontier size F and branch count B are jointly uniform. With 4 layers, graphs ≤ 45 vertices: same pattern — struggles as graphs grow, scaling doesn’t rescue it.
What CoT does and doesn't do
CoT changes the depth requirement — DFS/selection-inference become learnable with a constant number of layers instead of log-many. But it does not remove the wall: trainability still degrades with graph size, and scale still doesn’t fix it. “Think step by step” helps the representation, not the learning-to-search-at-scale problem.
7️⃣ Transfer to real logic (proof search) — why this matters for LLMs
A natural objection: symbolic E u v tokens aren’t language. So the authors rerun in natural language: each edge becomes a sentence — “If Alex is a wumpus, then Alex is a vumpus” — every word/punctuation a token, and the task becomes “given a start proposition, predict the next step of the proof.” This is a strict one-to-one re-encoding of the graph problem in implicational propositional logic.
Result: training behaviour is qualitatively the same — the model still struggles more as the graph grows, especially in FLOPs (each edge is now many tokens → longer sequences → quadratic attention cost). This rules out “it’s an artefact of the symbolic format” and ties the result directly to language-model reasoning.
⚠️ Don't overstate this
The paper did not test real pretrained LLMs on language. It trained small transformers from scratch on a synthetic, logic-shaped language. The NL experiment shows the format isn’t the culprit — it says nothing about frontier models. (Keep this honest; it’s a Tier-2 critique point.)
8️⃣ Three follow-up questions I had to get straight
Why 3 layers for DFS but 4 for selection-inference?
The real contrast is constant vs logarithmic, not “3 vs 4”. In single-pass search the model must find the whole path in one forward pass → exponential path-merging needs ~log₂(L) layers, a number that grows with graph size. In the CoT variants the model takes one step per forward pass (you call it repeatedly), and one step is a local, fixed-size computation → a constant number of layers suffices regardless of graph size. That is why CoT is “easier to teach.”
The 3 vs 4 then just tracks how many sub-operations sit in one step:
- DFS predicts one thing per call (next vertex: advance or backtrack) → 1 sub-task → 3 layers.
- Selection-inference decomposes each step into two chained sub-tasks (select a frontier vertex, then infer a child) → one extra composition → 4 layers.
⚠️ Honest caveat (say this if pushed): 3 and 4 are the layer counts the authors fixed in the experiments (“we set the number of layers to 3 / to 4”), not derived tight bounds. The proven claim is only “a constant number suffices.” Don’t present 3/4 as a theorem.
What is selection-inference? (the part that didn't click)
A reasoning format (Creswell et al. 2023) that forces a strict alternation of two moves instead of searching silently inside one forward pass: select → infer → select → infer → …
- Selection = “out of everything I know, point at the relevant piece.” Original (logic): pick the relevant premises. Graph version: pick a frontier vertex (a visited vertex that still has an unexplored child).
- Inference = “from just that piece, derive one new fact.” Logic: derive one conclusion. Graph: emit one new child vertex, adding it to the visited set.
Loop it: each inference grows the visited set by one; repeat enough and you hit the goal. It’s basically frontier expansion (BFS-style) made fully explicit and interpretable — every “pick a node, expand a node” decision becomes a visible, separate step (the original selling point: you can read off which facts were selected and what was inferred). Mechanically it’s two tasks, so visited edges (not just vertices) get encoded so the model can tell selection-context from inference-context.
Is Figure 4 just the authors' guess about how the model computes?
It starts as a hypothesis, then gets tested — so it’s a verified hypothesis, not speculation. The caption hedges explicitly: “We hypothesize that transformers learn this algorithm… We posit the model performs this computation for all vertices simultaneously.” So Fig. 4 is an idealized schematic. The real work is checking it: the mechanistic-interpretability method reconstructs the computation graph of attention ops on 2000 held-out inputs and asks whether each is “explained by path-merging.” Fig. 5 (top) = identified in trained models, not in a random control → “highly specific.”
But the learned circuit is messier than the clean picture (be honest about this — it’s Tier-3 critique gold):
- merges are often non-maximal (Fig. 5 bottom) → it does not cleanly double every layer;
- direction is ambiguous (stores “reachable from me” or “I’m reachable from”) — the appendix’s 2nd model showed both at once.
Why “doubling”? The per-layer op is a set-union along the frontier. A vertex storing everything reachable within d steps unions with a frontier vertex that also covers d steps → the union now covers 2d steps. Distance: 1→2→4→8→… → ~2ᴸ after L layers = repeated squaring / transitive closure in log depth. The paper’s word “theoretically double” is doing real work: it’s the upper bound on growth, which the non-maximal merges don’t actually achieve. (Fig. 4: vertex 1 reaches the goal by composing “3 reachable from 1” + “5 reachable from 3” → “5 reachable from 1”.)
PART B — Deliver the 10–15 min
⏱️ Time plan
| Block | Time | Content |
|---|---|---|
| 1. Present | ~5 min | Research question → setup → learned algorithm → results (the substance is Part A §1–§7) |
| 2. Critically evaluate | ~4 min | Strengths + weaknesses (Part C) — this is where you score |
| 3. Contextualize | ~4 min | LLM-reasoning debate · scaling/emergence · neuro-symbolic AI · classical search (Part D) |
Rule of thumb: a few points made cleanly beats listing everything. Deliberately leave hooks open at the end so the discussion steers into your focus topics.
🎤 The 5-minute spoken script
“The paper is Transformers Struggle to Learn to Search, Saparov et al. 2024. It’s about a very basic ability: search — which underlies reasoning, planning and navigation, and which LLMs are empirically bad at. The question is why: too little data, too few parameters, or a fundamental limit of the transformer architecture?
To test this cleanly they pick a minimal testbed: graph connectivity. A DAG, a start and a goal vertex, and the model outputs the next vertex on the path. It looks simple but it’s the backbone of reasoning — it’s equivalent to proof search in logic, where vertices are facts and edges are implications. If a model can’t do this, it can’t do anything more complex.
Difficulty is the lookahead — how far you must look forward to commit to the right first move. And because examples are generated procedurally, they have limitless data, which rules out ‘too little data’ from the start.
The methodological core is the training distribution. A naïve random distribution almost only produces tiny lookaheads, so the model learns shortcuts instead of real search, and fails. Only with a balanced distribution — lookahead uniform, shortcuts removed — does the transformer learn search almost perfectly. First message: whether a transformer learns search depends heavily on the data.
Then the real contribution: with a new mechanistic-interpretability method — activation patching, reconstructing the computation graph without assuming the algorithm — they find what it learned: exponential path-merging. Each vertex stores the set of vertices reachable from it; each layer unions these sets, so the reachable set doubles per layer → search exponential in the number of layers — essentially transitive closure in log-depth.
The three findings: one, bigger graphs make learning harder, and more parameters don’t help. Two, chain-of-thought (DFS, selection-inference) makes it learnable with constant layers but still fails on large graphs. Three, the conclusion: scaling alone won’t give robust search — you need different training (curriculum) or architectures (looped transformers).”
60-second emergency version
“Small transformers can learn graph search almost perfectly — but only with an artificially balanced training distribution that strips out shortcuts. Mechanistically they learn an exponential path-merging algorithm (each vertex accumulates reachable vertices, doubling per layer). As graphs get larger this collapses — and neither more parameters nor chain-of-thought fixes it. Conclusion: robust search is not merely a scaling problem.”
PART C — Critically evaluate
Being “critical” here is not trashing the paper. It’s showing you see the gap between what it claims and what it demonstrates — fairly, calibrated, with a rebuttal ready when he pushes back.
The meta-formula for every critique
“The paper claims X. The evidence supports the narrower X′. The gap is Y. This matters because Z. To close it you’d need W.”
The distinction your whole critique rests on: expressivity ≠ learnability
Two different questions hide behind the word “can”:
- Expressivity (representability): does there exist a weight setting that computes the function? A property of the architecture / hypothesis class — is the solution somewhere in reach?
- Learnability (trainability): will the actual training procedure (SGD, finite data from a distribution) find those weights? A property of architecture + optimizer + data together.
A model can be fully expressive for a task yet not learnable — the right weights exist, but gradient descent never reaches them (shortcut/local optima, huge seed variance). Analogy: a combination lock can represent every code (expressive); that doesn’t mean you can find the right one by turning dials (learnable).
Applied here: the title “Transformers Struggle to Learn to Search” + “fundamental limitation of the architecture” invites an expressivity reading. But Merrill & Sabharwal prove transformers + CoT can represent search (poly steps → class P ⊇ connectivity), and the balanced-distribution positive result shows the same architecture learns search with the right data. So what Saparov demonstrates is a learnability/optimization failure, not a representational limit. (See references_paper-transformers-search_28-05-26 #1.)
⚔️ Rebuttal you’ll get: “But more parameters don’t help (Fig. 7) — isn’t that architectural?” → Counter: that’s still a learnability statement (training fails in the tested tiny regime); it says nothing about representability and doesn’t reach frontier scale. Capacity existing ≠ optimization finding it.
Why it matters: the two diagnoses imply opposite fixes — expressivity limit → change the architecture; learnability limit → better training / data / curriculum / inductive bias (exactly Saparov’s own suggestions). It also maps onto the training-as-local-search reading (§Part D): expressivity = “a global optimum encoding search exists in the landscape”; learnability = “can hill-climbing reach it?” — often no.
🥇 Tier 1 — strong, defensible (lead with these)
1. Learnability ≠ expressivity — the title overreaches. The title implies an architectural (representational) limit; what’s shown is a training/optimization difficulty (high seed variance, distribution sensitivity, failed convergence). They even cite Merrill & Sabharwal (2024): CoT-transformers are Turing-complete. → Honest claim is narrower: “hard to train, with these tiny models, this optimizer, this setup.” ⚔️ Rebuttal ready: “But more parameters don’t help (Fig. 7).” → That refutes trainability at scale in the tested tiny regime, not representability, and doesn’t reach frontier scale.
🎯 Sharper framing: the title overreaches, but the text is calibrated — they hedge explicitly (“scaling to much larger sizes may lead to emergent searching ability”). Saying this shows you read closely and are being fair, not contrarian.
2. Extrapolation gap: toy models → frontier LLMs. Hidden dim 16, ≤60M params, 6–8 layers — 3–4 orders of magnitude below GPT-class. The sweeping “future LLMs won’t learn search via scaling” is underdetermined by toy-model evidence. The mechanistic science is solid at the scale studied; the sweeping claim is not.
3. Internal tension: data vs. architecture. The positive result depends on the hand-crafted balanced distribution (no shortcuts) — which doesn’t exist in the real world. But that implies the failure is about data/shortcuts, not architecture. You can’t say “it’s the distribution” and “it’s an architectural limit.” The paper’s own positive result is its strongest counter-evidence to the strong reading of its title.
🥈 Tier 2 — good supporting critiques
4. Interpretability-friendly architecture ≠ a real transformer. 1-hot, concatenated PE, no causal mask, 1 head/layer — chosen to make interpretation tractable, but atypical. The clean path-merging picture may be partly an artifact; the rotary/decoder-only robustness check is appendix-only and not the object of the mechanistic analysis.
5. “Search” is operationalised very narrowly. DAG connectivity / next-vertex prediction. Real reasoning has heuristics, backtracking, partial observability. The proof-search equivalence holds only for a subset of propositional logic. The leap to “reasoning” is asserted, not demonstrated.
🥉 Tier 3 — subtle, high-credit (if discussion goes deep)
6. “The algorithm” is an idealisation. Path-merging explains ~92–99% (not 100%) and the merges are demonstrably non-maximal → the model runs an approximate, partial version, possibly plus other heuristics. Honest of the authors to report — but it weakens “learned the correct algorithm.”
7. Even the success barely generalises. L≤12 → only L=13,14. A hard ceiling at +2 is a meaningful caveat for a paper about acquiring a general algorithm.
8. Reporting choices. Test loss on naïve graphs while training on balanced (OOD stress test, but conflates “can’t search” with “different distribution”); scaling plots report minimum loss over 14 seeds (optimistic — typical seed does worse). They’re transparent (also report convergence fractions) → a calibration caveat, not bad faith.
9. The mechanistic method doesn’t scale. ~L·n²·m·F forward passes → only feasible on tiny models. The one tool that could check whether real LLMs use path-merging can’t be run on them — exactly the regime the headline is about.
⚖️ What makes you a fair critic (state explicitly)
- The mechanistic-interpretability method is a genuine, reusable contribution; path-merging is well supported at the scale studied.
- The data-distribution sensitivity finding is robust and important independent of the title overclaim (“shortcuts in data ⇒ shortcuts in model”).
- The negative scaling trend is real within its regime; only the extrapolation is contested.
🔧 Constructive (“how would you improve it?“)
- Test medium-sized models to bridge the extrapolation gap.
- Probe real pretrained LLMs for path-merging circuits (a scalable proxy).
- Try the authors’ own remedies: curriculum learning, looped/recurrent transformers.
- A theoretical loss-landscape account of why optimization fails here.
PART D — Context, bridges & Q&A
🌐 Place it in a larger context
- “Reasoning vs. pattern-matching” in LLMs: alongside Kambhampati (planning), Saparov & He (proof search), Bachmann & Nagarajan (next-token traps). Theme: LLMs often imitate reasoning via shortcuts.
- Scaling laws & emergence: Wei et al. (“emergent abilities”) vs. Schaeffer et al. (“emergence is a mirage of metric choice”); Du et al. (loss is the better lens). This paper = a pointed counterexample to scaling optimism.
- 🌟 Neuro-symbolic AI (Kühnberger’s turf — score here): classical search (BFS/DFS/A*) is provably correct and scales; learned transformers only approximate and degrade. → Argument for hybrid systems that offload search to symbolic solvers/tools.
- Mechanistic interpretability as a growing field (circuits, activation patching).
- System 1 / System 2 (optional): fast forward-pass pattern-matching vs. deliberate search → motivates external search modules.
- Link to the course: lectures give the gold-standard search algorithms; this paper shows learned models don’t reliably reach them → why classical AI methods still matter.
🔁 Training as local search (the bridge to your Cluster 1)
Two different searches live in the paper. (1) The task is classical, path-based search over a graph (BFS/DFS/A*). (2) The training that installs it is local search — SGD = hill climbing in weight space. The negative results are about search (2), not (1). High seed variance = different random initial states → random-restart hill climbing; shortcut learning = a deceptive local optimum; balanced distribution = reshaping the objective to remove the deceptive optimum; the proposed fixes map one-for-one: many seeds = random restart, curriculum = annealing schedule, looped transformers = changing the move set.
⚠️ Don’t say “the paper is about local search” unqualified — at the task level it’s classical search; the local-search link is at the training level. Naming which search you mean earns the marks.
🌉 Bridges to your 3 focus topics
(Local) Search — path-merging vs. classical: a parallel, transitive-closure-style reachability build-up of all vertices at once, not sequential frontier expansion like BFS/DFS. Exponential-in-layers because sets double; ~log(L) layers suffice. The DFS CoT variant connects directly to lecture DFS; lookahead ≈ search depth.
Machine Learning — shortcut learning / spurious correlations / inductive bias → overfitting & generalization (ML I/II). Bias-variance angle: high seed variance on large graphs. Bridge to model capacity / scaling laws.
Transformer fundamentals — self-attention (Q/K/V), multi-head, positional encoding (here 1-hot, concatenated not added — why? interpretability), no causal mask (why? unordered edges).
CSP (stretch) — graph connectivity relates to constraint graphs/reachability; backtracking (exact, with guarantees) vs. learned search (approximate) — the same exact-vs-approximate tension.
❓ Likely discussion Q&A (bullet answers)
- “Main contribution?” → Not just “transformers fail,” but (a) they can with the right distribution, (b) the learned algorithm is mechanistically identified, (c) scaling/CoT don’t solve it.
- “Why three distributions?” → Random graphs rarely need large lookahead (exp decay) → shortcuts suffice. Balanced makes lookahead uniform + removes shortcuts → isolates genuine search. Together they prove the failure is about data, not inability.
- “Explain path-merging / why log layers?” → Each vertex stores its reachable set; each layer unions along edges → doubling; doubling log₂L times covers lookahead L. Transitive closure, all vertices in parallel. Caveat: non-maximal merges → barely generalises past trained L.
- “Does scale help? Be precise.” → (i) fix model, grow graph: convergence collapses, min loss rises faster; (ii) fix graph, grow model 0.9M→60M: faster to the shortcut min, no systematic gain at the true solution. Rotary decoder-only: same. → No, not in tested regime.
- “Does CoT solve it?” → Helps depth (DFS 3 layers, sel-inf 4 layers, constant not log) but not the size wall; bigger models cost more FLOPs.
- “Local or classical search?” → Task = classical (output is a step on a path); training of it = local search (SGD). Name which.
- “Title fully justified?” → Partially — trainability is hard for tiny models, scaling doesn’t help in that regime; but no representational impossibility (Merrill & Sabharwal) and their own positive result points at data/shortcuts. Strong architectural reading overreaches.
- “Confident it ‘really’ uses path-merging?” → Fairly (0.92–0.99 vs 0.05 control) but not absolutely (not 100%, non-maximal) → approximate path-merging + residual heuristics.
- “But o3/Gemini/Claude search now — doesn’t that refute it?” → No — it confirms it. They externalise search: RL-trained long CoT at test time (o1/o3, DeepSeek-R1, Gemini Deep Think, Claude extended thinking), explicit tree/MCTS search around the model (Tree-of-Thoughts; AlphaProof / AlphaGeometry 2 at the 2024 IMO), and tool use / external solvers (neuro-symbolic “LLM-modulo”). The single forward pass still doesn’t robustly search — exactly the paper’s point — so the field added test-time compute + external structure, not just parameters (the paper’s own recommended directions).
- “What next if it were your project?” → Medium models to bridge the gap; probe real LLMs for path-merging; curriculum + looped architectures; a loss-landscape theory of why SGD struggles.
📌 Terms that must be second nature
vertex · edge · DAG · path / reachability · lookahead (search depth, min{|P|, max|Sᵢ|}) · transitive closure · activation patching · computation graph / circuit · path-merging (non-maximal in practice) · chain-of-thought · scaling laws · emergence · shortcut / heuristic learning · self-attention (Q/K/V) · positional encoding · mechanistic interpretability · FLOPs (≈ 2·params·tokens).
Before the exam
Say Part B’s 5-min script out loud 3× with a timer. Then deliver Parts C & D freely from bullets. Tick the Part A checklist boxes only when you can explain each without notes. Run Quiz Exam Schwerpunkte for focus-topic depth.