Hard quiz on the exam paper for the oral module exam (Mon 1 Jun 2026).

Every question is 🔴 Hard (why / compare / derive / apply / defend). ⚠️ = a point examiners love to push on or that’s easy to overstate.
Answers hidden in collapsible callouts. Fill in your answer + result during the session — then re-quiz the misses 2 days later.

Companion

pruefung_paper-transformers-search_25-05-26 (full dossier — anatomy §1–§7, present script, critique, context) · Quiz Exam Schwerpunkte · Methods of AI Lecture

Q1 — Setup · 🔴 Hard ⚠️

Question: The paper argues results on a toy DAG transfer to “reasoning.” State the exact equivalence it relies on, and then state the strongest limit of that equivalence (i.e. where a critic should push).

Answer

Equivalence: graph connectivity = proof search in implicational propositional logic — vertex ↔ proposition, edge u→v ↔ implication “if u then v”, start ↔ premise, goal ↔ what to prove, next vertex ↔ next proof step. So “find the next vertex toward the goal” is “take the next step of a proof.” It’s a lower bound on competence: fail here ⇒ fail on any reasoning containing search.

Limit (the push): the equivalence holds only for a narrow subset of logic (implicational, no negation/disjunction/quantifiers) and for deductive chaining. Real reasoning adds heuristics, informed search, backtracking under uncertainty, partial observability. So “can’t learn DAG connectivity at scale ⇒ can’t reason” is asserted via the equivalence, not demonstrated on real reasoning tasks. (Tier-2 critique.)

🖼️ Figure — the graph ⇄ proof-search mapping (paper Fig 1)
Same problem, two faces:
Graph edges        Transformer input (tokens)
  8 ─► 3 ─► 6      E 4 1  E 8 3  E 3 6  E 8 4  E 2 3   Q 8 6   P 8
  8 ─► 4           └──── edges ────┘     start 8, goal 6   predict→
  2 ─► 3           Label: 3      (next vertex on path 8→3→6)
  4 ─► 1
graph ⇄ implicational logic
vertex v ⇄ proposition v
edge u→v ⇄ “if u then v”
start 8 ⇄ premise (given 8)
goal 6 ⇄ prove 6
next vertex 3 ⇄ next proof step (8 ⇒ 3)

graph	⇄	implicational logic
vertex v	⇄	proposition v
edge u→v	⇄	“if u then v”
start 8	⇄	premise (given 8)
goal 6	⇄	prove 6
next vertex 3	⇄	next proof step (8 ⇒ 3)

Max’s answer: finding answers happens through making the correct implications. Creating a simple graph with implications tests whether a LLM is able to follow implications which are trivial for humans. Due to the fact that this graph is always unidrectional and also in their balanced graph there are no shortcuts optional, the LLM has to find the correct pathway.
Also the critics argue that implications are not the only way to make assumptions - backtracking, deductive chaining, heuristics
Result:

Q2 — Lookahead · 🔴 Hard ⚠️

Question: Write the formal definition of lookahead L, explain the role of the min, and apply it: a start vertex has the true path of length 3 to the goal, plus one disjoint distractor branch of length 5 and one of length 2. What is L?

Answer

Definition: with P = path start→goal and Sᵢ = the other start-leaving branches otherwise disjoint from P:
L = min{ |P|, maxᵢ |Sᵢ| }.

Why the min: you only need to look as far as the shorter of (i) the true path and (ii) the longest distractor you must rule out. If the true path is short you commit quickly; if there are no deep distractors you can’t go wrong. Difficulty = how far you’re forced to search before safely picking the first move.

Apply: |P| = 3; max distractor = max(5, 2) = 5. L = min{3, 5} = 3. (The length-5 distractor would demand depth 5, but the true path only being 3 caps the needed lookahead at 3 — beyond that you’ve already reached the goal.)

🖼️ This question's graph (L = 3)
start ─► a ─► b ─► goal              true path P,  |P| = 3
start ─► k1─► k2─► k3─► k4─► k5      distractor,  length 5
start ─► m1─► m2                     distractor,  length 2
maxᵢ|Sᵢ| = max(5, 2) = 5 → L = min(3, 5) = 3. The length-5 branch can’t push L above |P|: you reach the goal in 3 before exhausting it.

Two worked graphs → worked-examples_lookahead_29-05-26

Detailed walkthrough of two DAGs: Ex.1 distractor wins (elimination cheaper, L=3, |P|=5); Ex.2 path wins (over-long distractor capped, L=3, |P|=3). Covers the rejoining-branch trap and the “disjoint from P, not from each other” subtlety.

Max’s answer: l is the maximum path length the algorithm needs to start to decide which ath to take that will certainly lead to an answer. P is the path, S are the other pathways from the starting point.
It takes the minimal P, so the shortest pathway, and rules out

Result:

Q3 — Distributions · 🔴 Hard ⚠️

Question: The naïve-trained model’s accuracy collapses at higher lookaheads (Fig 2 — accuracy; note Fig 5 is a different metric, “proportion explained by path-merging”) — yet this is on lookaheads it actually saw during (limitless) training. Explain precisely why “it never saw those L” is the wrong explanation, and what the right one is.

Answer

With streaming/limitless data the model did see higher-L graphs — generation is procedural and unbounded, so those examples certainly appeared.

But the naïve (Erdős–Rényi) distribution puts exponentially decaying mass on large lookaheads (almost everything is L=1 or 2; see Fig 8). So higher-L examples are vanishingly rare relative to the flood of easy ones.

Gradient descent minimizes average loss. Getting the rare hard cases right barely moves the loss, so the optimizer never has to learn them — it converges to a shortcut that aces the abundant easy cases.

Right explanation: failure is from distribution imbalance / shortcut-friendliness, not absence of data. The balanced distribution makes hard examples as “loud” as easy ones, forcing real search.

⚠️ Third distribution (don’t forget): there are three training distributions — naïve, star, and balanced. The star-trained model does reasonably on balanced test graphs but poorly on naïve ones; only the balanced model generalises across all three (Fig 2).

🖼️ Lookahead mass: naïve vs balanced (schematic; cf. Fig 8)

  naïve P(L)  (log y-axis)      balanced P(L)  (by construction)
1 ████████████                1 █████
2 ███                         2 █████
3 █                           3 █████
4 ▏                           4 █████
5 ·                           5 █████
6 ·                           6 █████
7 (≈0, "astronomically rare") 7 █████
  → optimiser ignores L≥5       → every L equally "loud"

import math                                   # stdlib only — runs in Obsidian/Pyodide
L = list(range(1, 11))
naive = [math.exp(-1.3 * l) for l in L]       # ~exponential decay (cf. Fig 8)
s = sum(naive); naive = [x / s for x in naive]
balanced = [1 / len(L)] * len(L)              # uniform by design
print("L        :", L)
print("naive  P :", [round(x, 3) for x in naive])
print("balanced :", [round(x, 3) for x in balanced])
# SGD optimises ∝ frequency → naïve barely trains the rare deep-lookahead cases

Max’s answer: the naive-trained model was trained on a naive distribution, meaning the distr. had shortcuts and uneven paths, so there was always a high variance of L, meaning that the model might have been trained only to look at a few lookaheads, because it might have found the goal a lot faster in the training runs.
Result:

Q4 — Distributions · 🔴 Hard

Question: The paper’s headline reads as an architectural claim, yet the positive result depends on the hand-crafted balanced distribution. Explain why this is an internal tension, and which side the evidence actually favours.

Answer

The transformer learns search almost perfectly — but only when trained on a balanced, shortcut-free distribution that does not occur in the real world.

If curating the data is what unlocks search, then the failure on naïve/real data is fundamentally about data and shortcuts, not about the architecture being incapable.

The real tension is emphasis, not contradiction. The two claims can coexist if properly qualified — data binds at the small scales tested, architecture may bind at larger scales (Fig 6: even Balanced collapses on big graphs), and the architecture’s heavy distribution-sensitivity is itself a weak inductive bias (a real architectural shortcoming). What the positive result does refute is the unqualified architectural reading the title invites: it shows the limit is data-dependent and scale-dependent, not unconditional. So the honest critique is that the title overreaches relative to the calibrated reading the evidence supports.

Evidence favours: a learnability/data story (with these tiny models + standard training), not representational impossibility. (Tier-1 critique #3.)

Max’s answer: first of all, they claim it is an issue by architecture - since they looked at more layers than the algorithm would need - but also the only search they created is based on a hand-crafted balanced distribution, which is not observable in real life. That means it is an theoretical environment and real world search depends on more than balanced implications - backtracking,heuristics, etc.
Result:

Q5 — Mechanistic method · 🔴 Hard ⚠️

Question: In identifying important attention operations, the authors perturb each weight both to 0 and to the row’s maximum (renormalizing). Why is perturbing upward essential — what would a 0-only test miss?

Answer

Setting a weight to 0 tests “does muting this connection hurt the prediction?” — it catches operations that carry signal via a large weight.

But a token sometimes attends strongly to almost everything except one token, and the real information rides on that single suppressed, low-weight connection. Muting a low weight changes little, so a 0-only test would wrongly mark it unimportant.

Forcing the weight up to the row max (and renormalizing) reveals these “attend-to-all-but-one” operations: if maxing the exceptional weight changes the output, the operation matters.

So up + down perturbations together catch both “this strong link carries info” and “this deliberately weak link carries info.”

🖼️ Why a 0-only test misses the signal (attend-to-all-but-one)
An attention row where the suppressed token i* secretly carries the information:
token :   t1    t2    t3    t4    t5   t6(i*)
weight:  0.22  0.22  0.22  0.22  0.11  0.01   ← i* tiny ON PURPOSE

perturb w(i*) → 0   : row barely changes  → marked "unimportant"  ✗ (WRONG)
perturb w(i*) → max : output flips         → marked "important"   ✓ (caught)
The real info rides on the deliberately small weight, so only the upward perturbation exposes it (paper footnote 2).

Max’s answer:
Result:

Q6 — Mechanistic method · 🔴 Hard

Question: In Step IV the authors compute two perturbed dot products, Q̃ⱼKᵢᵀ and QⱼK̃ᵢᵀ. Explain what each one isolates and how this lets them attribute a feature to the source vs. the target embedding.

Answer

An attention weight (j attends to i) rises with the dot product Qⱼ·Kᵢ, where the query Qⱼ is built from target token j’s embedding and the key Kᵢ from source token i’s embedding.

Q̃ⱼKᵢᵀ = recompute with the query perturbed (target side changed), key untouched. If this dot product swings (vs. the original QⱼKᵢᵀ) beyond √d/κ₂, the perturbed feature must be encoded in the target embedding j.

QⱼK̃ᵢᵀ = key perturbed (source side changed), query untouched. A swing ⇒ the feature lives in the source embedding i.

So by perturbing one side at a time and watching which dot product moves, they pin which embedding (source or target) carries the causal feature — and the two probe types (token-ID swap vs. zeroing the positional embedding) say whether it’s an identity or a position feature. (Earlier layers are frozen so the effect is attributable to this op.)

Worked QK dot product (numbers) → worked-example_qk-dotproduct_29-05-26

Concrete one-hot example: vertex 3 attends to vertex 6 (Q₃·K₆ = 1, others 0) → copies 6’s reachable set = one path-merge. Then shows how perturbing Q̃ⱼKᵢᵀ vs QⱼK̃ᵢᵀ attributes the feature to target vs source.

Max’s answer:
Result:

Q7 — Path-merging · 🔴 Hard ⚠️

Question: Why does the learned algorithm search a number of vertices exponential in the number of layers (equivalently: only ~log-many layers are needed)? Where exactly does the doubling come from — and what does this correspond to classically?

Answer

Each vertex’s embedding stores the set of vertices reachable from it. At each layer, attention unions a vertex’s reachable set with that of a vertex on its current frontier (the edge of its reachable set).

Crucially the merge unions two reachable sets of (roughly) equal size, so the set can double per layer — not a prefix growing by one. After ℓ layers a vertex can reach ~2^ℓ vertices ⇒ to cover lookahead L you need ~log₂L layers.

Classically: this is repeated squaring of the reachability relation = computing the transitive closure in logarithmic depth, done for all vertices in parallel. (Matches Sanford et al.’s theory that connectivity needs ~log-many layers in graph size.)

Contrast: BFS expands a frontier serially, steps ∝ path length; path-merging is parallel and logarithmic.

⚠️ Direction: §4.2 notes the model may run this forward (store vertices reachable from v) or backward (store vertices from which v is reachable) — either direction gives the same doubling. Don’t claim it’s necessarily forward.

🖼️ The doubling, layer by layer (chain 1→2→…→9)

What vertex 1 knows it can reach — distance doubles, not +1:

after layer │ reachable set of vertex 1 │ max distance
────────────┼───────────────────────────┼─────────────
  input     │ {1}                       │ 0
  layer 1   │ {1,2}                     │ 1
  layer 2   │ {1,2,3}                   │ 2
  layer 3   │ {1,2,3,4,5}               │ 4
  layer 4   │ {1,…,9}                   │ 8   ⇒ 2^ℓ

Layer 3 jumps 2→4 because vertex 1 merges vertex 3’s set, already {3,4,5} (vertex 3 expanded in parallel during layers 1–2). Parallelism is what makes it double instead of +1.

# stdlib only — reproduces the table above. chain 1->2->...->9
edges = {v: ({v + 1} if v < 9 else set()) for v in range(1, 10)}
reach = {v: {v} | edges[v] for v in range(1, 10)}   # layer 1: direct neighbours (dist 1)
print("layer 1:", sorted(reach[1]))
for layer in range(2, 5):                            # layers 2+: union reachable sets → DOUBLES
    reach = {v: reach[v] | {w for u in reach[v] for w in reach[u]} for v in range(1, 10)}
    print(f"layer {layer}:", sorted(reach[1]))
# prints {1,2} → {1,2,3} → {1,2,3,4,5} → {1..9} : dist 1→2→4→8 ⇒ ~log2(L) layers

Max’s answer:
Result:

Q8 — Path-merging · 🔴 Hard ⚠️

Question: The model learns a non-maximal version of path-merging. Define “maximal,” cite the evidence that it isn’t, and explain how this non-maximality explains the generalization ceiling (L≤12 trained → works at 13, 14, then stops).

Answer

Maximal merge: at each layer the operation unions the largest reachable sets available (true doubling). E.g. merging {1,2} into a vertex holding {2,3} is maximal; merging only {2} is sub-maximal.

Evidence: reconstructing 2000 circuits, the proportion of maximal merges is well below 1 — the network routinely merges smaller sets than it could. (And path-merging explains ~92–99% of examples, not 100%.)

Why it has no incentive to be maximal: it has more layers than it needs for the trained lookaheads, and those lookaheads don’t align with powers of 2 — so sloppy, sub-doubling merges still solve the training range.

Ceiling: because merges don’t truly double, the reachable set grows slower than 2^ℓ. A model trained on L≤12 has just enough slack to stretch to L=13, 14, but the sub-exponential growth runs out beyond that → it fails on larger L. Non-maximality = a little headroom, not unbounded generalization.

🖼️ Maximal vs non-maximal merge (paper §4.2 example)
Vertex 1 holds {1,2} and attends to vertex 2 to merge in its set:
vertex 2 holds {2,3}  →  1 becomes {1,2,3}   MAXIMAL    (took the largest available set)
vertex 2 holds {2}    →  1 becomes {1,2}     sub-maximal (grew by less than it could)
Reconstructed circuits: fraction of maximal merges is well below 1 (Fig 5 bottom) ⇒ set grows slower than 2^ℓ ⇒ trained L≤12 reaches only L=13,14, then breaks.

Max’s answer:
Result:

Q9 — Scaling · 🔴 Hard ⚠️

Question: “Scaling doesn’t help” rests on two distinct experiments. State each precisely (what’s fixed, what varies, what’s measured) and what each shows. Why is “bigger is faster at being wrong” the right summary of the second?

Answer

(a) Bigger graphs, fixed model (8 layers, hidden 16; 14 seeds): vary max graph size 8→50. As it grows, (i) the fraction of seeds that converge (train acc ≥ 0.995) collapses toward 0, and (ii) the minimum test loss over seeds rises at an increasing rate. ⇒ Larger problems get harder faster.

(b) Bigger models, fixed graph (31 vertices; 0.9M→60.4M params, 4 seeds): vary model size. Larger models reach the shortcut local minimum faster, but there is no discernible relationship between size and the training needed to reach the global minimum (true search solution).

“Faster at being wrong”: scaling buys quicker descent into the shortcut basin (the wrong solution), not a higher chance of finding the correct algorithm — and bigger models cost far more FLOPs to get there. So size helps speed-to-shortcut, not correctness.

Robustness: a rotary, decoder-only variant (modern default) behaves the same ⇒ not an artifact of the odd architecture.

🧠 WHY scaling doesn't help (the deeper reason)

Core: scaling adds capacity, but the bottleneck was never capacity — it’s optimization. The search solution is representable (path-merging needs only ~log₂L layers; the model has more layers than it needs), but SGD doesn’t find it — it settles on a shortcut. Bigger models solve a problem the model didn’t have.

Deceptive shortcut basin: an easy local min (~loss 1.0) aces the abundant easy examples via heuristics. SGD descends into it fast; bigger models reach the shortcut faster (Fig 7), but escaping to the global min is size-independent → “faster at being wrong.”

More params ≠ different incentive: shortcuts come from the data/loss structure (average loss dominated by easy cases), not a parameter shortage. Adding params doesn’t change what SGD is rewarded for.

Spare capacity goes unused: even with surplus layers, merges are non-maximal (Q8) → extra capacity already sits idle.

Graph-size scaling (Fig 6): as graphs grow, the fraction of seeds that converge collapses → 0 — a learnability collapse, not a capacity ceiling.

Model-size scaling (Fig 7) grew width/dimension — but path-merging says depth is the search-range axis; plus bigger models cost far more FLOPs.

CoT doesn’t rescue it either (constant layers, but still degrades with graph size).

One-liner: scaling helps a capacity-limited model; this one is optimization-limited — the solution exists and is representable, but training won’t reach it. ⚠️ Shown only at ≤60M params (width-scaled) → it’s an extrapolation; authors concede emergent search at frontier scale is possible.

🖼️ The two scaling experiments (schematic; Fig 6 & Fig 7)

(a) FIX model (8L, dim16), GROW graph 8→50      (b) FIX graph (31), GROW model 0.9M→60M

 converged    min test loss                       test loss vs #examples
 fraction      │        ╱                           │╲   every size plunges to the
 1│●╲          │      ╱                             │ ╲  SAME shortcut min, then stalls
  │  ●╲        │    ╱                               │  ╲___  — no ordering by size to
  │    ●╲__    │  ╱                                 │       the true (global) min
 0└───────►    └───────►                            └───────►
   graph size    graph size                           training examples
  collapses→0   rises (faster & faster)            bigger = faster-to-shortcut, not better

(a) 14 seeds; convergence = train acc ≥ 0.995. (b) 4 seeds. “Faster at being wrong” = quicker descent into the shortcut basin, not a better shot at the search algorithm.

🗂️ ALL model configs across the paper (reference — layers vary 3→8!)

Shared architecture (everywhere): GPT-2 style, ReLU; 1-hot token + absolute-position embeddings, concatenated (not added); 1 attention head per layer; FFN dim = model dim; no causal mask (edges are unordered). Sophia optimizer, lr 1e-5, weight decay 0.1, no dropout, cross-entropy, streaming training. Batch 1024 (256 for scaling, GPU memory).

Experiment Section / Fig Layers Fixed Varied
Distribution sensitivity + interpretability §3.1, §4 / Fig 2, 5 6 dim 16, input 128 (~41 vertices) lookahead L=1→20
Graph-size scaling §5 / Fig 6 (+ A.4/Fig 10) 8 dim 16, 14 seeds graph size 8→50
Model-size scaling §5 / Fig 7 (varies) graph 31, 4 seeds params 0.9M·1.3M·6.0M·60.4M
DFS — depth test §6.1 / Fig 14 top 3 — graph size
DFS — model size §6.1 / Fig 14 + A.8 (varies) graph 35 params 0.4M·5.5M·13.3M·26.0M·47.2M
DFS — padding §A.9 / Fig 15 7 input 128 standard vs random padding
Selection-inference — depth §6.2 / Fig 16 4 — graph size
Selection-inference — model size §6.2 / Fig 17 + A.11 (varies) graph 45 params 3.2M·3.8M·13.4M·14.8M·40.3M
Decoder-only + RoPE robustness §A.5 / Fig 11, 12 (matched to §5) — graph size / model size

Why depth varies (the principle): path-merging needs ~log₂(max lookahead) layers. They sit near that floor when depth must be small (6, interpretability is L·n²·m·F-expensive), above it when depth must be a non-factor (8, isolating graph size), and below it for CoT (3 DFS, 4 SI — the trace, not the layers, carries the search). Width/params are swept separately to test model-size scaling. ⚠️ “6 then 8” isn’t a change of mind — different models for different jobs.

Experiment	Section / Fig	Layers	Fixed	Varied
Distribution sensitivity + interpretability	§3.1, §4 / Fig 2, 5	6	dim 16, input 128 (~41 vertices)	lookahead L=1→20
Graph-size scaling	§5 / Fig 6 (+ A.4/Fig 10)	8	dim 16, 14 seeds	graph size 8→50
Model-size scaling	§5 / Fig 7	(varies)	graph 31, 4 seeds	params 0.9M·1.3M·6.0M·60.4M
DFS — depth test	§6.1 / Fig 14 top	3	—	graph size
DFS — model size	§6.1 / Fig 14 + A.8	(varies)	graph 35	params 0.4M·5.5M·13.3M·26.0M·47.2M
DFS — padding	§A.9 / Fig 15	7	input 128	standard vs random padding
Selection-inference — depth	§6.2 / Fig 16	4	—	graph size
Selection-inference — model size	§6.2 / Fig 17 + A.11	(varies)	graph 45	params 3.2M·3.8M·13.4M·14.8M·40.3M
Decoder-only + RoPE robustness	§A.5 / Fig 11, 12	(matched to §5)	—	graph size / model size

Max’s answer:
Result:

Q10 — Critique · 🔴 Hard ⚠️

Question: Give the single strongest critique of the paper, the rebuttal an examiner will fire back, and your counter to that rebuttal. Be precise about the claim/evidence gap.

Answer

Strongest critique — learnability ≠ expressivity (the title overreaches). The title implies an architectural/representational limit; the paper actually shows a training/optimization difficulty (huge seed variance, distribution sensitivity, failed convergence) on tiny models. It does not show transformers cannot represent search — and the authors themselves cite Merrill & Sabharwal (2024): CoT-transformers are Turing-complete.

Examiner’s rebuttal: “But Figure 7 shows more parameters don’t help — isn’t that architectural?”

Your counter: that refutes trainability at scale in the tested tiny regime (≤60M params) — it does not refute representability, and it doesn’t reach frontier scale (3–4 orders of magnitude larger). (Note: the fixed-model graph-scaling runs used 8 layers / hidden dim 16; the param-scaling runs at 31 vertices varied the dimension up to 60.4M — so “dim 16” describes only the former.) The authors even hedge: scaling to much larger sizes may yield emergent search.

Fair framing that scores: the title overreaches, but the text is calibrated — they report the hedges and the non-maximality honestly. You’re separating claim from evidence, not trashing the work.

Max’s answer:
Result:

Q11 — Chain-of-Thought · 🔴 Hard

Question: Chain-of-thought (DFS, selection-inference) does change something and doesn’t change something else. State both precisely, including the role of layers and FLOPs.

Answer

What CoT changes (depth): emitting intermediate tokens makes the task learnable with a constant number of layers — DFS with 3 layers, selection-inference with 4 layers — instead of the log-many layers single-pass path-merging needs. The “search depth” moves from the network depth into the length of the generated trace.

What CoT does not change (the wall): trainability still degrades as graphs grow, and scaling the model still doesn’t fix it. Worse, larger models need many more FLOPs to learn the CoT task, so the “learns from fewer examples” benefit evaporates once compute is counted.

One line: CoT helps the representation/depth requirement, not the learning-to-search-at-scale problem.

🖼️ CoT input formats (exact tokens, paper Fig 13 & §A.10)
Same base graph; now the model predicts an intermediate token, not the final answer:
DFS, standard padding : E 4 1 E 8 3 E 3 6 E 8 4 E 2 3  Q 8 6  P P P 8 4 1   → 3
DFS, random padding   : E 4 1 E 8 3 E 3 6 E 8 4 E 2 3  Q 8 6  P 8 P 4 1 P   → 3
selection subtask     : … E 2 3  Q 8 6  P P P P P P 8 4 P 4 1 P             → 8
inference  subtask    : … E 2 3  Q 8 6  P P P P P 8 4 P 4 1 P 8             → 3
P = padding; the visited trace sits at the right end. DFS needs 3 layers, selection-inference 4 — constant in graph size (the depth moved into the generated trace). But trainability still degrades as the graph grows.

📚 DFS & selection-inference — full mechanics

Both are CoT instantiations (§6 umbrella): the model emits intermediate tokens, so search depth moves into the trace → learnable with a constant number of layers. How each works:

① Depth-first search (DFS) — §6.1

Task: given the graph (edge list) + the sequence of visited vertices so far (the DFS trace up to a “current” vertex), predict the next vertex in the DFS traversal.

Key feature — backtracking: if the current vertex is a dead end, the next step backtracks to an ancestor and takes one of its unvisited children. The model must learn both “go deeper” and “backtrack.”

Difficulty knob = backtrack distance (the DFS analogue of lookahead): how far back you must jump. They train on uniform backtrack distance (DFS “balanced distribution”; §A.7: split the graph into G1 = contains the goal and G2 = the visited trace, backtrack distance = |G2|).

Padding: the visited trace varies in length, so they pad between graph and trace to keep edge tokens in fixed positions. Standard = left-pad the trace; random (§A.9) = insert padding between visited vertices.

Findings (Fig 14): learnable with just 3 layers at all sizes (constant depth — the trace carries the search), but degrades as graphs grow; scaling the model doesn’t help (bigger = fewer examples but more FLOPs). §A.9: random padding generalizes to larger backtrack distances (standard 1.00→0.15 vs random →0.87) — but not to larger graphs.

② Selection-inference — §6.2 (after Creswell et al. 2023)

Task: decompose each search step into two alternating subtasks:

Selection: given graph + visited vertices, pick a visited vertex that still has an unvisited child (choose a frontier node to expand).

Inference: given that selected vertex, predict an unvisited child (take one step out).
Repeat select→infer from the start vertex → eventually reach the goal.

Input uses visited edges (not just vertices), since the steps are edge-based.

Difficulty knobs: frontier size F (# visited vertices with unvisited children) and branch count B (# children of the current vertex). Trained with (F, B) uniform.

Findings (Fig 16, 17): learnable with 4 layers; struggles as graph size grows; scaling doesn’t help. §A.10 nuance: transformers do well on selection but poorly on inference for large graphs.

Bottom line (both): CoT moves search depth into the trace → constant layers suffice, but the wall remains — trainability degrades with graph size and scale doesn’t fix it.

🖼️ Figure 14 — DFS scaling results (3 layers, 15 seeds; values read off the plot)
Three rows (Min train loss | Min test loss), log–log. Test = held-out balanced, backtrack-distance 8; minimum over 15 seeds.

TOP — fix 3 layers, vary GRAPH size 11→47: step-wise descent; smaller graphs reach far lower loss (graph-11 → test ≈ 0.25), larger graphs stay high (graph-47 → test ≈ 0.6). ⇒ bigger graph = harder, higher floor.

MID — fix graph 35, vary MODEL 0.4M→47.2M (vs examples): see plot ↓.

BOT — same vs FLOPs: ordering flips to compute — small models hit their floor cheaply (~10¹⁴ FLOPs); the 47.2M’s deep dip costs ~5×10¹⁵ FLOPs ⇒ the “fewer examples” win is a “much more compute” loss.
MID-right: test loss vs #training examples (graph fixed at 35)
 0.6 ┤●● 0.4M (small) — slow, drifts to ~0.4 only late
     │  ╲___
 0.4 ┤      ╲___________   ← all models end ~0.35–0.42
     │ 47M ╲         __╱
 0.2 ┤      ╲___   _╱       ← 47M RISES back up (overfits)
 0.15┤         ╲_╱          ← 47M dips DEEPEST (~0.15) at ~2×10⁷, then up
     └────────────────────► 10⁶     10⁷     10⁸   examples
The dip-then-rise (47.2M): train loss keeps dropping (step-downs to ~0.45) while test loss dips to ~0.15 then rises to ~0.38 — the textbook train↓ / test↑ overfitting signature. Mechanism: the big model fastest-fits a great-generalising solution, then continued training over-specialises to the full balanced mix (all backtrack distances) and its backtrack-8 generalisation degrades back to the pack. Streamed/infinite data ⇒ this is distribution specialisation, not memorisation.
Findings: ① constant 3 layers learn DFS at all graph sizes (depth needn’t grow — the trace carries the search). ② degrades as graphs grow (top). ③ bigger model = faster & deeper dip but no lasting win — it overfits back up to the pack. ④ in FLOPs the advantage evaporates (bottom).

Max’s answer:
Result:

Q12 — NL experiment · 🔴 Hard ⚠️

Question: The natural-language (proof-search) experiment is often misremembered. What does it rule out, what does it fail to show, and why does difficulty grow “especially in FLOPs” there?

Answer

Rules out: “the failure is an artifact of the symbolic E u v format.” Re-encoding each edge as an English sentence (“If Alex is a wumpus, then Alex is a vumpus”) and keeping the same scaling failure shows the format isn’t the culprit — it’s tied to language-shaped reasoning.

Fails to show: anything about real pretrained LLMs. They still trained small transformers from scratch on a synthetic, logic-shaped language. ⚠️ Don’t say “they tested LLMs on language.”

Why “especially in FLOPs”: each edge is now many tokens (a whole sentence) instead of one triple ⇒ much longer sequences ⇒ attention cost grows quadratically in length ⇒ FLOPs per example balloon, so the difficulty curve is steepest when measured in compute.

🖼️ Symbolic vs natural-language encoding (paper Fig 1 / §3.1.2)
SYMBOLIC   (1 edge = 3 tokens):
  E 4 1  E 8 3  E 3 6  E 8 4  E 2 3   Q 8 6   P 8        → 3

NATURAL LANGUAGE (1 edge = a full sentence, every word & comma a token):
  "If Alex is a wumpus , then Alex is a vumpus ."   ← ~one edge ≈ 10 tokens
  … Given Alex is a <start> , prove Alex is a <goal> .
Same scaling failure in both ⇒ format isn’t the cause. NL sequences are far longer → attention is O(len²) → difficulty curve steepest in FLOPs. ⚠️ Still tiny transformers trained from scratch — not real pretrained LLMs.

Max’s answer:
Result:

Q13 — Context · 🔴 Hard ⚠️

Question: Discussion bait: “But o3, Gemini and Claude clearly search now — doesn’t that refute the paper?” Argue why it confirms the paper instead, with concrete mechanisms.

Answer

It confirms it. None of these models discover a search algorithm in a single forward pass; they externalize search — exactly the paper’s recommended directions.

Four mechanisms:

Test-time compute / long CoT (RL-trained): o1/o3, DeepSeek-R1, Gemini “Deep Think”, Claude extended thinking — search lives in the generated trace, scaled by thinking longer. (= the paper’s CoT setting, scaled.)

Explicit search over steps: self-consistency, Tree-of-Thoughts (DFS/BFS over partial solutions), MCTS; AlphaProof / AlphaGeometry 2 reached IMO silver (2024) with tree search around the net.

Tool use / external solvers: code interpreter, SAT/SMT, classical planner — the neuro-symbolic “LLM-modulo” idea (Kambhampati).

Retrieval + agentic loops: RAG, ReAct, plan–act–verify.

Punchline: the field got robust search by adding test-time computation and external structure, not by making one forward pass smarter through scale — precisely the paper’s thesis. (“Modern models can search” and “transformers struggle to learn to search” are both true because modern search isn’t in the forward pass.)

Max’s answer:
Result:

Q14 — Search connection · 🔴 Hard ⚠️

Question: Is this paper about local search or classical search? Both answers are partly right — disentangle them, and map the paper’s proposed fixes onto your Local-Search lecture.

Answer

Two different searches live in the paper — name which you mean (this is what scores):

The task is classical, path-based search over a DAG (the output is a step on a path) → compare to BFS/DFS/A*, not hill climbing. Path-merging = parallel, log-depth transitive closure vs. BFS’s serial, linear frontier expansion.

The training that installs it is local search: SGD = hill climbing in weight space. The paper’s negative results (seed variance, shortcut convergence, “more params don’t help”) are local-optima / loss-landscape phenomena.

⚠️ Don’t say “the paper is about local search” unqualified — at the task level that’s a category error.

Fixes ↔ Local-Search lecture:

many seeds, report best ↔ random-restart hill climbing

curriculum learning (easy→hard) ↔ annealing schedule (broad exploration, then “cool” onto hard cases)

balanced, shortcut-free data ↔ reshaping the objective to remove deceptive optima

looped/recurrent transformers ↔ changing the neighbourhood / move set

🖼️ Path-merging vs classical search (the lecture bridge)

complete? optimal? how it explores depth in L
BFS yes yes (unit cost) serial frontier ∝ L (linear)
DFS yes (finite graph) no serial + backtrack ∝ L
A* yes yes (admissible h) best-first, guided depends on h
path-merging learned (~approx) n/a all vertices ∥, set-union ∝ log₂L
The net’s “search” is parallel + logarithmic but only approximate; classical search is serial but provably correct → that gap is the argument for neuro-symbolic hybrids.

	complete?	optimal?	how it explores	depth in L
BFS	yes	yes (unit cost)	serial frontier	∝ L (linear)
DFS	yes (finite graph)	no	serial + backtrack	∝ L
A*	yes	yes (admissible h)	best-first, guided	depends on h
path-merging	learned (~approx)	n/a	all vertices ∥, set-union	∝ log₂L
The net’s “search” is parallel + logarithmic but only approximate; classical search is serial but provably correct → that gap is the argument for neuro-symbolic hybrids.

Max’s answer:
Result:

Q15 — Method limits · 🔴 Hard

Question: The mechanistic-interpretability method is the paper’s most durable contribution — yet it carries a self-undermining limitation for the headline. What is it, and why does it bite exactly where the paper wants to generalize?

Answer

Cost: the analysis needs ~L·n²·m·F forward passes (layers · input size · examples · perturbed features) — feasible only on the very small models studied.

Self-undermining bite: the headline is about whether large / frontier models can acquire search. But the one tool that could verify whether real LLMs use path-merging cannot currently be run on them — it doesn’t scale to the regime the claim is about.

So the method proves things cleanly where it can run (tiny models) and is silent precisely where the sweeping claim lives (large models). The authors acknowledge this and suggest testing path-merging predictions separately in larger models. (Tier-3 critique #9.)

🖼️ Why the method can't scale — plug in numbers
Cost ≈ L · n² · m · F forward passes (footnote 6):
L = #layers,  n = input size (tokens),  m = #examples,  F = #perturbed features

tiny model studied:  L=6,  n=128,  m=2000,  F≈128
  6 · 128² · 2000 · 128   ≈  2.5 × 10¹⁰   forward passes   (feasible, just)

frontier LLM:        L≈100, n≈10⁵, m=2000, F≈10⁵
  100 · (10⁵)² · 2000 · 10⁵  ≈  2 × 10²⁰   ✗  utterly infeasible
The n² term (context length) alone explodes → the verification tool dies exactly at the scale the headline is about.

Max’s answer:
Result:

Q16 — Attention / architecture · 🔴 Hard ⚠️

Question: The balanced graphs are built left-to-right (topological order) with edges all pointing right — yet the model uses full (bidirectional) attention, no causal mask. Why isn’t causal attention enough, given the graph’s left-to-right structure?

Answer

The graph is topologically ordered, but the input token sequence is not: edges are listed in random order and vertex IDs are randomly permuted (carry no topology). So token position ≠ graph position.

A needed edge can therefore sit anywhere — before or after the current token. Causal attention sees only earlier tokens, so it would miss edges shuffled to the right.

Path-merging runs at all vertices in parallel, each needing to reach arbitrary other tokens → every-token-to-every-token attention is required.

⚠️ The scramble is deliberate (anti-shortcut): it stops the model exploiting input layout, forcing genuine identity-based search. Full attention is its necessary consequence.

⚠️ Trap: don’t say “because edges go both ways” — it’s a DAG. Reason = input-sequence order ≠ graph order.

🖼️ Causal vs full attention (why the mask is dropped)
A needed edge can be shuffled to a later position than the token that needs it:
 CAUSAL mask (decoder)      FULL attention (this paper)
     a b c d                    a b c d
  a  1 0 0 0                 a  1 1 1 1
  b  1 1 0 0                 b  1 1 1 1
  c  1 1 1 0                 c  1 1 1 1
  d  1 1 1 0→1               d  1 1 1 1
  (a token can't see         (every token sees
   LATER tokens)              every token)
Edges are listed in random order + IDs permuted ⇒ the edge you must merge can sit anywhere → causal would miss the ones on its right.

Max’s answer:
Result:

Q17 — Training data · 🔴 Hard ⚠️

Question: Is the training set a fixed collection of examples? Explain how the data is formed and why it matters for the “it’s not lack of data” argument.

Answer

Not fixed — procedurally generated. Each example is a freshly sampled random graph (different edges, size, lookahead, start/goal), serialized to tokens with edges in random order and IDs permuted.

Streaming training: continually sample fresh batches from the generator (test set reserved + filtered out) → effectively infinite, non-repeating data (883M / 236M examples).

Why it matters: with limitless generation the model did see hard (high-L) examples — they’re just rare under naïve. So failure isn’t “lack of data” → it’s distribution imbalance / learnability (Q3).

⚠️ Don’t say “trained on a dataset”; say “trained on a generative process / streaming data.” The whole no-lack-of-data argument hinges on this.

🖼️ The data pipeline (one fresh example per call)

# nothing is reused — every call makes a NEW random graph (stdlib only)
def sample_example(dist):
    G      = sample_graph(dist)        # naive / star / balanced
    s, g   = pick_start_goal(G)        # g reachable from s
    G      = permute_vertex_ids(G)     # IDs carry no topology
    edges  = shuffle(list_edges(G))    # random order in the token string
    tokens = serialize(edges, s, g)    # "E u v ... Q s g  P s"
    return tokens, next_vertex_on_path(s, g)
 
# streaming: while True: batch = [sample_example("balanced") for _ in range(1024)]
#   → effectively infinite, non-repeating data (883M / 236M examples)

So it’s a generator, not a fixed set reshuffled — that’s why “it never saw hard L” is the wrong explanation (it did; they’re just rare).

Max’s answer:
Result:

Q18 — Experimental design · 🔴 Hard ⚠️

Question: They train on balanced but measure test loss on naïve. Why test on a different distribution than they trained on?

Answer

Balanced is a training scaffold (engineered: uniform lookahead, shortcuts removed) whose only job is to force real search to be learned. Naïve is the natural target.

Testing on naïve = a genuine out-of-distribution transfer test: only a model that learned the algorithm transfers; one that fit balanced-specific quirks won’t. (Testing on balanced = grading on your own curriculum.)

Neutral fixed yardstick → comparable across all graph/model sizes.

Naïve is easy (low lookahead) → a true searcher should ace it, so rising naïve-loss = failing even the easy natural case = strong evidence. (Fig 2: balanced→naïve ≈ 0.99 when converged ⇒ transfer is achievable.)

🖼️ Scaffold vs target (why train ≠ test distribution)
TRAIN ─► balanced distribution    (engineered SCAFFOLD: uniform L, shortcuts removed)
           │  its only job: force real search to be learned
           ▼
TEST  ─► naïve distribution       (the natural TARGET; easy, low-L)
           = out-of-distribution transfer test
Only a model that learned the algorithm transfers; a shortcut-fitter doesn’t. Testing on balanced would be grading on your own curriculum. Naïve being easy ⇒ failing it is damning.

Max’s answer:
Result:

Q19 — Mechanism · 🔴 Hard ⚠️

Question: Lookahead is defined via “ruling out distractors.” Does the model rule out branches by iterating through them? Where does the choice of first vertex actually happen?

Answer

“Ruling out” is the metric’s mental model (a sequential searcher / DFS), not the model’s mechanism. The transformer does not iterate branch-by-branch.

In one parallel forward pass, it computes each vertex’s reachable set. The answer = the child of the start whose reachable set contains the goal; children whose sets lack the goal get ~zero probability (“ruled out”).

Where the choice lives: at the start vertex’s out-edges — pick the immediate child that reaches the goal. Search depth only labels each child “leads to goal / dead end.”

L vs model: L is computed offline from the graph (model never sees L); ruling-out is done from the graph (which it does see). No contradiction.

⚠️ Don’t describe the model as “exploring paths and backtracking” — that’s DFS, not the parallel path-merging transformer.

🖼️ "Ruling out" (the metric) vs the model (the mechanism)
Start S has children {A✓, K✗, M✗}; the goal is reachable only via A.
METRIC's mental model (DFS, serial)     MODEL (path-merging, parallel)
  explore K → dead end → discard          compute, ALL AT ONCE:
  explore M → dead end → discard            reach(A) ∋ goal ✓
  A remains → output A                       reach(K) ∌ goal ✗
  (iterative, ~L steps)                      reach(M) ∌ goal ✗
                                           → output the child whose set ∋ goal = A
                                           (1 forward pass, ~log₂L layers)
The choice always lives at the start’s out-edges: pick the child whose reachable set contains the goal. The model never iterates/backtracks.

Max’s answer:
Result:

Q20 — Robustness · 🔴 Hard

Question: The main model is effectively encoder-only with 1-hot concatenated positions. What is the decoder-only + RoPE experiment, and what does it establish?

Answer

It re-runs the scaling experiments with the modern-LLM recipe: decoder-only (causal), learned token embeddings, and RoPE — which injects position by rotating Q/K vectors (multiplicatively, encoding relative position) instead of concatenating a 1-hot.

Purpose: defuse “your conclusions are an artifact of a weird non-standard architecture.” Result (Figs 11, 12, §A.5): same scaling failure.

Establishes: the failure isn’t a quirk of the 1-hot/encoder design — it holds for the architecture real LLMs use.

⚠️ But still tiny from-scratch models → more architecturally representative, not closer to a real LLM in scale (Q10 stands).

🖼️ RoPE: position = a rotation angle
Rotate each token's Q/K vector by angle (position × θ):
   pos 1:  ↗        pos 2:  ↑        pos 3:  ↖
attention score dot(Q_m, K_n) depends on the ANGLE BETWEEN them
   = (m − n)  =  RELATIVE position
this paper’s main model RoPE variant (§A.5)
position absolute, 1-hot rotation angle
injected by concatenation multiplication (rotation)
encodes absolute slot relative distance
attention bidirectional (encoder-like) causal (decoder)

Swapping to the modern recipe (decoder + RoPE + learned embeddings) → same scaling failure ⇒ not an architecture artifact.

	this paper’s main model	RoPE variant (§A.5)
position	absolute, 1-hot	rotation angle
injected by	concatenation	multiplication (rotation)
encodes	absolute slot	relative distance
attention	bidirectional (encoder-like)	causal (decoder)

Max’s answer:
Result:

Q21 — Scaling axis · 🔴 Hard ⚠️

Question: Fig 7 measures “non-embedding parameters.” Define that, and identify a subtle gap in which axis they scaled.

Answer

Non-embedding parameters = total params minus the embedding tables (token + position) = the attention + feed-forward weights. Standard scaling-laws measure, because embedding params scale with vocab/length, not computational capacity.

The gap: Fig 7 scales model dimension d (width) — but path-merging says depth (layers) controls search range (~log₂L layers for lookahead L). So they scaled the axis least relevant to search depth.

⇒ A clean depth-scaling-at-frontier study is missing; “bigger model doesn’t help” is shown for width, ≤60M params.

⚠️ Strong critique fuel (pair with Q10): the negative scaling result didn’t test the theoretically-relevant axis (depth) at scale.

🖼️ They scaled width — but depth is the search axis
  DEPTH (layers)
    ▲   path-merging: lookahead L needs ~log₂L LAYERS   ← the search-range axis
    │
    │   ·········· (held ~fixed)
    │
    └─────────────────────────────►  WIDTH (dimension d)
                                      Fig 7 scaled THIS (0.9M→60.4M params)
non-embedding params = attention + FFN weights (everything except the token/position embedding tables) = the honest “compute size.” Fig 7 grew width ⇒ a clean depth-at-scale test is missing.

Max’s answer:
Result:

Q22 — Training dynamics · 🔴 Hard ⚠️

Question: In Fig 14 (DFS), the biggest model’s test loss dips to ~0.15 then rises to ~0.38 while its train loss keeps falling. Name the phenomenon and explain it under streaming data.

Answer

Train↓ / test↑ = overfitting — but with streaming/infinite data it’s not memorisation, it’s distribution specialisation.

The big model fastest-fits a great-generalising solution (deep test dip ~0.15), then continued training keeps minimising average loss over the full balanced mix (all backtrack distances) → it over-specialises, and its loss on the narrow backtrack-8 test slice rises back to the pack (~0.38).

Takeaway: the deep dip is not a lasting win → bigger ≠ better, consistent with “scaling doesn’t help.”

⚠️ Trap: “infinite data ⇒ no overfitting” is false — you can still overfit the training distribution relative to a specific test slice.

🖼️ The overfitting signature (Fig 14, 47.2M model)
loss
     │ train ╲________________________   keeps DROPPING (to ~0.45)
 0.4 │ test  ╲____           _________   ← rises back to the pack (~0.38)
     │           ╲_________╱
0.15 │                ↑ test DIPS deepest (~0.15) at ~2×10⁷ examples
     └──────────────────────────────────► #examples
       train ↓  +  test ↑   =   OVERFITTING
Streamed/infinite data ⇒ not memorisation but distribution specialisation: it over-fits the full balanced mix (all backtrack distances) at the expense of the narrow backtrack-8 test slice. The deep dip is not a lasting win.

Max’s answer:
Result:

Q23 — Selection-inference · 🔴 Hard ⚠️

Question: In selection-inference, transformers do well on selection but poorly on inference for large graphs. Define both subtasks and explain why the asymmetry is notable.

Answer

Selection: given graph + visited vertices, pick a visited vertex that still has an unvisited child (choose a frontier node).

Inference: given that vertex, predict an unvisited child (take one step out).

Asymmetry (§A.10): selection (recognition/lookup over the visited set) stays accurate; inference (committing to the next edge) fails on large graphs — the hard part is the actual search step, not the bookkeeping.

Why notable: even when search is decomposed into the simplest steps, the step that requires looking ahead is where it breaks → the difficulty is search itself, not task framing.

🖼️ One selection-inference step
visited: {S, a}     graph: S→a, S→b, a→c, a→d   (c, d still unvisited)

① SELECTION : "which visited vertex has an unvisited child?"  → a      ✓ done well
② INFERENCE : "given a, name an unvisited child"              → c or d ✗ fails on big graphs
Selection = bookkeeping over the visited set (easy). Inference = committing to the next edge (the real search step) — that’s what breaks at scale.

Max’s answer:
Result:

Q24 — Tokens / attention · 🔴 Hard ⚠️

Question: What is a “token” in the symbolic task, and what does the attention head actually see — the vertex number, or something else? Why does this allow only identity matching?

Answer

A token is a single symbol: a marker (E, Q, P) or a vertex ID (E u v = edge u→v; Q s g; P s). Not a sentence (that’s the NL variant).

The head sees each token’s embedding = one-hot(identity) ⊕ one-hot(position) — not the digit’s value. IDs are permuted, so “4” is an arbitrary label, not a quantity.

⇒ the only usable operation is identity (or position) matching: Qⱼ·Kᵢ is high when the source is what the query seeks. No arithmetic on IDs (4<8 is meaningless).

⚠️ This is why merges are token-matching (match vertex IDs) or position-based — both ride on one-hot overlaps, not magnitudes. (See worked-example_qk-dotproduct_29-05-26.)

🖼️ A token is a symbol, seen as a one-hot (not a number)
vocabulary:  { E , Q , P , 0, 1, 2, …, N }

token "4"  →  embedding = one-hot(identity=4) ⊕ one-hot(position)
           →  [ …1 in slot 4… | …1 in its position slot… ]   (NOT the value 4)

IDs are PERMUTED ⇒ "4" is just a label; "4 < 8" is meaningless
⇒ the head can only do identity / position MATCHING: Qⱼ·Kᵢ is high ⟺ same one-hot slot. No arithmetic on IDs.

Max’s answer:
Result:

Q25 — DFS difficulty · 🔴 Hard

Question: What plays the role of “lookahead” in the DFS task, and where does random padding help — and where doesn’t it?

Answer

Backtrack distance is the DFS analogue of lookahead: how far back you must backtrack from a dead end to an unvisited branch. Trained with uniform backtrack distance (DFS “balanced distribution”).

Random padding (insert padding between visited-trace tokens) lets the model generalise to larger backtrack distances (standard collapses 1.00→0.15; random stays →0.87) — a cheap fix vs the engineered balanced distribution the main task needed.

Where it doesn’t help: larger graphs — padding aids backtrack distance, not graph size. The scaling wall is unaffected.

⚠️ Don’t conflate “generalises to bigger backtrack distance” with “generalises to bigger graphs” — only the former.

🧩 Two different uses of padding in DFS (don't merge them)

① Structural padding (necessary) — keep the graph in fixed positions. The DFS input is edges … Q s g + visited trace, and the trace varies in length (visited 2 vs 10 vertices). Without padding, the edge/trace boundary shifts, so the edges land at different absolute positions each example — bad, because the model uses absolute position embeddings. So (§6.1) they reserve the right-most tokens for the trace and pad the gap, keeping edges in fixed positions (as in the original task). Job: remove a nuisance variable so the model isn’t fighting position-shift on top of search.
② Random padding (optional, §A.9) — a generalisation trick. Insert padding randomly between trace tokens (data augmentation). This lets DFS generalise to larger backtrack distances (standard 1.00→0.15 vs random →0.87) — a cheap fix vs engineering a balanced distribution.
⚠️ #1 makes the task well-posed (position-stable edges); #2 rescues bigger backtrack distances. Neither helps with bigger graphs.

🖼️ Backtrack distance + the padding layout
DFS run:  S → a → b (dead end) → BACKTRACK to S → c → goal
                                  └─ backtrack distance = how far back you jump (DFS's "lookahead")

token layout (standard padding):
  [ E u v  E u v  …  Q s g | P P P … P |  visited trace ]
    └──── EDGES: always same slots ────┘   └ right-most: the trace ┘
Padding keeps the edge block position-stable even though the trace length varies. Random padding (§A.9) instead scatters padding inside the trace → generalises to larger backtrack distances.

Max’s answer:
Result:

🎙️ Oral-exam style — questions in the examiner’s voice

These are how Kühnberger actually asks: open, conversational, often starting easy and then asking “why?” to go deeper. The point isn’t a memorized answer but steering the discussion into your focus topics (Local Search · CSP · ML). Practice saying these out loud. Format: the question as he’d phrase it, then a strategy for the answer + the hooks to leave open.

OE1 — The opener

Examiner: “So, just tell me — what is the paper about?”

How to handle it

Deliver the 60-second version, not everything: research question (why search? why do LLMs fail?) → testbed (DAG connectivity = proof search) → the one positive result (balanced distribution ⇒ learns search) → the learned algorithm (path-merging) → the two negatives (scale, CoT don’t fix it). Then stop and let him pick the thread. ⚠️ Don’t dump all 15 minutes here — leave hooks (“…and interestingly the way they reverse-engineered the algorithm is itself a contribution”).

OE2 — The “why this, specifically?” probe

Examiner: “Why did they pick graph connectivity, of all things? Isn’t that a bit far from real reasoning?”

How to handle it

Agree it looks narrow, then defend: connectivity = proof search in propositional logic (give the vertex↔proposition mapping), so it’s a lower bound — fail here, fail everywhere richer. Then volunteer the limit yourself (subset of logic, no backtracking/uncertainty) — pre-empting his critique signals maturity and you control the next turn. This is the claim-vs-evidence move he’s fishing for.

OE3 — The whiteboard request

Examiner: “Can you quickly sketch or explain the algorithm the network learned?”

How to handle it

Draw a chain 1→2→…→9. Say: each vertex stores its reachable set; each layer unions along edges so the set doubles (show 1 reaches {2}, then {2,3,4}, then up to {…,8}). Conclude: exponential in layers ⇒ ~log-depth, all vertices in parallel = transitive closure. Then contrast with BFS (serial, linear) — that’s your bridge into the Search cluster. Mention non-maximal in practice so you sound precise, not rehearsed.

OE4 — The connect-to-the-lecture probe

Examiner: “You chose Search as a focus topic. How does this fit with what we covered on search in the lecture?”

How to handle it

This is the money question — score here. Two layers: (1) the task is classical search → compare path-merging to BFS/DFS/A* (completeness, guarantees, serial vs parallel). (2) the training is local search → SGD = hill climbing in weight space; seed variance = random restarts, curriculum = annealing. Then pivot to neuro-symbolic: classical search is provably correct and scales, learned search only approximates ⇒ hybrid systems. (Kühnberger’s home turf — he’ll like this.)

OE5 — The “does this matter in practice?” probe

Examiner: “But ChatGPT and the like can obviously search and plan today. Doesn’t that make the paper obsolete?”

How to handle it

Flip it: it confirms the paper. Modern models externalize search — RL-trained long chain-of-thought (o1/o3, DeepSeek-R1, Gemini Deep Think, Claude extended thinking), explicit tree/MCTS search (Tree-of-Thoughts; AlphaProof/AlphaGeometry at the IMO), and tool use / neuro-symbolic (LLM-modulo). The single forward pass still doesn’t robustly search — exactly the paper’s point. The field added test-time compute + structure, not just parameters. (See Q13.)

OE6 — The critical-thinking probe

Examiner: “What do you make of the paper yourself? Where would you push back critically?”

How to handle it

Lead with the strongest, fair critique: learnability ≠ expressivity — title implies an architectural limit, evidence shows a training difficulty on tiny models; Merrill & Sabharwal say CoT-transformers are Turing-complete. Have the rebuttal + counter ready (Q10). Then balance it: the mechanistic method and the distribution-sensitivity finding are genuine, solid contributions. Fair, calibrated, with a counter — that’s the grade.

OE7 — The breadth check (leaves the paper)

Examiner: “Let’s go general for a moment: you mentioned A*. What makes a heuristic ‘good’ there? And how does that differ from the ‘shortcuts’ the network learns?”

How to handle it

Admissibility (never overestimates the true cost) ⇒ A* stays optimal; consistency ⇒ efficient. Contrast: a heuristic is a principled, guarantee-preserving estimate; the network’s shortcuts are statistical regularities of the data that are right on easy graphs and break on hard ones. The balanced distribution is precisely an attempt to make shortcuts useless. (Bridges paper → your Search cluster basics.)

OE8 — The ML-cluster pivot

Examiner: “This ‘shortcut learning’ — does it relate to machine-learning concepts we covered?”

How to handle it

Yes: inductive bias / spurious correlations / overfitting to the training distribution. The naïve distribution lets the model fit the data’s quirks (small lookahead) instead of the task — a generalization failure. Tie to bias–variance (huge seed variance on large graphs) and to why you need representative/curated data. Optional bridge: this is why decorrelation (RF) and regularization matter — controlling what the model latches onto.

OE9 — The forward-looking close

Examiner: “If this were your project — what would you do next?”

How to handle it

Four concrete, paper-grounded moves: (1) medium-sized models to see if the negative trend bends (bridge the extrapolation gap); (2) probe real pretrained LLMs for path-merging-like circuits (scalable proxy for the method); (3) try the authors’ own remedies — curriculum learning and looped/recurrent transformers; (4) a theoretical loss-landscape account of why SGD struggles here. Shows you can think past the paper.

OE10 — The “define these quickly” rapid-fire

Examiner: (quick volley) “What is lookahead? — What is activation patching? — Why no causal mask? — What are FLOPs?”

How to handle it

One crisp sentence each — no waffling:

Lookahead: how far you must search forward before safely committing to the first move; L = min{|P|, maxᵢ|Sᵢ|}.

Activation patching: perturb an input feature / activation, re-run, and watch the output change to find what’s causally used.

No causal mask: edges are unordered, so a relevant edge can sit anywhere → every token must attend to every other.

FLOPs: floating-point operations, the unit of compute; one forward pass ≈ 2·params·tokens.

Score

Hard quiz (Q1–Q25): __ / 25 (all 🔴 Hard; Q16–Q25 added from the deep-dive session)
Oral-exam block (OE1–OE10): practiced out loud? ☐ once ☐ twice ☐ fluent
By cluster — Setup/Lookahead: __/2 · Distributions: __/2 · Mech-interp & path-merging: __/4 · Scaling/CoT/NL: __/3 · Critique/context/search-link: __/4

Re-Quiz

Note misses here and revisit before Mon 1 Jun:

Brain Online

Explorer

quiz_paper-transformers-search_28-05-26

Q1 — Setup · 🔴 Hard ⚠️

Q2 — Lookahead · 🔴 Hard ⚠️

Q3 — Distributions · 🔴 Hard ⚠️

Q4 — Distributions · 🔴 Hard

Q5 — Mechanistic method · 🔴 Hard ⚠️

Q6 — Mechanistic method · 🔴 Hard

Q7 — Path-merging · 🔴 Hard ⚠️

Q8 — Path-merging · 🔴 Hard ⚠️

Q9 — Scaling · 🔴 Hard ⚠️

Q10 — Critique · 🔴 Hard ⚠️

Q11 — Chain-of-Thought · 🔴 Hard

Q12 — NL experiment · 🔴 Hard ⚠️

Q13 — Context · 🔴 Hard ⚠️

Q14 — Search connection · 🔴 Hard ⚠️

Q15 — Method limits · 🔴 Hard

Q16 — Attention / architecture · 🔴 Hard ⚠️

Q17 — Training data · 🔴 Hard ⚠️

Q18 — Experimental design · 🔴 Hard ⚠️

Q19 — Mechanism · 🔴 Hard ⚠️

Q20 — Robustness · 🔴 Hard

Q21 — Scaling axis · 🔴 Hard ⚠️

Q22 — Training dynamics · 🔴 Hard ⚠️

Q23 — Selection-inference · 🔴 Hard ⚠️

Q24 — Tokens / attention · 🔴 Hard ⚠️

Q25 — DFS difficulty · 🔴 Hard

🎙️ Oral-exam style — questions in the examiner’s voice

OE1 — The opener

OE2 — The “why this, specifically?” probe

OE3 — The whiteboard request

OE4 — The connect-to-the-lecture probe

OE5 — The “does this matter in practice?” probe

OE6 — The critical-thinking probe

OE7 — The breadth check (leaves the paper)

OE8 — The ML-cluster pivot

OE9 — The forward-looking close

OE10 — The “define these quickly” rapid-fire

Score

Note misses here and revisit before Mon 1 Jun:

Backlinks

Mika

✨ Features

⚙️ Einstellungen

📚 Chat-Verlauf

📖 Citation Manager

✍️ Writing Assistant

Inhaltsverzeichnis