ai-generated methods-of-ai exam-prep

Hard quiz on the exam paper for the oral module exam (Mon 1 Jun 2026).

Every question is 🔴 Hard (why / compare / derive / apply / defend). ⚠️ = a point examiners love to push on or that’s easy to overstate.
Answers hidden in collapsible callouts. Fill in your answer + result during the session — then re-quiz the misses 2 days later.

Companion

pruefung_paper-transformers-search_25-05-26 (full dossier — anatomy §1–§7, present script, critique, context) · Quiz Exam Schwerpunkte · Methods of AI Lecture


Q1 — Setup · 🔴 Hard ⚠️

Question: The paper argues results on a toy DAG transfer to “reasoning.” State the exact equivalence it relies on, and then state the strongest limit of that equivalence (i.e. where a critic should push).

Max’s answer: finding answers happens through making the correct implications. Creating a simple graph with implications tests whether a LLM is able to follow implications which are trivial for humans. Due to the fact that this graph is always unidrectional and also in their balanced graph there are no shortcuts optional, the LLM has to find the correct pathway.
Also the critics argue that implications are not the only way to make assumptions - backtracking, deductive chaining, heuristics
Result:

Q2 — Lookahead · 🔴 Hard ⚠️

Question: Write the formal definition of lookahead L, explain the role of the min, and apply it: a start vertex has the true path of length 3 to the goal, plus one disjoint distractor branch of length 5 and one of length 2. What is L?

Max’s answer: l is the maximum path length the algorithm needs to start to decide which ath to take that will certainly lead to an answer. P is the path, S are the other pathways from the starting point.
It takes the minimal P, so the shortest pathway, and rules out

Result:

Q3 — Distributions · 🔴 Hard ⚠️

Question: The naïve-trained model’s accuracy collapses at higher lookaheads (Fig 2 — accuracy; note Fig 5 is a different metric, “proportion explained by path-merging”) — yet this is on lookaheads it actually saw during (limitless) training. Explain precisely why “it never saw those L” is the wrong explanation, and what the right one is.

Max’s answer: the naive-trained model was trained on a naive distribution, meaning the distr. had shortcuts and uneven paths, so there was always a high variance of L, meaning that the model might have been trained only to look at a few lookaheads, because it might have found the goal a lot faster in the training runs.
Result:

Q4 — Distributions · 🔴 Hard

Question: The paper’s headline reads as an architectural claim, yet the positive result depends on the hand-crafted balanced distribution. Explain why this is an internal tension, and which side the evidence actually favours.

Max’s answer: first of all, they claim it is an issue by architecture - since they looked at more layers than the algorithm would need - but also the only search they created is based on a hand-crafted balanced distribution, which is not observable in real life. That means it is an theoretical environment and real world search depends on more than balanced implications - backtracking,heuristics, etc.
Result:

Q5 — Mechanistic method · 🔴 Hard ⚠️

Question: In identifying important attention operations, the authors perturb each weight both to 0 and to the row’s maximum (renormalizing). Why is perturbing upward essential — what would a 0-only test miss?

Max’s answer:
Result:

Q6 — Mechanistic method · 🔴 Hard

Question: In Step IV the authors compute two perturbed dot products, Q̃ⱼKᵢᵀ and QⱼK̃ᵢᵀ. Explain what each one isolates and how this lets them attribute a feature to the source vs. the target embedding.

Max’s answer:
Result:

Q7 — Path-merging · 🔴 Hard ⚠️

Question: Why does the learned algorithm search a number of vertices exponential in the number of layers (equivalently: only ~log-many layers are needed)? Where exactly does the doubling come from — and what does this correspond to classically?

Max’s answer:
Result:

Q8 — Path-merging · 🔴 Hard ⚠️

Question: The model learns a non-maximal version of path-merging. Define “maximal,” cite the evidence that it isn’t, and explain how this non-maximality explains the generalization ceiling (L≤12 trained → works at 13, 14, then stops).

Max’s answer:
Result:

Q9 — Scaling · 🔴 Hard ⚠️

Question: “Scaling doesn’t help” rests on two distinct experiments. State each precisely (what’s fixed, what varies, what’s measured) and what each shows. Why is “bigger is faster at being wrong” the right summary of the second?

Max’s answer:
Result:

Q10 — Critique · 🔴 Hard ⚠️

Question: Give the single strongest critique of the paper, the rebuttal an examiner will fire back, and your counter to that rebuttal. Be precise about the claim/evidence gap.

Max’s answer:
Result:

Q11 — Chain-of-Thought · 🔴 Hard

Question: Chain-of-thought (DFS, selection-inference) does change something and doesn’t change something else. State both precisely, including the role of layers and FLOPs.

Max’s answer:
Result:

Q12 — NL experiment · 🔴 Hard ⚠️

Question: The natural-language (proof-search) experiment is often misremembered. What does it rule out, what does it fail to show, and why does difficulty grow “especially in FLOPs” there?

Max’s answer:
Result:

Q13 — Context · 🔴 Hard ⚠️

Question: Discussion bait: “But o3, Gemini and Claude clearly search now — doesn’t that refute the paper?” Argue why it confirms the paper instead, with concrete mechanisms.

Max’s answer:
Result:

Q14 — Search connection · 🔴 Hard ⚠️

Question: Is this paper about local search or classical search? Both answers are partly right — disentangle them, and map the paper’s proposed fixes onto your Local-Search lecture.

Max’s answer:
Result:

Q15 — Method limits · 🔴 Hard

Question: The mechanistic-interpretability method is the paper’s most durable contribution — yet it carries a self-undermining limitation for the headline. What is it, and why does it bite exactly where the paper wants to generalize?

Max’s answer:
Result:

Q16 — Attention / architecture · 🔴 Hard ⚠️

Question: The balanced graphs are built left-to-right (topological order) with edges all pointing right — yet the model uses full (bidirectional) attention, no causal mask. Why isn’t causal attention enough, given the graph’s left-to-right structure?

Max’s answer:
Result:

Q17 — Training data · 🔴 Hard ⚠️

Question: Is the training set a fixed collection of examples? Explain how the data is formed and why it matters for the “it’s not lack of data” argument.

Max’s answer:
Result:

Q18 — Experimental design · 🔴 Hard ⚠️

Question: They train on balanced but measure test loss on naïve. Why test on a different distribution than they trained on?

Max’s answer:
Result:

Q19 — Mechanism · 🔴 Hard ⚠️

Question: Lookahead is defined via “ruling out distractors.” Does the model rule out branches by iterating through them? Where does the choice of first vertex actually happen?

Max’s answer:
Result:

Q20 — Robustness · 🔴 Hard

Question: The main model is effectively encoder-only with 1-hot concatenated positions. What is the decoder-only + RoPE experiment, and what does it establish?

Max’s answer:
Result:

Q21 — Scaling axis · 🔴 Hard ⚠️

Question: Fig 7 measures “non-embedding parameters.” Define that, and identify a subtle gap in which axis they scaled.

Max’s answer:
Result:

Q22 — Training dynamics · 🔴 Hard ⚠️

Question: In Fig 14 (DFS), the biggest model’s test loss dips to ~0.15 then rises to ~0.38 while its train loss keeps falling. Name the phenomenon and explain it under streaming data.

Max’s answer:
Result:

Q23 — Selection-inference · 🔴 Hard ⚠️

Question: In selection-inference, transformers do well on selection but poorly on inference for large graphs. Define both subtasks and explain why the asymmetry is notable.

Max’s answer:
Result:

Q24 — Tokens / attention · 🔴 Hard ⚠️

Question: What is a “token” in the symbolic task, and what does the attention head actually see — the vertex number, or something else? Why does this allow only identity matching?

Max’s answer:
Result:

Q25 — DFS difficulty · 🔴 Hard

Question: What plays the role of “lookahead” in the DFS task, and where does random padding help — and where doesn’t it?

Max’s answer:
Result:


🎙️ Oral-exam style — questions in the examiner’s voice

These are how Kühnberger actually asks: open, conversational, often starting easy and then asking “why?” to go deeper. The point isn’t a memorized answer but steering the discussion into your focus topics (Local Search · CSP · ML). Practice saying these out loud. Format: the question as he’d phrase it, then a strategy for the answer + the hooks to leave open.

OE1 — The opener

Examiner: “So, just tell me — what is the paper about?”

OE2 — The “why this, specifically?” probe

Examiner: “Why did they pick graph connectivity, of all things? Isn’t that a bit far from real reasoning?”

OE3 — The whiteboard request

Examiner: “Can you quickly sketch or explain the algorithm the network learned?”

OE4 — The connect-to-the-lecture probe

Examiner: “You chose Search as a focus topic. How does this fit with what we covered on search in the lecture?”

OE5 — The “does this matter in practice?” probe

Examiner: “But ChatGPT and the like can obviously search and plan today. Doesn’t that make the paper obsolete?”

OE6 — The critical-thinking probe

Examiner: “What do you make of the paper yourself? Where would you push back critically?”

OE7 — The breadth check (leaves the paper)

Examiner: “Let’s go general for a moment: you mentioned A*. What makes a heuristic ‘good’ there? And how does that differ from the ‘shortcuts’ the network learns?”

OE8 — The ML-cluster pivot

Examiner: “This ‘shortcut learning’ — does it relate to machine-learning concepts we covered?”

OE9 — The forward-looking close

Examiner: “If this were your project — what would you do next?”

OE10 — The “define these quickly” rapid-fire

Examiner: (quick volley) “What is lookahead? — What is activation patching? — Why no causal mask? — What are FLOPs?”


Score

  • Hard quiz (Q1–Q25): __ / 25 (all 🔴 Hard; Q16–Q25 added from the deep-dive session)
  • Oral-exam block (OE1–OE10): practiced out loud? ☐ once ☐ twice ☐ fluent
  • By cluster — Setup/Lookahead: __/2 · Distributions: __/2 · Mech-interp & path-merging: __/4 · Scaling/CoT/NL: __/3 · Critique/context/search-link: __/4

Re-Quiz

Note misses here and revisit before Mon 1 Jun: