ai-generated methods-of-ai exam-prep
The official ICLR 2025 peer reviews of the exam paper.
Venue: ICLR 2025 · Decision: Accept (Poster) · 4 reviewers + Area Chair meta-review.
Source: OpenReview forum 9cQB1Hwrtw. Quotes are verbatim from the reviews; emphasis added.
Why this file matters for the oral: several reviewer criticisms independently corroborate the critiques in pruefung_paper-transformers-search_25-05-26 — and the rebuttal gives you the counter-arguments. Being able to say “even the ICLR reviewers flagged X, and the authors answered with Y” is a strong, well-calibrated move.
Companion
pruefung_paper-transformers-search_25-05-26 (dossier) · quiz_paper-transformers-search_28-05-26 (hard quiz) · Methods of AI Lecture
Scores at a glance
| Reviewer | Rating | Confidence | Soundness | Presentation | Contribution | Final stance |
|---|---|---|---|---|---|---|
| zLV2 | 8 | 3 | 3 | 3 | 3 | raised to 8 — “a strong paper” |
| wdgg | 8 | 3 | 3 | 3 | 3 | raised; “close to 7… happy to see it at ICLR” |
| fM1a | 6 | 3 | 2 | 3 | 3 | raised after rebuttal (was lower) |
| NMwf | 5 | 3 | 2 | 1 | 2 | below-accept; stopped engaging → AC down-weighted |
(ICLR scale: 8 = accept/good paper, 6 = marginally above threshold, 5 = marginally below, 3 = reject. Confidence 3 = “fairly confident.“)
Decision — Area Chair meta-review
“All the reviewers appreciated the paper’s contribution in terms of methodology and empirical findings… The main weakness noted by the reviewers pertained to presentation and writing… reviewer fM1a raised concerns about a potential data leakage, which seems to have been resolved… I recommend accepting.” Decision: Accept (Poster).
On the dissenting reviewer: “only one reviewer (NMwf) had a below-acceptance score. However, that reviewer stopped engaging after posting their review… so I have downweighed that score.”
The recurring criticisms (these are the exam-relevant ones)
1. ⚠️ “Decoder-only” vs. encoder-only — a flat-out architectural contradiction (fM1a)
The original manuscript called the models “decoder-only” while using bidirectional (full) attention — which is contradictory, since decoder-only models are defined by causal masking. The authors had to change the wording from “decoder-only transformers” to just “transformers.”
Reviewer fM1a (post-rebuttal)
“I appreciate that the authors have revised their description from ‘decoder-only transformers’ to simply ‘transformers’ — a notable correction given that their original manuscript described using decoder-only models with bidirectional attention masks. As this represents a basic architectural contradiction in transformer design (decoder-only models, by definition, use causal masking, not bidirectional attention).”
→ fM1a’s reading: “since the authors used full attention rather than causal attention, they actually trained an encoder-only model rather than a decoder-only model.”
For your exam: this is the precise, citable version of your Tier-2 critique “the interpretability-friendly architecture is not a real transformer.” The no-causal-mask choice isn’t just unusual — a reviewer argued it makes it an encoder, not the decoder LLMs actually are.
2. Toy architecture → does it generalize to real LLMs? (fM1a, wdgg)
1-hot embeddings, concatenated positional embeddings, single forward-pass output (no reasoning steps). fM1a: these “might not capture the complexity of modern embedding approaches” and “may limit the generalizability of the findings to contemporary LLMs.”
The DEFENCE — Reviewer wdgg (defending the authors to fM1a)
“Most of the existing studies on mechanistic interpretability… use some simplified transformer variants… While it’s desirable to study an architecture that mimics LLMs, the current progress should allow for mediated design choices… it can often be shown WLOG that this concatenation can be converted to the typical additive structure with 1 additional layer. Since we still understand very little about transformer interpretability, I think it’s reasonable to operate with mediated expectations and study stylized problems.”
For your exam — this is the key dialectic: the toy-model critique (yours and fM1a’s) is real, but the standard counter is “stylized setups are normal in mech-interp theory, and the simplifications are WLOG / removable with one extra layer.” Hold both sides.
3. ⚠️ Data leakage between train and test (fM1a) — “resolved”, but worth knowing
Reviewer fM1a
“While the authors claim they will remove overlapping samples… they don’t explain how they compare whether two graphs are identical. If only using string matching, it cannot determine whether two graphs are completely equal (e.g.
1→2and3→4are equivalent up to relabeling)… the test set is likely included in the training set. Additionally, given the number of vertices and max edges, DAG generation is finite — so the ‘infinite graph generation’ claim may be incorrect.”
The AC considered this resolved by the rebuttal. Nuance for you: the paper permutes vertex IDs and filters by exact match; fM1a’s point is that graph isomorphism ≠ string identity, so some test graphs could recur in training up to relabeling. A sharp discussion point about the “limitless data” framing.
4. The interpretability method has two soft spots
- Freezing earlier layers is an oversimplification (fM1a): “modified tokens also influence previous attention calculation,” so freezing “may oversimplify the intricate dynamics… modifications can propagate through the entire network.” (The authors’ defence: freezing is deliberate — it isolates the effect of this operation; see dossier §3.)
- It doesn’t scale (fM1a, zLV2): the analysis costs L·n²·m·F forward passes (layers · input length · examples · perturbed features).
Authors' response (to zLV2)
“The number of forward passes is L·n²·m·F… the method as presented is not practically applicable to very large models, but we do demonstrate its applicability and utility to our trained models.”
For your exam: matches your Tier-3 critique #9 — the one tool that could check whether real LLMs path-merge can’t be run on them.
5. Relation to prior graph-reasoning work is underspecified (wdgg)
Reviewer wdgg
“Several prior works have found that large-language models can implement certain graph algorithms, including graph connectivity, and that this… can be improved with appropriate adaptations of chain-of-thought. It is unclear whether the authors’ findings contradict, confirm, or offer more nuanced insights to prior works… the paper… is somewhat lacking when situating itself in the LLM planning/search and theoretical expressivity literature.”
wdgg explicitly points at Merrill & Sabharwal, The Expressive Power of Transformers with CoT (your #1 reference!) and at chess-search / in-context-exploration work — i.e. the same literature your “further reading” list covers.
6. Logical flow / claims vs. evidence (NMwf — the dissenter)
NMwf (Presentation = 1) felt claims outran evidence:
Reviewer NMwf
“in line 51, the authors state… ‘transformers can indeed be taught to search, but only under fairly restrictive conditions on the training distribution.’ However, Figure 3 does not fully support this claim… it does not substantiate any firm conclusions about the training distribution itself.”
He also pressed on under-specified method details (how many examples? was the “explainable path” inspected manually?). The AC noted he stopped engaging and down-weighted the score — but his “claim vs. evidence” instinct is the same lens as your Tier-1 critiques.
What the reviewers praised (so you stay fair, not just critical)
- zLV2: “a significant step towards our understanding of the inner mechanisms of transformer models”; the message-passing / exponential path-merging discovery is “an important contribution to the academic community.”
- wdgg: “Combining mechanistic interpretability with the search problem is, to my knowledge, novel… This new setting also prompts the author to introduce a new algorithm for mechanistic analysis, which may be of interest to the interpretability community.” Also liked the “exposition style presentation… each section introducing problem settings of growing complexity.”
- NMwf (even while critical): “an intriguing and practical research question… not only scientifically interesting but also has meaningful implications.”
- Meta-review: “The core findings… (e.g. the mechanism by which the models perform search over graphs) are… novel” and “experiments are well-designed and provide strong evidence.”
How to use this in the oral (the payoff)
If Kühnberger asks “what did the reviewers / community think?” or “where’s the paper weakest?”, you can say:
“It was accepted at ICLR 2025 as a poster — scores 8/8/6/5. Reviewers loved the novelty of combining mechanistic interpretability with search and the path-merging discovery. The substantive criticisms were exactly the architecture-realism ones: one reviewer pointed out the model uses full attention, so it’s effectively encoder-only, not the decoder-only LLMs we care about, and that the method doesn’t scale to real models (L·n²·m·F passes). There was also a sharp data-leakage point — string-matching can’t catch isomorphic graphs — though the AC considered it resolved. Tellingly, another reviewer defended the simplifications as standard practice in mechanistic-interpretability theory (often WLOG). So the community verdict matches my own read: the method and the distribution-sensitivity finding are solid; the extrapolation to frontier LLMs is the contestable part.”
That single paragraph shows you read the paper and its reception, and that you can weigh a critique against its rebuttal — exactly the calibrated judgement the oral rewards.