ai-generated methods-of-ai exam-prep
Companion to quiz_paper-transformers-search_28-05-26 · Q6 (the
Qⱼ·Kᵢdot products) — also underpins Q7 (path-merging).A concrete numeric example of the query–key dot product for this paper’s setup: one-hot identity ⊕ position tokens, and how the dot product selects the merge target. Then how perturbing it (Step IV) attributes a feature to source vs target.
The setup (recall)
Each token’s embedding = one-hot(vertex ID) ⊕ one-hot(position), concatenated. For each token the head builds:
- Query
Qⱼ = W_Q · embⱼ(from the target token j — “what am I looking for?“) - Key
Kᵢ = W_K · embᵢ(from the source token i — “what am I?“) - Value
Vᵢ(the content that gets copied if j attends to i — here: i’s reachable set)
Attention score = Qⱼ · Kᵢᵀ (scaled), then softmax over all i. High score ⇒ j pulls in i’s value.
The merge we’ll model
Edge 3→6 exists, so vertex 3 can reach 6. Vertex 3 (target j) wants to absorb vertex 6’s reachable set → it must attend to the token for vertex 6 (source i). Let’s watch the dot product pick 6 out of {2, 3, 6}.
Step 1 — the embeddings (with the register path-merging builds up)
By this layer each vertex token carries two one-hot blocks (slots ordered [2, 3, 6]): its own identity and its frontier = the vertices it currently wants to merge from (filled in from the edge tokens / earlier layers). Position is extra dims, set aside here (it drives the position-based variant — see bottom).
| token | edge it has | identity | frontier | emb (6-dim) |
|---|---|---|---|---|
| vertex 2 | 2→3 | [1,0,0] | [0,1,0] | [1,0,0, 0,1,0] |
| vertex 3 | 3→6 | [0,1,0] | [0,0,1] | [0,1,0, 0,0,1] |
| vertex 6 | (sink) | [0,0,1] | [0,0,0] | [0,0,1, 0,0,0] |
Step 2 — the weight matrices (this is the whole trick)
The network learns two projection matrices. W_K copies out the identity block (“what I am”); W_Q copies out the frontier block (“what I’m looking for”):
Step 3 — compute Q and K (matrix × vector)
Key = identity block. Query = frontier block. E.g. for vertex 3:
So Q₃ = [0,0,1] = “I’m looking for vertex 6” (its frontier), while K₆ = [0,0,1] = “I am vertex 6.” Doing this for all tokens:
Step 4 — the full QKᵀ score matrix
Rows = query/target token, columns = key/source token. Each entry is Qᵢ · Kⱼ:
Read the matrix: each row’s 1 points at exactly the vertex that row’s token wants to merge from — vertex 2 → 3, vertex 3 → 6, sink 6 → nobody. The QKᵀ matrix literally re-encodes the graph’s edges as “who-attends-to-whom”, and it’s computed for all vertices in one shot. Vertex 3’s row is [0, 0, 1] ⇒ pure identity matching: score 1 only for the token whose identity equals what the query seeks (6), 0 elsewhere.
Where exactly is the “matching”? — inside each cell
The match is not a separate step between cells; each cell is one match-test. An entry Qᵢ·Kⱼ is a dot product of two one-hots = “do these line up?” → 1 if identical, 0 if not. Expand vertex 3’s row:
The 1 in the matrix is the match — it lands in column K₆ because that’s the only key whose one-hot ([0,0,1] = “I am 6”) overlaps the query’s one-hot ([0,0,1] = “looking for 6”). The dot product = “how much do query and key overlap”; one-hots overlap only when identical. So scanning a row scores every candidate source at once, and the 1 marks the hit. (In a real model embeddings aren’t clean one-hots → matching is soft, highest score wins after softmax; the one-hot toy just makes the hit exactly 1.)
Softmax → who gets attended to
Raw scores [0, 0, 1] → softmax ≈ [0.21, 0.21, 0.58]. The learned W_Q, W_K have magnitude, so the effective logits are more like [0, 0, 8]:
→ vertex 3 attends almost entirely to vertex 6. The head then copies V₆ = 6’s reachable set into position 3. One path-merge done — and the same head does this for every vertex in parallel, in this one layer.
Connecting to Q6 — attributing the feature (Step IV)
We zoom in on the one operation “vertex 3 merges from vertex 6” — the (Q₃, K₆) cell, which is 1. To attribute why it fires, perturb one side at a time and recompute the dot product. (Earlier layers are frozen, so any swing is attributable to this op.)
Extend the slots to [2, 3, 6, 9] so vertex 6 has a fresh substitute “9” (the paper swaps in a vertex ID not present in the input). Embeddings are now 8-dim = identity(4) ⊕ frontier(4), and W_K = [I₄ | 0], W_Q = [0 | I₄].
Baseline: Q₃ = [0,0,1,0] (“looking for 6”), K₆ = [0,0,1,0] (“I am 6”) ⇒ Q₃·K₆ = 1.
Test A — perturb the SOURCE (swap vertex 6’s identity 6 → 9)
Source embedding changes; key is recomputed; query untouched → this is QⱼK̃ᵢᵀ:
The match collapses when we change the source’s identity ⇒ this op depends on the source embedding being “6”.
Test B — perturb the TARGET (zero vertex 3’s “frontier = 6” feature)
Target embedding changes; query is recomputed; key untouched → this is Q̃ⱼKᵢᵀ:
The match collapses when we remove the target’s “looking-for-6” feature ⇒ this op depends on the target embedding carrying that frontier.
What the two tests buy you
| Perturb | Recomputed side | Dot product | Drops? | Conclusion |
|---|---|---|---|---|
| source id 6→9 | key K̃ᵢ | Qⱼ·K̃ᵢ | 1→0 | feature “6” lives in the source embedding |
| target frontier→0 | query Q̃ⱼ | Q̃ⱼ·Kᵢ | 1→0 | ”looking-for-6” lives in the target embedding |
So splitting the perturbation into Q̃ⱼKᵢᵀ (target side) vs QⱼK̃ᵢᵀ (source side) localizes the causal feature to one embedding — exactly the source/target attribution Step IV needs to reconstruct the computation graph.
The two flavours of merge
- Token-matching (above): match by vertex identity — query points at an ID, key reports an ID. Uses the one-hot(ID) part of the embedding.
- Position-based: identical mechanics but on the one-hot(position) part — the query points at a position (“the vertex one step from my frontier”) and the key reports its position. (Footnote 7 in the paper.)
⚠️ Exam line: “With one-hot identity⊕position embeddings, the Qⱼ·Kᵢ dot product is identity (or position) matching: it’s ~1 when the source token is what the query is looking for, ~0 otherwise. Softmax sharpens that into copying the matched vertex’s reachable set (the value) into the current vertex — that single copy is a path-merge. Perturbing the query vs the key isolates whether the causal feature lives in the target or source embedding.”