ai-generated methods-of-ai exam-prep

Companion to quiz_paper-transformers-search_28-05-26 · Q6 (the Qⱼ·Kᵢ dot products) — also underpins Q7 (path-merging).

A concrete numeric example of the query–key dot product for this paper’s setup: one-hot identity ⊕ position tokens, and how the dot product selects the merge target. Then how perturbing it (Step IV) attributes a feature to source vs target.


The setup (recall)

Each token’s embedding = one-hot(vertex ID) ⊕ one-hot(position), concatenated. For each token the head builds:

  • Query Qⱼ = W_Q · embⱼ (from the target token j — “what am I looking for?“)
  • Key Kᵢ = W_K · embᵢ (from the source token i — “what am I?“)
  • Value Vᵢ (the content that gets copied if j attends to i — here: i’s reachable set)

Attention score = Qⱼ · Kᵢᵀ (scaled), then softmax over all i. High score ⇒ j pulls in i’s value.

The merge we’ll model

Edge 3→6 exists, so vertex 3 can reach 6. Vertex 3 (target j) wants to absorb vertex 6’s reachable set → it must attend to the token for vertex 6 (source i). Let’s watch the dot product pick 6 out of {2, 3, 6}.

Step 1 — the embeddings (with the register path-merging builds up)

By this layer each vertex token carries two one-hot blocks (slots ordered [2, 3, 6]): its own identity and its frontier = the vertices it currently wants to merge from (filled in from the edge tokens / earlier layers). Position is extra dims, set aside here (it drives the position-based variant — see bottom).

tokenedge it hasidentityfrontieremb (6-dim)
vertex 22→3[1,0,0][0,1,0][1,0,0, 0,1,0]
vertex 33→6[0,1,0][0,0,1][0,1,0, 0,0,1]
vertex 6(sink)[0,0,1][0,0,0][0,0,1, 0,0,0]

Step 2 — the weight matrices (this is the whole trick)

The network learns two projection matrices. W_K copies out the identity block (“what I am”); W_Q copies out the frontier block (“what I’m looking for”):

Step 3 — compute Q and K (matrix × vector)

Key = identity block. Query = frontier block. E.g. for vertex 3:

So Q₃ = [0,0,1] = “I’m looking for vertex 6” (its frontier), while K₆ = [0,0,1] = “I am vertex 6.” Doing this for all tokens:

Step 4 — the full QKᵀ score matrix

Rows = query/target token, columns = key/source token. Each entry is Qᵢ · Kⱼ:

Read the matrix: each row’s 1 points at exactly the vertex that row’s token wants to merge from — vertex 2 → 3, vertex 3 → 6, sink 6 → nobody. The QKᵀ matrix literally re-encodes the graph’s edges as “who-attends-to-whom”, and it’s computed for all vertices in one shot. Vertex 3’s row is [0, 0, 1]pure identity matching: score 1 only for the token whose identity equals what the query seeks (6), 0 elsewhere.

Where exactly is the “matching”? — inside each cell

The match is not a separate step between cells; each cell is one match-test. An entry Qᵢ·Kⱼ is a dot product of two one-hots = “do these line up?” → 1 if identical, 0 if not. Expand vertex 3’s row:

The 1 in the matrix is the match — it lands in column K₆ because that’s the only key whose one-hot ([0,0,1] = “I am 6”) overlaps the query’s one-hot ([0,0,1] = “looking for 6”). The dot product = “how much do query and key overlap”; one-hots overlap only when identical. So scanning a row scores every candidate source at once, and the 1 marks the hit. (In a real model embeddings aren’t clean one-hots → matching is soft, highest score wins after softmax; the one-hot toy just makes the hit exactly 1.)

Softmax → who gets attended to

Raw scores [0, 0, 1] → softmax ≈ [0.21, 0.21, 0.58]. The learned W_Q, W_K have magnitude, so the effective logits are more like [0, 0, 8]:

→ vertex 3 attends almost entirely to vertex 6. The head then copies V₆ = 6’s reachable set into position 3. One path-merge done — and the same head does this for every vertex in parallel, in this one layer.


Connecting to Q6 — attributing the feature (Step IV)

We zoom in on the one operation “vertex 3 merges from vertex 6” — the (Q₃, K₆) cell, which is 1. To attribute why it fires, perturb one side at a time and recompute the dot product. (Earlier layers are frozen, so any swing is attributable to this op.)

Extend the slots to [2, 3, 6, 9] so vertex 6 has a fresh substitute “9” (the paper swaps in a vertex ID not present in the input). Embeddings are now 8-dim = identity(4) ⊕ frontier(4), and W_K = [I₄ | 0], W_Q = [0 | I₄].

Baseline: Q₃ = [0,0,1,0] (“looking for 6”), K₆ = [0,0,1,0] (“I am 6”) ⇒ Q₃·K₆ = 1.

Test A — perturb the SOURCE (swap vertex 6’s identity 6 → 9)

Source embedding changes; key is recomputed; query untouched → this is QⱼK̃ᵢᵀ:

The match collapses when we change the source’s identity ⇒ this op depends on the source embedding being “6”.

Test B — perturb the TARGET (zero vertex 3’s “frontier = 6” feature)

Target embedding changes; query is recomputed; key untouched → this is Q̃ⱼKᵢᵀ:

The match collapses when we remove the target’s “looking-for-6” feature ⇒ this op depends on the target embedding carrying that frontier.

What the two tests buy you

PerturbRecomputed sideDot productDrops?Conclusion
source id 6→9key K̃ᵢQⱼ·K̃ᵢ1→0feature “6” lives in the source embedding
target frontier→0query Q̃ⱼQ̃ⱼ·Kᵢ1→0”looking-for-6” lives in the target embedding

So splitting the perturbation into Q̃ⱼKᵢᵀ (target side) vs QⱼK̃ᵢᵀ (source side) localizes the causal feature to one embedding — exactly the source/target attribution Step IV needs to reconstruct the computation graph.

The two flavours of merge

  • Token-matching (above): match by vertex identity — query points at an ID, key reports an ID. Uses the one-hot(ID) part of the embedding.
  • Position-based: identical mechanics but on the one-hot(position) part — the query points at a position (“the vertex one step from my frontier”) and the key reports its position. (Footnote 7 in the paper.)

⚠️ Exam line: “With one-hot identity⊕position embeddings, the Qⱼ·Kᵢ dot product is identity (or position) matching: it’s ~1 when the source token is what the query is looking for, ~0 otherwise. Softmax sharpens that into copying the matched vertex’s reachable set (the value) into the current vertex — that single copy is a path-merge. Perturbing the query vs the key isolates whether the causal feature lives in the target or source embedding.”