worked-example_qk-dotproduct_29-05-26

ai-generated methods-of-ai exam-prep

Companion to quiz_paper-transformers-search_28-05-26 · Q6 (the Qⱼ·Kᵢ dot products) — also underpins Q7 (path-merging).

A concrete numeric example of the query–key dot product for this paper’s setup: one-hot identity ⊕ position tokens, and how the dot product selects the merge target. Then how perturbing it (Step IV) attributes a feature to source vs target.

The setup (recall)

Each token’s embedding = one-hot(vertex ID) ⊕ one-hot(position), concatenated. For each token the head builds:

Query Qⱼ = W_Q · embⱼ (from the target token j — “what am I looking for?“)
Key Kᵢ = W_K · embᵢ (from the source token i — “what am I?“)
Value Vᵢ (the content that gets copied if j attends to i — here: i’s reachable set)

Attention score = Qⱼ · Kᵢᵀ (scaled), then softmax over all i. High score ⇒ j pulls in i’s value.

The merge we’ll model

Edge 3→6 exists, so vertex 3 can reach 6. Vertex 3 (target j) wants to absorb vertex 6’s reachable set → it must attend to the token for vertex 6 (source i). Let’s watch the dot product pick 6 out of {2, 3, 6}.

Step 1 — the embeddings (with the register path-merging builds up)

By this layer each vertex token carries two one-hot blocks (slots ordered [2, 3, 6]): its own identity and its frontier = the vertices it currently wants to merge from (filled in from the edge tokens / earlier layers). Position is extra dims, set aside here (it drives the position-based variant — see bottom).

emb = [identity \cdot \cdot \cdot frontier \cdot \cdot \cdot]

token	edge it has	identity	frontier	`emb` (6-dim)
vertex 2	2→3	[1,0,0]	[0,1,0]	`[1,0,0, 0,1,0]`
vertex 3	3→6	[0,1,0]	[0,0,1]	`[0,1,0, 0,0,1]`
vertex 6	(sink)	[0,0,1]	[0,0,0]	`[0,0,1, 0,0,0]`

Step 2 — the weight matrices (this is the whole trick)

The network learns two projection matrices. W_K copies out the identity block (“what I am”); W_Q copies out the frontier block (“what I’m looking for”):

W_{K} = 100010001000000000 (= [I_{3} ∣ 0]), W_{Q} = 000000000100010001 (= [0 ∣ I_{3}])

Step 3 — compute Q and K (matrix × vector)

Key = identity block. Query = frontier block. E.g. for vertex 3:

K_{3} = W_{K} emb_{3} = 100010001000000000010001 = 010, Q_{3} = W_{Q} emb_{3} = 000000000100010001010001 = 001

So Q₃ = [0,0,1] = “I’m looking for vertex 6” (its frontier), while K₆ = [0,0,1] = “I am vertex 6.” Doing this for all tokens:

Q_{2} = 010, Q_{3} = 001, Q_{6} = 000 K_{2} = 100, K_{3} = 010, K_{6} = 001

Step 4 — the full `QKᵀ` score matrix

Rows = query/target token, columns = key/source token. Each entry is Qᵢ · Kⱼ:

Q K^{⊤} = Q_{2} Q_{3} Q_{6} K_{2} 000 K_{3} 100 K_{6} 010

Read the matrix: each row’s 1 points at exactly the vertex that row’s token wants to merge from — vertex 2 → 3, vertex 3 → 6, sink 6 → nobody. The QKᵀ matrix literally re-encodes the graph’s edges as “who-attends-to-whom”, and it’s computed for all vertices in one shot. Vertex 3’s row is [0, 0, 1] ⇒ pure identity matching: score 1 only for the token whose identity equals what the query seeks (6), 0 elsewhere.

Where exactly is the “matching”? — inside each cell

The match is not a separate step between cells; each cell is one match-test. An entry Qᵢ·Kⱼ is a dot product of two one-hots = “do these line up?” → 1 if identical, 0 if not. Expand vertex 3’s row:

Q_{3} = 001 ("looking for 6") ⟹ Q_{3} \cdot K_{2} Q_{3} \cdot K_{3} Q_{3} \cdot K_{6} = [0, 0, 1] \cdot [1, 0, 0] = 0 + 0 + 0 = 0 ✗ = [0, 0, 1] \cdot [0, 1, 0] = 0 + 0 + 0 = 0 ✗ = [0, 0, 1] \cdot [0, 0, 1] = 0 + 0 + 1 = 1 ✓ MATCH

The 1 in the matrix is the match — it lands in column K₆ because that’s the only key whose one-hot ([0,0,1] = “I am 6”) overlaps the query’s one-hot ([0,0,1] = “looking for 6”). The dot product = “how much do query and key overlap”; one-hots overlap only when identical. So scanning a row scores every candidate source at once, and the 1 marks the hit. (In a real model embeddings aren’t clean one-hots → matching is soft, highest score wins after softmax; the one-hot toy just makes the hit exactly 1.)

Softmax → who gets attended to

Raw scores [0, 0, 1] → softmax ≈ [0.21, 0.21, 0.58]. The learned W_Q, W_K have magnitude, so the effective logits are more like [0, 0, 8]:

softmax ([0, 0, 8]) \approx [0.0003, 0.0003, 0.9994]

→ vertex 3 attends almost entirely to vertex 6. The head then copies V₆ = 6’s reachable set into position 3. One path-merge done — and the same head does this for every vertex in parallel, in this one layer.

Connecting to Q6 — attributing the feature (Step IV)

We zoom in on the one operation “vertex 3 merges from vertex 6” — the (Q₃, K₆) cell, which is 1. To attribute why it fires, perturb one side at a time and recompute the dot product. (Earlier layers are frozen, so any swing is attributable to this op.)

Extend the slots to [2, 3, 6, 9] so vertex 6 has a fresh substitute “9” (the paper swaps in a vertex ID not present in the input). Embeddings are now 8-dim = identity(4) ⊕ frontier(4), and W_K = [I₄ | 0], W_Q = [0 | I₄].

Baseline: Q₃ = [0,0,1,0] (“looking for 6”), K₆ = [0,0,1,0] (“I am 6”) ⇒ Q₃·K₆ = 1.

Test A — perturb the SOURCE (swap vertex 6’s identity 6 → 9)

Source embedding changes; key is recomputed; query untouched → this is QⱼK̃ᵢᵀ:

\tilde{K}_{6} = W_{K} \tilde{emb}_{6} = 10000100001000010000000000000000 id now 9 00010000 = 0001

Q_{3} \cdot \tilde{K}_{6} = 0010 \cdot 0001 = 0 (was 1) \Rightarrow drop 1 \to 0

The match collapses when we change the source’s identity ⇒ this op depends on the source embedding being “6”.

Test B — perturb the TARGET (zero vertex 3’s “frontier = 6” feature)

Target embedding changes; query is recomputed; key untouched → this is Q̃ⱼKᵢᵀ:

\tilde{Q}_{3} = W_{Q} \tilde{emb}_{3} = 00000000000000001000010000100001 frontier zeroed 01000000 = 0000

\tilde{Q}_{3} \cdot K_{6} = 0000 \cdot 0010 = 0 (was 1) \Rightarrow drop 1 \to 0

The match collapses when we remove the target’s “looking-for-6” feature ⇒ this op depends on the target embedding carrying that frontier.

What the two tests buy you

Perturb	Recomputed side	Dot product	Drops?	Conclusion
source id 6→9	key `K̃ᵢ`	`Qⱼ·K̃ᵢ`	1→0	feature “6” lives in the source embedding
target frontier→0	query `Q̃ⱼ`	`Q̃ⱼ·Kᵢ`	1→0	”looking-for-6” lives in the target embedding

So splitting the perturbation into Q̃ⱼKᵢᵀ (target side) vs QⱼK̃ᵢᵀ (source side) localizes the causal feature to one embedding — exactly the source/target attribution Step IV needs to reconstruct the computation graph.

The two flavours of merge

Token-matching (above): match by vertex identity — query points at an ID, key reports an ID. Uses the one-hot(ID) part of the embedding.
Position-based: identical mechanics but on the one-hot(position) part — the query points at a position (“the vertex one step from my frontier”) and the key reports its position. (Footnote 7 in the paper.)

⚠️ Exam line: “With one-hot identity⊕position embeddings, the Qⱼ·Kᵢ dot product is identity (or position) matching: it’s ~1 when the source token is what the query is looking for, ~0 otherwise. Softmax sharpens that into copying the matched vertex’s reachable set (the value) into the current vertex — that single copy is a path-merge. Perturbing the query vs the key isolates whether the causal feature lives in the target or source embedding.”

Brain Online

Explorer

worked-example_qk-dotproduct_29-05-26

The setup (recall)

The merge we’ll model

Step 1 — the embeddings (with the register path-merging builds up)

Step 2 — the weight matrices (this is the whole trick)

Step 3 — compute Q and K (matrix × vector)

Step 4 — the full `QKᵀ` score matrix

Where exactly is the “matching”? — inside each cell

Softmax → who gets attended to

Connecting to Q6 — attributing the feature (Step IV)

Test A — perturb the SOURCE (swap vertex 6’s identity 6 → 9)

Test B — perturb the TARGET (zero vertex 3’s “frontier = 6” feature)

What the two tests buy you

The two flavours of merge

Backlinks

Mika

✨ Features

⚙️ Einstellungen

📚 Chat-Verlauf

📖 Citation Manager

✍️ Writing Assistant

Inhaltsverzeichnis

Brain Online

Explorer

worked-example_qk-dotproduct_29-05-26

The setup (recall)

The merge we’ll model

Step 1 — the embeddings (with the register path-merging builds up)

Step 2 — the weight matrices (this is the whole trick)

Step 3 — compute Q and K (matrix × vector)

Step 4 — the full QKᵀ score matrix

Where exactly is the “matching”? — inside each cell

Softmax → who gets attended to

Connecting to Q6 — attributing the feature (Step IV)

Test A — perturb the SOURCE (swap vertex 6’s identity 6 → 9)

Test B — perturb the TARGET (zero vertex 3’s “frontier = 6” feature)

What the two tests buy you

The two flavours of merge

Backlinks

Mika

✨ Features

⚙️ Einstellungen

📚 Chat-Verlauf

📖 Citation Manager

✍️ Writing Assistant

Inhaltsverzeichnis

Step 4 — the full `QKᵀ` score matrix