Self-Attention is the core mechanism inside every Transformer. For each token in a sequence, it computes a weighted sum of all other tokens in the same sequence — where the weights are determined by how relevant each other token is to the current token.
“Self” means: the same sequence supplies queries, keys, AND values. (In contrast, cross-attention lets one sequence query another — used in encoder-decoder Transformers and Stable Diffusion.)
The single equation that runs GPT-4, Claude, Stable Diffusion, AlphaFold, Whisper, and most modern AI:
Attention(Q, K, V) = softmax(QKᵀ / √d_k) · V
The mechanism in 4 steps
For an input sequence of n tokens, each represented by a vector of dimension d_model:
Project each token into three vectors: a query Q, a key K, and a value V. The projections use learned matrices W_Q, W_K, W_V — these are the only learnable parameters in attention.
Score how relevant each token is to each other token: S = QKᵀ (an n × n matrix of dot products). Each entry Sᵢⱼ = “how much should token i pay attention to token j?“.
Scale + normalize: divide by √d_k (prevents softmax saturation in high dim), then softmax each row → attention weights that sum to 1.
Aggregate: multiply weights with values: output = weights · V. Each output token is a weighted blend of all value vectors.
Result: each output token contains information from all other tokens, weighted by learned relevance.
See it in code — 8 lines
🐍 Code anzeigen / ausblenden
import numpy as npdef softmax(x, axis=-1): x = x - x.max(axis=axis, keepdims=True) # numerical stability e = np.exp(x) return e / e.sum(axis=axis, keepdims=True)def self_attention(X, W_Q, W_K, W_V): Q, K, V = X @ W_Q, X @ W_K, X @ W_V # n × d_k, n × d_k, n × d_v d_k = Q.shape[-1] scores = Q @ K.T / np.sqrt(d_k) # n × n similarity matrix weights = softmax(scores, axis=-1) # rows sum to 1 return weights @ V, weights# Toy: 4 tokens, model dim 8, key/value dim 4np.random.seed(0)n, d_model, d_k = 4, 8, 4X = np.random.randn(n, d_model) # input sequenceW_Q = np.random.randn(d_model, d_k)W_K = np.random.randn(d_model, d_k)W_V = np.random.randn(d_model, d_k)output, weights = self_attention(X, W_Q, W_K, W_V)print("Attention weights (rows sum to 1):")print(np.round(weights, 2))print(f"\nOutput shape: {output.shape} (same n, but each token is a blend)")
The attention weights matrix weights[i, j] tells you “how much token i pays attention to token j” — the famous “attention map” you see in interpretability papers.
Visual: the attention heatmap
The matrix weights IS what’s plotted in every Transformer interpretability paper. Each row sums to 1 — it’s a probability distribution over which other tokens this token attends to.
🐍 Code anzeigen / ausblenden
# Pyodide / Obsidian Execute Code: install matplotlib first.# In normal Python (terminal / Jupyter), delete the next 2 lines.import micropipawait micropip.install("matplotlib")import matplotlib.pyplot as plttokens = ["The", "cat", "sat", "on"] # 4 tokens for our toy examplefig, axes = plt.subplots(1, 2, figsize=(11, 4.5))# Left: the actual attention weightsim0 = axes[0].imshow(weights, cmap='viridis', vmin=0, vmax=1)axes[0].set_xticks(range(len(tokens))); axes[0].set_xticklabels(tokens)axes[0].set_yticks(range(len(tokens))); axes[0].set_yticklabels(tokens)axes[0].set_xlabel('attending TO (key)'); axes[0].set_ylabel('attending FROM (query)')axes[0].set_title('Self-attention weights\n(each row sums to 1)')for i in range(len(tokens)): for j in range(len(tokens)): axes[0].text(j, i, f'{weights[i, j]:.2f}', ha='center', va='center', color='white' if weights[i, j] < 0.5 else 'black')plt.colorbar(im0, ax=axes[0], fraction=0.04)# Right: GPT-style causal mask versionmask = np.triu(np.ones((4, 4)), k=1).astype(bool) # upper triangular (future positions)causal_scores = (X @ W_Q) @ (X @ W_K).T / np.sqrt(d_k)causal_scores[mask] = -np.infcausal_weights = softmax(causal_scores)im1 = axes[1].imshow(causal_weights, cmap='viridis', vmin=0, vmax=1)axes[1].set_xticks(range(len(tokens))); axes[1].set_xticklabels(tokens)axes[1].set_yticks(range(len(tokens))); axes[1].set_yticklabels(tokens)axes[1].set_xlabel('attending TO (key)'); axes[1].set_ylabel('attending FROM (query)')axes[1].set_title('Masked (GPT-style) self-attention\n(no peeking at future tokens)')for i in range(len(tokens)): for j in range(len(tokens)): v = causal_weights[i, j] if v > 0.005: axes[1].text(j, i, f'{v:.2f}', ha='center', va='center', color='white' if v < 0.5 else 'black')plt.colorbar(im1, ax=axes[1], fraction=0.04)plt.tight_layout(); plt.show()
What to see:
Left (vanilla self-attention): every cell is filled — token cat can attend to The, sat, on, AND itself. Bidirectional, like BERT.
Right (causal mask): upper triangle is 0 — token cat can only see The and itself, not sat or on. This is what GPT does: each token only attends to its own + earlier positions. The diagonal stripe pattern is the signature of causal attention.
Each row sums to 1 (softmax) — it’s a probability distribution over which tokens to gather from.
In real Transformers, you’d see this for hundreds of tokens × dozens of heads × dozens of layers — the attention map is the main interpretability tool.
Why divide by √d_k?
At high dimensions, dot products q·k grow with √d_k. Without scaling, softmax saturates → gradients vanish → training breaks. Vaswani et al. (2017) added the 1/√d_k scaling factor specifically for this. Skipping it is a common bug in DIY implementations.
Multi-Head Attention
A single attention layer can only learn one “kind of relationship.” Multi-head attention runs h parallel attention computations, each with its own (W_Q, W_K, W_V) — letting different heads specialize:
One head might attend to syntactic relationships (subject-verb)
The mask in GPT-style models sets Sᵢⱼ = −∞ for j > i, so the softmax gives zero weight to future tokens. This is why GPT can be trained on next-token prediction without leaking information.
Properties
✅ Parallelizable across sequence positions (unlike RNNs)
✅ Direct long-range connections — every token connects to every other in one layer
✅ Permutation-equivariant (good for set data; needs positional encoding for sequences)
❌ O(n²) memory and compute in sequence length → expensive for long sequences
❌ Without positional encoding, can’t distinguish “dog bit man” from “man bit dog”
The Hopfield connection (Ramsauer et al., 2020)
⚠️ Important exam-trap fact:modern Hopfield networks are mathematically equivalent to a single attention layer. The softmax-attention update is exactly the retrieval step of a continuous high-capacity associative memory. This means Transformers can be read as stacks of associative memory layers. See Hopfield Networks for the full bridge.
Where Self-Attention is used today
Self-attention is the dominant primitive in modern AI:
All LLMs: GPT-4, Claude, Gemini, LLaMA, Mistral, DeepSeek — every one uses masked self-attention in decoder blocks
BERT / RoBERTa / embedding models — bidirectional self-attention for classification + retrieval
Vision Transformers (ViT) — self-attention over image patches
AlphaFold 2 — Evoformer’s row/column attention over MSAs
Whisper — self-attention for speech recognition
Stable Diffusion — self-attention inside the U-Net + cross-attention to text embeddings
MuZero / RT-2 — self-attention for game / robotics policies
Time series forecasting — TimesFM, Lag-Llama use self-attention
What Self-Attention is being challenged by
Limitation
Challenger
Why it might win
O(n²) memory + compute
Mamba / S4 / S5 (State Space Models)
Linear time in sequence length → much cheaper for long contexts
O(n²) compute
Linear attention (Performer, Linformer, RWKV)
Approximate softmax → O(n) compute
Long-context modeling
FlashAttention + ring attention
Same O(n²) but cleverly tiled; brings 1M-token context within reach
Mixture of specialists
Mixture-of-Experts (MoE) with attention
Sparse activation — only some experts attend per token
Status (early 2026): vanilla self-attention still wins on most benchmarks, but Mamba-style SSMs are catching up fast on long-context tasks. Hybrids (attention + SSM layers) are an active area.
See also
Transformers — the architecture self-attention powers