Self-Attention

methods-of-ai

Self-Attention is the core mechanism inside every Transformer. For each token in a sequence, it computes a weighted sum of all other tokens in the same sequence — where the weights are determined by how relevant each other token is to the current token.

“Self” means: the same sequence supplies queries, keys, AND values. (In contrast, cross-attention lets one sequence query another — used in encoder-decoder Transformers and Stable Diffusion.)

The single equation that runs GPT-4, Claude, Stable Diffusion, AlphaFold, Whisper, and most modern AI:

Attention(Q, K, V) = softmax(QKᵀ / √d_k) · V

The mechanism in 4 steps

For an input sequence of n tokens, each represented by a vector of dimension d_model:

  1. Project each token into three vectors: a query Q, a key K, and a value V. The projections use learned matrices W_Q, W_K, W_V — these are the only learnable parameters in attention.
  2. Score how relevant each token is to each other token: S = QKᵀ (an n × n matrix of dot products). Each entry Sᵢⱼ = “how much should token i pay attention to token j?“.
  3. Scale + normalize: divide by √d_k (prevents softmax saturation in high dim), then softmax each row → attention weights that sum to 1.
  4. Aggregate: multiply weights with values: output = weights · V. Each output token is a weighted blend of all value vectors.

Result: each output token contains information from all other tokens, weighted by learned relevance.

See it in code — 8 lines

The attention weights matrix weights[i, j] tells you “how much token i pays attention to token j” — the famous “attention map” you see in interpretability papers.

Visual: the attention heatmap

The matrix weights IS what’s plotted in every Transformer interpretability paper. Each row sums to 1 — it’s a probability distribution over which other tokens this token attends to.

What to see:

  • Left (vanilla self-attention): every cell is filled — token cat can attend to The, sat, on, AND itself. Bidirectional, like BERT.
  • Right (causal mask): upper triangle is 0 — token cat can only see The and itself, not sat or on. This is what GPT does: each token only attends to its own + earlier positions. The diagonal stripe pattern is the signature of causal attention.
  • Each row sums to 1 (softmax) — it’s a probability distribution over which tokens to gather from.

In real Transformers, you’d see this for hundreds of tokens × dozens of heads × dozens of layers — the attention map is the main interpretability tool.

Why divide by √d_k?

At high dimensions, dot products q·k grow with √d_k. Without scaling, softmax saturates → gradients vanish → training breaks. Vaswani et al. (2017) added the 1/√d_k scaling factor specifically for this. Skipping it is a common bug in DIY implementations.

Multi-Head Attention

A single attention layer can only learn one “kind of relationship.” Multi-head attention runs h parallel attention computations, each with its own (W_Q, W_K, W_V) — letting different heads specialize:

  • One head might attend to syntactic relationships (subject-verb)
  • Another to coreference (this → noun phrase)
  • Another to long-range dependencies
MultiHead(X) = Concat(head₁, head₂, …, head_h) · W_O
where head_i = Attention(X·W_Q_i, X·W_K_i, X·W_V_i)

GPT-4 uses ~96 heads per layer; smaller models like BERT-base use 12.

Self-Attention vs. Cross-Attention vs. Masked Self-Attention

VariantQ fromK, V fromUsed in
Self-attentionsequence Xsequence XBERT (encoder), ViT
Cross-attentionsequence A (e.g. decoder)sequence B (e.g. encoder output)T5 decoder, Stable Diffusion (text → image conditioning)
Masked self-attentionsequence Xsequence X (with future positions masked)GPT — prevents seeing the future during training

The mask in GPT-style models sets Sᵢⱼ = −∞ for j > i, so the softmax gives zero weight to future tokens. This is why GPT can be trained on next-token prediction without leaking information.

Properties

  • Parallelizable across sequence positions (unlike RNNs)
  • Direct long-range connections — every token connects to every other in one layer
  • Permutation-equivariant (good for set data; needs positional encoding for sequences)
  • O(n²) memory and compute in sequence length → expensive for long sequences
  • ❌ Without positional encoding, can’t distinguish “dog bit man” from “man bit dog”

The Hopfield connection (Ramsauer et al., 2020)

⚠️ Important exam-trap fact: modern Hopfield networks are mathematically equivalent to a single attention layer. The softmax-attention update is exactly the retrieval step of a continuous high-capacity associative memory. This means Transformers can be read as stacks of associative memory layers. See Hopfield Networks for the full bridge.

Where Self-Attention is used today

Self-attention is the dominant primitive in modern AI:

  • All LLMs: GPT-4, Claude, Gemini, LLaMA, Mistral, DeepSeek — every one uses masked self-attention in decoder blocks
  • BERT / RoBERTa / embedding models — bidirectional self-attention for classification + retrieval
  • Vision Transformers (ViT) — self-attention over image patches
  • AlphaFold 2 — Evoformer’s row/column attention over MSAs
  • Whisper — self-attention for speech recognition
  • Stable Diffusion — self-attention inside the U-Net + cross-attention to text embeddings
  • MuZero / RT-2 — self-attention for game / robotics policies
  • Time series forecasting — TimesFM, Lag-Llama use self-attention

What Self-Attention is being challenged by

LimitationChallengerWhy it might win
O(n²) memory + computeMamba / S4 / S5 (State Space Models)Linear time in sequence length → much cheaper for long contexts
O(n²) computeLinear attention (Performer, Linformer, RWKV)Approximate softmax → O(n) compute
Long-context modelingFlashAttention + ring attentionSame O(n²) but cleverly tiled; brings 1M-token context within reach
Mixture of specialistsMixture-of-Experts (MoE) with attentionSparse activation — only some experts attend per token

Status (early 2026): vanilla self-attention still wins on most benchmarks, but Mamba-style SSMs are catching up fast on long-context tasks. Hybrids (attention + SSM layers) are an active area.

See also

Tags: methods-of-ai deep-learning transformers attention self-attention
Created: 18-05-26