Self-Attention

methods-of-ai

Self-Attention is the core mechanism inside every Transformer. For each token in a sequence, it computes a weighted sum of all other tokens in the same sequence — where the weights are determined by how relevant each other token is to the current token.

“Self” means: the same sequence supplies queries, keys, AND values. (In contrast, cross-attention lets one sequence query another — used in encoder-decoder Transformers and Stable Diffusion.)

The single equation that runs GPT-4, Claude, Stable Diffusion, AlphaFold, Whisper, and most modern AI:

Attention(Q, K, V) = softmax(QKᵀ / √d_k) · V

The mechanism in 4 steps

For an input sequence of n tokens, each represented by a vector of dimension d_model:

Project each token into three vectors: a query Q, a key K, and a value V. The projections use learned matrices W_Q, W_K, W_V — these are the only learnable parameters in attention.
Score how relevant each token is to each other token: S = QKᵀ (an n × n matrix of dot products). Each entry Sᵢⱼ = “how much should token i pay attention to token j?“.
Scale + normalize: divide by √d_k (prevents softmax saturation in high dim), then softmax each row → attention weights that sum to 1.
Aggregate: multiply weights with values: output = weights · V. Each output token is a weighted blend of all value vectors.

Result: each output token contains information from all other tokens, weighted by learned relevance.

See it in code — 8 lines

🐍 Code anzeigen / ausblenden

import numpy as np
 
def softmax(x, axis=-1):
    x = x - x.max(axis=axis, keepdims=True)              # numerical stability
    e = np.exp(x)
    return e / e.sum(axis=axis, keepdims=True)
 
def self_attention(X, W_Q, W_K, W_V):
    Q, K, V = X @ W_Q, X @ W_K, X @ W_V                  # n × d_k, n × d_k, n × d_v
    d_k = Q.shape[-1]
    scores = Q @ K.T / np.sqrt(d_k)                      # n × n similarity matrix
    weights = softmax(scores, axis=-1)                   # rows sum to 1
    return weights @ V, weights
 
# Toy: 4 tokens, model dim 8, key/value dim 4
np.random.seed(0)
n, d_model, d_k = 4, 8, 4
X   = np.random.randn(n, d_model)                        # input sequence
W_Q = np.random.randn(d_model, d_k)
W_K = np.random.randn(d_model, d_k)
W_V = np.random.randn(d_model, d_k)
 
output, weights = self_attention(X, W_Q, W_K, W_V)
print("Attention weights (rows sum to 1):")
print(np.round(weights, 2))
print(f"\nOutput shape: {output.shape}  (same n, but each token is a blend)")

The attention weights matrix weights[i, j] tells you “how much token i pays attention to token j” — the famous “attention map” you see in interpretability papers.

Visual: the attention heatmap

The matrix weights IS what’s plotted in every Transformer interpretability paper. Each row sums to 1 — it’s a probability distribution over which other tokens this token attends to.

🐍 Code anzeigen / ausblenden

# Pyodide / Obsidian Execute Code: install matplotlib first.
# In normal Python (terminal / Jupyter), delete the next 2 lines.
import micropip
await micropip.install("matplotlib")
 
import matplotlib.pyplot as plt
 
tokens = ["The", "cat", "sat", "on"]                     # 4 tokens for our toy example
 
fig, axes = plt.subplots(1, 2, figsize=(11, 4.5))
 
# Left: the actual attention weights
im0 = axes[0].imshow(weights, cmap='viridis', vmin=0, vmax=1)
axes[0].set_xticks(range(len(tokens))); axes[0].set_xticklabels(tokens)
axes[0].set_yticks(range(len(tokens))); axes[0].set_yticklabels(tokens)
axes[0].set_xlabel('attending TO (key)'); axes[0].set_ylabel('attending FROM (query)')
axes[0].set_title('Self-attention weights\n(each row sums to 1)')
for i in range(len(tokens)):
    for j in range(len(tokens)):
        axes[0].text(j, i, f'{weights[i, j]:.2f}', ha='center', va='center',
                     color='white' if weights[i, j] < 0.5 else 'black')
plt.colorbar(im0, ax=axes[0], fraction=0.04)
 
# Right: GPT-style causal mask version
mask = np.triu(np.ones((4, 4)), k=1).astype(bool)        # upper triangular (future positions)
causal_scores = (X @ W_Q) @ (X @ W_K).T / np.sqrt(d_k)
causal_scores[mask] = -np.inf
causal_weights = softmax(causal_scores)
im1 = axes[1].imshow(causal_weights, cmap='viridis', vmin=0, vmax=1)
axes[1].set_xticks(range(len(tokens))); axes[1].set_xticklabels(tokens)
axes[1].set_yticks(range(len(tokens))); axes[1].set_yticklabels(tokens)
axes[1].set_xlabel('attending TO (key)'); axes[1].set_ylabel('attending FROM (query)')
axes[1].set_title('Masked (GPT-style) self-attention\n(no peeking at future tokens)')
for i in range(len(tokens)):
    for j in range(len(tokens)):
        v = causal_weights[i, j]
        if v > 0.005:
            axes[1].text(j, i, f'{v:.2f}', ha='center', va='center',
                         color='white' if v < 0.5 else 'black')
plt.colorbar(im1, ax=axes[1], fraction=0.04)
 
plt.tight_layout(); plt.show()

What to see:

Left (vanilla self-attention): every cell is filled — token cat can attend to The, sat, on, AND itself. Bidirectional, like BERT.
Right (causal mask): upper triangle is 0 — token cat can only see The and itself, not sat or on. This is what GPT does: each token only attends to its own + earlier positions. The diagonal stripe pattern is the signature of causal attention.
Each row sums to 1 (softmax) — it’s a probability distribution over which tokens to gather from.

In real Transformers, you’d see this for hundreds of tokens × dozens of heads × dozens of layers — the attention map is the main interpretability tool.

Why divide by √d_k?

At high dimensions, dot products q·k grow with √d_k. Without scaling, softmax saturates → gradients vanish → training breaks. Vaswani et al. (2017) added the 1/√d_k scaling factor specifically for this. Skipping it is a common bug in DIY implementations.

Multi-Head Attention

A single attention layer can only learn one “kind of relationship.” Multi-head attention runs h parallel attention computations, each with its own (W_Q, W_K, W_V) — letting different heads specialize:

One head might attend to syntactic relationships (subject-verb)
Another to coreference (this → noun phrase)
Another to long-range dependencies

MultiHead(X) = Concat(head₁, head₂, …, head_h) · W_O
where head_i = Attention(X·W_Q_i, X·W_K_i, X·W_V_i)

GPT-4 uses ~96 heads per layer; smaller models like BERT-base use 12.

Self-Attention vs. Cross-Attention vs. Masked Self-Attention

Variant	Q from	K, V from	Used in
Self-attention	sequence X	sequence X	BERT (encoder), ViT
Cross-attention	sequence A (e.g. decoder)	sequence B (e.g. encoder output)	T5 decoder, Stable Diffusion (text → image conditioning)
Masked self-attention	sequence X	sequence X (with future positions masked)	GPT — prevents seeing the future during training

The mask in GPT-style models sets Sᵢⱼ = −∞ for j > i, so the softmax gives zero weight to future tokens. This is why GPT can be trained on next-token prediction without leaking information.

Properties

✅ Parallelizable across sequence positions (unlike RNNs)
✅ Direct long-range connections — every token connects to every other in one layer
✅ Permutation-equivariant (good for set data; needs positional encoding for sequences)
❌ O(n²) memory and compute in sequence length → expensive for long sequences
❌ Without positional encoding, can’t distinguish “dog bit man” from “man bit dog”

The Hopfield connection (Ramsauer et al., 2020)

⚠️ Important exam-trap fact: modern Hopfield networks are mathematically equivalent to a single attention layer. The softmax-attention update is exactly the retrieval step of a continuous high-capacity associative memory. This means Transformers can be read as stacks of associative memory layers. See Hopfield Networks for the full bridge.

Where Self-Attention is used today

Self-attention is the dominant primitive in modern AI:

All LLMs: GPT-4, Claude, Gemini, LLaMA, Mistral, DeepSeek — every one uses masked self-attention in decoder blocks
BERT / RoBERTa / embedding models — bidirectional self-attention for classification + retrieval
Vision Transformers (ViT) — self-attention over image patches
AlphaFold 2 — Evoformer’s row/column attention over MSAs
Whisper — self-attention for speech recognition
Stable Diffusion — self-attention inside the U-Net + cross-attention to text embeddings
MuZero / RT-2 — self-attention for game / robotics policies
Time series forecasting — TimesFM, Lag-Llama use self-attention

What Self-Attention is being challenged by

Limitation	Challenger	Why it might win
O(n²) memory + compute	Mamba / S4 / S5 (State Space Models)	Linear time in sequence length → much cheaper for long contexts
O(n²) compute	Linear attention (Performer, Linformer, RWKV)	Approximate softmax → O(n) compute
Long-context modeling	FlashAttention + ring attention	Same O(n²) but cleverly tiled; brings 1M-token context within reach
Mixture of specialists	Mixture-of-Experts (MoE) with attention	Sparse activation — only some experts attend per token

Status (early 2026): vanilla self-attention still wins on most benchmarks, but Mamba-style SSMs are catching up fast on long-context tasks. Hybrids (attention + SSM layers) are an active area.

Brain Online

Explorer

Self-Attention

Self-Attention

The mechanism in 4 steps

See it in code — 8 lines

Visual: the attention heatmap

Why divide by √d_k?

Multi-Head Attention

Self-Attention vs. Cross-Attention vs. Masked Self-Attention

Properties

The Hopfield connection (Ramsauer et al., 2020)

Where Self-Attention is used today

What Self-Attention is being challenged by

See also

Backlinks

Mika

✨ Features

⚙️ Einstellungen

📚 Chat-Verlauf

📖 Citation Manager

✍️ Writing Assistant

Inhaltsverzeichnis