Transformers

Transformers are a type of deep learning model.
Excel in handling sequential data, like text and time series.
Key innovation: self-attention mechanism, allowing models to weigh the importance of different input parts.
Highly effective in NLP tasks: translation, summarization, question-answering.
Architecture consists of encoder and decoder layers; some models use only encoders (e.g., BERT) or decoders (e.g., GPT) for specific tasks.
Scalable and parallelizable, leading to large, powerful models.
Pre-training and fine-tuning approach enables adaptation to various tasks with minimal task-specific data.
Transformers have set new standards in accuracy for many NLP benchmarks.

How does a transformer work?

Transformers are a type of deep learning model that process data in parallel, making them highly efficient for tasks like natural language processing (NLP). They rely on a mechanism called self-attention to weigh the importance of different words in a sentence, enabling them to understand context and relationships between words. Transformers consist of two main parts: the encoder, which processes the input data, and the decoder, which generates the output. This architecture allows them to excel at tasks such as translation, text summarization, and question answering by capturing complex dependencies in data.

Embedding

Transformers use Embedding to transform words into numbers.
Every word in a dictionary has its own code.

Positional Encoding

Every word has its position encoded. For every position, there is a different wave length, therefore resulting in different numbers for each position.

🐍 Figure — Sinusoidal positional encoding (Vaswani et al. 2017)

import micropip
await micropip.install("matplotlib")
await micropip.install("numpy")
import matplotlib.pyplot as plt
import numpy as np
 
d_model = 64
max_pos = 50
 
# PE(pos, 2i)   = sin(pos / 10000^(2i/d))
# PE(pos, 2i+1) = cos(pos / 10000^(2i/d))
pos = np.arange(max_pos)[:, None]
i   = np.arange(d_model)[None, :]
div = np.power(10000.0, (2 * (i // 2)) / d_model)   # neighbouring dims share a frequency
angles = pos / div
 
PE = np.zeros((max_pos, d_model))
PE[:, 0::2] = np.sin(angles[:, 0::2])               # even dims → sin
PE[:, 1::2] = np.cos(angles[:, 1::2])               # odd dims  → cos
 
fig, (ax_h, ax_r) = plt.subplots(1, 2, figsize=(12, 5),
                                 gridspec_kw={"width_ratios": [3, 2]})
 
im = ax_h.imshow(PE, aspect="auto", cmap="RdBu_r", vmin=-1, vmax=1)
ax_h.set_xlabel("embedding dimension  i  (0 … 63)")
ax_h.set_ylabel("position  pos  (0 … 49)")
ax_h.set_title("Sinusoidal positional encoding  PE(pos, i)\n"
               "Vaswani et al. 2017 — fixed (not learned), absolute positions")
ax_h.annotate("low dims\nvary FAST", xy=(3, 25), xytext=(8, 6),
              fontsize=9, color="black",
              arrowprops=dict(arrowstyle="->", color="black"))
ax_h.annotate("high dims\nvary SLOW", xy=(60, 25), xytext=(38, 45),
              fontsize=9, color="black",
              arrowprops=dict(arrowstyle="->", color="black"))
plt.colorbar(im, ax=ax_h, fraction=0.046, pad=0.04, label="PE value")
 
for p, col in [(0, "#3498db"), (5, "#27ae60"), (20, "#e74c3c")]:
    ax_r.plot(PE[p], label=f"pos = {p}", color=col, lw=1.6)
ax_r.set_xlabel("embedding dimension  i")
ax_r.set_ylabel("PE value")
ax_r.set_title("A few PE row-vectors\n(each position → a unique signature)")
ax_r.grid(alpha=0.3); ax_r.legend(fontsize=9)
 
plt.tight_layout(); plt.show()

What this shows. Left: the full (50 × 64) positional-encoding matrix as a heatmap — one row per token position, one column per embedding dimension, with the classic banded pattern. Each dimension is a sine/cosine of a different wavelength: low-index dimensions oscillate fast with position, high-index ones change slowly, so every position gets a unique multi-frequency signature (right panel). This matters because the Self-Attention mechanism at the heart of transformers is permutation-invariant — it computes the same set of attention weights regardless of token order, so on its own it cannot tell “dog bites man” from “man bites dog”. Adding PE to the token embeddings injects absolute (and, via trigonometric identities, relative) position information back into an otherwise order-blind model. Because the encoding is a fixed function rather than learned parameters, it also generalises to sequence lengths never seen during training — unlike the order handled implicitly by the recurrence in Hopfield Networks and classic RNNs.

Self Attention

The model calculates the relationships within words and calculates the probability another word is associated with the given word. The more often it receives texts where it refers to pizza, the higher the chance in this text it will be the same.
The highest association to a word is to the word itself.

Encoder-Decoder-Attention

The importance of the single words within a sentence is tracked by the encoder-decoder-attention. It makes sure the most important words are translated first.

Residual Connections

Residual connections ensure the algorithm learns efficiently and focuses on solving just one part of the problem.

Where Transformers are used today

Transformers (Vaswani et al., 2017) are now the dominant architecture in essentially all of AI — far beyond the NLP they were invented for.

Large language models — GPT-4, Claude, Gemini, LLaMA, DeepSeek, Mistral. Decoder-only architectures.
Encoder models for retrieval & classification — BERT, RoBERTa, embedding models (OpenAI’s text-embedding-3, Sentence-BERT).
Image generation — Stable Diffusion (cross-attention conditioning), DALL-E 3, Imagen 2. ViT (Vision Transformer) for image classification.
Speech — Whisper, Conformer (hybrid CNN+Transformer).
Protein folding — AlphaFold 2’s Evoformer is a heavily customized transformer; ESMFold is a vanilla transformer applied to protein sequences.
Video understanding — VideoMAE, V-JEPA, Sora’s underlying diffusion-transformer hybrid.
Robotics — RT-1, RT-2 (Google DeepMind) — transformers as robot policy networks consuming image + language tokens.
Time series & forecasting — TimesFM (Google), Lag-Llama — transformers replacing classical ARIMA + RNN models.
Code — Copilot, CodeLLaMA, DeepSeek-Coder — pure transformer LLMs trained on code.
Recommendation systems — Pinterest, Meta, Netflix use transformer-based sequence models for session prediction.

What Transformers replaced — and what’s challenging them now

Replaced what	Transformers won because …
RNNs / LSTMs (Hochreiter & Schmidhuber 1997) for sequence modeling	Parallelizable across sequence positions (RNNs sequential); long-range deps via direct attention (no vanishing gradient)
CNNs for many vision tasks (via ViT, Dosovitskiy 2021)	Global receptive field from layer 1; scales better with data
Seq2seq with attention (Bahdanau 2014) for translation	Dropped recurrence entirely — translation became massively faster
Word embeddings (Word2Vec, GloVe)	Contextual embeddings (each occurrence of “bank” gets its own vector based on context)
Tree-structured models for parsing	End-to-end attention learns hierarchical structure implicitly

Currently challenging transformers	Why it matters
State Space Models (Mamba, S4, S5) — 2023/2024	Linear-time complexity in sequence length (vs. quadratic for attention) → cheaper for very long contexts
MoE (Mixture of Experts) architectures — Mixtral, DeepSeek-V3	Same parameter count, sparse activation → faster inference per token
Linear attention variants (Performer, Linformer, RWKV)	Approximate attention in O(N) instead of O(N²)
Hyena / FlashFFTConv	Convolution-based long-range modeling — competitive on some benchmarks
Diffusion language models	Parallel token generation instead of autoregressive — research stage
JEPA architectures (LeCun’s vision)	Predicting embeddings rather than tokens — early stage

Status quo (early 2026): transformers still dominate, but hybrids (SSM + attention, MoE + transformer) are eating into pure-transformer dominance for specific use cases. Pure attention is no longer obviously the best architecture for long-context — Mamba-style SSMs are catching up fast.

Brain Online

Explorer

Transformers

Transformers

How does a transformer work?

Embedding

Positional Encoding

Self Attention

Encoder-Decoder-Attention

Residual Connections

Where Transformers are used today

What Transformers replaced — and what’s challenging them now

see also

Backlinks

Mika

✨ Features

⚙️ Einstellungen

📚 Chat-Verlauf

📖 Citation Manager

✍️ Writing Assistant

Inhaltsverzeichnis