Transformer Internals

How transformers actually compute, from the ground up. Not the high-level “attention is all you need” version — the mechanistic view that makes interpretability research possible.

Roadmap Step 1 1–2 Weeks Foundational

In this section

  1. 01 Prerequisites
  2. 02 The Big Picture: Residual Streams
  3. 03 Embeddings
  4. 04 Attention Heads
  5. 05 MLP Layers
  6. 06 Unembedding & Logits
  7. 07 Circuits & Composition
  8. 08 Paper Guide: Mathematical Framework
  9. 09 Required Reading
  10. 10 Exercises & Deliverables
01

Prerequisites

What you need before starting. Check what you’re comfortable with.

  • Linear algebra: Vectors, matrices, matrix multiplication, dot products, transposes. You need to be fluent, not just familiar.
  • Python + NumPy: Comfortable with array operations, broadcasting, reshaping. This is your lab equipment.
  • Basic ML: What a loss function is, gradient descent at a high level, what “training” means. You don’t need deep expertise.
  • Softmax: Takes a vector of numbers, outputs a probability distribution. softmax(xi) = exi / ∑ exj
  • PyTorch basics: Tensors, modules, forward passes. You’ll pick this up as you go.
  • Probability: Conditional probability, distributions. Helpful for understanding predictions.
  • Information theory: Entropy, cross-entropy loss. Not essential but adds depth.
If you’re rusty

3Blue1Brown’s Essence of Linear Algebra series is the best visual refresher. Takes about 3 hours and builds genuine intuition.

02

The Big Picture: Residual Streams

The single most important mental model for mechanistic interpretability.

The standard way to explain transformers starts with attention mechanisms and builds up from there. That’s useful for building one, but misleading for understanding one. The mechanistic interpretability community uses a different framing, introduced by Elhage et al. (2021), that makes the internal structure much clearer.

The key insight: think of the transformer as a residual stream. The residual stream is a vector for each token position that flows through the entire network. Every component — every attention head, every MLP layer — reads from this stream and writes back to it. The components don’t talk to each other directly. They communicate through the shared stream.

Key Insight

The residual stream is a communication bus. Attention heads and MLPs are like peripherals: they read information from the bus, process it, and write their results back. The bus carries all information from input to output. This is why residual connections matter so much — they create the bus.

The Residual Stream Equation xfinal = xembed + attn0(x) + mlp0(x) + attn1(x) + mlp1(x) + … Each term is added to the stream. The final stream is the sum of the initial embedding plus every component’s contribution.
Token + Position Embeddings WE · token + Wpos[position]
Residual Stream
Layer 0
Multi-Head Self-Attention reads stream → computes attention → writes back
Feed-Forward MLP reads stream → nonlinear transform → writes back
Residual Stream (updated)
Layer 1
Multi-Head Self-Attention reads stream → computes attention → writes back
Feed-Forward MLP reads stream → nonlinear transform → writes back
Final Residual Stream
Unembedding → Logits → Next Token Prediction WU · xfinal → softmax → probabilities

Why does this matter for interpretability? Because every component’s contribution is additive. The final output is literally the sum of all contributions. This means we can ask: “How much did attention head 3 in layer 5 contribute to predicting this token?” And we can get a real answer, because its contribution is a vector that gets added to the stream.

This is direct logit attribution: take any component’s output, multiply it by the unembedding matrix, and you can see exactly which tokens it pushes the model toward predicting.

03

Embeddings: Tokens to Vectors

How text becomes math.

Before the transformer can do anything, it needs to convert tokens (integers representing words or subwords) into vectors. This is the embedding layer. It’s conceptually simple: a lookup table. Token 4217 maps to a specific 768-dimensional vector (in GPT-2 Small). That vector is the token’s “starting point” in the residual stream.

The embedding matrix WE has shape [vocab_size × d_model]. For GPT-2 Small: [50257 × 768]. Each row is a learned vector for one token.

These vectors aren’t random — after training, they encode semantic relationships. Tokens with similar meanings end up near each other. The embedding space has geometric structure that the rest of the network exploits.

Attention is permutation-invariant — it can’t tell token order without help. So we add a position embedding: another learned vector for each position in the sequence (0, 1, 2, …, max_len).

The initial residual stream for token t at position p is simply:

x0 = WE[t] + Wpos[p] Sum of token meaning + position information. This is the starting state of the residual stream.
Why This Matters

The embedding creates the initial residual stream state. Everything the model does after this is about reading, processing, and updating this stream. The embedding is the model’s prior belief about each token before any contextual processing.

04

Attention Heads: Pattern Matching & Information Movement

The most important component to understand deeply. This is where the subtlety lives.

Each attention head does two things: it decides where to look (which other tokens to attend to) and what to move (what information to copy from those positions to the current position). These are controlled by two separate circuits with very different functions.

Residual stream vectors (one per token)
“The”
“cat”
“sat”
“on”
Each block is a 768-dimensional vector (in GPT-2 Small)

Start: Read from the Residual Stream

Each attention head reads the residual stream at every token position. At this point, each vector contains the token embedding plus all prior layers’ contributions. The head will process these to decide what information to move where.

Project to Queries, Keys, and Values
Token: “sat”
Q
K
V
× 4 tokens × 12 heads

Project to Q, K, V

Each head has three small weight matrices (WQ, WK, WV) that project from the residual stream (768-dim) to a smaller head dimension (64-dim in GPT-2). The Query asks “what am I looking for?” The Key says “what do I contain?” The Value is “what information to send if attended to.”

Q = x · WQK = x · WKV = x · WV Each projection: [batch, seq, 768] → [batch, seq, 64]. The head works in a smaller subspace.
Attention scores (QKT / √dk)
The
cat
sat
on
The
2.1
-
-
-
cat
0.4
1.8
-
-
sat
0.1
2.3
0.9
-
on
-0.2
0.6
1.1
0.3
Dashes indicate masked positions (causal attention: can’t look at future tokens)

Compute Attention Scores

The dot product Q · KT measures how much each query “matches” each key. High score = strong match. We divide by √dk to prevent scores from getting too large (which would make softmax too “peaky”). In autoregressive models, we mask future positions so tokens can only attend to the past.

scores = Q · KT / √dk Shape: [seq_len × seq_len]. Entry (i, j) is how much token i attends to token j.
Attention pattern (after softmax)
The
cat
sat
on
The
1.00
cat
0.20
0.80
sat
0.08
0.55
0.37
on
0.07
0.28
0.41
0.24
Each row sums to 1.0. Bold = highest attention weight. This is what attention patterns visualize.

Softmax → Attention Pattern

Softmax normalizes each row so the weights sum to 1. Now each row is a probability distribution over source tokens. This is the attention pattern — the thing you see in attention visualizations. Here, “sat” attends mostly to “cat” (0.55), then to itself (0.37).

Weighted sum of value vectors
0.08 ×
VThe
+
0.55 ×
Vcat
+
0.37 ×
Vsat
=
output for “sat”
zsat

Compute Weighted Sum of Values

For each token position, multiply the attention weights by the value vectors and sum. The result is a weighted blend of information from the attended positions. Token “sat” gets 55% of the “cat” value vector, 37% of its own, and 8% of “The”. This is how attention moves information between positions.

z = pattern · V Matrix multiply: [seq × seq] × [seq × d_head] → [seq × d_head]. One output vector per position.
Project back to residual stream dimension & add
head output
zsat
× WO
projected
outputsat
+
residual stream
xsat
=
updated stream
x′sat

Write Back to the Residual Stream

The output matrix WO projects the head’s output (64-dim) back to the residual stream dimension (768-dim). This result is added to the residual stream. Multiple heads in the same layer all add their outputs simultaneously. The stream now carries the original information plus whatever this head contributed.

x′ = x + z · WO The “+” is the residual connection. It’s why it’s called a residual stream.
1 / 6

QK Circuit: “Where to Look”

Pattern

The WQWKT matrix (the QK circuit) determines the attention pattern. It’s a bilinear form that computes a score for every (query-position, key-position) pair. When we say an attention head “looks at the previous token,” that behavior lives in the QK circuit.

QK Circuit AijxiT WQWKT xj How much does position i attend to position j? Depends on what’s at both positions.

OV Circuit: “What to Move”

Information

The WVWO matrix (the OV circuit) determines what information gets moved. Once the head has decided to attend from position i to position j, the OV circuit controls what about position j gets written into position i’s residual stream.

OV Circuit outputi = ∑j Aij · xj WVWO A weighted sum of (source info × OV projection). The OV circuit is a linear map on the source vectors.
Why Separating QK and OV Matters

This decomposition is the foundation of circuit analysis. The QK circuit and OV circuit have independent roles and can be analyzed independently. When you find an attention head that does something interesting (e.g., copies the previous token), you can separately ask: “How does it know to look there?” (QK) and “What does it copy?” (OV). Different heads might have the same QK pattern but different OV behavior, or vice versa.

Multi-Head Attention

GPT-2 Small has 12 heads per layer. Each head operates in its own 64-dimensional subspace (768 / 12 = 64). All 12 heads read from the same residual stream, compute independently, then their outputs are summed and added back. Each head can learn a completely different pattern — one might attend to the previous token, another to the subject of the sentence, another to punctuation.

05

MLP Layers: The Processing Units

Where knowledge is stored and nonlinear computation happens.

After attention heads move information between token positions, the MLP layer processes each position independently. Unlike attention (which mixes information across positions), the MLP applies the same transformation to each token’s residual stream vector separately.

MLP Computation mlp(x) = GELU(x · Win + bin) · Wout + bout Win: [768 × 3072], Wout: [3072 × 768] for GPT-2 Small. The inner dimension (3072) is 4× the model dimension.

The “key-value memory” interpretation: Each row of Win is a “key” that the input matches against. When there’s a strong match (high activation after GELU), the corresponding column of Wout is the “value” that gets added to the residual stream. The MLP stores associations: when the input looks like X, add Y to the stream.

This is where factual knowledge largely lives. The fact that “Paris is the capital of France” is stored as: when the residual stream encodes a “capital of France?” query, specific MLP neurons activate and push the stream toward “Paris.”

The GELU (or ReLU) activation between the two linear layers is what makes MLPs more than just another linear transformation. Without it, the entire transformer would be a single linear function (since the composition of linear functions is linear).

The nonlinearity enables conditional computation: the MLP can implement if/then logic. “If the context says this is about France AND the question is about capitals, THEN add the Paris vector.” Attention can’t do this because attention is linear in the values.

Attention vs. MLP: The Division of Labor

A useful mental model: attention moves information between positions; MLPs process information at each position. Attention is the routing network; MLPs are the processing nodes. Attention says “this token should know about that token.” MLPs say “given what this token now knows, update its representation.” This is a simplification, but a productive one.

06

Unembedding & Logits

From vectors back to words.

After all layers have processed the residual stream, the model needs to convert the final vector back into a prediction over tokens. The unembedding matrix WU does this: it projects the residual stream (768-dim) to the vocabulary size (50,257 for GPT-2), producing a score (logit) for every possible next token.

Unembedding logits = xfinal · WU    probabilities = softmax(logits) WU shape: [768 × 50257]. The logit for token t is the dot product of xfinal with the t-th column of WU.

This is where direct logit attribution becomes powerful. Because the final residual stream is the sum of all components’ contributions, the logit for any token is also the sum of each component’s contribution:

Logit Decomposition logit(t) = xembed · WU[t] + attn0 · WU[t] + mlp0 · WU[t] + … Each term tells you how much one component pushes toward predicting token t. This is the basis of most circuit analysis.

The Logit Lens

A simple but revealing technique: apply the unembedding at intermediate layers (not just the final one). At each layer, you can see what the model would predict if processing stopped there. Typically you see the prediction start vague and progressively sharpen. Sometimes the correct answer appears surprisingly early, revealing that later layers are doing refinement rather than core computation.

07

Circuits & Composition

How simple components combine to implement complex behaviors.

Individual attention heads and MLPs are interesting, but the real power comes from composition — how components in different layers work together. Because later layers can read what earlier layers wrote to the residual stream, attention heads can effectively “chain” their computations.

Q-Composition

Type

Head B in a later layer uses the output of Head A (written to the residual stream) as its query input. Head A’s output tells Head B what to look for.

K-Composition

Type

Head B uses Head A’s output as its key input. Head A’s output changes what other tokens advertise to Head B.

V-Composition

Type

Head B uses Head A’s output as its value input. Head A’s output changes what information gets moved when Head B attends to that position.

Induction heads are a two-head circuit that implements in-context pattern completion. Given a sequence like [A][B] … [A], an induction circuit predicts [B] will come next. It detects that the current token [A] appeared before, finds what followed it, and copies that prediction forward.

This is arguably the most important circuit discovered so far. It’s the mechanism behind in-context learning in transformers, and it’s a beautiful example of how two simple heads compose to implement a complex algorithm.

Input sequence
Harry
Potter
Harry
???

The Pattern

The model has seen “Harry Potter” earlier in the sequence. Now it sees “Harry” again and needs to predict the next token. The induction head circuit will recognize this repeated pattern and predict “Potter.”

Head A (Layer 0): Previous Token Head
attends to prev
Potter
→ copies identity →
writes at position of
“Potter” info
At EVERY position, this head copies the identity of the previous token into the residual stream. So at the position after “Harry” (which is “Potter”), the residual stream now encodes: “the token before me was Harry.”

Step 1: Previous Token Head (Layer 0)

Head A has a simple job: it always attends to the previous token position and copies information about that token. After Head A runs, each position’s residual stream has been enriched with information about the preceding token. This is a general-purpose operation — it doesn’t know about induction yet.

Head B (Layer 1): Induction Head
current token
Harry
Q: “Where was Harry before?”
matches via K-composition
“Potter” (prev = Harry)
→ V: copy →
predicts
Potter

Step 2: Induction Head (Layer 1)

Head B does the clever part. Its query at the current “Harry” position asks: “where in the past was there a token preceded by Harry?” Thanks to Head A’s output, the position of “Potter” now has “preceded by Harry” written into its residual stream. This is K-composition — Head B’s keys use Head A’s output. Head B attends to “Potter,” and its OV circuit copies the token identity → predicting “Potter.”

Complete circuit
1. Head A (L0) copies prev-token info everywhere
2. Head B (L1) searches for “preceded by current token” via K-composition
3. Head B copies the matched token via OV circuit
4. Result: [A][B]…[A] → predict [B]

The Complete Induction Circuit

Two heads, each doing something simple, compose to implement a powerful algorithm. Neither head “understands” induction on its own. The behavior emerges from their interaction through the residual stream. This is what mechanistic interpretability means by “circuits” — computational subnetworks whose behavior can be understood and predicted.

1 / 4
Why Induction Heads Matter

Induction heads are believed to be the primary mechanism for in-context learning. They appear in every transformer large enough to have two layers. They emerge at a specific point during training (a “phase change”), and their emergence coincides with a sharp drop in loss. Understanding this one circuit gives you intuition for how transformers learn algorithms through composition.

08

Reading Guide: A Mathematical Framework for Transformer Circuits

Section-by-section guide to the foundational paper. Read the paper alongside this guide.

How to Use This Guide

Read each paper section first, then come back here. The guide won’t make sense without reading the original. The goal is to highlight what’s important, explain what’s confusing, and tell you what you can skim. Budget about 4–6 hours total for a careful first read.

Summary of Results

High-level overview of what the paper discovers. Read this carefully — it sets up everything else.

What to Focus On
  • The “residual stream” framing — this reframes everything you thought you knew about transformers
  • The claim that attention heads have two independent roles (QK and OV circuits)
  • The concept of “virtual attention heads” created by composition
What You Can Skim
  • Specific model details (they use small attention-only models) — the framework applies generally

Section 2: Transformer Framework

The mathematical framework itself. This is the meat. Introduces the residual stream view formally, decomposes attention into QK and OV circuits, discusses how to think about MLPs.

Key Equations to Understand
  • Residual stream decomposition: The output is a sum of all components’ contributions. This enables direct logit attribution.
  • QK circuit WQTWK: The bilinear form that determines attention patterns. Think of it as a “matching function.”
  • OV circuit WVWO: The linear map that determines what information gets moved. Independent of QK.
  • Full attention head: Attn(x) = softmax(x WQWKTxT) · x WVWO
Common Stumbling Blocks
  • The paper treats WQ and WK as separate matrices (not the combined WQK) — this is intentional because they have different interpretations
  • “Low-rank” decomposition: each head’s QK and OV matrices are rank dhead (64 for GPT-2). The full WQK matrix is [768 × 768] but only rank 64.
  • Bilinear form: xiT WQK xj means the attention score depends on BOTH the query and key positions. It’s not just about the query.

Section 3: Zero & One-Layer Transformers

Applies the framework to the simplest cases. Zero-layer = just bigrams (embed → unembed). One-layer = skip-trigrams (attention can implement “if token A appeared before, predict token B”).

Why This Section Matters
  • Shows the framework works concretely — you can verify the math against actual models
  • The “skip-trigram” concept makes attention’s power and limits very clear
  • Introduces analyzing the WEWQKWET matrix — what tokens attend to what other tokens, independent of position
  • Good practice for reading the notation before the harder sections

Section 4: Two-Layer Attention-Only Transformers & Induction Heads

The most important and most challenging section. Introduces composition (Q, K, V-composition), virtual attention heads, and the induction head circuit.

What to Focus On
  • Composition: Head B can use Head A’s output in its Q, K, or V computation. This creates “virtual heads” with attention patterns neither individual head has.
  • K-composition specifically: This is how the induction head works. Head A writes “previous token identity” to the stream. Head B reads this in its keys.
  • The induction head mechanism: [A][B]…[A] → predict [B]. Two heads, each doing something simple, compose into in-context learning.
Common Stumbling Blocks
  • The distinction between a “real” attention head and a “virtual” attention head can be confusing. Virtual heads are emergent computation from composition — they don’t correspond to any single head in the model.
  • The paper uses attention-only models (no MLPs) — this simplifies the analysis but means some findings don’t directly transfer to full transformers.
  • If the composition math gets overwhelming, focus on the induction head example first, then go back to the general framework.
Suggestion
  • Read this section twice. First time: follow the induction head story. Second time: understand the general composition framework.

Discussion & Related Work

Reflects on what the framework enables and its limitations. Worth reading for the big-picture perspective.

Key Takeaways
  • The framework makes specific, testable predictions about transformer behavior
  • Attention-only models are a useful simplification but miss MLP contributions
  • Composition enables exponentially many virtual circuits from a linear number of heads
  • This paper launched a research program — everything in mech interp since builds on this framing
09

Required Reading & Resources

Everything you need, ranked by priority. Start from the top.

A Mathematical Framework for Transformer Circuits

Elhage, Nanda, Olah et al. — Anthropic, 2021

The paper this entire section is about. Introduces the residual stream view, QK/OV circuit decomposition, composition, and induction heads. Read with the guide above.

Medium 4–6 hours

Neel Nanda’s Prerequisites Guide

Neel Nanda — neelnanda.io

What math and coding background you need. Extremely practical — tells you exactly what to learn and what to skip. Start here if you’re unsure about prerequisites.

Easy 30 min

ARENA — Chapter 1: Transformer from Scratch

Callum McDougall

Hands-on Jupyter notebooks where you build a transformer from scratch. Best way to internalize the architecture. Do the exercises, don’t just read.

Medium 6–8 hours

Transformers for Software Engineers

Nelson Elhage

Explains transformer internals in the language of software engineering (data flow, state, computation steps). If you’re a software engineer, this is the fastest path to understanding.

Easy 1–2 hours

Neel Nanda’s Quickstart Guide

Neel Nanda — neelnanda.io

Overview of the entire field with actionable next steps. Read after you understand transformer internals to see how they connect to interpretability research.

Easy 1 hour

The Illustrated Transformer

Jay Alammar

Visual walkthrough of the transformer architecture. More traditional ML perspective (not the mech interp framing), but excellent diagrams. Good if you’re a visual learner.

Easy 45 min

In-context Learning and Induction Heads

Olsson, Elhage, Nanda et al. — Anthropic, 2022

Deep dive into induction heads: how they form during training, their role in in-context learning, and why they matter. Read after the Framework paper.

Challenging 3–4 hours

3Blue1Brown: Neural Networks / Attention

Grant Sanderson

Beautiful visual explanations of neural networks and attention. Watch if the math feels abstract — these videos build geometric intuition that makes everything click.

Easy 2 hours
Reading Order

Priority bars: Essential (read these) — Recommended — Optional (but valuable). Start with Nanda’s Prerequisites, then Elhage’s “Transformers for Software Engineers,” then the Framework paper with this guide, then ARENA exercises.

10

Exercises & Deliverables

You understand transformer internals when you can do these, not when you can describe them.

Exercise 1

Build a Transformer from Scratch

Implement a minimal GPT-2–style transformer in PyTorch. No libraries, no shortcuts. Include: token embeddings, positional embeddings, multi-head attention (with causal masking), MLP layers, residual connections, layer norm, and unembedding.

  • Use the ARENA Chapter 1 exercises as a guide
  • Load pretrained GPT-2 weights into your implementation and verify it produces the same outputs
  • This should take 4–8 hours. If it takes much less, you’re probably not understanding deeply enough
Exercise 2

Explore Activations with TransformerLens

Load GPT-2 Small in TransformerLens. For a simple prompt, cache all internal activations and explore them.

  • Visualize attention patterns for all heads in all layers
  • Find the “previous token” head (attends to position i-1)
  • Find an induction head (use a repeated sequence like "Mr Jones Mr Jones")
  • Use the logit lens to watch predictions evolve through layers
Exercise 3

Direct Logit Attribution

For a prompt where GPT-2 correctly predicts the next token, decompose the logit into contributions from each component.

  • Which attention heads contribute most to the correct prediction?
  • Which MLP layers contribute most?
  • Are there any components that actively push against the correct prediction?
  • Visualize the contributions as a bar chart
Exercise 4

Analyze QK and OV Circuits

Pick an attention head that has an interesting attention pattern (from Exercise 2). Analyze its QK and OV circuits separately.

  • Compute the full WET WQK WE matrix — which token types attend to which other token types?
  • Compute the OV circuit WE WOV WU — for each token this head attends to, what does it write to the logits?
  • Do these two analyses tell a coherent story about what this head does?
Deliverable

Notebook: “Inside GPT-2”

Combine the above into a single Jupyter notebook that demonstrates your ability to work with transformer internals. This is your proof of understanding and your reference for future work.

  • Clear markdown explanations alongside code
  • Visualizations of attention patterns, logit lens, and attribution
  • At least one specific finding about GPT-2’s behavior (something you discovered, even if small)
  • This notebook is your ticket to the next section: Superposition & Features

When You’re Ready

If you can explain the residual stream view, decompose attention into QK and OV circuits, and trace information flow through a two-layer induction circuit — you have the foundation. Every technique in mechanistic interpretability (SAEs, activation patching, circuit tracing, steering vectors) builds directly on this understanding. Head to the Roadmap for the next step: Superposition & Features.