Prerequisites
What you need before starting. Check what you’re comfortable with.
- Linear algebra: Vectors, matrices, matrix multiplication, dot products, transposes. You need to be fluent, not just familiar.
- Python + NumPy: Comfortable with array operations, broadcasting, reshaping. This is your lab equipment.
- Basic ML: What a loss function is, gradient descent at a high level, what “training” means. You don’t need deep expertise.
- Softmax: Takes a vector of numbers, outputs a probability distribution. softmax(xi) = exi / ∑ exj
- PyTorch basics: Tensors, modules, forward passes. You’ll pick this up as you go.
- Probability: Conditional probability, distributions. Helpful for understanding predictions.
- Information theory: Entropy, cross-entropy loss. Not essential but adds depth.
3Blue1Brown’s Essence of Linear Algebra series is the best visual refresher. Takes about 3 hours and builds genuine intuition.
The Big Picture: Residual Streams
The single most important mental model for mechanistic interpretability.
The standard way to explain transformers starts with attention mechanisms and builds up from there. That’s useful for building one, but misleading for understanding one. The mechanistic interpretability community uses a different framing, introduced by Elhage et al. (2021), that makes the internal structure much clearer.
The key insight: think of the transformer as a residual stream. The residual stream is a vector for each token position that flows through the entire network. Every component — every attention head, every MLP layer — reads from this stream and writes back to it. The components don’t talk to each other directly. They communicate through the shared stream.
The residual stream is a communication bus. Attention heads and MLPs are like peripherals: they read information from the bus, process it, and write their results back. The bus carries all information from input to output. This is why residual connections matter so much — they create the bus.
Why does this matter for interpretability? Because every component’s contribution is additive. The final output is literally the sum of all contributions. This means we can ask: “How much did attention head 3 in layer 5 contribute to predicting this token?” And we can get a real answer, because its contribution is a vector that gets added to the stream.
This is direct logit attribution: take any component’s output, multiply it by the unembedding matrix, and you can see exactly which tokens it pushes the model toward predicting.
Embeddings: Tokens to Vectors
How text becomes math.
Before the transformer can do anything, it needs to convert tokens (integers representing words or subwords) into vectors. This is the embedding layer. It’s conceptually simple: a lookup table. Token 4217 maps to a specific 768-dimensional vector (in GPT-2 Small). That vector is the token’s “starting point” in the residual stream.
The embedding matrix WE has shape [vocab_size × d_model]. For GPT-2 Small: [50257 × 768]. Each row is a learned vector for one token.
These vectors aren’t random — after training, they encode semantic relationships. Tokens with similar meanings end up near each other. The embedding space has geometric structure that the rest of the network exploits.
Attention is permutation-invariant — it can’t tell token order without help. So we add a position embedding: another learned vector for each position in the sequence (0, 1, 2, …, max_len).
The initial residual stream for token t at position p is simply:
The embedding creates the initial residual stream state. Everything the model does after this is about reading, processing, and updating this stream. The embedding is the model’s prior belief about each token before any contextual processing.
Attention Heads: Pattern Matching & Information Movement
The most important component to understand deeply. This is where the subtlety lives.
Each attention head does two things: it decides where to look (which other tokens to attend to) and what to move (what information to copy from those positions to the current position). These are controlled by two separate circuits with very different functions.
Start: Read from the Residual Stream
Each attention head reads the residual stream at every token position. At this point, each vector contains the token embedding plus all prior layers’ contributions. The head will process these to decide what information to move where.
Project to Q, K, V
Each head has three small weight matrices (WQ, WK, WV) that project from the residual stream (768-dim) to a smaller head dimension (64-dim in GPT-2). The Query asks “what am I looking for?” The Key says “what do I contain?” The Value is “what information to send if attended to.”
Compute Attention Scores
The dot product Q · KT measures how much each query “matches” each key. High score = strong match. We divide by √dk to prevent scores from getting too large (which would make softmax too “peaky”). In autoregressive models, we mask future positions so tokens can only attend to the past.
Softmax → Attention Pattern
Softmax normalizes each row so the weights sum to 1. Now each row is a probability distribution over source tokens. This is the attention pattern — the thing you see in attention visualizations. Here, “sat” attends mostly to “cat” (0.55), then to itself (0.37).
Compute Weighted Sum of Values
For each token position, multiply the attention weights by the value vectors and sum. The result is a weighted blend of information from the attended positions. Token “sat” gets 55% of the “cat” value vector, 37% of its own, and 8% of “The”. This is how attention moves information between positions.
Write Back to the Residual Stream
The output matrix WO projects the head’s output (64-dim) back to the residual stream dimension (768-dim). This result is added to the residual stream. Multiple heads in the same layer all add their outputs simultaneously. The stream now carries the original information plus whatever this head contributed.
QK Circuit: “Where to Look”
PatternThe WQWKT matrix (the QK circuit) determines the attention pattern. It’s a bilinear form that computes a score for every (query-position, key-position) pair. When we say an attention head “looks at the previous token,” that behavior lives in the QK circuit.
OV Circuit: “What to Move”
InformationThe WVWO matrix (the OV circuit) determines what information gets moved. Once the head has decided to attend from position i to position j, the OV circuit controls what about position j gets written into position i’s residual stream.
This decomposition is the foundation of circuit analysis. The QK circuit and OV circuit have independent roles and can be analyzed independently. When you find an attention head that does something interesting (e.g., copies the previous token), you can separately ask: “How does it know to look there?” (QK) and “What does it copy?” (OV). Different heads might have the same QK pattern but different OV behavior, or vice versa.
GPT-2 Small has 12 heads per layer. Each head operates in its own 64-dimensional subspace (768 / 12 = 64). All 12 heads read from the same residual stream, compute independently, then their outputs are summed and added back. Each head can learn a completely different pattern — one might attend to the previous token, another to the subject of the sentence, another to punctuation.
MLP Layers: The Processing Units
Where knowledge is stored and nonlinear computation happens.
After attention heads move information between token positions, the MLP layer processes each position independently. Unlike attention (which mixes information across positions), the MLP applies the same transformation to each token’s residual stream vector separately.
The “key-value memory” interpretation: Each row of Win is a “key” that the input matches against. When there’s a strong match (high activation after GELU), the corresponding column of Wout is the “value” that gets added to the residual stream. The MLP stores associations: when the input looks like X, add Y to the stream.
This is where factual knowledge largely lives. The fact that “Paris is the capital of France” is stored as: when the residual stream encodes a “capital of France?” query, specific MLP neurons activate and push the stream toward “Paris.”
The GELU (or ReLU) activation between the two linear layers is what makes MLPs more than just another linear transformation. Without it, the entire transformer would be a single linear function (since the composition of linear functions is linear).
The nonlinearity enables conditional computation: the MLP can implement if/then logic. “If the context says this is about France AND the question is about capitals, THEN add the Paris vector.” Attention can’t do this because attention is linear in the values.
A useful mental model: attention moves information between positions; MLPs process information at each position. Attention is the routing network; MLPs are the processing nodes. Attention says “this token should know about that token.” MLPs say “given what this token now knows, update its representation.” This is a simplification, but a productive one.
Unembedding & Logits
From vectors back to words.
After all layers have processed the residual stream, the model needs to convert the final vector back into a prediction over tokens. The unembedding matrix WU does this: it projects the residual stream (768-dim) to the vocabulary size (50,257 for GPT-2), producing a score (logit) for every possible next token.
This is where direct logit attribution becomes powerful. Because the final residual stream is the sum of all components’ contributions, the logit for any token is also the sum of each component’s contribution:
The Logit Lens
A simple but revealing technique: apply the unembedding at intermediate layers (not just the final one). At each layer, you can see what the model would predict if processing stopped there. Typically you see the prediction start vague and progressively sharpen. Sometimes the correct answer appears surprisingly early, revealing that later layers are doing refinement rather than core computation.
Circuits & Composition
How simple components combine to implement complex behaviors.
Individual attention heads and MLPs are interesting, but the real power comes from composition — how components in different layers work together. Because later layers can read what earlier layers wrote to the residual stream, attention heads can effectively “chain” their computations.
Q-Composition
TypeHead B in a later layer uses the output of Head A (written to the residual stream) as its query input. Head A’s output tells Head B what to look for.
K-Composition
TypeHead B uses Head A’s output as its key input. Head A’s output changes what other tokens advertise to Head B.
V-Composition
TypeHead B uses Head A’s output as its value input. Head A’s output changes what information gets moved when Head B attends to that position.
Induction heads are a two-head circuit that implements in-context pattern completion. Given a sequence like [A][B] … [A], an induction circuit predicts [B] will come next. It detects that the current token [A] appeared before, finds what followed it, and copies that prediction forward.
This is arguably the most important circuit discovered so far. It’s the mechanism behind in-context learning in transformers, and it’s a beautiful example of how two simple heads compose to implement a complex algorithm.
The Pattern
The model has seen “Harry Potter” earlier in the sequence. Now it sees “Harry” again and needs to predict the next token. The induction head circuit will recognize this repeated pattern and predict “Potter.”
Step 1: Previous Token Head (Layer 0)
Head A has a simple job: it always attends to the previous token position and copies information about that token. After Head A runs, each position’s residual stream has been enriched with information about the preceding token. This is a general-purpose operation — it doesn’t know about induction yet.
Step 2: Induction Head (Layer 1)
Head B does the clever part. Its query at the current “Harry” position asks: “where in the past was there a token preceded by Harry?” Thanks to Head A’s output, the position of “Potter” now has “preceded by Harry” written into its residual stream. This is K-composition — Head B’s keys use Head A’s output. Head B attends to “Potter,” and its OV circuit copies the token identity → predicting “Potter.”
The Complete Induction Circuit
Two heads, each doing something simple, compose to implement a powerful algorithm. Neither head “understands” induction on its own. The behavior emerges from their interaction through the residual stream. This is what mechanistic interpretability means by “circuits” — computational subnetworks whose behavior can be understood and predicted.
Induction heads are believed to be the primary mechanism for in-context learning. They appear in every transformer large enough to have two layers. They emerge at a specific point during training (a “phase change”), and their emergence coincides with a sharp drop in loss. Understanding this one circuit gives you intuition for how transformers learn algorithms through composition.
Reading Guide: A Mathematical Framework for Transformer Circuits
Section-by-section guide to the foundational paper. Read the paper alongside this guide.
Read each paper section first, then come back here. The guide won’t make sense without reading the original. The goal is to highlight what’s important, explain what’s confusing, and tell you what you can skim. Budget about 4–6 hours total for a careful first read.
Summary of Results
High-level overview of what the paper discovers. Read this carefully — it sets up everything else.
- The “residual stream” framing — this reframes everything you thought you knew about transformers
- The claim that attention heads have two independent roles (QK and OV circuits)
- The concept of “virtual attention heads” created by composition
- Specific model details (they use small attention-only models) — the framework applies generally
Section 2: Transformer Framework
The mathematical framework itself. This is the meat. Introduces the residual stream view formally, decomposes attention into QK and OV circuits, discusses how to think about MLPs.
- Residual stream decomposition: The output is a sum of all components’ contributions. This enables direct logit attribution.
- QK circuit WQTWK: The bilinear form that determines attention patterns. Think of it as a “matching function.”
- OV circuit WVWO: The linear map that determines what information gets moved. Independent of QK.
- Full attention head: Attn(x) = softmax(x WQWKTxT) · x WVWO
- The paper treats WQ and WK as separate matrices (not the combined WQK) — this is intentional because they have different interpretations
- “Low-rank” decomposition: each head’s QK and OV matrices are rank dhead (64 for GPT-2). The full WQK matrix is [768 × 768] but only rank 64.
- Bilinear form: xiT WQK xj means the attention score depends on BOTH the query and key positions. It’s not just about the query.
Section 3: Zero & One-Layer Transformers
Applies the framework to the simplest cases. Zero-layer = just bigrams (embed → unembed). One-layer = skip-trigrams (attention can implement “if token A appeared before, predict token B”).
- Shows the framework works concretely — you can verify the math against actual models
- The “skip-trigram” concept makes attention’s power and limits very clear
- Introduces analyzing the WEWQKWET matrix — what tokens attend to what other tokens, independent of position
- Good practice for reading the notation before the harder sections
Section 4: Two-Layer Attention-Only Transformers & Induction Heads
The most important and most challenging section. Introduces composition (Q, K, V-composition), virtual attention heads, and the induction head circuit.
- Composition: Head B can use Head A’s output in its Q, K, or V computation. This creates “virtual heads” with attention patterns neither individual head has.
- K-composition specifically: This is how the induction head works. Head A writes “previous token identity” to the stream. Head B reads this in its keys.
- The induction head mechanism: [A][B]…[A] → predict [B]. Two heads, each doing something simple, compose into in-context learning.
- The distinction between a “real” attention head and a “virtual” attention head can be confusing. Virtual heads are emergent computation from composition — they don’t correspond to any single head in the model.
- The paper uses attention-only models (no MLPs) — this simplifies the analysis but means some findings don’t directly transfer to full transformers.
- If the composition math gets overwhelming, focus on the induction head example first, then go back to the general framework.
- Read this section twice. First time: follow the induction head story. Second time: understand the general composition framework.
Discussion & Related Work
Reflects on what the framework enables and its limitations. Worth reading for the big-picture perspective.
- The framework makes specific, testable predictions about transformer behavior
- Attention-only models are a useful simplification but miss MLP contributions
- Composition enables exponentially many virtual circuits from a linear number of heads
- This paper launched a research program — everything in mech interp since builds on this framing
Required Reading & Resources
Everything you need, ranked by priority. Start from the top.
A Mathematical Framework for Transformer Circuits
The paper this entire section is about. Introduces the residual stream view, QK/OV circuit decomposition, composition, and induction heads. Read with the guide above.
Neel Nanda’s Prerequisites Guide
What math and coding background you need. Extremely practical — tells you exactly what to learn and what to skip. Start here if you’re unsure about prerequisites.
ARENA — Chapter 1: Transformer from Scratch
Hands-on Jupyter notebooks where you build a transformer from scratch. Best way to internalize the architecture. Do the exercises, don’t just read.
Transformers for Software Engineers
Explains transformer internals in the language of software engineering (data flow, state, computation steps). If you’re a software engineer, this is the fastest path to understanding.
Neel Nanda’s Quickstart Guide
Overview of the entire field with actionable next steps. Read after you understand transformer internals to see how they connect to interpretability research.
The Illustrated Transformer
Visual walkthrough of the transformer architecture. More traditional ML perspective (not the mech interp framing), but excellent diagrams. Good if you’re a visual learner.
In-context Learning and Induction Heads
Deep dive into induction heads: how they form during training, their role in in-context learning, and why they matter. Read after the Framework paper.
3Blue1Brown: Neural Networks / Attention
Beautiful visual explanations of neural networks and attention. Watch if the math feels abstract — these videos build geometric intuition that makes everything click.
Priority bars: Essential (read these) — Recommended — Optional (but valuable). Start with Nanda’s Prerequisites, then Elhage’s “Transformers for Software Engineers,” then the Framework paper with this guide, then ARENA exercises.
Exercises & Deliverables
You understand transformer internals when you can do these, not when you can describe them.
Build a Transformer from Scratch
Implement a minimal GPT-2–style transformer in PyTorch. No libraries, no shortcuts. Include: token embeddings, positional embeddings, multi-head attention (with causal masking), MLP layers, residual connections, layer norm, and unembedding.
- Use the ARENA Chapter 1 exercises as a guide
- Load pretrained GPT-2 weights into your implementation and verify it produces the same outputs
- This should take 4–8 hours. If it takes much less, you’re probably not understanding deeply enough
Explore Activations with TransformerLens
Load GPT-2 Small in TransformerLens. For a simple prompt, cache all internal activations and explore them.
- Visualize attention patterns for all heads in all layers
- Find the “previous token” head (attends to position i-1)
- Find an induction head (use a repeated sequence like
"Mr Jones Mr Jones") - Use the logit lens to watch predictions evolve through layers
Direct Logit Attribution
For a prompt where GPT-2 correctly predicts the next token, decompose the logit into contributions from each component.
- Which attention heads contribute most to the correct prediction?
- Which MLP layers contribute most?
- Are there any components that actively push against the correct prediction?
- Visualize the contributions as a bar chart
Analyze QK and OV Circuits
Pick an attention head that has an interesting attention pattern (from Exercise 2). Analyze its QK and OV circuits separately.
- Compute the full WET WQK WE matrix — which token types attend to which other token types?
- Compute the OV circuit WE WOV WU — for each token this head attends to, what does it write to the logits?
- Do these two analyses tell a coherent story about what this head does?
Notebook: “Inside GPT-2”
Combine the above into a single Jupyter notebook that demonstrates your ability to work with transformer internals. This is your proof of understanding and your reference for future work.
- Clear markdown explanations alongside code
- Visualizations of attention patterns, logit lens, and attribution
- At least one specific finding about GPT-2’s behavior (something you discovered, even if small)
- This notebook is your ticket to the next section: Superposition & Features
When You’re Ready
If you can explain the residual stream view, decompose attention into QK and OV circuits, and trace information flow through a two-layer induction circuit — you have the foundation. Every technique in mechanistic interpretability (SAEs, activation patching, circuit tracing, steering vectors) builds directly on this understanding. Head to the Roadmap for the next step: Superposition & Features.