Transformer Internals — Mechanistic Interpretability Field Guide

01

Prerequisites

What you need before starting. Check what you’re comfortable with.

Must Have

Linear algebra: Vectors, matrices, matrix multiplication, dot products, transposes. You need to be fluent, not just familiar.
Python + NumPy: Comfortable with array operations, broadcasting, reshaping. This is your lab equipment.
Basic ML: What a loss function is, gradient descent at a high level, what “training” means. You don’t need deep expertise.
Softmax: Takes a vector of numbers, outputs a probability distribution. softmax(x_i) = e^x_i / ∑ e^x_j

Nice to Have

PyTorch basics: Tensors, modules, forward passes. You’ll pick this up as you go.
Probability: Conditional probability, distributions. Helpful for understanding predictions.
Information theory: Entropy, cross-entropy loss. Not essential but adds depth.

If you’re rusty

3Blue1Brown’s Essence of Linear Algebra series is the best visual refresher. Takes about 3 hours and builds genuine intuition.

02

The Big Picture: Residual Streams

The single most important mental model for mechanistic interpretability.

The standard way to explain transformers starts with attention mechanisms and builds up from there. That’s useful for building one, but misleading for understanding one. The mechanistic interpretability community uses a different framing, introduced by Elhage et al. (2021), that makes the internal structure much clearer.

The key insight: think of the transformer as a residual stream. The residual stream is a vector for each token position that flows through the entire network. Every component — every attention head, every MLP layer — reads from this stream and writes back to it. The components don’t talk to each other directly. They communicate through the shared stream.

Key Insight

The residual stream is a communication bus. Attention heads and MLPs are like peripherals: they read information from the bus, process it, and write their results back. The bus carries all information from input to output. This is why residual connections matter so much — they create the bus.

The Residual Stream Equation x final = x embed + attn 0 (x) + mlp 0 (x) + attn 1 (x) + mlp 1 (x) + \dots Each term is added to the stream. The final stream is the sum of the initial embedding plus every component’s contribution.

Architecture Overview

Token + Position Embeddings W_E · token + W_pos[position]

↓

Residual Stream

Layer 0

Multi-Head Self-Attention reads stream → computes attention → writes back

Feed-Forward MLP reads stream → nonlinear transform → writes back

Residual Stream (updated)

Layer 1

Multi-Head Self-Attention reads stream → computes attention → writes back

Feed-Forward MLP reads stream → nonlinear transform → writes back

⋮

Final Residual Stream

↓

Unembedding → Logits → Next Token Prediction W_U · x_final → softmax → probabilities

Why does this matter for interpretability? Because every component’s contribution is additive. The final output is literally the sum of all contributions. This means we can ask: “How much did attention head 3 in layer 5 contribute to predicting this token?” And we can get a real answer, because its contribution is a vector that gets added to the stream.

This is direct logit attribution: take any component’s output, multiply it by the unembedding matrix, and you can see exactly which tokens it pushes the model toward predicting.

03

Embeddings: Tokens to Vectors

How text becomes math.

Before the transformer can do anything, it needs to convert tokens (integers representing words or subwords) into vectors. This is the embedding layer. It’s conceptually simple: a lookup table. Token 4217 maps to a specific 768-dimensional vector (in GPT-2 Small). That vector is the token’s “starting point” in the residual stream.

Token Embedding

The embedding matrix W_E has shape [vocab_size × d_model]. For GPT-2 Small: [50257 × 768]. Each row is a learned vector for one token.

These vectors aren’t random — after training, they encode semantic relationships. Tokens with similar meanings end up near each other. The embedding space has geometric structure that the rest of the network exploits.

Positional Embedding

Attention is permutation-invariant — it can’t tell token order without help. So we add a position embedding: another learned vector for each position in the sequence (0, 1, 2, …, max_len).

The initial residual stream for token t at position p is simply:

x 0 = W E [t] + W pos [p] Sum of token meaning + position information. This is the starting state of the residual stream.

Why This Matters

The embedding creates the initial residual stream state. Everything the model does after this is about reading, processing, and updating this stream. The embedding is the model’s prior belief about each token before any contextual processing.

04

Attention Heads: Pattern Matching & Information Movement

The most important component to understand deeply. This is where the subtlety lives.

Each attention head does two things: it decides where to look (which other tokens to attend to) and what to move (what information to copy from those positions to the current position). These are controlled by two separate circuits with very different functions.

Residual stream vectors (one per token)

“The”

“cat”

“sat”

“on”

Each block is a 768-dimensional vector (in GPT-2 Small)

Start: Read from the Residual Stream

Each attention head reads the residual stream at every token position. At this point, each vector contains the token embedding plus all prior layers’ contributions. The head will process these to decide what information to move where.

Project to Queries, Keys, and Values

Token: “sat”

Q

K

V

× 4 tokens × 12 heads

Project to Q, K, V

Each head has three small weight matrices (W_Q, W_K, W_V) that project from the residual stream (768-dim) to a smaller head dimension (64-dim in GPT-2). The Query asks “what am I looking for?” The Key says “what do I contain?” The Value is “what information to send if attended to.”

Q = x \cdot W Q K = x \cdot W K V = x \cdot W V Each projection: [batch, seq, 768] \to [batch, seq, 64]. The head works in a smaller subspace.

Attention scores (QK^T / √d_k)

The

cat

sat

on

The

2.1

-

cat

0.4

1.8

-

sat

0.1

2.3

0.9

-

on

-0.2

0.6

1.1

0.3

Dashes indicate masked positions (causal attention: can’t look at future tokens)

Compute Attention Scores

The dot product Q · K^T measures how much each query “matches” each key. High score = strong match. We divide by √d_k to prevent scores from getting too large (which would make softmax too “peaky”). In autoregressive models, we mask future positions so tokens can only attend to the past.

scores = Q \cdot K T / \sqrt d k Shape: [seq_len \times seq_len]. Entry (i, j) is how much token i attends to token j.

Attention pattern (after softmax)

The

cat

sat

on

The

1.00

cat

0.20

0.80

sat

0.08

0.55

0.37

on

0.07

0.28

0.41

0.24

Each row sums to 1.0. Bold = highest attention weight. This is what attention patterns visualize.

Softmax → Attention Pattern

Softmax normalizes each row so the weights sum to 1. Now each row is a probability distribution over source tokens. This is the attention pattern — the thing you see in attention visualizations. Here, “sat” attends mostly to “cat” (0.55), then to itself (0.37).

Weighted sum of value vectors

0.08 ×

V_The

+

0.55 ×

V_cat

+

0.37 ×

V_sat

=

output for “sat”

z_sat

Compute Weighted Sum of Values

For each token position, multiply the attention weights by the value vectors and sum. The result is a weighted blend of information from the attended positions. Token “sat” gets 55% of the “cat” value vector, 37% of its own, and 8% of “The”. This is how attention moves information between positions.

z = pattern \cdot V Matrix multiply: [seq \times seq] \times [seq \times d_head] \to [seq \times d_head]. One output vector per position.

Project back to residual stream dimension & add

head output

z_sat

× W_O

→

projected

output_sat

+

residual stream

x_sat

=

updated stream

x′_sat

Write Back to the Residual Stream

The output matrix W_O projects the head’s output (64-dim) back to the residual stream dimension (768-dim). This result is added to the residual stream. Multiple heads in the same layer all add their outputs simultaneously. The stream now carries the original information plus whatever this head contributed.

x' = x + z \cdot W O The “+” is the residual connection. It’s why it’s called a residual stream.

1 / 6

The Two Circuits Inside Every Attention Head

QK Circuit: “Where to Look”

Pattern

The W_QW_K^T matrix (the QK circuit) determines the attention pattern. It’s a bilinear form that computes a score for every (query-position, key-position) pair. When we say an attention head “looks at the previous token,” that behavior lives in the QK circuit.

QK Circuit A ij \propto x i T W Q W K T x j How much does position i attend to position j? Depends on what’s at both positions.

OV Circuit: “What to Move”

Information

The W_VW_O matrix (the OV circuit) determines what information gets moved. Once the head has decided to attend from position i to position j, the OV circuit controls what about position j gets written into position i’s residual stream.

OV Circuit output i = \sum j A ij \cdot x j W V W O A weighted sum of (source info \times OV projection). The OV circuit is a linear map on the source vectors.

Why Separating QK and OV Matters

This decomposition is the foundation of circuit analysis. The QK circuit and OV circuit have independent roles and can be analyzed independently. When you find an attention head that does something interesting (e.g., copies the previous token), you can separately ask: “How does it know to look there?” (QK) and “What does it copy?” (OV). Different heads might have the same QK pattern but different OV behavior, or vice versa.

Multi-Head Attention

GPT-2 Small has 12 heads per layer. Each head operates in its own 64-dimensional subspace (768 / 12 = 64). All 12 heads read from the same residual stream, compute independently, then their outputs are summed and added back. Each head can learn a completely different pattern — one might attend to the previous token, another to the subject of the sentence, another to punctuation.

05

MLP Layers: The Processing Units

Where knowledge is stored and nonlinear computation happens.

After attention heads move information between token positions, the MLP layer processes each position independently. Unlike attention (which mixes information across positions), the MLP applies the same transformation to each token’s residual stream vector separately.

MLP Computation mlp (x) = GELU(x \cdot W in + b in) \cdot W out + b out W in : [768 \times 3072], W out : [3072 \times 768] for GPT-2 Small. The inner dimension (3072) is 4\times the model dimension.

What MLPs Do

The “key-value memory” interpretation: Each row of W_in is a “key” that the input matches against. When there’s a strong match (high activation after GELU), the corresponding column of W_out is the “value” that gets added to the residual stream. The MLP stores associations: when the input looks like X, add Y to the stream.

This is where factual knowledge largely lives. The fact that “Paris is the capital of France” is stored as: when the residual stream encodes a “capital of France?” query, specific MLP neurons activate and push the stream toward “Paris.”

Why the Nonlinearity Matters

The GELU (or ReLU) activation between the two linear layers is what makes MLPs more than just another linear transformation. Without it, the entire transformer would be a single linear function (since the composition of linear functions is linear).

The nonlinearity enables conditional computation: the MLP can implement if/then logic. “If the context says this is about France AND the question is about capitals, THEN add the Paris vector.” Attention can’t do this because attention is linear in the values.

Attention vs. MLP: The Division of Labor

A useful mental model: attention moves information between positions; MLPs process information at each position. Attention is the routing network; MLPs are the processing nodes. Attention says “this token should know about that token.” MLPs say “given what this token now knows, update its representation.” This is a simplification, but a productive one.

06

Unembedding & Logits

From vectors back to words.

After all layers have processed the residual stream, the model needs to convert the final vector back into a prediction over tokens. The unembedding matrix W_U does this: it projects the residual stream (768-dim) to the vocabulary size (50,257 for GPT-2), producing a score (logit) for every possible next token.

Unembedding logits = x final \cdot W U probabilities = softmax(logits) W U shape: [768 \times 50257]. The logit for token t is the dot product of x final with the t-th column of W U .

This is where direct logit attribution becomes powerful. Because the final residual stream is the sum of all components’ contributions, the logit for any token is also the sum of each component’s contribution:

Logit Decomposition logit(t) = x embed \cdot W U [t] + attn 0 \cdot W U [t] + mlp 0 \cdot W U [t] + \dots Each term tells you how much one component pushes toward predicting token t. This is the basis of most circuit analysis.

The Logit Lens

A simple but revealing technique: apply the unembedding at intermediate layers (not just the final one). At each layer, you can see what the model would predict if processing stopped there. Typically you see the prediction start vague and progressively sharpen. Sometimes the correct answer appears surprisingly early, revealing that later layers are doing refinement rather than core computation.

07

Circuits & Composition

How simple components combine to implement complex behaviors.

Individual attention heads and MLPs are interesting, but the real power comes from composition — how components in different layers work together. Because later layers can read what earlier layers wrote to the residual stream, attention heads can effectively “chain” their computations.

Three Types of Composition

Q-Composition

Type

Head B in a later layer uses the output of Head A (written to the residual stream) as its query input. Head A’s output tells Head B what to look for.

K-Composition

Type

Head B uses Head A’s output as its key input. Head A’s output changes what other tokens advertise to Head B.

V-Composition

Type

Head B uses Head A’s output as its value input. Head A’s output changes what information gets moved when Head B attends to that position.

The Canonical Example: Induction Heads

Induction heads are a two-head circuit that implements in-context pattern completion. Given a sequence like [A][B] … [A], an induction circuit predicts [B] will come next. It detects that the current token [A] appeared before, finds what followed it, and copies that prediction forward.

This is arguably the most important circuit discovered so far. It’s the mechanism behind in-context learning in transformers, and it’s a beautiful example of how two simple heads compose to implement a complex algorithm.

Input sequence

…

Harry

Potter

…

Harry

???

The Pattern

The model has seen “Harry Potter” earlier in the sequence. Now it sees “Harry” again and needs to predict the next token. The induction head circuit will recognize this repeated pattern and predict “Potter.”

Head A (Layer 0): Previous Token Head

attends to prev

Potter

→ copies identity →

writes at position of

“Potter” info

At EVERY position, this head copies the identity of the previous token into the residual stream. So at the position after “Harry” (which is “Potter”), the residual stream now encodes: “the token before me was Harry.”

Step 1: Previous Token Head (Layer 0)

Head A has a simple job: it always attends to the previous token position and copies information about that token. After Head A runs, each position’s residual stream has been enriched with information about the preceding token. This is a general-purpose operation — it doesn’t know about induction yet.

Head B (Layer 1): Induction Head

current token

Harry

Q: “Where was Harry before?”

→

matches via K-composition

“Potter” (prev = Harry)

→ V: copy →

predicts

Potter

Step 2: Induction Head (Layer 1)

Head B does the clever part. Its query at the current “Harry” position asks: “where in the past was there a token preceded by Harry?” Thanks to Head A’s output, the position of “Potter” now has “preceded by Harry” written into its residual stream. This is K-composition — Head B’s keys use Head A’s output. Head B attends to “Potter,” and its OV circuit copies the token identity → predicting “Potter.”

Complete circuit

1. Head A (L0) copies prev-token info everywhere

2. Head B (L1) searches for “preceded by current token” via K-composition

3. Head B copies the matched token via OV circuit

4. Result: [A][B]…[A] → predict [B]

The Complete Induction Circuit

Two heads, each doing something simple, compose to implement a powerful algorithm. Neither head “understands” induction on its own. The behavior emerges from their interaction through the residual stream. This is what mechanistic interpretability means by “circuits” — computational subnetworks whose behavior can be understood and predicted.

1 / 4

Why Induction Heads Matter

Induction heads are believed to be the primary mechanism for in-context learning. They appear in every transformer large enough to have two layers. They emerge at a specific point during training (a “phase change”), and their emergence coincides with a sharp drop in loss. Understanding this one circuit gives you intuition for how transformers learn algorithms through composition.

08

Reading Guide: A Mathematical Framework for Transformer Circuits

Section-by-section guide to the foundational paper. Read the paper alongside this guide.

How to Use This Guide

Read each paper section first, then come back here. The guide won’t make sense without reading the original. The goal is to highlight what’s important, explain what’s confusing, and tell you what you can skim. Budget about 4–6 hours total for a careful first read.

Summary of Results

Accessible 20 min

High-level overview of what the paper discovers. Read this carefully — it sets up everything else.

What to Focus On

The “residual stream” framing — this reframes everything you thought you knew about transformers
The claim that attention heads have two independent roles (QK and OV circuits)
The concept of “virtual attention heads” created by composition

What You Can Skim

Specific model details (they use small attention-only models) — the framework applies generally

Section 2: Transformer Framework

Core Math 45 min

The mathematical framework itself. This is the meat. Introduces the residual stream view formally, decomposes attention into QK and OV circuits, discusses how to think about MLPs.

Key Equations to Understand

Residual stream decomposition: The output is a sum of all components’ contributions. This enables direct logit attribution.
QK circuit W_Q^TW_K: The bilinear form that determines attention patterns. Think of it as a “matching function.”
OV circuit W_VW_O: The linear map that determines what information gets moved. Independent of QK.
Full attention head: Attn(x) = softmax(x W_QW_K^Tx^T) · x W_VW_O

Common Stumbling Blocks

The paper treats W_Q and W_K as separate matrices (not the combined W_QK) — this is intentional because they have different interpretations
“Low-rank” decomposition: each head’s QK and OV matrices are rank d_head (64 for GPT-2). The full W_QK matrix is [768 × 768] but only rank 64.
Bilinear form: x_i^T W_QK x_j means the attention score depends on BOTH the query and key positions. It’s not just about the query.

Section 3: Zero & One-Layer Transformers

Illustrative 30 min

Applies the framework to the simplest cases. Zero-layer = just bigrams (embed → unembed). One-layer = skip-trigrams (attention can implement “if token A appeared before, predict token B”).

Why This Section Matters

Shows the framework works concretely — you can verify the math against actual models
The “skip-trigram” concept makes attention’s power and limits very clear
Introduces analyzing the W_EW_QKW_E^T matrix — what tokens attend to what other tokens, independent of position
Good practice for reading the notation before the harder sections

Section 4: Two-Layer Attention-Only Transformers & Induction Heads

Challenging 60+ min

The most important and most challenging section. Introduces composition (Q, K, V-composition), virtual attention heads, and the induction head circuit.

What to Focus On

Composition: Head B can use Head A’s output in its Q, K, or V computation. This creates “virtual heads” with attention patterns neither individual head has.
K-composition specifically: This is how the induction head works. Head A writes “previous token identity” to the stream. Head B reads this in its keys.
The induction head mechanism: [A][B]…[A] → predict [B]. Two heads, each doing something simple, compose into in-context learning.

Common Stumbling Blocks

The distinction between a “real” attention head and a “virtual” attention head can be confusing. Virtual heads are emergent computation from composition — they don’t correspond to any single head in the model.
The paper uses attention-only models (no MLPs) — this simplifies the analysis but means some findings don’t directly transfer to full transformers.
If the composition math gets overwhelming, focus on the induction head example first, then go back to the general framework.

Suggestion

Read this section twice. First time: follow the induction head story. Second time: understand the general composition framework.

Discussion & Related Work

Conceptual 15 min

Reflects on what the framework enables and its limitations. Worth reading for the big-picture perspective.

Key Takeaways

The framework makes specific, testable predictions about transformer behavior
Attention-only models are a useful simplification but miss MLP contributions
Composition enables exponentially many virtual circuits from a linear number of heads
This paper launched a research program — everything in mech interp since builds on this framing

09

Required Reading & Resources

Everything you need, ranked by priority. Start from the top.

A Mathematical Framework for Transformer Circuits

Elhage, Nanda, Olah et al. — Anthropic, 2021

The paper this entire section is about. Introduces the residual stream view, QK/OV circuit decomposition, composition, and induction heads. Read with the guide above.

Medium 4–6 hours

Neel Nanda’s Prerequisites Guide

Neel Nanda — neelnanda.io

What math and coding background you need. Extremely practical — tells you exactly what to learn and what to skip. Start here if you’re unsure about prerequisites.

Easy 30 min

ARENA — Chapter 1: Transformer from Scratch

Callum McDougall

Hands-on Jupyter notebooks where you build a transformer from scratch. Best way to internalize the architecture. Do the exercises, don’t just read.

Medium 6–8 hours

Transformers for Software Engineers

Nelson Elhage

Explains transformer internals in the language of software engineering (data flow, state, computation steps). If you’re a software engineer, this is the fastest path to understanding.

Easy 1–2 hours

Neel Nanda’s Quickstart Guide

Neel Nanda — neelnanda.io

Overview of the entire field with actionable next steps. Read after you understand transformer internals to see how they connect to interpretability research.

Easy 1 hour

The Illustrated Transformer

Jay Alammar

Visual walkthrough of the transformer architecture. More traditional ML perspective (not the mech interp framing), but excellent diagrams. Good if you’re a visual learner.

Easy 45 min

In-context Learning and Induction Heads

Olsson, Elhage, Nanda et al. — Anthropic, 2022

Deep dive into induction heads: how they form during training, their role in in-context learning, and why they matter. Read after the Framework paper.

Challenging 3–4 hours

3Blue1Brown: Neural Networks / Attention

Grant Sanderson

Beautiful visual explanations of neural networks and attention. Watch if the math feels abstract — these videos build geometric intuition that makes everything click.

Easy 2 hours

Reading Order

Priority bars: Essential (read these) — Recommended — Optional (but valuable). Start with Nanda’s Prerequisites, then Elhage’s “Transformers for Software Engineers,” then the Framework paper with this guide, then ARENA exercises.

10

Exercises & Deliverables

You understand transformer internals when you can do these, not when you can describe them.

Exercise 1

Build a Transformer from Scratch

Implement a minimal GPT-2–style transformer in PyTorch. No libraries, no shortcuts. Include: token embeddings, positional embeddings, multi-head attention (with causal masking), MLP layers, residual connections, layer norm, and unembedding.

Use the ARENA Chapter 1 exercises as a guide
Load pretrained GPT-2 weights into your implementation and verify it produces the same outputs
This should take 4–8 hours. If it takes much less, you’re probably not understanding deeply enough

Exercise 2

Explore Activations with TransformerLens

Load GPT-2 Small in TransformerLens. For a simple prompt, cache all internal activations and explore them.

Visualize attention patterns for all heads in all layers
Find the “previous token” head (attends to position i-1)
Find an induction head (use a repeated sequence like "Mr Jones Mr Jones")
Use the logit lens to watch predictions evolve through layers

Exercise 3

Direct Logit Attribution

For a prompt where GPT-2 correctly predicts the next token, decompose the logit into contributions from each component.

Which attention heads contribute most to the correct prediction?
Which MLP layers contribute most?
Are there any components that actively push against the correct prediction?
Visualize the contributions as a bar chart

Exercise 4

Analyze QK and OV Circuits

Pick an attention head that has an interesting attention pattern (from Exercise 2). Analyze its QK and OV circuits separately.

Compute the full W_E^T W_QK W_E matrix — which token types attend to which other token types?
Compute the OV circuit W_E W_OV W_U — for each token this head attends to, what does it write to the logits?
Do these two analyses tell a coherent story about what this head does?

Deliverable

Notebook: “Inside GPT-2”

Combine the above into a single Jupyter notebook that demonstrates your ability to work with transformer internals. This is your proof of understanding and your reference for future work.

Clear markdown explanations alongside code
Visualizations of attention patterns, logit lens, and attribution
At least one specific finding about GPT-2’s behavior (something you discovered, even if small)
This notebook is your ticket to the next section: Superposition & Features

When You’re Ready

If you can explain the residual stream view, decompose attention into QK and OV circuits, and trace information flow through a two-layer induction circuit — you have the foundation. Every technique in mechanistic interpretability (SAEs, activation patching, circuit tracing, steering vectors) builds directly on this understanding. Head to the Roadmap for the next step: Superposition & Features.

In this section

Prerequisites

The Big Picture: Residual Streams

Embeddings: Tokens to Vectors

Attention Heads: Pattern Matching & Information Movement

Start: Read from the Residual Stream

Project to Q, K, V

Compute Attention Scores

Softmax → Attention Pattern

Compute Weighted Sum of Values

Write Back to the Residual Stream

QK Circuit: “Where to Look”

OV Circuit: “What to Move”

MLP Layers: The Processing Units

Unembedding & Logits

The Logit Lens

Circuits & Composition

Q-Composition

K-Composition

V-Composition

The Pattern

Step 1: Previous Token Head (Layer 0)

Step 2: Induction Head (Layer 1)

The Complete Induction Circuit

Reading Guide: A Mathematical Framework for Transformer Circuits

Summary of Results

Section 2: Transformer Framework

Section 3: Zero & One-Layer Transformers

Section 4: Two-Layer Attention-Only Transformers & Induction Heads

Discussion & Related Work

Required Reading & Resources

A Mathematical Framework for Transformer Circuits

Neel Nanda’s Prerequisites Guide

ARENA — Chapter 1: Transformer from Scratch

Transformers for Software Engineers

Neel Nanda’s Quickstart Guide

The Illustrated Transformer

In-context Learning and Induction Heads

3Blue1Brown: Neural Networks / Attention

Exercises & Deliverables

Build a Transformer from Scratch

Explore Activations with TransformerLens

Direct Logit Attribution

Analyze QK and OV Circuits

Notebook: “Inside GPT-2”

When You’re Ready