Superposition & Features

The central problem of mechanistic interpretability: neural networks represent far more concepts than they have neurons. Understanding this is understanding why the field exists.

Roadmap Step 2 1–2 Weeks Core Problem

In this section

  1. 01 The Problem: Why Neurons Don’t Make Sense
  2. 02 The Geometry of Superposition
  3. 03 Features: The Real Units of Meaning
  4. 04 Toy Models of Superposition
  5. 05 Sparse Autoencoders: The Solution
  6. 06 SAE Walkthrough: Step by Step
  7. 07 Evaluating SAEs: How Do You Know It Worked?
  8. 08 Frontier: What’s Beyond SAEs
  9. 09 Paper Guide: Toy Models & Towards Monosemanticity
  10. 10 Required Reading
  11. 11 Exercises & Deliverables
01

The Problem: Why Neurons Don’t Make Sense

You’d expect each neuron to represent one thing. They don’t.

Here’s the intuitive expectation: a neural network learns to detect concepts, and each neuron represents one concept. Neuron 347 fires for “cat,” neuron 1,024 fires for “legal language,” neuron 5,891 fires for “the Golden Gate Bridge.” If this were true, interpretability would be solved — just read off what each neuron does.

This is not what happens. In practice, individual neurons fire for multiple, seemingly unrelated concepts. A single neuron might activate for cats, the color blue, and legal documents. This is called polysemanticity, and it makes neural networks fundamentally opaque at the neuron level.

Monosemantic Neuron (Rare)

Interpretable

Fires for one concept. Example: a neuron in an image model that activates only for curved edges at a specific orientation. You can look at it and say “I know what this does.” These exist but are the exception.

Neuron 42: activates for → “references to the Eiffel Tower” One neuron, one concept. Clean, interpretable. Rare in practice.

Polysemantic Neuron (Common)

Opaque

Fires for multiple unrelated concepts. The same neuron activates for academic citations, Korean text, and mentions of cooking temperatures. There’s no single story you can tell about “what this neuron does.”

Neuron 42: activates for → “academic citations” AND “Korean text” AND “cooking temps” One neuron, many concepts. Opaque. This is the norm.
The Core Question

Why would a network do this? It seems like bad engineering — why not give each concept its own neuron? The answer is superposition: the network has learned far more concepts than it has neurons, and it’s found a clever way to pack them all in. Understanding this packing scheme is the entire game.

02

The Geometry of Superposition

How you fit 1,000 concepts into 768 dimensions.

The key insight is geometric. In high-dimensional spaces, you can have a surprisingly large number of nearly orthogonal directions. In 768 dimensions (GPT-2’s residual stream), you can fit thousands of directions that are almost perpendicular to each other. “Almost” is doing a lot of work here — but if concepts are sparse enough (rarely active at the same time), the small amount of interference between these nearly-orthogonal directions doesn’t matter much.

3 features in 3 dimensions — no superposition needed
Feature A
[1, 0, 0]
Feature B
[0, 1, 0]
Feature C
[0, 0, 1]
Each feature gets its own axis. Perfectly orthogonal. Zero interference. This is the dream.

The Ideal Case: One Feature Per Dimension

If you have as many dimensions as features, each feature gets its own axis. Features don’t interfere with each other. Reading feature A is just reading dimension 1. This is monosemantic — and it’s exactly what SAEs try to recover.

5 features, but only 3 dimensions
A
B
C
D
E
↓ squeeze into ↓
dim 1
dim 2
dim 3
The network learned more concepts than it has dimensions. Now what?

The Real Situation: More Features Than Dimensions

A real model like GPT-2 Small has 768 dimensions but has learned thousands (maybe millions) of concepts. It can’t give each concept its own axis. It has to share. The question is: how does it share?

5 features packed into 3 dimensions via superposition
Feature A
[0.9, 0.2, 0.1]
Feature B
[0.1, 0.8, 0.3]
Feature C
[0.2, 0.1, 0.9]
Feature D
[0.7, -0.5, 0.3]
Feature E
[-0.3, 0.6, 0.7]
Features now share dimensions. Each feature is a direction (not an axis). Directions are nearly orthogonal, but not perfectly.

Superposition: Features as Directions

The network represents each feature as a direction in the high-dimensional space, rather than as a single axis. These directions are nearly orthogonal — the dot product between any two is small but not zero. When you read dimension 1, you get a mix of all features that use that dimension. This is why individual neurons are polysemantic.

Sparsity makes interference tolerable
Input 1
A ON
B
C
D
E
Only A active → no interference
Input 2
A
B
C ON
D
E ON
C & E active → small interference

Why Sparsity Is the Key

Superposition works because most features are sparse — they’re only active on a small fraction of inputs. “Golden Gate Bridge” is relevant to maybe 0.01% of text. If features are rarely active simultaneously, the interference between their nearly-orthogonal directions rarely causes problems. The sparser the features, the more you can pack in. This is the fundamental tradeoff: more features = more interference, but sparsity keeps interference manageable.

1 / 4
The Superposition Tradeoff Benefit: represent N features in d dimensions (N >> d)
Cost: interference ∝ dot_product(featurei, featurej)2 × P(both active) More features = more packing = more potential interference. But if features are sparse (low co-activation), the cost stays low.
The Johnson-Lindenstrauss Intuition

There’s a famous result in mathematics (the Johnson-Lindenstrauss lemma): in high dimensions, random vectors are almost orthogonal. In 768 dimensions, you can have hundreds of thousands of directions where any pair has a dot product close to zero. The model doesn’t need to carefully engineer orthogonal directions — even random-ish directions in high-dimensional space are nearly orthogonal. Superposition is not a hack; it’s a natural consequence of high-dimensional geometry.

03

Features: The Real Units of Meaning

If neurons aren’t the right unit, what is?

The mechanistic interpretability community has converged on the idea that the real units of meaning in a neural network are features: directions in activation space that correspond to interpretable concepts. A feature isn’t a neuron — it’s a direction that might be spread across many neurons.

A feature is a direction in the residual stream (or any activation space) that:

  • Corresponds to a concept: It activates on inputs related to a specific, human-understandable idea (e.g., “code has a bug,” “sycophantic praise,” “the Golden Gate Bridge”)
  • Is causally meaningful: Amplifying the feature changes the model’s behavior in the expected direction (more bridge references, more sycophancy, etc.)
  • Is sparse: Only active on a small fraction of inputs — most text isn’t about the Golden Gate Bridge
  • Is a direction, not a neuron: The feature vector might have nonzero components across many neurons

Features are assumed to be linear — they correspond to directions (vectors) rather than curved surfaces or nonlinear manifolds. This is the linear representation hypothesis. Evidence for it:

  • Linear probes work: Simple linear classifiers on activations can extract concept presence with high accuracy
  • Steering vectors work: Adding a fixed vector to the residual stream changes behavior predictably
  • SAEs work: Linear decomposition (the basis of SAEs) finds interpretable features
  • Caveat: Some concepts may be nonlinear. Active research area. The hypothesis is useful even if not perfectly true.
Neurons vs. Features

Think of it like this: neurons are the physical basis (the actual computational units in the hardware), while features are the natural basis (the concepts the network actually works with). A neuron is like a pixel on a screen — it’s a real thing, but the meaningful unit is the image, which is a pattern across many pixels. SAEs try to find the “image” (feature) basis from the “pixel” (neuron) basis.

Real Features Found in Claude

Anthropic’s Scaling Monosemanticity (2024) extracted millions of features from Claude 3 Sonnet. Some examples:

  • A feature for the Golden Gate Bridge (fires for any mention, image, or reference)
  • A feature for “code has a bug” (fires when reviewing buggy code)
  • A feature for sycophantic praise (fires when the model is being overly agreeable)
  • A feature for deception / being untruthful (safety-relevant!)
  • Features that are multilingual (same feature fires for “love” in English, French, Chinese, etc.)
  • Features that are multimodal (same feature fires for text about and images of the same concept)

Amplifying the Golden Gate Bridge feature made Claude obsessively relate everything to the bridge — proof these features are causally meaningful, not just correlations.

04

Toy Models of Superposition

A mathematical framework for when and how superposition happens.

Elhage et al.’s “Toy Models of Superposition” (2022) studies superposition in a controlled setting. They train tiny models (5 neurons, 20 features) and observe exactly when and how features get packed into fewer dimensions. The results give us a framework for understanding superposition in real models.

Architecture Input: x ∈ ℝn (n sparse features, each active with probability S)
Bottleneck: h = Wx  where W ∈ ℝm×n, m < n
Output: = ReLU(WTh + b)
Loss: ||x - ||2 weighted by feature importance A linear encoder → nonlinear decoder. m dimensions must represent n > m features. The model must choose which features to keep and how to pack them.

Phase Transitions

Finding

Superposition doesn’t increase smoothly. As feature sparsity increases, the model suddenly transitions from “dedicate one dimension to the most important feature” to “pack multiple features in using superposition.” These sharp transitions are a sign of geometric structure, not randomness.

Sparsity Determines Packing

Finding

Sparser features (those active less often) are more likely to be stored in superposition. Dense features (active often) get their own dimensions because interference would be too costly. The model dynamically allocates: important + dense features get dedicated axes; less important + sparse features get packed.

Geometric Structures Emerge

Finding

Features in superposition organize into beautiful geometric structures: pentagons, tetrahedra, antipodal pairs. These are optimal packings — the same shapes that appear in coding theory and sphere-packing problems. The network rediscovers known mathematical optima.

Importance × Sparsity Tradeoff

Finding

Whether a feature gets its own dimension depends on importance × density. High importance + high density = dedicated dimension. Low importance + high sparsity = superposition. The model makes optimal allocation decisions.

Why Toy Models Matter

You can’t study superposition in GPT-4 because you don’t know the ground-truth features. Toy models let you control the features, vary sparsity and importance, and verify that your decomposition methods (SAEs) actually recover the true features. They’re the controlled experiments that give us confidence the approach works.

05

Sparse Autoencoders: The Solution

How we crack open superposition and find the real features.

If the network stores features in superposition (many features packed into fewer dimensions), the solution is to unpack them. A Sparse Autoencoder (SAE) does exactly this: it takes the dense, polysemantic activations from a model layer and decomposes them into a much larger set of sparse, ideally monosemantic features.

Think of it as a learned dictionary. The model’s activations are compressed text. The SAE is a translator that converts from “neuron language” (polysemantic, compressed) to “feature language” (monosemantic, interpretable, but much wider).

Model Activations e.g., residual stream at layer 6, dim = 768
SAE Encoder Linear projection + activation → sparse features
Sparse Feature Space (e.g., 768 × 16 = 12,288 features)
SAE Decoder Linear projection → reconstructed activations
Reconstructed Activations Should closely match original activations
SAE Architecture Encode: f = TopK(Wenc(x - bdec) + benc)
Decode: = Wdec · f + bdec
Loss: ||x - ||2 Wenc: [hidden_dim × d_model], Wdec: [d_model × hidden_dim]. TopK keeps only the top K activations, zeroing the rest. hidden_dim is much larger than d_model (16x to 64x expansion).

TopK (Current Standard)

Preferred

Keep only the top K activations per input, set the rest to zero. Directly controls sparsity level. K might be 32–128 out of 12,288+ features. Clean, predictable.

L1 Penalty (Original)

Historic

Add λ||f||1 to the loss. Encourages sparsity but requires tuning λ. Can cause “feature splitting” (one concept split across multiple features) or “feature absorption.” TopK avoids these issues.

Expansion factor: How much wider is the SAE than the model? Typical values: 16x to 64x. More expansion = more features found, but also more noise features and higher compute cost. 16x is a common starting point.

Which layer to decompose: You train separate SAEs for different layers. Early layers have more syntactic features. Later layers have more semantic/abstract features. The residual stream is the most common target.

Training data: Run the model on a large text corpus, cache activations at the target layer, train the SAE on those cached activations. Millions to billions of activation vectors.

06

SAE Walkthrough: Step by Step

What actually happens when an SAE decomposes a model’s activations.

Model processes: “The Eiffel Tower is located in”
0.82
-0.31
1.47
0.05
-0.63
0.29
768-dimensional residual stream vector at position “in”, layer 8. Dense. Every dimension is nonzero. Polysemantic.

Start: Dense Model Activations

We extract the residual stream at a specific layer and position. This 768-dimensional vector is the model’s internal state. It’s dense (most values are nonzero) and polysemantic (each dimension blends multiple concepts). We can’t directly read what the model “knows” here.

Encoder projects to 12,288-dimensional feature space
768-dim
x
× Wenc + benc
12,288-dim (pre-activation)
z = [0.1, -0.3, 2.8, 0.0, 1.1, …]

Encode: Project to a Wider Space

The encoder is a linear layer that projects from 768 dimensions to 12,288 dimensions (a 16x expansion). This wider space has enough room for each feature to get its own dimension. The pre-activation values indicate how much each potential feature matches the input.

TopK activation: keep top 64, zero the rest
0
0
2.8
0
1.1
0
0
0.9
0
0
64 out of 12,288 features are active (<0.5%). This IS the sparse representation. Each nonzero value is one feature “firing.”

Sparsify: Most Features are Zero

TopK keeps only the 64 largest activations and zeros everything else. Now we have a sparse representation: only 64 out of 12,288 features are active (<0.5%). Each active feature ideally corresponds to one interpretable concept present in this input.

Active features and what they mean
2.8
Feature #1,847: “Eiffel Tower / Paris landmarks”
1.1
Feature #4,203: “Geographic location queries”
0.9
Feature #7,891: “European cultural references”
+ 61 more active features (lower activation strengths)

Interpret: Each Feature Has a Meaning

Now comes the payoff. Each active feature (ideally) corresponds to one interpretable concept. Feature #1,847 fires strongly (2.8) because the Eiffel Tower is present. Feature #4,203 fires because this is a location query. The activation strength tells you how strongly each concept is present. You’ve gone from “768 opaque numbers” to “64 named concepts with strengths.”

Decoder reconstructs the original activations
Sparse features
f
× Wdec + bdec
Reconstructed (768-dim)
Original (768-dim)
x
Good SAEs reconstruct >95% of the variance. The decoder columns ARE the feature directions in activation space.

Decode: Verify the Decomposition

The decoder projects the sparse features back to 768 dimensions. If the reconstruction closely matches the original activations, the SAE has found a good decomposition. Critically: each column of the decoder matrix IS the feature’s direction in activation space. This is what you plot, analyze, and use for steering.

1 / 5
07

Evaluating SAEs: How Do You Know It Worked?

Finding features is easy. Finding the right features is hard.

Training an SAE will always give you features. The question is whether those features are real (corresponding to genuine concepts in the model) or artifacts (statistical patterns that don’t mean anything). There’s no single metric that answers this, so the field uses several.

Reconstruction Quality

Metric

How well can the decoder reconstruct the original activations from the sparse features? Measured by explained variance (R2) or mean squared error. Good SAEs explain >95% of variance. But: perfect reconstruction with garbage features is possible (just memorize the data), so this is necessary but not sufficient.

Downstream Performance

Metric

Replace the model’s real activations with SAE-reconstructed activations. Does the model still perform well? If loss goes up a lot, the SAE is losing important information. This is a stronger test than reconstruction because it measures whether the SAE preserves the information the model actually uses.

Interpretability

Qualitative

Can humans understand the features? For each feature, look at the top-activating inputs and see if they share a coherent theme. Automated scoring: have an LLM describe the feature from its top activations, then test whether the description predicts activations on held-out data. This is how Neuronpedia generates feature descriptions.

Sparsity vs. Reconstruction Tradeoff

Metric

Plot reconstruction quality against sparsity (L0 = average number of active features). Better SAEs achieve the same reconstruction with fewer active features. This Pareto frontier is the main comparison metric: does SAE architecture A dominate architecture B?

Feature Stability

Open Problem

Train two SAEs with different random seeds. Do they find the same features? Currently: partially. Some features are consistent, others aren’t. This is a fundamental concern — if features depend on the random seed, are they “real” or just one valid decomposition among many?

Causal Tests (Steering)

Gold Standard

The strongest test: does amplifying or suppressing a feature change the model’s behavior in the expected way? If the “Golden Gate Bridge” feature is real, amplifying it should make the model talk about the bridge. This is the closest thing to ground truth we have for real models.

The Verification Problem

This is one of the field’s deepest challenges: there’s no ground truth for what the “real features” are in a production model. We can verify in toy models (where we know the true features), but for GPT-4 or Claude, we’re always making inferences. Every interpretation should be treated as a hypothesis, not a conclusion.

08

Frontier: What’s Beyond SAEs

Where the cutting edge is going, as of 2025–2026.

Gated SAEs

Variant

Add a gating mechanism to the encoder: a separate linear layer decides which features to activate (gate), while the main encoder computes the activation magnitudes. Separating “should this feature fire?” from “how strongly?” gives better reconstruction at the same sparsity level.

Crosscoders

Extension

SAEs that read from and write to multiple layers (respecting causality). Enable tracking how features evolve through the network and comparing features between different models (“model diffing”). Key tool for Anthropic’s circuit tracing work.

Transcoders / CLTs

Alternative

Instead of decomposing activations, replace MLP layers entirely with sparse, interpretable alternatives. A cross-layer transcoder reads the residual stream and contributes to all subsequent MLP outputs. Makes circuit tracing dramatically cleaner.

BatchTopK

Training

Instead of choosing the top K features per example, choose the top K across the entire batch. Allows adaptive sparsity per input: some inputs might have 20 active features, others 100. Better than fixed K for real-world distributions.

Meta-SAEs

Research

Train an SAE on the decoder columns of another SAE to discover structure among features. Can features be grouped? Do they form a hierarchy? Are there families of related features? Reveals how the model organizes its knowledge.

The “Are Features Real?” Debate

Open Question

Maybe the linear representation hypothesis is wrong. Maybe features aren’t directions but regions, manifolds, or something else entirely. Maybe there is no “right” decomposition. This remains the deepest open question. SAEs work well enough to be useful, but we can’t prove they’re finding the “true” structure.

09

Paper Guide

Section-by-section guides to the two essential papers for this unit.

Reading Strategy

This paper is long but well-written. Focus on Sections 1–3 (setup, results, geometry) for the core understanding. Sections 4+ go deeper into specific phenomena and can be skimmed on a first read. Budget 3–4 hours for a careful first pass.

Section 1–2: Motivation & Setup

Why superposition matters, the experimental setup (tiny ReLU networks trained to reconstruct sparse inputs through a bottleneck), and the key variables (feature importance, feature sparsity, bottleneck width).

What to Focus On
  • The toy model architecture — understand it completely before moving on
  • The distinction between feature importance and feature sparsity
  • What “superposition” means formally in this context: features represented as non-axis-aligned directions

Section 3: Key Results

The main findings: phase transitions between no superposition and full superposition, how sparsity and importance interact, the emergence of geometric structures.

Key Takeaways
  • Phase transitions: Superposition doesn’t ramp up smoothly. As you increase the number of features relative to dimensions, there are sharp transitions.
  • Sparsity enables packing: The sparser a feature, the more the model is willing to store it in superposition (because interference rarely matters).
  • Geometric structures: Features in superposition form optimal geometric configurations (pentagons, tetrahedra). The network discovers these automatically.
Common Stumbling Blocks
  • The paper plots feature directions projected into 2D — remember this is a projection of higher-dimensional geometry
  • The WTW matrix visualization: each cell shows the dot product between two feature directions. Diagonal = 1.0 (self), off-diagonal = interference.

Sections 4+: Deeper Phenomena

Feature splitting, superposition in different types of features (discrete, correlated), computation in superposition. Important for researchers, skippable for a first pass.

Skim For
  • Feature splitting: One concept might be split across multiple features. This becomes relevant when evaluating SAEs.
  • Computation in superposition: Can a network compute on features that are in superposition, without first decompressing them? The paper suggests yes, which has deep implications.
Reading Strategy

This paper is the proof-of-concept for SAEs. It demonstrates that SAEs actually work on a real (tiny) transformer. Focus on understanding the results and methodology. The mathematical details of the SAE architecture are less important than understanding what the features look like. Budget 2–3 hours.

Problem & Approach

Why neurons are polysemantic, the dictionary learning framing, and the SAE architecture.

What to Focus On
  • The framing: activations as a linear combination of sparse features, SAE as dictionary learning
  • The 1-layer 512-neuron transformer model they use (small enough to fully analyze)
  • The L1 sparsity penalty (this paper used L1, not TopK — TopK came later)

Results: What the Features Look Like

The actual features extracted. This is the exciting part — real examples of monosemantic features emerging from the SAE decomposition.

What to Focus On
  • The feature dashboards: top activating examples, logit effects, feature density
  • How clearly monosemantic the features are compared to the polysemantic neurons
  • The range of feature types: specific tokens, abstract concepts, syntactic patterns
  • How features are validated: top activations + causal effects (ablation / amplification)

Evaluation & Limitations

How they evaluate the SAE, the tradeoffs, and honest limitations.

What to Focus On
  • The reconstruction vs. sparsity tradeoff (Pareto frontier)
  • The “dead features” problem: some SAE features never activate. Wasted capacity.
  • The model they use is tiny (1-layer) — does this scale? (Spoiler: Scaling Monosemanticity answers this)
  • Honest about limitations: feature splitting, inconsistency across seeds, evaluation challenges
10

Required Reading & Resources

Everything you need for this unit, ranked by priority.

Toy Models of Superposition

Elhage, Hume, Olah et al. — Anthropic, 2022

The mathematical framework for understanding superposition. When and how features get packed into fewer dimensions. Phase transitions, geometric structures, the importance/sparsity tradeoff.

Medium 3–4 hours

Towards Monosemanticity

Bricken, Templeton et al. — Anthropic, 2023

First successful SAE decomposition of a real transformer. Proof of concept that SAEs can extract interpretable, monosemantic features from polysemantic neurons.

Medium 2–3 hours

Scaling Monosemanticity

Bricken, Templeton, Batson et al. — Anthropic, 2024

SAEs scaled to Claude 3 Sonnet. Millions of features including abstract, multilingual, multimodal, and safety-relevant concepts. Proves the approach works on production models. Read the blog post version first; the full paper is very long.

Blog: Easy 1 hour (blog)

Neuronpedia — Browse Features Interactively

Decode Research

Don’t just read about features — explore them. Browse the feature dashboards, see what features look like in practice, explore top activations and logit effects. Load pretrained SAEs and search for features related to any concept.

Easy 1+ hours

ARENA — SAE Exercises

Callum McDougall

Hands-on Jupyter notebooks: train a toy SAE, load pretrained SAEs from Neuronpedia, explore features, analyze the sparsity/reconstruction tradeoff. The practical companion to the papers.

Medium 4–6 hours

SAELens Documentation & Tutorials

Joseph Bloom / Decode Research

The standard library for training and analyzing SAEs. Tutorials for training your own SAE, loading pretrained ones, and analyzing features. Works with any PyTorch model.

Medium 2–3 hours

Nanda’s Recommended Superposition Papers

Neel Nanda

Curated list of the best papers on superposition and features, with Nanda’s personal commentary on what’s worth reading and why. Good for going deeper after the essentials.

Easy 30 min
Suggested Order

Read Toy Models first (understand the problem), then Towards Monosemanticity (the solution), then Scaling Monosemanticity (proof it works at scale). Interleave with Neuronpedia browsing — seeing real features makes the papers click. Then do the ARENA exercises to get hands-on.

11

Exercises & Deliverables

Build intuition by doing, not just reading.

Exercise 1

Train a Toy Model of Superposition

Replicate the core result from Toy Models of Superposition. Train a small ReLU network to reconstruct sparse inputs through a bottleneck.

  • Create synthetic data: 20 sparse features with varying importance and sparsity
  • Train a bottleneck model (20 → 5 → 20) to reconstruct the inputs
  • Visualize the learned W matrix: are features axis-aligned (monosemantic) or in superposition?
  • Vary sparsity: watch the phase transition from monosemantic to superposition
  • Plot the WTW matrix — do you see the geometric structures (pentagons, tetrahedra)?
Exercise 2

Train a Toy SAE

Train an SAE to decompose the superposed representations from Exercise 1. Verify it recovers the true features.

  • Take the 5-dim bottleneck representations from your trained model
  • Train an SAE: 5 → 20 → 5 (expand back to the true feature count)
  • Compare SAE features to the true features — does it recover them?
  • Try different sparsity penalties: L1 vs TopK. Which recovers features better?
  • Measure: reconstruction quality, feature interpretability, and sparsity
Exercise 3

Explore Real Features on Neuronpedia

Browse Neuronpedia to build intuition for what SAE features look like in practice.

  • Pick a model (GPT-2 Small is a good start) and a layer
  • Find 5 features you can clearly interpret — describe what each one does
  • Find 2 features that seem hard to interpret — what makes them unclear?
  • Search for features related to a specific concept (e.g., “Python code,” “emotions”)
  • Look at the logit effects: which output tokens does each feature promote?
Exercise 4

Load Pretrained SAEs with SAELens

Use SAELens to load pretrained SAEs and analyze features programmatically.

  • Load a pretrained SAE for GPT-2 Small from Neuronpedia
  • Run the model + SAE on a prompt of your choice
  • Identify the top active features for a specific token position
  • Visualize: which features activate across a full sentence? (feature activation heatmap)
  • Compare features at different layers: early (syntactic) vs. late (semantic)
Deliverable

Notebook: “Superposition & SAE Features”

Combine the above into a single notebook demonstrating your understanding of superposition and your ability to work with SAE features.

  • Toy model showing superposition phase transition (with visualization)
  • Toy SAE recovering true features from superposed representations
  • Analysis of 5+ real SAE features from a pretrained model
  • At least one comparison: features at early vs. late layers, or different sparsity levels
  • This notebook + your Transformer Internals notebook is a strong portfolio for the next step: Activation Patching & Circuits

When You’re Ready

If you can explain why superposition happens (high-dimensional geometry + sparsity), how SAEs decompose it (sparse overcomplete dictionary), and can work with real SAE features from Neuronpedia and SAELens — you have the tools. The next step is learning to use features for causal analysis: activation patching, circuit discovery, and tracing how features influence the model’s output. Head to the Roadmap for Step 3: Activation Patching & Circuits.