The Problem: Why Neurons Don’t Make Sense
You’d expect each neuron to represent one thing. They don’t.
Here’s the intuitive expectation: a neural network learns to detect concepts, and each neuron represents one concept. Neuron 347 fires for “cat,” neuron 1,024 fires for “legal language,” neuron 5,891 fires for “the Golden Gate Bridge.” If this were true, interpretability would be solved — just read off what each neuron does.
This is not what happens. In practice, individual neurons fire for multiple, seemingly unrelated concepts. A single neuron might activate for cats, the color blue, and legal documents. This is called polysemanticity, and it makes neural networks fundamentally opaque at the neuron level.
Monosemantic Neuron (Rare)
InterpretableFires for one concept. Example: a neuron in an image model that activates only for curved edges at a specific orientation. You can look at it and say “I know what this does.” These exist but are the exception.
Polysemantic Neuron (Common)
OpaqueFires for multiple unrelated concepts. The same neuron activates for academic citations, Korean text, and mentions of cooking temperatures. There’s no single story you can tell about “what this neuron does.”
Why would a network do this? It seems like bad engineering — why not give each concept its own neuron? The answer is superposition: the network has learned far more concepts than it has neurons, and it’s found a clever way to pack them all in. Understanding this packing scheme is the entire game.
The Geometry of Superposition
How you fit 1,000 concepts into 768 dimensions.
The key insight is geometric. In high-dimensional spaces, you can have a surprisingly large number of nearly orthogonal directions. In 768 dimensions (GPT-2’s residual stream), you can fit thousands of directions that are almost perpendicular to each other. “Almost” is doing a lot of work here — but if concepts are sparse enough (rarely active at the same time), the small amount of interference between these nearly-orthogonal directions doesn’t matter much.
The Ideal Case: One Feature Per Dimension
If you have as many dimensions as features, each feature gets its own axis. Features don’t interfere with each other. Reading feature A is just reading dimension 1. This is monosemantic — and it’s exactly what SAEs try to recover.
The Real Situation: More Features Than Dimensions
A real model like GPT-2 Small has 768 dimensions but has learned thousands (maybe millions) of concepts. It can’t give each concept its own axis. It has to share. The question is: how does it share?
Superposition: Features as Directions
The network represents each feature as a direction in the high-dimensional space, rather than as a single axis. These directions are nearly orthogonal — the dot product between any two is small but not zero. When you read dimension 1, you get a mix of all features that use that dimension. This is why individual neurons are polysemantic.
Why Sparsity Is the Key
Superposition works because most features are sparse — they’re only active on a small fraction of inputs. “Golden Gate Bridge” is relevant to maybe 0.01% of text. If features are rarely active simultaneously, the interference between their nearly-orthogonal directions rarely causes problems. The sparser the features, the more you can pack in. This is the fundamental tradeoff: more features = more interference, but sparsity keeps interference manageable.
Cost: interference ∝ dot_product(featurei, featurej)2 × P(both active) More features = more packing = more potential interference. But if features are sparse (low co-activation), the cost stays low.
There’s a famous result in mathematics (the Johnson-Lindenstrauss lemma): in high dimensions, random vectors are almost orthogonal. In 768 dimensions, you can have hundreds of thousands of directions where any pair has a dot product close to zero. The model doesn’t need to carefully engineer orthogonal directions — even random-ish directions in high-dimensional space are nearly orthogonal. Superposition is not a hack; it’s a natural consequence of high-dimensional geometry.
Features: The Real Units of Meaning
If neurons aren’t the right unit, what is?
The mechanistic interpretability community has converged on the idea that the real units of meaning in a neural network are features: directions in activation space that correspond to interpretable concepts. A feature isn’t a neuron — it’s a direction that might be spread across many neurons.
A feature is a direction in the residual stream (or any activation space) that:
- Corresponds to a concept: It activates on inputs related to a specific, human-understandable idea (e.g., “code has a bug,” “sycophantic praise,” “the Golden Gate Bridge”)
- Is causally meaningful: Amplifying the feature changes the model’s behavior in the expected direction (more bridge references, more sycophancy, etc.)
- Is sparse: Only active on a small fraction of inputs — most text isn’t about the Golden Gate Bridge
- Is a direction, not a neuron: The feature vector might have nonzero components across many neurons
Features are assumed to be linear — they correspond to directions (vectors) rather than curved surfaces or nonlinear manifolds. This is the linear representation hypothesis. Evidence for it:
- Linear probes work: Simple linear classifiers on activations can extract concept presence with high accuracy
- Steering vectors work: Adding a fixed vector to the residual stream changes behavior predictably
- SAEs work: Linear decomposition (the basis of SAEs) finds interpretable features
- Caveat: Some concepts may be nonlinear. Active research area. The hypothesis is useful even if not perfectly true.
Think of it like this: neurons are the physical basis (the actual computational units in the hardware), while features are the natural basis (the concepts the network actually works with). A neuron is like a pixel on a screen — it’s a real thing, but the meaningful unit is the image, which is a pattern across many pixels. SAEs try to find the “image” (feature) basis from the “pixel” (neuron) basis.
Real Features Found in Claude
Anthropic’s Scaling Monosemanticity (2024) extracted millions of features from Claude 3 Sonnet. Some examples:
- A feature for the Golden Gate Bridge (fires for any mention, image, or reference)
- A feature for “code has a bug” (fires when reviewing buggy code)
- A feature for sycophantic praise (fires when the model is being overly agreeable)
- A feature for deception / being untruthful (safety-relevant!)
- Features that are multilingual (same feature fires for “love” in English, French, Chinese, etc.)
- Features that are multimodal (same feature fires for text about and images of the same concept)
Amplifying the Golden Gate Bridge feature made Claude obsessively relate everything to the bridge — proof these features are causally meaningful, not just correlations.
Toy Models of Superposition
A mathematical framework for when and how superposition happens.
Elhage et al.’s “Toy Models of Superposition” (2022) studies superposition in a controlled setting. They train tiny models (5 neurons, 20 features) and observe exactly when and how features get packed into fewer dimensions. The results give us a framework for understanding superposition in real models.
Bottleneck: h = Wx where W ∈ ℝm×n, m < n
Output: x̂ = ReLU(WTh + b)
Loss: ||x - x̂||2 weighted by feature importance A linear encoder → nonlinear decoder. m dimensions must represent n > m features. The model must choose which features to keep and how to pack them.
Phase Transitions
FindingSuperposition doesn’t increase smoothly. As feature sparsity increases, the model suddenly transitions from “dedicate one dimension to the most important feature” to “pack multiple features in using superposition.” These sharp transitions are a sign of geometric structure, not randomness.
Sparsity Determines Packing
FindingSparser features (those active less often) are more likely to be stored in superposition. Dense features (active often) get their own dimensions because interference would be too costly. The model dynamically allocates: important + dense features get dedicated axes; less important + sparse features get packed.
Geometric Structures Emerge
FindingFeatures in superposition organize into beautiful geometric structures: pentagons, tetrahedra, antipodal pairs. These are optimal packings — the same shapes that appear in coding theory and sphere-packing problems. The network rediscovers known mathematical optima.
Importance × Sparsity Tradeoff
FindingWhether a feature gets its own dimension depends on importance × density. High importance + high density = dedicated dimension. Low importance + high sparsity = superposition. The model makes optimal allocation decisions.
You can’t study superposition in GPT-4 because you don’t know the ground-truth features. Toy models let you control the features, vary sparsity and importance, and verify that your decomposition methods (SAEs) actually recover the true features. They’re the controlled experiments that give us confidence the approach works.
Sparse Autoencoders: The Solution
How we crack open superposition and find the real features.
If the network stores features in superposition (many features packed into fewer dimensions), the solution is to unpack them. A Sparse Autoencoder (SAE) does exactly this: it takes the dense, polysemantic activations from a model layer and decomposes them into a much larger set of sparse, ideally monosemantic features.
Think of it as a learned dictionary. The model’s activations are compressed text. The SAE is a translator that converts from “neuron language” (polysemantic, compressed) to “feature language” (monosemantic, interpretable, but much wider).
Decode: x̂ = Wdec · f + bdec
Loss: ||x - x̂||2 Wenc: [hidden_dim × d_model], Wdec: [d_model × hidden_dim]. TopK keeps only the top K activations, zeroing the rest. hidden_dim is much larger than d_model (16x to 64x expansion).
TopK (Current Standard)
PreferredKeep only the top K activations per input, set the rest to zero. Directly controls sparsity level. K might be 32–128 out of 12,288+ features. Clean, predictable.
L1 Penalty (Original)
HistoricAdd λ||f||1 to the loss. Encourages sparsity but requires tuning λ. Can cause “feature splitting” (one concept split across multiple features) or “feature absorption.” TopK avoids these issues.
Expansion factor: How much wider is the SAE than the model? Typical values: 16x to 64x. More expansion = more features found, but also more noise features and higher compute cost. 16x is a common starting point.
Which layer to decompose: You train separate SAEs for different layers. Early layers have more syntactic features. Later layers have more semantic/abstract features. The residual stream is the most common target.
Training data: Run the model on a large text corpus, cache activations at the target layer, train the SAE on those cached activations. Millions to billions of activation vectors.
SAE Walkthrough: Step by Step
What actually happens when an SAE decomposes a model’s activations.
Start: Dense Model Activations
We extract the residual stream at a specific layer and position. This 768-dimensional vector is the model’s internal state. It’s dense (most values are nonzero) and polysemantic (each dimension blends multiple concepts). We can’t directly read what the model “knows” here.
Encode: Project to a Wider Space
The encoder is a linear layer that projects from 768 dimensions to 12,288 dimensions (a 16x expansion). This wider space has enough room for each feature to get its own dimension. The pre-activation values indicate how much each potential feature matches the input.
Sparsify: Most Features are Zero
TopK keeps only the 64 largest activations and zeros everything else. Now we have a sparse representation: only 64 out of 12,288 features are active (<0.5%). Each active feature ideally corresponds to one interpretable concept present in this input.
Interpret: Each Feature Has a Meaning
Now comes the payoff. Each active feature (ideally) corresponds to one interpretable concept. Feature #1,847 fires strongly (2.8) because the Eiffel Tower is present. Feature #4,203 fires because this is a location query. The activation strength tells you how strongly each concept is present. You’ve gone from “768 opaque numbers” to “64 named concepts with strengths.”
Decode: Verify the Decomposition
The decoder projects the sparse features back to 768 dimensions. If the reconstruction closely matches the original activations, the SAE has found a good decomposition. Critically: each column of the decoder matrix IS the feature’s direction in activation space. This is what you plot, analyze, and use for steering.
Evaluating SAEs: How Do You Know It Worked?
Finding features is easy. Finding the right features is hard.
Training an SAE will always give you features. The question is whether those features are real (corresponding to genuine concepts in the model) or artifacts (statistical patterns that don’t mean anything). There’s no single metric that answers this, so the field uses several.
Reconstruction Quality
MetricHow well can the decoder reconstruct the original activations from the sparse features? Measured by explained variance (R2) or mean squared error. Good SAEs explain >95% of variance. But: perfect reconstruction with garbage features is possible (just memorize the data), so this is necessary but not sufficient.
Downstream Performance
MetricReplace the model’s real activations with SAE-reconstructed activations. Does the model still perform well? If loss goes up a lot, the SAE is losing important information. This is a stronger test than reconstruction because it measures whether the SAE preserves the information the model actually uses.
Interpretability
QualitativeCan humans understand the features? For each feature, look at the top-activating inputs and see if they share a coherent theme. Automated scoring: have an LLM describe the feature from its top activations, then test whether the description predicts activations on held-out data. This is how Neuronpedia generates feature descriptions.
Sparsity vs. Reconstruction Tradeoff
MetricPlot reconstruction quality against sparsity (L0 = average number of active features). Better SAEs achieve the same reconstruction with fewer active features. This Pareto frontier is the main comparison metric: does SAE architecture A dominate architecture B?
Feature Stability
Open ProblemTrain two SAEs with different random seeds. Do they find the same features? Currently: partially. Some features are consistent, others aren’t. This is a fundamental concern — if features depend on the random seed, are they “real” or just one valid decomposition among many?
Causal Tests (Steering)
Gold StandardThe strongest test: does amplifying or suppressing a feature change the model’s behavior in the expected way? If the “Golden Gate Bridge” feature is real, amplifying it should make the model talk about the bridge. This is the closest thing to ground truth we have for real models.
This is one of the field’s deepest challenges: there’s no ground truth for what the “real features” are in a production model. We can verify in toy models (where we know the true features), but for GPT-4 or Claude, we’re always making inferences. Every interpretation should be treated as a hypothesis, not a conclusion.
Frontier: What’s Beyond SAEs
Where the cutting edge is going, as of 2025–2026.
Gated SAEs
VariantAdd a gating mechanism to the encoder: a separate linear layer decides which features to activate (gate), while the main encoder computes the activation magnitudes. Separating “should this feature fire?” from “how strongly?” gives better reconstruction at the same sparsity level.
Crosscoders
ExtensionSAEs that read from and write to multiple layers (respecting causality). Enable tracking how features evolve through the network and comparing features between different models (“model diffing”). Key tool for Anthropic’s circuit tracing work.
Transcoders / CLTs
AlternativeInstead of decomposing activations, replace MLP layers entirely with sparse, interpretable alternatives. A cross-layer transcoder reads the residual stream and contributes to all subsequent MLP outputs. Makes circuit tracing dramatically cleaner.
BatchTopK
TrainingInstead of choosing the top K features per example, choose the top K across the entire batch. Allows adaptive sparsity per input: some inputs might have 20 active features, others 100. Better than fixed K for real-world distributions.
Meta-SAEs
ResearchTrain an SAE on the decoder columns of another SAE to discover structure among features. Can features be grouped? Do they form a hierarchy? Are there families of related features? Reveals how the model organizes its knowledge.
The “Are Features Real?” Debate
Open QuestionMaybe the linear representation hypothesis is wrong. Maybe features aren’t directions but regions, manifolds, or something else entirely. Maybe there is no “right” decomposition. This remains the deepest open question. SAEs work well enough to be useful, but we can’t prove they’re finding the “true” structure.
Paper Guide
Section-by-section guides to the two essential papers for this unit.
This paper is long but well-written. Focus on Sections 1–3 (setup, results, geometry) for the core understanding. Sections 4+ go deeper into specific phenomena and can be skimmed on a first read. Budget 3–4 hours for a careful first pass.
Section 1–2: Motivation & Setup
Why superposition matters, the experimental setup (tiny ReLU networks trained to reconstruct sparse inputs through a bottleneck), and the key variables (feature importance, feature sparsity, bottleneck width).
- The toy model architecture — understand it completely before moving on
- The distinction between feature importance and feature sparsity
- What “superposition” means formally in this context: features represented as non-axis-aligned directions
Section 3: Key Results
The main findings: phase transitions between no superposition and full superposition, how sparsity and importance interact, the emergence of geometric structures.
- Phase transitions: Superposition doesn’t ramp up smoothly. As you increase the number of features relative to dimensions, there are sharp transitions.
- Sparsity enables packing: The sparser a feature, the more the model is willing to store it in superposition (because interference rarely matters).
- Geometric structures: Features in superposition form optimal geometric configurations (pentagons, tetrahedra). The network discovers these automatically.
- The paper plots feature directions projected into 2D — remember this is a projection of higher-dimensional geometry
- The WTW matrix visualization: each cell shows the dot product between two feature directions. Diagonal = 1.0 (self), off-diagonal = interference.
Sections 4+: Deeper Phenomena
Feature splitting, superposition in different types of features (discrete, correlated), computation in superposition. Important for researchers, skippable for a first pass.
- Feature splitting: One concept might be split across multiple features. This becomes relevant when evaluating SAEs.
- Computation in superposition: Can a network compute on features that are in superposition, without first decompressing them? The paper suggests yes, which has deep implications.
This paper is the proof-of-concept for SAEs. It demonstrates that SAEs actually work on a real (tiny) transformer. Focus on understanding the results and methodology. The mathematical details of the SAE architecture are less important than understanding what the features look like. Budget 2–3 hours.
Problem & Approach
Why neurons are polysemantic, the dictionary learning framing, and the SAE architecture.
- The framing: activations as a linear combination of sparse features, SAE as dictionary learning
- The 1-layer 512-neuron transformer model they use (small enough to fully analyze)
- The L1 sparsity penalty (this paper used L1, not TopK — TopK came later)
Results: What the Features Look Like
The actual features extracted. This is the exciting part — real examples of monosemantic features emerging from the SAE decomposition.
- The feature dashboards: top activating examples, logit effects, feature density
- How clearly monosemantic the features are compared to the polysemantic neurons
- The range of feature types: specific tokens, abstract concepts, syntactic patterns
- How features are validated: top activations + causal effects (ablation / amplification)
Evaluation & Limitations
How they evaluate the SAE, the tradeoffs, and honest limitations.
- The reconstruction vs. sparsity tradeoff (Pareto frontier)
- The “dead features” problem: some SAE features never activate. Wasted capacity.
- The model they use is tiny (1-layer) — does this scale? (Spoiler: Scaling Monosemanticity answers this)
- Honest about limitations: feature splitting, inconsistency across seeds, evaluation challenges
Required Reading & Resources
Everything you need for this unit, ranked by priority.
Toy Models of Superposition
The mathematical framework for understanding superposition. When and how features get packed into fewer dimensions. Phase transitions, geometric structures, the importance/sparsity tradeoff.
Towards Monosemanticity
First successful SAE decomposition of a real transformer. Proof of concept that SAEs can extract interpretable, monosemantic features from polysemantic neurons.
Scaling Monosemanticity
SAEs scaled to Claude 3 Sonnet. Millions of features including abstract, multilingual, multimodal, and safety-relevant concepts. Proves the approach works on production models. Read the blog post version first; the full paper is very long.
Neuronpedia — Browse Features Interactively
Don’t just read about features — explore them. Browse the feature dashboards, see what features look like in practice, explore top activations and logit effects. Load pretrained SAEs and search for features related to any concept.
ARENA — SAE Exercises
Hands-on Jupyter notebooks: train a toy SAE, load pretrained SAEs from Neuronpedia, explore features, analyze the sparsity/reconstruction tradeoff. The practical companion to the papers.
SAELens Documentation & Tutorials
The standard library for training and analyzing SAEs. Tutorials for training your own SAE, loading pretrained ones, and analyzing features. Works with any PyTorch model.
Nanda’s Recommended Superposition Papers
Curated list of the best papers on superposition and features, with Nanda’s personal commentary on what’s worth reading and why. Good for going deeper after the essentials.
Read Toy Models first (understand the problem), then Towards Monosemanticity (the solution), then Scaling Monosemanticity (proof it works at scale). Interleave with Neuronpedia browsing — seeing real features makes the papers click. Then do the ARENA exercises to get hands-on.
Exercises & Deliverables
Build intuition by doing, not just reading.
Train a Toy Model of Superposition
Replicate the core result from Toy Models of Superposition. Train a small ReLU network to reconstruct sparse inputs through a bottleneck.
- Create synthetic data: 20 sparse features with varying importance and sparsity
- Train a bottleneck model (20 → 5 → 20) to reconstruct the inputs
- Visualize the learned W matrix: are features axis-aligned (monosemantic) or in superposition?
- Vary sparsity: watch the phase transition from monosemantic to superposition
- Plot the WTW matrix — do you see the geometric structures (pentagons, tetrahedra)?
Train a Toy SAE
Train an SAE to decompose the superposed representations from Exercise 1. Verify it recovers the true features.
- Take the 5-dim bottleneck representations from your trained model
- Train an SAE: 5 → 20 → 5 (expand back to the true feature count)
- Compare SAE features to the true features — does it recover them?
- Try different sparsity penalties: L1 vs TopK. Which recovers features better?
- Measure: reconstruction quality, feature interpretability, and sparsity
Explore Real Features on Neuronpedia
Browse Neuronpedia to build intuition for what SAE features look like in practice.
- Pick a model (GPT-2 Small is a good start) and a layer
- Find 5 features you can clearly interpret — describe what each one does
- Find 2 features that seem hard to interpret — what makes them unclear?
- Search for features related to a specific concept (e.g., “Python code,” “emotions”)
- Look at the logit effects: which output tokens does each feature promote?
Load Pretrained SAEs with SAELens
Use SAELens to load pretrained SAEs and analyze features programmatically.
- Load a pretrained SAE for GPT-2 Small from Neuronpedia
- Run the model + SAE on a prompt of your choice
- Identify the top active features for a specific token position
- Visualize: which features activate across a full sentence? (feature activation heatmap)
- Compare features at different layers: early (syntactic) vs. late (semantic)
Notebook: “Superposition & SAE Features”
Combine the above into a single notebook demonstrating your understanding of superposition and your ability to work with SAE features.
- Toy model showing superposition phase transition (with visualization)
- Toy SAE recovering true features from superposed representations
- Analysis of 5+ real SAE features from a pretrained model
- At least one comparison: features at early vs. late layers, or different sparsity levels
- This notebook + your Transformer Internals notebook is a strong portfolio for the next step: Activation Patching & Circuits
When You’re Ready
If you can explain why superposition happens (high-dimensional geometry + sparsity), how SAEs decompose it (sparse overcomplete dictionary), and can work with real SAE features from Neuronpedia and SAELens — you have the tools. The next step is learning to use features for causal analysis: activation patching, circuit discovery, and tracing how features influence the model’s output. Head to the Roadmap for Step 3: Activation Patching & Circuits.