“A surprising fact about modern large language models is that nobody really knows how they work internally.”Anthropic Research Team
What is this field?
Mechanistic interpretability reverse-engineers neural networks to understand how they compute, not just what they output. Think: decompiling a binary back to source code.
Neural networks work, but nobody designed their algorithms. They emerged from training. Mechanistic interpretability asks: what algorithms did they learn, and can we understand them?
Networks represent far more concepts than they have neurons. They pack thousands of “features” (concepts like “Golden Gate Bridge” or “code is buggy”) into a smaller number of dimensions using nearly-orthogonal directions in high-dimensional space. Individual neurons are polysemantic — they fire for multiple unrelated things. This makes the network opaque.
The Breakthrough Moment
StatusMIT Technology Review named mech interp a 2026 Breakthrough Technology. The field went from niche to essential in ~2 years. Anthropic, Google DeepMind, and startups like Goodfire are all investing heavily. 140+ papers at the ICML 2024 workshop alone.
What Actually Works Now
MethodsSAEs can extract interpretable features from production models (Claude, Llama). Attribution graphs can trace how a model arrives at specific answers. Steering vectors can modify model behavior without retraining. But we still can’t give robust guarantees about what a model will or won’t do.
The Honest Gap
ChallengeWe can identify features and trace some circuits, but the full picture remains out of reach. Features aren’t stable across training runs. Circuits for complex behaviors are too tangled to fully map. The field shifted from “prove a model is safe” to “get useful-but-imperfect understanding.”
Core Concepts
Click any concept to expand details, prerequisites, and connections.
Timeline
Key papers and breakthroughs, in order. Highlighted dots mark turning points.
People & Organizations
Who’s doing this work and where.
Anthropic Interpretability Team
LabThe largest and most influential mech interp team. Led by Chris Olah. Produced Scaling Monosemanticity, Circuit Tracing, Crosscoders, the transformer-circuits.pub research site. Has the most access to frontier models (Claude). Sets the research agenda for much of the field.
Google DeepMind MI Team
LabLed by Neel Nanda (age 26). Recently pivoted to “applied interpretability” — using mech interp tools directly for safety in production. Hiring research scientists and engineers. Created TransformerLens (now community-maintained). Key focus: pragmatic safety applications.
Goodfire
StartupFirst mech interp startup. $50M+ raised, Anthropic invested. Built commercial API for feature discovery and steering on Llama models. “Paint with Ember” lets you paint using neural features. Proving interpretability has commercial value, not just research.
Apollo Research
Safety orgAI safety research org. Lee Sharkey (co-author of “Open Problems in MI”) is based here. Focus on using interpretability for evaluating dangerous AI capabilities, particularly deception and scheming.
EleutherAI
Open sourceOpen-source AI research collective. Released the Pythia model suite (purpose-built for interp research with saved checkpoints). Created the “Attribute” library for attribution graphs. Strong focus on open, reproducible research.
Decode Research / Neuronpedia
PlatformMaintains Neuronpedia (the Wikipedia of neural features) and SAELens (SAE training library). Now open source. Hosts 4+ TB of feature activations, explanations, and metadata. Collaborated with Anthropic on open-sourcing circuit tracing.
MATS
TrainingPremier mentorship program for alignment researchers. Many mech interp researchers came through MATS. If you want to do this professionally, MATS is one of the best entry points.
Chris Olah
AnthropicCoined “mechanistic interpretability.” Founded the Circuits research program at OpenAI, then moved to Anthropic. Behind Distill.pub, the Circuits papers, and most of Anthropic’s landmark interp work. The intellectual godfather of the field.
Neel Nanda
Google DeepMindCreated TransformerLens. MI team lead at Google DeepMind at age 26. Mentored 50+ junior researchers. Best resource for getting started (quickstart guide, prerequisites guide, blog). Recently shifted toward “pragmatic interpretability.”
Trenton Bricken
AnthropicCore author on Towards Monosemanticity and Scaling Monosemanticity. Central to the SAE / dictionary learning approach. Background in neuroscience + ML.
Nelson Elhage
AnthropicCo-author of Mathematical Framework for Transformer Circuits, Toy Models of Superposition, and many other foundational papers. Created “Transformers for Software Engineers” — essential reading for your background.
Lee Sharkey
Apollo ResearchLead author of “Open Problems in MI” (Jan 2025) — the field’s roadmap. Working on using interp for evaluating AI deception. Strong on theory and identifying what the field actually needs to solve.
Joseph Bloom
Decode ResearchCreator of SAELens (SAE training library) and SAEDashboard. Building the infrastructure the field runs on. Key contributor to making SAE research accessible and reproducible.
David Bau
AcademicProfessor at Northeastern. Created baukit/nnsight. Pioneer in understanding individual neurons and editing model knowledge. ROME (Rank-One Model Editing) work. Approaches interp from a different angle than Anthropic.
Dan Hendrycks
CAISDirector of Center for AI Safety. Representation Engineering / RepE work. Top-down approach to finding and controlling concept representations. Complements bottom-up circuit analysis.
Callum McDougall
EducatorCreated ARENA (Alignment Research Engineer Accelerator) tutorials. The most recommended hands-on learning resource for mech interp. If you’re doing the exercises, you’re following his curriculum.
Tools & Infrastructure
The practical toolkit for doing interpretability research.
TransformerLens
LibraryThe primary tool. Load 50+ model architectures, access every internal activation (residual stream, attention patterns, MLP outputs), cache activations, hook into forward passes to edit/ablate/patch. v3 (alpha, Sep 2025) supports large models. Start here.
SAELens
LibraryTrain and analyze sparse autoencoders. Works with any PyTorch model (not just TransformerLens). Loads pretrained SAEs from Neuronpedia. The standard tool for feature extraction research. Previously part of TransformerLens, now standalone.
nnsight
LibraryDavid Bau’s library. More performant than TransformerLens for large models. Wraps HuggingFace transformers directly. Better for production-scale work. Different API philosophy (context managers vs hooks). Use when TransformerLens is too slow.
Neuronpedia
PlatformInteractive platform for exploring SAE features. Browse 4+ TB of activations, explanations, metadata. Feature dashboards show top activations, logits, density. Hosts attribution graph explorer. Now open source. The “lab bench” of the field.
Circuit Tracer
LibraryAnthropic’s open-sourced attribution graph library. Generate attribution graphs for open-weight models (Gemma-2-2B, Llama-3.1-1B). Trace how models arrive at specific outputs. Frontend hosted on Neuronpedia. Released May 2025.
ARENA Tutorials
LearningCallum McDougall’s hands-on curriculum. Jupyter notebooks with exercises and solutions. Covers: building transformers from scratch, TransformerLens, SAEs, activation patching, circuits. The recommended starting point for learning by doing.
GPT-2 (Small / Medium)
ClassicThe fruit fly of mech interp. Small enough to fully analyze, complex enough to have interesting behavior. Most published circuits research uses GPT-2. Great for learning. 124M / 355M parameters.
Pythia Suite (EleutherAI)
ResearchPurpose-built for interp. Models from 70M to 12B, with 154 saved checkpoints during training. Lets you study how features emerge during training. Open weights, open data.
Gemma 2/3 (Google)
CurrentWell-supported in TransformerLens. Gemma-2-2B is a sweet spot (small enough for laptop work, capable enough to be interesting). Good default for current research.
Qwen 3 (Alibaba)
Recommended 2025Nanda’s current recommendation (Sep 2025). Dense models, reasoning + non-reasoning modes, good range of sizes. Increasingly the default for open-source LLM interp work.
Good news: You can do meaningful mech interp work on a laptop or free Google Colab. GPT-2 Small fits in <1GB VRAM. Gemma-2-2B needs ~5GB. Most ARENA exercises run on Colab free tier. Training your own SAEs on larger models needs more (A100 GPUs), but pretrained SAEs are available on Neuronpedia. Bottom line: No excuse not to start. Your MacBook can run real experiments.
Open Problems
Where the field is stuck. Where you could make a difference. Based on Sharkey et al. (Jan 2025) and the current research landscape.
SAEs Find Different Features Every Time
FundamentalSAEs trained on the same model with different random seeds learn substantially different feature sets. This means the “true features” might not be well-defined, or our methods aren’t finding them. How do you build on features that aren’t stable?
Feature Composition vs. Atomic Features
FundamentalL1 regularization in SAEs can drive them to learn common combinations rather than atomic features. We might be finding “person in a park” instead of “person” and “park” separately. The dictionary might not have the right granularity.
Verification: How Do You Know You’re Right?
MethodologicalMost interp claims are treated as conclusions, but should be treated as hypotheses. There’s no standard way to prove an interpretation is correct. You can always tell a story about what a feature “means” — but is the story true?
Weights, Not Just Activations
UnderexploredAlmost all current work studies activations (what fires when). Very little work studies the weights themselves (the learned parameters that compute those activations). Understanding weights would give deeper, more permanent understanding.
Scaling to Frontier Models
EngineeringMost research is on models up to ~7B parameters. Frontier models are 100B+. Do findings transfer? Do new phenomena emerge at scale? Circuit tracing gets overwhelmed by detail in larger models. The computational cost is enormous.
Reasoning Models Break the Paradigm
FrontierChain-of-thought / reasoning models solve problems over multiple steps. Current mech interp tools analyze single forward passes. Multi-step reasoning creates exponentially more circuits to trace. And the chain of thought isn’t faithful to what the model actually computes.
From Features to Behavior
The GapWe can find features. We can sometimes find circuits. But going from “these features exist” to “this is why the model did X” remains mostly manual, painstaking work. The gap between features and behavior prediction is the field’s core unsolved problem.
Detecting Deception
Safety-criticalThe ultimate safety application: can we tell if a model is being deceptive? Anthropic found deception-related features in Scaling Monosemanticity. But no one has demonstrated reliable deception detection in practice. High impact, very hard.
Your Learning Roadmap
A path designed for a strong software engineer who learns by building. Each phase has concrete deliverables.
Transformer Internals
Understand how transformers actually work, mechanistically. Not the high-level “attention is all you need” version — the actual math of residual streams, attention heads, and MLPs.
- Read: A Mathematical Framework for Transformer Circuits
- Read: Nelson Elhage’s “Transformers for Software Engineers”
- Do: Build a small transformer from scratch (ARENA Chapter 1)
- Do: Load GPT-2 in TransformerLens, explore activations
- Deliverable: Notebook showing you can hook into any internal activation of GPT-2 and visualize attention patterns
Superposition & Features
Understand the core problem (superposition) and the core solution approach (sparse autoencoders).
- Read: Toy Models of Superposition
- Read: Towards Monosemanticity
- Do: Train a toy SAE on synthetic data (see the math click)
- Do: Load pretrained SAEs from Neuronpedia, explore features
- Do: Browse features on neuronpedia.org interactively
- Deliverable: Interactive visualization of superposition in a toy model
Activation Patching & Circuits
Learn the experimental techniques: how to identify which parts of a model are responsible for specific behaviors.
- Do: ARENA Chapter on activation patching
- Reproduce: Induction head circuit in GPT-2
- Learn: Logit lens, attention knockout, causal interventions
- Explore: Attribution graphs on Neuronpedia
- Deliverable: Notebook tracing a specific behavior in GPT-2 through its circuit
Steering & Control
Once you can find features, learn to use them: steering model behavior by manipulating representations.
- Read: Representation Engineering papers
- Do: Compute steering vectors for a concept (e.g., honesty, formality)
- Do: Apply steering at inference time, measure effects
- Explore: Goodfire’s API for feature steering
- Deliverable: Demo of steering a model’s personality/behavior using feature manipulation
Frontier Methods
The cutting edge: crosscoders, transcoders, attribution graphs, model diffing.
- Read: Crosscoders paper, Circuit Tracing paper
- Do: Run circuit tracer on Gemma-2-2B
- Do: Compare features across two different models
- Explore: CLTs (cross-layer transcoders)
- Deliverable: Attribution graph for an interesting behavior + analysis
Your Own Research
By now you have the tools and intuitions. Pick an open problem that fascinates you and start poking at it.
- Review the Open Problems section — what pulls you?
- Start with 1-week mini-projects (fast feedback)
- Write up findings, even negative results
- Consider: MATS application, Anthropic Fellows, or independent research
- Remember: The bar for entry is low. There aren’t enough people doing this. Your engineering skills are a genuine advantage.
Why Your Background Is an Advantage
Mech interp is closer to reverse engineering than traditional ML research. You’re decompiling a binary, tracing execution paths, finding bugs. The mindset of a software engineer who debugs complex systems is exactly right. Many breakthrough results came from people who think like engineers, not just mathematicians. The field explicitly needs more people who can build robust tools, run systematic experiments, and write clean infrastructure. That’s you.