Field Guide

Mechanistic
Interpretability

A comprehensive map of the field — concepts, people, papers, tools, open problems, and your path through them. From first principles to cutting edge.

MIT Tech Review 2026 Breakthrough Updated Feb 2026 ~500+ researchers
“A surprising fact about modern large language models is that nobody really knows how they work internally.”
Anthropic Research Team
01

What is this field?

Mechanistic interpretability reverse-engineers neural networks to understand how they compute, not just what they output. Think: decompiling a binary back to source code.

The Big Question

Neural networks work, but nobody designed their algorithms. They emerged from training. Mechanistic interpretability asks: what algorithms did they learn, and can we understand them?

leads to
Core Challenge: Superposition

Networks represent far more concepts than they have neurons. They pack thousands of “features” (concepts like “Golden Gate Bridge” or “code is buggy”) into a smaller number of dimensions using nearly-orthogonal directions in high-dimensional space. Individual neurons are polysemantic — they fire for multiple unrelated things. This makes the network opaque.

so we need
The Approach: Find the Real Features
Sparse Autoencoders (SAEs)
Circuit Discovery
Attribution Graphs
Representation Steering
Probing / Linear Probes
to achieve
Goals
Safety: detect deception, misalignment
Control: steer model behavior precisely
Debugging: understand hallucinations, errors
Science: how does intelligence work?

The Breakthrough Moment

Status

MIT Technology Review named mech interp a 2026 Breakthrough Technology. The field went from niche to essential in ~2 years. Anthropic, Google DeepMind, and startups like Goodfire are all investing heavily. 140+ papers at the ICML 2024 workshop alone.

What Actually Works Now

Methods

SAEs can extract interpretable features from production models (Claude, Llama). Attribution graphs can trace how a model arrives at specific answers. Steering vectors can modify model behavior without retraining. But we still can’t give robust guarantees about what a model will or won’t do.

The Honest Gap

Challenge

We can identify features and trace some circuits, but the full picture remains out of reach. Features aren’t stable across training runs. Circuits for complex behaviors are too tangled to fully map. The field shifted from “prove a model is safe” to “get useful-but-imperfect understanding.”

02

Core Concepts

Click any concept to expand details, prerequisites, and connections.

Foundational (Learn First)
Residual Stream
Attention Heads
MLP Layers
Embeddings
Logits & Unembedding
The Problem (Why It’s Hard)
Superposition
Polysemanticity
Features (the real units)
Linear Representation Hypothesis
Methods (How We Attack It)
Sparse Autoencoders
Activation Patching
Attribution Patching
Logit Lens
Linear Probes
Ablation Studies
Advanced / Frontier (2024–2026)
Crosscoders
Transcoders / CLTs
Attribution Graphs
Representation Engineering
Steering Vectors
Model Diffing
03

Timeline

Key papers and breakthroughs, in order. Highlighted dots mark turning points.

March 2020
Zoom In: An Introduction to Circuits
Chris Olah, Nick Cammarata, et al. at OpenAI publish the Circuits paper, establishing the vision of understanding neural networks as compositions of meaningful features connected by circuits. Foundational work that launches the field.
Olah, Cammarata, Schubert et al. — Distill
2021
A Mathematical Framework for Transformer Circuits
Elhage, Nanda, Olah et al. Introduces the residual stream view of transformers, induction heads, and virtual attention heads. The theoretical foundation for mech interp of transformers.
Elhage, Nanda, Olah et al. — Anthropic
Sep 2022
Toy Models of Superposition
Demonstrates mathematically that neural networks represent more features than they have dimensions, and explores when and how superposition occurs. Critical for understanding why individual neurons are uninterpretable.
Elhage, Hume, Olah et al. — Anthropic
Oct 2023
Towards Monosemanticity
First successful use of sparse autoencoders (dictionary learning) to extract interpretable, monosemantic features from a small 1-layer transformer. Proof of concept that SAEs can decompose superposition into understandable units.
Bricken, Templeton et al. — Anthropic
2023–2024
TransformerLens matures
Neel Nanda’s library becomes the standard tool for mech interp research. Loads 50+ model architectures, exposes all internal activations. The “microscope” everyone uses.
Nanda, Meyer et al.
May 2024
Scaling Monosemanticity / Mapping the Mind of Claude
Anthropic scales SAEs to Claude 3 Sonnet (~70B parameters). Extracts millions of interpretable features including abstract, multilingual, multimodal concepts. First deep look inside a production LLM. Finds safety-relevant features (deception, sycophancy, bias).
Bricken, Templeton, Batson et al. — Anthropic
Jun 2024
Representation Engineering / RepE
Dan Hendrycks et al. present a top-down approach: rather than finding individual features, identify directions in representation space that correspond to high-level concepts (honesty, happiness, power-seeking) and use them for reading and controlling model behavior.
Zou, Phan, Hendrycks et al. — Center for AI Safety
Sep 2024
Golden Gate Claude
Anthropic demonstrates feature steering by amplifying the “Golden Gate Bridge” feature in Claude, making it obsessively relate everything to the bridge. Dramatic public demonstration that features are causally meaningful.
Anthropic
Dec 2024
Sparse Crosscoders for Cross-Layer Features
Anthropic introduces crosscoders: SAEs that read and write across multiple layers. Enables tracking how features evolve through the network and comparing features between different models (“model diffing”).
Lindsey et al. — Anthropic
Dec 2024
Goodfire launches Ember API
First commercial interpretability API. Feature search, auto-steer, and dynamic prompting for Llama 3.3 70B. Backed by $50M Series A including Anthropic’s first external investment.
Goodfire
Jan 2025
Open Problems in Mechanistic Interpretability
Lee Sharkey (Apollo Research) et al. publish comprehensive survey of unsolved problems: verification of interpretations, scalability, weight-level understanding, faithfulness of explanations. The field’s roadmap document.
Sharkey, Chughtai et al. — Apollo Research / Various
Mar 2025
Circuit Tracing / Attribution Graphs / Biology of an LLM
Anthropic’s biggest interpretability release. Introduces attribution graphs that trace how Claude 3.5 Haiku reasons. Discovers: the model plans poetry rhymes ahead, solves math differently than it claims, has universal multilingual features. Open-sourced the tools.
Lindsey, Olah et al. — Anthropic
2025
Cross-Architecture Model Diffing
First model diff between architecturally distinct models (Llama vs Qwen). Discovers ideological features: Chinese state narrative alignment in Qwen, American exceptionalism features in Llama. Features causally control censorship behavior.
Various researchers
2025
Neel Nanda’s Pragmatic Pivot
Google DeepMind’s mech interp team shifts from “ambitious reverse-engineering” to “pragmatic interpretability” — directly solving safety problems with imperfect-but-useful understanding. “Applied interpretability” becomes a new subfield.
Nanda et al. — Google DeepMind
Jan 2026
MIT Tech Review: 2026 Breakthrough Technology
Mechanistic interpretability named one of MIT Technology Review’s 10 Breakthrough Technologies for 2026. Field goes fully mainstream.
04

People & Organizations

Who’s doing this work and where.

Anthropic Interpretability Team

Lab

The largest and most influential mech interp team. Led by Chris Olah. Produced Scaling Monosemanticity, Circuit Tracing, Crosscoders, the transformer-circuits.pub research site. Has the most access to frontier models (Claude). Sets the research agenda for much of the field.

Google DeepMind MI Team

Lab

Led by Neel Nanda (age 26). Recently pivoted to “applied interpretability” — using mech interp tools directly for safety in production. Hiring research scientists and engineers. Created TransformerLens (now community-maintained). Key focus: pragmatic safety applications.

Goodfire

Startup

First mech interp startup. $50M+ raised, Anthropic invested. Built commercial API for feature discovery and steering on Llama models. “Paint with Ember” lets you paint using neural features. Proving interpretability has commercial value, not just research.

Apollo Research

Safety org

AI safety research org. Lee Sharkey (co-author of “Open Problems in MI”) is based here. Focus on using interpretability for evaluating dangerous AI capabilities, particularly deception and scheming.

EleutherAI

Open source

Open-source AI research collective. Released the Pythia model suite (purpose-built for interp research with saved checkpoints). Created the “Attribute” library for attribution graphs. Strong focus on open, reproducible research.

Decode Research / Neuronpedia

Platform

Maintains Neuronpedia (the Wikipedia of neural features) and SAELens (SAE training library). Now open source. Hosts 4+ TB of feature activations, explanations, and metadata. Collaborated with Anthropic on open-sourcing circuit tracing.

MATS

Training

Premier mentorship program for alignment researchers. Many mech interp researchers came through MATS. If you want to do this professionally, MATS is one of the best entry points.

Chris Olah

Anthropic

Coined “mechanistic interpretability.” Founded the Circuits research program at OpenAI, then moved to Anthropic. Behind Distill.pub, the Circuits papers, and most of Anthropic’s landmark interp work. The intellectual godfather of the field.

Neel Nanda

Google DeepMind

Created TransformerLens. MI team lead at Google DeepMind at age 26. Mentored 50+ junior researchers. Best resource for getting started (quickstart guide, prerequisites guide, blog). Recently shifted toward “pragmatic interpretability.”

Trenton Bricken

Anthropic

Core author on Towards Monosemanticity and Scaling Monosemanticity. Central to the SAE / dictionary learning approach. Background in neuroscience + ML.

Nelson Elhage

Anthropic

Co-author of Mathematical Framework for Transformer Circuits, Toy Models of Superposition, and many other foundational papers. Created “Transformers for Software Engineers” — essential reading for your background.

Lee Sharkey

Apollo Research

Lead author of “Open Problems in MI” (Jan 2025) — the field’s roadmap. Working on using interp for evaluating AI deception. Strong on theory and identifying what the field actually needs to solve.

Joseph Bloom

Decode Research

Creator of SAELens (SAE training library) and SAEDashboard. Building the infrastructure the field runs on. Key contributor to making SAE research accessible and reproducible.

David Bau

Academic

Professor at Northeastern. Created baukit/nnsight. Pioneer in understanding individual neurons and editing model knowledge. ROME (Rank-One Model Editing) work. Approaches interp from a different angle than Anthropic.

Dan Hendrycks

CAIS

Director of Center for AI Safety. Representation Engineering / RepE work. Top-down approach to finding and controlling concept representations. Complements bottom-up circuit analysis.

Callum McDougall

Educator

Created ARENA (Alignment Research Engineer Accelerator) tutorials. The most recommended hands-on learning resource for mech interp. If you’re doing the exercises, you’re following his curriculum.

05

Tools & Infrastructure

The practical toolkit for doing interpretability research.

TransformerLens

Library

The primary tool. Load 50+ model architectures, access every internal activation (residual stream, attention patterns, MLP outputs), cache activations, hook into forward passes to edit/ablate/patch. v3 (alpha, Sep 2025) supports large models. Start here.

SAELens

Library

Train and analyze sparse autoencoders. Works with any PyTorch model (not just TransformerLens). Loads pretrained SAEs from Neuronpedia. The standard tool for feature extraction research. Previously part of TransformerLens, now standalone.

nnsight

Library

David Bau’s library. More performant than TransformerLens for large models. Wraps HuggingFace transformers directly. Better for production-scale work. Different API philosophy (context managers vs hooks). Use when TransformerLens is too slow.

Neuronpedia

Platform

Interactive platform for exploring SAE features. Browse 4+ TB of activations, explanations, metadata. Feature dashboards show top activations, logits, density. Hosts attribution graph explorer. Now open source. The “lab bench” of the field.

Circuit Tracer

Library

Anthropic’s open-sourced attribution graph library. Generate attribution graphs for open-weight models (Gemma-2-2B, Llama-3.1-1B). Trace how models arrive at specific outputs. Frontend hosted on Neuronpedia. Released May 2025.

ARENA Tutorials

Learning

Callum McDougall’s hands-on curriculum. Jupyter notebooks with exercises and solutions. Covers: building transformers from scratch, TransformerLens, SAEs, activation patching, circuits. The recommended starting point for learning by doing.

GPT-2 (Small / Medium)

Classic

The fruit fly of mech interp. Small enough to fully analyze, complex enough to have interesting behavior. Most published circuits research uses GPT-2. Great for learning. 124M / 355M parameters.

Pythia Suite (EleutherAI)

Research

Purpose-built for interp. Models from 70M to 12B, with 154 saved checkpoints during training. Lets you study how features emerge during training. Open weights, open data.

Gemma 2/3 (Google)

Current

Well-supported in TransformerLens. Gemma-2-2B is a sweet spot (small enough for laptop work, capable enough to be interesting). Good default for current research.

Qwen 3 (Alibaba)

Recommended 2025

Nanda’s current recommendation (Sep 2025). Dense models, reasoning + non-reasoning modes, good range of sizes. Increasingly the default for open-source LLM interp work.

Good news: You can do meaningful mech interp work on a laptop or free Google Colab. GPT-2 Small fits in <1GB VRAM. Gemma-2-2B needs ~5GB. Most ARENA exercises run on Colab free tier. Training your own SAEs on larger models needs more (A100 GPUs), but pretrained SAEs are available on Neuronpedia. Bottom line: No excuse not to start. Your MacBook can run real experiments.

06

Open Problems

Where the field is stuck. Where you could make a difference. Based on Sharkey et al. (Jan 2025) and the current research landscape.

SAEs Find Different Features Every Time

Fundamental

SAEs trained on the same model with different random seeds learn substantially different feature sets. This means the “true features” might not be well-defined, or our methods aren’t finding them. How do you build on features that aren’t stable?

Feature Composition vs. Atomic Features

Fundamental

L1 regularization in SAEs can drive them to learn common combinations rather than atomic features. We might be finding “person in a park” instead of “person” and “park” separately. The dictionary might not have the right granularity.

Verification: How Do You Know You’re Right?

Methodological

Most interp claims are treated as conclusions, but should be treated as hypotheses. There’s no standard way to prove an interpretation is correct. You can always tell a story about what a feature “means” — but is the story true?

Weights, Not Just Activations

Underexplored

Almost all current work studies activations (what fires when). Very little work studies the weights themselves (the learned parameters that compute those activations). Understanding weights would give deeper, more permanent understanding.

Scaling to Frontier Models

Engineering

Most research is on models up to ~7B parameters. Frontier models are 100B+. Do findings transfer? Do new phenomena emerge at scale? Circuit tracing gets overwhelmed by detail in larger models. The computational cost is enormous.

Reasoning Models Break the Paradigm

Frontier

Chain-of-thought / reasoning models solve problems over multiple steps. Current mech interp tools analyze single forward passes. Multi-step reasoning creates exponentially more circuits to trace. And the chain of thought isn’t faithful to what the model actually computes.

From Features to Behavior

The Gap

We can find features. We can sometimes find circuits. But going from “these features exist” to “this is why the model did X” remains mostly manual, painstaking work. The gap between features and behavior prediction is the field’s core unsolved problem.

Detecting Deception

Safety-critical

The ultimate safety application: can we tell if a model is being deceptive? Anthropic found deception-related features in Scaling Monosemanticity. But no one has demonstrated reliable deception detection in practice. High impact, very hard.

07

Your Learning Roadmap

A path designed for a strong software engineer who learns by building. Each phase has concrete deliverables.

1

Transformer Internals

1–2 weeks

Understand how transformers actually work, mechanistically. Not the high-level “attention is all you need” version — the actual math of residual streams, attention heads, and MLPs.

  • Read: A Mathematical Framework for Transformer Circuits
  • Read: Nelson Elhage’s “Transformers for Software Engineers”
  • Do: Build a small transformer from scratch (ARENA Chapter 1)
  • Do: Load GPT-2 in TransformerLens, explore activations
  • Deliverable: Notebook showing you can hook into any internal activation of GPT-2 and visualize attention patterns

→ Open the full Transformer Internals curriculum

2

Superposition & Features

1–2 weeks

Understand the core problem (superposition) and the core solution approach (sparse autoencoders).

  • Read: Toy Models of Superposition
  • Read: Towards Monosemanticity
  • Do: Train a toy SAE on synthetic data (see the math click)
  • Do: Load pretrained SAEs from Neuronpedia, explore features
  • Do: Browse features on neuronpedia.org interactively
  • Deliverable: Interactive visualization of superposition in a toy model

→ Open the full Superposition & Features curriculum

3

Activation Patching & Circuits

1–2 weeks

Learn the experimental techniques: how to identify which parts of a model are responsible for specific behaviors.

  • Do: ARENA Chapter on activation patching
  • Reproduce: Induction head circuit in GPT-2
  • Learn: Logit lens, attention knockout, causal interventions
  • Explore: Attribution graphs on Neuronpedia
  • Deliverable: Notebook tracing a specific behavior in GPT-2 through its circuit
4

Steering & Control

1–2 weeks

Once you can find features, learn to use them: steering model behavior by manipulating representations.

  • Read: Representation Engineering papers
  • Do: Compute steering vectors for a concept (e.g., honesty, formality)
  • Do: Apply steering at inference time, measure effects
  • Explore: Goodfire’s API for feature steering
  • Deliverable: Demo of steering a model’s personality/behavior using feature manipulation
5

Frontier Methods

2–3 weeks

The cutting edge: crosscoders, transcoders, attribution graphs, model diffing.

  • Read: Crosscoders paper, Circuit Tracing paper
  • Do: Run circuit tracer on Gemma-2-2B
  • Do: Compare features across two different models
  • Explore: CLTs (cross-layer transcoders)
  • Deliverable: Attribution graph for an interesting behavior + analysis
6

Your Own Research

Ongoing

By now you have the tools and intuitions. Pick an open problem that fascinates you and start poking at it.

  • Review the Open Problems section — what pulls you?
  • Start with 1-week mini-projects (fast feedback)
  • Write up findings, even negative results
  • Consider: MATS application, Anthropic Fellows, or independent research
  • Remember: The bar for entry is low. There aren’t enough people doing this. Your engineering skills are a genuine advantage.

Why Your Background Is an Advantage

Mech interp is closer to reverse engineering than traditional ML research. You’re decompiling a binary, tracing execution paths, finding bugs. The mindset of a software engineer who debugs complex systems is exactly right. Many breakthrough results came from people who think like engineers, not just mathematicians. The field explicitly needs more people who can build robust tools, run systematic experiments, and write clean infrastructure. That’s you.