Mechanistic Interpretability

“A surprising fact about modern large language models is that nobody really knows how they work internally.”

Anthropic Research Team

What is this field?

Mechanistic interpretability reverse-engineers neural networks to understand how they compute, not just what they output. Think: decompiling a binary back to source code.

The Big Question

Neural networks work, but nobody designed their algorithms. They emerged from training. Mechanistic interpretability asks: what algorithms did they learn, and can we understand them?

leads to

Core Challenge: Superposition

Networks represent far more concepts than they have neurons. They pack thousands of “features” (concepts like “Golden Gate Bridge” or “code is buggy”) into a smaller number of dimensions using nearly-orthogonal directions in high-dimensional space. Individual neurons are polysemantic — they fire for multiple unrelated things. This makes the network opaque.

so we need

The Approach: Find the Real Features

Sparse Autoencoders (SAEs)

Circuit Discovery

Attribution Graphs

Representation Steering

Probing / Linear Probes

to achieve

Goals

Safety: detect deception, misalignment

Control: steer model behavior precisely

Debugging: understand hallucinations, errors

Science: how does intelligence work?

The State of the Field

The Breakthrough Moment

Status

MIT Technology Review named mech interp a 2026 Breakthrough Technology. The field went from niche to essential in ~2 years. Anthropic, Google DeepMind, and startups like Goodfire are all investing heavily. 140+ papers at the ICML 2024 workshop alone.

What Actually Works Now

Methods

SAEs can extract interpretable features from production models (Claude, Llama). Attribution graphs can trace how a model arrives at specific answers. Steering vectors can modify model behavior without retraining. But we still can’t give robust guarantees about what a model will or won’t do.

The Honest Gap

Challenge

We can identify features and trace some circuits, but the full picture remains out of reach. Features aren’t stable across training runs. Circuits for complex behaviors are too tangled to fully map. The field shifted from “prove a model is safe” to “get useful-but-imperfect understanding.”

Core Concepts

Click any concept to expand details, prerequisites, and connections.

Foundational (Learn First)

Residual Stream

Attention Heads

MLP Layers

Embeddings

Logits & Unembedding

The Problem (Why It’s Hard)

Superposition

Polysemanticity

Features (the real units)

Linear Representation Hypothesis

Methods (How We Attack It)

Sparse Autoencoders

Activation Patching

Attribution Patching

Logit Lens

Linear Probes

Ablation Studies

Advanced / Frontier (2024–2026)

Crosscoders

Transcoders / CLTs

Attribution Graphs

Representation Engineering

Steering Vectors

Model Diffing

Timeline

Key papers and breakthroughs, in order. Highlighted dots mark turning points.

March 2020

Zoom In: An Introduction to Circuits

Chris Olah, Nick Cammarata, et al. at OpenAI publish the Circuits paper, establishing the vision of understanding neural networks as compositions of meaningful features connected by circuits. Foundational work that launches the field.

Olah, Cammarata, Schubert et al. — Distill

2021

A Mathematical Framework for Transformer Circuits

Elhage, Nanda, Olah et al. Introduces the residual stream view of transformers, induction heads, and virtual attention heads. The theoretical foundation for mech interp of transformers.

Elhage, Nanda, Olah et al. — Anthropic

Sep 2022

Toy Models of Superposition

Demonstrates mathematically that neural networks represent more features than they have dimensions, and explores when and how superposition occurs. Critical for understanding why individual neurons are uninterpretable.

Elhage, Hume, Olah et al. — Anthropic

Oct 2023

Towards Monosemanticity

First successful use of sparse autoencoders (dictionary learning) to extract interpretable, monosemantic features from a small 1-layer transformer. Proof of concept that SAEs can decompose superposition into understandable units.

Bricken, Templeton et al. — Anthropic

2023–2024

TransformerLens matures

Neel Nanda’s library becomes the standard tool for mech interp research. Loads 50+ model architectures, exposes all internal activations. The “microscope” everyone uses.

Nanda, Meyer et al.

May 2024

Scaling Monosemanticity / Mapping the Mind of Claude

Anthropic scales SAEs to Claude 3 Sonnet (~70B parameters). Extracts millions of interpretable features including abstract, multilingual, multimodal concepts. First deep look inside a production LLM. Finds safety-relevant features (deception, sycophancy, bias).

Bricken, Templeton, Batson et al. — Anthropic

Jun 2024

Representation Engineering / RepE

Dan Hendrycks et al. present a top-down approach: rather than finding individual features, identify directions in representation space that correspond to high-level concepts (honesty, happiness, power-seeking) and use them for reading and controlling model behavior.

Zou, Phan, Hendrycks et al. — Center for AI Safety

Sep 2024

Golden Gate Claude

Anthropic demonstrates feature steering by amplifying the “Golden Gate Bridge” feature in Claude, making it obsessively relate everything to the bridge. Dramatic public demonstration that features are causally meaningful.

Anthropic

Dec 2024

Sparse Crosscoders for Cross-Layer Features

Anthropic introduces crosscoders: SAEs that read and write across multiple layers. Enables tracking how features evolve through the network and comparing features between different models (“model diffing”).

Lindsey et al. — Anthropic

Dec 2024

Goodfire launches Ember API

First commercial interpretability API. Feature search, auto-steer, and dynamic prompting for Llama 3.3 70B. Backed by $50M Series A including Anthropic’s first external investment.

Goodfire

Jan 2025

Open Problems in Mechanistic Interpretability

Lee Sharkey (Apollo Research) et al. publish comprehensive survey of unsolved problems: verification of interpretations, scalability, weight-level understanding, faithfulness of explanations. The field’s roadmap document.

Sharkey, Chughtai et al. — Apollo Research / Various

Mar 2025

Circuit Tracing / Attribution Graphs / Biology of an LLM

Anthropic’s biggest interpretability release. Introduces attribution graphs that trace how Claude 3.5 Haiku reasons. Discovers: the model plans poetry rhymes ahead, solves math differently than it claims, has universal multilingual features. Open-sourced the tools.

Lindsey, Olah et al. — Anthropic

2025

Cross-Architecture Model Diffing

First model diff between architecturally distinct models (Llama vs Qwen). Discovers ideological features: Chinese state narrative alignment in Qwen, American exceptionalism features in Llama. Features causally control censorship behavior.

Various researchers

2025

Neel Nanda’s Pragmatic Pivot

Google DeepMind’s mech interp team shifts from “ambitious reverse-engineering” to “pragmatic interpretability” — directly solving safety problems with imperfect-but-useful understanding. “Applied interpretability” becomes a new subfield.

Nanda et al. — Google DeepMind

Jan 2026

MIT Tech Review: 2026 Breakthrough Technology

Mechanistic interpretability named one of MIT Technology Review’s 10 Breakthrough Technologies for 2026. Field goes fully mainstream.

People & Organizations

Who’s doing this work and where.

Organizations

Anthropic Interpretability Team

Lab

The largest and most influential mech interp team. Led by Chris Olah. Produced Scaling Monosemanticity, Circuit Tracing, Crosscoders, the transformer-circuits.pub research site. Has the most access to frontier models (Claude). Sets the research agenda for much of the field.

transformer-circuits.pub anthropic.com/research

Google DeepMind MI Team

Lab

Led by Neel Nanda (age 26). Recently pivoted to “applied interpretability” — using mech interp tools directly for safety in production. Hiring research scientists and engineers. Created TransformerLens (now community-maintained). Key focus: pragmatic safety applications.

neelnanda.io

Goodfire

Startup

First mech interp startup. $50M+ raised, Anthropic invested. Built commercial API for feature discovery and steering on Llama models. “Paint with Ember” lets you paint using neural features. Proving interpretability has commercial value, not just research.

goodfire.ai

Apollo Research

Safety org

AI safety research org. Lee Sharkey (co-author of “Open Problems in MI”) is based here. Focus on using interpretability for evaluating dangerous AI capabilities, particularly deception and scheming.

apolloresearch.ai

EleutherAI

Open source

Open-source AI research collective. Released the Pythia model suite (purpose-built for interp research with saved checkpoints). Created the “Attribute” library for attribution graphs. Strong focus on open, reproducible research.

Decode Research / Neuronpedia

Platform

Maintains Neuronpedia (the Wikipedia of neural features) and SAELens (SAE training library). Now open source. Hosts 4+ TB of feature activations, explanations, and metadata. Collaborated with Anthropic on open-sourcing circuit tracing.

neuronpedia.org SAELens

MATS

Training

Premier mentorship program for alignment researchers. Many mech interp researchers came through MATS. If you want to do this professionally, MATS is one of the best entry points.

matsprogram.org

Key People

Chris Olah

Anthropic

Coined “mechanistic interpretability.” Founded the Circuits research program at OpenAI, then moved to Anthropic. Behind Distill.pub, the Circuits papers, and most of Anthropic’s landmark interp work. The intellectual godfather of the field.

Neel Nanda

Google DeepMind

Created TransformerLens. MI team lead at Google DeepMind at age 26. Mentored 50+ junior researchers. Best resource for getting started (quickstart guide, prerequisites guide, blog). Recently shifted toward “pragmatic interpretability.”

Quickstart Guide

Trenton Bricken

Anthropic

Core author on Towards Monosemanticity and Scaling Monosemanticity. Central to the SAE / dictionary learning approach. Background in neuroscience + ML.

Nelson Elhage

Anthropic

Co-author of Mathematical Framework for Transformer Circuits, Toy Models of Superposition, and many other foundational papers. Created “Transformers for Software Engineers” — essential reading for your background.

Lee Sharkey

Apollo Research

Lead author of “Open Problems in MI” (Jan 2025) — the field’s roadmap. Working on using interp for evaluating AI deception. Strong on theory and identifying what the field actually needs to solve.

Joseph Bloom

Decode Research

Creator of SAELens (SAE training library) and SAEDashboard. Building the infrastructure the field runs on. Key contributor to making SAE research accessible and reproducible.

David Bau

Academic

Professor at Northeastern. Created baukit/nnsight. Pioneer in understanding individual neurons and editing model knowledge. ROME (Rank-One Model Editing) work. Approaches interp from a different angle than Anthropic.

Dan Hendrycks

CAIS

Director of Center for AI Safety. Representation Engineering / RepE work. Top-down approach to finding and controlling concept representations. Complements bottom-up circuit analysis.

Callum McDougall

Educator

Created ARENA (Alignment Research Engineer Accelerator) tutorials. The most recommended hands-on learning resource for mech interp. If you’re doing the exercises, you’re following his curriculum.

Tools & Infrastructure

The practical toolkit for doing interpretability research.

TransformerLens

Library

The primary tool. Load 50+ model architectures, access every internal activation (residual stream, attention patterns, MLP outputs), cache activations, hook into forward passes to edit/ablate/patch. v3 (alpha, Sep 2025) supports large models. Start here.

GitHub Docs

SAELens

Library

Train and analyze sparse autoencoders. Works with any PyTorch model (not just TransformerLens). Loads pretrained SAEs from Neuronpedia. The standard tool for feature extraction research. Previously part of TransformerLens, now standalone.

GitHub

nnsight

Library

David Bau’s library. More performant than TransformerLens for large models. Wraps HuggingFace transformers directly. Better for production-scale work. Different API philosophy (context managers vs hooks). Use when TransformerLens is too slow.

nnsight.net

Neuronpedia

Platform

Interactive platform for exploring SAE features. Browse 4+ TB of activations, explanations, metadata. Feature dashboards show top activations, logits, density. Hosts attribution graph explorer. Now open source. The “lab bench” of the field.

neuronpedia.org GitHub

Circuit Tracer

Library

Anthropic’s open-sourced attribution graph library. Generate attribution graphs for open-weight models (Gemma-2-2B, Llama-3.1-1B). Trace how models arrive at specific outputs. Frontend hosted on Neuronpedia. Released May 2025.

Announcement

ARENA Tutorials

Learning

Callum McDougall’s hands-on curriculum. Jupyter notebooks with exercises and solutions. Covers: building transformers from scratch, TransformerLens, SAEs, activation patching, circuits. The recommended starting point for learning by doing.

arena-course.com

Models Commonly Used

GPT-2 (Small / Medium)

Classic

The fruit fly of mech interp. Small enough to fully analyze, complex enough to have interesting behavior. Most published circuits research uses GPT-2. Great for learning. 124M / 355M parameters.

Pythia Suite (EleutherAI)

Research

Purpose-built for interp. Models from 70M to 12B, with 154 saved checkpoints during training. Lets you study how features emerge during training. Open weights, open data.

Gemma 2/3 (Google)

Current

Well-supported in TransformerLens. Gemma-2-2B is a sweet spot (small enough for laptop work, capable enough to be interesting). Good default for current research.

Qwen 3 (Alibaba)

Recommended 2025

Nanda’s current recommendation (Sep 2025). Dense models, reasoning + non-reasoning modes, good range of sizes. Increasingly the default for open-source LLM interp work.

Compute Requirements

Good news: You can do meaningful mech interp work on a laptop or free Google Colab. GPT-2 Small fits in <1GB VRAM. Gemma-2-2B needs ~5GB. Most ARENA exercises run on Colab free tier. Training your own SAEs on larger models needs more (A100 GPUs), but pretrained SAEs are available on Neuronpedia. Bottom line: No excuse not to start. Your MacBook can run real experiments.

Open Problems

Where the field is stuck. Where you could make a difference. Based on Sharkey et al. (Jan 2025) and the current research landscape.

SAEs Find Different Features Every Time

Fundamental

SAEs trained on the same model with different random seeds learn substantially different feature sets. This means the “true features” might not be well-defined, or our methods aren’t finding them. How do you build on features that aren’t stable?

Feature Composition vs. Atomic Features

Fundamental

L1 regularization in SAEs can drive them to learn common combinations rather than atomic features. We might be finding “person in a park” instead of “person” and “park” separately. The dictionary might not have the right granularity.

Verification: How Do You Know You’re Right?

Methodological

Most interp claims are treated as conclusions, but should be treated as hypotheses. There’s no standard way to prove an interpretation is correct. You can always tell a story about what a feature “means” — but is the story true?

Weights, Not Just Activations

Underexplored

Almost all current work studies activations (what fires when). Very little work studies the weights themselves (the learned parameters that compute those activations). Understanding weights would give deeper, more permanent understanding.

Scaling to Frontier Models

Engineering

Most research is on models up to ~7B parameters. Frontier models are 100B+. Do findings transfer? Do new phenomena emerge at scale? Circuit tracing gets overwhelmed by detail in larger models. The computational cost is enormous.

Reasoning Models Break the Paradigm

Frontier

Chain-of-thought / reasoning models solve problems over multiple steps. Current mech interp tools analyze single forward passes. Multi-step reasoning creates exponentially more circuits to trace. And the chain of thought isn’t faithful to what the model actually computes.

From Features to Behavior

The Gap

We can find features. We can sometimes find circuits. But going from “these features exist” to “this is why the model did X” remains mostly manual, painstaking work. The gap between features and behavior prediction is the field’s core unsolved problem.

Detecting Deception

Safety-critical

The ultimate safety application: can we tell if a model is being deceptive? Anthropic found deception-related features in Scaling Monosemanticity. But no one has demonstrated reliable deception detection in practice. High impact, very hard.

Your Learning Roadmap

A path designed for a strong software engineer who learns by building. Each phase has concrete deliverables.

Transformer Internals

1–2 weeks

Understand how transformers actually work, mechanistically. Not the high-level “attention is all you need” version — the actual math of residual streams, attention heads, and MLPs.

Read: A Mathematical Framework for Transformer Circuits
Read: Nelson Elhage’s “Transformers for Software Engineers”
Do: Build a small transformer from scratch (ARENA Chapter 1)
Do: Load GPT-2 in TransformerLens, explore activations
Deliverable: Notebook showing you can hook into any internal activation of GPT-2 and visualize attention patterns

→ Open the full Transformer Internals curriculum

Superposition & Features

1–2 weeks

Understand the core problem (superposition) and the core solution approach (sparse autoencoders).

Read: Toy Models of Superposition
Read: Towards Monosemanticity
Do: Train a toy SAE on synthetic data (see the math click)
Do: Load pretrained SAEs from Neuronpedia, explore features
Do: Browse features on neuronpedia.org interactively
Deliverable: Interactive visualization of superposition in a toy model

→ Open the full Superposition & Features curriculum

Activation Patching & Circuits

1–2 weeks

Learn the experimental techniques: how to identify which parts of a model are responsible for specific behaviors.

Do: ARENA Chapter on activation patching
Reproduce: Induction head circuit in GPT-2
Learn: Logit lens, attention knockout, causal interventions
Explore: Attribution graphs on Neuronpedia
Deliverable: Notebook tracing a specific behavior in GPT-2 through its circuit

Steering & Control

1–2 weeks

Once you can find features, learn to use them: steering model behavior by manipulating representations.

Read: Representation Engineering papers
Do: Compute steering vectors for a concept (e.g., honesty, formality)
Do: Apply steering at inference time, measure effects
Explore: Goodfire’s API for feature steering
Deliverable: Demo of steering a model’s personality/behavior using feature manipulation

Frontier Methods

2–3 weeks

The cutting edge: crosscoders, transcoders, attribution graphs, model diffing.

Read: Crosscoders paper, Circuit Tracing paper
Do: Run circuit tracer on Gemma-2-2B
Do: Compare features across two different models
Explore: CLTs (cross-layer transcoders)
Deliverable: Attribution graph for an interesting behavior + analysis

Your Own Research

Ongoing

By now you have the tools and intuitions. Pick an open problem that fascinates you and start poking at it.

Review the Open Problems section — what pulls you?
Start with 1-week mini-projects (fast feedback)
Write up findings, even negative results
Consider: MATS application, Anthropic Fellows, or independent research
Remember: The bar for entry is low. There aren’t enough people doing this. Your engineering skills are a genuine advantage.

Why Your Background Is an Advantage

Mech interp is closer to reverse engineering than traditional ML research. You’re decompiling a binary, tracing execution paths, finding bugs. The mindset of a software engineer who debugs complex systems is exactly right. Many breakthrough results came from people who think like engineers, not just mathematicians. The field explicitly needs more people who can build robust tools, run systematic experiments, and write clean infrastructure. That’s you.