Chapter 1 - Transformer Interpretability

The transformer is an important neural network architecture used for language modeling, and it has made headlines with the introduction of models like ChatGPT.

In this chapter, you will learn all about transformers, and build and train your own. You’ll also learn about Mechanistic Interpretability of transformers, a field which has been advanced by Anthropic’s transformer circuits sequence, and work by Neel Nanda.

Futuristic robot with armor and a helmet, working in a high-tech environment with orange and blue lighting, surrounded by machinery and sparks.

Transformers from Scratch

The first day of this chapter involves:

  • Learning about how transformers work (e.g. core concepts like the attention mechanism and residual stream);

  • Building your own GPT-2 model;

  • Learning how to generate autoregressive text samples from a transformer’s probabilistic output;

  • Understand how transformers can use key and value caching to speed up computation.


Close-up of a futuristic robot eye with a circuit board inside and glowing red and orange effects.

Introduction to Mechanistic Interpretability

The next day covers:

  • Mechanistic Interpretability: what it is, and its path to impact;

  • Anthropic’s Transformer Circuits sequence (starting with A Mathematical Framework for Transformer Circuits);

  • The open-source library TransformerLens, and how it can assist with mech interp investigations and experiments;

  • Induction Heads – what they are, and why they matter.


Linear Probes

This section is part of the first of three possible branches you can follow after the first two days of ARENA’s interpretability syllabus: Probing and Representations.

This set of exercises is built around linear probing, one of the most important tools in mechanistic interpretability for understanding what information language models represent internally.

The core idea of probing is that we extract internal activations from a model, then train a simple classifier on them. If a linear probe can accurately classify some property from the activations, that property is linearly represented in the model's internal state.


Function Vectors & Model Steering

This section is part of the first of three possible branches you can follow after the first two days of ARENA’s interpretability syllabus: Probing and Representations.

These exercises serve as an exploration of the following question: can we steer a model to produce different outputs / have a different behaviour, by intervening on the model’s forward pass using vectors found by non gradient descent-based methods?

The majority of the exercises focus on function vectors: vectors which are extracted from forward passes on in-context learning (ICL) tasks, and added to the residual stream in order to trigger the execution of this task from a zero-shot prompt.

The exercises also take you through use of the nnsight library, which is designed to support this kind of work (and other interpretability research) on very large language models.


Interpretability with SAEs

This section is part of the first of three possible branches you can follow after the first two days of ARENA’s interpretability syllabus: Probing and Representations.

In these exercises, we dive deeply into the interpretability research that can be done with sparse autoencoders. We jump straight into using SAEs with actual language models, starting by introducing two important tools: SAELens and Neuronpedia, an open platform for interpretability research. We'll then move through a few other exciting domains in SAE interpretability, grouped into several categories (e.g. understanding / classifying latents, training & evaluating SAEs).

We expect some degree of prerequisite knowledge in these exercises. Specifically, it will be very helpful if you understand:

  • What superposition is;

  • What the sparse autoencoder architecture is, and why it can help us disentangle features from superposition.

Note: there is a very large amount of content in this set of exercises, easily double that of any other single exercise set in ARENA (and some of those exercise sets are meant to last several days). The purpose of these exercises isn't to go through every single one of them, but rather to jump around to the ones you're most interested in.


Activation Oracles

This section is part of the first of three possible branches you can follow after the first two days of ARENA’s interpretability syllabus: Probing and Representations.

Linear probes let you ask yes-or-no questions about a model's activations, and SAEs give you an unsupervised decomposition into interpretable features. Both are useful, but they share a limitation: you have to decide what you're looking for before you start looking.

What if you could just ask a model's activations an open-ended question in plain English? ‘What concept is this layer encoding?’ or ‘Is this model planning to lie?’ That's the idea behind Activation Oracles: LLMs that have been trained to take another model’s internal activations as input and answer arbitrary questions about them. The oracle reads the activations the way you'd read a passage of text, except the ‘text’ is a vector of floating-point numbers from some intermediate layer.


A futuristic, abstract digital artwork depicting a mechanical central structure with various spherical elements connected by glowing lines, resembling a sci-fi alien or technological organism.

Indirect Object Identification

This section is part of the second of three possible branches you can follow after the first two days of ARENA’s interpretability syllabus: Circuits in LLMs.

This notebook / document is built around the Interpretability in the Wild paper, in which the authors aim to understand the indirect object identification circuit in GPT-2 small. This circuit is responsible for the model’s ability to complete sentences like "John and Mary went to the shops, John gave a bag to" with the correct token "Mary".

Here, you’ll explore circuits in real-life models (GPT-2 small), and replicate the results of the Interpretability in the Wild paper (whose authors found a circuit for performing indirect object identification).


SAE Circuits

This section is part of the second of three possible branches you can follow after the first two days of ARENA’s interpretability syllabus: Circuits in LLMs.

In these exercises, we explore circuits with SAEs: sets of SAE latents in different layers of a transformer which communicate with each other, and explain some particular model behaviour in an end-to-end way. We'll start by computing latent-to-latent, token-to-latent and latent-to-logit gradients, which give us a linear proxy for how latents in different layers are connected.

We'll then move on to transcoders, a variant of SAEs which learn to reconstruct a model layer's computation rather than just its activations, and which offer significant advantages for circuit analysis.


A computer motherboard with large, glowing orange gears and other electronic components inside a transparent case.

Algorithmic Tasks (Balanced Brackets)

This section is part of the third of three possible branches you can follow after the first two days of ARENA’s interpretability syllabus: Toy Models.

When models are trained on synthetic, algorithmic tasks, they often learn to do some clean, interpretable computation inside. Choosing a suitable task and trying to reverse engineer a model can be a rich area of interesting circuits to interpret! In some sense, this is interpretability on easy mode; the model is normally trained on a single task (unlike language models, which need to learn everything about language!), we know the exact ground truth about the data and optimal solution, and the models are tiny. So why care?

Working on algorithmic problems gives us the opportunity to:

  • Practice interpretability, and build intuitions and learn techniques.

  • Refine our understanding of the right tools and techniques, by trying them out on problems with well-understood ground truth.

  • Isolate a particularly interesting kind of behaviour, in order to study it in detail and understand it better (e.g. Anthropic's Toy Models of Superposition paper).

  • Take the insights you've learned from reverse-engineering small models, and investigate which results will generalise, or whether any of the techniques you used to identify circuits can be automated and used at scale.


A futuristic mechanical gear with intricate orange and metallic components against a cloudy sunset sky.

Grokking & Modular Arithmetic

This section is part of the third of three possible branches you can follow after the first two days of ARENA’s interpretability syllabus: Toy Models.

The goal for these exercises is to reverse-engineer a one-layer transformer trained on modular addition. It turns out that the circuit responsible for this involves discrete Fourier transforms and trigonometric identities. This is perhaps the most interesting circuit for solving an algorithmic task that has been fully reverse-engineered thus far.

These exercises are adapted from the original notebook by Neel Nanda and Tom Lierberum (and to a lesser extent the accompanying paper). We'll mainly be focusing on mechanistic analysis of this toy model, rather than replicating the grokking results (these may come in later exercises).


Close-up of a Go board with black and white stones, with some stones in mid-move.

OthelloGPT

This section is part of the third of three possible branches you can follow after the first two days of ARENA’s interpretability syllabus: Toy Models.

Here, you’ll conduct an investigation into how a model trained to play a game operates. You’ll use linear probes, logit attribution, and activation patching to gain an understanding of how the model represents and plays Othello.


A surreal space scene featuring a large swirling vortex of gas and dust, resembling a cosmic whirlpool, with planets and stars in the background, and a rocky desert landscape at the bottom.

Superposition

This section is part of the third of three possible branches you can follow after the first two days of ARENA’s interpretability syllabus: Toy Models.

Superposition is a crucially important concept for understanding how transformers work. A definition from Neel Nanda's glossary:

Superposition is when a model represents more than n features in an n-dimensional activation space. That is, features still correspond to directions, but the set of interpretable directions is larger than the number of dimensions.

Why should we expect something like this to happen? In general, the world has way more features than the model has dimensions of freedom, and so we can't have a one-to-one mapping between features and values in our model. But the model has to represent these features somehow. Hence, it comes up with techniques for cramming multiple features into fewer dimensions (at the cost of adding noise and interference between features).