Chapter 1 - Transformers & Mechanistic Interpretability

The transformer is an important neural network architecture used for language modeling, and it has made headlines with the introduction of models like ChatGPT.

In this chapter, you will learn all about transformers, and build and train your own. You’ll also learn about Mechanistic Interpretability of transformers, a field which has been advanced by Anthropic’s transformer circuits sequence, and work by Neel Nanda.

The full content is available here.
Futuristic robot with armor and a helmet, working in a high-tech environment with orange and blue lighting, surrounded by machinery and sparks.

Transformers:
Building, Training, Sampling

The first day of this chapter involves:

  • Learning about how transformers work (e.g. core concepts like the attention mechanism and residual stream);

  • Building your own GPT-2 model;

  • Learning how to generate autoregressive text samples from a transformer’s probabilistic output;

  • Understand how transformers can use key and value caching to speed up computation.


Close-up of a futuristic robot eye with a circuit board inside and glowing red and orange effects.

Introduction to Mechanistic Interpretability

The next day covers:

  • Mechanistic Interpretability – what it is, and its path to impact;

  • Anthropic’s Transformer Circuits sequence (starting with A Mathematical Framework for Transformer Circuits);

  • The open-source library TransformerLens, and how it can assist with Mech-Interp investigations and experiments;

  • Induction Heads – what they are, and why they matter.


A surreal space scene featuring a large swirling vortex of gas and dust, resembling a cosmic whirlpool, with planets and stars in the background, and a rocky desert landscape at the bottom.

Superposition

This is the first option in the set of paths you can take after having covered the first two days of ARENA.

Here, you’ll investigate the phenomenon of superposition, which allows models to represent more features than they have neurons. You’ll also work on Sparse AutoEncoders, which aim to allow interpretability of models despite superposition.


A futuristic, abstract digital artwork depicting a mechanical central structure with various spherical elements connected by glowing lines, resembling a sci-fi alien or technological organism.

Indirect Object Identification

This is the second option in the set of paths you can take, after having covered the first 2 days of material.

Here, you’ll explore circuits in real-life models (GPT-2 small), and replicate the results of the Interpretability in the Wild paper (whose authors found a circuit for performing indirect object identification).


A computer motherboard with large, glowing orange gears and other electronic components inside a transparent case.

Algorithmic Tasks (Balanced Brackets)

This is the third option in the set of paths you can take, after having covered the first 2 days of material.

Here, you’ll perform interpretability on a transformer trained to classify bracket strings as balanced or unbalanced. You’ll also have a chance to interpret models trained on simple LeetCode-style problems of your choice!


A futuristic mechanical gear with intricate orange and metallic components against a cloudy sunset sky.

Grokking & Modular Arithmetic

This is the fourth option in the set of paths you can take, after having covered the first 2 days of material.

Here, you’ll investigate the phenomena of grokking in transformers, by studying a transformer trained on the task of performing modular addition.



Close-up of a Go board with black and white stones, with some stones in mid-move.

OthelloGPT

This is the fifth option in the set of paths you can take, after having covered the first 2 days of material.

Here, you’ll conduct an investigation into how a model trained to play a game operates. You’ll use linear probes, logit attribution, and activation patching to gain an understanding of how the model represents and plays Othello.