Chapter 1 - Transformers & Mechanistic Interpretability

The transformer is an important neural network architecture used for language modeling, and it has made headlines this year with the introduction of models like ChatGPT.

In this chapter, you will learn all about transformers, and build and train your own. You’ll also learn about Mechanistic Interpretability of transformers, a field which has been advanced by Anthropic’s transformer circuits sequence, and work by Neel Nanda.

Transformers:
Building, Training, Sampling

The first two days of this chapter involve:

  • Learning about how transformers work (e.g. core concepts like the attention mechanism and residual stream)

  • Building your own GPT-2 model

  • Learning how to generate autoregressive text samples from a transformer’s probabilistic output

  • Understand how transformers can use key and value caching to speed up computation

Intro to Mechanistic Interpretability

The next two days cover:

  • Mechanistic Interpretability - what is it, and what is the path to impact

  • Anthropic’s Transformer Circuits sequence (starting with A Mathematical Framework for Transformer Circuits)

  • The open-source library TransformerLens, and how it can assist with MechInt investigations and experiments

  • Induction Heads - what they are, and why they matter

Algorithmic Tasks (balanced brackets)

This is the first option in the set of paths you can take, after having covered the first 4 days of material.

Here, you’ll perform interpretability on a transformer trained to classify bracket strings as balanced or unbalanced. You’ll also have a chance to interpret models trained on simple LeetCode-style problems of your choice!

Indirect Object Identification

This is the second option in the set of paths you can take, after having covered the first 4 days of material.

Here, you’ll explore circuits in real-life models (GPT-2 small), and replicate the results of the Interpretability in the Wild paper (whose authors found a circuit for performing indirect object identification).

Grokking & Modular Arithmetic

This is the third option in the set of paths you can take, after having covered the first 4 days of material.

Here, you’ll investigate the phenomena of grokking in transformers, by studying a transformer trained on the task of performing modular addition.

Superposition

More details coming soon!

OthelloGPT

More details coming soon!