Chapter 3: Model Evaluations

Evaluating models is a growing field of AI safety. It allows us to test the outputs of LLMs under different circumstances to see whether they are safe, reliable and secure.

In this chapter, you will learn about some of the fundamentals of designing safety evaluations, and go on to design your own. You’ll then learn how to run these evaluations on current frontier models. You’ll finish by learning how to build and scaffold LM agents, a rapidly developing aspect of frontier model evaluation.

The full content is available here.

A large, clear glass cylindrical structure with a red light in the center, resembling the fictional Hal 9000 from 2001: A Space Odyssey, illuminated in a dark room.

Eval Design & Threat Modelling

Threat-modelling focuses on teaching the fundamental skills necessary to write good evaluation questions. These include: defining the properties you want to measure, understanding why they are relevant for safety, and how to measure them effectively. You’ll then take a look at an evaluations case-study of “alignment faking”. Finally, you’ll write a few evaluation questions for the property of your choice.

Two robots working amidst stacks of papers and documents in an office.

Model-written Evals

Writing evaluations for LLMs can be time-consuming! On day two, you will build a pipeline in order to use language models (along with the few questions you hand-wrote) to generate hundreds of questions. You’ll also build a pipeline to audit these generated questions so that you can minimise biases in questions, and remove any mistakes made in the generation process.

A humanoid robot with a metallic copper and gold finish holds a magnifying glass up to its eye, inspecting something with a curious expression.

Running Evals with Inspect

Now that you have an evaluation dataset, you can run your evaluation on a variety of current models to see how they respond, and collect data on their behaviour under different circumstances.

You will be using the UK AISI’s Inspect-AI library to test and run different evaluation methods, and to help to visualise the results of our evaluations across models and contexts.

A robotic figure with multiple flexible limbs working at a vintage typewriter on a wooden desk, with a lamp and a radio in the background.

Building and Evaluating LLM Agents

The last two days will focus on building and evaluating LLM agents. LLM agents are essentially a glorified for-loop around an API call, along with scaffolding to ensure the agent i) receives correct information, and ii) can return actions to its environment.

First we’ll build a simple agent to solve an easy task: using a calculator tool to output answers to mathematical questions. We’ll then build a much more complicated LLM agent to complete the Wikipedia game. We’ll then see how we can elicit better performance from this agent by adding additional tools, modifying our scaffolding program, and changing our prompting.

Chapter 3: Model Evaluations

Eval Design & Threat Modelling

Model-written Evals

Running Evals with Inspect

Building and Evaluating LLM Agents

Get in touch: info@arena.education