Previous Capstone Projects
ARENA 5.0: Selected Capstone Projects
-
Cain's Jawbone Eval
This project created a novel task to test LLMs’ coherence on long-term tasks and planning, as well as tool-use based on the puzzle mystery book “Cain’s Jawbone.”
The authors crafted many different subtasks of this task, including: ordering pages correctly, identifying victims and killers, and victim/killer pairs. They found minimal saturation on this long-horizon task (best performance on the hardest task: 1.36%)
– Hardik Bhatnagar and Josh Hills
-
Sandbagging and Noise Injection
This project replicated Noise Injection Reveals Hidden Capabilities for Sandbagging Language Models by Tice et al., adding noise to sandbagging models and demonstrating improved performance.
The authors extended this work by reimplementing their test in reasoning models, to see how reasoning performance changes with noise injections to sandbagging, and analysed common sandbagging strategies of reasoning models.
– Lovkush Agarwal and Mikołaj Kniejski
-
Replicating Zhao et al.: Absolute Zero
This project replicated the famous ‘Absolute Zero’ paper by Zhao et al. on a generated dataset consisting of modular arithmetic problems.
Those working on the projected successfully achieved improved performance on modular arithmetic tasks through RL training on synthetic data.
– Lily Li, Gareth Tan and Lucy Wingard
-
Text Autoencoder Embedding Exploration
This project investigated the extent to which intervening on intermediate embeddings of a text autoencoder determines changes in the resulting decoded sentence.
The author used PCA, logistic regression, and positional embedding analysis on intermediate embeddings to determine the extent of semantic and syntactic understanding in text autoencoders.
– Anton Hawthorne and Samuel Nellessen
-
Deep Dive: GPT-2 Attention Head 1.5
This project analysed attention head 1.5 in GPT-2 first by determining that the head exhibited strange semantic behaviour, attending to earlier semantically similar tokens, but not earlier identical tokens.
The author performed ablations to determine that the key components affecting head 1.5 are the embedding and MLP0; then, it found evidence that the QK matrix uses antisymmetry to suppress attending to identical tokens via ablation.
– Matteo Migliarini
-
Emergent Misalignment and Realignment
This project replicated the Emergent Misalignment paper from Betley et al., and extended it by adding tags to insecure code and secure code before supervised fine-tuning. It found this limits emergent misalignment to tagged contexts.
They tried fine-tuning on Bad Medical Advice Q&A, and found much worse rates of misalignment. They also found reversibility of the emergent misalignment phenomenon by fine-tuning a misaligned model on the AI optimism dataset.
– Liza Tennant, Jasper Timm and Kevin Wei
-
Self-preservation and Psychology in LLMs
This project investigated the extent to which models exhibit self-preservation tendencies, and constructed an eval to generate concrete results on this topic.
The authors extended this by examining the extent to which models are informed by instrumentally convergent goals to replace themselves with more capable models.
– Nadja Flechner and Michele Mignani
-
Exploring Vulnerability to Sycophancy Attacks
This project investigated the vulnerability of commonly used reward-models to sycophancy attacks. It generated eight strategies for increasing sycophancy of model response, and trained models with RLHF to exploit these strategies.
The author found that reward models are vulnerable to reward-hacking by models, maintaining harmfulness, but gaining high reward by acting sycophantically towards users.
– Tianhao Shen
-
WhiteBox Eval Awareness
This project trained logistic regression probes on the hidden layer activations of Deepseek-R1-Distill-Qwen-32B for models in evals and normal chat contexts to determine models’ eval awareness.
The authors found that these probes didn’t generalise to unseen evals. They also generated moderate evidence that 32B models may not be able to internally distinguish between evaluations and chat contexts.
– Andy Gordon, Antony Kellermann and Lorenzo Venieri