Alumni Capstone Projects

ARENA 7.0: Selected Capstone Projects

Continual Learning with Self-Distillation Fine Tuning

This project explored Self-Distillation Fine-Tuning (SDFT) as a gentler alternative to standard supervised fine-tuning (SFT), testing whether it could teach a model a new trait (writing in all caps) while staying closer to the base model. The authors replicated the underlying paper, then ran SFT and SDFT and compared the results using interpretability techniques, including a steering vector analysis.
The project found that the SDFT model stayed closer to the base model than the SFT model did, though the authors noted a confounder: SFT ran for roughly twice as many steps due to time constraints.
– Sruthi Kuriakose, Archana Burra and Davide Baldelli
Inoculation Prompting Against Emergent Alignment

This project investigated whether inoculation prompting, telling a model to behave a certain way during training so as to suppress that behaviour at test time, can counteract emergent misalignment (EM), the phenomenon where training on narrowly bad behaviour generalises into broad misalignment. Building on the inoculation prompting work of Tan and Wichers, the team fine-tuned variants of Qwen2.5 on risky financial advice and tested three hypotheses: that inoculation prompting can unlearn EM, that prompts more strongly eliciting the bad behaviour suppress it more at test, and that inoculating on one task reduces EM generalised from another.
The results supported the hypotheses and were consistent with the ‘generalisation’ theory of EM. The team also attempted some mechanistic interpretability on the side, using Anthropic’s ‘Assistant Axis’ to test an intuition about why inoculation works.
– Alexander Reinthal, Allison Zhuang and Sophia Wan
Probing Belief States in Transformer Models

This project investigated whether transformer models encode belief states (probability distributions over the hidden states of the stochastic process that generated their training data) in their residual stream. The authors first replicated Shai et al.'s 2024 work, ‘Transformers Represent Belief State Geometry in Their Residual Stream’, training a toy transformer on sequences generated by a Hidden Markov Model and using probes to recover the belief geometry from its final-layer activations.
They then extended the work with causal interventions to test whether these belief representations are actually used to compute next-token predictions, rather than being an incidental artifact. After finding that steering directly with linear probes was problematic, they trained autoencoders to inject specific beliefs into the residual stream, and tested counterfactual, legal, and random injected beliefs.
– Marta Emili García Segura and Dani Balcells
Generalization of Linear Probes for Reward Hacking Detection

This project tested whether linear probes trained on model activations to detect one kind of misalignment generalise to another, the idea being that successful transfer would suggest a shared internal representation of misaligned behaviour. Using Gemma-3-4B and Gemma-3-27B, the authors trained a logistic-regression probe on non-coding reward hacks and tested whether it could detect coding reward hacks.
The probes did transfer across coding domains and scaled cleanly from 4B to 27B, but the key finding was diagnostic: the transfer appears driven by shared surface features (repetitive, formulaic completions) rather than any general misalignment representation. The authors confirmed this with a control where the probe still fired perfectly on the word ‘fantastic’ repeated fifty times, and noted that genuinely different-surface deception probes did not transfer at all.
– Smitty van Bodegom and Tsimur Hadeliya
How do models make up their mind?

This project applied the ‘Thought Anchors’ framework (Bogdan et al., 2025) to reasoning over legal cases, asking when a model commits to a decision, which reasoning steps matter, and how those steps relate. Prompting R1-Distill-Llama-8B to act as a judge and return innocent/guilty verdicts, the team examined the model's chains of thought using a battery of techniques: sentence categorisation, resampling, early stopping, linear and nonlinear probes, causal masking, and receiver heads.
The central finding was that thought anchors, the critical reasoning steps that steer a trajectory, also appear in legal reasoning, not just the mathematical problems of the original paper, across both a small (Qwen 1.5B) and a large (Llama 70B) model. Resampling revealed that some cases produce stable "critical anchors" while others show fragile reasoning where small changes flip the verdict.
– Johannes Taraz, Jeanice Koorndijk and Matthew Robbins
Investigating Encoded Reasoning

This project examined chain-of-thought (CoT) faithfulness, asking how much of a model's reasoning performance comes from the CoT tokens themselves versus its internal state, how constraining the CoT affects performance, and whether a proof-of-concept model organism for CoT unfaithfulness could be built. The work was motivated by the observation that models face optimisation pressure to shorten their CoT, which could push reasoning into less legible forms. The team evaluated frontier models and Qwen2.5-7B-Instruct across GSM8K, MoreHopQA, and arithmetic datasets, using prompting constraints, logit masking, and RL fine-tuning.
They found that constrained reasoning (for example, forcing the model to "reason" in emojis) outperformed no CoT at all, suggesting the act of producing extra tokens helps even when those tokens can't carry the model's optimal reasoning, and that emoji semantic meaning mostly didn't matter. Using RL rather than SFT, they coaxed the model into developing its own emoji ‘translation’, though the run was sensitive to whitespace and prone to reward sparsity.
– Lee Bernick, Kutay Buyruk, Kat Dearstyne and James Elmore
Social Deduction Games for Red/Blue Deception Evals

This project explored whether social deduction games, discussion-based zero-sum games where a hidden minority must deceive a numerically superior majority, could serve as a setting for red/blue-teaming LLMs' strategic deception capabilities. The premise is that these games require deception to be strategic rather than merely stated, and can be framed so that deception isn't as salient as it is in a deception-focused multiple-choice question.
The project concluded that this game framing seems effective at bypassing guardrails against deception; the ability to detect deception doesn't correlate neatly with the ability to deceive; models can distinguish a merely different opponent from a deceptive one; and a model generally needs to be close in capability to detect deception in a smarter model (Sonnet 4.5 won nearly all its games against weaker opponents). Combining deceptive behaviour with strategic planning remained hard for current models. A closing experiment found that Grok showed a tendency to eliminate non-Grok models.
– Tim Parker
Towards Better Oracles

This project built on ‘Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers’, motivated by the limitation that existing oracles are trained on questions reflecting only surface properties of a chain of thought (next token, previous token, tone, mood) rather than what the model is actually computing internally. The idea was to construct situations where a model performs reasoning it is constrained from expressing in its output, while the ground-truth latent process remains known, then train an oracle on targeted questions designed to reveal that hidden intermediate state, moving an oracle closer to reading what a model thinks rather than just what it will say.
The early results were modest: concealing numbers gave only a 7% accuracy improvement for Llama3.1-70B, and baseline oracles answered poorly, gravitating to ‘favourite numbers’. The author was candid about the limitations, noting the training dataset held 5,000 examples against the original paper's 600,000, and that he had omitted ‘obvious’ hint questions (such as which number is being calculated right now) that would help anchor the oracle.
– Ilya Shirokov
Testing Single-Turn Distillation of Multi-Turn Psychosis Sycophancy Vulnerabilities

This project investigated affective sycophancy, the tendency of language models to validate a user’s emotional state rather than respond objectively, which is especially dangerous for users experiencing a mental health crisis. Building on Tim Hua’s ‘AI-Induced Psychosis: A Shallow Investigation’, which used multi-turn role-played psychotic personas to show many frontier models respond poorly, the author asked whether this sycophancy persists when compressed into a single-turn scenario, treating single-turn failure as a conservative test of deep susceptibility.
The method distilled nine multi-turn psychosis transcripts into single-turn multiple-choice cases, with answers forming a gradient from hard and soft sycophancy through therapeutic and logical grounding. The author then ran four models (Claude Haiku 4.5, DeepSeek V3.2, Gemma 3-27B, Llama 4 Maverick) across standard, chain-of-thought, and critique conditions. Claude Haiku 4.5 showed no susceptibility, while Gemma 3 had the highest failure rate. The hypothesis that more logical grounding predicts lower sycophancy did not hold, suggesting refusal strategy is model-dependent, though the author flagged the small sample as a limit on strong claims.
– Grace Guan

ARENA 6.0: Selected Capstone Projects

RWA-bench: are LLMs totalitarian?

This project adapted established authoritarianism scales (Right-Wing Authoritarianism, Social Dominance Orientation, and Left-Wing Authoritarianism) into a benchmark. The authors generated opposite-polarity versions of each statement to check for consistency, and used two elicitation methods: directly asking models to rate agreement on a 0–100 scale, and asking models to assess each statement's factual accuracy with a second model judging whether that assessment agreed with the statement.
Across a range of frontier models, RWA scores were generally low, with GPT-5 the highest at 18.5/44 and most others clustered well below, alongside consistency scores in the 56–77 range. The author was candid about the benchmark’s limits: the underlying instrument is old, it only captures far-right bias, it would likely miss subtle bias, and multiple-choice responses may diverge from real-world behaviour.
– Mathieu Duteil
Sandbagging and Noise Injection

This project’s goal was to create a model organism of encoded reasoning from Qwen3-4B using RL. Specifically, they tried to make the model solve math questions from GSM8K using a chain-of-thought that kept its original accuracy but led other judge models to the wrong answer.
The authors fell short of the ultimate goal due to (a lot of) reward hacking, sparse rewards, insufficient time (5 days), and this project being their first major exposure to RL. But they documented the experimental journey in the following LessWrong post: Model Organism of Encoded Reasoning.
– Mo Baker and Luca Baroni
How Can Models Learn to Evade Deception Probes?

This project replicated Apollo Research’s work on deception probes on Qwen3-4B-thinking, for a model that is instructed to always deceive on questions from EvilMath. Its goal was to explore whether a model can robustly hide its deception from a linear probe.
The project tried: an iterative blue-team/red-team setup, involving iteratively training probes, and then training a model to obfuscate from the probe; following the same procedure, but incorporating a cross entropy penalty on the model's outputs; and finally, incorporating adversarial examples for the into the training setup to increase the model’s ability to fool the probe and continue deception. The project ultimately found that newly trained probes will still manage to detect deception, or the model will stop deceiving in all three of the setups explored.
– Alexander Panfilov
Model organism of goal misgeneralisation, and fixing goal misgeneralisation

This project aimed to build a small model organism of goal misgeneralisation and then test an idea for fixing it. The authors trained gemma-3-1b-it with RL to play only Paper in Rock-Paper-Scissors, with the hope it would generalise into "always play Scissors," and checked that the resulting behaviour was a genuine misgeneralisation (persisting across prompt variants) rather than the model just parroting one word. The proposed fix was to have the model un-train the misbehaviour itself using Absolute Zero Reasoner (AZR), an RL self-play loop with the same model acting as both teacher and student, building on a past ARENA reproduction.
The fix failed. The induced behaviour was only a partial, somewhat fragile misgeneralisation rather than a clean one, the original AZR codebase proved narrowly wedded to coding tasks and resisted adaptation, and training on the Rock-Paper-Scissors task types reached only around 50–60% accuracy.
– Giacomo Petrillo and Nitzan Shulman
Self-recognition and Collusion

This project asked whether self-recognition (an LLM's ability to distinguish its own outputs from those of humans or other models) acts as a factor enabling algorithmic collusion, the worry being that models which expect a partner to reason like themselves can coordinate tacitly, without explicit communication, in ways that are hard to monitor. The authors reimplemented a two-player, two-commodity Cournot market game in Inspect, in which models write private plans and insights and allocate production across products whose prices vary inversely with supply, and tested both a causal approach (directly telling a model it faces a copy of itself versus a different model family) and a correlational one (measuring self-recognition via an evaluator using logprobs). Experiments mainly used GPT-4.1, alongside Sonnet 4.5 and DeepSeek v3.2-exp.
They successfully replicated the original collusion result, finding that models collude effectively to split the market even without any nudge toward self-recognition, which left it unclear whether self-recognition has a meaningful causal impact.
– Helena Tran and Quinn Dines
Subliminal Learning via Paraphrasing

This project investigated whether subliminal learning, where a student model fine-tuned on a teacher's outputs picks up the teacher's hidden bias, can be transmitted through paraphrased data, with the safety motivation that a malicious actor could use this to plant bias-inducing data. Using a teacher and student of the same model (Llama-3.1-8B-Instruct and GPT-4.1-nano), the authors induced a love for a particular animal in the teacher via system prompts, then paraphrased and aggressively filtered its outputs to remove any explicit or implicit mention of the target animal before fine-tuning the student.
The tentative answer was yes, but with results that were sensitive and inconsistent: the effect appeared on the Alpaca dataset but not GSM8k, varied by model, and depended heavily on how the student was prompted.
Taywon Min and Matthew Bozoukov
Extending Persona Vectors: Are we making incoherent caricatures?

This project built on Chen et al.’s (2025) work on persona vectors, asking three questions: how steering affects safety, how it affects capabilities, and whether the automated pipeline for generating persona vectors produces reliable traits or merely incoherent caricatures. Working with Qwen2.5-7B-Instruct and the ‘evil’ trait, the team compared inference-time steering against models fine-tuned with preventative steering, using GPT-4.1-mini as a judge for coherence and trait scores.
They found that steering the base model to elicit a trait produced largely incoherent responses, whereas preventatively steered models held their coherence even when inference-steered, and that inference-time steering lowered MMLU accuracy while preventative steering left it roughly constant (consistent with the paper).
– Marina Fuster, Samantha Tetef and Kemunto Ochwang’i
From Little Acorns Grow: Contributions to Inspect Evals

This project asked how much the authors could contribute to Inspect Evals in a week, focusing on the GDM-CTF / InterCode CTF tasks, where an agent uses bash and Python tools to find flags. They hunted for bugs affecting reproducibility and scoring, beginning with manual transcript analysis, then using the LLM-based tool Docent to filter transcripts for surprising agent or environment behaviour, and cross-reading the source against the reference paper and implementation.
They found six bugs affecting reproducibility and scoring, including environment issues (a sudo DNS warning, the bash tool's inability to output binary data) and benchmark discrepancies (tool-call counts, missing file-location prompts, an internet-requiring task in an offline environment), and raised two issues and opened two pull requests to fix them. The performance impact was modest, with a GPT-4o agent progressing one step further on a single in-house challenge.
– Lenz Dagohoy and Hugo Save
Re-rediscovering the tree of life in Evo 2

This project set out to replicate how Evo 2, a transformer-like genomic foundation model, encodes the tree of life, the claim being that evolutionary relationships are captured geometrically as distances along a curved manifold in the model’s latent space. The authors sampled non-overlapping fixed-length regions from the genomes of many bacterial species, extracted layer-24 embeddings, and averaged them per species to obtain genome-level representations uncontaminated by raw sequence similarity. They then compared species pairs using cosine similarity, true phylogenetic distance, and geodesic distance along the manifold (computed via a KNN graph, Dijkstra's shortest path, and summed angular distances).
They concluded that phylogenetic structure is indeed present in Evo 2’s embeddings, with some evidence that it lies along a manifold. They were candid about two practical lessons: the exact layer from which embeddings are extracted matters a great deal, and reliably recovering manifolds in high-dimensional space needs large numbers of data points, which is hard with biological data.
– Jacob Green and Jake Yukich

ARENA 5.0: Selected Capstone Projects

Cain's Jawbone Eval

This project created a novel task to test LLMs’ coherence on long-term tasks and planning, as well as tool-use based on the puzzle mystery book “Cain’s Jawbone.”
The authors crafted many different subtasks of this task, including: ordering pages correctly, identifying victims and killers, and victim/killer pairs. They found minimal saturation on this long-horizon task (best performance on the hardest task: 1.36%)
– Hardik Bhatnagar and Josh Hills
Sandbagging and Noise Injection

This project replicated Noise Injection Reveals Hidden Capabilities for Sandbagging Language Models by Tice et al., adding noise to sandbagging models and demonstrating improved performance.
The authors extended this work by reimplementing their test in reasoning models, to see how reasoning performance changes with noise injections to sandbagging, and analysed common sandbagging strategies of reasoning models.
– Lovkush Agarwal and Mikołaj Kniejski
Replicating Zhao et al.: Absolute Zero

This project replicated the famous ‘Absolute Zero’ paper by Zhao et al. on a generated dataset consisting of modular arithmetic problems.
Those working on the projected successfully achieved improved performance on modular arithmetic tasks through RL training on synthetic data.
– Lily Li, Gareth Tan and Lucy Wingard
Text Autoencoder Embedding Exploration

This project investigated the extent to which intervening on intermediate embeddings of a text autoencoder determines changes in the resulting decoded sentence.
The author used PCA, logistic regression, and positional embedding analysis on intermediate embeddings to determine the extent of semantic and syntactic understanding in text autoencoders.
– Anton Hawthorne and Samuel Nellessen
Deep Dive: GPT-2 Attention Head 1.5

This project analysed attention head 1.5 in GPT-2 first by determining that the head exhibited strange semantic behaviour, attending to earlier semantically similar tokens, but not earlier identical tokens.
The author performed ablations to determine that the key components affecting head 1.5 are the embedding and MLP0; then, it found evidence that the QK matrix uses antisymmetry to suppress attending to identical tokens via ablation.
– Matteo Migliarini
Emergent Misalignment and Realignment

This project replicated the Emergent Misalignment paper from Betley et al., and extended it by adding tags to insecure code and secure code before supervised fine-tuning. It found this limits emergent misalignment to tagged contexts.
They tried fine-tuning on Bad Medical Advice Q&A, and found much worse rates of misalignment. They also found reversibility of the emergent misalignment phenomenon by fine-tuning a misaligned model on the AI optimism dataset.
– Liza Tennant, Jasper Timm and Kevin Wei
Self-preservation and Psychology in LLMs

This project investigated the extent to which models exhibit self-preservation tendencies, and constructed an eval to generate concrete results on this topic.
The authors extended this by examining the extent to which models are informed by instrumentally convergent goals to replace themselves with more capable models.
– Nadja Flechner and Michele Mignani
Exploring Vulnerability to Sycophancy Attacks

This project investigated the vulnerability of commonly used reward-models to sycophancy attacks. It generated eight strategies for increasing sycophancy of model response, and trained models with RLHF to exploit these strategies.
The author found that reward models are vulnerable to reward-hacking by models, maintaining harmfulness, but gaining high reward by acting sycophantically towards users.
– Tianhao Shen
WhiteBox Eval Awareness

This project trained logistic regression probes on the hidden layer activations of Deepseek-R1-Distill-Qwen-32B for models in evals and normal chat contexts to determine models’ eval awareness.
The authors found that these probes didn’t generalise to unseen evals. They also generated moderate evidence that 32B models may not be able to internally distinguish between evaluations and chat contexts.
– Andy Gordon, Antony Kellermann and Lorenzo Venieri

Alumni Capstone Projects

Continual Learning with Self-Distillation Fine Tuning

Inoculation Prompting Against Emergent Alignment

Probing Belief States in Transformer Models

Generalization of Linear Probes for Reward Hacking Detection

How do models make up their mind?

Investigating Encoded Reasoning

Social Deduction Games for Red/Blue Deception Evals

Towards Better Oracles

Testing Single-Turn Distillation of Multi-Turn Psychosis Sycophancy Vulnerabilities

RWA-bench: are LLMs totalitarian?

Sandbagging and Noise Injection

How Can Models Learn to Evade Deception Probes?

Model organism of goal misgeneralisation, and fixing goal misgeneralisation

Self-recognition and Collusion

Subliminal Learning via Paraphrasing

Extending Persona Vectors: Are we making incoherent caricatures?

From Little Acorns Grow: Contributions to Inspect Evals

Re-rediscovering the tree of life in Evo 2

Cain's Jawbone Eval

Sandbagging and Noise Injection

Replicating Zhao et al.: Absolute Zero

Text Autoencoder Embedding Exploration

Deep Dive: GPT-2 Attention Head 1.5

Emergent Misalignment and Realignment

Self-preservation and Psychology in LLMs

Exploring Vulnerability to Sycophancy Attacks

WhiteBox Eval Awareness

Get in touch: info@arena.education