AI control aims to design measures that will prevent future systems from causing harm, under the worst-case assumption that they are not aligned with our interests.
We can sort control techniques into two categories: black box measures and white box measures. Black box control relies on monitoring a model’s external behaviour, such as its chain-of-thought and final outputs. White box control, on the other hand, involves observing or modifying its internal activations.
At the AI Security Institute (AISI), we’ve assembled a dedicated White Box Control Team. Our current focus is on detecting sandbagging, a phenomenon where models strategically underperform during evaluations. Today, we’re releasing a detailed progress update describing some preliminary results.
In this blog, we introduce the concept of white box control and provide some brief insights from our research so far.
What is white box control?
White box control involves altering or observing what’s going on inside a neural network in an effort to prevent harm from misaligned models.
It is an application of mechanistic interpretability, a field which aims to understand the inner workings of AI models. White box control and mechanistic interpretability can employ many of the same tools, such as linear probes (which can isolate representations within specific layers of a network) and sparse autoencoders (which help disentangle neural networks into a form that is more interpretable).
White box control aims to leverage our understanding of how neural networks function to mitigate against potential harms – either by actively monitoring a model’s internal behaviour, or directly intervening in it (using feature steering, for example).
Advantages of white box control
There’s plenty of progress we can make on the problem of controlling misaligned AI systems using black box techniques! For example, we could have weaker models monitor the outputs of stronger ones, check AI-generated code for backdoors, or restrict the information available to each instance of a model. These measures could be very useful for AI security even absent an understanding of their inner workings.
However, we know that a model’s outputs are not always representative of its internal behaviour. For example, chains-of-thought may be “unfaithful”, meaning they don’t reflect the model’s true reasoning process, and research suggests this problem may worsen as model size increases. In particular, instances of sandbagging (the focus of our current research) may appear indistinguishable from honest mistakes, even when a full chain-of-thought is available. Worse, we may eventually enter a regime where AIs reason in unrestricted latent space rather than in natural language, outputting an opaque “language” known as Neuralese. Developing white box control techniques helps us prepare scenarios in external behaviours become either misleading, or insufficient to understand a model’s reasoning process.
Our work on sandbagging detection
Over the past few months, we’ve run several experiments to study the efficacy of white and black box methods for detecting sandbagging in toy experimental set-ups.
We tested two open-source language models, Llama-3.1-8B-Instruct and Llama-3.3-70B-Instruct, in a variety of evaluation settings. We first experimented with set-ups where the model is explicitly prompted to sandbag, to determine whether it is able (if not inclined!) to do so.
So, how reliably were we able to detect this behaviour? We found that deception probes (a white box method) worked extremely well in detecting sandbagging in the smaller of the two models, when prompted to perform relatively simple tasks. An automated monitor with access to model outputs (a black box method) struggled in the same setting, providing some evidence for the usefulness of probes over black box monitors for sandbagging detection.
Unfortunately, we also found that increased model size and task complexity caused the performance of both white and black box monitors to deteriorate. In our second setup, we prompted Llama-3.3-70B-Instruct, the larger of the two models, to perform a multi-stage, interactive coding task. This caused the accuracy of our white box monitor to decrease, in part because it struggled to distinguish sandbagging from a higher rate of honest errors due to increased task complexity. But this didn’t appear to be the only reason – accuracy also decreased when we prompted the larger model to perform simple tasks. We plan to investigate why monitor accuracy is sensitive to model size in future research.
As well as studying how well our methods work for weeding out prompted sandbagging, we wanted to explore whether sandbagging is already occurring in the wild. We used a black box monitor to analyse transcripts from past testing runs across AISI, to check for evidence of out-loud reasoning about intent to underperform. We found no such evidence – suggesting that models are either not sandbagging or doing so in ways that current evaluation methods are unable to detect. We assess that the latter is unlikely, though we can’t conclusively rule it out.
Going forward
As AI models grow more powerful, accurately assessing their capabilities will become more important. Underestimating what models can do may give rise to insufficient mitigations, and difficulty building consensus around the risks that powerful AI could pose. We think white box methods for sandbagging detection and mitigation may play a crucial role, especially if unfaithful or uninterpretable chains-of-thought mean we can no longer rely on black box monitoring.
We plan on continuing our work on sandbagging across a wider range of models and evaluation scenarios, as well as exploring ways to prevent deterioration of monitor accuracy as model size and task complexity increase. Specifically, we will stress-test the effectiveness of our methods through red / blue team games, where red teams develop sandbagging strategies that blue teams attempt to counteract.
Our full progress update contains details of each experiment we ran, preliminary results, and key reflections.