The Inspect Sandboxing Toolkit: Scalable and secure AI agent evaluations

How do we test AI systems for dangerous capabilities without risking real-world harm?

The more capable models become, the harder it is to safely evaluate them. When an agent can execute arbitrary code and interact with critical systems to gain sensitive information, running evaluations without adequate safeguards could put critical systems at risk.

Sandboxes are isolated environments for testing and monitoring AI behaviour. When we give a model access to use tools (such as for writing code), we execute that action in a sandbox to limit the model’s access to external systems and data. This lets us evaluate its capabilities without exposing sensitive resources. Today, we’re releasing our toolkit for safely running agentic AI evaluations.

Why we built the Inspect Sandboxing Toolkit

Sandboxing is a common practice in software development for limiting the resources and impact of potentially untrusted processes. When evaluating AI agents, it is critical that instructions are appropriately sandboxed because they come directly from the agent being evaluated.

Running untrusted processes is a mature problem in software engineering – but applying sandboxing to AI agent evaluations in particular presents unique challenges. We need to assess AI agents' capability to interact with arbitrary systems and services to complete a complex, multi-step tasks. But it’s hard to predict ahead of time what tools the model may need to access during this process, how proficient it will be, and whether it will attempt to pursue misaligned goals.

While several sandboxes are available for AI evaluations, we found that existing solutions have important limitations. For example, they may allow secure execution of Python code but prevent the agent from accessing the internet or filesystems.

Since no solution offered the security, scalability, and flexibility needed for comprehensive agentic evaluations, we built something that did.  

Our Solution: Sandboxing plugins native to Inspect

The Inspect Sandboxing Toolkit is a set of plugins built and open-sourced to accommodate a variety of agentic evaluation environments for advanced AI models, along with documentation on how to approach choosing a sandbox.

We wanted to a build a mechanism native to Inspect, our open-source framework for capability evaluations, which is being increasingly adopted across the AI safety and security communities. The toolkit and documentation enable teams to discuss and implement appropriate mitigations for each evaluation's specific risk profile, facilitating case-by-case decisions between senior responsible owners and technical leads.

We’ve been using every part of this sandboxing toolkit internally at AISI. We’ve also helped the US Centre for AI Standards and Innovation (CAISI), Model Evaluation and Threat Research (METR) and Apollo Research adopt the open-source sandbox providers for their own agentic evaluations. 

A distinguishing feature of how Inspect handles sandboxing is the separation of the model execution or inference from the environment where tool calls are made. Inspect itself sits outside of the sandbox and sends commands into it, as requested by the model being tested, to execute code or functions.

This architecture provides:

Security: Commands are issued from outside the sandbox. Everything that happens inside is explicitly allowed and initiated by Inspect. In the future, we may face the risk of sandboxed agents using injection attacks to attempt to access Inspect's Python code. We are actively considering this scenario and relevant mitigations.

Modularity: Isolated scaffolding servers let you easily switch out features and permissions, creating an environment tailored to your desired evaluation.

Scalability: The more capable your model – and the higher-stakes your evaluation – the more secure your sandbox will need to be. Inspect-Native lets you easily adapt scaffolds to create environments with varying resource constraints.

Our sandboxing toolkit addresses multiple risk scenarios in agentic evaluations. The primary current concern is agents accidentally damaging infrastructure. However, we increasingly run evaluations that deliberately test an agent's ability to escape sandboxes – relevant to our cyber and autonomy risk areas.

Looking forward, we're also researching defences against potentially misaligned models that might actively attack their sandboxes.

‍

The toolkit

Our toolkit includes three plugins to Inspect:

Docker Compose: A tool for managing multi-container evaluation scenarios, allowing you to easily get started with agentic evaluations.

Kubernetes Plugin: Allows you to simulate more complex environments, and automatically spin up or shut down containers based on resource demands.

Proxmox plugin: Strong virtual machine (VM) isolation for high-risk evaluations.

Providing tools can be unhelpful without providing guidance on when to use them, particularly in a novel field like agentic evaluations. So, we also created an accompanying sandboxing protocol to help users understand the appropriate level of sandboxing for their evaluation and provide an end-to-end guide on how to set it up using our three tools.

Our sandboxing protocol classifies isolation along three different axes: tooling (restricts model access to certain tools, and/or the ability to execute code), host (prevents a model from escaping or compromising the host system), and network (controls a model’s interaction with external systems – including the internet – over the network).

Our vision

Our goal is to enable a flourishing ecosystem of secure, scalable, and analysable AI evaluations.

Our toolkit lets researchers safely test highly capable agents and offers a standardised framework for use across the AI evaluations community. Its scalable security measures are designed to address emerging threats as AI systems grow more powerful.

Looking to evaluate your AI agents securely? Check out our full documentation to get started with Inspect's sandboxing toolkit of products.