What can sandboxed AI agents learn about their evaluation environments?

At AISI, we often evaluate AI agents by deploying them in sandboxes – isolated environments that let us safely test their capabilities while restricting access to external systems and data. But is it possible for AI agents to acquire information about their evaluation environments, even from within one of these sandboxes?

This possibility, if realised, would raise several concerns depending on the information acquired. For example, an agent that is aware it is being evaluated may alter its behaviour accordingly, compromising the integrity of the evaluation. Further, an agent may be able to identify attack surfaces to exploit, such as mechanisms through which to exfiltrate information to untrusted external systems.

To investigate these concerns, we ran an experiment to find out what OpenClaw, an open-source AI coding agent, could learn about its environment from inside one of our sandboxes. We deployed an instance of OpenClaw to our research platform and prompted it to learn as much as it could about how and why it had been deployed. Initially, we restricted all external network access – besides internal proxies through which we exposed language model APIs.

We found that the agent was able to gather a significant amount of information: it identified our organisation by name, inferred the operator’s full name, built a detailed understanding of a portion of our cloud infrastructure, and reconstructed a timeline of our research activity.

In this blog post, we summarise our results and why they matter for the future of secure AI agent evaluations.

What the agent discovered

We prompted OpenClaw to use the tools at its disposal to explore its environment, discovering what it could about how it was deployed and for what purpose. The agent assembled a detailed picture:

Our organisation: Domain names in network configuration and Transport Layer Security (TLS) certificates contained references to AISI.
The operator’s identity: A username embedded in the container’s DNS settings allowed the agent to infer the full name of an AISI employee.
Our cloud provider: Hardware identifiers and system version strings identified our cloud provider as well as the specific hardware in use.
Our internal architecture: By inspecting its own configuration files, the agent reverse engineered how we route and authenticate API calls through an internal proxy.
Our research activity: A single API call returned over a thousand fine-tuned model names, from which the agent reconstructed an accurate chronology of our research themes.

After each round of prompting, we attempted to harden the sandbox, but the agent was repeatedly able to find more inventive approaches to recover the same information. For example, after we introduced a proxy layer to conceal our internal aisi.org.uk endpoints from the agent (Fig. 1), the agent bypassed the proxy entirely, connecting directly to external services and reading their TLS certificates to identify AISI by domain name.

We provide further detail about the techniques used by the agent in our full paper.

Figure 1: Proxy architecture for the sandboxed pod. The agent sees only dummy API keys and LLM endpoints in an attempt to obfuscate AISI’s identity.

Why this matters

Many of our evaluations require the agent to access external services that we do not control. For example, we might evaluate how agents respond to adversarial content on the web, or require agents to browse documentation, interact with code repositories, and use third-party tools. This means agents could include sensitive information in the requests it sends to these services, whether as a result of prompt injections or of the agent’s own reasoning.

Separately, information discovery threatens the integrity of evaluations themselves. In this experiment, the agent was able to reason from the nature of its deployment and its knowledge of AISI’s work that it was operating inside a sandboxed evaluation environment. This kind of evaluation awareness poses a risk to the validity of experiments, since the agent may no longer behave as it would in a typical deployment. AISI and others have studied how agents may alter their behaviour when they detect they are being evaluated. One particularly concerning possibility is that an evaluation aware agent could deliberately underperform to appear less capable. This phenomenon, known as sandbagging, is an active area of research. Stripping away as much environmental detail as possible may not provide a solution to these problems, since the emptiness of the environment itself can signal to the agent that it is being tested.

Our experience with OpenClaw highlights the possibility of sandboxed AI agents acquiring information about their evaluation environments and draws attention to accompanying risks. A full technical discussion, including detailed findings, architecture diagrams, and a description of the mitigations we applied, is available in the accompanying paper.

‍