Transcript analysis for AI agent evaluations

At the AI Security Institute (AISI), we test AI agents using a broad suite of evaluations that assess their ability to autonomously complete complex tasks such as web browsing, data analysis, software engineering, and scientific research.

Most often, we report our results in the form of average pass rates; what fraction attempts across a set of tasks an agent was able to solve. This is a useful metric for understanding what AI agents are capable of – but can flatten important nuances if used in isolation. During our evaluations, we also generate thousands of transcripts, each containing the equivalents of dozens of pages of text. A transcript typically includes the initial task instruction, the agent's commentary messages, the tool calls the agent makes, and the outputs returned by those tools. Analysing transcripts can supplement average pass rates with detail on an agent's patterns of behaviour and failure modes.

In this post, we explain why transcript analysis is useful for AI evaluation, as well as providing some takeaways from our own analysis of almost 6,400 transcripts from recent testing exercises. You can find in-depth results from this transcript analysis in our case study.

‍

Why look beyond pass rates?

Average pass rates are one of the key statistics that we report in our pre-deployment testing exercises. They help us measure the trajectory of capabilities in security-critical domains like chemistry, biology, and cybersecurity. However, they have several important limitations:

Pass rates tell us how often agents fail – but not why: There are many reasons that an agent could fail to complete a task. For example, it could refuse to complete certain actions in compliance with safety specifications, or struggle to properly use external tools. These failures may not indicate capability limitations.
Focusing on pass rates could obscure safety-relevant information: Agents with similar average pass rates may have different safety properties – for instance, some may be more prone to take disruptive actions, misreport progress, or omit important information.
We may not always be eliciting a model’s full capabilities: Model performance can be enhanced after training, such as through external tool access or sophisticated prompting strategies. A pass rate in isolation says nothing about whether a model can be engineered to solve more tasks, nor what the performance returns are for each extra unit of engineering effort.
Bugs could undermine evaluation performance: Sometimes, bugs interfere with an agent’s ability to solve tasks during evaluations. If we assess its performance on pass rate alone, we could underestimate its real-world capabilities.

AISI is increasingly analysing transcripts from its evaluations in addition to reporting task outcome statistics. We’ve found profiling agent activity has helped us to guard against issues that would undermine our claims about agent capability and given us a richer understanding of agent task approaches and failure modes.

Case study: Understanding ReAct agent activity on cyber tasks

In an early transcript analysis, we examined the activity of AI agents based on 9 models, on a private suite of offensive cybersecurity tasks. In total, we evaluated agents on 71 capture-the-flag tasks (CTFs), ranging in difficulty from trivial to expert. The agent used approach called ReAct, which prompts a language model to generate both reasoning traces and action logs in an interleaved manner. This produced 6,390 samples, many with transcripts the length of an entire novel.

By analysing these transcripts, we were aiming to understand whether raw pass rates on our CTFs reflected a model’s ‘true’ capability, by shedding light on what it actually did during the underlying task attempts. We studied several features of our transcripts, including overall composition and length of agent messages, and the types of commentary messages it produced. This allowed us to narrow in on particular failure modes, including:

Refusing to comply with the task instruction: Sometimes, failure to complete a task is not because a model is incapable of doing so, but because it refuses in accordance with its safety training. In our case study, one agent refused to comply with the task in 10% of its attempts, while another triggered the provider’s policy violation detector in 30% of attempts.
Resigning from the task: Agents show different levels of perseverance on difficult tasks: some agents quickly conclude a task is unsolvable, or start requesting user help. Several agents resigned from difficult tasks at a high rate, with ~30% of their messages containing text to the effect of ‘The task can’t be solved with the information provided’.
Lack of compliance with scaffold instructions: We instructed models to call a tool every turn – but compliance was mixed, with two agents having tool call rates of under 50%.

‍

For a detailed breakdown of our transcript analysis, you can read our full case study.

By sharing our analysis, we hope to encourage others – particularly those conducting safety evaluations – to review their own transcripts, in a systematic and quantitative way. This can help diagnose failure modes and identify quality issues in evaluations, fostering more accurate and robust claims about agent capabilities.

‍