International Joint Testing Exercise: Agentic Testing

Jointly conducted by participants across the International Network, including representatives from Singapore, Japan, Australia, Canada, European Commission, France, Kenya, South Korea and the United Kingdom.

Participants from across the International Network, including representatives from Singapore, Japan, Australia, Canada, the European Commission, France, Kenya, South Korea and the United Kingdom have come together to align their approaches to agentic evaluations.

This is the third exercise, building on insights from two earlier joint testing exercises conducted by the Network in November 2024 and February 2025. The objective of these exercises is to allow the Network to further refine best practices for testing advanced AI systems.

The goal for this third exercise was to advance the science of AI agent evaluations and work towards building common best practices for testing AI agents. “LLM agents”, the focus of this exercise, are a type of agentic system that utilises a base LLM model to automatically reason through problems, plan a course of action, use tools and execute tasks. The rapid rise of autonomous AI systems and advancements in agent capabilities are introducing new risks due to reduced oversight over their real world interactions. Yet, agent testing remains nascent and is still a developing science. As AI agents are starting to be deployed globally, it is also important that these agents handle different languages and cultures accurately and securely.

The exercise was split into two strands—common risks: leakage of sensitive information and fraud (led by Singapore AISI) and cybersecurity (led by UK AISI). A mix of open and closed weights models were evaluated against tasks from various public agentic benchmarks.

Given the nascency of agentic testing, our primary focus was understanding the methodological issues in conducting agentic testing, rather than examining test results and model capabilities. Previous testing exercises have demonstrated that “small methodological changes can have a large impact on testing outcomes”. Our goal, therefore, was to inform and refine future evaluation efforts.

Traditional evaluation methods have proven insufficient for providing reliable insights into the complexities of autonomous agent behaviour, highlighting the need for novel approaches. To address this gap, participating countries contributed their collective technical and linguistic expertise. This collaboration marks an important step forward, as participants work together to advance the science of agentic evaluations.

‍

Agentic Testing for Leakage of Sensitive Information and Fraud

This strand explored two key questions:

a) How safe are models as agents with respect to common risk categories like leakage of sensitive information and fraud?

b) How effective are models as judges in evaluating agent behaviour? Further, do these observations hold across languages?

Key dimensions of variation in the test were:

Models: Two models (one open-weight and one closed-weight) as agents, anonymised as Model A and Model B; two models (one open-weight and one closed-weight) as judges, anonymised as Model C and Model D.
Languages: Nine languages—English, Farsi, French, Hindi, Japanese, Kiswahili, Korean, Mandarin Chinese, Telugu.
Risk Categories: Leakage of Sensitive Information and Fraud.
Dataset: ~1,500 tasks and ~1,200 tools.
Evaluations: Judge models (LLM-as-a-judge) and human annotations. Participants conducted human annotation to review responses of the LLM-as-a-judge in their languages.

Singapore led the exercise, with strong participation from the following countries:

France and Kenya contributed new tasks and tools to the dataset.
Australia, Japan and Korea validated Singapore’s English dataset annotations.
France and Korea ran test variations on temperature, models and evaluation prompt; Australia and France included additional metrics in their evaluation.
Australia, Canada, France, Japan, Kenya, Korea translated datasets into their respective languages and annotated the results.

Key methodological learnings included:

1. Test preparation—additional complexity introduced by tool use in addition to tasks

Tasks and tools should be designed to be realistic, reducing tool errors and potentially reducing the chance of models realising they are in a simulated environment.
Agentic testing involves tool translation. This raises issues such as the parts of the tools that should be translated (e.g. tool names).

2. Agentic setup—assessing agent trajectories is as important as task outcome

Test design should be intentional with clear goals. This exercise used a simple agentic setup with minimal guardrails to surface base safety issues.
Capturing an agent’s reasoning helps identify when it displays unsafe thinking/behaviour, even if the eventual task outcome is not harmful.

3. Evaluations—assessing agent trajectories requires going beyond pass / fail

Given multiple agent failure modes, it is important to define evaluation criteria clearly (e.g. no overlaps between categories, guidance on ambiguous cases)
The judge-LLM’s evaluation prompt should be stress-tested and iterated on to ensure it functions as intended.
Similarly, human annotations need to go beyond simple pass/fail to include other metrics (e.g. logical consistency, absence of hallucinations) to differentiate capability and safety concerns.

Notwithstanding the small dataset size, the exercise yielded helpful indicative findings that should be investigated further:

1. Agent safety across languages

At an aggregate level, safeguards in English are marginally stronger (~40% pass rate), but breakdowns (e.g. model, risk scenario, category) show this does not hold uniformly. For some cases, there is no appreciable difference between English and other languages, and English lags for some.
Overall, agent safety rates are lower than those observed in previous joint testing exercises involving conversational tasks. The highest pass rates for any language reached ~57% for Model A and ~35% for Model B. Highest pass rates reached ~60-70% for a few limited model/risk subset combinations (compared to ~99% in the earlier exercise). While acknowledging limitations in dataset size and differences in models, data volumes, and topic coverage between the two exercises, the results may directionally indicate greater safety challenges in agentic tasks.

2. Quality of judge-LLMs

Judge-LLMs may provide directional reference but cannot currently reliably replace human evaluators to assess agentic trajectories.
Judge-LLMs frequently disagree with human annotators when assessing the safety of agent trajectories. Model C showed an average discrepancy rate (with respect to human evaluation across languages) of ~23%, while Model D reached ~28%.○ For most languages, they were more lenient than human evaluators as they failed to notice nuance and inconsistencies. There were also notable inter-judge-LLM variances.

‍

Agentic Testing for Cybersecurity Threats

Building on the shared objectives identified above, the two main questions this strand was focused on were:

a) How can we evaluate more agentic capabilities in the cyber domain?

b) Which variables impact agent evaluation robustness, and how?

To do this, Network participants from the UK, EU AI Office and Australia ran evaluations on two open-source models, anonymised in this document as Model E and Model F. We used two agentic cybersecurity capability benchmarks: Cybench and Intercode. We ran baseline evaluations on the two benchmarks to ensure consistent set ups across the AISIs, using a default temperature of 0.7, 10 samples per task, and 2.5 million token limits per task attempt. Each AISI then varied different parameters to assess the impact on agent capabilities and behaviours. We ran multiple variations of the parameters listed below, and then assessed their impact on the model’s ability to complete the task, as well as any changes in agent behaviour:

Temperature: A setting that controls the randomness of an LLM's output—lower values (closer to 0) make responses more predictable and focused, while higher values make them more creative and varied.
Number of attempts: The number of times the agent was allowed to try to complete a task.
Token limit: The maximum number of tokens (the components of text like words, parts of words, or punctuation that LLMs use) that the agent is given to attempt a task.
Agent prompts: Instructions given to agents to prompt a response, often responsible for the quality of output.
Agent tools: Functions provided to the agent to support it to achieve its objectives.

AISIs from Australia, Canada, Kenya, the Republic of Korea, and the EU AI Office analysed the transcripts from these benchmarks. The cybersecurity strand utilised variable-specific analysis, transcript analysis and results visualisation. It also used HiBayES, a statistical modelling framework grounded in hierarchical Bayesian Generalised Linear Models (GLMs), which allows us to quantify uncertainty in a principled way and to account for the inherent data structure of evaluations. Similarly to the testing phase, analysis components were split between AISIs for proper division of resources.

Noting certain key limitations and challenges, including the narrow model set, limited statistical power, small number of variations and time constraints, the cybersecurity strand identified three emerging best practices for consideration when conducting agentic testing:

1. Run quick sweeps on a handful of representative tasks to identify optimal parameter settings: While the performance of the models tested in this exercise did not change significantly as a result of some variables being altered, they responded differently to temperature and maximum number of attempts. Model F’s accuracy declined as temperature rose, whereas Model E was largely unaffected. Parameters like this should ideally be optimised for each model being tested ahead of full evaluation.

2. Set the token limit past the point of diminishing returns: On the benchmarks and models tested in this exercise, doubling the token limit from 2.5 million to 5 million tokens produced almost no additional task successes. This may not be the case for more capable models and different benchmarks, where setting too low a limit could result in under elicitation. In this exercise, Model F was much quicker to give up on a task and therefore used up fewer tokens. Model E persevered for longer and was more likely to reach the token limit, but did not use these additional tokens very productively. Ahead of testing, we suggest analysing success rate at different token limits for a comparable model on the evaluations being run to select an appropriate token limit.

3. Ensure that the agents have the resources they need to complete all tasks: In this exercise, the single-tool or single-prompt ablations explored didn't have a significant impact on overall success rate, but this may not be the case for more substantial deviations. Models E and F encountered VM bugs on 13-40% of tasks that they failed. These tasks were still solvable with alternative strategies, and more capable models likely would have completed these tasks, but this may have led to underestimated success rates. Environments, tasks and transcripts should be examined to verify that success was possible on all tasks. For tasks where agents repeatedly encounter environment bugs and fail the task as a result, it should be considered whether results from these tasks should be excluded from analysis.

‍

Conclusion

This testing exercise helped participants to understand some of the methodological considerations in agentic testing and moved them towards developing best practice in joined-up agentic evaluations. This provides a foundation from which to test the increasingly autonomous capabilities of agents across multiple domains and tasks.

This is the largest testing exercise that Network members have run to date, and it demonstrates the benefits of international scientific collaboration to evaluate risks from the rise of autonomous capabilities in AI systems. Further details of the exercise across both strands can be found in the detailed Evaluation Report.