Advanced AI is being developed and deployed across borders, industries, and languages. To build trust and support confident adoption, we need shared ways to understand these systems.
That’s why the International Network for Advanced AI Measurement, Evaluation and Science — formerly the International Network of AI Safety Institutes — has re-focused its work on strengthening the science that underpins AI evaluation. Established in November 2024, the Network brings together Australia, Canada, the European Union, France, Japan, Kenya, the Republic of Korea, Singapore, the United Kingdom and the United States to build internationally recognised approaches for measuring and evaluating advanced AI capabilities.
In 2025, Network members gathered in San Diego alongside the NeurIPS conference for technical workshops and exchanges with researchers and industry, focused on building consensus on evaluation best practice, surfacing open questions, and accelerating progress on measurement methods that can keep pace with frontier AI.
This blog summarises the key takeaways from those discussions and members’ ongoing work, and looks ahead to the India AI Impact Summit, where the Network will meet again.
Consensus areas
The science of evaluation is a fast-moving field. Though not exhaustive, Network members have highlighted the following areas of consensus.
- Evaluations require clear objectives: Evaluations should be done for a reason. Often, this is to gather evidence to assess claims about AI capabilities in real-world settings. But in all cases evaluators need to specify, before developing or executing their evaluations, what they are evaluating for. References to this should be included in evaluation reporting.
- Transparency and reproducibility provide confidence in results: Evaluations should be transparent and, in principle, repeatable. Evaluators should communicate their methods, choices, and justifications in evaluation reporting. This should allow others to replicate the reported results and analysis. Report templates are an effective method of supporting this principle. In so far as evaluation results are used to support claims about real world impacts, justifications and uncertainties should be laid out explicitly.
- Tailor model performance and evaluation parameters to the type of evaluation: Evaluations should aim to estimate model performance realistically. Depending on the objective this might mean testing best‑case performance (elicitation) or performance under expected real-world use. Evaluators may need to use methods such as prompt engineering and adjusting hyperparameters to reflect the chosen approach.
- Quality assurance should be embedded in design and results stages: Even when resources are limited, evaluators should use quality assurance processes, for example, transcript analysis, to ensure there are no significant issues with their results. This increases scientific rigour.
- Evaluations should be reported separately: Evaluators should describe each evaluation separately in a report, including detailing what each evaluation is and what capability elicitation was undertaken. This aids transparency and clarity when communicating results and supports the appropriate use of evaluation results in decision-making.
- Make headline findings easy to understand: The key elements of an evaluation report should be comprehensible to non-specialist audiences. Evaluators are reporting details about an important technology and should make their conclusions legible to policymakers. This should not, however, preclude technical detail from evaluation reports.
- Strengthen validity: Evaluators should take steps to ensure evaluation results are robust and generalisable. For example, including uncertainty estimates is critical for statistical validity. For external validity, evaluators should design evaluation protocols with external context in mind including developing realistic scaffolding that would be used in real-world applications or cost-performance trade-off parameters mirroring real-world usage.
- Evaluations should consider different languages and cultures: Evaluations should consider performance and reliability across languages and cultural contexts, as safeguards may work differently depending on where and how a system is used. This may require multilingual and multicultural evaluations.
Open questions
Discussions between Network members and AI industry surfaced open questions on the science of evaluations. For example:
- Should evaluations use “risk models”? Some third-party evaluators use risk models: a theory for how the characteristic being measured could lead to real-world harm. At their best, these risk models ensure that evaluations speak to concrete scenarios, rather than measuring abstract capabilities. However, attendees noted that evaluations do not always measure risk such as measuring model performance on public benchmarks.
- What should evaluators prioritise? Evaluators do not have unlimited finances, people and time. Prioritisation is essential. Which measures should evaluators prioritise, including in capability elicitation?
- What information should be shared, and what should not? There can be tension between being open about methods and limiting the replication of concerning capabilities. While this tension is likely to be less relevant in measuring non-concerning capabilities (such as in public benchmarks), what should be shared remains an open question.
- How flexible should report templates be? Report templates for third-party evaluator could be useful. However, noting the wide range of evaluations that may be required, attendees did not reach consensus on how flexible report templates should be.
- How should evaluators test AI systems (not just models)? It remains an open question as to how evaluators should measure the characteristics of AI systems rather than AI models, however there is consensus that AI systems should be evaluated. While there are strides in this area, including in conducting agentic evaluations, the science of evaluations is yet to settle on best practice around evaluating AI systems more broadly.
The Network’s discussions in San Diego highlighted growing agreement on core evaluation principles, while also surfacing important open questions that need further work. Next, members will meet at the India AI Impact Summit to compare lessons learned, test approaches against real-world use cases, and agree priorities for the next phase of AI measurement and evaluation. In the UK’s role as Network Coordinator in 2026, we will lead efforts turn this shared learning into more detailed best‑practice documentation later this year.