In May 2024, the AI Security Institute open-sourced Inspect AI, a toolkit for evaluating the capabilities of frontier AI models. Inspect powers many of AISI’s frontier evaluations, and open-sourcing it allowed us to share these evaluation capabilities with the wider community.
Building on Inspect’s momentum, today we are releasing the AISI Engineering Playbook, a complete resource that captures the methods and practices we've developed while evaluating frontier AI systems.
While Inspect provides the open-source tools, evaluating frontier AI at scale needs complex supporting infrastructure to work effectively. Researchers need somewhere to run untrusted code, a way to manage and audit calls to model providers, the compute to host models, and systems of working that tie it all together. Much of this work is invisible, rarely documented, and developed from scratch by every team that attempts serious evaluation.
By open-sourcing not only Inspect, but also our methods and infrastructure where we can, we hope to help researchers, organisations, and other evaluators stand up rigorous evaluation capability without starting from zero. We hope that the Engineering Playbook provides researchers and organisations with valuable insights into our evaluation processes.
Access the Playbook here.
What is Inspect, and how is it being used?
Inspect allows AI researchers to set up structured, multi-step and repeatable tests for AI models, and then to score those tests in a measurable way, compare models, and keep detailed evaluation logs, traces, metrics and reports. The companion project Inspect Evals is a collection of 200+ pre-built evaluations ready to run on any model, and covers a broad range of open benchmarks in coding, agentic tasks, reasoning, knowledge, behaviour, and multi-modal understanding.
Inspect has been used by a wide range of researchers, companies and organisations:
- METR migrated from their in-house Vivaria platform to Inspect. They now run time horizon evaluations with 228 tasks across frontier models. Their results are cited in the International AI Safety Report 2026.
- Apollo Research deprecated their internal evaluation framework after a three-month trial of Inspect, citing "features which would take us months to build ourselves," specifically secure agent sandboxes and interpretability tooling. They now build all new evaluations on Inspect.
- Other government institutions have adopted Inspect and the sandboxing toolkit for their own evaluation work.
Inspect has also turned into a community, bringing together researchers from around the world. Since open-sourcing Inspect, we’ve watched it grow, welcoming 240+ contributors to inspect_ai alone. Today marks the next step for the evals community: releasing AISI's Engineering Playbook.
What’s in the new Playbook?
The Engineering Playbook is an open guide to building AI evaluation capabilities, breaking down the wider infrastructure of evaluation into five layers – Evaluate, Isolate, Connect, Run, and Scale.
Users can start anywhere, but together the capabilities form a complete platform for building and running evaluations.
An overview of the Playbook is as follows:
- Evaluate is the framework for testing frontier models, built so evaluations are reproducible, comparable across providers, and able to handle agentic tasks.
- Isolate is how those evaluations run safely, matching containment to the risk of the work.
- Connect sits between researchers and the model providers they call, giving an organisation audit ability and cost control without slowing research down.
- Run is the platform a researcher works from - the machine, the data, the compute, and the route to a deployed application.
- Scale brings supercomputer-scale inference within reach for hosting and researching the largest open-weight models.
As well as technical overviews and explainers, the Playbook offers a deep dive on each capability section, with links to the open-source code where it exists.
Access the Playbook here.
Who should use the Playbook?
The Playbook is written for anyone building evaluation capability, whether that is a single researcher running their first evaluation, or an organisation standing up infrastructure from scratch. It can be used in multiple ways:
- Researchers can adopt individual components, without committing to the whole stack.
- Organisations and other evaluators can use the five layers as a reference architecture for their own evaluation platform – building their own AISI.
- The wider ecosystem can build on, contribute to, and extend the open-source code, as the Inspect community already has.
What's next?
AISI has spent two years building a technical stack to evaluate frontier AI systems, and we think that sharing it lowers the barrier for others doing the same. As frontier AI grows more capable and complex, maintaining the ability to conduct quality evaluations at scale becomes essential.
This release is a snapshot of how we evaluate frontier AI today, not a finished product. We will keep updating the Playbook as our methods evolve, and as the tools and infrastructure behind rigorous evaluation continue to mature.
We hope that the Inspect framework and the new Playbook will continue to grow and support frontier AI evaluation efforts. Here's to the continued development of open-source, rigorous, and better evaluation of AI systems.
"I created Inspect because I wanted to provide a strong technical foundation for governments, researchers, and labs to collaborate on testing and understanding frontier models. I am thrilled to have seen so many organisations adopt it, as well as to have accepted contributions from hundreds of users." - JJ Allaire, creator of Inspect and founder at Meridian Labs
“Absolutely central to METR's evaluations is Inspect, which we've built our own open source cloud eval platform on top of, along with actively contributing to the Inspect codebase to run larger and more complex evaluations.” - Mischa Spiegelmock, member of technical staff at METR