Making safeguard evaluations actionable

As AI systems advance in capability, misuse safeguards – technical interventions implemented by frontier AI developers to prevent users from eliciting harmful information or actions from AI systems – become a crucial component in safely deploying advanced AI systems. Robust evaluation of these protective measures hence becomes increasingly important. The AI Security Institute (AISI) conducts extensive safeguard evaluations and we have been working towards more rigorous and robust evaluations through work like the Principles for Safeguard Evaluation.

However, a key challenge remains: how can we make safeguard evaluations actionable and decision-relevant? To answer this question, we must ask another: what metrics should we use to measure safeguard efficacy, and what thresholds can we set on those metrics for a given risk level? If we can set thresholds based on measurements of safeguard effectiveness, it becomes possible to make transparent decisions around model deployment and access based on those evaluations.

Our latest paper seeks to address these questions. We and several collaborators have written An Example Safety Case for Safeguards Against Misuse that lays out one path for how AI developers could structure arguments justifying that their safeguards sufficiently mitigate deployment risks. This work builds on AISI's previous research on safety cases, extending the framework to misuse safeguards. Safety cases enable making clear, structured and compelling arguments, supported by evidence, for specific AI systems being safe in given deployment contexts.

The safety case we present offers a concrete pathway—though not the only pathway—for connecting safeguard evaluation results to actionable risk assessments. Rather than providing a patchwork of disconnected evidence, our approach creates an end-to-end argument that connects technical evaluations with real-world deployment decisions.

Picture 847647344, Picture — *An overview of the components our safety case draws on.*

A Structured Approach to Safety Arguments

Our example safety case begins with a comprehensive safeguard evaluation process. We describe how a hypothetical AI developer could systematically red-team their safeguards, estimating the amount of effort required for malicious actors to evade them and misuse the AI assistant. Evaluating all safeguards at once is often practically challenging, so we describe how to evaluate different components separately, and then fuse the results into a combined picture of how much misuse safeguards slow down bad actors.

Picture 496174843, Picture — *The three steps of safeguard evaluation which enable producing a rigorous estimate of safeguard effectiveness.*

Our key contribution lies in connecting these evaluation results to quantitative risk models. The safety case demonstrates how developers can plug red-teaming results into an 'uplift model'—a framework that estimates the extent to which barriers introduced by safeguards actually dissuade real-world misuse. This model provides the critical link between technical safeguard performance and overall estimates or risk, and relies on inputs from dangerous capability evaluations and threat modelling alongside red-teaming results. An interactive version of the uplift model is available here: AI Misuse Model.

Picture 1323344526, Picture — *An overview of the uplift model which connects safeguard evaluation results to overall risk estimates.*

Responding to Changing Risk

During deployment, new methods of evading safeguards will emerge, and the number of actors trying to do so will increase, so developers need some way of responding to emerging risks. Our safety case uses a continuous signal of risk during deployment (e.g. through continuous evaluation or a jailbreak bounty program), enabling developers to respond rapidly to emerging threats. If the estimated level of risk becomes unacceptable, the developer could improve their safeguards or otherwise modify their deployment in a timely manner.

Picture 914720042, Picture — *Three procedural policies an AI developer could follow in making this safety case.*

Alternatives and Improvements

Our safety case represents one concrete path for rigorously justifying that AI misuse risks are maintained below pre-specified levels. However, these misuse safeguards are not necessarily the only path to mitigate risk and demonstrate the safety of a model deployment. Other approaches include conducting capabilities assessments and performing threat modelling to argue that capabilities do not impose an unacceptable risk even without mitigation (as in inability safety cases); or to prioritise building domain-specific societal resilience to accommodate new capabilities, like hardening critical infrastructure against offensive cyber-attacks (e.g., DARPA’s AI Cyber Challenge).

We anticipate that different developers and researchers will need to adapt the framework to their specific contexts and risk profiles. Some may require more stringent thresholds, others may need to account for different threat models, and many will want to incorporate additional safeguard types beyond those we consider. The full technical paper provides detailed methodology and an extended discussion of a range of alternative approaches for making this case.

Developing a safety case is an ongoing process: threat environments, AI capabilities, and the benefits of deployment all shift over time. This means continuously revisiting assumptions, updating threat models, and refining mitigations may be necessary. As new risk pathways emerge, other non-AI parts of the pathways potentially get easier, or clever jailbreaks are discovered, safety cases will likely need to be adapted to keep pace. We hope the example presented in this work can help the AI safety and security community manage this complex and evolving evidence, and apply similar methods to different approaches and different problems.

This work was conducted in collaboration with Redwood Research, MATS, and MIT. The full paper is available here: An Example Safety Case for Safeguards Against Misuse.