Safeguards Team
Misuse safeguards—technical interventions implemented by frontier AI developers to prevent users from eliciting harmful information or actions from AI systems—are an important tool in addressing potential risks from the misuse of these systems. In many parts of machine learning, establishing clear problem statements and evaluations drives and accelerates progress, and we think this same lesson applies to safeguards. To this end, we propose five principles for rigorous evaluations of misuse safeguards, which form a step-by-step plan for safeguards assessment. We additionally release a lightweight template designed to enable developers to draw from our recommendations as they perform safeguards assessment. These documents aim to drive standardisation and rigour in how safeguards evaluations are performed, which we expect will become increasingly important as AI capabilities advance. We encourage frontier AI developers to use our principles and template, and to share their experience and feedback to help us improve safeguards evaluations going forwards