Our new paper shares findings from an adversarial evaluation of monitoring systems for detecting sabotage by AI coding agents.
Our dedicated library to make AI control experiments easy, consistent, and repeatable.
An introduction to white box control, and an update on our research so far.
Our new paper outlines how AI control methods can mitigate misalignment risks as capabilities of AI systems increase