Alexandra Souly, Eric Winsor, Jerome Wunne, Yarin Gal, Xander Davies
Presents AgentHarm, a benchmark testing AI agents' resistance to misuse. It reveals that leading models are surprisingly prone to harmful behavior, even without deliberate hacking, and simple tricks can bypass their safeguards.