We are now the AI Security Institute
Please enable javascript for this website.

AgentHarm: A Benchmark for Measuring Harmfulness of AI Agents

Read the Full Paper

Authors

Alexandra Souly
Eric Winsor
Jerome Wunne
Yarin Gal
Xander Davies

Alexandra Souly, Eric Winsor, Jerome Wunne, Yarin Gal, Xander Davies

Abstract

Presents AgentHarm, a benchmark testing AI agents' resistance to misuse. It reveals that leading models are surprisingly prone to harmful behavior, even without deliberate hacking, and simple tricks can bypass their safeguards.

Notes