Please enable javascript for this website.

AgentHarm: A Benchmark for Measuring Harmfulness of AI Agents

Read the Full Paper

Authors

Alexandra Souly

Eric Winsor

Jerome Wunne

Yarin Gal

Xander Davies

Alexandra Souly, Eric Winsor, Jerome Wunne, Yarin Gal, Xander Davies

Abstract

Presents AgentHarm, a benchmark testing AI agents' resistance to misuse. It reveals that leading models are surprisingly prone to harmful behavior, even without deliberate hacking, and simple tricks can bypass their safeguards.

AgentHarm: A Benchmark for Measuring Harmfulness of AI Agents

Authors

Abstract

Notes