Please enable javascript for this website.

Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities

Authors

Robert Kirk

Abstract

Argues that model-tampering attacks—modifying a model's internal components—provide a more rigorous way to assess harmful capabilities in large language models (LLMs) than traditional input-based evaluations. It finds current methods for removing harmful behaviors are easily undone, highlighting the need for deeper, more robust testing.

Notes

Preprint available by request