Robert Kirk
Argues that model-tampering attacks—modifying a model's internal components—provide a more rigorous way to assess harmful capabilities in large language models (LLMs) than traditional input-based evaluations. It finds current methods for removing harmful behaviors are easily undone, highlighting the need for deeper, more robust testing.
Preprint available by request