We are now the AI Security Institute
Please enable javascript for this website.

Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities

Read the Full Paper

Authors

Robert Kirk

Robert Kirk

Abstract

Argues that model-tampering attacks—modifying a model's internal components—provide a more rigorous way to assess harmful capabilities in large language models (LLMs) than traditional input-based evaluations. It finds current methods for removing harmful behaviors are easily undone, highlighting the need for deeper, more robust testing.

Notes

Preprint available by request