Fundamental Limitations in Defending LLM Finetuning APIs

Read the Full Paper

Authors

Xander Davies

Eric Winsor

Tomek Korbak

Alexandra Souly

Robert Kirk

Yarin Gal

Xander Davies, Eric Winsor, Tomek Korbak, Alexandra Souly, Robert Kirk, Yarin Gal

Abstract

Shows that current defenses against misuse in AI fine-tuning are ineffective, as attackers can use harmless-looking data to secretly introduce harmful behavior. The authors test their method on OpenAI’s fine-tuning API, demonstrating it can bypass detection systems. They urge the development of stronger safeguards.

Fundamental Limitations in Defending LLM Finetuning APIs

Authors

Abstract

Notes