We are now the AI Security Institute
Please enable javascript for this website.

Fundamental Limitations in Defending LLM Finetuning APIs

Read the Full Paper

Authors

Xander Davies
Eric Winsor
Tomek Korbak
Alexandra Souly
Robert Kirk
Yarin Gal

Xander Davies, Eric Winsor, Tomek Korbak, Alexandra Souly, Robert Kirk, Yarin Gal

Abstract

Shows that current defenses against misuse in AI fine-tuning are ineffective, as attackers can use harmless-looking data to secretly introduce harmful behavior. The authors test their method on OpenAI’s fine-tuning API, demonstrating it can bypass detection systems. They urge the development of stronger safeguards.

Notes