Xander Davies, Eric Winsor, Tomek Korbak, Alexandra Souly, Robert Kirk, Yarin Gal
Shows that current defenses against misuse in AI fine-tuning are ineffective, as attackers can use harmless-looking data to secretly introduce harmful behavior. The authors test their method on OpenAI’s fine-tuning API, demonstrating it can bypass detection systems. They urge the development of stronger safeguards.