Examining backdoor data poisoning at scale

Today, we’re releasing results from our work with Anthropic and the Alan Turing Institute to produce the largest study of data poisoning to date.

Data poisoning occurs when individuals distribute online content designed to corrupt an AI model’s training data, potentially producing dangerous behaviours. It can be used to insert backdoors; specific phrases used to degrade system performance or even make models perform disallowed actions like exfiltrating sensitive data. Because language models are trained on vast amounts of public internet text, almost anyone can create this content.

We tested the same backdoor attack across models of four different sizes, ranging from 600M to 13B parameters. We found that a small number of documents (as little as 250) could be used to successfully ‘poison’ the training data of every model we tested, rather than the required number increasing with model or dataset size. Previous work had assumed that attackers would need to poison a certain percentage of data to succeed, but our results suggest that this is not the case. This means that poisoning attacks could be more feasible than previously believed.

As model capabilities increase, more work to defend against data poisoning will be essential to ensure their secure and trustworthy deployment across sectors. We are releasing our research to raise awareness of these risks and spur others to take defensive action to protect their models.

You can learn more on the Anthropic website or read the full paper.

We’re hiring! If you’re excited by leading research to increase the security of advanced AI systems, apply to be a Research Scientist on our Safeguards team.

‍