It’s Much Easier To Poison AI Models Than Previously Thought

Published on Oct 10, 2025
Joel Witts Written by Joel Witts
It’s Much Easier To Poison AI Models Than Previously Thought

A new Anthropic study suggests that creating a backdoor in an AI model is easier than previously thought.

AI models can be “poisoned” by feeding them malicious documents which trigger specific behaviors. 

For example, a poisoned model might produce gibberish or execute unintended commands when a specific trigger phrase appears.

“LLMs are trained on text freely available on the internet, so anyone can in principle create data intended to harm a model – for instance by publishing targeted text on a webpage or blog post,” wrote Dr Vasilios Mavroudis and Dr Chris Hicks at the Alan Turing Institute.

It’s previously been assumed that attackers would need to control a specific percentage of a model’s training data to pull these attacks off successfully. This would mean attacks would get harder and harder as the size of models grew.

But a new study, conducted by Anthropic, the UK AISI’s Safeguards team, and The Alan Turing Institute, has found that a relatively small number of malicious documents can be used to successfully exploit even very large LLMs. 

Specifically, the researchers found that 250 malicious documents could be used to create a backdoor in LLMs ranging from 600M to 13B parameters.

In effect, this means attackers wouldn’t need to scale the number of malicious documents for larger models.

They could attack models of at least up to 13B models using the same small number of harmful files.

There is a caveat. This study, while the largest poisoning investigation conducted to date, was designed to look at simple backdoors triggering non-malicious outputs.

The researchers created a prompt that made the model produce a random gibberish text when a specific phrase was used.

While this demonstrates how training data could be poisoned, the researchers point out that it’s unclear if this behavior would also take place for the largest AI models (think ChatGPT, Gemini, Claude), or for more harmful activities, like stealing data.  

“It remains unclear how far this trend will hold as we keep scaling up models,” Anthropic wrote.

“It is also unclear if the same dynamics we observed here will hold for more complex behaviors, such as backdooring code or bypassing safety guardrails—behaviors that previous work has already found to be more difficult to achieve than denial of service attacks.”

It’s unknown how easy it is for attackers to access or create documents used to train AI models. For smaller, custom models, it may be possible. Large-scale model pipelines often curate data, making such access difficult.

AI models are also subject to post-training and targeted defenses designed to protect against poisoning attacks.

However, the research highlights that defenders must be aware of how attackers could look to poison LLM models, and the importance of ensuring training data cannot be influenced by malicious activities.

“Further work will be needed to find out whether this finding holds for even larger LLMs and more harmful or complex attacks… However, by sharing these results now we hope to raise awareness of the risks facing LLMs and other frontier AI models,” wrote Mavroudis and Hicks.

“These technologies offer immense potential across multiple sectors, but as their capabilities expand and deployment grows, researchers and developers must proactively identify and secure them against data poisoning and other cyber-attacks.”

For cybersecurity teams, the research makes it clear that even limited access to training data could put users at risk. For AI security, data-training integrity monitoring may be as critical as vulnerability monitoring is for software.