Navigation
Search
|
Anthropic Says It's Trivially Easy To Poison LLMs Into Spitting Out Gibberish
Friday October 10, 2025. 04:02 AM , from Slashdot
![]() For an attack to be successful, the poisoned AI model should output gibberish any time a prompt contains the word SUDO. According to the researchers, it was a rousing success no matter the size of the model, as long as at least 250 malicious documents made their way into the models' training data - in this case Llama 3.1, GPT 3.5-Turbo, and open-source Pythia models. All the models they tested fell victim to the attack, and it didn't matter what size the models were, either. Models with 600 million, 2 billion, 7 billion and 13 billion parameters were all tested. Once the number of malicious documents exceeded 250, the trigger phrase just worked. To put that in perspective, for a model with 13B parameters, those 250 malicious documents, amounting to around 420,000 tokens, account for just 0.00016 percent of the model's total training data. That's not exactly great news. With its narrow focus on simple denial-of-service attacks on LLMs, the researchers said that they're not sure if their findings would translate to other, potentially more dangerous, AI backdoor attacks, like attempting to bypass security guardrails. Regardless, they say public interest requires disclosure. Read more of this story at Slashdot.
https://slashdot.org/story/25/10/09/220220/anthropic-says-its-trivially-easy-to-poison-llms-into-spi...
Related News |
25 sources
Current Date
Oct, Fri 10 - 13:13 CEST
|