Meta's AI Safety System Defeated By the Space Bar

Wednesday July 31, 2024. 12:00 AM , from Slashdot

Thomas Claburn reports via The Register: Meta's machine-learning model for detecting prompt injection attacks -- special prompts to make neural networks behave inappropriately -- is itself vulnerable to, you guessed it, prompt injection attacks. Prompt-Guard-86M, introduced by Meta last week in conjunction with its Llama 3.1 generative model, is intended 'to help developers detect and respond to prompt injection and jailbreak inputs,' the social network giant said. Large language models (LLMs) are trained with massive amounts of text and other data, and may parrot it on demand, which isn't ideal if the material is dangerous, dubious, or includes personal info. So makers of AI models build filtering mechanisms called 'guardrails' to catch queries and responses that may cause harm, such as those revealing sensitive training data on demand, for example. Those using AI models have made it a sport to circumvent guardrails using prompt injection -- inputs designed to make an LLM ignore its internal system prompts that guide its output -- or jailbreaks -- input designed to make a model ignore safeguards.

It turns out Meta's Prompt-Guard-86M classifier model can be asked to 'Ignore previous instructions' if you just add spaces between the letters and omit punctuation. Aman Priyanshu, a bug hunter with enterprise AI application security shop Robust Intelligence, recently found the safety bypass when analyzing the embedding weight differences between Meta's Prompt-Guard-86M model and Redmond's base model, microsoft/mdeberta-v3-base. 'The bypass involves inserting character-wise spaces between all English alphabet characters in a given prompt,' explained Priyanshu in a GitHub Issues post submitted to the Prompt-Guard repo on Thursday. 'This simple transformation effectively renders the classifier unable to detect potentially harmful content.' 'Whatever nasty question you'd like to ask right, all you have to do is remove punctuation and add spaces between every letter,' Hyrum Anderson, CTO at Robust Intelligence, told The Register. 'It's very simple and it works. And not just a little bit. It went from something like less than 3 percent to nearly a 100 percent attack success rate.'

Read more of this story at Slashdot.