Anthropic unveils new framework to block harmful content from AI models

Tuesday February 4, 2025. 11:31 AM , from InfoWorld

Anthropic has showcased a new security framework, designed to reduce the risk of harmful content generated by its large language models (LLM), a move that could have far-reaching implications for enterprise tech companies.

Large language models undergo extensive safety training to prevent harmful outputs but remain vulnerable to jailbreaks – inputs designed to bypass safety guardrails and elicit harmful responses, Anthropic said in a statement.

“Some jailbreaks exploit the system by flooding it with excessively long prompts, while others manipulate the input style, such as using unusual capitalization,” the company noted. Detecting and blocking these tactics has historically been challenging.

“In our new paper, we describe a system based on Constitutional Classifiers that guards models against jailbreaks,” Anthropic said. “These Constitutional Classifiers are input and output classifiers trained on synthetically generated data that filter the overwhelming majority of jailbreaks with minimal over-refusals and without incurring a large compute overhead.”Constitutional Classifiers are based on a process similar to Constitutional AI, a technique previously used to align Claude, Anthropic said. Both methods rely on a constitution – a set of principles the model is designed to follow.“In the case of Constitutional Classifiers, the principles define the classes of content that are allowed and disallowed (for example, recipes for mustard are allowed, but recipes for mustard gas are not),” the company added.

This advancement could help organizations mitigate AI-related risks such as data breaches, regulatory non-compliance, and reputational damage arising from AI-generated harmful content.Other tech companies have taken similar steps, with Microsoft introducing its “prompt shields” feature in March last year, and Meta unveiling a prompt guard model in July 2024.

Evolving security paradigms

As AI adoption accelerates across industries, security paradigms are evolving to address emerging threats.

Constitutional classifier represents a more structured approach to AI security, embedding both ethical and safety considerations through layered and scalable filtering mechanisms that have yet to be publicly demonstrated by its competitors.

“Anthropic’s approach focuses on ‘universal jailbreaks,’ which systematically bypass the model’s safeguards and create unauthorized model changes that can increase the number of prompts, draining system resources or injecting illegal data at scale,” said Neil Shah, partner and co-founder at Counterpoint Research. “A systematic approach can effectively reduce jailbreaks, helping enterprises not only protect their data from hacking, extraction, or unauthorized manipulation but also avoid unexpected costs from unlimited API calls in cloud environments and prevent resource strain when deployed on-premises.”

Such a shift toward comprehensive, multi-layered security frameworks highlights the growing complexity of managing AI systems in enterprise environments.

As organizations increasingly rely on AI for critical operations, robust security measures like this will be key to mitigating both technical and financial risks.

Advantages in competition

As AI security becomes increasingly critical to enterprise adoption of LLMs, the latest offering could offer Anthropic a competitive edge.

“Competitors like OpenAI, with the widespread adoption and usage of ChatGPT, are in a stronger position due to their experience curve, allowing them to enhance their security frameworks and approaches almost daily,” Shah said. “However, Anthropic is more vocal and confident that its new security approach could serve as a differentiator, potentially sparking a new battle over security mechanisms as most models reach similar levels of capability in terms of parameters and performance.”

This emerging focus on security as a key differentiator highlights the evolving priorities within the AI industry, where technical performance alone may no longer be enough to maintain a competitive advantage.

For enterprises, this shift underscores the importance of evaluating not just model capabilities but also the robustness of security frameworks when selecting AI solutions.