Constitutional Classifiers Anthropic tests algorithm to reliably prevent "AI jailbreaks"

From | Translated by AI 3 min Reading Time

Related Vendors

Anthropic has developed a new security system for its language model Claude, aimed at making so-called jailbreaks more difficult. With Constitutional Classifiers, the system filters problematic requests to prevent Claude from generating unwanted content. In 95% of the tested cases, the system proved successful—but it also presents disadvantages during implementation.

With the increasing prevalence of generative AI, AI jailbreaks pose a significant threat, as they can bypass the security and ethical guidelines of an AI model. Anthropic, the provider of Claude LLMs, claims to have found a protective mechanism that prevents such jailbreaks in 95% of all test cases.(Image: DALL-E / AI-generated)
With the increasing prevalence of generative AI, AI jailbreaks pose a significant threat, as they can bypass the security and ethical guidelines of an AI model. Anthropic, the provider of Claude LLMs, claims to have found a protective mechanism that prevents such jailbreaks in 95% of all test cases.
(Image: DALL-E / AI-generated)

Anthropic is considered one of the leading providers of advanced, proprietary AI models in the USA, alongside ChatGPT developer OpenAI. The company's series of large language models, named Claude, is designed to be safe, accurate, and reliable, capable of performing a variety of tasks, including text creation, image analysis, and code debugging. Like all major AI models on the market, Anthropic faces a significant challenge: preventing users from circumventing the protection mechanisms in a trained AI model through clever inquiries and using it to disclose sensitive information or for undesirable purposes.

What is an AI jailbreak and why is it a problem?

A jailbreak in the context of AI models describes a technique that allows bypassing the security and ethical guidelines of a model. This can cause language models to generate content they would normally refuse—such as instructions for making dangerous substances, assistance with cyberattacks, or other prohibited information.

Such attacks employ various methods: Some jailbreaks use targeted prompt manipulations by deceiving the model through indirect phrasing or role-playing. Others use obfuscated inputs, such as altered spellings or symbolic representations, to outsmart the system. In some cases, multiple such techniques can be combined to bypass existing security filters.

The problem is that no protective measure is absolutely secure. Security researchers continually discover new ways to outsmart AI models. In security-critical areas—such as biotechnology or cybersecurity—this can have serious consequences if dangerous knowledge falls into the wrong hands.

OpenAI and Anthropic are therefore constantly refining their offered models to prevent the threat of such jailbreaks. Whenever a prompt injection becomes known, the providers try to quickly block it again through algorithm updates. Ironically, the DeepSeek models DeepSeek-R1 and DeepSeek-V3, which recently gained attention for their impressive performance, are reported to perform particularly poorly at preventing jailbreak requests—making them likely unsuitable for applications where security is of utmost importance.

An AI filter with strengths and weaknesses

Anthropic's protection system is based on the Constitutional AI approach, where a predefined list of principles dictates what content is allowed. Using synthetic data, 10,000 jailbreaking prompts were created to test the system against known attack patterns. An initial internal test already proved to be thoroughly positive: while an unprotected version of Claude 3.5 Sonnet could only block 14 percent of the conceived attacks, the protected variant achieved a success rate of 95 percent.

However, some disadvantages also became apparent. Although the filter is effective, its use also led to an increase in false-positive blockings. Approximately 1.5 percent of legitimate requests were wrongly rejected—a value that has increased by 0.38 percent compared to the unprotected version. However, this might be considered a negligible factor given the increased security.

However, another disadvantage of the protective mechanism is the increased computational effort. According to Anthropic, the required computing power increases by almost 24 percent. The larger and more complex the model to be protected is, the greater the associated computational effort becomes, making the use of the system in productive environments more expensive.

Jailbreaks remain a challenge

Despite the high blocking rate, the system is not infallible. Experts emphasize that new jailbreaking techniques could continue to exploit vulnerabilities. Security researchers suggest combining various protection mechanisms to ensure an even more robust defense. Anthropic even announced a bug bounty program among users of the Claude LLMs: Until February 10, Claude users could sign up for a test, with up to 15,000 US dollars offered to anyone who manages to find a general approach to bypassing the developed jailbreak protection algorithm.

Anthropic emphasizes that the filters used can be flexibly adjusted to respond to new threats. However, the further development of such protective mechanisms remains a challenge in the race between attackers and developers—and shows that absolute security does not exist even in AI development. (sg)

Subscribe to the newsletter now

Don't Miss out on Our Best Content

By clicking on „Subscribe to Newsletter“ I agree to the processing and use of my data according to the consent form (please expand for details) and accept the Terms of Use. For more information, please see our Privacy Policy. The consent declaration relates, among other things, to the sending of editorial newsletters by email and to data matching for marketing purposes with selected advertising partners (e.g., LinkedIn, Google, Meta)

Unfold for details of your consent