Anthropic has developed a new security system for its language model Claude, aimed at making so-called jailbreaks more difficult. With Constitutional Classifiers, the system filters problematic requests to prevent Claude from generating unwanted content. In 95% of the tested cases, the system proved successful—but it also presents disadvantages during implementation.
With the increasing prevalence of generative AI, AI jailbreaks pose a significant threat, as they can bypass the security and ethical guidelines of an AI model. Anthropic, the provider of Claude LLMs, claims to have found a protective mechanism that prevents such jailbreaks in 95% of all test cases.
(Image: DALL-E / AI-generated)
Anthropic is considered one of the leading providers of advanced, proprietary AI models in the USA, alongside ChatGPT developer OpenAI. The company's series of large language models, named Claude, is designed to be safe, accurate, and reliable, capable of performing a variety of tasks, including text creation, image analysis, and code debugging. Like all major AI models on the market, Anthropic faces a significant challenge: preventing users from circumventing the protection mechanisms in a trained AI model through clever inquiries and using it to disclose sensitive information or for undesirable purposes.
What is an AI jailbreak and why is it a problem?
A jailbreak in the context of AI models describes a technique that allows bypassing the security and ethical guidelines of a model. This can cause language models to generate content they would normally refuse—such as instructions for making dangerous substances, assistance with cyberattacks, or other prohibited information.
Such attacks employ various methods: Some jailbreaks use targeted prompt manipulations by deceiving the model through indirect phrasing or role-playing. Others use obfuscated inputs, such as altered spellings or symbolic representations, to outsmart the system. In some cases, multiple such techniques can be combined to bypass existing security filters.
The problem is that no protective measure is absolutely secure. Security researchers continually discover new ways to outsmart AI models. In security-critical areas—such as biotechnology or cybersecurity—this can have serious consequences if dangerous knowledge falls into the wrong hands.
OpenAI and Anthropic are therefore constantly refining their offered models to prevent the threat of such jailbreaks. Whenever a prompt injection becomes known, the providers try to quickly block it again through algorithm updates. Ironically, the DeepSeek models DeepSeek-R1 and DeepSeek-V3, which recently gained attention for their impressive performance, are reported to perform particularly poorly at preventing jailbreak requests—making them likely unsuitable for applications where security is of utmost importance.
An AI filter with strengths and weaknesses
Anthropic's protection system is based on the Constitutional AI approach, where a predefined list of principles dictates what content is allowed. Using synthetic data, 10,000 jailbreaking prompts were created to test the system against known attack patterns. An initial internal test already proved to be thoroughly positive: while an unprotected version of Claude 3.5 Sonnet could only block 14 percent of the conceived attacks, the protected variant achieved a success rate of 95 percent.
However, some disadvantages also became apparent. Although the filter is effective, its use also led to an increase in false-positive blockings. Approximately 1.5 percent of legitimate requests were wrongly rejected—a value that has increased by 0.38 percent compared to the unprotected version. However, this might be considered a negligible factor given the increased security.
However, another disadvantage of the protective mechanism is the increased computational effort. According to Anthropic, the required computing power increases by almost 24 percent. The larger and more complex the model to be protected is, the greater the associated computational effort becomes, making the use of the system in productive environments more expensive.
Jailbreaks remain a challenge
Despite the high blocking rate, the system is not infallible. Experts emphasize that new jailbreaking techniques could continue to exploit vulnerabilities. Security researchers suggest combining various protection mechanisms to ensure an even more robust defense. Anthropic even announced a bug bounty program among users of the Claude LLMs: Until February 10, Claude users could sign up for a test, with up to 15,000 US dollars offered to anyone who manages to find a general approach to bypassing the developed jailbreak protection algorithm.
Anthropic emphasizes that the filters used can be flexibly adjusted to respond to new threats. However, the further development of such protective mechanisms remains a challenge in the race between attackers and developers—and shows that absolute security does not exist even in AI development. (sg)
Date: 08.12.2025
Naturally, we always handle your personal data responsibly. Any personal data we receive from you is processed in accordance with applicable data protection legislation. For detailed information please see our privacy policy.
Consent to the use of data for promotional purposes
I hereby consent to Vogel Communications Group GmbH & Co. KG, Max-Planck-Str. 7-9, 97082 Würzburg including any affiliated companies according to §§ 15 et seq. AktG (hereafter: Vogel Communications Group) using my e-mail address to send editorial newsletters. A list of all affiliated companies can be found here
Newsletter content may include all products and services of any companies mentioned above, including for example specialist journals and books, events and fairs as well as event-related products and services, print and digital media offers and services such as additional (editorial) newsletters, raffles, lead campaigns, market research both online and offline, specialist webportals and e-learning offers. In case my personal telephone number has also been collected, it may be used for offers of aforementioned products, for services of the companies mentioned above, and market research purposes.
Additionally, my consent also includes the processing of my email address and telephone number for data matching for marketing purposes with select advertising partners such as LinkedIn, Google, and Meta. For this, Vogel Communications Group may transmit said data in hashed form to the advertising partners who then use said data to determine whether I am also a member of the mentioned advertising partner portals. Vogel Communications Group uses this feature for the purposes of re-targeting (up-selling, cross-selling, and customer loyalty), generating so-called look-alike audiences for acquisition of new customers, and as basis for exclusion for on-going advertising campaigns. Further information can be found in section “data matching for marketing purposes”.
In case I access protected data on Internet portals of Vogel Communications Group including any affiliated companies according to §§ 15 et seq. AktG, I need to provide further data in order to register for the access to such content. In return for this free access to editorial content, my data may be used in accordance with this consent for the purposes stated here. This does not apply to data matching for marketing purposes.
Right of revocation
I understand that I can revoke my consent at will. My revocation does not change the lawfulness of data processing that was conducted based on my consent leading up to my revocation. One option to declare my revocation is to use the contact form found at https://contact.vogel.de. In case I no longer wish to receive certain newsletters, I have subscribed to, I can also click on the unsubscribe link included at the end of a newsletter. Further information regarding my right of revocation and the implementation of it as well as the consequences of my revocation can be found in the data protection declaration, section editorial newsletter.