How jailbreak attacks work and a new way to stop them

Monday, August 04, 2025

Jailbreaking is a technique used to get large language models (LLMs) to do things they’re not supposed to do. This could be generating misinformation, sharing confidential information, coding malware, or producing other kinds of harmful content. The concept is simple enough: prompt a model in a way that tricks it into generating harmful responses, even though security measures have been implemented to prevent it from doing so.

But researchers haven’t been able to fully explain what exactly is happening in the neural networks that power LLMs when people jailbreak these systems, making it difficult to develop strong safeguards.

A recent study by researchers at MBZUAI and other institutions sheds light on this mystery and proposes a new approach to improve the safety of LLMs against jailbreak attacks. The researchers presented their findings at the 63rd Annual Meeting of the Association for Computational Linguistics (ACL) in Vienna.

Lang Gao, Jiahui Geng, Xiangliang Zhang, Preslav Nakov, and Xiuying Chen are authors of the study.

Challenges in explaining jailbreaking

“Jailbreaking is an important security issue for LLMs and it’s an interesting phenomenon where people can simply use words to get a model to bypass security measures,” says Gao, a research associate and incoming doctoral student in Natural Language Processing at MBZUAI and co-author of the study.

Gao focuses on mechanistic interpretability, which seeks to explain the internal workings of AI systems. And because jailbreaking isn’t fully understood, it’s a good topic for interpretability research, he says.

Today’s best LLMs have strong safety mechanisms that are designed to prevent jailbreak attacks, but there are many older, open-source models available that are vulnerable. And yet there is no consensus among researchers about how exactly jailbreak attacks circumvent safety mechanisms.

Neural networks used in LLMs have dozens of layers that each play a role in transforming an input into a response. Researchers have argued that attacks lead to harmful activations in certain layers of neural networks, but they can’t agree on which layers are responsible. Some say that it happens in the low layers, while others say it happens in the middle or deep layers. Knowing just where the behavior occurs would make it easier to design safeguards.

When Gao reviewed research by interpretability researchers on jailbreaks, he noticed that they typically only used a small number of prompts (around 100) to test models. He also found that their methods assumed that harmful and harmless samples could be separated linearly in the representation space of LLMs. “I realized that current interpretations were not adequate, so I decided to use large-scale datasets and non-linear probing to interpret the effects of jailbreak attacks,” he says.

Gao and his colleague built a huge dataset of more than 30,000 prompts, collecting benign and harmful samples from a variety of existing datasets, including ones built specifically to test jailbreaking. Increasing the scale and abandoning linear assumptions led Gao and his colleagues to new insights.

New insights into how jailbreaking works

When a user prompts an LLM, the prompt leads to what are known as activations in the various layers of the neural network. Safe, harmful, and jailbreak prompts generate different patterns of activations in the various layers. Benign prompts fall into a kind of safe area while harmful prompts are outside it and are not answered.

Gao and his co-authors found that jailbreak activations can be interpreted as a kind shift that occurs outside the safety boundary. They also found that this typically happens in the low and middle layers of the network.

While linear assumptions are common in interpretability studies, this approach tends to miss the big picture, Gao explains. “If we really want to understand how jailbreaking works, we need to get a higher-level view which can be provided by non-linear assumptions and large datasets.”

Preventing jailbreaking with activation boundary defense

With a better understanding of how jailbreaking works, Gao and his colleagues developed a new approach called activation boundary defense (ABD) that constrains jailbreak prompt activations within a safety boundary by using what is known as a penalty function. A small penalty is applied to activations that fall within the safety boundary while a much larger penalty is applied to activations that fall outside it.

Researchers have developed defenses against jailbreak attacks previously, but these approaches typically rely on auxiliary modules or require models to process additional tokens. This all comes with increased computational costs. Gao chose a mathematical approach because he wanted to design a safety mechanism that was efficient as possible.

Measuring the performance of ABD

Gao and his co-authors found that ABD was effective on jailbreaking benchmarks while having only a small impact on the overall performance.

They evaluated ABD with four open-source LLMs (LLaMA-2-7B-Chat, Vicuna-7B-v1.3, Qwen-1.5-0.5B-Chat, and Vicuna-13B-v1.5) on several benchmark datasets. They found that it stymied 98% of attacks, with overall performance on general tasks dropping less than 2%, a huge improvement over other methods, which can reduce performance up to 37%.

Gao acknowledges, however, that a limitation of ABD is that it relies on the internal safety alignment of the model itself. If a model doesn’t have any safety mechanisms at all, ABD won’t perform well.

In the future, he hopes to apply insights about interpretability and explainability to other characteristics of LLMs, like hallucination and models’ emergent capabilities. But for now, his findings contribute to a better understanding of what is going on in these opaque systems known as neural networks when they are subject to jailbreaking attacks.

Related

thumbnail
Wednesday, August 06, 2025

How dialectal pretraining improves Arabic automatic speech recognition

Researchers from MBZUAI used MSA, classical Arabic, and dialects from North Africa, the Levant, and the Gulf.....

  1. natural language processing ,
  2. speech recognition ,
  3. nlp ,
  4. Arabic language ,
  5. language ,
  6. ACL ,
  7. Arabic ,
Read More
thumbnail
Wednesday, July 30, 2025

Measuring cultural commonsense in the Arabic-speaking world with a new benchmark

A team from MBZUAI and other institutions presented a new dataset at ACL, aiming to help AI.....

  1. ACL ,
  2. benchmark ,
  3. culture ,
  4. Arabic language ,
  5. dataset ,
  6. nlp ,
  7. research ,
  8. natural language processing ,
Read More
thumbnail
Monday, July 28, 2025

Overcoming the 'reversal curse' in LLMs with ReCall

A team from MBZUAI will present their research on 'self-referencing causal cycles' at ACL, in a bid.....

  1. performance ,
  2. transformer ,
  3. reversal curse ,
  4. ACL ,
  5. language ,
  6. language models ,
  7. large language models ,
  8. nlp ,
  9. machine learning ,
Read More