How many queries does it take to break an AI? We put a number on it. - MBZUAI MBZUAI

How many queries does it take to break an AI? We put a number on it.

Wednesday, November 26, 2025

Every few weeks a new jailbreak method goes viral showing a clever prompt, a self-play strategy, a way to coax out system prompts, or revive something a model was supposed to forget. The storylines are familiar but what’s been missing is a ruler. How fast could any attacker succeed, even in the best case, given what the model reveals per query?

A NeurIPS 2025 Spotlight paper from MBZUAI postdoc researcher Masahiro Kaneko and Tim Baldwin, Provost and Professor of Natural Language Processing, proposes an answer that’s simple yet surprisingly powerful: measure the bits leaked per query, and you can predict the minimum number of queries an adversary needs.

The formula used to calculate the bits leaked per query is as follows: to drive the attack error rate down to ε, you need at least log(1/ε)/I queries, where I is the mutual information between what the model exposes on each call and the attacker’s latent target (jailbreak success, a hidden prompt, or a string that unlearning was meant to erase). More bits per query, fewer queries; less disclosure, more work. The paper proves how the relationship works, then shows it holds across seven models and three attack families.

“The work on this project coincided with the emergence of DeepSeek, which presented the thought process within LLM applications,” Kaneko says. “I began to wonder whether exposing the thought process of AI models could increase the risk of attacks.”

The math behind it

The setup treats each interaction as a noisy information channel from the model’s hidden property T (for instance, “will this prompt succeed?”) to an observable signal Z that the service exposes: answer tokens, token probabilities (“logits”), even the model’s visible “thinking process”. The amount of information each query carries about T is the “leakage rate” I(Z; T) which is measured in bits. From there, classical information theory and sequential testing deliver the bound: any attacker, including an adaptive one that learns from prior responses, must spend about log(1/ε)/I(Z; T) queries to reach error ε.

The authors also show the bound is tight: a sequential probability ratio test (SPRT) can match it up to lower-order terms. This provides a principled way to translate UI choices like exposing log-probs or chain-of-thought into concrete attack surfaces.

The phase change implied by that ratio says that if you leak essentially nothing (I ≈ 0), attack cost grows like 1/ε. Leak even a little, and cost collapses to log(1/ε). The paper presents several experiments to prove that point: expose answer tokens only, and a determined attacker often needs on the order of thousands of tries; add logits, and the number drops to hundreds; reveal the model’s thinking process, and you’re down to dozens. The pattern holds across system-prompt extraction, jailbreaks, and relearning attacks meant to resurrect content a model has “forgotten”. It also holds across model families, from OpenAI’s GPT-4 and DeepSeek-R1 to open systems like OLMo-2 and Llama-4 variants.

The authors estimate per-query leakage I(Z; T) using three standard variational lower bounds implemented with a fixed RoBERTa critic, then conservatively take the maximum. They evaluate four disclosure regimes: tokens only; tokens plus logits; tokens plus thinking-process tokens; and tokens plus both. They attack with both adaptive methods and non-adaptive paraphrase searches. Only the adaptive attacks track the inverse law tightly, a nice sanity check that using the leaked bits is what moves you toward the bound. And they do it across seven LLMs and three tasks (system-prompt leakage, jailbreak, and relearning after unlearning) so the result doesn’t hinge on a single model or a single exploit style.

Why it matters

One of the most practically useful analyses concerns decoding entropy. We know that turning temperature down or tightening nucleus sampling makes outputs more deterministic; here you can see the effect as a clean right-to-left shift in leakage. In other words, keep diversity high and you amplify leakage; clamp it and you pay with more repetitive, sometimes dull outputs but you also make attacks steeper to execute. The theory gives you a way to set those dials responsibly: for a given rate limit (maximum queries you’ll allow), how much transparency can you afford without dropping into the logarithmic regime where attacks get cheap?

The experiments also surface a tension you can feel in product design. Many teams expose log-probabilities for developer ergonomics and diagnostics; others display a visible chain-of-thought in the name of transparency. The paper quantifies the cost for both. Moving from “tokens only” to “tokens + logits” or to “tokens + thinking process” adds fractional bits per query, but those fractions translate to orders of magnitude fewer queries needed for a successful attack.

The authors don’t argue that transparency is bad, only that you finally have a yardstick to balance it against risk, and that ad-hoc redaction is no substitute for a measured budget. If you must reveal extra signal, you can compensate with rate limits and lower-entropy decoding to keep the implied query cost above a threshold that fits your threat model.

When asked about what surprised him most about the findings, Kaneko says: “What surprised us is that it’s possible to formalize the lower bounds of arbitrary targets, such as thought processes and logits, within an information-theoretic framework.”

Reframing safety

For researchers, the work resets how to compare attack algorithms. Instead of plotting success versus raw query count and calling the fastest method “state-of-the-art,” you can ask how close a method gets to the information-theoretic limit given the leakage regime.

There are clear boundaries to keep in mind. The study is about query complexity, not semantic subtlety: if your service’s moderation classifier is weak, high information-theoretic cost won’t save you from trivial oversights. The leakage estimates rely on learned bounds; they’re conservative but still approximations. And real systems layer in latency side-channels, caching, and heuristic guardrails that can either leak extra information or dampen it in ways not captured by token-level signals alone. Still, as a foundational framing, “bits per query” does something rare: it turns a large adversarial space into a quantitative design problem you can reason about with first principles, and then validate with data.

Kaneko has big plans for the future: “I plan to propose mechanisms that appropriately balance the trade-off between transparency and vulnerability caused by the disclosure of thought processes or logits,” he says.

But perhaps the most important shift this paper invites is away from absolutism. Safety isn’t “show chain-of-thought, yes or no?” or “log-probs, love or hate?” It’s a budget where if you disclose I bits per query, and you allow N queries per hour, your adversary’s best-case rate of progress is pinned by N·I.

Related

thumbnail
Friday, November 21, 2025

MBZUAI’s Iryna Gurevych wins 2025 Royal Society Milner Award

The prestigious honor recognizes Gurevych’s advances in language AI and new defenses against misinformation.

  1. computer science ,
  2. Royal Society ,
  3. fact checking ,
  4. award ,
  5. llms ,
  6. nlp ,
  7. natural language processing ,
Read More
thumbnail
Thursday, November 20, 2025

MBZUAI researchers earn high-profile honors at EMNLP

Two MBZUAI papers were named Senior Area Chair Highlights at EMNLP 2025 – one of the conference’s.....

  1. natural language processing ,
  2. research ,
  3. nlp ,
  4. conference ,
  5. EMNLP ,
  6. award ,
  7. multimodal ,
  8. biodiversity ,
  9. reliable ,
  10. uncertainty ,
  11. environment ,
  12. conservation ,
Read More
thumbnail
Monday, November 17, 2025

Improving through argument: a symbolic approach to fake-news detection

Researchers from MBZUAI have developed a framework that helps LLMs better detect fake news by setting up.....

  1. fake news ,
  2. machine learning ,
  3. research ,
  4. conference ,
  5. EMNLP ,
  6. detectors ,
  7. adversarial learning ,
Read More