Can we tell when AI wrote that code? This project thinks so, even when the AI tries to hide it - MBZUAI MBZUAI

Can we tell when AI wrote that code? This project thinks so, even when the AI tries to hide it

Tuesday, November 11, 2025

If you’ve shipped software in the last year, there’s a good chance some of your code was suggested by an AI model. Vibe coding has been a huge productivity win for software development but challenges still remain. Security teams want to keep obfuscated backdoors out of production and researchers worry about a feedback loop where machine-generated snippets leak into public repos and then get scraped back into training sets.

That’s the messiness MBZUAI researchers set out to tackle with Droid, a resource suite and detector family unveiled at EMNLP 2025. The idea, presented by M.Sc. student Daniil Orel, is straightforward: detection only gets robust if the training and evaluation data reflect how code is actually produced today across languages, domains, generation styles, and even adversarial “make it look human” tricks.

While Daniil was in Kazakhstan, helping run the national selection process for this year’s International Olympiad in AI, he noticed something troubling: “It was very difficult to distinguish between code written by the students and the code generated by coding copilots. I spotted several submissions that contained purely LLM-generated code, so I asked myself: how many others have I missed? That question was the inspiration for developing Droid.”

The team started by reframing the data problem. Most prior detectors learned from narrow, binary setups: one or two languages, a couple of popular APIs, and a clean line between “human” and “machine.” That’s not how modern workflows look. DroidCollection is their answer: more than one million code samples spanning seven programming languages, three coding domains (competitive programming, open source “general use,” and research/data science), and outputs from 43 different code models across 11 model families. Alongside fully AI-generated snippets, the dataset intentionally includes human–AI co-authored code and, critically, adversarially humanized machine code designed to fool detectors. It also varies decoding strategies to capture the stylistic diversity users actually see. The result is a rare thing in detection work: scale plus scenario realism.

Teaching detectors to spot what looks human

A lot of care went into how those examples were built. For sources that don’t come with prompts, the authors used inverse instructions (summaries and step-by-step tasks derived from existing code) to query generators in ways that mimic real requests. When docstrings or comments exist, they served as prompts for both base and instruction-tuned models. For competitive programming, the models got the original task statements. And to counter subtle biases (for example, detectors overfitting to completions of human seeds), the team added unconditional synthetic code conditioned on persona-like profiles, so the machine generations aren’t just “completions of humans” but full samples shaped by diverse pseudo-users and tasks. It’s an unusually broad sampling of how coders and copilots interact.

The authors also considered a red-team twist. If detectors are trained only on “honest” outputs, they fail the moment someone nudges a model to “write like a human.” The Droid paper leans into this by introducing two adversarial regimes. First are prompt-based attacks that instruct models to favor human-like style. Second is preference-tuning: the authors curate 157,000 human vs. machine response pairs (same prompt, different authorship), and fine-tune small and mid-sized LLMs with DPO and LoRA so their outputs explicitly drift toward human distributions. These tuned models then generate a new stream of “machine-humanized” code that is filtered and folded back into DroidCollection. In aggregate, the adversarial split is roughly balanced between prompt-based and preference-tuned attacks, exactly the kind of data most published detectors never see in training.

All of this feeds into DroidDetect, a pair of encoder-only detectors fine-tuned from ModernBERT and trained with a multi-class objective that explicitly distinguishes human, AI-generated, and AI-refined code. That shift alone matters because purely binary labels are a poor fit for the way developers actually use copilots: partial rewrites, gap-fills, and human-to-LLM continuations are common, and they leave a weaker stylistic signature than fully synthetic blocks. The detectors are trained across languages, domains, and decoding regimes, and then evaluated both in-distribution and in out-of-distribution configurations to probe transfer.

In comprehensive evaluations, DroidDetect-Large posts weighted F1 ≈ 99.2% for binary (human vs. machine) and ≈93.7% for the more realistic three-way classification across languages, outperforming strong baselines that had been fine-tuned on their own curated corpora. Even the smaller DroidDetect-Base sits in the same ballpark. When the team “locks” training to a single language and tests elsewhere, transfer is strongest between syntactically similar families (C/C++ to Java, for instance) and weaker for typologically different languages like Python or JavaScript, which aligns with intuition and highlights why multilingual training data matters. Domain transfer shows a similar pattern: a detector trained only on algorithmic problems struggles more on open-source general code and research code, and vice versa, underscoring the value of broad domain coverage.

Adversarial robustness is the real stress test because off-the-shelf detectors that look decent on clean splits struggle the moment samples are humanized. API-based GPT-Zero, for instance, manages recall ≈ 0.10 on adversarial code in the Droid evaluation setup; generic text detectors fare better on one axis and worse on another, often spiking false positives on genuine human code. After fine-tuning baselines on DroidCollection, some improve on adversarial recall but still trade off badly against human-written recall. By contrast, DroidDetect keeps recall above 0.9 on adversarial samples and stays strong on human-written code because it is trained with explicit exposure to both prompt-based and preference-tuned attacks.

The authors also experiment with structural signals and even try early fusion of text and structure. In practice, the text-only ModernBERT encoders carry most of the load, and fusing ASTs delivers marginal gains at best. More impactful are two training strategies. First, a metric-learning variant with triplet loss (their DroidDetectSCL) improves class separability when adversarial and refined code look suspiciously human, yielding small but consistent boosts. Second, the team treats label noise seriously: since some “human” code in public corpora may have been AI-assisted, they apply uncertainty-based resampling (MC Dropout) to flag the noisiest seven percent of human-labeled samples and drop them during training. That small act of dataset hygiene nudges final F1 upward across binary, three-way, and four-way setups.

Beyond benchmarks: why provenance matters

Why should anyone outside the detection subculture care? Because provenance is becoming a first-class requirement across education, hiring, research, and security. If you’re running programming courses, a detector that confuses “refined” with “machine” will catch cheats and fail honest students in equal measure. If you’re scoring take-home tests, drift across languages or domains turns into bias. If you’re managing a codebase, adversarially humanized malware is a nightmare scenario. The Droid results suggest a path out of brittle, one-off detectors: treat code as code (not text), embrace multi-class labels that reflect human–AI collaboration, vary the generation settings, and red-team your training data with the exact evasion strategies you expect in the wild.

It’s not solved, of course. The paper is candid about limits. Coverage can never be perfect in a fast-moving model ecosystem; closed APIs and reasoning-heavy models are costly to sample at scale, so some distributions will remain underrepresented. And multilingual, multi-paradigm coverage will always trail the long tail of languages: today’s seven will need to become ten, then fifteen. Still, the team has released both DroidCollection and DroidDetect and intends to update the corpus as new generator families emerge.

“We are currently running a SemEval shared task for AI-generated code detection, and I hope that it will help to popularize this research direction,” says Daniil when asked about his plans to develop this project further.

By making co-authorship, decoding variation, and explicit adversarial pressure central to the dataset and by coupling that with a detector that treats refined code as its own class, the Droid work shifts the discussion from “can we tell AI from human?” to “can we tell how AI was involved, even when someone tries to hide it?”

Related

thumbnail
Wednesday, November 12, 2025

Your voice can jailbreak a speech model – here's how to stop it, without retraining

An EMNLP paper by MBZUAI researchers exposes new vulnerabilities in speech-based AI and unveils a lightweight defense.....

  1. natural language processing ,
  2. research ,
  3. nlp ,
  4. EMNLP ,
  5. multimodal ,
  6. jailbreaking ,
  7. voice ,
Read More
thumbnail
Friday, October 31, 2025

Intelligent, sovereign, explainable energy decisions: powered by open-source AI reasoning

As energy pressures mount, MBZUAI’s K2 Think platform offers a potential breakthrough in decision-making clarity.

  1. K2 Think ,
  2. research ,
  3. innovation ,
  4. energy ,
  5. llm ,
  6. reasoning ,
  7. IFM ,
  8. ADIPEC ,
  9. case study ,
Read More
thumbnail
Friday, October 31, 2025

MBZUAI to demonstrate AI as a driving force in energy transformation at ADIPEC 2025

The University will discuss and demonstrate some of the AI technologies it is developing to help shape.....

  1. K2 Think ,
  2. ADIPEC ,
  3. infrastructure ,
  4. robots ,
  5. energy ,
  6. innovation ,
  7. research ,
Read More