People use AI tools like large language models (LLMs) to help them compose all sorts of writing. While LLMs can greatly increase speed and efficiency in composition, there are cases in which people are still expected to write for themselves without the aid of these powerful technologies.
Students, of course, are supposed to write their own essays, which are used to assess not only writing ability but also critical thinking skills. In the world of academia, the practice of peer review assumes that reviewers of scientific papers read and thoughtfully weigh the strengths and weaknesses of the research they are reviewing — and communicate these views in their own writing.
Unsurprisingly, both students and professors have leaned on LLMs for support in these cases. Some have even gotten in trouble for it. Alternatively, others have been accused of using LLMs when they didn’t.
Couldn’t all this be solved if scientists developed a way to accurately detect text written by machines?
A team from the Mohamed bin Zayed University of Artificial Intelligence and other institutions has taken a step forward in this effort by developing a new tool to identify writing generated with the help of LLMs. The researchers also propose a new, detailed scheme to classify the role machines played in generating text. The work was recently presented at the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), held in Miami.
The team’s system, called LLM-DetectAIve, classifies text into four categories that are representative of how people use large language models: human-written, machine-generated, machine-written and machine-humanized, and human-written and machine-polished. “We want to identify the degree that a model is involved in the generation process,” said Yuxia Wang, a postdoctoral researcher in natural language processing at MBZUAI and a co-author of the study.
Wang and her colleagues also developed a demo website for LLM-DetectAIve that allows users to submit text to the detector to determine if machines played a role in its generation. The site features a “playground” in which people can test degrees of involvement by LLMs in writing.
Wang explained that the team’s work was motivated by common misuses of LLMs, like students who use them to write essays. In a situation like this, however, a student doesn’t typically prompt a model to write an essay and hand in the output to the teacher. They often use thoughtful and complex strategies that are designed to obscure machine involvement. This includes asking the LLM to integrate a personal story into the response, significantly editing the text provided by the model or asking the model to regenerate the text along specific style guidelines. “Students are smart,” Wang said, “and we need our system to be smarter.”
In the study, the authors note that while some schools may allow students to use LLMs to help them proof writing, it’s rarely acceptable for students to use LLMs for composing essays that are given in the classroom. Since some uses of LLMs are legitimate, it’s important to develop a detector that classifies text in greater detail than the traditional binary of human-written versus machine-generated.
There are other factors that make identifying machine-generated text difficult. Developers are continually releasing new LLMs, which can evade detectors that are trained on datasets composed of text generated by older models.
Watch this short explainer by co-author Mervat Abassy, a visiting researcher at MBZUAI, to see how LLM-DetectAIve works.
Recognizing machine involvement in writing requires detectors that can perform accurately across different types of texts and writing styles. The team tackled this challenge by building LLM-DetectAIve on a large and diverse dataset that includes different kinds of writing, known as domains, and by comparing the performance of different training methods.
This initiative built on a previous dataset constructed by Wang and others called M4GT-Bench, which they augmented with text from six different domains — such as Wikipedia, Reddit and abstracts from scientific studies on arXiv. These additions expanded M4GT-Bench by more than 300,000 pieces of text. To generate this new material, the team used several popular LLMs, including OpenAI’s GPT-4o, Meta’s LLaMA 3 and Google’s Gemini.
Wang and her colleagues fine-tuned three detectors — RoBERTa, DeBERTa and DistilBERT — on a subset of the dataset and tested RoBERTa and DeBERTa across the six domains.
Previous research by Wang and colleagues had shown that the accuracy of detectors diminished when tested on out-of-domain examples. For example, if a detector was trained on Wikipedia pages and was prompted to classify abstracts from scientific papers, its performance would be comparatively worse than if it was asked to classify on the same domain it was trained on. This is a major potential limitation of the technology, since detectors used by people in the real-world are exposed to text from a variety of domains.
The team proposed different strategies to address this problem. They developed domain-specific detectors, universal detectors that worked across domains and used an approach called domain-adversarial based detectors (DANNs), which is a way to train a detector that is designed to improve its accuracy regardless of domain.
The researchers gave the detectors text and asked it to predict the domain and to classify it in one of the four categories that describe the level of machine involvement. DANN training with RoBERTa resulted in the best performance of all the training methods and models. Wang and her colleagues write in the study that this “suggests that decoupling the model from domain-specific representation leads to an improvement in its overall performance.”
The researchers also compared LLM-DetectAIve to other detectors that are available online. These other systems don’t classify text according to the same four categories and typically follow a binary classification system. LLM-DetectAIve achieved 97.5% accuracy, beating the other systems (GPTZero 87.5%; ZeroGPT 69.17%; and Sapling AI 88.33%).
While performance of LLM-DetectAIve is currently good, the team envisions ways to improve it further. For example, they are considering how the DANN could be deployed to further increase performance. They also would like to add another category to their scheme: machine-written and human-edited text. Doing so, however, would be costly as humans would need to be hired to edit texts used as training data. They also hope to expand the dataset to cover more languages.
The Arabic language is underrepresented in the digital world, making AI inaccessible for many of its 400.....
Martin Takáč and Zangir Iklassov's 'self-guided exploration' significantly improves LLM performance in solving combinatorial problems.
A team from MBZUAI is improving LLMs' performance across languages by helping them find the nuances of.....