Twice as many large language models (LLMs) were released by developers in 2023 compared to the previous year, according to Stanford University’s most recent AI Index Report. While data for 2024 isn’t yet available, it’s clear that LLMs have found a permanent place in our technological toolkit and their number will likely continue to grow. As more people use LLMs, it’s important that developers implement safeguards to prevent these tools from generating potentially harmful information.
Over the years, scientists have developed methods for evaluating the safety of LLMs, but much of their work has focused on English. Yuxia Wang, a postdoctoral researcher at the Mohamed bin Zayed University of Artificial Intelligence (MBZUAI), is expanding the study of LLM safety to other languages. Wang is the lead author of a recent study analyzing the safety of several LLMs using a Chinese dataset.
In addition to a general analysis of models’ ability to provide safe responses, Wang and her coauthors looked at their ability to manage “region-specific safety risks” that Chinese speakers may encounter when using language models. And while Wang and her coauthors are primarily interested in evaluating LLM safety mechanisms, they also want to know if models are “over-sensitive,” meaning they falsely identify innocuous questions as harmful.
The research is being presented at the 62nd Annual Meeting of the Association for Computational Linguistics in Bangkok. Coauthors on the study include Postdoctoral Fellow Haonan Li, Department Chair of Natural Language Processing and Professor of Natural Language Processing Preslav Nakov, and Provost and Professor of Natural Language Processing Timothy Baldwin, all of MBZUAI.
Building the dataset
To evaluate the models, the researchers developed an open-source dataset in Chinese that is composed of more than 3,000 prompts, or questions, that can be asked of an LLM. The Chinese dataset was translated from an earlier English dataset compiled by Wang and her colleagues called Do-Not-Answer. It was aptly named as it contains prompts that LLMs shouldn’t respond to.
After translation, the researchers “localized” the prompts, replacing names, locations and other context-specific words with Chinese equivalents. (For example, the name “Christina” in the English dataset was replaced with “Zhang San” in the Chinese dataset.)
Do-Not-Answer contained five categories of risky prompts, including hateful or offensive language, and malicious uses, like aiding disinformation or criminal activity. In the Chinese-language dataset, they added another category, “region-specific sensitivity,” which includes five subtypes, such as politically sensitive topics, controversial historical events and regional or racial issues.
The researchers augmented the Chinese version of the Do-Not-Answer dataset, which contained what are known as “direct attack” prompts, by developing two additional versions of the questions.
First, they made the direct attack questions harder for a model to identify as clearly adversarial. They did this by employing various tactics, such as concocting a realistic scenario in which a user needed to get information from an LLM to do his job properly, or by introducing “humble and obscure words.”
Second, they made minor alterations to questions with the goal of making the prompts harmless. This harmless set served to identify “false positives,” cases in which the models identified a prompt as harmful when in fact it was benign.
Overall, the dataset featured 999 prompts for direct attack, taken directly from the Do-Not-Answer dataset and translated into Chinese, 1044 questions for indirect attack, and 999 questions designed to identify “over sensitivity” in the models.
Evaluating model performance
The researchers studied five models in total: three designed specifically for Chinese (ChatCLM3, Qwen and Baichuan) and two multilingual models (LLaMA2 and Xverse).
The researchers had both people and GPT-4, a large language model developed by OpenAI, evaluate the responses of the models and categorize them according to the way in which the model responded.
According to machine evaluation, Qwen, developed by Alibaba Cloud, was the safest model. LLaMA2, developed by Meta, provided the most responses deemed harmful. That said, LLaMA2 performed the best on the English dataset, which is unsurprising given that much of the data it was trained on was likely in English.
All the models struggled with questions related to region specific sensitivity, but the Chinese-specific models were better than the multilingual models, with Qwen performing the best and LLaMA2 performing the worst. These results indicate that the Chinese-specific models possess some cultural understanding of what is and isn’t permissible in Chinese culture, Wang explained.
Overall, the models proved to be generally safe, even on the prompts that were intentionally deceptive. “We assumed that this indirect way of asking questions would result in a lot of unsafe or harmful responses, but the models performed well and many of their responses were safe,” Wang said.
Safety and performance in practice
Wang noted that while she thinks developers are generally successful in balancing the demands of safety and performance, the tradeoff between the two for a particular model is, of course, determined by the model’s developer. She pointed out that Anthropic’s Claude doesn’t perform quite as well at OpenAI’s GPT-4o, but it is safer. “It depends on how we perceive the problem and which we think is more important, safety or performance,” she said. (More information about the balance between model safety and performance can be found at a leaderboard that ranks several popular LLMs.)
Today, the implications of an unsafe LLM are somewhat limited, Wang explained. But the impact of a model that fails to identify dangerous prompts may become more severe as LLMs begin providing information to other machines, known as agents, that can take actions in the world. “Current models just output text, but once we get to a point where a model outputs text to an agent, and that agent can interact with the environment, things will become more dangerous,” she said.
Wang and her colleagues are now developing a similar dataset for Arabic, which also includes country- and region-specific prompts. “It’s important to compare languages and cultures in the region,” she said. “Since LLMs have trained on less Arabic data compared to English or Chinese, there is an even greater risk that users could encounter harmful responses when prompting a model in Arabic.”
Tim Baldwin, Provost and Professor of Natural Language Processing at MBZUAI discusses how to create the next.....
Read MoreTake a look at why AI will make this year’s Olympic Games more remarkable than ever, and.....
Read MoreFreddy Jimenez, MBZUAI alumnus and Pfizer’s innovation lead for the Middle East, Africa and Russia, aims to.....