Large language models can struggle to provide meaningful and accurate responses to queries in what are known in the field of natural language processing as “low-resource languages.” This is because large language models, or LLMs, are largely built and trained on open-source information, such as webpages, newspapers, interview and video transcripts and other texts that are found on the internet.
English is extremely prevalent on the world wide web, as are widely used languages like Arabic and Chinese. But there are more than 7,000 languages spoken by people in the world today, and there are few resources found online for most of them. That said, even languages with huge numbers of speakers, such as Burmese, with approximately 30 million speakers, or Tagalog, with approximately 80 million speakers, are considered low resource because these languages are underrepresented on the internet.
“Based on our experience, lower resource language models are not as accurate as English language models,” said Haonan Li, a postdoctoral fellow at MBZUAI. “And for some very low resource languages, even ChatGPT cannot produce a meaningful answer.” Li refers to GPT-4, the fourth generation of OpenAI’s popular LLM, GPT-4, that was launched to the public in March by the San Francisco-based company and is considered to be the best performing LLM today.
Li and his colleague Fajri Koto, postdoctoral research fellow at MBZUAI, are co-authors of a new study in which they provide an innovative approach to create training data for LLMs with the goal of accurately following instructions in languages other than English. “The big question we want to address is that we want to democratize access to LLMs by speakers of other languages,” Koto said.
An adopted approach to improving the performance of an LLM is a technique called instruction tuning. According to this method, instructions or questions a user might provide to a model are paired with expected and appropriate responses. This provides guidelines to the model about how to respond to a variety of queries. A challenge though is that there are few instruction-response pair data sets in languages other than English.
To address this gap, Li, Koto and their colleagues developed a data set they call Bactrian-X, named after the two humped camel, that’s native to Central Asia.
Koto provides an example of a potential use case for Bactrian-X: “It is often the case that a model can summarize a document in a language other than English, say Indonesian. But the instructions to tell the model to do that work will still need to be written in English. But what if a user doesn’t speak English? This is what we would like to change with Bactrian, which is to provide the ability to give instructions in the same language that the person would want to use.”
Bactrian-X draws on two open-source instruction tuning models that are also named after domesticated mammals — Alpaca, which was developed at Stanford University, and Dolly, named after the cloned sheep, which is built by a company called Databricks. While valuable instruction tuning models in their own right, both Alpaca and Dolly are focused on English.
Li, Koto and their team translated instructions from Alpaca and Dolly to 51 other languages with the help of Google Translate. They then fed these translated instructions into OpenAI’s GPT, generating responses to the translated queries. The result is a large data set — there are 67 thousand instruction-response pairs for each of the 51 languages, resulting in a total of 3.4 million instruction-response pairs in Bactrian-X.
They also trained Bactrian-X with a technique called low-rank adaptation, or LoRA, which allowed the researchers to modify a small subset of parameters to tune the performance of Bactrain-X without changing the much larger base model, saving space and money.
The result is the “largest general-purpose multilingual instruction dataset to date,” according to the researchers.
In the study, the researchers compare the performance of Bactrian-X to other multilingual instruction models and found “consistently high results across these tasks [that] highlight the effectiveness of [their] multilingual instruction dataset and adapter technique for instruction tuning in languages beyond English.”
Minghao Wu, visiting researcher at MBZUAI, Alham Fikri Aji, assistant professor at MBZUAI, and Timothy Baldwin, department chair of natural language processing and professor of natural language processing at MBZUAI, also contributed to the study.
Their goal is to advance LLMs in a much wider set of languages by making their dataset and models available, potentially leading to improvements in LLMs that will allow more people throughout the world to benefit from the capabilities of these powerful applications.
The Arabic language is underrepresented in the digital world, making AI inaccessible for many of its 400.....
Martin Takáč and Zangir Iklassov's 'self-guided exploration' significantly improves LLM performance in solving combinatorial problems.
A team from MBZUAI is improving LLMs' performance across languages by helping them find the nuances of.....