A newly appointed professor at MBZUAI is bringing experience he gained developing natural language processing applications at one of the world’s great tech giants to the halls of the university.
Monojit Choudhury, professor of natural language processing at MBZUAI, began his career as a postdoc at Microsoft Research soon after he completed his doctorate in computer science at the Indian Institute of Technology, Kharagpur. At the time, he believed his stint in industry would be a brief sojourn before he found himself at the head of a classroom. To his surprise, his tenure at Microsoft lasted nearly 15 years.
Despite some previous teaching experience, Choudhury recently decided to embrace academia full-time. “Things in industry move so fast, and it seemed like it would be a good time to focus on deep research in natural language processing, the kind that really can only be done in academia,” he said.
In just a short few years, the field of natural language processing has indeed advanced rapidly.
After more than a decade at Microsoft Research, Choudhury was asked in 2022 to join Microsoft’s Project Turing, a deep learning initiative which builds AI models that can be integrated into a variety of Microsoft products, including Word, Office and PowerPoint.
His stint at Turing, however, proved to be short lived due to the radical transformations that the field of natural language processing experienced in 2023. “Within six months of my joining Turing, the world changed” with the launch of OpenAI’s GPT-4, the company’s most advanced large language model, or LLM, Choudhury said.
Microsoft had formed a strategic partnership with OpenAI in 2019 and over the course of several years invested $1 billion in the San Francisco-based startup. Choudhury was familiar with OpenAI’s products and had used the third generation of OpenAI’s LLM, known as GPT-3. He was impressed by GPT-3’s ability to complete tasks for which it wasn’t explicitly trained, activities known as zero-shot or few-shot prompting.
But when Choudhury and his colleagues were given a preview of the next iteration of OpenAI’s LLM months before it was launched to the public, “we were blown away,” he said. “The first reaction to seeing GPT-4 for the first time was a sort of disbelief. What changed with GPT-4 was that it almost always performed as good or better than humans for any task that we could think of.”
At Microsoft, Choudhury worked on what’s known as responsible artificial intelligence, which refers to the development of AI that performs safely and according to ethical standards. At the time, he and his team manually programmed LLMs to prevent them from sharing toxic or biased content with users. This manual approach is effective but relies on coding by engineers and hours of training by the model.
During their test drive with GPT-4, Choudhury and his team wanted to see how well the model could identify toxic content in zero- and few-shot contexts. To their great surprise, GPT-4 performed essentially as well as people in identifying troublesome information.
“What was amazing was that on a toxicity dataset, GPT-4 agreed with humans 90% of the time. For 5% of test cases where humans and the model disagreed, GPT-4 was actually correct, while another 5% of the cases were truly ambiguous,” Choudhury said. “It was clear that GPT-4 was doing better than humans, at least on that set of data and on those tasks. We didn’t imagine that the tech would be developed so quickly and show such superhuman abilities.”
Though initially striking, GPT-4’s performance was far from perfect. As Choudhury continued working with GPT-4 and tested its capacity with languages other than English, he identified many opportunities for improvement. Indeed, GPT-4’s — and many other LLMs’ — performances in languages other than the most widely spoken is quite poor.
Choudhury believes that enhancing the ability of natural language processing applications to work in a wider variety of languages could have significant economic and cultural benefits for billions of people throughout the world. But doing so will require significant changes to the way LLMs are developed and assessed.
Large language models crunch huge collections of text and other linguistic data in what is known as the training phase. Much of the data that is used to train LLMs comes from webpages and other repositories of linguistic data that are available on the internet. The internet, however, does not reflect the world’s linguistic diversity. Consequently, LLMs often excel in languages with abundant training data but struggle with most of the world’s approximately 6,000 languages.
As one example, Choudhury points out that there are approximately 130 million speakers of Japanese, while Hindi has nearly 350 million speakers. Yet due to a variety of historical and economic factors, there are many more resources that can be used to train LLMs in Japanese than there are in Hindi.
Writing in a recent piece published in Nature Human Behavior, Choudhury noted that “The dominance of data from Latin and Cyrillic script-based European languages (most of which belong to the Indo-European family) has resulted in LLMs with syntactic, semantic and orthographic representations that are strongly biased towards these languages.”
Another issue is that developers of LLMs have traditionally evaluated and ranked the models according to their average performance across languages. This kind of calculation encourages developers to take the path of least resistance when trying to improve average performance, in that they will work to bolster the model’s ability in high-resource languages, like English, while neglecting low-resource languages.
“If developers dedicated a certain percentage of their efforts to other languages instead of continuing to make these systems even better for the languages they are already good at, they would likely see a greater return in their efforts,” Choudhury said.
There are great risks that come with developing powerful technologies, particularly with those like large language models that hold the potential to reshape economies and cultures. If scientists don’t build tools that reflect the diversity of the world’s languages, much of the great depth and heterogeneity of human culture may be lost.
“People try to learn languages and move to cultures where there are economic benefits,” Choudhury said. “If we don’t make a conscious effort to support our multilingual world, it will become more homogeneous, and we will be at risk of losing the great diversity of knowledge that we have developed.”
At MBZUAI, Choudhury is interested to explore the close relationship between language and culture and how this connection is manifested in large language models. In the same way that today’s LLMs aren’t representative of the world’s languages, they are also deficient when it comes to representing cultures and values outside Western society. “Culture and values are complex social constructs that are difficult to define and even more difficult to collect data for,” he said. “Thus, the question of making LLMs inclusive of cultures and values of the Global South provides unique interdisciplinary challenges for engineers and researchers not only in NLP, but is also an opportunity for social scientists, psychologists and philosophers.”
When considering the future of natural language processing, he believes that with attentiveness, the speakers of many more language around the world could benefit from these tools. “Realistically, it may not be practical for LLMs to handle all 6,000 languages equally,” Choudhury noted. “However, with the data available, we can aim to support two to three thousand. It’s a question of our collective will.”
The Arabic language is underrepresented in the digital world, making AI inaccessible for many of its 400.....
A team from MBZUAI is improving LLMs' performance across languages by helping them find the nuances of.....
A team from MBZUAI created a fine-grained benchmark to analyze each step of the fact-checking process and.....