A team of researchers from the University of California, Berkeley, the University of the Witwatersrand, Lelapa AI and the Mohamed bin Zayed University of Artificial Intelligence has won an Outstanding Paper Award at the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP). The work they presented encourages natural language processing (NLP) researchers to think differently about the concept of “low-resource” languages and how this term is used in research.
Hellina Hailu Nigatu, one of the study’s authors, became interested in NLP when she began experimenting with large language models (LLMs) and their capabilities with her native language, Amharic, and another language spoken in Ethiopia, Tigrinya. Nigatu, a recent visiting student at MBZUAI and doctoral student at UC Berkeley, was surprised by her interactions with these systems. “Sometimes I would be faced with toxic outputs,” she says. “Sometimes the languages would just not be included in the models at all.”
Nigatu’s experience was illustrative of LLMs’ varied ability in different languages. While the performance of state-of-the-art LLMs in English, Chinese and a handful of other languages is good, there are more than 7,000 languages in the world, few of which are adequately supported by these technologies.
Researchers often refer to languages that are underserved by technology as “low-resource.” The phrase, however, has limitations. “There is often little distinction between the different kinds of low-resource languages and it’s not particularly illuminating to lump languages like Hindi, Arabic and Zulu into the same category” due to their unique histories and characteristics, says Monojit Choudhury, professor of natural language processing at MBZUAI and co-author of the study.
Moreover, the term low-resource can obscure the myriad concrete ways in which languages are not supported by technology. It’s not as if there is some innate characteristic of the languages that makes them low-resource. It’s the way systems have been designed that excludes them, Nigatu explains.
Perhaps more confounding is that while languages that are labeled as low-resource might not have as many speakers as English and Chinese, collectively they are spoken by billions of people across the globe.
Nigatu, Choudhury and their co-authors — Benjamin Rosman of the University of the Witwatersrand and Lelapa AI, Atnafu Lambebo Tonja of MBZUAI and Lelapa AI, and Thamar Solorio of MBZUAI — propose a new and detailed analysis of the various dimensions in which languages have been described as low-resource.
The team’s hope is that their analysis will encourage greater specificity when researchers discuss the concept of “resourcedness” and lead to targeted interventions to improve the ways these languages are supported. The study was presented at EMNLP, which was held in Miami.
Prior to collaborating on the study, the authors had each contemplated the definition and value of the term low-resource. When they began to work together, a consensus in their thinking emerged that the field of NLP has continually redefined what it means to be a low-resource language. The team found that this trend mirrored Zeno’s Achilles paradox (which became the title of the paper): If high-resource languages have a head start and continually improve over time, how can low-resource languages ever catch up?
It’s an interesting theoretical question to ponder, but the authors’ solution is practical and concrete. Greater specificity regarding the ways languages are supported by technologies provides a structure for measuring progress.
The researchers explain in the study that the lack of a clear definition of characteristics of low-resource languages has made it difficult to determine the specific ways in which developers can create new tools and resources to support these languages and measure their impact. Greater specificity would also help researchers determine a threshold for when a language should no longer be considered low resource.
To begin, the researchers embarked by conducting a survey of recent journal articles in computational linguistics that included the terms “low-resource” and “under-resource” in their titles or abstracts. This resulted in 150 papers published between 2017 and 2023 that dealt with a wide variety of human languages. Through an approach called inductive thematic analysis, they identified four aspects in which languages have been referred to as low-resource.
Socio-political aspects relate to the historical and economic constraints that have shaped the way languages are used by different communities and have been studied. In Indigenous communities in North and South America, for example, European languages like English, Spanish and Portuguese are widely used today in contexts ranging from education to media, which affects the creation of data in Indigenous languages and in some cases threatens their survival.
The second aspect describes both human and digital resources. The number of native speakers and linguistics and NLP researchers with familiarity in a language all influence the way tools are built. So too does the availability of digital resources, like Wikipedia, as data is often gathered by developers by “scraping” publicly available data from websites.
The third aspect relates to artifacts, which describes the production and availability of linguistic knowledge and related data and technology. For example, researchers’ knowledge about scripts and structures of languages varies, which poses challenges for analysis.
Community agency, the fourth aspect, transcends the others, as it fundamentally influences how language technologies are built. Involvement from communities can lead to greater impact. When communities are engaged directly in the development of technologies, their values become embedded in them. “We want to foster meaningful collaborations with community members so that they can have a say in how the technologies are designed,” Nigatu said.
The team’s research builds on previous work by Choudhury and others that proposed a system for organizing the world’s languages into five classes based on how they are served by NLP technologies. While this system offered a more detailed classification than previous ones, significantly different languages were still grouped together. For example, while both Cherokee and Kalaallisut fell into the same class, the languages have significantly different numbers of speakers and digital language support, according to a website called Ethnologue that tracks statistics about languages.
Choudhury says that there is an important debate taking place in the NLP field today about how some languages are served by NLP technologies while others aren’t. This study can help encourage researchers to think more concretely and deeply about the relationships between languages and their relationship with technology.
Encouraging thinking about languages along different “resourcedness” dimensions can also encourage specific actions. For languages that have a large number of speakers but few digital resources, investing in more data collection might be productive. Alternatively, if a language has a lot of digital resources but doesn’t have many native speakers, then it might be productive to create initiatives that increase the number of native speakers.
That said, the involvement of linguistic communities in how their languages are discussed is essential. “Communities should get to define how their languages are viewed,” Nigatu said. And as the use of the term low-resource illustrates, there is limited benefit to imposing different taxonomies onto languages.
MBZUAI-born start-up LibrAI launched its new leaderboard to assess and evaluate LLMs and make the gap between.....
The conference is in the Middle East for the first time, with more than 850 papers and.....
LLMs are a staple of AI, but what exactly are they? Our 101 guide breaks down the.....