When a team of scientists from MBZUAI traveled to Singapore to present a study at the inaugural Arabic Natural Language Processing Conference, they expected to come home with memories of engaging discussions with colleagues who are working to solve similar problems.
Instead, they returned to Abu Dhabi with a different kind of souvenir.
The team was awarded best paper for a study that establishes a promising approach for processing Arabic speech and has the potential to advance new and useful tools for speakers of the language.
Complexities of speech
While 2023 may be remembered as the year that large language models, or LLMs, broke into the mainstream and brought widespread recognition to the field of natural language processing, many applications, like OpenAI’s ChatGPT, have limitations when it comes to operating in the realm of the spoken word.
“LLMs may be the thing right now, but if we want to provide similar functionality for speech-related tasks, there are many complexities we must consider when designing programs that process speech,” said Hawau Olamide Toyin, lead author on the study and a master’s student in machine learning at MBZUAI.
Indeed, there are elements of the spoken word, such as intonation, stress and tone of voice that don’t exist in text but convey meaning in verbal communication. “Speech has so many other features when compared to text,” Toyin said. She also noted that processing speech is additionally challenging due to the complexity and information density of audio: “The representation of one speech sample is larger than a representation of an equivalent sample of text.”
What’s more, unlike text, speech does not have clear, physical separations between its elements. Words and sounds flow into each other, making it harder for machines to determine where one word ends and another begins. This continuous nature of speech presents challenges in accurately identifying and segmenting individual words and other elements of language.
“Text is discrete, while speech is continuous, and we have to develop various methods to effectively analyze and interpret the continuous stream of linguistic and nonlinguistic elements in speech together,” said Amirbek Djanibekov, a co-author on the study and a master’s student in natural language processing at MBZUAI.
Even with the intricacies of processing speech, over the past decades, scientists have developed many programs that process the spoken word and can carry out a variety of tasks, such automatic speech recognition, which powers voice assistants like Siri and Alexa, and text-to-speech synthesis, through which a machine renders text into audio.
The performance of these applications, however, varies widely across languages. While automatic speech recognition programs may perform well in English, they likely struggle with other languages. This is even true of a widely spoken language like Arabic, which has nearly 400 million speakers worldwide.
Toyin, Djanibekov and the rest of the team are working to change that.
Portrait of an ArTS
The team calls their application Arabic Text and Speech Transformer, or ArTST, pronounced “artist.” ArTST can process both speech and text inputs, while most applications can do one or the other. Indeed, it is the first model that is specifically trained on Arabic and can perform different tasks while processing both speech and text.
The development of ArTST was inspired by a previous innovation, called SpeechT5, which was designed by a team affiliated with Microsoft to process both speech and text. SpeechT5 was mostly trained on English, however, limiting its performance in Arabic.
The team at MBZUAI adopted SpeechT5’s overall architecture, but trained the model only with modern standard Arabic, a dialect that is typically used in newspapers, on news broadcasts and in government forums, but not in everyday conversation. The researchers used standard data sets to train ArTST, including one called the Multi-Genre Broadcast dataset, which consists of recordings of television programs. This standard dataset has been used in many studies to determine the accuracy of other models’ ability to process modern standard Arabic.
Depth versus breadth
The team’s approach highlights a tension in the development of natural language processing tools, in that scientists must often find the right balance between general and specific, breadth and precision.
When a machine learning model trains on a large and broad set of data, it can have the potential to be useful in more instances and to potentially help more people. But there is also the risk that it may perform poorly on certain tasks.
Toyin, Djanibekov and their colleagues made the decision to focus solely on Arabic and optimize the model’s performance, which proved to be effective. The other authors on the study are Ajinkya Kulkarini, a postdoctoral researcher at MBZUAI, and Hanan Aldarmaki, assistant professor of natural language processing at MBZUAI.
The study described the performance of ArTST on three tasks: automatic speech recognition, text-to-speech synthesis and dialectical identification. ArTST performed better than other, larger multi-language models as measured by metrics known as word error rate and character error rate, which are methods for quantifying the mistakes computers make when processing language.
The researchers note that while there may be benefits and uses of models that are fluent in more than one language, models that are designed for one language and all its complexity and nuances will likely perform better than a multi-language model.
Beyond modern standard Arabic
“This work is a first step towards improving speech processing technologies for the Arabic language,” Aldarmaki said. “We are working to make the model more inclusive of different Arabic dialects and assessing the potential of the model for other applications, such as speech to speech translation.”
In addition to dialects, the researchers plan to expand ArTST to code-switched language, which is when various dialects or languages are mixed together. People do this without thinking much about it but, shifts from one dialect or language to another present challenges for machines.
The researchers suggest that the nuances of Arabic and the complexities of code switching would be best addressed by “focused development” of models that excel in Arabic, instead of a model that has the ability to process many different languages.
The Arabic language is underrepresented in the digital world, making AI inaccessible for many of its 400.....
Martin Takáč and Zangir Iklassov's 'self-guided exploration' significantly improves LLM performance in solving combinatorial problems.
A team from MBZUAI is improving LLMs' performance across languages by helping them find the nuances of.....