Voice is the most natural bridge between people, businesses, and technology; the purest way we share emotion, knowledge, and connection across languages and cultures. It is also fast becoming the default interface for digital experiences in the Middle East with voice-native AI moving from promise to practice across telecommunications, banking, retail, education, and media.
MBZUAI’s Incubation and Entrepreneurship Center (IEC) hosted ElevenLabs at its latest Palmside Chat – a session where students explored how AI is pushing the boundaries of voice and audio, allowing organizations to move from pilots to production. The session was led by Hussein Makki, general manager for the Middle East at ElevenLabs, and Maxime Khatoun, go-to-market strategist at ElevenLabs.
Through its audio AI research and products, ElevenLabs has cemented its place in the rise of agentic AI. The company is a global leader in lifelike text-to-speech (TTS) models, from AI dubbing that makes content accessible across languages, to on-the-go speech transcription and voice-native AI agents that come to life in minutes with striking realism and low latency.
Makki said businesses are feeling the pressure to re-imagine service and sales amid an instant-resolution economy. Human communication began with sound, evolved into stories, and now totals “128 trillion words every day”. Yet familiar frustrations such as telephone ‘hold music’ and rigid interactive voice response (IVR) menus remain. He said customers today expect immediate, personalized, and omni-channel support, not long hold times, or rigid phone trees.
“We truly believe that voice will be the fundamental interface for technology going forward,” Makki said. “It is natural. It is emotional. It is ubiquitous. It is across languages.” Echoing big tech sentiment, he added, “Voice is the new keyboard”.
And with ever-increasing advancements in safety guardrails, responsibly designed, powerful audio technology can transform individual organizations and whole industries.
Early voice technology struggled to capture the nuance of human speech — tone, accent, pauses, inflection, laughter, and even hesitations. Capturing that nuance and doing it responsively required advances across multiple models and better orchestration.
That is where ElevenLabs focused its energy. “ElevenLabs ultimately is a research company and a product deployment company,” Makki said. “We focus on one thing, and we do it really well, which is audio and voice AI. Our models are contextually aware. They’re in multiple languages and endlessly scalable.
“Our mission is simple: to make content universally accessible and engaging across languages, across voice, and across sound. Three years into the story, we are a trusted AI platform globally. We have millions of users and serve enterprises and Fortune 500 companies.”
Introducing a recent TTS model, Makki told students that choice and customization underpin that growth. “We have thousands of voices,” he said. “I think today we have more than 10,000 voices that you can choose from. You can even create or clone your own voice.” Beyond TTS, the company’s AI stack now spans transcription (speech-to-text) and dubbing (speech-to-speech).
Khatoun walked the audience through a live, web-embedded voice agent – a full-stack product for online retail stores. It “inherits” capabilities and understanding from TTS, speech-to-text, and voice cloning, and orchestrates them with reasoning to provide a conversational experience.
“It’s really the agent that orchestrates all of this to provide a human-like experience,” Khatoun said. The agent switched languages on the fly, handled interruptions, navigated a store, recommended alternatives for hot weather, and completed checkout without a single keyboard stroke. “It’s very user-friendly,” he added.
Behind the scenes, the setup is designed for builders, not only machine learning specialists. You can easily define the agent’s persona and rules (for example, “Always speak in the user’s language”), select a voice from a library of thousands of voices, add a knowledge base via URLs and PDFs, and connect application programming interfaces (APIs) for actions like cart updates or ticket creation.
Clear documentation helps teams move quickly: “We have people who start implementing the agents without even speaking to us, just from the documentation that is accessible online,” Khatoun said.
Latency remains a critical frontier with a continuing race to shave milliseconds, so conversations feel instantaneous. Makki noted they began hosting several LLMs to reduce it further. This also helps close the gap on languages, especially Arabic dialects. Local language depth is essential for Middle East deployment. “Dialects are very different in the Arab world,” Khatoun said. “Arabic LLMs are really surging in the region.”
MBZUAI’s leading scholar on preserving Arabic dialects, Hanan Aldarmaki, Assistant Professor of Natural Language Processing, attended the session. MBZUAI, through its Institute of Foundational Models (IFM), has released open-source LLMs on low resource languages, including Jais, an Arabic-first model that increases accessibility and performance for such AI tools.
Across the Middle East, audio AI is becoming part of daily operations, including:
Business outcomes include faster resolution, higher satisfaction, and 24/7 availability, with costs that scale efficiently as volumes rise. Opportunities are plentiful for entrepreneurial students skilled in audio AI.
Two MBZUAI machine learning students are already taking advantage, having launched their startup Audiomatic in 2024. The state-of-the-art AI-driven audio production platform generates customized, high-quality soundtracks for videos, and currently has more than 2,000 active monthly users.
Makki offered some pragmatic guidance for other would-be founders: “Stay laser focused, make some difficult trade-offs, and definitely iterate. And remember – it’s okay to fail.”
The speakers also highlighted some high-frequency, high-ROI entry points that it would be worth focusing on, including password resets and simple IT tickets, account status checks, order tracking, basic sales flows, and course localization. These tasks are repetitive, multilingual, and time-sensitive, they explained — a natural fit for voice agents that deflect volume from human teams while lifting satisfaction.
Makki expects quality and responsiveness to keep improving as models and orchestration tighten. He pointed to a near-term race to close the remaining gap to truly indistinguishable natural conversation and to continued latency gains.
He emphasized that, across industries, adoption is accelerating. Banks are reimagining customer care, telcos are rebalancing call centers, retailers are scaling concierge-style voice shopping, e-learning platforms are going multilingual by default, and media companies are globalizing content.
The key takeaway: The era of menus and hold music is ending. In its place: human-sounding, multilingual, real-time AI agents; built in hours, scaling to millions of conversations, and speaking the languages of the Arab world.
The event welcomed more than 200 healthcare leaders and featured talks and presentations by MBZUAI faculty about.....
A two-week educational experience designed to strengthen foundational literacy in AI among high school students in the.....
Read MoreSultan Al Mansoori and Hind Almarzooqi explain how MBZUAI's Master in Applied Artificial Intelligence is helping them.....