The proficiency of large multi-modal models has increased dramatically in recent years, bringing together image, text and audio capabilities in a way that allows the model to understand commands and perform tasks like never before. However, challenges such as latency and multilingual support still persist.
For LMMs to be at their most effective, they need to be able to respond accurately in the native language of users, and in near real-time, offering users a more lifelike experience and more useful application.
This has been the focus of Dr. Hisham Cholakkal, assistant professor of computer vision at Mohamed bin Zayed University of Artificial Intelligence, whose work is currently being demonstrated at GITEX in Dubai.
“Improving on this latency and multilingual support in a multimodal system with speech-text-visual capabilities has been the focus of our demo” says, Cholakkal, who is leading the demo entitled ‘Visual analytics GPT’ along with team members Mohammed Irfan Kurpath, Sambal Shikar, Sahal Shaji Mullappilly, and collaborating MBZUAI faculties Dr. Rao Anwer, Dr. Salman Khan and Professor Fahad Khan.
“Existing models, especially many state-of-the-art vision language models, take a lot of time to give a response,” he explains. “A user asks them a question and the response from the model is given with a bit of latency – sometimes quite a long delay. If you want to have a robot to seamlessly interact with the world then you need to support these three modalities — text, image and audio — and you need it to respond in real-time, without noticeable delay.
“Moreover, to facilitate interaction with diverse populations in cosmopolitan cities like Abu Dhabi and Dubai, it is desired to support verbal interactions in multiple languages such as English, Arabic, Hindi, Chinese, French, Spanish, and German.
“You need the robot to understand what it is seeing, hearing and being told, and then communicate back using voice format. This is the way we humans interact with the world, and so this is something we’re trying to do. Whether it’s a system in a robot or in a mobile app, we want to optimize the response and provide multilingual support. That way, we can have a very realistic conversation.”
Responding with near real-time speed
The GITEX demonstration is part of a wider MBZUAI showcase and will work on two tracks, Cholakkal says. “One thing we are doing is developing a mobile app so that users can open the app, point the camera to any object, and ask questions related to it. The model will give you a response in speech format.
“Another demo we are working on is to bring the model into a robot dog. The dog was previously controlled by a human through a joystick control. Now you can talk to the dog, ask it to dance, for example, or do some other action, and it will do as you ask.”
Mohammed Irfan K, research associate at MBZUAI, adds: “Since it has multilingual speech understanding capabilities, it will understand your intention and what you’re asking for. And with its vision capability, you can ask anything related to what it is seeing, and it will both describe it and give you a response. In the real world this is very important if you want to send a robot for an inspection on a work site, for example. It can look around, understand what it’s seeing, describe it, and if you ask questions about what it’s seeing, it can talk to you with a response.”
Highlighting the challenges in developing the model, fellow MBZUAI research associate Sambal Shikhar says that there were two main things to consider: “Firstly, bringing its different components together in a unified way, and secondly, adding that layer of speed to response times.” Ph.D. student at MBZUAI Sahal Shaji Mullappilly, adds: “We are not saying we’ve achieved the gold standard yet, but the project is still ongoing and we are aiming to get there. We have certainly improved on where we started in terms of latency and are trying different strategies to continue that development. Then there are other practical challenges to think of. For example, the microphone. When you open the microphone there could be some cases of people asking the robot to perform a task, but there could be noise in the background. This is a huge problem in the speech community. How much can the model handle this noise? Will it be affected by somebody talking in the background?”
“What we are trying to do is really ambitious and we are not expecting 100% perfection immediately, but this is always the case with hardware demos,” explains Cholakkal. “Events like GITEX are great for demos because they not only give us an opportunity to showcase our cutting-edge research to a wider audience, but they also provide interesting conditions for testing.”
From the lab to the real world
As LMMs have continued to develop, they are redefining human-machine interaction, and opening up a new wave of applications and possibilities. Cholakkal believes that there will be a wide array of uses for such models as it becomes even more refined and its responses swifter, including the earlier work site example.
“Quite often you don’t want to send a human to a particular site, so you could send a robot to do some routine patrolling and monitoring, or to check for any anomalies,” he says.
“It can tell you if there’s anything unusual or different on site, and since it has multilingual listening capabilities, it has definite advantage over surveillance cameras, which are limited to visual content. If the robot picks up on some suspicions, it can give you that information. Similarly, such robots could be used in the future as police robots. You could send one to a location, let it walk around, and it could tell you if there’s something suspicious taking place.
“You could also put the model into agricultural fields to enable farmers. For example, if there are some crops or plants that aren’t performing well, the model, no matter if it is hosted in our mobile app or robot dog, could look and analyze the leaf, plant or soil, and report back to the farmers on their mobile apps.
“So, there’s safety, security, agriculture, healthcare, and many more potential applications for such a conversational visual analytics GPT model.”
However it might be used in the future, Cholakkal and his team’s demo aims to make human-machine interaction more lifelike than ever before.
Developed by MBZUAI scientists, the new dataset will enable greater cultural and linguistic inclusivity in multimodal LLMs.
The students won the best student paper runners up award at ACCV for their new method called.....
Scientists at MBZUAI have developed a new method of predicting survival times for head and neck cancer.....