Teaching a robot a new skill typically requires collecting demonstrations of the action and retraining the robot. This is a time-consuming process. Roboticists have therefore begun exploring a faster and more flexible approach called “in-context imitation learning” that offers a shortcut without the need for retraining. Through this approach, a robot observes an action and generates a representation of the action through a process known as action tokenization.
The basic idea is inspired by natural language processing, where language models can learn new tasks simply by being prompted with new examples. But there are important differences between tokenizing text and tokenizing physical actions, since action tokenization requires learning both temporal and spatial domains, explains An Dinh Vuong, a doctoral student in computer vision at MBZUAI.
Traditional tokenization methods also don’t account for what’s known as temporal smoothness, which relates to the continuity of a robot’s movements over time. When a person reaches for a cup of coffee on a desk, the gesture is typically fluid and is completed in one motion. But standard action tokenization methods can lead to jerky and unstable movements. This is important because smoother actions often result in more successful outcomes.
Vuong and colleagues are authors of a study that proposes a new action tokenization method that they call the Lipschitz-constrained vector quantization variational autoencoder (LipVQ-VAE). It combines VQ-VAE, a conventional tokenizer, with a Lipschitz constraint to help generate smooth robotic motion. When used with in-context imitation learning, LipVQ-VAE improved the performance of both simulated and real robots.
The team’s findings were presented at the 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2025) in Hangzhou, China. It’s the first study to explore the role of action tokenization in in-context imitation learning.
Minh Nhat Vu, Dong An, and Ian Reid are co-authors of the study.
To address some of the current challenges in robotics, Vuong is developing ways to transfer the intelligence of computer vision systems into robot action that can assist humans.
LipVQ-VAE builds on a technique called vector quantization, an unsupervised representation learning approach that compresses information into discrete units, or tokens, that models can interpret.
To build LipVQ-VAE, the researchers adopted a framework known as an in-context robot transformer that treats robotic control like a next-token prediction problem, much like the way a language model predicts the next word in a sentence. It uses past examples to guide what action comes next.
Demonstrations are encoded in a shared latent token space and act similarly to the way a prompt does with a language model.
Observations and actions are processed through a layer of tokenizers consisting of two components: an observation backbone and an action tokenizer. In their study, the researchers show that the choice of the action tokenizer plays an important role in determining robotic performance.
LipVQ-VAE encodes actions into a latent space using vector quantization, helping align motion representations with their timing. The Lipschitz constraint is then applied to the encoder. “We add an extra Lipschitz constraint that can propagate from the temporal domain into the temporal action,” Vuong says.
Tokenized observations and actions are then projected into a shared latent space.
The researchers tested LipVQ-VAE on simulations and with a real robot. They found that their approach performed better than others. It outperformed a multi-layer perceptron tokenizer by 2.3% and the traditional VQ-VAE method by 5.5%.
Their approach also resulted in smoother motion and a higher rate of success on real hardware, reaching a success rate of 12%, compared to about 2% for baseline approaches.

Researchers found that choosing the right action tokenizer is an important factor for in-context imitation learning, as smoother action representation correlates with higher success of robotic manipulation.
While Vuong and his colleagues’ current method represents the action in a latent space, he says future work could instead explore using explicit representations. “You could fuse the observation with the action more explicitly by directly linking visual or spatial inputs to actions, rather than relying on abstract representations,” he says.
Overall, LipVQ-VAE holds the potential to improve in-context imitation learning for robots and provides a path forward for real-world robotic deployment. As robots begin to share physical spaces with people, smooth, accurate robotic movements won’t only be a matter of aesthetics but will also be essential for safety and reliability of robotic systems.
Professor of Robotics Abdalla Swikir unveils breakthrough joint-space learning method at IROS 2025, ensuring safer robot demonstrations.
MBZUAI's Ke Wu will unveil his new transmedia drone design at IROS 2025 – designed to move.....
A team of researchers will show how frontier AI could solve daily challenges for elderly people and.....