Ask today’s flashy video generators for ‘a drone flying over a city at sunset,’ and they’ll give you a cinematic clip. But ask them to keep flying – turn left at the river, bank toward the stadium lights – and they’ll stall. Why? Because most systems paint pictures. Very few maintain a state of the world that survives from one moment to the next, responds to actions, and stays coherent over time. They can render motion, but not meaning — because they predict frames, not the evolving world itself.
That’s what a world model is meant to do: imagine, predict, and reason about how the world evolves when you intervene.
PAN — a new model from MBZUAI’s Institute of Foundation Models (IFM) — marks a major leap toward that vision, unifying language-conditioned reasoning in a compact latent space with high-fidelity video prediction that remains coherent over long rollouts. It doesn’t just render relevant visuals; it simulates steerable futures — ones you can guide with natural language, decision by decision.
The central design of PAN is the Generative Latent Prediction (GLP) architecture, which couples internal latent reasoning with generative supervision in the visual domain. Instead of predicting every pixel directly, PAN separates what happens from how it looks. It first evolves an internal latent state — a structured representation that remembers what’s in the scene and how it’s moving — conditioned on history and a natural-language action such as “drive through a snowy forest.”
Then it decodes that next latent state into a short segment of video, so you can watch the consequence. Doing both, every step, means the model’s imagination stays grounded in realizable visuals, while the visuals remain tethered to a consistent, causal story about the world.
In a paper published on arXiv, the authors contrast this approach with popular encoder-only “predict-the-next-embedding” approaches, which produce learning transitions that look tidy in feature space but don’t map cleanly to plausible observations. GLP’s insistence on reconstructing the next observation keeps those latent transitions honest.
PAN’s architecture follows the GLP structure: a vision encoder turns the current observation into a latent state; an autoregressive LLM-based backbone evolves that state forward in time, conditioned on previous history and natural-language actions; and a video diffusion decoder renders the next observation. The decoder uses a new Causal Swin-DPM mechanism, a sliding-window denoising process that keeps transitions smooth and prevents the compounding drift that plagues long simulations. In short: the LLM keeps the plot, the diffusion model keeps the cinematography, and a clever causal window keeps the scenes flowing.
The paper also includes a significant section on how world models should be evaluated. The authors argue that judging a world model demands more than frame-level sharpness. They measure three things that actually matter if you want to use the model to reason or plan:
On these axes, PAN is state-of-the-art among open-source systems and within range of leading commercial models, which is notable because it supports open-ended, action-conditioned interaction rather than single-shot video generation.
In action simulation, PAN reaches the best open-source fidelity across both agentic scenarios, where a controllable entity must follow instructions without breaking the background, and environment scenarios, where the scene itself is manipulated. In long-horizon tests, it posts the strongest scores for transition smoothness and simulation consistency, metrics derived from optical flow and a temporal-robustness suite that emphasize continuity across boundaries and resistance to degradation as horizons extend.
In simulative planning, where a language agent proposes candidate actions and the world model previews their outcomes, PAN improves task success significantly over the agent alone and over alternative models, both in open-ended manipulation settings and in structured tabletop arrangements. The authors conclude that keeping a coherent state, step after step, pays dividends when you ask the model to think ahead.
In fact, it’s PAN’s step-by-step behavior that makes it qualitatively different from prompt-to-full-video generation. At each step, the backbone ingests the accumulated world history, the current observation, and the next proposed action – such as “grasp the yellow can from the middle tray,” “turn onto the gravel road,” or “raise the boom past the fourth window” – and produces the next latent state. The decoder then renders a short, consistent segment from that state, and the process repeats. Because the latent states live in the shared multimodal space of a vision-language model, they carry semantic grounding (“that’s still the same can,” “the car is now facing north”) alongside perceptual structure. That’s what enables the model to maintain identities, preserve spatial relations, and propagate causal effects without the cumulative wobble that usually creeps in.
PAN is great for applications where agentic “preview before you act” behaviors are important. Robotics teams can use the model to evaluate candidate grasps or rearrangements before risking a clumsy arm. Autonomous systems researchers can sanity-check maneuvers under evolving conditions using language to script alternative paths. Decision support tools in logistics or inspection could test hypotheses visually (e.g. “what happens if we move the dolly behind the crate and then rotate it clockwise?”) without booting a heavy physics simulator for every micro-change. None of this requires the model to be perfect visually, it requires it to be consistent and responsive enough that a planning loop can learn from its rollouts. PAN’s gains in long-horizon stability and action fidelity are exactly the kinds of improvements that make those loops viable.
What’s most striking about PAN is how natural the interaction feels. You don’t engineer a controller – you simply describe what you want, and the world adapts. That’s the original promise of world models: not just to render possibilities, but to reason about consequences.
With PAN, the gap between language and long-horizon simulation gets narrower, and the experience feels less like prompting a model — and more like conversing with a world that remembers what just happened and anticipates what comes next.”
The event welcomed more than 200 healthcare leaders and featured talks and presentations by MBZUAI faculty about.....
An EMNLP paper by MBZUAI researchers exposes new vulnerabilities in speech-based AI and unveils a lightweight defense.....
MBZUAI researchers unveil Droid at EMNLP – a toolkit to detect AI-written code, even when models mimic.....
Read More