As the old saying goes, “time is money.” This is especially true for those in the logistics business. Speed has a double benefit: more deliveries mean more revenue, and getting packages to customers on time will likely lead to them coming back for more. But figuring out how to optimize delivery under the unpredictable conditions found on urban roads is difficult.
Academic and industry researchers have developed algorithms to solve what are known as vehicle routing problems that are used by delivery companies, but until now, there hasn’t been a benchmark to measure the performance of these algorithms while accounting for the unpredictability that is a hallmark of real-life logistics. To be effective, vehicle routing algorithms must be responsive to traffic that ebbs and flows with the time of day and with unexpected events like accidents and road closures.
Researchers at MBZUAI have developed a new benchmark that is designed to test the performance of vehicle routing algorithms in the presence of this randomness. Called SVRPBench (stochastic vehicle routing problem benchmark), it’s the first open benchmark that simulates the unpredictable nature of delivery in cities, including rush-hour traffic, accidents, and times when customers want their packages to be delivered.
The researchers are presenting a study about the benchmark at the 39th Annual Conference on Neural Information Processing Systems (NeurIPS 2025) in San Diego, California.
The authors of the study are Ahmed Heakl, Yahia Salaheldin Shaaban, Martin Takáč, Salem Lahlou, and Zangir Iklassov.
Devising the most efficient delivery routes can lead to huge cost savings for companies, says Heakl, a master’s student in computer vision at MBZUAI and co-lead author of the study. But there are many variables that need to be considered to do this well. Vehicles have different capacities. Some drivers are slow, while others are fast. Some routes are generally predictable, while others aren’t. And late deliveries can damage a company’s reputation. “All these factors translate directly to revenue,” he says.
Shaaban, a master’s student in machine learning at MBZUAI and co-lead author of the study, explains that while there are existing benchmarks that can be used to measure the efficiency of vehicle routing algorithms, these benchmarks are deterministic and don’t account for the unpredictable nature of delivery conditions in the real world. SVRPBench introduces an aspect of randomness that is more realistic compared to current benchmarks.
SVRPBench also models cities in a way that is truer to life. Typically, benchmarks use random points on a map to represent customers. But in the real world, customers are clustered in certain areas, depending on the city. SVRPBench uses near-accurate representations of where warehouses and customers are located. “We tried to make the map as close to reality so that it has more accurate structures,” Shaaban says.

The SVRPBench framework generates realistic scenarios through four stages. An input generation module considers the city layout, locations of customers and when they expect deliveries, and level of demand. A stochastic modeling engine adds randomness related to traffic and travel disruptions. An instance assembly module accounts for the number and capacity of delivery vehicles and travel times. And an evaluation framework is used to measure the performance of vehicle routing algorithms according to a variety of metrics.
The benchmark also considers vehicle routing problems of three different sizes: small, with 50–100 customers; medium, with 100–300 customers; and large, with more than 300 customers.
The researchers tested different vehicle routing algorithms, including reinforcement-learning-based and classical methods, on SVRPBench and measured their performance according to different metrics. These included total cost, which measured the cumulative travel time across all vehicles, and runtime, which measured how long a solution took to compute.
In terms of total cost, a rule-based system called OR-Tools, developed by Google, performed best. Overall, learning methods are faster than metaheuristics and OR-Tools, but not necessarily faster than simple classical heuristics like NN+2opt in terms of runtime.
Introducing delivery time windows had a significant impact on performance, increasing the total cost across algorithms up to 648%. Multi-depot settings improved the feasibility and runtime across all algorithms, offering a practical design consideration for logistics planning, the researchers explain.
The randomness introduced by the benchmark affected the performance of reinforcement-learning based methods more than classical and metaheuristic methods, degrading them by more than 20%. “We found that most learning-based algorithms fail under realistic constraints, like unexpected traffic, especially when you move from the training distribution to the real distribution,” Heakl says.
Shaaban says that while learning methods have led to great advances in fields like natural language processing and computer vision, vehicle routing problems are by their nature different compared to sentence completion or image classification. “Vehicle routing problems have hard constraints, like delivery windows and vehicle capacities, that must be met,” he says, “and finding the optimal solution becomes exponentially harder as the problem size increases.” Heuristic approaches, which rely on local optimization, have been engineered over decades to be extremely efficient at navigating these constraints.
That said, the gap between learning and heuristic approaches is closing, Heakl explains. “We are seeing more hybrid approaches where neural networks learn to guide the heuristics, effectively narrowing the search space so the rule-based solver can work faster.”
The researchers hope that SVRPBench can be a useful resource for those working in the logistics industry to give them a more accurate idea of how vehicle routing algorithms will perform under realistic conditions.
“Many people in industry think this problem is solved using heuristic methods but when you go into a realistic setting it’s clear that they’re not solved,” Heakl says. “But our work shows that we have a long way to go to solve for realistic constraints like traffic jams, accidents, and changes to delivery times.”
MBZUAI’s Institute of Foundation Models has released its latest version of K2 – a 70-billion-parameter, reasoning-centric foundation model.....
Read MoreA new benchmark by MBZUAI researchers shows how poorly current multimodal methods handle real-world geometric and perspective-based.....
Speaking as part of MBZUAI’s Distinguished Lecture series, Sir Michael Brady discussed the future of healthcare and.....