If you’ve watched enough demos of web-browsing AI agents, you’ve probably noticed a quiet omission: most of them don’t include solving CAPTCHAs. That omission isn’t accidental. Today’s most capable multimodal agents can answer exam questions, describe images, and even write code, but drop them into a real browser and ask them to pass a modern “are you human?” challenge, and they struggle.
To find the reasons why agents can’t tackle CAPTCHAs, MBZUAI Ph.D. students Yaxin Luo and Zhaoyi Li created a testbed called Open CaptchaWorld and captured their findings in a new NeurIPS 2025 paper.
Open CaptchaWorld is an open, web-based benchmark where agents must actually perceive a puzzle, reason through multiple steps, and act on the page by clicking, dragging and rotating elements until they solve the CAPTCHA. While humans breeze through with 93.3% accuracy, the best agent hovers at 40%.
The reason why the researchers created this benchmark is simple: if you want web agents to work in the real world, you can’t evaluate them in carefully curated sandboxes that filter out CAPTCHAs. Those puzzles are exactly where automation meets the real frictions of the web such as stateful interfaces, tiny controls, ambiguous visual cues, and instructions that require common-sense planning.
“We were originally building a web shopping agent,” says Luo. “We found that while our models could reason through complex user requests, they were constantly getting stuck at the login or checkout phase because of CAPTCHAs. The agent would get into a loop refreshing the page, failing the puzzle, and trying again, because it lacked the fine-grained interaction skills to pass.
“We realized that CAPTCHAs are the gatekeepers to the most high-value web actions, like e-commerce, ticketing, and secure logins. If an agent can’t pass these, it can’t actually be deployed in the real world. When we looked at why this was happening, we realized that major existing benchmarks like AgentBench and VisualWebArena, deliberately filter out pages with CAPTCHAs. They treat these puzzles as ‘noise’ to be removed, rather than a core capability to be tested.”
Open CaptchaWorld spans 20 modern CAPTCHA types, from “select all squares with a maple leaf” to drag-to-align sliders, jigsaw-style alignment, sequence clicking, counting and arithmetic over icons, and even hold-to-complete timers.
All of these puzzles are run in a browser loop. Agents see screenshots, maintain a running plan, and issue granular actions until they hit “Submit.” That design moves evaluation from one-shot perception to interactive, multi-turn problem solving.
To make the difficulty legible, the team introduces CAPTCHA Reasoning Depth, a task-agnostic measure that counts the minimal number of cognitive and motor steps a human needs to solve a puzzle. Think: “identify the right icon,” “plan the sequence,” “click the targets,” “verify the UI’s feedback,” each counted once if it genuinely contributes to solving. Across the benchmark, the average depth sits around 2.94 with a healthy spread, reflecting puzzles that look trivial to people but demand structured, stepwise control from agents.
A striking behavioral insight emerges when you compare annotations: humans compress familiar micro-steps into intuition (“find the sequence, click in order, done”); advanced models like OpenAI o3 over-segment the same task into many literal sub-steps (“recognize icon 1, store in working memory, click, check feedback… repeat”), inflating their own estimated depth and, in practice, their opportunities to go wrong. That gap of intuition vs. brittle enumeration shows up repeatedly in the failures.
Technically, the paper frames each puzzle using a mathematical framework called Partially Observable Markov Decision Process (POMDP): the page is a partially observable state, actions are discrete UI operations (click, drag, type), observations are screenshots, and the reward is success or failure. Agents must infer a belief over what matters on the page, plan a short sequence, and execute precisely enough that the interface state transitions the way they expect. Unlike vision question answering tasks, Open CaptchaWorld provides closed-loop control with vision and language in service of acting. The authors evaluate a fleet of browser-use agents by swapping in different multimodal LLM backbones (OpenAI o3, GPT-4, Claude 3.7 Sonnet, Gemini 2.5 Pro, DeepSeek-V3) under a unified prompting and action API.
Several strong general models fall close to random on the trickier puzzles. Even more awkward: the best model is also the most expensive to run in this setup (about $66.4 to traverse a full sequence), while cheaper models save budget but crater on accuracy. In other words, agent deployment has a cost–accuracy frontier, and today it’s steep.
“We went into this expecting that current agents might struggle with visual recognition, but what actually surprised us was how they failed,” says Luo. “It wasn’t usually a lack of knowledge, but rather a lack of ‘intuition’ and ‘motor control’. As we showed in our paper, there were three specific findings that really struck us by analyzing the model’s failure cases.”
The strongest agents do fine on low-depth perception tasks and can sometimes handle compositional ones like bingo-style matches or dartboard counting that need light arithmetic. But three failure patterns recur across the board. First, right plan, wrong precision: the agent forms a correct strategy (e.g., “place a dot at the end of the car’s path”) and then mis-clicks by a few dozen pixels, repeatedly. Second, delicate operations: slider alignment puzzles demand calibrated drags; agents understand the goal but lack the fine-grained spatial control to land inside the tolerance window. Third, strategy drift: in some “object match” tasks, an agent latches onto irrelevant cues like image filenames or page text instead of the visual content, “solving” a different problem than the one asked.
One advantage of a live browser benchmark is that you can read the agents’ “thought process” as they step. Take OpenAI o3 on a successful image match: it cycles through options, tracks memory (“we found the cat; submit”), and cleanly finishes. Then you can watch it fail on dot placement: it articulates the right end-goal but keeps clicking near the center of the path, likely a perception/coordination mismatch that a human adjusts for automatically. On slider alignment, the model oscillates without ever locking onto the exact notch. And on a count-and-match task, it starts extracting filenames (“image19.png”) as proxies for content, a heuristic that works nowhere except synthetic demos. So while the intent is there, the intuition and motor skill are not.
Why do these puzzles that are easy for humans expose such deep cracks? The authors show that most multimodal evaluations are static and single-turn: show me an image, I’ll tell you what’s in it. CAPTCHAs are dynamic and stateful: observe, plan, act, observe again, refine. They require tight coupling between visual parsing and motor control, and they punish small errors with UI states that don’t budge. Put differently, they stress what an agent needs to survive outside the lab: working memory, sequential planning, spatial precision, and a bias toward goal-directed simplicity instead of verbose overthinking. By baking those demands into an open benchmark with a clear complexity measure (Reasoning Depth), a real browser loop, and a diverse puzzle set, Open CaptchaWorld offers a grounded way to measure progress that leaderboards of static QA can’t.
The paper also sketches a path forward for more capable agents: better state tracking to avoid re-explaining the obvious; vision tuned for small-object localization so clicks land inside tiny hitboxes; motor-control primitives that learn from corrective feedback (micro-drags, coarse-to-fine refinement); and planning that compresses routine micro-steps the way humans do, rather than enumerating them. Perhaps most importantly, evaluation must stay interactive. It’s only in the loop when your click changes the interface that fragile plans and miscalibrated drags reveal themselves. CAPTCHAs, annoying as they are, turn out to be an excellent “unit test” for exactly that.
Benchmarks like WebArena and AgentBench have been invaluable, but many filter out CAPTCHA flows because they “break” end-to-end tasks. Open CaptchaWorld argues the opposite: if you remove the gates that real sites use, you overestimate readiness for the real world. Including them reframes success: not “could the agent read the page?” but “could it finish the job?” That’s a more product-relevant metric, and it forces conversations about cost as well as capability. In the current results, a model that doubles accuracy can also triple run cost. If you plan to deploy browser agents at scale, that curve needs to be taken into account.
None of this is to say we should get rid of CAPTCHAs; they exist to keep abuse at bay, and the authors explicitly discuss safeguards and societal impacts. But as accessibility and automation pressures grow, agents will need to navigate human checks without resorting to shady API handoffs or workarounds. Benchmarks can’t solve that tension, but they can make the technical problem visible and tractable.
In terms of how this research could evolve into the future, Luo is excited about the work ahead: “We want to make sure this benchmark doesn’t just measure today’s models, but pushes the development of future models,” he says. “In order to achieve that, we realized that simply collecting existing CAPTCHAs is not enough. In our next moves, we plan to design the new era of CAPTCHAs.
“As foundation models become more powerful, static benchmarks become obsolete very quickly. We plan to move beyond just curating existing web puzzles and start designing novel, AI-native CAPTCHAs. These will be procedurally generated puzzles specifically engineered to stress-test the weaknesses we discovered, like the ‘Intuition Gap’ and fine-grained motor control.”
By quantifying how and where agents fail, Open CaptchaWorld gives researchers a concrete target: close the intuition gap, tighten control, and learn to act with the same calm economy humans bring when they are asked to count the number of traffic lights in a fragmented picture.
Led by MBZUAI researchers, MAviS could be valuable for environmental agencies and organizations involved in monitoring avian.....
The event was part of ADMAF’s Riwaq Al Fikr initiative – a dedicated series of talks and.....
The prestigious honor recognizes Gurevych’s advances in language AI and new defenses against misinformation.