A multi-agent system for reading research papers - MBZUAI MBZUAI

A multi-agent system for reading research papers

Monday, April 27, 2026

Ask an AI researcher how they keep up with the field and you will get some version of the same confession: it’s hard. The volume of new papers appearing on arXiv alone would take more hours to read than exist in a year. So researchers cobble together workflows from keyword searches, Twitter threads, Slack channels, and the occasional recommendation from a colleague who happened to stumble across a preprint late at night. 

A team from MBZUAI has built a system meant to compress that triage into something more structured. Paper Circle uses multiple AI agents to search for papers across several databases at once, rank them using a mix of relevance signals, and then construct knowledge graphs from the papers it finds. The code is open source, and a live demo runs on Vercel.

Over the past two years, a growing number of projects have pitched autonomous agents that generate hypotheses, run experiments, and draft papers without human involvement. Paper Circle has a different pitch: help any researcher find the right twenty papers out of ten thousand, understand how those papers connect, and export clean citations without losing an afternoon to BibTeX formatting. The authors call it a “collaborative workbench” to indicate that this is a tool that assumes the human is still doing the thinking.

Federated search, familiar ingredients

The system has two main pipelines. The first handles discovery: a user types a natural language query. An intent classification agent breaks it into structured pieces: which databases to hit, what year range matters, whether the user wants canonical references or recent preprints. A search agent then queries arXiv, Semantic Scholar, OpenAlex, and DBLP in parallel (or searches a local database, or both), deduplicates the results, and hands them to a scoring layer.

That scoring layer is where the system engineering gets interesting. Instead of a single relevance score, Paper Circle computes separate measures for query similarity (via TF-IDF), recency, novelty, and BM25 lexical matching. These are combined into a weighted sum, but the weights shift depending on the search mode. A “stable” mode favors relevance and citation count. A “discovery” mode turns up the novelty dial, deliberately surfacing papers with unusual terminology that might not rank well on traditional metrics. After scoring, a diversity filter based on Maximal Marginal Relevance prevents the top results from all clustering around the same subtopic. 

Essentially, the system is trying to model the different ways a researcher might want to explore a topic, not just the most obvious one.

The second pipeline handles analysis of individual papers. Given a PDF, the system parses it along section boundaries (rather than chopping text into arbitrary token windows), then dispatches specialized agents to extract concepts, methods, experiments, and the relationships between them. The output is a typed knowledge graph where a node might represent “transformer architecture” or “CIFAR-10 dataset” and an edge encodes how they relate: this method was evaluated on that dataset, this figure illustrates that concept. Users can ask questions about the paper and get answers pinned to specific sections and page numbers. A coverage checker flags figures, tables, or equations that the extraction missed, which provides a basic quality gate before anyone relies on the graph downstream.

Both pipelines log every agent action with timestamps and paper counts, and they produce outputs at every step in five formats: JSON, CSV, BibTeX, Markdown, and a live HTML dashboard. If a colleague asks why a particular paper appeared in your literature review, you can trace the decision back through the pipeline step by step.

The numbers, and what they reveal

The team benchmarked Paper Circle’s retrieval using open source language models on four NVIDIA GPUs, with a test corpus drawn from major CS and ML conferences. They ran 50 queries in two styles: one set generated synthetically by an LLM to mimic realistic natural-language searches, and another constructed from random templates with varying scope constraints.

The best agent configuration, a quantized 30B parameter Qwen3-Coder model, found the target paper 80% of the time and ranked it with a mean reciprocal rank of 0.627 on the harder benchmark. It was also among the fastest configurations, finishing queries in about 22 seconds. The relationship between model size and retrieval quality was uneven. A 33B parameter DeepSeek model scraped by with a 12% hit rate, while a 3B parameter Qwen model reached 60%. Instruction following ability appears to matter more than raw parameter count for this kind of multi-step task.

Perhaps the most instructive result in the paper involves no language model at all. Plain BM25 lexical matching, a much older technique, achieved a 78% hit rate, outperforming the majority of agent-based configurations. Adding a neural reranker on top of BM25 pushed ranking quality to its highest point in the study, but at roughly 28 times the computational cost. And a hybrid approach combining BM25 with semantic retrieval performed no better than BM25 alone. 

When the team scaled up to 500 queries using the full agent pipeline, the hit rate climbed to 98% and MRR to 0.88. The jump is partly explained by the query mix: the larger benchmark included synthetically generated queries that turned out to be easier for multi-agent retrieval, a finding the authors flag as needing further investigation.

Where judgment is still human work

The system includes a review framework that assigns scores to papers across dimensions like novelty, soundness, and clarity, trying to approximate the role of a conference peer reviewer. When tested against actual ICLR 2024 reviews for 50 randomly selected papers, the correlation with human scores was poor across every model tested. Pearson coefficients stayed below 0.25, and some metrics showed negative correlations, meaning the system occasionally ranked papers in the reverse order of human preference.

The authors write that the review component should not be used to compare or rank papers, and they attribute the gap partly to model capacity, noting that larger models produced somewhat better reviews. Searching, deduplicating, formatting, and extracting structured data from PDFs are tasks with clear inputs and verifiable outputs. Evaluating whether a paper’s contribution is significant, whether its experimental design is sound, or whether its claims outrun its evidence requires a kind of judgment that current models handle poorly, even when they can generate fluent prose that looks like a review.

Context and what comes next

Paper Circle joins a growing ecosystem that includes PaperQA, STORM, SciSage, Connected Papers, and alphaXiv. What sets it apart, based on the comparison the authors provide, is the combination of multi-source retrieval, typed knowledge graphs with provenance tracking, coverage verification, and deterministic audit logs. No other system in their comparison table offers all of these together.

The architectural choice to build on Hugging Face’s smolagents library, using code-generating agents rather than chat-style ones, is worth noting. Code agents chain tool calls and manage state more reliably than conversational agents, which matters when the goal is to produce structured output rather than natural-sounding dialogue.

Paper Circle does not claim to replace peer review, generate hypotheses or write papers. Instead, judging by the very positive response on Hugging Face, it seems to have helped people find and organize what has already been written, and to do so transparently, and it’ll be exciting to see how the community builds upon this initial work with new features or applications.

 

Related

thumbnail
Wednesday, April 22, 2026

What it takes to teach a machine to see in Arabic

MBZUAI researchers have developed AIN – a multimodal model advancing Arabic AI beyond text into vision tasks.

  1. computer vision ,
  2. LMM ,
  3. language ,
  4. multimodal ,
  5. Arabic ,
  6. large multimodal model ,
Read More
thumbnail
Tuesday, April 07, 2026

AI researchers in Abu Dhabi are rewriting the rules of medicine across every stage of life

On World Health Day, MBZUAI showcases how artificial intelligence is transforming healthcare, from predicting Alzheimer’s years in.....

  1. World Health Day ,
  2. health ,
  3. AI ,
  4. healthcare ,
Read More
thumbnail
Tuesday, April 07, 2026

How reinforcement learning can help medical AI systems reason like a doctor

MBZUAI researchers received an NVIDIA Academic Grant for MediX-R1 – a framework that fine-tunes multimodal language models.....

  1. grant ,
  2. reasoning ,
  3. medical ,
  4. reinforcement learning ,
  5. healthcare ,
Read More