Agentic-MME: What Agentic Capability Really Brings to Multimodal Intelligence?

Overview

Multimodal LLMs are shifting from passive observers to active investigators. Agentic-MME evaluates whether they can manipulate visual evidence and retrieve external knowledge in a coherent, efficient workflow.

Abstract (short). Existing benchmarks often separate visual tools and web search, and mostly score final answers. Agentic-MME introduces a unified process-verifiable setup with human-annotated trajectories, dual-axis evaluation (S-axis for strategy/search and V-axis for visual evidence), and overthinking-based efficiency diagnostics.

Case studies in Agentic-MME — Case studies across Level-1/2/3 tasks: from isolated visual operations to deeply coupled multi-round visual and knowledge workflows.

Benchmark Design

Tasks are organized by interaction complexity, from single-step visual operations to tightly interleaved visual-search reasoning.

Level 1

Visual Expansion Focus

Single decisive visual action (crop/rotate/enhance) to recover target evidence.

Level 2

Visual + Knowledge

Short multi-step workflows combining image operations and web retrieval.

Level 3

Synergistic Coupling

Iterative hypothesis-verification loops requiring interleaved visual and search reasoning.

S-axis: Knowledge Expansion

Audits search strategy, retrieval quality, and correctness of intermediate answers via checkpointed trajectory matching.

V-axis: Visual Expansion

Audits tool invocation correctness and whether generated visual artifacts genuinely expose required evidence.

Data pipeline — Annotation and evaluation pipeline with standardized execution harness for code and API-based tool modes.

Main Results

State-of-the-art models still struggle on long-horizon multimodal agentic reasoning, especially in Level-3 tasks.

Main benchmark results table — Main benchmark performance (Table 3 in paper): overall and per-level results across closed-source and open-source models.

Best overall: Gemini 3-pro reaches 56.3% accuracy.
Hard split drop: performance falls to 33.3% on Level‑3.
Observation: broad knowledge does not imply reliable planning + execution.
Diagnostic value: process-level auditing pinpoints failure sources hidden by final-answer-only metrics.

56.3%

Best Overall

33.3%

Best on Level-3

Visual Tools

Web Tools

Error heatmap — Fine-grained error analysis across difficulty levels and failure types.

Leaderboard

Interactive leaderboard from the paper's main table. You can switch interface mode, category, and sorting metric. To update results later, edit only the leaderboardData array in script.

Model	Mode	Category	Overall Acc	L1 Acc	L2 Acc	L3 Acc	Overall S	Overall V

Note: values are copied from Table 3 (paper). Human settings do not have process scores, shown as “—”.

Case Gallery

Representative appendix cases across Level-1/2/3, covering core visual expansion, short multi-step workflows, and advanced synergistic reasoning. To update later, edit only the caseData array.

Dataset and Annotation

Agentic-MME emphasizes human-supervised process labels to support robust and explainable benchmark diagnostics.

Task Scope

418 real-world tasks across six domains with progressive interaction complexity.

Process Labels

2000+ stepwise checkpoints with human reference trajectories and expected intermediate states.

Human Effort

On average 10+ person-hours of annotation and verification per task.

Dataset statistics: domains, levels, checkpoints, tool usage, and evidence distribution.

Citation

@misc{agenticmme2026,
  title   = {Agentic-MME: What Agentic Capability Really Brings to Multimodal Intelligence?},
  author  = {Qianshan Wei, Yishan Yang, Siyi Wang, Jinglin Chen, Binyu Wang, Jiaming Wang, Shuang Chen, Zechen Li, Yang Shi, Yuqi Tang, Weining Wang, Yi Yu, Chaoyou Fu, Qi Li, Yi-Fan Zhang},
  year    = {2026},
  note    = {Process-verified benchmark for multimodal agentic capability}
}

Agentic-MMEWhat Agentic Capability Really Brings to Multimodal Intelligence?