MM-CondChain

A Programmatically Verified Benchmark for
Visually Grounded Deep Compositional Reasoning

Haozhan Shen1,2, Shilin Yan1†, Hongwei Xue1‡, Shuaiqi Lu1, Xiaojun Tang1,
Guannan Zhang1, Tiancheng Zhao3‡, Jianwei Yin2
Project Leader Corresponding Author
1Accio Team, Alibaba Group 2Zhejiang University 3ZJU-BJ

Introduction

MM-CondChain Overview
Figure 1: MM-CondChain overview showing comparison with prior benchmarks, example multi-layer reasoning chain, and model performance summary.

We introduce MM-CondChain, a benchmark for visually grounded deep compositional reasoning in Multimodal Large Language Models (MLLMs).

Each benchmark instance is organized as a multi-layer reasoning chain, where every layer contains a non-trivial compositional condition grounded in visual evidence and built from multiple objects, attributes, or relations. To answer correctly, an MLLM must perceive the image in detail, reason over multiple visual elements at each step, and follow the resulting execution path to the final outcome.

To scalably construct such workflow-style data, we propose a VPIR-based agentic synthesis pipeline that decouples logical construction from language rendering. A Verifiable Programmatic Intermediate Representation (VPIR) ensures each layer's condition is mechanically verifiable. A Composer then assembles verified layers into complete instructions, automatically producing paired hard negatives where True/False paths differ by exactly one flipped predicate.

Using this pipeline, we construct benchmarks across three visual domains: natural images, data charts, and GUI trajectories. Experiments on a range of MLLMs show that even the strongest model attains only 53.33 Path F1, with sharp drops on hard negatives and as depth or predicate complexity grows, confirming that deep compositional reasoning remains a fundamental challenge.

Leaderboard

Overall performance on MM-CondChain. Results show that visually grounded deep compositional reasoning remains highly challenging for current MLLMs.

Filtered by
MM-CondChain Leaderboard
Open-Source Proprietary
# Model Type Nat. True Nat. False Nat. F1 Chart True Chart False Chart F1 GUI True GUI False GUI F1 Avg F1
Gemini-3-Pro
Google
Proprietary 73.8744.9755.91 70.0062.5066.04 32.6345.6238.05 53.33
GPT-5-0807
OpenAI
Proprietary 80.6533.6747.51 63.5067.5065.44 30.7749.8738.06 50.34
Gemini-3-Flash
Google
Proprietary 54.7741.4647.19 60.5063.5061.96 36.8734.7535.78 48.31
Qwen3-VL-Plus
Alibaba
Proprietary 67.5932.1643.58 56.0054.5055.24 34.7538.2036.39 45.07
Gemini-2.5-Pro
Google
Proprietary 38.9455.2845.70 55.5064.5059.66 10.3454.3817.38 40.91
Qwen3-VL-Flash
Alibaba
Proprietary 61.5629.6540.02 59.5047.5052.83 58.6210.6117.97 36.94
Gemini-2.5-Flash
Google
Proprietary 29.4048.2436.53 35.5047.0040.45 6.9044.8311.95 29.64
GPT-4o-1120
OpenAI
Proprietary 83.9212.8122.23 17.0018.0017.49 63.4012.2020.46 20.06
Qwen3-VL-235B-A22B-Thinking
Alibaba
Open-Source 65.4939.5549.31 61.5058.5059.96 28.9133.9531.23 46.83
Qwen3.5-397B-A17B
Alibaba
Open-Source 52.0131.1638.97 67.0052.0058.55 40.0540.3240.19 45.90
Qwen3-VL-235B-A22B-Instruct
Alibaba
Open-Source 62.1243.9451.47 55.0061.0057.84 62.6017.2427.04 45.45
Kimi-K2.5
Moonshot AI
Open-Source 75.5741.0653.21 46.0052.0048.82 50.9325.2033.72 45.25
Qwen3-VL-30B-A3B-Thinking
Alibaba
Open-Source 30.9031.1631.03 58.0056.5057.24 40.5327.7332.93 40.40
Qwen3.5-122B-A10B
Alibaba
Open-Source 95.4820.8534.23 84.5037.5051.95 65.7823.0834.17 40.12
Qwen3-VL-8B-Thinking
Alibaba
Open-Source 60.7130.4840.58 49.5037.0042.35 37.1427.8531.83 38.25
GLM-4.6V
Zhipu AI
Open-Source 73.3726.1338.54 66.0034.5045.31 30.5024.4027.11 36.99
Qwen3-VL-8B-Instruct
Alibaba
Open-Source 47.9830.8137.52 39.7839.7839.78 58.6712.5320.65 32.65
Qwen3.5-9B
Alibaba
Open-Source 91.6913.1022.92 86.5028.5042.87 71.6211.6720.07 28.62
InternVL3-38B
Shanghai AI Lab
Open-Source 73.6220.6032.20 31.0031.5031.25 57.0312.4720.46 27.97
Qwen3-VL-30B-A3B-Instruct
Alibaba
Open-Source 27.6427.1427.38 44.0035.5039.30 73.677.9814.40 27.03
Qwen3.5-4B
Alibaba
Open-Source 88.9215.3726.20 86.5020.0032.49 65.787.6913.77 24.15
Qwen3.5-35B-A3B
Alibaba
Open-Source 93.4311.6220.66 88.5017.0028.52 74.2714.3224.02 24.40
InternVL3-14B
Shanghai AI Lab
Open-Source 76.3813.5723.04 43.0021.0028.22 84.622.394.64 18.63
InternVL3.5-8B
Shanghai AI Lab
Open-Source 82.4110.3018.31 76.0019.5031.04 82.231.332.61 17.32
InternVL3-8B
Shanghai AI Lab
Open-Source 65.338.2914.72 47.508.5014.42 63.665.319.79 12.98
GLM-4.6V-Flash
Zhipu AI
Open-Source 83.929.5517.14 81.915.5310.36 87.530.531.05 9.52
Qwen3.5-0.8B
Alibaba
Open-Source 33.172.264.23 31.503.005.48 33.951.863.52 4.41

If you want to add your model to our leaderboard, please contact your-email@example.com.

MM-CondChain

VPIR-based Benchmark Construction Pipeline

We propose a VPIR-based agentic benchmark construction pipeline that decouples logical construction from language rendering. The pipeline iteratively builds multi-layer reasoning chains where each layer is first expressed as an executable predicate and mechanically verified against structured visual facts, and only then rendered into natural language.

MM-CondChain Pipeline
Overview of the MM-CondChain agentic synthesis pipeline. Given a multimodal input, the Planner iteratively extends a conditional chain with VPIR predicates verified via code execution.

Fact and VPIR Attribute Statistics

We observe clear domain-specific patterns across the three visual domains. Natural instances mainly rely on object attributes and spatial relations, Chart instances concentrate on numerical and structural statistics, and GUI instances emphasize action, state, and trajectory-level metadata.

Attribute Statistics
(Left column) Top attributes in extracted facts; (Right column) Top variables used in VPIR predicates for Natural, Chart, and GUI domains.

Logic Pattern Composition

VPIR expressions in MM-CondChain exhibit substantial structural diversity. The benchmark is not dominated by one or two simple templates: the top-20 templates cover only 50.07% of all expressions, and 128 unique templates are needed to reach 80% coverage.

Logic Patterns
Left: distribution of VPIR logic families. Middle: top-20 dominant templates. Right: example instantiation showing predicate mapping and natural language rendering.

Experiment Results

Design Ablations

We investigate how chain depth and predicate complexity affect model performance. From depth 2 to depth 6, Path F1 drops by approximately 29-33% in relative terms across all tested models. Similarly, increasing predicate complexity leads to substantial performance drops (27-36% relative degradation).

Ablation Study
Left: Effect of chain depth (D=2,4,6) on Path F1. Right: Effect of predicate complexity (Simple vs Complex) on Path F1.

BibTeX

@article{shen2026mmcondchain,
  title={MM-CondChain: A Programmatically Verified Benchmark for Visually Grounded Deep Compositional Reasoning},
  author={Shen, Haozhan and Yan, Shilin and Xue, Hongwei and Lu, Shuaiqi and Tang, Xiaojun and Zhang, Guannan and Zhao, Tiancheng and Yin, Jianwei},
  journal={arXiv preprint arXiv:2603.12266},
  year={2026}
}