MM-CondChain
A Programmatically Verified Benchmark for
Visually Grounded Deep Compositional Reasoning
Introduction
We introduce MM-CondChain, a benchmark for visually grounded deep compositional reasoning in Multimodal Large Language Models (MLLMs).
Each benchmark instance is organized as a multi-layer reasoning chain, where every layer contains a non-trivial compositional condition grounded in visual evidence and built from multiple objects, attributes, or relations. To answer correctly, an MLLM must perceive the image in detail, reason over multiple visual elements at each step, and follow the resulting execution path to the final outcome.
To scalably construct such workflow-style data, we propose a VPIR-based agentic synthesis pipeline that decouples logical construction from language rendering. A Verifiable Programmatic Intermediate Representation (VPIR) ensures each layer's condition is mechanically verifiable. A Composer then assembles verified layers into complete instructions, automatically producing paired hard negatives where True/False paths differ by exactly one flipped predicate.
Using this pipeline, we construct benchmarks across three visual domains: natural images, data charts, and GUI trajectories. Experiments on a range of MLLMs show that even the strongest model attains only 53.33 Path F1, with sharp drops on hard negatives and as depth or predicate complexity grows, confirming that deep compositional reasoning remains a fundamental challenge.
Leaderboard
Overall performance on MM-CondChain. Results show that visually grounded deep compositional reasoning remains highly challenging for current MLLMs.
| # | Model | Type | Nat. True | Nat. False | Nat. F1 | Chart True | Chart False | Chart F1 | GUI True | GUI False | GUI F1 | Avg F1 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Gemini-3-Pro |
Proprietary | 73.87 | 44.97 | 55.91 | 70.00 | 62.50 | 66.04 | 32.63 | 45.62 | 38.05 | 53.33 | |
| GPT-5-0807 OpenAI |
Proprietary | 80.65 | 33.67 | 47.51 | 63.50 | 67.50 | 65.44 | 30.77 | 49.87 | 38.06 | 50.34 | |
| Gemini-3-Flash |
Proprietary | 54.77 | 41.46 | 47.19 | 60.50 | 63.50 | 61.96 | 36.87 | 34.75 | 35.78 | 48.31 | |
| Qwen3-VL-Plus Alibaba |
Proprietary | 67.59 | 32.16 | 43.58 | 56.00 | 54.50 | 55.24 | 34.75 | 38.20 | 36.39 | 45.07 | |
| Gemini-2.5-Pro |
Proprietary | 38.94 | 55.28 | 45.70 | 55.50 | 64.50 | 59.66 | 10.34 | 54.38 | 17.38 | 40.91 | |
| Qwen3-VL-Flash Alibaba |
Proprietary | 61.56 | 29.65 | 40.02 | 59.50 | 47.50 | 52.83 | 58.62 | 10.61 | 17.97 | 36.94 | |
| Gemini-2.5-Flash |
Proprietary | 29.40 | 48.24 | 36.53 | 35.50 | 47.00 | 40.45 | 6.90 | 44.83 | 11.95 | 29.64 | |
| GPT-4o-1120 OpenAI |
Proprietary | 83.92 | 12.81 | 22.23 | 17.00 | 18.00 | 17.49 | 63.40 | 12.20 | 20.46 | 20.06 | |
| Qwen3-VL-235B-A22B-Thinking Alibaba |
Open-Source | 65.49 | 39.55 | 49.31 | 61.50 | 58.50 | 59.96 | 28.91 | 33.95 | 31.23 | 46.83 | |
| Qwen3.5-397B-A17B Alibaba |
Open-Source | 52.01 | 31.16 | 38.97 | 67.00 | 52.00 | 58.55 | 40.05 | 40.32 | 40.19 | 45.90 | |
| Qwen3-VL-235B-A22B-Instruct Alibaba |
Open-Source | 62.12 | 43.94 | 51.47 | 55.00 | 61.00 | 57.84 | 62.60 | 17.24 | 27.04 | 45.45 | |
| Kimi-K2.5 Moonshot AI |
Open-Source | 75.57 | 41.06 | 53.21 | 46.00 | 52.00 | 48.82 | 50.93 | 25.20 | 33.72 | 45.25 | |
| Qwen3-VL-30B-A3B-Thinking Alibaba |
Open-Source | 30.90 | 31.16 | 31.03 | 58.00 | 56.50 | 57.24 | 40.53 | 27.73 | 32.93 | 40.40 | |
| Qwen3.5-122B-A10B Alibaba |
Open-Source | 95.48 | 20.85 | 34.23 | 84.50 | 37.50 | 51.95 | 65.78 | 23.08 | 34.17 | 40.12 | |
| Qwen3-VL-8B-Thinking Alibaba |
Open-Source | 60.71 | 30.48 | 40.58 | 49.50 | 37.00 | 42.35 | 37.14 | 27.85 | 31.83 | 38.25 | |
| GLM-4.6V Zhipu AI |
Open-Source | 73.37 | 26.13 | 38.54 | 66.00 | 34.50 | 45.31 | 30.50 | 24.40 | 27.11 | 36.99 | |
| Qwen3-VL-8B-Instruct Alibaba |
Open-Source | 47.98 | 30.81 | 37.52 | 39.78 | 39.78 | 39.78 | 58.67 | 12.53 | 20.65 | 32.65 | |
| Qwen3.5-9B Alibaba |
Open-Source | 91.69 | 13.10 | 22.92 | 86.50 | 28.50 | 42.87 | 71.62 | 11.67 | 20.07 | 28.62 | |
| InternVL3-38B Shanghai AI Lab |
Open-Source | 73.62 | 20.60 | 32.20 | 31.00 | 31.50 | 31.25 | 57.03 | 12.47 | 20.46 | 27.97 | |
| Qwen3-VL-30B-A3B-Instruct Alibaba |
Open-Source | 27.64 | 27.14 | 27.38 | 44.00 | 35.50 | 39.30 | 73.67 | 7.98 | 14.40 | 27.03 | |
| Qwen3.5-4B Alibaba |
Open-Source | 88.92 | 15.37 | 26.20 | 86.50 | 20.00 | 32.49 | 65.78 | 7.69 | 13.77 | 24.15 | |
| Qwen3.5-35B-A3B Alibaba |
Open-Source | 93.43 | 11.62 | 20.66 | 88.50 | 17.00 | 28.52 | 74.27 | 14.32 | 24.02 | 24.40 | |
| InternVL3-14B Shanghai AI Lab |
Open-Source | 76.38 | 13.57 | 23.04 | 43.00 | 21.00 | 28.22 | 84.62 | 2.39 | 4.64 | 18.63 | |
| InternVL3.5-8B Shanghai AI Lab |
Open-Source | 82.41 | 10.30 | 18.31 | 76.00 | 19.50 | 31.04 | 82.23 | 1.33 | 2.61 | 17.32 | |
| InternVL3-8B Shanghai AI Lab |
Open-Source | 65.33 | 8.29 | 14.72 | 47.50 | 8.50 | 14.42 | 63.66 | 5.31 | 9.79 | 12.98 | |
| GLM-4.6V-Flash Zhipu AI |
Open-Source | 83.92 | 9.55 | 17.14 | 81.91 | 5.53 | 10.36 | 87.53 | 0.53 | 1.05 | 9.52 | |
| Qwen3.5-0.8B Alibaba |
Open-Source | 33.17 | 2.26 | 4.23 | 31.50 | 3.00 | 5.48 | 33.95 | 1.86 | 3.52 | 4.41 |
If you want to add your model to our leaderboard, please contact your-email@example.com.
MM-CondChain
VPIR-based Benchmark Construction Pipeline
We propose a VPIR-based agentic benchmark construction pipeline that decouples logical construction from language rendering. The pipeline iteratively builds multi-layer reasoning chains where each layer is first expressed as an executable predicate and mechanically verified against structured visual facts, and only then rendered into natural language.
Fact and VPIR Attribute Statistics
We observe clear domain-specific patterns across the three visual domains. Natural instances mainly rely on object attributes and spatial relations, Chart instances concentrate on numerical and structural statistics, and GUI instances emphasize action, state, and trajectory-level metadata.
Logic Pattern Composition
VPIR expressions in MM-CondChain exhibit substantial structural diversity. The benchmark is not dominated by one or two simple templates: the top-20 templates cover only 50.07% of all expressions, and 128 unique templates are needed to reach 80% coverage.
Experiment Results
Design Ablations
We investigate how chain depth and predicate complexity affect model performance. From depth 2 to depth 6, Path F1 drops by approximately 29-33% in relative terms across all tested models. Similarly, increasing predicate complexity leads to substantial performance drops (27-36% relative degradation).
BibTeX
@article{shen2026mmcondchain,
title={MM-CondChain: A Programmatically Verified Benchmark for Visually Grounded Deep Compositional Reasoning},
author={Shen, Haozhan and Yan, Shilin and Xue, Hongwei and Lu, Shuaiqi and Tang, Xiaojun and Zhang, Guannan and Zhao, Tiancheng and Yin, Jianwei},
journal={arXiv preprint arXiv:2603.12266},
year={2026}
}