MM-CondChain: A Programmatically Verified Benchmark for Visually Grounded Deep Compositional Reasoning

Introduction

Figure 1: MM-CondChain overview showing comparison with prior benchmarks, example multi-layer reasoning chain, and model performance summary.

We introduce MM-CondChain, a benchmark for visually grounded deep compositional reasoning in Multimodal Large Language Models (MLLMs).

Each benchmark instance is organized as a multi-layer reasoning chain, where every layer contains a non-trivial compositional condition grounded in visual evidence and built from multiple objects, attributes, or relations. To answer correctly, an MLLM must perceive the image in detail, reason over multiple visual elements at each step, and follow the resulting execution path to the final outcome.

To scalably construct such workflow-style data, we propose a VPIR-based agentic synthesis pipeline that decouples logical construction from language rendering. A Verifiable Programmatic Intermediate Representation (VPIR) ensures each layer's condition is mechanically verifiable. A Composer then assembles verified layers into complete instructions, automatically producing paired hard negatives where True/False paths differ by exactly one flipped predicate.

Using this pipeline, we construct benchmarks across three visual domains: natural images, data charts, and GUI trajectories. Experiments on a range of MLLMs show that even the strongest model attains only 53.33 Path F1, with sharp drops on hard negatives and as depth or predicate complexity grows, confirming that deep compositional reasoning remains a fundamental challenge.

Leaderboard

Overall performance on MM-CondChain. Results show that visually grounded deep compositional reasoning remains highly challenging for current MLLMs.

Filtered by

MM-CondChain Leaderboard

Open-Source Proprietary

Model	Type	Nat. True	Nat. False	Nat. F1	Chart True	Chart False	Chart F1	GUI True	GUI False	GUI F1	Avg F1
Gemini-3-Pro Google	Proprietary	73.87	44.97	55.91	70.00	62.50	66.04	32.63	45.62	38.05	53.33
GPT-5-0807 OpenAI	Proprietary	80.65	33.67	47.51	63.50	67.50	65.44	30.77	49.87	38.06	50.34
Gemini-3-Flash Google	Proprietary	54.77	41.46	47.19	60.50	63.50	61.96	36.87	34.75	35.78	48.31
Qwen3-VL-Plus Alibaba	Proprietary	67.59	32.16	43.58	56.00	54.50	55.24	34.75	38.20	36.39	45.07
Gemini-2.5-Pro Google	Proprietary	38.94	55.28	45.70	55.50	64.50	59.66	10.34	54.38	17.38	40.91
Qwen3-VL-Flash Alibaba	Proprietary	61.56	29.65	40.02	59.50	47.50	52.83	58.62	10.61	17.97	36.94
Gemini-2.5-Flash Google	Proprietary	29.40	48.24	36.53	35.50	47.00	40.45	6.90	44.83	11.95	29.64
GPT-4o-1120 OpenAI	Proprietary	83.92	12.81	22.23	17.00	18.00	17.49	63.40	12.20	20.46	20.06
Qwen3-VL-235B-A22B-Thinking Alibaba	Open-Source	65.49	39.55	49.31	61.50	58.50	59.96	28.91	33.95	31.23	46.83
Qwen3.5-397B-A17B Alibaba	Open-Source	52.01	31.16	38.97	67.00	52.00	58.55	40.05	40.32	40.19	45.90
Qwen3-VL-235B-A22B-Instruct Alibaba	Open-Source	62.12	43.94	51.47	55.00	61.00	57.84	62.60	17.24	27.04	45.45
Kimi-K2.5 Moonshot AI	Open-Source	75.57	41.06	53.21	46.00	52.00	48.82	50.93	25.20	33.72	45.25
Qwen3-VL-30B-A3B-Thinking Alibaba	Open-Source	30.90	31.16	31.03	58.00	56.50	57.24	40.53	27.73	32.93	40.40
Qwen3.5-122B-A10B Alibaba	Open-Source	95.48	20.85	34.23	84.50	37.50	51.95	65.78	23.08	34.17	40.12
Qwen3-VL-8B-Thinking Alibaba	Open-Source	60.71	30.48	40.58	49.50	37.00	42.35	37.14	27.85	31.83	38.25
GLM-4.6V Zhipu AI	Open-Source	73.37	26.13	38.54	66.00	34.50	45.31	30.50	24.40	27.11	36.99
Qwen3-VL-8B-Instruct Alibaba	Open-Source	47.98	30.81	37.52	39.78	39.78	39.78	58.67	12.53	20.65	32.65
Qwen3.5-9B Alibaba	Open-Source	91.69	13.10	22.92	86.50	28.50	42.87	71.62	11.67	20.07	28.62
InternVL3-38B Shanghai AI Lab	Open-Source	73.62	20.60	32.20	31.00	31.50	31.25	57.03	12.47	20.46	27.97
Qwen3-VL-30B-A3B-Instruct Alibaba	Open-Source	27.64	27.14	27.38	44.00	35.50	39.30	73.67	7.98	14.40	27.03
Qwen3.5-4B Alibaba	Open-Source	88.92	15.37	26.20	86.50	20.00	32.49	65.78	7.69	13.77	24.15
Qwen3.5-35B-A3B Alibaba	Open-Source	93.43	11.62	20.66	88.50	17.00	28.52	74.27	14.32	24.02	24.40
InternVL3-14B Shanghai AI Lab	Open-Source	76.38	13.57	23.04	43.00	21.00	28.22	84.62	2.39	4.64	18.63
InternVL3.5-8B Shanghai AI Lab	Open-Source	82.41	10.30	18.31	76.00	19.50	31.04	82.23	1.33	2.61	17.32
InternVL3-8B Shanghai AI Lab	Open-Source	65.33	8.29	14.72	47.50	8.50	14.42	63.66	5.31	9.79	12.98
GLM-4.6V-Flash Zhipu AI	Open-Source	83.92	9.55	17.14	81.91	5.53	10.36	87.53	0.53	1.05	9.52
Qwen3.5-0.8B Alibaba	Open-Source	33.17	2.26	4.23	31.50	3.00	5.48	33.95	1.86	3.52	4.41

If you want to add your model to our leaderboard, please contact your-email@example.com.

VPIR-based Benchmark Construction Pipeline

We propose a VPIR-based agentic benchmark construction pipeline that decouples logical construction from language rendering. The pipeline iteratively builds multi-layer reasoning chains where each layer is first expressed as an executable predicate and mechanically verified against structured visual facts, and only then rendered into natural language.

Overview of the MM-CondChain agentic synthesis pipeline. Given a multimodal input, the Planner iteratively extends a conditional chain with VPIR predicates verified via code execution.

Fact and VPIR Attribute Statistics

We observe clear domain-specific patterns across the three visual domains. Natural instances mainly rely on object attributes and spatial relations, Chart instances concentrate on numerical and structural statistics, and GUI instances emphasize action, state, and trajectory-level metadata.

(Left column) Top attributes in extracted facts; (Right column) Top variables used in VPIR predicates for Natural, Chart, and GUI domains.

Logic Pattern Composition

VPIR expressions in MM-CondChain exhibit substantial structural diversity. The benchmark is not dominated by one or two simple templates: the top-20 templates cover only 50.07% of all expressions, and 128 unique templates are needed to reach 80% coverage.

Left: distribution of VPIR logic families. Middle: top-20 dominant templates. Right: example instantiation showing predicate mapping and natural language rendering.

Design Ablations

We investigate how chain depth and predicate complexity affect model performance. From depth 2 to depth 6, Path F1 drops by approximately 29-33% in relative terms across all tested models. Similarly, increasing predicate complexity leads to substantial performance drops (27-36% relative degradation).

Left: Effect of chain depth (D=2,4,6) on Path F1. Right: Effect of predicate complexity (Simple vs Complex) on Path F1.

BibTeX

@article{shen2026mmcondchain,
  title={MM-CondChain: A Programmatically Verified Benchmark for Visually Grounded Deep Compositional Reasoning},
  author={Shen, Haozhan and Yan, Shilin and Xue, Hongwei and Lu, Shuaiqi and Tang, Xiaojun and Zhang, Guannan and Zhao, Tiancheng and Yin, Jianwei},
  journal={arXiv preprint arXiv:2603.12266},
  year={2026}
}

More Works from Our Lab

SwimBird: Eliciting Switchable Reasoning Mode in Hybrid Autoregressive MLLMs

You Only Forward Once: An Efficient Compositional Judging Paradigm

MM-CondChain

A Programmatically Verified Benchmark for Visually Grounded Deep Compositional Reasoning