SwimBird: Eliciting Switchable Reasoning Mode in Hybrid Autoregressive MLLMs

1Huazhong University of Science and Technology
2Accio Team, Alibaba Group
Project Leader Corresponding Author

Abstract

Multimodal Large Language Models (MLLMs) have made remarkable progress in multimodal perception and reasoning by bridging vision and language. However, most existing MLLMs perform reasoning primarily with textual CoT, which limits their effectiveness on vision-intensive tasks. Recent approaches inject a fixed number of continuous hidden states as "visual thoughts" into the reasoning process and improve visual performance, but often at the cost of degraded text-based logical reasoning. We argue that the core limitation lies in a rigid, pre-defined reasoning pattern that cannot adaptively choose the most suitable thinking modality for different user queries. We introduce SwimBird, a reasoning-switchable MLLM that dynamically switches among three reasoning modes conditioned on the input: (1) text-only reasoning, (2) vision-only reasoning (continuous hidden states as visual thoughts), and (3) interleaved vision-text reasoning. To enable this capability, we adopt a hybrid autoregressive formulation that unifies next-token prediction for textual thoughts with next-embedding prediction for visual thoughts, and design a systematic reasoning-mode curation strategy to construct SwimBird-SFT-92K, a diverse supervised fine-tuning dataset covering all three reasoning patterns. By enabling flexible, query-adaptive mode selection, SwimBird preserves strong textual logic while substantially improving performance on vision-dense tasks.

SwimBird Method Architecture

Key Features & Results

Performance Results

Visual Understanding
Performance on fine-grained visual understanding benchmarks

Fine-grained Visual Understanding Benchmarks

Reasoning Tasks
Performance on general VQA and multimodal reasoning tasks

General VQA & Multimodal Reasoning

Contact & Opportunities

If you have any questions about this project, please feel free to contact:

We're Hiring!

Accio Lab is actively seeking self-motivated researchers and research interns to join our team!

Research Areas

Multimodal Large Language Models, Large Language Models, Agentic AI, Agent

What We Look For

Passion for AI research, strong coding skills, and independent thinking

What We Offer

Cutting-edge research, mentorship, and collaboration opportunities

What You'll Get

Cutting-edge technology, mature Agent products, and flexible work environment

Interested? Send your CV and research interests to:

Apply Now

BibTeX

@article{YourPaperKey2024,
  title={SwimBird: Eliciting Switchable Reasoning Mode in Hybrid Autoregressive MLLMs},
  author={Jintao Tong, Shilin Yan, Hongwei Xue, Xiaojun Tang, Kunyu Shi, Guannan Zhang, Ruixuan Li, Yixiong Zou},
  journal={Conference/Journal Name},
  year={2026},
  url={https://your-domain.com/your-project-page}
}