Abstract
Multimodal Large Language Models (MLLMs) have made remarkable progress in multimodal perception and reasoning by bridging vision and language. However, most existing MLLMs perform reasoning primarily with textual CoT, which limits their effectiveness on vision-intensive tasks. Recent approaches inject a fixed number of continuous hidden states as "visual thoughts" into the reasoning process and improve visual performance, but often at the cost of degraded text-based logical reasoning. We argue that the core limitation lies in a rigid, pre-defined reasoning pattern that cannot adaptively choose the most suitable thinking modality for different user queries. We introduce SwimBird, a reasoning-switchable MLLM that dynamically switches among three reasoning modes conditioned on the input: (1) text-only reasoning, (2) vision-only reasoning (continuous hidden states as visual thoughts), and (3) interleaved vision-text reasoning. To enable this capability, we adopt a hybrid autoregressive formulation that unifies next-token prediction for textual thoughts with next-embedding prediction for visual thoughts, and design a systematic reasoning-mode curation strategy to construct SwimBird-SFT-92K, a diverse supervised fine-tuning dataset covering all three reasoning patterns. By enabling flexible, query-adaptive mode selection, SwimBird preserves strong textual logic while substantially improving performance on vision-dense tasks.
Key Features & Results
Query-Adaptive Multimodal Reasoning
SwimBird dynamically switches among text-only, vision-only, and interleaved vision-text modes
Different Reasoning-Mode Cases
Real-world examples demonstrating adaptive mode selection
Mode Distribution Across Benchmarks
Analysis of reasoning mode usage patterns in different tasks
Performance Results
Fine-grained Visual Understanding Benchmarks
General VQA & Multimodal Reasoning
Contact & Opportunities
If you have any questions about this project, please feel free to contact:
We're Hiring!
Accio Lab is actively seeking self-motivated researchers and research interns to join our team!
Research Areas
Multimodal Large Language Models, Large Language Models, Agentic AI, Agent
What We Look For
Passion for AI research, strong coding skills, and independent thinking
What We Offer
Cutting-edge research, mentorship, and collaboration opportunities
What You'll Get
Cutting-edge technology, mature Agent products, and flexible work environment
Interested? Send your CV and research interests to:
Apply NowBibTeX
@article{YourPaperKey2024,
title={SwimBird: Eliciting Switchable Reasoning Mode in Hybrid Autoregressive MLLMs},
author={Jintao Tong, Shilin Yan, Hongwei Xue, Xiaojun Tang, Kunyu Shi, Guannan Zhang, Ruixuan Li, Yixiong Zou},
journal={Conference/Journal Name},
year={2026},
url={https://your-domain.com/your-project-page}
}