SwimBird: Eliciting Switchable Reasoning Mode in Hybrid Autoregressive MLLMs

Jintao Tong; Shilin Yan; Hongwei Xue; Xiaojun Tang; Kunyu Shi; Guannan Zhang; Ruixuan Li; Yixiong Zou

SwimBird: Eliciting Switchable Reasoning Mode in Hybrid Autoregressive MLLMs

Jintao Tong^1,2, Shilin Yan^2†‡, Hongwei Xue², Xiaojun Tang², Kunyu Shi²
Guannan Zhang², Ruixuan Li^1‡, Yixiong Zou^1‡

¹Huazhong University of Science and Technology
²Accio Team, Alibaba Group

^†Project Leader ^‡Corresponding Author

arXiv Code 🤗 Model 🤗 Dataset

Abstract

Multimodal Large Language Models (MLLMs) have made remarkable progress in multimodal perception and reasoning by bridging vision and language. However, most existing MLLMs perform reasoning primarily with textual CoT, which limits their effectiveness on vision-intensive tasks. Recent approaches inject a fixed number of continuous hidden states as "visual thoughts" into the reasoning process and improve visual performance, but often at the cost of degraded text-based logical reasoning. We argue that the core limitation lies in a rigid, pre-defined reasoning pattern that cannot adaptively choose the most suitable thinking modality for different user queries. We introduce SwimBird, a reasoning-switchable MLLM that dynamically switches among three reasoning modes conditioned on the input: (1) text-only reasoning, (2) vision-only reasoning (continuous hidden states as visual thoughts), and (3) interleaved vision-text reasoning. To enable this capability, we adopt a hybrid autoregressive formulation that unifies next-token prediction for textual thoughts with next-embedding prediction for visual thoughts, and design a systematic reasoning-mode curation strategy to construct SwimBird-SFT-92K, a diverse supervised fine-tuning dataset covering all three reasoning patterns. By enabling flexible, query-adaptive mode selection, SwimBird preserves strong textual logic while substantially improving performance on vision-dense tasks.

Key Features & Results

Core Concept

SwimBird adaptive reasoning modes visualization

Query-Adaptive Multimodal Reasoning
SwimBird dynamically switches among text-only, vision-only, and interleaved vision-text modes

Examples

Different Reasoning-Mode Cases
Real-world examples demonstrating adaptive mode selection

Analysis

Mode Distribution Across Benchmarks
Analysis of reasoning mode usage patterns in different tasks

Performance Results

Visual Understanding

Fine-grained Visual Understanding Benchmarks

Reasoning Tasks

General VQA & Multimodal Reasoning

Contact & Opportunities

If you have any questions about this project, please feel free to contact:

tattoo.ysl@gmail.com

We're Hiring!

Accio Lab is actively seeking self-motivated researchers and research interns to join our team!

Research Areas

Multimodal Large Language Models, Large Language Models, Agentic AI, Agent

What We Look For

Passion for AI research, strong coding skills, and independent thinking

What We Offer

Cutting-edge research, mentorship, and collaboration opportunities

What You'll Get

Cutting-edge technology, mature Agent products, and flexible work environment

Interested? Send your CV and research interests to:

Apply Now

BibTeX

@article{YourPaperKey2024,
  title={SwimBird: Eliciting Switchable Reasoning Mode in Hybrid Autoregressive MLLMs},
  author={Jintao Tong, Shilin Yan, Hongwei Xue, Xiaojun Tang, Kunyu Shi, Guannan Zhang, Ruixuan Li, Yixiong Zou},
  journal={Conference/Journal Name},
  year={2026},
  url={https://your-domain.com/your-project-page}
}

More Works from Our Lab

SwimBird: Eliciting Switchable Reasoning Mode in Hybrid Autoregressive MLLMs

Abstract

Key Features & Results

Query-Adaptive Multimodal Reasoning SwimBird dynamically switches among text-only, vision-only, and interleaved vision-text modes

Different Reasoning-Mode Cases Real-world examples demonstrating adaptive mode selection

Mode Distribution Across Benchmarks Analysis of reasoning mode usage patterns in different tasks

Performance Results

Fine-grained Visual Understanding Benchmarks

General VQA & Multimodal Reasoning

Contact & Opportunities

We're Hiring!

Research Areas

What We Look For

What We Offer

What You'll Get

BibTeX

Query-Adaptive Multimodal Reasoning
SwimBird dynamically switches among text-only, vision-only, and interleaved vision-text modes

Different Reasoning-Mode Cases
Real-world examples demonstrating adaptive mode selection

Mode Distribution Across Benchmarks
Analysis of reasoning mode usage patterns in different tasks