Vision language models (VLMs) have achieved impressive performance across a variety of computer vision tasks. However, the multimodal reasoning capability has not been fully explored in existing models. In this paper, we propose a Chain-of-Focus (CoF) method that allows VLMs to perform adaptive focusing and zooming in on key image regions based on obtained visual cues and the given questions, achieving efficient multimodal reasoning. To enable this CoF capability, we present a two-stage training pipeline, including supervised fine-tuning (SFT) and reinforcement learning (RL). In the SFT stage, we construct the MM-CoF dataset, comprising 3K samples derived from a visual agent designed to adaptively identify key regions to solve visual tasks with different image resolutions and questions. We use MM-CoF to fine-tune the Qwen2.5-VL model for cold start. In the RL stage, we leverage the outcome accuracies and formats as rewards to update the Qwen2.5-VL model, enabling further refining the model's search and reasoning strategy without human priors. Our model achieves significant improvements on multiple benchmarks. On the V* benchmark that requires strong visual reasoning capability, our model outperforms existing VLMs by 5% among 8 image different resolutions ranging from 224 to 4K, demonstrating the effectiveness of the proposed CoF method and facilitating the more efficient deployment of VLMs in practical applications.
The Chain-of-Focus (CoF) method enables VLMs to perform adaptive search and zooming in on key image regions, thereby creating a chain of focus steps for multimodal reasoning with gradually obtained visual cues.
In implementation, the visual tokens corresponding to key regions are appended to previously generated output tokens for subsequent outputs during a single generation round. This approach allows the VLMs to gather more visual cues, enabling them to analyze the image more thoroughly, accurately, and reliably than if they only relied on a static view of the image.
Note that our method does not perform visual search and zooming for every image, but performs adaptive search and zooming based on obtained visual cues, reducing the cost while keeping the performance.
CoF adopts a two-stage training pipeline.
In the SFT stage, we construct the MM-CoF dataset with 3K samples from the SAM dataset across diverse resolutions. For each image, we synthesize a task and use a visual agent with multiple tools to search and reason until task completion. The agent's reasoning steps are then summarized into a CoF process by an LLM. We fine-tune a Qwen2.5-VL-7B model on MM-CoF for cold start.
In the RL stage, we leverage the outcome accuracies and formats as rewards to update the VLMs, enabling further refining the model’s search and reasoning strategy without human priors. We denote the obtained model as Qwen2.5-VL-7B-CoF.
Query: Is the color of the bus red or yellow?
(A) yellow
(B) red
Answer with the option's letter from the given choices directly.
Output:
<think> To determine the color of the bus, I need to closely observe the bus in the image. The bus is clearly visible, and its color can be seen distinctly as red. </think>
<answer> B </answer>
Query: What is the color of the woman's jacket?
(A) black
(B) red
(C) blue
(D) orange
Answer with the option's letter from the given choices directly.
Output:
<think> To determine the color of the woman's jacket, I need to closely examine the area where she is located. However, the color details are not immediately clear due to the distance and lighting conditions. To improve visibility, I proceed step by step. I first identify the bounding box corresponding to the woman: <|box_start|>[570, 1726, 615, 1848]<|box_end|>. I zoom in on this area to obtain a refined visual embedding <|image_zoomin|>, which reveals that the jacket is light blue. </think>
<answer> C </answer>
Query: What is the color of the lock?
(A) yellow
(B) silver
(C) golden
(D) red
Answer with the option's letter from the given choices directly.
Output:
<think> To determine the color of the lock, I need to closely examine the area where the lock is located. However, the lock is not immediately visible, making it difficult to identify its color. To improve visibility, I proceed step by step. I first attempt to locate the lock but find no clear indication due to its small size. I then focus on the red postbox, identifying two bounding boxes: <|box_start|>[585, 700, 1305, 1821]<|box_end|> and <|box_start|>[645, 920, 1275, 1821]<|box_end|>. I analyze these areas but still cannot locate the lock. Next, I shift my attention to the area above the postbox, locating two bounding boxes: <|box_start|>[0, 600, 1425, 1241]<|box_end|> and <|box_start|>[0, 600, 1425, 1301]<|box_end|>. I zoom in on <|box_start|>[0, 600, 1425, 1241]<|box_end|> to obtain a refined visual embedding <|image_zoomin|>, which reveals the presence of a mailbox. Upon analyzing this refined area, I confirm that the lock on the mailbox is silver. </think>
<answer> B </answer>
In this paper, we have presented the chain-of-focus (CoF) method to improve the multimodal reasoning ability of VLMs via adaptive search and focus. By operations of adaptive image zooming, the CoF method can help the reasoning process using more visual cues. The proposed data collection pipeline can efficiently collect CoF data and the data can indeed empower the perception, grounding, and reasoning ability for visual language models. The used SFT-RL training pipeline can gradually improve the generalization of VLMs.
@misc{zhang2025chain,
title={Chain-of-Focus: Adaptive Visual Search and Zooming for Multimodal Reasoning via RL},
author={Xintong Zhang and Zhi Gao and Bofei Zhang and Pengxiang Li and Xiaowen Zhang and Yang Liu and Tao Yuan and Yuwei Wu and Yunde Jia and Song-Chun Zhu and Qing Li},
year={2025},
eprint={2505.15436},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2505.15436},
}