Aligning What Vision-Language Models See and Perceive with Adaptive Information Flow

Abstract

Vision-Language Models (VLMs) have demonstrated strong capability in a wide range of tasks such as visual recognition, document parsing, and visual grounding. Nevertheless, recent works show that while VLMs often manage to capture the correct image region corresponding to the question, they do not necessarily produce the correct answers. In this work, we demonstrate that this misalignment could be attributed to suboptimal information flow within VLMs, where text tokens distribute too much attention to irrelevant visual tokens, leading to incorrect answers. Based on the observation, we show that modulating the information flow during inference can improve the perception capability of VLMs. The idea is that text tokens should only be associated with important visual tokens during decoding, eliminating the interference of irrelevant regions. To achieve this, we propose a token dynamics-based method to determine the importance of visual tokens, where visual tokens that exhibit distinct activation patterns during different decoding stages are viewed as important. We apply our approach to representative open-source VLMs and evaluate on various datasets including visual question answering, visual grounding and counting, optical character recognition, and object hallucination. The results show that our approach significantly improves the performance of baselines.

Key Idea

Test-time information flow modulation improves perception. (a) VLMs often manage to capture related object regions with text-to-image attention (e.g., Mickey Mouse clock in the input image), but sometimes fail to answer the question correctly. (b) We observe that visual tokens corresponding to object regions show distinct activation pattern in certain layers of the language model. In contrast, visual tokens in distracting regions exhibit irregular activation pattern across different layers. (c) Based on (b), we observe that disconnecting the interaction between text tokens and distracting visual tokens during inference lead to improved results, enabling the model to correctly select Mickey Mouse for question answering. This information flow modulation is implemented by modulating causal mask and applied during attention computation.

Method Overview

Given an input image and user prompt, one-step language decoding is applied to obtain (b) token dynamics. Then, token dynamics-based entropy is computed, resulting in a token entropy map. Subsequently, (c) adaptive token masking is adopted for causal mask generation. The generated causal mask is fed to the language model for reasoning, achieving test-time information flow modulation.

Qualitative Results

Visualizations of masked tokens and text-to-image attention after information flow modulation.

BibTeX

@inproceedings{liu2026aif,
  title={Aligning What Vision-Language Models See and Perceive with Adaptive Information Flow},
  author={Chengxin, Liu and Wonseok, Choi and Chenshuang, Zhang and Tae-Hyun, Oh},
  booktitle={CVPR},
  year={2026}
}