Key Idea
Test-time information flow modulation improves perception. (a) VLMs often manage to capture related object regions with text-to-image attention (e.g., Mickey Mouse clock in the input image), but sometimes fail to answer the question correctly. (b) We observe that visual tokens corresponding to object regions show distinct activation pattern in certain layers of the language model. In contrast, visual tokens in distracting regions exhibit irregular activation pattern across different layers. (c) Based on (b), we observe that disconnecting the interaction between text tokens and distracting visual tokens during inference lead to improved results, enabling the model to correctly select Mickey Mouse for question answering. This information flow modulation is implemented by modulating causal mask and applied during attention computation.