Recent advances in multimodal training have significantly improved the integration of image understanding and generation within a unified model. This study investigates how vision-language models (VLMs) handle image understanding tasks, specifically focusing on how visual information is processed and transferred to the textual domain. We focus on two architectures: Chameleon [1], which supports multimodal outputs, and Pixtral [2], which outputs text only. Pixtral fuses late-layer visual tokens into the textual domain, as it can output only text, and it features a distributed image-text communication pattern, in which internal image tokens directly communicate with the textual domain. On the contrary, Chameleon encodes visual and textual tokens in well separated regions, and knowledge transfer from image to text happens through a narrow gate, the end-of-image token [EOI]. We demonstrate that ablating this single token significantly deteriorates performance on image understanding tasks. Furthermore, modifying this token enables effective steering of the image semantics, showing that targeted, local interventions can reliably control the model's global behavior.
We highlight a fundamental difference in how multimodal-output and unimodal-output VLMs handle the interaction between visual and textual representations. In multimodal-output models like Chameleon, image and text representations remain largely separated throughout the network, as indicated by consistently low cosine similarity and modality-specific clustering. This separation raises questions about how and where the modalities communicate. In contrast, models like Pixtral exhibit progressively higher fusion of image and text representations in later layers, as witnessed by rising cosine similarity and decreasing modality homogeneity. This indicates a more distributed communication pattern, where visual tokens directly inform textual generation.
The analysis of total attention contributions reveals that a few localized tokens, such as the [EOI] and last image tokens in Chameleon-7B, or the [EOL], 32nd and 1025th image tokens in Pixtral-12B, are strongly attended by text tokens, significantly shaping image-to-text communication.
The results above show that in the Chameleon models the [EOI] token is responsible for a large portion of cross-modal attention and contains the highest semantic information regarding the image. This makes it a suitable candidate for being a narrow gate of communication. Conversely, in Pixtral the semantic content is spread on the whole image representation, thus suggesting that the model uses a much wider gate of communication distributed across large portions of the image.
These results show that visual information is not directly transmitted between the residual representations of image tokens and those of text tokens; instead, it flows through [EOI] .
Overall, the results of our ablation experiments confirm that the cross-modal communication in the Chameleon models flows mainly through a single token, the special [EOI] token. On the contrary, in Pixtral-12B such communication happens in a distributed fashion through the image itself, meaning that it cannot be disrupted with a local intervention.
The [EOI] token plays a central role in how Chameleon models connect image and text information. Editing [EOI] in the middle layers effectively transfers semantics between classes, with the strongest impact occurring when global image information is encoded. These effects persist until later layers, where the semantics have already been communicated to the text, highlighting the token’s critical role in cross-modal understanding.
@misc{serra2024narrowgatelocalizedimagetext,
title={The Narrow Gate: Localized Image-Text Communication in Vision-Language Models},
author={Alessandro Serra and Francesco Ortu and Emanuele Panizon and Lucrezia Valeriani and Lorenzo Basile and Alessio Ansuini and Diego Doimo and Alberto Cazzaniga},
year={2024},
eprint={2412.06646},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2412.06646},
}