The Narrow Gate: Localized Image-Text Communication in Vision-Language Models

Abstract

Recent advances in multimodal training have significantly improved the integration of image understanding and generation within a unified model. This study investigates how vision-language models (VLMs) handle image understanding tasks, specifically focusing on how visual information is processed and transferred to the textual domain. We focus on two architectures: Chameleon [1], which supports multimodal outputs, and Pixtral [2], which outputs text only. Pixtral fuses late-layer visual tokens into the textual domain, as it can output only text, and it features a distributed image-text communication pattern, in which internal image tokens directly communicate with the textual domain. On the contrary, Chameleon encodes visual and textual tokens in well separated regions, and knowledge transfer from image to text happens through a narrow gate, the end-of-image token [EOI]. We demonstrate that ablating this single token significantly deteriorates performance on image understanding tasks. Furthermore, modifying this token enables effective steering of the image semantics, showing that targeted, local interventions can reliably control the model's global behavior.

Modality Gap in VLMs

Left: Cosine similarity between text and image token embeddings as a function of model depth reflects the orthogonality of modalities in Chameleon models. Right: Homogeneity score of token clusters generated via Advanced Density Peaks [3] with respect to their original modality.

We highlight a fundamental difference in how multimodal-output and unimodal-output VLMs handle the interaction between visual and textual representations. In multimodal-output models like Chameleon, image and text representations remain largely separated throughout the network, as indicated by consistently low cosine similarity and modality-specific clustering. This separation raises questions about how and where the modalities communicate. In contrast, models like Pixtral exhibit progressively higher fusion of image and text representations in later layers, as witnessed by rising cosine similarity and decreasing modality homogeneity. This indicates a more distributed communication pattern, where visual tokens directly inform textual generation.

Analysis of Cross-Modal Attention

Cross-Modal Attention Contributions of Tokens

We construct a metric that quantifies the average attention that all the text tokens give to a token at position \(j\) within the image part of the prompt, which we assume to span the first \( N_{[\texttt{EOI}]} \) tokens. We define the (relative) cross-modal attention \(f^{l}_j\) as: \[ f^{l}_{j} = \frac{1}{C}\frac{1}{|\mathcal{H}|}\sum_{h \in \mathcal{H}} \sum_{\;\;\;i>N_{[\texttt{EOI}]}} A_{(i,j)}^{l,h} \]

Contribution of different image token positions to the total text-on-image attention across layers in Chameleon-7B (left) and Pixtral-12B (right), computed on ImageNet data.

The analysis of total attention contributions reveals that a few localized tokens, such as the [EOI] and last image tokens in Chameleon-7B, or the [EOL], 32nd and 1025th image tokens in Pixtral-12B, are strongly attended by text tokens, significantly shaping image-to-text communication.

Localization of Visual Semantic Information

To investigate whether these tokens contain semantic visual information, we compute their neighborhood overlap with ImageNet class labels defined by [4] as: \[ \chi^{l,gt}= \frac{1}{n} \sum_i \frac{1}{k}\sum_{j\in \mathcal{N}^{l}(i)} A^{gt}_{{ij}}\]

Neighborhood overlap between selected image tokens and ImageNet labels for Chameleon-7B (left) and Pixtral-12B (right).

The results above show that in the Chameleon models the [EOI] token is responsible for a large portion of cross-modal attention and contains the highest semantic information regarding the image. This makes it a suitable candidate for being a narrow gate of communication. Conversely, in Pixtral the semantic content is spread on the whole image representation, thus suggesting that the model uses a much wider gate of communication distributed across large portions of the image.

Ablation Experiments

Effect of Progressive Attention Knockout

We now investigate the impact of special tokens on the flow of information through the network by progressively performing attention knockout [5], ablating communication from layer \(l\) to the end of the network. For each window of ablated layers, we compute \(\chi^{out,gt}\) to evaluate the effect of the ablation on the final representation.

Neighborhood overlap between model final layer rappresentations at the last text token positions and ImageNet classes when applying attention knockout. Communication from the [EOI] token position (green) to the text or from all image token positions (magenta) to the text is ablated across an increasing number of layers, starting from the last.

These results show that visual information is not directly transmitted between the residual representations of image tokens and those of text tokens; instead, it flows through [EOI] .

Effect of Attention Knockout on Image Understanding Tasks

We evaluate how applying attention knockout to block communication between specific token positions and the text affects the general performance of the models on more complex visual understanding tasks.

Overall, the results of our ablation experiments confirm that the cross-modal communication in the Chameleon models flows mainly through a single token, the special [EOI] token. On the contrary, in Pixtral-12B such communication happens in a distributed fashion through the image itself, meaning that it cannot be disrupted with a local intervention.

Steering Image Understanding Through Activation Patching

The localized nature of cross-modal communication in Chameleon-7B and Chameleon-34B allows a targeted editing of image semantics. We show this by performing a patching experiment on the [EOI] token, replacing that of a base ImageNet class with another target class and measure the similarity between the probability distribution of the base class and the target class.

Similarity between the probability distributions over the vocabulary for the target input and the base input across different layers. Left: Impact of patching the residual stream at each layer. Right: Impact of cumulative patching of the sole input of attention blocks, starting from the point indicated on the x-axis through the end of the model.

The [EOI] token plays a central role in how Chameleon models connect image and text information. Editing [EOI] in the middle layers effectively transfers semantics between classes, with the strongest impact occurring when global image information is encoded. These effects persist until later layers, where the semantics have already been communicated to the text, highlighting the token’s critical role in cross-modal understanding.

BibTeX

@misc{serra2024narrowgatelocalizedimagetext,
        title={The Narrow Gate: Localized Image-Text Communication in Vision-Language Models}, 
        author={Alessandro Serra and Francesco Ortu and Emanuele Panizon and Lucrezia Valeriani and Lorenzo Basile and Alessio Ansuini and Diego Doimo and Alberto Cazzaniga},
        year={2024},
        eprint={2412.06646},
        archivePrefix={arXiv},
        primaryClass={cs.CV},
        url={https://arxiv.org/abs/2412.06646}, 
  }

References

[1] Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models. arXiv preprint arXiv:2405.09818, 2024.
[3] Mistral AI. Pixtral 12B. arXiv preprint arXiv:2410.07073, 2024.
[3] Maria d’Errico, Elena Facco, Alessandro Laio, and Alex Rodriguez. Automatic topography of high-dimensional data sets by non-parametric density peak clustering. Information Sciences, 2021.
[4] Diego Doimo, Aldo Glielmo, Alessio Ansuini, and Alessandro Laio. Hierarchical nucleation in deep neural networks. NeurIPS, 2020.
[5] Mor Geva, Jasmijn Bastings, Katja Filippova, and Amir Globerson. Dissecting recall of factual associations in auto-regressive language models. EMNLP, 2023.