Definition
Cross-attention fusion is a way of feeding one modality (typically vision) into a language model by inserting attention layers whose keys and values come from the visual features while the queries come from the text stream. Rather than turning an image into tokens and placing them in the input sequence, the model attends to visual representations at chosen depths inside the transformer. DeepMind's Flamingo popularised this design for vision-language models.
Gated cross-attention
Flamingo keeps its pretrained vision encoder and language model frozen and inserts new gated cross-attention dense blocks between the existing self-attention layers. A tanh gate, initialised so the block contributes nothing at the start of training, lets the model gradually learn to use visual context without catastrophically disturbing the language model's original behaviour. Only the inserted layers and a perceiver resampler are trained, which is efficient and preserves the base model's text skills.
Trade-offs versus token concatenation
Cross-attention fusion keeps visual data out of the main token budget, so high-resolution images do not crowd out the text context window. The cost is a more invasive architecture change: attention layers must be added throughout the network rather than relying on a single front-end connector. Many lighter systems instead concatenate projected visual tokens directly into the sequence.
This strategy is one option a modality projector design can take, and it sits on the spectrum between early vs late fusion by mixing modalities deep inside the network.
In Simple Terms
Cross-attention fusion is a way of feeding one modality (typically vision) into a language model by inserting attention layers whose keys and values come from…
