Skip to content

Bitcoin accepted at checkout  |  Ships from Laval, QC, Canada  |  Expert support since 2016

Cross-Attention Fusion

Sovereign AI

Definition

Cross-attention fusion is a way of feeding one modality (typically vision) into a language model by inserting attention layers whose keys and values come from the visual features while the queries come from the text stream. Rather than turning an image into tokens and placing them in the input sequence, the model attends to visual representations at chosen depths inside the transformer. DeepMind's Flamingo popularised this design for vision-language models.

Gated cross-attention

Flamingo keeps its pretrained vision encoder and language model frozen and inserts new gated cross-attention dense blocks between the existing self-attention layers. A tanh gate, initialised so the block contributes nothing at the start of training, lets the model gradually learn to use visual context without catastrophically disturbing the language model's original behaviour. Only the inserted layers and a perceiver resampler are trained, which is efficient and preserves the base model's text skills.

Trade-offs versus token concatenation

Cross-attention fusion keeps visual data out of the main token budget, so high-resolution images do not crowd out the text context window. The cost is a more invasive architecture change: attention layers must be added throughout the network rather than relying on a single front-end connector. Many lighter systems instead concatenate projected visual tokens directly into the sequence.

This strategy is one option a modality projector design can take, and it sits on the spectrum between early vs late fusion by mixing modalities deep inside the network.

In Simple Terms

Cross-attention fusion is a way of feeding one modality (typically vision) into a language model by inserting attention layers whose keys and values come from…

Explore the Full Glossary

Browse all Bitcoin mining terms from A to Z. Whether you are a beginner or expert, deepen your understanding of the mining ecosystem.

Mining Glossary

ASIC Miner Database

Compare 500+ miners with real-time profitability data, home mining scores, and detailed specs.

Compare Miners