Cross-Attention Fusion

Sovereign AI

Cross-attention fusion is a way of feeding one modality (typically vision) into a language model by inserting attention layers whose keys and values come from the visual features while the queries come from the text stream. Rather than turning an image into tokens and placing them in the input sequence, the model attends to visual representations at chosen depths inside the transformer. DeepMind's Flamingo popularised this design for vision-language models, and it remains the reference point whenever multimodal architectures are compared.

How the mechanism works

Ordinary self-attention lets text tokens look at other text tokens. A cross-attention layer redirects that machinery across modalities: the text stream asks the questions (queries), and the image features supply the answers (keys and values). Because the visual side enters only through keys and values, the language model's sequence length is untouched, the image is a resource the text consults, not a passage it must read. Inserted at multiple depths, these layers let early text processing ground itself in coarse visual context while later layers consult finer detail, a flexibility single-entry-point designs lack.

Gated cross-attention in Flamingo

Flamingo keeps its pretrained vision encoder and language model frozen and inserts new gated cross-attention dense blocks between the existing self-attention layers. A tanh gate, initialised at zero so the new blocks contribute nothing at the start of training, lets the model gradually learn to use visual context without catastrophically disturbing the language model's original behaviour, at initialization, the network is exactly the untouched language model. Only the inserted layers and a perceiver resampler, which condenses a variable number of visual features into a small fixed set of latent vectors, are trained. That efficiency is the design's quiet genius: the expensive language model is reused as-is, and its hard-won text ability is preserved by construction.

Trade-offs versus token concatenation

Cross-attention fusion keeps visual data out of the main token budget, so high-resolution images or long videos do not crowd out the text context window, and per-layer gating gives fine control over how much vision influences generation. The costs are architectural and practical: attention blocks must be added throughout the network rather than bolted on at the front, adding parameters and complicating the serving stack, and the approach diverges from the plain decoder-only recipe most inference tooling optimizes for. Many lighter systems, most open-weights vision-language models among them, instead concatenate projected visual tokens directly into the sequence: simpler, friendlier to standard runtimes, but token-hungry at high resolution. For the self-hoster the practical consequence is that fusion style shapes memory: token concatenation inflates the sequence (and KV cache), while cross-attention shifts cost into extra layers instead.

Where it fits in the design space

Cross-attention fusion is one option a modality projector design can take, and it sits deliberately between the extremes of early vs late fusion: modalities stay separate through their encoders but mix repeatedly deep inside the network. When you evaluate a local multimodal model, identifying which fusion strategy it uses tells you much about its context economics, its VRAM appetite, and its likely failure modes before you run a single benchmark.

The pattern also generalizes beyond images. Audio encoders, video streams, even structured sensor data can feed a language model through the same gated cross-attention template, which is why the technique keeps reappearing under new names as multimodal systems broaden. The engineering lesson travels with it: when extending a system that already works, add capability through gated side-channels that start at zero and prove their value gradually, rather than rebuilding the core. That principle is as sound for firmware on a mining fleet as it is for a frozen language model.

Cross-attention fusion is a way of feeding one modality (typically vision) into a language model by inserting attention layers whose keys and values come from…

Explore the Full Glossary

Browse all Bitcoin mining terms from A to Z. Whether you are a beginner or expert, deepen your understanding of the mining ecosystem.

Mining Glossary

ASIC Miner Database

Compare 500+ miners with real-time profitability data, home mining scores, and detailed specs.

Compare Miners