Modality Projector (Connector)

Sovereign AI

Modality projector, also called a connector, is the bridge component in a multimodal large language model (MLLM) that translates features from a vision encoder into the embedding space the language model already understands. An image encoder and a text model are trained separately and speak different representational languages: the encoder emits grids of visual feature vectors, while the transformer underneath the LLM consumes token embeddings of a specific dimension and distribution. The projector closes that gap — mapping visual features into vectors the language model can treat as if they were tokens — so that pictures can flow into the same attention stack that processes words. It is usually the only newly trained part of an otherwise frozen vision-and-language pair, which is precisely what makes modern multimodal AI so economical to build.

The three connector families

Most systems use one of three styles. Projection-based connectors (the LLaVA lineage, MiniGPT-4, DeepSeek-VL) are the simplest: a single linear layer or small multi-layer perceptron applied to each visual feature independently, directly projecting it into the text-embedding dimension. Every image patch becomes one pseudo-token, the approach is trivially trainable, and its performance made it the default recipe for open multimodal models. Query-based connectors (BLIP-2's Q-Former, early Qwen-VL) interpose a set of learnable query vectors that cross-attend into the image features, compressing a variable-size grid into a fixed, small number of tokens — trading some visual detail for a much shorter sequence. Cross-attention connectors (Flamingo, CogVLM) skip token concatenation entirely, injecting visual context through dedicated attention layers threaded into the language model itself — architecturally invasive but frugal with sequence length. The engineering tension is constant across all three: preserve visual fidelity, spend as few tokens as possible, and touch the pretrained giants as little as you can.

Why the token budget matters

Every visual pseudo-token the projector emits competes with text for the model's context window, and attention cost grows steeply with sequence length. A high-resolution image mapped patch-by-patch can consume hundreds or thousands of positions before the user has typed a word — which is why compression connectors exist and why high-resolution strategies (tiling an image and projecting each tile) trade quality against budget so visibly. On self-hosted hardware this is felt directly in VRAM and prefill time: connector design is a real driver of whether a vision-capable model is pleasant or painful on your GPU.

Training the bridge

The standard recipe freezes both giants and trains only the connector, in stages: first on image–caption pairs so projected features land meaningfully in language space, then instruction-tuning on visual question-answering data — sometimes unfreezing the LLM for that final polish. Because the projector itself is tiny (a linear layer or an MLP of a few million parameters against multi-billion-parameter neighbors), stage one is cheap even by homelab standards, and fine-tuning a connector for a specialized domain — reading thermal images of hashboards, say, or parsing miner dashboard screenshots — is one of the most accessible forms of multimodal customization.

Why it matters for sovereignty

The projector is the modularity point of the whole multimodal stack. Because it is small and swappable, the pairing of an open vision encoder with a locally run open-weight model is a matter of training a bridge, not retraining either giant — which is what makes private, offline multimodal AI realistic on owned hardware rather than rented endpoints. Your images — documents, faces, your workshop, your hardware — are among the most sensitive data you can send to an API; a local connector is what keeps them home. The projector consumes sequences of visual tokens, and its token-concatenation approach contrasts with cross-attention fusion as the two main routes by which vision reaches the language model.

Modality projector, also called a connector, is the bridge component in a multimodal large language model (MLLM) that translates features from a vision encoder into…

Explore the Full Glossary

Browse all Bitcoin mining terms from A to Z. Whether you are a beginner or expert, deepen your understanding of the mining ecosystem.

Mining Glossary

ASIC Miner Database

Compare 500+ miners with real-time profitability data, home mining scores, and detailed specs.

Compare Miners