Definition
A modality projector, also called a connector, is the bridge component in a multimodal large language model (MLLM) that translates features from a vision encoder into the embedding space the language model already understands. Image and text live in different representation spaces; the projector closes that gap so visual information can flow into the same transformer that processes words. It is usually the only newly trained part of an otherwise frozen vision-and-language stack, which makes it cheap to fine-tune.
How the connector works
Most modern systems use one of three connector styles. Projection-based connectors (LLaVA, MiniGPT-4, DeepSeek-VL) apply a single linear layer or a small multi-layer perceptron to each visual feature, directly projecting it into the text-embedding dimension. Query-based connectors (BLIP-2's Q-Former, Qwen-VL) use a set of learnable query vectors to compress a variable-length image grid into a fixed number of tokens. Cross-attention connectors (Flamingo, CogVLM) inject visual context through dedicated attention layers rather than concatenating tokens.
Why it matters for sovereignty
Because the projector is a small, swappable module, self-hosting operators can adapt an open vision encoder to a locally run open language model without retraining either giant. That modularity is what makes private, offline multimodal AI realistic on owned hardware rather than rented cloud endpoints.
The projector consumes a sequence of visual representations - see visual token - and its output competes with concatenation-free designs such as cross-attention fusion for how vision reaches the language model.
In Simple Terms
A modality projector, also called a connector, is the bridge component in a multimodal large language model (MLLM) that translates features from a vision encoder…
