Definition
A vision encoder is the part of a multimodal system responsible for turning raw pixels into a numerical representation that downstream components can reason over. It is the "eyes" of a vision-language model: it ingests an image and outputs a set of feature vectors, often called image tokens or patch embeddings, that capture what is in the picture and roughly where.
How vision encoders are built
Modern vision encoders are usually Vision Transformers (ViT) that split an image into a grid of fixed-size patches, embed each patch, and apply self-attention so every patch can attend to every other. Older designs used convolutional neural networks. Many encoders are pre-trained contrastively against text, as in CLIP, so their output already aligns with language, which makes them ideal front-ends for models that combine images and words.
Why the encoder is the linchpin
In a typical vision-language model, the vision encoder feeds a projector that maps image features into the language model's token space, after which a standard language model generates text. The quality of the whole system is bounded by what the encoder can perceive, so it determines whether a model can read fine print, count objects, or recognize a specific ASIC board on a workbench.
Vision encoders are usually built on the vision transformer and feed the larger vision-language model they are part of.
In Simple Terms
A vision encoder is the part of a multimodal system responsible for turning raw pixels into a numerical representation that downstream components can reason over.…
