Passer au contenu

Bitcoin accepté au paiement  |  Expédié depuis Laval, QC, Canada  |  Soutien expert depuis 2016

Vision Encoder

Sovereign AI

Definition

A vision encoder is the part of a multimodal system responsible for turning raw pixels into a numerical representation that downstream components can reason over. It is the "eyes" of a vision-language model: it ingests an image and outputs a set of feature vectors, often called image tokens or patch embeddings, that capture what is in the picture and roughly where.

How vision encoders are built

Modern vision encoders are usually Vision Transformers (ViT) that split an image into a grid of fixed-size patches, embed each patch, and apply self-attention so every patch can attend to every other. Older designs used convolutional neural networks. Many encoders are pre-trained contrastively against text, as in CLIP, so their output already aligns with language, which makes them ideal front-ends for models that combine images and words.

Why the encoder is the linchpin

In a typical vision-language model, the vision encoder feeds a projector that maps image features into the language model's token space, after which a standard language model generates text. The quality of the whole system is bounded by what the encoder can perceive, so it determines whether a model can read fine print, count objects, or recognize a specific ASIC board on a workbench.

Vision encoders are usually built on the vision transformer and feed the larger vision-language model they are part of.

In Simple Terms

A vision encoder is the part of a multimodal system responsible for turning raw pixels into a numerical representation that downstream components can reason over.…

Explore the Full Glossary

Browse all Bitcoin mining terms from A to Z. Whether you are a beginner or expert, deepen your understanding of the mining ecosystem.

Glossaire du minage

ASIC Miner Database

Compare 500+ miners with real-time profitability data, home mining scores, and detailed specs.

Comparer les mineurs