Definition
Image patch embedding is the operation that converts a raw image into a sequence of vectors a transformer can read. Instead of feeding pixels directly, the image is divided into a grid of fixed-size patches (commonly 14x14 or 16x16 pixels); each patch is flattened and passed through a linear projection to produce an embedding vector. This patchification is what made the Vision Transformer possible, because it reframes an image as a short sequence of tokens analogous to words in a sentence.
From patches to a sequence
After projection, a positional encoding is added to each patch embedding so the model knows where each patch sat in the original grid, since attention itself is order-agnostic. The resulting sequence then flows through standard transformer blocks. A 224x224 image at a 16-pixel patch size yields 196 patches, so patch size directly controls sequence length, compute cost, and the granularity of detail the model can resolve.
Role in multimodal systems
In a vision-language model, patch embeddings produced by the vision encoder are the raw material that a projector later maps into the language model's space. Smaller patches preserve fine detail but multiply the token count; larger patches are cheaper but blur small features. That trade-off is central to running multimodal models efficiently on owned hardware.
Patch embeddings feed the vision encoder whose outputs become a visual token sequence, later aligned with text through multimodal alignment.
In Simple Terms
Image patch embedding is the operation that converts a raw image into a sequence of vectors a transformer can read. Instead of feeding pixels directly,…
