Definition
The Vision Transformer (ViT) is an image model, introduced by Google researchers in 2020, that applies the transformer architecture, originally built for language, directly to computer vision. Rather than scanning an image with convolution filters, ViT cuts the image into a grid of fixed-size patches, flattens and embeds each patch into a vector, and treats the resulting sequence like words in a sentence.
Patches as tokens
Each patch embedding gets a positional encoding so the model knows where it sat in the original image, plus a special classification token that aggregates global information. Stacked self-attention layers then let every patch attend to every other patch, capturing long-range relationships across the whole image in a way convolutional networks struggle to do without many layers. This global view is ViT's defining advantage.
Why ViT became foundational
Given enough training data, ViTs match or beat convolutional networks on image classification, detection, and segmentation, and they scale cleanly with more data and compute. Crucially, because ViT speaks the same "sequence of tokens" language as text transformers, it slots naturally into multimodal systems, which is why most modern vision encoders are ViT-based.
The Vision Transformer is a direct application of the transformer and its self-attention mechanism to images.
In Simple Terms
The Vision Transformer (ViT) is an image model, introduced by Google researchers in 2020, that applies the transformer architecture, originally built for language, directly to…
