Skip to content

Bitcoin accepted at checkout  |  Ships from Laval, QC, Canada  |  Expert support since 2016

Vision Transformer (ViT)

Sovereign AI

Definition

The Vision Transformer (ViT) is an image model, introduced by Google researchers in 2020, that applies the transformer architecture, originally built for language, directly to computer vision. Rather than scanning an image with convolution filters, ViT cuts the image into a grid of fixed-size patches, flattens and embeds each patch into a vector, and treats the resulting sequence like words in a sentence.

Patches as tokens

Each patch embedding gets a positional encoding so the model knows where it sat in the original image, plus a special classification token that aggregates global information. Stacked self-attention layers then let every patch attend to every other patch, capturing long-range relationships across the whole image in a way convolutional networks struggle to do without many layers. This global view is ViT's defining advantage.

Why ViT became foundational

Given enough training data, ViTs match or beat convolutional networks on image classification, detection, and segmentation, and they scale cleanly with more data and compute. Crucially, because ViT speaks the same "sequence of tokens" language as text transformers, it slots naturally into multimodal systems, which is why most modern vision encoders are ViT-based.

The Vision Transformer is a direct application of the transformer and its self-attention mechanism to images.

In Simple Terms

The Vision Transformer (ViT) is an image model, introduced by Google researchers in 2020, that applies the transformer architecture, originally built for language, directly to…

Explore the Full Glossary

Browse all Bitcoin mining terms from A to Z. Whether you are a beginner or expert, deepen your understanding of the mining ecosystem.

Mining Glossary

ASIC Miner Database

Compare 500+ miners with real-time profitability data, home mining scores, and detailed specs.

Compare Miners