Vision Transformer (ViT)

Sovereign AI

The Vision Transformer (ViT) is an image model, introduced by Google researchers in 2020, that applies the transformer architecture — originally built for language — directly to computer vision. Rather than scanning an image with convolution filters, ViT cuts the image into a grid of fixed-size patches (16x16 pixels in the original), flattens and embeds each patch into a vector, and treats the resulting sequence exactly like words in a sentence. The paper's title said it plainly: an image is worth 16x16 words.

Patches as tokens

Each patch is linearly projected into an embedding — the image patch embedding — and receives a positional encoding so the model knows where it sat in the original grid, since attention itself is order-blind. A special classification token is prepended to aggregate global information for downstream decisions. From there the architecture is a standard transformer stack: alternating self-attention and feed-forward layers, in which every patch can attend to every other patch from the very first layer. That global receptive field is ViT's defining advantage — a convolutional network needs many stacked layers before distant pixels can influence each other, while ViT relates opposite corners of the image immediately.

The data-scale bargain

Convolutions bake in useful assumptions about images — locality, translation invariance — that ViT deliberately discards. The consequence is a bargain: on small datasets ViT underperforms, because it must learn from scratch what CNNs get for free, but given large-scale pre-training it matches and then surpasses convolutional networks on classification, detection, and segmentation, and keeps improving with more data and compute where CNN gains flatten. Later refinements — data-efficient training recipes, hierarchical and windowed-attention variants — softened the data hunger and tamed the quadratic attention cost at high resolutions, but the plain ViT recipe remains the backbone of the field.

Why ViT became foundational

The deeper significance is architectural convergence. Because ViT speaks the same "sequence of tokens" language as text transformers, vision stopped being an architectural island: one modeling framework, one optimization toolkit, and one scaling playbook now serve both text and images. That is why most modern vision encoders are ViT-based, why CLIP-style contrastive training pairs a ViT with a text transformer so naturally, and why a multimodal model can splice image tokens into a language model's input as if they were words — patches and words are, by construction, the same kind of object.

Running ViTs locally

ViTs are practical citizens of self-hosted stacks. Open-weight checkpoints span from a few million to several billion parameters; the small and base sizes run comfortably on consumer GPUs and acceptably on CPUs, powering local image search, classification, and the vision half of on-device assistants. For a sovereign operator the relevant knobs are patch size and input resolution — together they set how many tokens an image becomes, which drives both compute cost and whether fine detail like board silkscreen survives into the representation. The pattern is familiar from every other layer of the stack: understand the architecture's trade-offs, and you can choose hardware and models deliberately instead of renting someone else's opinion.

ViT variants are named with a compact convention worth decoding: a size letter (S, B, L, H for small through huge) and a patch number — ViT-B/16 is the base model with 16-pixel patches, ViT-L/14 a large model with finer 14-pixel patches. Smaller patches mean more tokens per image, better fine detail, and a steeper compute bill, since attention cost grows with the square of token count; that is the knob to remember when small text or thin structures matter. Input resolution multiplies the same trade: doubling resolution quadruples patches. When a downstream model can or cannot read a label, this arithmetic — not mystery — is usually the reason. For self-hosters, the recipe is to pick the smallest ViT that survives a test on your own hardest images, and to treat anything larger as spending watts on headroom you have not demonstrated you need.

The Vision Transformer (ViT) is an image model, introduced by Google researchers in 2020, that applies the transformer architecture — originally built for language —…

Explore the Full Glossary

Browse all Bitcoin mining terms from A to Z. Whether you are a beginner or expert, deepen your understanding of the mining ecosystem.

Glossaire du minage

ASIC Miner Database

Compare 500+ miners with real-time profitability data, home mining scores, and detailed specs.

Comparer les mineurs