Definition
A visual token is the unit by which an image enters a multimodal language model. A vision encoder splits a picture into patches, encodes each one, and a projector maps those encodings into vectors that sit in the same space as text tokens. The language model then attends jointly over the interleaved visual and text tokens, treating the image much like a stretch of words.
How many tokens an image becomes
Image resolution drives token count. LLaVA-1.5, for example, encodes a 336x336 image into 576 visual tokens. Higher-resolution inputs can balloon into thousands of tokens, and visual tokens typically outnumber the text tokens in a prompt by a wide margin. Because they carry less structured, more spatially redundant information than language, many of them are unnecessary for a given task, which is why token-compression and pruning methods are an active research area.
Why the token budget matters
Every visual token consumes context-window space and attention compute that scales quadratically with sequence length. For someone running a multimodal model on their own hardware, controlling how many visual tokens an image produces is the single biggest lever on memory use and latency. Trimming redundant tokens can make high-resolution understanding feasible on consumer GPUs.
Visual tokens are produced from an image patch embedding step and routed into the language model by a modality projector.
In Simple Terms
A visual token is the unit by which an image enters a multimodal language model. A vision encoder splits a picture into patches, encodes each…
