Definition
A tokenizer vocabulary is the fixed dictionary of tokens a language model can read and emit. It is the bridge between human text and the integer IDs a neural network actually processes: every string the model sees is first segmented into tokens, and each token is looked up against the vocabulary to retrieve its ID. Nothing outside the vocabulary can be represented directly — it is the model's entire alphabet.
Size and trade-offs
Vocabulary size is a deliberate design choice. Monolingual models often use 30,000 to 64,000 entries; recent multilingual models run larger — roughly 128,000 for Llama 3 and around 200,000 for GPT-4-class models. A larger vocabulary means common words and phrases stay whole, lowering the token count for a given text and improving throughput. The cost is a much larger embedding matrix and output layer, which consumes memory and compute. Smaller vocabularies are leaner but fragment text into more pieces.
Tokens are not words
Vocabulary entries are subword units, not dictionary words. A single entry can be a character, a fragment, a whole word, a punctuation mark, or a raw byte sequence, depending on how the merge rules were learned. This is why token counts rarely match word counts, and why the same sentence costs different amounts across model families. The vocabulary file — a plain mapping of strings to integer IDs — ships alongside the model and must match exactly, or decoding produces garbage.
The vocabulary is produced by a tokenization algorithm such as Byte-Pair Encoding (BPE), and it always reserves slots for special tokens that mark structure rather than ordinary text.
In Simple Terms
A tokenizer vocabulary is the fixed dictionary of tokens a language model can read and emit. It is the bridge between human text and the…
