Definition
A tokenizer is the component that converts raw text into the numeric tokens a language model actually processes. LLMs do not read characters or whole words; they operate on a fixed vocabulary of tokens, which are usually subword fragments. The tokenizer maps your text to those token IDs on the way in and maps the model's output IDs back to text on the way out. Every limit you care about, context window size, API cost, and throughput, is measured in tokens, not words.
Byte-pair encoding
Most modern LLMs (GPT, Llama, and others) use a variant of byte-pair encoding (BPE). BPE starts from individual bytes or characters and iteratively merges the most frequent adjacent pairs into longer tokens, building a vocabulary that balances common whole words against the ability to spell out rare or unseen strings. This is why a common English word may be a single token while an unusual technical term or a wallet address gets split into several. Byte-level BPE guarantees any Unicode input can be encoded, with no out-of-vocabulary failures.
Why operators should care
Tokenization directly drives the economics of running a model. Denser tokenization means more text fits in the same context budget and fewer tokens are billed per request. It also explains quirks like models struggling to count letters: they see tokens, not characters.
When you run a self-hosted model with an open-weight release, the tokenizer ships alongside the weights and must match exactly, since a mismatch produces garbage output. Understanding tokens is foundational to estimating how much hardware your sovereign inference setup needs.
In Simple Terms
A tokenizer is the component that converts raw text into the numeric tokens a language model actually processes. LLMs do not read characters or whole…
