Byte-Pair Encoding (BPE)

Sovereign AI

Byte-Pair Encoding (BPE) is the subword tokenization algorithm that underpins most modern large language models, including the GPT family. Originally a 1990s data-compression technique, it was adapted for natural language processing to solve the out-of-vocabulary problem: instead of treating every word as an atomic unit — and failing on any word it has never seen — BPE breaks rare and novel words into smaller, reusable pieces while keeping common words whole. The result is a vocabulary that can represent literally any input string, which is why it became the default answer to "how does the model read text?"

How the merge loop works

Training starts with a base vocabulary of individual characters or raw bytes. The algorithm scans the corpus, finds the most frequently adjacent pair of symbols, and merges that pair into a single new token. It records the merge rule and repeats — count pairs, merge the winner, record the rule — until the vocabulary reaches a target size, commonly somewhere between 30,000 and 200,000 tokens. What ships with the model is the ordered list of merge rules: at inference time the tokenizer replays those rules deterministically on new text, so the same string always segments the same way. Frequent words like "mining" survive as single tokens; a rarer string like "underclocking" might split into pieces such as "under", "clock", "ing" — each piece already known to the model from thousands of other contexts.

Bytes, not characters

The byte-level variant (used by GPT-2 and its descendants) runs the merge loop over raw UTF-8 bytes rather than characters. Since there are only 256 possible bytes, the base vocabulary covers every possible input with zero exceptions — no unknown-token fallback, ever. That robustness matters for exactly the kind of text a Bitcoin-adjacent operator feeds a model: hex strings, wallet addresses, hashes, code, log excerpts from a kernel log. A word-level scheme would choke on bc1q...; byte-level BPE just fragments it into many small tokens and carries on. The flip side is cost: high-entropy strings compress poorly, so an address or hash can consume dozens of tokens while a common English sentence of the same length consumes a handful.

Why self-hosters should care

Vocabulary size is itself a tuning knob with real trade-offs. A larger vocabulary means longer merges, so text compresses into fewer tokens per sentence — but every added token row enlarges the embedding matrix and the output layer, spending parameters on lookup rather than reasoning. Domain also matters: a tokenizer trained mostly on English prose fragments technical vocabulary heavily, which is why models can differ noticeably in how many tokens the same firmware log costs. And because merges are frozen at training time, a model and its tokenizer age together — the vocabulary that never saw a new protocol's jargon will forever spell it out one awkward fragment at a time.

Tokens are the unit of everything downstream: context window limits, inference latency, memory footprint, and — for API users — billing. Because frequent words collapse to one token and rare or non-English text fragments heavily, the same semantic content can cost wildly different token counts depending on language and domain; multilingual and code-heavy workloads are systematically more expensive per character. When you run models locally, tokenizer efficiency translates directly into how much genuinely fits in the window of your own hardware. It also explains classic LLM blind spots: models struggle with character-level tasks like counting letters because they never see letters, only merged fragments. The merge rules and symbol table together define the tokenizer vocabulary — a model and its tokenizer are inseparable, and mismatching them produces gibberish. Upstream, tokenization is one stage in the data pipeline / ETL that turns raw text into training-ready tensors; downstream, every token becomes an embedding the network can actually compute on.

Byte-Pair Encoding (BPE) is the subword tokenization algorithm that underpins most modern large language models, including the GPT family. Originally a 1990s data-compression technique, it…

Explore the Full Glossary

Browse all Bitcoin mining terms from A to Z. Whether you are a beginner or expert, deepen your understanding of the mining ecosystem.

Mining Glossary

ASIC Miner Database

Compare 500+ miners with real-time profitability data, home mining scores, and detailed specs.

Compare Miners