Definition
Byte-Pair Encoding (BPE) is the subword tokenization algorithm that underpins most modern large language models, including the GPT family. Originally a 1990s data-compression technique, it was adapted for natural language processing to solve the out-of-vocabulary problem: instead of treating every word as an atomic unit, BPE breaks rare and compound words into smaller, reusable pieces while keeping common words whole.
How the merge loop works
Training starts with a vocabulary of individual characters (or raw bytes). The algorithm scans the corpus, finds the most frequently adjacent pair of symbols, and merges that pair into a single new token. It records the merge rule and repeats — pair-counting, merging, recording — until the vocabulary reaches a target size, often 30,000 to 200,000 tokens. The ordered list of merge rules is what the tokenizer ships with; at inference time it replays those rules to segment new text deterministically.
Why miners building AI tooling should care
Token economics are real costs. Because frequent words collapse to one token and rare strings fragment into many, the same prompt can cost wildly different amounts depending on the tokenizer. Byte-level BPE (the GPT variant) operates on UTF-8 bytes, so it never fails on unseen characters — useful for code, hashes, or wallet addresses that a word-level scheme would choke on. Multilingual and domain-specific text is more expensive per character because it triggers more fragmentation.
BPE is one stage in a broader ingestion flow. The merge rules and the symbol table together define the tokenizer vocabulary, and BPE output feeds the upstream data pipeline / ETL that prepares text for training.
In Simple Terms
Byte-Pair Encoding (BPE) is the subword tokenization algorithm that underpins most modern large language models, including the GPT family. Originally a 1990s data-compression technique, it…
