A from-scratch implementation of Byte-Pair Encoding (BPE), the tokenization algorithm used by GPT-2 and other large language models.
Most people just from transformers import AutoTokenizer. This project demonstrates deep understanding of how tokenization actually works by implementing BPE from first principles.
What you'll find here:
- Complete BPE training algorithm
- Encode/decode functionality
- HTML visualization of token splits
- 28 unit tests verifying correctness
- Zero external dependencies (pure Python)
BPE converts text to numbers through learned compression:
- Start with text as individual bytes (256 base tokens)
- Count all adjacent pairs
- Merge the most frequent pair into a new token
- Repeat until reaching target vocabulary size
- Store merge rules in order (this IS your tokenizer)
The key insight: common sequences get compressed into single tokens, while rare text stays split into smaller pieces.
from bpe_tokenizer import BPETokenizer
# Train on your corpus
tokenizer = BPETokenizer()
tokenizer.train(training_text, vocab_size=1000, verbose=True)
# Encode text to token IDs
tokens = tokenizer.encode("Hello, world!")
# [372, 45, 289, ...]
# Decode back to text
text = tokenizer.decode(tokens)
# "Hello, world!"
# See the token splits
tokenizer.get_token_strings("Hello, world!")
# ['Hello', ',', ' world', '!']The tokenizer includes an HTML visualization showing exactly how text gets split:
html = tokenizer.visualize_html("The transformer architecture")Each token is highlighted with a different color, making it easy to see where splits occur.
bpe-tokenizer-scratch/
├── bpe_tokenizer.py # Core implementation (~350 lines)
├── test_bpe_tokenizer.py # 28 comprehensive tests
├── demo.py # Interactive demonstration
├── tokenizer_visualization.html
└── README.md
python3 test_bpe_tokenizer.pyTests cover:
- Encode/decode roundtrip for ASCII, Unicode, emoji
- Correct merge rule learning
- Compression behavior verification
- Edge cases (empty strings, unseen characters)
- Save/load functionality
Every possible byte value (0-255) gets a token. This guarantees any input can be encoded, even binary data or unknown Unicode.
Merge rules must be applied in the order they were learned. If we learned (a,b)->ab before (ab,c)->abc, we must merge a+b first.
- Common text: "the" → 1 token
- Rare text: "xyzzy" → 5 tokens
This is why LLMs are "cheaper" on common language patterns.
Understanding tokenization explains LLM behavior:
- Why counting letters is hard: "strawberry" might split as
["str", "aw", "berry"] - Why non-English costs more tokens: Less training data = fewer merges learned
- Why context length is in tokens, not words: Some words are 1 token, some are 5+
Real tokenizers (tiktoken, SentencePiece) add:
- Pre-tokenization (split on whitespace first)
- Regex patterns for numbers/punctuation
- Special tokens (
<pad>,<eos>, etc.) - C/Rust implementations for speed
But the core algorithm? Exactly what's implemented here.
- micrograd-numpy - Autograd engine from scratch
Built as part of a series implementing ML fundamentals from scratch.