BPE Tokenizer from Scratch

A from-scratch implementation of Byte-Pair Encoding (BPE), the tokenization algorithm used by GPT-2 and other large language models.

Why Build This?

Most people just from transformers import AutoTokenizer. This project demonstrates deep understanding of how tokenization actually works by implementing BPE from first principles.

What you'll find here:

Complete BPE training algorithm
Encode/decode functionality
HTML visualization of token splits
28 unit tests verifying correctness
Zero external dependencies (pure Python)

The Algorithm

BPE converts text to numbers through learned compression:

Start with text as individual bytes (256 base tokens)
Count all adjacent pairs
Merge the most frequent pair into a new token
Repeat until reaching target vocabulary size
Store merge rules in order (this IS your tokenizer)

The key insight: common sequences get compressed into single tokens, while rare text stays split into smaller pieces.

Quick Start

from bpe_tokenizer import BPETokenizer

# Train on your corpus
tokenizer = BPETokenizer()
tokenizer.train(training_text, vocab_size=1000, verbose=True)

# Encode text to token IDs
tokens = tokenizer.encode("Hello, world!")
# [372, 45, 289, ...]

# Decode back to text
text = tokenizer.decode(tokens)
# "Hello, world!"

# See the token splits
tokenizer.get_token_strings("Hello, world!")
# ['Hello', ',', ' world', '!']

Visualization

The tokenizer includes an HTML visualization showing exactly how text gets split:

html = tokenizer.visualize_html("The transformer architecture")

Each token is highlighted with a different color, making it easy to see where splits occur.

Project Structure

bpe-tokenizer-scratch/
├── bpe_tokenizer.py          # Core implementation (~350 lines)
├── test_bpe_tokenizer.py     # 28 comprehensive tests
├── demo.py                   # Interactive demonstration
├── tokenizer_visualization.html
└── README.md

Running Tests

python3 test_bpe_tokenizer.py

Tests cover:

Encode/decode roundtrip for ASCII, Unicode, emoji
Correct merge rule learning
Compression behavior verification
Edge cases (empty strings, unseen characters)
Save/load functionality

Key Implementation Details

Why 256 Base Tokens?

Every possible byte value (0-255) gets a token. This guarantees any input can be encoded, even binary data or unknown Unicode.

Why Order Matters

Merge rules must be applied in the order they were learned. If we learned (a,b)->ab before (ab,c)->abc, we must merge a+b first.

Compression Tradeoff

Common text: "the" → 1 token
Rare text: "xyzzy" → 5 tokens

This is why LLMs are "cheaper" on common language patterns.

What This Teaches You

Understanding tokenization explains LLM behavior:

Why counting letters is hard: "strawberry" might split as ["str", "aw", "berry"]
Why non-English costs more tokens: Less training data = fewer merges learned
Why context length is in tokens, not words: Some words are 1 token, some are 5+

How This Differs from Production

Real tokenizers (tiktoken, SentencePiece) add:

Pre-tokenization (split on whitespace first)
Regex patterns for numbers/punctuation
Special tokens (<pad>, <eos>, etc.)
C/Rust implementations for speed

But the core algorithm? Exactly what's implemented here.

Related Projects

micrograd-numpy - Autograd engine from scratch

Built as part of a series implementing ML fundamentals from scratch.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
bpe_tokenizer.py		bpe_tokenizer.py
demo.py		demo.py
test_bpe_tokenizer.py		test_bpe_tokenizer.py
tokenizer.json		tokenizer.json
tokenizer_visualization.html		tokenizer_visualization.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

BPE Tokenizer from Scratch

Why Build This?

The Algorithm

Quick Start

Visualization

Project Structure

Running Tests

Key Implementation Details

Why 256 Base Tokens?

Why Order Matters

Compression Tradeoff

What This Teaches You

How This Differs from Production

Related Projects

About

Uh oh!

Releases

Packages

Languages

License

designer-coderajay/bpe-tokenizer-scratch

Folders and files

Latest commit

History

Repository files navigation

BPE Tokenizer from Scratch

Why Build This?

The Algorithm

Quick Start

Visualization

Project Structure

Running Tests

Key Implementation Details

Why 256 Base Tokens?

Why Order Matters

Compression Tradeoff

What This Teaches You

How This Differs from Production

Related Projects

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages