Skip to content

designer-coderajay/bpe-tokenizer-scratch

Repository files navigation

BPE Tokenizer from Scratch

A from-scratch implementation of Byte-Pair Encoding (BPE), the tokenization algorithm used by GPT-2 and other large language models.

Python Tests Dependencies

Why Build This?

Most people just from transformers import AutoTokenizer. This project demonstrates deep understanding of how tokenization actually works by implementing BPE from first principles.

What you'll find here:

  • Complete BPE training algorithm
  • Encode/decode functionality
  • HTML visualization of token splits
  • 28 unit tests verifying correctness
  • Zero external dependencies (pure Python)

The Algorithm

BPE converts text to numbers through learned compression:

  1. Start with text as individual bytes (256 base tokens)
  2. Count all adjacent pairs
  3. Merge the most frequent pair into a new token
  4. Repeat until reaching target vocabulary size
  5. Store merge rules in order (this IS your tokenizer)

The key insight: common sequences get compressed into single tokens, while rare text stays split into smaller pieces.

Quick Start

from bpe_tokenizer import BPETokenizer

# Train on your corpus
tokenizer = BPETokenizer()
tokenizer.train(training_text, vocab_size=1000, verbose=True)

# Encode text to token IDs
tokens = tokenizer.encode("Hello, world!")
# [372, 45, 289, ...]

# Decode back to text
text = tokenizer.decode(tokens)
# "Hello, world!"

# See the token splits
tokenizer.get_token_strings("Hello, world!")
# ['Hello', ',', ' world', '!']

Visualization

The tokenizer includes an HTML visualization showing exactly how text gets split:

html = tokenizer.visualize_html("The transformer architecture")

Each token is highlighted with a different color, making it easy to see where splits occur.

Project Structure

bpe-tokenizer-scratch/
├── bpe_tokenizer.py          # Core implementation (~350 lines)
├── test_bpe_tokenizer.py     # 28 comprehensive tests
├── demo.py                   # Interactive demonstration
├── tokenizer_visualization.html
└── README.md

Running Tests

python3 test_bpe_tokenizer.py

Tests cover:

  • Encode/decode roundtrip for ASCII, Unicode, emoji
  • Correct merge rule learning
  • Compression behavior verification
  • Edge cases (empty strings, unseen characters)
  • Save/load functionality

Key Implementation Details

Why 256 Base Tokens?

Every possible byte value (0-255) gets a token. This guarantees any input can be encoded, even binary data or unknown Unicode.

Why Order Matters

Merge rules must be applied in the order they were learned. If we learned (a,b)->ab before (ab,c)->abc, we must merge a+b first.

Compression Tradeoff

  • Common text: "the" → 1 token
  • Rare text: "xyzzy" → 5 tokens

This is why LLMs are "cheaper" on common language patterns.

What This Teaches You

Understanding tokenization explains LLM behavior:

  • Why counting letters is hard: "strawberry" might split as ["str", "aw", "berry"]
  • Why non-English costs more tokens: Less training data = fewer merges learned
  • Why context length is in tokens, not words: Some words are 1 token, some are 5+

How This Differs from Production

Real tokenizers (tiktoken, SentencePiece) add:

  • Pre-tokenization (split on whitespace first)
  • Regex patterns for numbers/punctuation
  • Special tokens (<pad>, <eos>, etc.)
  • C/Rust implementations for speed

But the core algorithm? Exactly what's implemented here.

Related Projects


Built as part of a series implementing ML fundamentals from scratch.

About

Byte-Pair Encoding tokenizer built from scratch in Python. The same algorithm used by GPT-2.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published