This repository contains the code for adapting English-centric LLMs to new languages through vocabulary extension, BXOD embedding initialization, continual pretraining, and supervised fine-tuning.
The full pipeline consists of 6 stages:
1. Data Processing Prepare raw text corpora for tokenizer training
|
2. Tokenizer Building Train a BPE tokenizer for the target language
|
3. Embedding Init BXOD initialization for new language tokens
|
4. Pretraining Continual pretraining on target language + English
|
5. Fine-tuning Supervised instruction-following fine-tuning
|
6. Evaluation Benchmark evaluation (translation, NLU, etc.)
BXOD uses cross-lingual semantic similarity (via LaBSE) to initialize embeddings for new target-language tokens. For each new token:
-
Semantic Anchoring (BXOD): If the token appears in a target-language frequency dictionary:
- Find the closest target-language words using LaBSE cosine similarity
- For each neighbor, find the closest English translations
- Compute combined scores (target_similarity x english_similarity) and apply softmax
- Compute weighted average of the English token embeddings from the base model
-
Fallback (NACHOS): If the token is not in the dictionary (rare subwords):
- Encode the token using the base model's tokenizer
- Average the resulting subtoken embeddings
This approach ensures that ~80% of tokens (by frequency) receive semantically grounded initializations.
The codebase supports two base model families with key differences:
| Aspect | Llama Track | Gemma Track |
|---|---|---|
| Base model | meta-llama/Llama-3.2-3B-Instruct |
google/gemma-2-2b-it |
| Vocab strategy | Replace non-English tokens (fixed vocab size) | Add new tokens (expand vocab size) |
| Pre-tokenizer | ByteLevel | WhitespaceSplit + Punctuation |
| Block size | 2048 | 512 |
| Attention | Flash Attention 2 | Eager |
| Chat format | <|start_header_id|>...<|eot_id|> |
<start_of_turn>...<end_of_turn> |
pip install torch transformers datasets tokenizers accelerate
pip install sentence-transformers wordfreq # for BXOD embedding initialization
pip install sacrebleu unbabel-comet # for evaluation
pip install wandb # for training loggingBuilds a BPE tokenizer from text corpora. Handles data downloading, preprocessing, and tokenizer training in one script.
Uzbek (Llama track - ByteLevel pre-tokenizer):
python build_tokenizer.py \
--lang uzbek \
--strategy bytelevel \
--vocab_size 35185 \
--output_tokenizer uzbek_tokenizer_llama.json \
--data_sources \
hf:yakhyo/uz-wiki:train \
--text_column text \
--apostrophe_mode replaceUzbek (Gemma track - WhitespaceSplit pre-tokenizer with ▁ prefix):
python build_tokenizer.py \
--lang uzbek \
--strategy whitespace \
--vocab_size 40000 \
--output_tokenizer uzbek_tokenizer_gemma.json \
--data_sources \
hf:yakhyo/uz-wiki:train \
--text_column text \
--apostrophe_mode replace \
--normalizer nfdKorean (Gemma track):
python build_tokenizer.py \
--lang korean \
--strategy whitespace \
--vocab_size 18833 \
--output_tokenizer korean_tokenizer.json \
--data_sources \
local:processed_dataset/korean_webtext.txt \
--normalizer nfkcSpanish (Gemma track):
python build_tokenizer.py \
--lang spanish \
--strategy whitespace \
--vocab_size 36000 \
--output_tokenizer spanish_tokenizer.json \
--data_sources \
hf:wikimedia/wikipedia:20231101.es:train[:200000] \
--text_column text \
--normalizer nfd_strip_accents| Language | Data Sources | Preprocessing |
|---|---|---|
| Uzbek | yakhyo/uz-wiki, tahrirchi/uz-crawl (news), tahrirchi/uz-books (13GB books), community corpora |
Replace apostrophes (ʻ→APST), filter Cyrillic, keep Latin-only words |
| Korean | KOREAN-WEBTEXT, community corpora | NFKC normalization, process JSONL (instruction/output/input fields) |
| Spanish | Spanish Wikipedia (200K articles) | NFD normalization, optional accent stripping |
| Any new language | Wikipedia dump, news corpora, web text | Normalize quotes, filter scripts as needed |
For a new language: Prepare a plain text file (one document per line) or use HuggingFace datasets. Choose bytelevel strategy for Llama models or whitespace for Gemma models. Set vocabulary size based on your corpus (typically 18K-43K).
BXOD requires a frequency dictionary for the target language (e.g., uz_50k.txt, ko_50k.txt). This is a plain text file with one word per line, ordered by frequency (most frequent first). You can generate this from your corpus or use the wordfreq library:
from wordfreq import top_n_list
words = top_n_list('ko', 50000)
with open('ko_50k.txt', 'w') as f:
f.write('\n'.join(words))For languages not covered by wordfreq (e.g., Uzbek), extract word frequencies from your corpus:
from collections import Counter
word_counts = Counter()
with open('corpus.txt') as f:
for line in f:
word_counts.update(line.strip().split())
top_words = [w for w, _ in word_counts.most_common(50000)]
with open('uz_50k.txt', 'w') as f:
f.write('\n'.join(top_words))Different files handle different language/model combinations because the vocabulary integration strategy differs between Llama (replace tokens) and Gemma (add tokens).
| Parameter | Default | Description |
|---|---|---|
--top_n_neighbors |
3 | Number of target-language neighbors to find via LaBSE |
--top_n_candidates |
3 | Number of English translations per neighbor |
--softmax_temperature |
0.05 | Temperature for sharpening translation weights |
--bank_size |
50000 | Size of reference vocabulary banks |
Replaces non-English tokens in Llama's vocabulary with Uzbek tokens.
python uzbek_llama_init.py \
--base_model meta-llama/Llama-3.2-3B-Instruct \
--tokenizer_file uzbek_tokenizer_llama.json \
--uzbek_words_file uz_50k.txt \
--output_dir ./models/Llama-3.2-3B-uz-bxod \
--top_n_neighbors 3 \
--top_n_candidates 3 \
--softmax_temperature 0.05 \
--bank_size 50000Adds new Korean tokens to Gemma's vocabulary.
python korean_gemma_init.py \
--base_model google/gemma-2-2b-it \
--korean_words_file ko_50k.txt \
--output_dir ./models/gemma-korean-bxod \
--max_new_tokens 20000The word list file (ko_50k.txt) should contain one Korean word per line, ordered by frequency.
Adds Spanish tokens from the BPE tokenizer with integrated merge rules.
python spanish_gemma_init.py \
--base_model google/gemma-2-2b-it \
--tokenizer_file spanish_tokenizer.json \
--target_words_file es_50k.txt \
--output_dir ./models/gemma-spanish-bxod \
--max_new_tokens 36000Adds Uzbek tokens from the BPE tokenizer with integrated merge rules.
python uzbek_gemma_init.py \
--base_model google/gemma-2-2b-it \
--tokenizer_file uzbek_tokenizer_gemma.json \
--uzbek_words_file uz_50k.txt \
--output_dir ./models/gemma-uzbek-bxod \
--max_new_tokens 36000Choose the appropriate init file based on your base model:
- Llama models: Copy
uzbek_llama_init.py. The Llama approach replaces non-ASCII tokens in the vocabulary, so the total vocab size stays fixed. - Gemma models (with word list): Copy
korean_gemma_init.py. Adds tokens from a frequency word list. - Gemma models (with BPE tokenizer): Copy
spanish_gemma_init.pyoruzbek_gemma_init.py. Adds tokens from a custom BPE tokenizer and integrates its merge rules.
Key changes needed:
- Frequency dictionary: Prepare a target-language frequency word list (see above)
- Token cleaning: Adjust character replacements if your language has special normalization (e.g., Uzbek apostrophe handling)
# Uzbek
python pretrain_llama.py \
--base_model ./models/Llama-3.2-3B-uz-bxod \
--output_dir ./models/Llama-3.2-3B-uz-bxod-pretrained \
--en_dataset wikipedia:20220301.en:train[:340000] \
--target_datasets tahrirchi/uz-crawl:news yakhyo/uz-wiki:train \
--max_steps 1000 \
--block_size 2048 \
--learning_rate 1e-4 \
--grad_accum 32 \
--filter_latin \
--uzbek_apostrophe \
--run_name uz_pretrain# Korean
python pretrain_gemma.py \
--base_model ./models/gemma-korean-bxod \
--output_dir ./models/gemma-korean-bxod-pretrained \
--en_dataset wikipedia:20220301.en:train[:340000] \
--target_datasets HAERAE-HUB/KOREAN-WEBTEXT:train[:100000] \
--max_steps 6000 \
--block_size 512 \
--learning_rate 1e-4 \
--grad_accum 16 \
--run_name korean_pretrain
# Spanish
python pretrain_gemma.py \
--base_model ./models/gemma-spanish-bxod \
--output_dir ./models/gemma-spanish-bxod-pretrained \
--en_dataset wikipedia:20220301.en:train[:160000] \
--target_datasets wikimedia/wikipedia:20231101.es:train[:40000] \
--max_steps 1000 \
--block_size 512 \
--grad_accum 128 \
--run_name spanish_pretrain
# Uzbek Gemma
python pretrain_gemma.py \
--base_model ./models/gemma-uzbek-bxod \
--output_dir ./models/gemma-uzbek-bxod-pretrained \
--en_dataset wikipedia:20220301.en:train[:160000] \
--target_datasets yakhyo/uz-wiki:train \
--max_steps 1000 \
--block_size 512 \
--grad_accum 128 \
--uzbek_apostrophe \
--run_name uzbek_gemma_pretrain| Language | English Data | Target Language Data |
|---|---|---|
| Uzbek | English Wikipedia (340K) | tahrirchi/uz-crawl (news), yakhyo/uz-wiki, tahrirchi/uz-books |
| Korean | English Wikipedia (340K) | HAERAE-HUB/KOREAN-WEBTEXT (100K) |
| Spanish | English Wikipedia (160K) | Spanish Wikipedia (40K) |
| New language | English Wikipedia | Wikipedia, news crawls, web text in target language |
# Uzbek
python finetune_llama.py \
--base_model ./models/Llama-3.2-3B-uz-bxod-pretrained \
--output_dir ./models/Llama-3.2-3B-uz-bxod-tuned \
--en_dataset tatsu-lab/alpaca:train \
--target_datasets UAzimov/uzbek-instruct-llm:train behbudiy/alpaca-cleaned-uz:train \
--target_format chat alpaca \
--num_epochs 2 \
--learning_rate 4e-5 \
--grad_accum 32 \
--uzbek_apostrophe# Korean
python finetune_gemma.py \
--base_model ./models/gemma-korean-bxod-pretrained \
--output_dir ./models/gemma-korean-bxod-tuned \
--en_dataset tatsu-lab/alpaca:train \
--target_datasets nlpai-lab/kullm-v2:train \
--target_format alpaca \
--num_epochs 2 \
--learning_rate 5e-5 \
--grad_accum 16
# Spanish
python finetune_gemma.py \
--base_model ./models/gemma-spanish-bxod-pretrained \
--output_dir ./models/gemma-spanish-bxod-tuned \
--en_dataset tatsu-lab/alpaca:train \
--target_datasets bertin-project/alpaca-spanish:train \
--target_format alpaca \
--num_epochs 1 \
--learning_rate 4e-5 \
--grad_accum 128
# Uzbek Gemma
python finetune_gemma.py \
--base_model ./models/gemma-uzbek-bxod-pretrained \
--output_dir ./models/gemma-uzbek-bxod-tuned \
--en_dataset tatsu-lab/alpaca:train \
--target_datasets \
UAzimov/uzbek-instruct-llm:train \
behbudiy/alpaca-cleaned-uz:train \
behbudiy/translation-instruction:train \
--target_format chat alpaca alpaca \
--num_epochs 1 \
--learning_rate 4e-5 \
--grad_accum 128 \
--uzbek_apostrophe \
--use_bos| Language | English | Target Language | Format |
|---|---|---|---|
| Uzbek | tatsu-lab/alpaca |
UAzimov/uzbek-instruct-llm (chat), behbudiy/alpaca-cleaned-uz (alpaca), behbudiy/translation-instruction (alpaca) |
Mixed |
| Korean | tatsu-lab/alpaca |
nlpai-lab/kullm-v2 (alpaca) |
Alpaca |
| Spanish | tatsu-lab/alpaca |
bertin-project/alpaca-spanish (alpaca) |
Alpaca |
| New language | tatsu-lab/alpaca |
Any instruction dataset in your language | Alpaca or Chat |
Dataset format requirements:
- Alpaca format: Must have columns
instruction,output, and optionallyinput - Chat format: Must have a
messagescolumn with list of{"role": "...", "content": "..."}dicts
# Uzbek Llama
python evaluate.py \
--model_path ./models/Llama-3.2-3B-uz-bxod-tuned \
--model_type llama \
--lang uzbek \
--tasks translation sentiment classification mmlu fertility \
--batch_size 128
# Korean Gemma
python evaluate.py \
--model_path ./models/gemma-korean-bxod-tuned \
--model_type gemma \
--lang korean \
--tasks translation sentiment mmlu fertility \
--batch_size 128
# Spanish Gemma
python evaluate.py \
--model_path ./models/gemma-spanish-bxod-tuned \
--model_type gemma \
--lang spanish \
--tasks translation sentiment mmlu fertility \
--batch_size 512| Task | Dataset | Metrics | Languages |
|---|---|---|---|
| Translation | FLORES devtest | BLEU, COMET | All (bidirectional) |
| Sentiment | Language-specific | Accuracy | Uzbek, Korean, Spanish |
| Classification | Uzbek news (10 categories) | Accuracy | Uzbek |
| MMLU | cais/mmlu | Accuracy | All (English) |
| Fertility | Language Wikipedia | Tokens/word ratio | All |
- Gather data: Collect text corpora (Wikipedia, news, web text) and an instruction dataset
- Prepare frequency dictionary: Create a word frequency list for your language (see Stage 1 notes)
- Choose base model: Llama (larger context, fixed vocab) or Gemma (expandable vocab)
- Build tokenizer: Run
build_tokenizer.pywith appropriate strategy - Initialize embeddings: Run the appropriate BXOD init script with your frequency dictionary
- Pretrain: Run
pretrain_llama.pyorpretrain_gemma.py - Fine-tune: Run
finetune_llama.pyorfinetune_gemma.py - Evaluate: Run
evaluate.pywith your language-specific benchmarks
codes/
├── README.md # This file
├── build_tokenizer.py # Stage 1-2: Data processing + tokenizer building
├── uzbek_llama_init.py # Stage 3: Uzbek + Llama BXOD initialization
├── korean_gemma_init.py # Stage 3: Korean + Gemma BXOD initialization
├── spanish_gemma_init.py # Stage 3: Spanish + Gemma BXOD initialization
├── uzbek_gemma_init.py # Stage 3: Uzbek + Gemma BXOD initialization
├── pretrain_llama.py # Stage 4: Continual pretraining (Llama)
├── pretrain_gemma.py # Stage 4: Continual pretraining (Gemma)
├── finetune_llama.py # Stage 5: Supervised fine-tuning (Llama)
├── finetune_gemma.py # Stage 5: Supervised fine-tuning (Gemma)
└── evaluate.py # Stage 6: Benchmark evaluation
| Parameter | Pretraining | Fine-tuning |
|---|---|---|
| Precision | bfloat16 | bfloat16 |
| Optimizer | AdamW (fused) | AdamW (fused) |
| LR Schedule | Cosine | Cosine |
| Learning Rate | 1e-4 (Llama), 6e-5 (Gemma) | 4e-5 |
| Warmup | 10% | 5% |
| Max Grad Norm | 1.0 | 1.0 |
| Batch Size | 1 x 32-128 grad accum | 1 x 16-128 grad accum |
| Gradient Checkpointing | Yes (pretraining) | Optional |


