Bekhzod Shukhratov bxod

BXOD: Cross-lingual Embedding Initialization for LLMs

This repository contains the code for adapting English-centric LLMs to new languages through vocabulary extension, BXOD embedding initialization, continual pretraining, and supervised fine-tuning.

Pipeline Overview

The full pipeline consists of 6 stages:

1. Data Processing     Prepare raw text corpora for tokenizer training
        |
2. Tokenizer Building  Train a BPE tokenizer for the target language
        |
3. Embedding Init      BXOD initialization for new language tokens
        |
4. Pretraining         Continual pretraining on target language + English
        |
5. Fine-tuning         Supervised instruction-following fine-tuning
        |
6. Evaluation          Benchmark evaluation (translation, NLU, etc.)

BXOD Embedding Initialization

BXOD uses cross-lingual semantic similarity (via LaBSE) to initialize embeddings for new target-language tokens. For each new token:

Semantic Anchoring (BXOD): If the token appears in a target-language frequency dictionary:
- Find the closest target-language words using LaBSE cosine similarity
- For each neighbor, find the closest English translations
- Compute combined scores (target_similarity x english_similarity) and apply softmax
- Compute weighted average of the English token embeddings from the base model
Fallback (NACHOS): If the token is not in the dictionary (rare subwords):
- Encode the token using the base model's tokenizer
- Average the resulting subtoken embeddings

This approach ensures that ~80% of tokens (by frequency) receive semantically grounded initializations.

Two Architecture Tracks

The codebase supports two base model families with key differences:

Aspect	Llama Track	Gemma Track
Base model	`meta-llama/Llama-3.2-3B-Instruct`	`google/gemma-2-2b-it`
Vocab strategy	Replace non-English tokens (fixed vocab size)	Add new tokens (expand vocab size)
Pre-tokenizer	ByteLevel	WhitespaceSplit + Punctuation
Block size	2048	512
Attention	Flash Attention 2	Eager
Chat format	`<\|start_header_id\|>...<\|eot_id\|>`	`<start_of_turn>...<end_of_turn>`

Requirements

pip install torch transformers datasets tokenizers accelerate
pip install sentence-transformers wordfreq    # for BXOD embedding initialization
pip install sacrebleu unbabel-comet           # for evaluation
pip install wandb                             # for training logging

Stage 1 & 2: Data Processing + Tokenizer Building

`build_tokenizer.py`

Builds a BPE tokenizer from text corpora. Handles data downloading, preprocessing, and tokenizer training in one script.

Usage for Each Language

Uzbek (Llama track - ByteLevel pre-tokenizer):

python build_tokenizer.py \
    --lang uzbek \
    --strategy bytelevel \
    --vocab_size 35185 \
    --output_tokenizer uzbek_tokenizer_llama.json \
    --data_sources \
        hf:yakhyo/uz-wiki:train \
    --text_column text \
    --apostrophe_mode replace

Uzbek (Gemma track - WhitespaceSplit pre-tokenizer with ▁ prefix):

python build_tokenizer.py \
    --lang uzbek \
    --strategy whitespace \
    --vocab_size 40000 \
    --output_tokenizer uzbek_tokenizer_gemma.json \
    --data_sources \
        hf:yakhyo/uz-wiki:train \
    --text_column text \
    --apostrophe_mode replace \
    --normalizer nfd

Korean (Gemma track):

python build_tokenizer.py \
    --lang korean \
    --strategy whitespace \
    --vocab_size 18833 \
    --output_tokenizer korean_tokenizer.json \
    --data_sources \
        local:processed_dataset/korean_webtext.txt \
    --normalizer nfkc

Spanish (Gemma track):

python build_tokenizer.py \
    --lang spanish \
    --strategy whitespace \
    --vocab_size 36000 \
    --output_tokenizer spanish_tokenizer.json \
    --data_sources \
        hf:wikimedia/wikipedia:20231101.es:train[:200000] \
    --text_column text \
    --normalizer nfd_strip_accents

Data Preparation Notes

Language	Data Sources	Preprocessing
Uzbek	`yakhyo/uz-wiki`, `tahrirchi/uz-crawl` (news), `tahrirchi/uz-books` (13GB books), community corpora	Replace apostrophes (ʻ→APST), filter Cyrillic, keep Latin-only words
Korean	KOREAN-WEBTEXT, community corpora	NFKC normalization, process JSONL (instruction/output/input fields)
Spanish	Spanish Wikipedia (200K articles)	NFD normalization, optional accent stripping
Any new language	Wikipedia dump, news corpora, web text	Normalize quotes, filter scripts as needed

For a new language: Prepare a plain text file (one document per line) or use HuggingFace datasets. Choose bytelevel strategy for Llama models or whitespace for Gemma models. Set vocabulary size based on your corpus (typically 18K-43K).

Frequency Dictionary Preparation

BXOD requires a frequency dictionary for the target language (e.g., uz_50k.txt, ko_50k.txt). This is a plain text file with one word per line, ordered by frequency (most frequent first). You can generate this from your corpus or use the wordfreq library:

from wordfreq import top_n_list
words = top_n_list('ko', 50000)
with open('ko_50k.txt', 'w') as f:
    f.write('\n'.join(words))

For languages not covered by wordfreq (e.g., Uzbek), extract word frequencies from your corpus:

from collections import Counter
word_counts = Counter()
with open('corpus.txt') as f:
    for line in f:
        word_counts.update(line.strip().split())
top_words = [w for w, _ in word_counts.most_common(50000)]
with open('uz_50k.txt', 'w') as f:
    f.write('\n'.join(top_words))

Stage 3: BXOD Embedding Initialization

Different files handle different language/model combinations because the vocabulary integration strategy differs between Llama (replace tokens) and Gemma (add tokens).

BXOD Hyperparameters

Parameter	Default	Description
`--top_n_neighbors`	3	Number of target-language neighbors to find via LaBSE
`--top_n_candidates`	3	Number of English translations per neighbor
`--softmax_temperature`	0.05	Temperature for sharpening translation weights
`--bank_size`	50000	Size of reference vocabulary banks

Uzbek + Llama (`uzbek_llama_init.py`)

Replaces non-English tokens in Llama's vocabulary with Uzbek tokens.

python uzbek_llama_init.py \
    --base_model meta-llama/Llama-3.2-3B-Instruct \
    --tokenizer_file uzbek_tokenizer_llama.json \
    --uzbek_words_file uz_50k.txt \
    --output_dir ./models/Llama-3.2-3B-uz-bxod \
    --top_n_neighbors 3 \
    --top_n_candidates 3 \
    --softmax_temperature 0.05 \
    --bank_size 50000

Korean + Gemma (`korean_gemma_init.py`)

Adds new Korean tokens to Gemma's vocabulary.

python korean_gemma_init.py \
    --base_model google/gemma-2-2b-it \
    --korean_words_file ko_50k.txt \
    --output_dir ./models/gemma-korean-bxod \
    --max_new_tokens 20000

The word list file (ko_50k.txt) should contain one Korean word per line, ordered by frequency.

Spanish + Gemma (`spanish_gemma_init.py`)

Adds Spanish tokens from the BPE tokenizer with integrated merge rules.

python spanish_gemma_init.py \
    --base_model google/gemma-2-2b-it \
    --tokenizer_file spanish_tokenizer.json \
    --target_words_file es_50k.txt \
    --output_dir ./models/gemma-spanish-bxod \
    --max_new_tokens 36000

Uzbek + Gemma (`uzbek_gemma_init.py`)

Adds Uzbek tokens from the BPE tokenizer with integrated merge rules.

python uzbek_gemma_init.py \
    --base_model google/gemma-2-2b-it \
    --tokenizer_file uzbek_tokenizer_gemma.json \
    --uzbek_words_file uz_50k.txt \
    --output_dir ./models/gemma-uzbek-bxod \
    --max_new_tokens 36000

Adapting to a New Language

Choose the appropriate init file based on your base model:

Llama models: Copy uzbek_llama_init.py. The Llama approach replaces non-ASCII tokens in the vocabulary, so the total vocab size stays fixed.
Gemma models (with word list): Copy korean_gemma_init.py. Adds tokens from a frequency word list.
Gemma models (with BPE tokenizer): Copy spanish_gemma_init.py or uzbek_gemma_init.py. Adds tokens from a custom BPE tokenizer and integrates its merge rules.

Key changes needed:

Frequency dictionary: Prepare a target-language frequency word list (see above)
Token cleaning: Adjust character replacements if your language has special normalization (e.g., Uzbek apostrophe handling)

Stage 4: Continual Pretraining

Llama Models (`pretrain_llama.py`)

# Uzbek
python pretrain_llama.py \
    --base_model ./models/Llama-3.2-3B-uz-bxod \
    --output_dir ./models/Llama-3.2-3B-uz-bxod-pretrained \
    --en_dataset wikipedia:20220301.en:train[:340000] \
    --target_datasets tahrirchi/uz-crawl:news yakhyo/uz-wiki:train \
    --max_steps 1000 \
    --block_size 2048 \
    --learning_rate 1e-4 \
    --grad_accum 32 \
    --filter_latin \
    --uzbek_apostrophe \
    --run_name uz_pretrain

Gemma Models (`pretrain_gemma.py`)

# Korean
python pretrain_gemma.py \
    --base_model ./models/gemma-korean-bxod \
    --output_dir ./models/gemma-korean-bxod-pretrained \
    --en_dataset wikipedia:20220301.en:train[:340000] \
    --target_datasets HAERAE-HUB/KOREAN-WEBTEXT:train[:100000] \
    --max_steps 6000 \
    --block_size 512 \
    --learning_rate 1e-4 \
    --grad_accum 16 \
    --run_name korean_pretrain

# Spanish
python pretrain_gemma.py \
    --base_model ./models/gemma-spanish-bxod \
    --output_dir ./models/gemma-spanish-bxod-pretrained \
    --en_dataset wikipedia:20220301.en:train[:160000] \
    --target_datasets wikimedia/wikipedia:20231101.es:train[:40000] \
    --max_steps 1000 \
    --block_size 512 \
    --grad_accum 128 \
    --run_name spanish_pretrain

# Uzbek Gemma
python pretrain_gemma.py \
    --base_model ./models/gemma-uzbek-bxod \
    --output_dir ./models/gemma-uzbek-bxod-pretrained \
    --en_dataset wikipedia:20220301.en:train[:160000] \
    --target_datasets yakhyo/uz-wiki:train \
    --max_steps 1000 \
    --block_size 512 \
    --grad_accum 128 \
    --uzbek_apostrophe \
    --run_name uzbek_gemma_pretrain

Pretraining Data Sources by Language

Language	English Data	Target Language Data
Uzbek	English Wikipedia (340K)	`tahrirchi/uz-crawl` (news), `yakhyo/uz-wiki`, `tahrirchi/uz-books`
Korean	English Wikipedia (340K)	`HAERAE-HUB/KOREAN-WEBTEXT` (100K)
Spanish	English Wikipedia (160K)	Spanish Wikipedia (40K)
New language	English Wikipedia	Wikipedia, news crawls, web text in target language

Stage 5: Supervised Fine-tuning

Llama Models (`finetune_llama.py`)

# Uzbek
python finetune_llama.py \
    --base_model ./models/Llama-3.2-3B-uz-bxod-pretrained \
    --output_dir ./models/Llama-3.2-3B-uz-bxod-tuned \
    --en_dataset tatsu-lab/alpaca:train \
    --target_datasets UAzimov/uzbek-instruct-llm:train behbudiy/alpaca-cleaned-uz:train \
    --target_format chat alpaca \
    --num_epochs 2 \
    --learning_rate 4e-5 \
    --grad_accum 32 \
    --uzbek_apostrophe

Gemma Models (`finetune_gemma.py`)

# Korean
python finetune_gemma.py \
    --base_model ./models/gemma-korean-bxod-pretrained \
    --output_dir ./models/gemma-korean-bxod-tuned \
    --en_dataset tatsu-lab/alpaca:train \
    --target_datasets nlpai-lab/kullm-v2:train \
    --target_format alpaca \
    --num_epochs 2 \
    --learning_rate 5e-5 \
    --grad_accum 16

# Spanish
python finetune_gemma.py \
    --base_model ./models/gemma-spanish-bxod-pretrained \
    --output_dir ./models/gemma-spanish-bxod-tuned \
    --en_dataset tatsu-lab/alpaca:train \
    --target_datasets bertin-project/alpaca-spanish:train \
    --target_format alpaca \
    --num_epochs 1 \
    --learning_rate 4e-5 \
    --grad_accum 128

# Uzbek Gemma
python finetune_gemma.py \
    --base_model ./models/gemma-uzbek-bxod-pretrained \
    --output_dir ./models/gemma-uzbek-bxod-tuned \
    --en_dataset tatsu-lab/alpaca:train \
    --target_datasets \
        UAzimov/uzbek-instruct-llm:train \
        behbudiy/alpaca-cleaned-uz:train \
        behbudiy/translation-instruction:train \
    --target_format chat alpaca alpaca \
    --num_epochs 1 \
    --learning_rate 4e-5 \
    --grad_accum 128 \
    --uzbek_apostrophe \
    --use_bos

Fine-tuning Datasets by Language

Language	English	Target Language	Format
Uzbek	`tatsu-lab/alpaca`	`UAzimov/uzbek-instruct-llm` (chat), `behbudiy/alpaca-cleaned-uz` (alpaca), `behbudiy/translation-instruction` (alpaca)	Mixed
Korean	`tatsu-lab/alpaca`	`nlpai-lab/kullm-v2` (alpaca)	Alpaca
Spanish	`tatsu-lab/alpaca`	`bertin-project/alpaca-spanish` (alpaca)	Alpaca
New language	`tatsu-lab/alpaca`	Any instruction dataset in your language	Alpaca or Chat

Dataset format requirements:

Alpaca format: Must have columns instruction, output, and optionally input
Chat format: Must have a messages column with list of {"role": "...", "content": "..."} dicts

Stage 6: Evaluation

`evaluate.py`

# Uzbek Llama
python evaluate.py \
    --model_path ./models/Llama-3.2-3B-uz-bxod-tuned \
    --model_type llama \
    --lang uzbek \
    --tasks translation sentiment classification mmlu fertility \
    --batch_size 128

# Korean Gemma
python evaluate.py \
    --model_path ./models/gemma-korean-bxod-tuned \
    --model_type gemma \
    --lang korean \
    --tasks translation sentiment mmlu fertility \
    --batch_size 128

# Spanish Gemma
python evaluate.py \
    --model_path ./models/gemma-spanish-bxod-tuned \
    --model_type gemma \
    --lang spanish \
    --tasks translation sentiment mmlu fertility \
    --batch_size 512

Evaluation Benchmarks

Task	Dataset	Metrics	Languages
Translation	FLORES devtest	BLEU, COMET	All (bidirectional)
Sentiment	Language-specific	Accuracy	Uzbek, Korean, Spanish
Classification	Uzbek news (10 categories)	Accuracy	Uzbek
MMLU	cais/mmlu	Accuracy	All (English)
Fertility	Language Wikipedia	Tokens/word ratio	All

Adapting to a New Language: Step-by-Step

Gather data: Collect text corpora (Wikipedia, news, web text) and an instruction dataset
Prepare frequency dictionary: Create a word frequency list for your language (see Stage 1 notes)
Choose base model: Llama (larger context, fixed vocab) or Gemma (expandable vocab)
Build tokenizer: Run build_tokenizer.py with appropriate strategy
Initialize embeddings: Run the appropriate BXOD init script with your frequency dictionary
Pretrain: Run pretrain_llama.py or pretrain_gemma.py
Fine-tune: Run finetune_llama.py or finetune_gemma.py
Evaluate: Run evaluate.py with your language-specific benchmarks

File Structure

codes/
├── README.md                   # This file
├── build_tokenizer.py          # Stage 1-2: Data processing + tokenizer building
├── uzbek_llama_init.py         # Stage 3: Uzbek + Llama BXOD initialization
├── korean_gemma_init.py        # Stage 3: Korean + Gemma BXOD initialization
├── spanish_gemma_init.py       # Stage 3: Spanish + Gemma BXOD initialization
├── uzbek_gemma_init.py         # Stage 3: Uzbek + Gemma BXOD initialization
├── pretrain_llama.py           # Stage 4: Continual pretraining (Llama)
├── pretrain_gemma.py           # Stage 4: Continual pretraining (Gemma)
├── finetune_llama.py           # Stage 5: Supervised fine-tuning (Llama)
├── finetune_gemma.py           # Stage 5: Supervised fine-tuning (Gemma)
└── evaluate.py                 # Stage 6: Benchmark evaluation

Training Hyperparameters

Parameter	Pretraining	Fine-tuning
Precision	bfloat16	bfloat16
Optimizer	AdamW (fused)	AdamW (fused)
LR Schedule	Cosine	Cosine
Learning Rate	1e-4 (Llama), 6e-5 (Gemma)	4e-5
Warmup	10%	5%
Max Grad Norm	1.0	1.0
Batch Size	1 x 32-128 grad accum	1 x 16-128 grad accum
Gradient Checkpointing	Yes (pretraining)	Optional

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bekhzod Shukhratov bxod

Achievements

Achievements

Block or report bxod

BXOD: Cross-lingual Embedding Initialization for LLMs

Pipeline Overview

BXOD Embedding Initialization

Two Architecture Tracks

Requirements

Stage 1 & 2: Data Processing + Tokenizer Building

`build_tokenizer.py`

Usage for Each Language

Data Preparation Notes

Frequency Dictionary Preparation

Stage 3: BXOD Embedding Initialization

BXOD Hyperparameters

Uzbek + Llama (`uzbek_llama_init.py`)

Korean + Gemma (`korean_gemma_init.py`)

Spanish + Gemma (`spanish_gemma_init.py`)

Uzbek + Gemma (`uzbek_gemma_init.py`)

Adapting to a New Language

Stage 4: Continual Pretraining

Llama Models (`pretrain_llama.py`)

Gemma Models (`pretrain_gemma.py`)

Pretraining Data Sources by Language

Stage 5: Supervised Fine-tuning

Llama Models (`finetune_llama.py`)

Gemma Models (`finetune_gemma.py`)

Fine-tuning Datasets by Language

Stage 6: Evaluation

`evaluate.py`

Evaluation Benchmarks

Adapting to a New Language: Step-by-Step

File Structure

Training Hyperparameters

Pinned Loading

Uh oh!

Bekhzod Shukhratov bxod

Achievements

Achievements

BXOD: Cross-lingual Embedding Initialization for LLMs

Pipeline Overview

BXOD Embedding Initialization

Two Architecture Tracks

Requirements

Stage 1 & 2: Data Processing + Tokenizer Building

build_tokenizer.py

Usage for Each Language

Data Preparation Notes

Frequency Dictionary Preparation

Stage 3: BXOD Embedding Initialization

BXOD Hyperparameters

Uzbek + Llama (uzbek_llama_init.py)

Korean + Gemma (korean_gemma_init.py)

Spanish + Gemma (spanish_gemma_init.py)

Uzbek + Gemma (uzbek_gemma_init.py)

Adapting to a New Language

Stage 4: Continual Pretraining

Llama Models (pretrain_llama.py)

Gemma Models (pretrain_gemma.py)

Pretraining Data Sources by Language

Stage 5: Supervised Fine-tuning

Llama Models (finetune_llama.py)

Gemma Models (finetune_gemma.py)

Fine-tuning Datasets by Language

Stage 6: Evaluation

evaluate.py

Evaluation Benchmarks

Adapting to a New Language: Step-by-Step

File Structure

Training Hyperparameters

Pinned Loading

Uh oh!

`build_tokenizer.py`

Uzbek + Llama (`uzbek_llama_init.py`)

Korean + Gemma (`korean_gemma_init.py`)

Spanish + Gemma (`spanish_gemma_init.py`)

Uzbek + Gemma (`uzbek_gemma_init.py`)

Llama Models (`pretrain_llama.py`)

Gemma Models (`pretrain_gemma.py`)

Llama Models (`finetune_llama.py`)

Gemma Models (`finetune_gemma.py`)

`evaluate.py`