Skip to content
View bxod's full-sized avatar
🏠
Working from home
🏠
Working from home

Block or report bxod

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
bxod/README.md

BXOD: Cross-lingual Embedding Initialization for LLMs

This repository contains the code for adapting English-centric LLMs to new languages through vocabulary extension, BXOD embedding initialization, continual pretraining, and supervised fine-tuning.

Pipeline Overview

The full pipeline consists of 6 stages:

1. Data Processing     Prepare raw text corpora for tokenizer training
        |
2. Tokenizer Building  Train a BPE tokenizer for the target language
        |
3. Embedding Init      BXOD initialization for new language tokens
        |
4. Pretraining         Continual pretraining on target language + English
        |
5. Fine-tuning         Supervised instruction-following fine-tuning
        |
6. Evaluation          Benchmark evaluation (translation, NLU, etc.)

BXOD Embedding Initialization

BXOD uses cross-lingual semantic similarity (via LaBSE) to initialize embeddings for new target-language tokens. For each new token:

  1. Semantic Anchoring (BXOD): If the token appears in a target-language frequency dictionary:

    • Find the closest target-language words using LaBSE cosine similarity
    • For each neighbor, find the closest English translations
    • Compute combined scores (target_similarity x english_similarity) and apply softmax
    • Compute weighted average of the English token embeddings from the base model
  2. Fallback (NACHOS): If the token is not in the dictionary (rare subwords):

    • Encode the token using the base model's tokenizer
    • Average the resulting subtoken embeddings

This approach ensures that ~80% of tokens (by frequency) receive semantically grounded initializations.

Two Architecture Tracks

The codebase supports two base model families with key differences:

Aspect Llama Track Gemma Track
Base model meta-llama/Llama-3.2-3B-Instruct google/gemma-2-2b-it
Vocab strategy Replace non-English tokens (fixed vocab size) Add new tokens (expand vocab size)
Pre-tokenizer ByteLevel WhitespaceSplit + Punctuation
Block size 2048 512
Attention Flash Attention 2 Eager
Chat format <|start_header_id|>...<|eot_id|> <start_of_turn>...<end_of_turn>

Requirements

pip install torch transformers datasets tokenizers accelerate
pip install sentence-transformers wordfreq    # for BXOD embedding initialization
pip install sacrebleu unbabel-comet           # for evaluation
pip install wandb                             # for training logging

Stage 1 & 2: Data Processing + Tokenizer Building

build_tokenizer.py

Builds a BPE tokenizer from text corpora. Handles data downloading, preprocessing, and tokenizer training in one script.

Usage for Each Language

Uzbek (Llama track - ByteLevel pre-tokenizer):

python build_tokenizer.py \
    --lang uzbek \
    --strategy bytelevel \
    --vocab_size 35185 \
    --output_tokenizer uzbek_tokenizer_llama.json \
    --data_sources \
        hf:yakhyo/uz-wiki:train \
    --text_column text \
    --apostrophe_mode replace

Uzbek (Gemma track - WhitespaceSplit pre-tokenizer with ▁ prefix):

python build_tokenizer.py \
    --lang uzbek \
    --strategy whitespace \
    --vocab_size 40000 \
    --output_tokenizer uzbek_tokenizer_gemma.json \
    --data_sources \
        hf:yakhyo/uz-wiki:train \
    --text_column text \
    --apostrophe_mode replace \
    --normalizer nfd

Korean (Gemma track):

python build_tokenizer.py \
    --lang korean \
    --strategy whitespace \
    --vocab_size 18833 \
    --output_tokenizer korean_tokenizer.json \
    --data_sources \
        local:processed_dataset/korean_webtext.txt \
    --normalizer nfkc

Spanish (Gemma track):

python build_tokenizer.py \
    --lang spanish \
    --strategy whitespace \
    --vocab_size 36000 \
    --output_tokenizer spanish_tokenizer.json \
    --data_sources \
        hf:wikimedia/wikipedia:20231101.es:train[:200000] \
    --text_column text \
    --normalizer nfd_strip_accents

Data Preparation Notes

Language Data Sources Preprocessing
Uzbek yakhyo/uz-wiki, tahrirchi/uz-crawl (news), tahrirchi/uz-books (13GB books), community corpora Replace apostrophes (ʻ→APST), filter Cyrillic, keep Latin-only words
Korean KOREAN-WEBTEXT, community corpora NFKC normalization, process JSONL (instruction/output/input fields)
Spanish Spanish Wikipedia (200K articles) NFD normalization, optional accent stripping
Any new language Wikipedia dump, news corpora, web text Normalize quotes, filter scripts as needed

For a new language: Prepare a plain text file (one document per line) or use HuggingFace datasets. Choose bytelevel strategy for Llama models or whitespace for Gemma models. Set vocabulary size based on your corpus (typically 18K-43K).

Frequency Dictionary Preparation

BXOD requires a frequency dictionary for the target language (e.g., uz_50k.txt, ko_50k.txt). This is a plain text file with one word per line, ordered by frequency (most frequent first). You can generate this from your corpus or use the wordfreq library:

from wordfreq import top_n_list
words = top_n_list('ko', 50000)
with open('ko_50k.txt', 'w') as f:
    f.write('\n'.join(words))

For languages not covered by wordfreq (e.g., Uzbek), extract word frequencies from your corpus:

from collections import Counter
word_counts = Counter()
with open('corpus.txt') as f:
    for line in f:
        word_counts.update(line.strip().split())
top_words = [w for w, _ in word_counts.most_common(50000)]
with open('uz_50k.txt', 'w') as f:
    f.write('\n'.join(top_words))

Stage 3: BXOD Embedding Initialization

Different files handle different language/model combinations because the vocabulary integration strategy differs between Llama (replace tokens) and Gemma (add tokens).

BXOD Hyperparameters

Parameter Default Description
--top_n_neighbors 3 Number of target-language neighbors to find via LaBSE
--top_n_candidates 3 Number of English translations per neighbor
--softmax_temperature 0.05 Temperature for sharpening translation weights
--bank_size 50000 Size of reference vocabulary banks

Uzbek + Llama (uzbek_llama_init.py)

Replaces non-English tokens in Llama's vocabulary with Uzbek tokens.

python uzbek_llama_init.py \
    --base_model meta-llama/Llama-3.2-3B-Instruct \
    --tokenizer_file uzbek_tokenizer_llama.json \
    --uzbek_words_file uz_50k.txt \
    --output_dir ./models/Llama-3.2-3B-uz-bxod \
    --top_n_neighbors 3 \
    --top_n_candidates 3 \
    --softmax_temperature 0.05 \
    --bank_size 50000

Korean + Gemma (korean_gemma_init.py)

Adds new Korean tokens to Gemma's vocabulary.

python korean_gemma_init.py \
    --base_model google/gemma-2-2b-it \
    --korean_words_file ko_50k.txt \
    --output_dir ./models/gemma-korean-bxod \
    --max_new_tokens 20000

The word list file (ko_50k.txt) should contain one Korean word per line, ordered by frequency.

Spanish + Gemma (spanish_gemma_init.py)

Adds Spanish tokens from the BPE tokenizer with integrated merge rules.

python spanish_gemma_init.py \
    --base_model google/gemma-2-2b-it \
    --tokenizer_file spanish_tokenizer.json \
    --target_words_file es_50k.txt \
    --output_dir ./models/gemma-spanish-bxod \
    --max_new_tokens 36000

Uzbek + Gemma (uzbek_gemma_init.py)

Adds Uzbek tokens from the BPE tokenizer with integrated merge rules.

python uzbek_gemma_init.py \
    --base_model google/gemma-2-2b-it \
    --tokenizer_file uzbek_tokenizer_gemma.json \
    --uzbek_words_file uz_50k.txt \
    --output_dir ./models/gemma-uzbek-bxod \
    --max_new_tokens 36000

Adapting to a New Language

Choose the appropriate init file based on your base model:

  • Llama models: Copy uzbek_llama_init.py. The Llama approach replaces non-ASCII tokens in the vocabulary, so the total vocab size stays fixed.
  • Gemma models (with word list): Copy korean_gemma_init.py. Adds tokens from a frequency word list.
  • Gemma models (with BPE tokenizer): Copy spanish_gemma_init.py or uzbek_gemma_init.py. Adds tokens from a custom BPE tokenizer and integrates its merge rules.

Key changes needed:

  1. Frequency dictionary: Prepare a target-language frequency word list (see above)
  2. Token cleaning: Adjust character replacements if your language has special normalization (e.g., Uzbek apostrophe handling)

Stage 4: Continual Pretraining

Llama Models (pretrain_llama.py)

# Uzbek
python pretrain_llama.py \
    --base_model ./models/Llama-3.2-3B-uz-bxod \
    --output_dir ./models/Llama-3.2-3B-uz-bxod-pretrained \
    --en_dataset wikipedia:20220301.en:train[:340000] \
    --target_datasets tahrirchi/uz-crawl:news yakhyo/uz-wiki:train \
    --max_steps 1000 \
    --block_size 2048 \
    --learning_rate 1e-4 \
    --grad_accum 32 \
    --filter_latin \
    --uzbek_apostrophe \
    --run_name uz_pretrain

Gemma Models (pretrain_gemma.py)

# Korean
python pretrain_gemma.py \
    --base_model ./models/gemma-korean-bxod \
    --output_dir ./models/gemma-korean-bxod-pretrained \
    --en_dataset wikipedia:20220301.en:train[:340000] \
    --target_datasets HAERAE-HUB/KOREAN-WEBTEXT:train[:100000] \
    --max_steps 6000 \
    --block_size 512 \
    --learning_rate 1e-4 \
    --grad_accum 16 \
    --run_name korean_pretrain

# Spanish
python pretrain_gemma.py \
    --base_model ./models/gemma-spanish-bxod \
    --output_dir ./models/gemma-spanish-bxod-pretrained \
    --en_dataset wikipedia:20220301.en:train[:160000] \
    --target_datasets wikimedia/wikipedia:20231101.es:train[:40000] \
    --max_steps 1000 \
    --block_size 512 \
    --grad_accum 128 \
    --run_name spanish_pretrain

# Uzbek Gemma
python pretrain_gemma.py \
    --base_model ./models/gemma-uzbek-bxod \
    --output_dir ./models/gemma-uzbek-bxod-pretrained \
    --en_dataset wikipedia:20220301.en:train[:160000] \
    --target_datasets yakhyo/uz-wiki:train \
    --max_steps 1000 \
    --block_size 512 \
    --grad_accum 128 \
    --uzbek_apostrophe \
    --run_name uzbek_gemma_pretrain

Pretraining Data Sources by Language

Language English Data Target Language Data
Uzbek English Wikipedia (340K) tahrirchi/uz-crawl (news), yakhyo/uz-wiki, tahrirchi/uz-books
Korean English Wikipedia (340K) HAERAE-HUB/KOREAN-WEBTEXT (100K)
Spanish English Wikipedia (160K) Spanish Wikipedia (40K)
New language English Wikipedia Wikipedia, news crawls, web text in target language

Stage 5: Supervised Fine-tuning

Llama Models (finetune_llama.py)

# Uzbek
python finetune_llama.py \
    --base_model ./models/Llama-3.2-3B-uz-bxod-pretrained \
    --output_dir ./models/Llama-3.2-3B-uz-bxod-tuned \
    --en_dataset tatsu-lab/alpaca:train \
    --target_datasets UAzimov/uzbek-instruct-llm:train behbudiy/alpaca-cleaned-uz:train \
    --target_format chat alpaca \
    --num_epochs 2 \
    --learning_rate 4e-5 \
    --grad_accum 32 \
    --uzbek_apostrophe

Gemma Models (finetune_gemma.py)

# Korean
python finetune_gemma.py \
    --base_model ./models/gemma-korean-bxod-pretrained \
    --output_dir ./models/gemma-korean-bxod-tuned \
    --en_dataset tatsu-lab/alpaca:train \
    --target_datasets nlpai-lab/kullm-v2:train \
    --target_format alpaca \
    --num_epochs 2 \
    --learning_rate 5e-5 \
    --grad_accum 16

# Spanish
python finetune_gemma.py \
    --base_model ./models/gemma-spanish-bxod-pretrained \
    --output_dir ./models/gemma-spanish-bxod-tuned \
    --en_dataset tatsu-lab/alpaca:train \
    --target_datasets bertin-project/alpaca-spanish:train \
    --target_format alpaca \
    --num_epochs 1 \
    --learning_rate 4e-5 \
    --grad_accum 128

# Uzbek Gemma
python finetune_gemma.py \
    --base_model ./models/gemma-uzbek-bxod-pretrained \
    --output_dir ./models/gemma-uzbek-bxod-tuned \
    --en_dataset tatsu-lab/alpaca:train \
    --target_datasets \
        UAzimov/uzbek-instruct-llm:train \
        behbudiy/alpaca-cleaned-uz:train \
        behbudiy/translation-instruction:train \
    --target_format chat alpaca alpaca \
    --num_epochs 1 \
    --learning_rate 4e-5 \
    --grad_accum 128 \
    --uzbek_apostrophe \
    --use_bos

Fine-tuning Datasets by Language

Language English Target Language Format
Uzbek tatsu-lab/alpaca UAzimov/uzbek-instruct-llm (chat), behbudiy/alpaca-cleaned-uz (alpaca), behbudiy/translation-instruction (alpaca) Mixed
Korean tatsu-lab/alpaca nlpai-lab/kullm-v2 (alpaca) Alpaca
Spanish tatsu-lab/alpaca bertin-project/alpaca-spanish (alpaca) Alpaca
New language tatsu-lab/alpaca Any instruction dataset in your language Alpaca or Chat

Dataset format requirements:

  • Alpaca format: Must have columns instruction, output, and optionally input
  • Chat format: Must have a messages column with list of {"role": "...", "content": "..."} dicts

Stage 6: Evaluation

evaluate.py

# Uzbek Llama
python evaluate.py \
    --model_path ./models/Llama-3.2-3B-uz-bxod-tuned \
    --model_type llama \
    --lang uzbek \
    --tasks translation sentiment classification mmlu fertility \
    --batch_size 128

# Korean Gemma
python evaluate.py \
    --model_path ./models/gemma-korean-bxod-tuned \
    --model_type gemma \
    --lang korean \
    --tasks translation sentiment mmlu fertility \
    --batch_size 128

# Spanish Gemma
python evaluate.py \
    --model_path ./models/gemma-spanish-bxod-tuned \
    --model_type gemma \
    --lang spanish \
    --tasks translation sentiment mmlu fertility \
    --batch_size 512

Evaluation Benchmarks

Task Dataset Metrics Languages
Translation FLORES devtest BLEU, COMET All (bidirectional)
Sentiment Language-specific Accuracy Uzbek, Korean, Spanish
Classification Uzbek news (10 categories) Accuracy Uzbek
MMLU cais/mmlu Accuracy All (English)
Fertility Language Wikipedia Tokens/word ratio All

Adapting to a New Language: Step-by-Step

  1. Gather data: Collect text corpora (Wikipedia, news, web text) and an instruction dataset
  2. Prepare frequency dictionary: Create a word frequency list for your language (see Stage 1 notes)
  3. Choose base model: Llama (larger context, fixed vocab) or Gemma (expandable vocab)
  4. Build tokenizer: Run build_tokenizer.py with appropriate strategy
  5. Initialize embeddings: Run the appropriate BXOD init script with your frequency dictionary
  6. Pretrain: Run pretrain_llama.py or pretrain_gemma.py
  7. Fine-tune: Run finetune_llama.py or finetune_gemma.py
  8. Evaluate: Run evaluate.py with your language-specific benchmarks

File Structure

codes/
├── README.md                   # This file
├── build_tokenizer.py          # Stage 1-2: Data processing + tokenizer building
├── uzbek_llama_init.py         # Stage 3: Uzbek + Llama BXOD initialization
├── korean_gemma_init.py        # Stage 3: Korean + Gemma BXOD initialization
├── spanish_gemma_init.py       # Stage 3: Spanish + Gemma BXOD initialization
├── uzbek_gemma_init.py         # Stage 3: Uzbek + Gemma BXOD initialization
├── pretrain_llama.py           # Stage 4: Continual pretraining (Llama)
├── pretrain_gemma.py           # Stage 4: Continual pretraining (Gemma)
├── finetune_llama.py           # Stage 5: Supervised fine-tuning (Llama)
├── finetune_gemma.py           # Stage 5: Supervised fine-tuning (Gemma)
└── evaluate.py                 # Stage 6: Benchmark evaluation

Training Hyperparameters

Parameter Pretraining Fine-tuning
Precision bfloat16 bfloat16
Optimizer AdamW (fused) AdamW (fused)
LR Schedule Cosine Cosine
Learning Rate 1e-4 (Llama), 6e-5 (Gemma) 4e-5
Warmup 10% 5%
Max Grad Norm 1.0 1.0
Batch Size 1 x 32-128 grad accum 1 x 16-128 grad accum
Gradient Checkpointing Yes (pretraining) Optional

Pinned Loading

  1. ekurs-react ekurs-react Public

    JavaScript

  2. drivesafe-web drivesafe-web Public

    UrbanTech hackathon project.

    CSS 1

  3. lostfound lostfound Public

    Lost n Found project is to connect people who found an item with ones who lost them.

    EJS

  4. PhotoGallery PhotoGallery Public

    Backend based Gallery app that performs all CRUD operations using Mongoose.

    EJS