Skip to content

UBC-NLP/sm-umt

Repository files navigation

Effective Self-Mining of In-Context Examples for Unsupervised Machine Translation with LLMs

Codebase for Effective Self-Mining of In-Context Examples for Unsupervised Machine Translation with LLMs

Published in Findings of the Association for Computational Linguistics: NAACL 2025.

Abstract

Large Language Models (LLMs) have demonstrated impressive performance on a wide range of natural language processing (NLP) tasks, primarily through in-context learning (ICL). In ICL, the LLM is provided with examples that represent a given task such that it learns to generate answers for test inputs. However, access to these in-context examples is not guaranteed especially for low-resource or massively multilingual tasks. In this work, we propose an unsupervised approach to mine in-context examples for machine translation (MT), enabling unsupervised MT (UMT) across different languages. Our approach begins with word-level mining to acquire word translations that are then used to perform sentence-level mining. As the quality of mined parallel pairs may not be optimal due to noise or mistakes, we introduce a filtering criterion to select the optimal in-context examples from a pool of unsupervised parallel sentences. We evaluate our approach using two multilingual LLMs on 288 directions from the FLORES-200 dataset (Team et al., 2022) and analyze the impact of various linguistic features on performance. Our findings demonstrate the effectiveness of our unsupervised approach in mining in-context examples for MT, leading to better or comparable translation performance as translation with regular in-context samples (extracted from human-annotated data), while also outperforming the other state-of-the-art UMT methods by an average of 7 BLEU points.

This repo restructures the research workspace into a reusable package plus plain Python scripts while preserving the method described in the paper:

  1. mine high-confidence word translation pairs,
  2. build weak word-by-word translations,
  3. back-translate unlabeled target-language sentences,
  4. select sentence-level ICL examples with TopK+BM25,
  5. translate the test set with the mined examples.

Dependencies

pip install -r requirements.text

Paper resources

Models used in the paper:

Similarity model used in the paper:

Main evaluation dataset:

Vocabularies:

We use frequency-sorted FastText vocabularies and keep the top 10,000 words per language.

Main scripts

Run the paper's unsupervised method on selected pairs:

python run_unsupervised.py \
  --pair eng_Latn,deu_Latn \
  --model-path /path/to/Meta-Llama-3-8B \
  --similarity-model-path /path/to/stsb-xlm-r-multilingual \
  --lexicons-dir /path/to/lexicons \
  --sentences-dir /path/to/flores_plus \
  --output-dir outputs/main

Run the regular BM25 ICL baseline:

python run_regular_bm25.py \
  --pair eng_Latn,deu_Latn \
  --model-path /path/to/Meta-Llama-3-8B \
  --similarity-model-path /path/to/stsb-xlm-r-multilingual \
  --lexicons-dir /path/to/lexicons \
  --sentences-dir /path/to/flores_plus \
  --output-dir outputs/baselines

Run the main paper setting with the built-in FLORES preset:

python run_main.py \
  --model-path /path/to/Meta-Llama-3-8B \
  --similarity-model-path /path/to/stsb-xlm-r-multilingual \
  --lexicons-dir /path/to/lexicons \
  --sentences-dir /path/to/flores_plus

Run the decoding ablation subset:

python run_decoding_ablation.py \
  --model-path /path/to/Meta-Llama-3-8B \
  --similarity-model-path /path/to/stsb-xlm-r-multilingual \
  --lexicons-dir /path/to/lexicons \
  --sentences-dir /path/to/flores_plus

Useful scripts in the repo root:

  • run_main.py: main FLORES setting with the paper pair preset.
  • run_unsupervised.py: generic unsupervised runner.
  • run_regular_bm25.py: generic regular BM25 baseline.
  • run_decoding_ablation.py: decoding strategy ablation subset.

Input format

The scripts expect plain text files with one item per line:

  • Lexicon vocabularies: one word per line, for example lexicons/en_frequent_words.txt.
  • FLORES/FLORES+ dev and test files: one sentence per line.

Expected --sentences-dir layout:

flores_plus/
  dev/
    dev.eng_Latn
    dev.fra_Latn
  devtest/
    devtest.eng_Latn
    devtest.fra_Latn

Expected --lexicons-dir layout:

lexicons/
  en_frequent_words.txt
  fr_frequent_words.txt

Concrete examples are in examples/README.md.

Citation

@inproceedings{el-mekki-abdul-mageed-2025-effective,
    title = "Effective Self-Mining of In-Context Examples for Unsupervised Machine Translation with {LLM}s",
    author = "El Mekki, Abdellah  and
      Abdul-Mageed, Muhammad",
    editor = "Chiruzzo, Luis  and
      Ritter, Alan  and
      Wang, Lu",
    booktitle = "Findings of the Association for Computational Linguistics: NAACL 2025",
    month = apr,
    year = "2025",
    address = "Albuquerque, New Mexico",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.findings-naacl.238/",
    doi = "10.18653/v1/2025.findings-naacl.238",
    pages = "4229--4256",
    ISBN = "979-8-89176-195-7",
    abstract = "Large Language Models (LLMs) have demonstrated impressive performance on a wide range of natural language processing (NLP) tasks, primarily through in-context learning (ICL). In ICL, the LLM is provided with examples that represent a given task such that it learns to generate answers for test inputs. However, access to these in-context examples is not guaranteed especially for low-resource or massively multilingual tasks. In this work, we propose an unsupervised approach to mine in-context examples for machine translation (MT), enabling unsupervised MT (UMT) across different languages. Our approach begins with word-level mining to acquire word translations that are then used to perform sentence-level mining. As the quality of mined parallel pairs may not be optimal due to noise or mistakes, we introduce a filtering criterion to select the optimal in-context examples from a pool of unsupervised parallel sentences. We evaluate our approach using two multilingual LLMs on 288 directions from the FLORES-200 dataset (CITATION) and analyze the impact of various linguistic features on performance. Our findings demonstrate the effectiveness of our unsupervised approach in mining in-context examples for MT, leading to better or comparable translation performance as translation with regular in-context samples (extracted from human-annotated data), while also outperforming the other state-of-the-art UMT methods by an average of 7 BLEU points."
}

About

Repository for Effective Self-Mining of In-Context Examples for Unsupervised Machine Translation with LLMs paper.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages