Codebase for Effective Self-Mining of In-Context Examples for Unsupervised Machine Translation with LLMs
Published in Findings of the Association for Computational Linguistics: NAACL 2025.
Large Language Models (LLMs) have demonstrated impressive performance on a wide range of natural language processing (NLP) tasks, primarily through in-context learning (ICL). In ICL, the LLM is provided with examples that represent a given task such that it learns to generate answers for test inputs. However, access to these in-context examples is not guaranteed especially for low-resource or massively multilingual tasks. In this work, we propose an unsupervised approach to mine in-context examples for machine translation (MT), enabling unsupervised MT (UMT) across different languages. Our approach begins with word-level mining to acquire word translations that are then used to perform sentence-level mining. As the quality of mined parallel pairs may not be optimal due to noise or mistakes, we introduce a filtering criterion to select the optimal in-context examples from a pool of unsupervised parallel sentences. We evaluate our approach using two multilingual LLMs on 288 directions from the FLORES-200 dataset (Team et al., 2022) and analyze the impact of various linguistic features on performance. Our findings demonstrate the effectiveness of our unsupervised approach in mining in-context examples for MT, leading to better or comparable translation performance as translation with regular in-context samples (extracted from human-annotated data), while also outperforming the other state-of-the-art UMT methods by an average of 7 BLEU points.
This repo restructures the research workspace into a reusable package plus plain Python scripts while preserving the method described in the paper:
- mine high-confidence word translation pairs,
- build weak word-by-word translations,
- back-translate unlabeled target-language sentences,
- select sentence-level ICL examples with TopK+BM25,
- translate the test set with the mined examples.
pip install -r requirements.textModels used in the paper:
Similarity model used in the paper:
Main evaluation dataset:
Vocabularies:
We use frequency-sorted FastText vocabularies and keep the top 10,000 words per language.
Run the paper's unsupervised method on selected pairs:
python run_unsupervised.py \
--pair eng_Latn,deu_Latn \
--model-path /path/to/Meta-Llama-3-8B \
--similarity-model-path /path/to/stsb-xlm-r-multilingual \
--lexicons-dir /path/to/lexicons \
--sentences-dir /path/to/flores_plus \
--output-dir outputs/mainRun the regular BM25 ICL baseline:
python run_regular_bm25.py \
--pair eng_Latn,deu_Latn \
--model-path /path/to/Meta-Llama-3-8B \
--similarity-model-path /path/to/stsb-xlm-r-multilingual \
--lexicons-dir /path/to/lexicons \
--sentences-dir /path/to/flores_plus \
--output-dir outputs/baselinesRun the main paper setting with the built-in FLORES preset:
python run_main.py \
--model-path /path/to/Meta-Llama-3-8B \
--similarity-model-path /path/to/stsb-xlm-r-multilingual \
--lexicons-dir /path/to/lexicons \
--sentences-dir /path/to/flores_plusRun the decoding ablation subset:
python run_decoding_ablation.py \
--model-path /path/to/Meta-Llama-3-8B \
--similarity-model-path /path/to/stsb-xlm-r-multilingual \
--lexicons-dir /path/to/lexicons \
--sentences-dir /path/to/flores_plusUseful scripts in the repo root:
run_main.py: main FLORES setting with the paper pair preset.run_unsupervised.py: generic unsupervised runner.run_regular_bm25.py: generic regular BM25 baseline.run_decoding_ablation.py: decoding strategy ablation subset.
The scripts expect plain text files with one item per line:
- Lexicon vocabularies: one word per line, for example
lexicons/en_frequent_words.txt. - FLORES/FLORES+ dev and test files: one sentence per line.
Expected --sentences-dir layout:
flores_plus/
dev/
dev.eng_Latn
dev.fra_Latn
devtest/
devtest.eng_Latn
devtest.fra_Latn
Expected --lexicons-dir layout:
lexicons/
en_frequent_words.txt
fr_frequent_words.txt
Concrete examples are in examples/README.md.
@inproceedings{el-mekki-abdul-mageed-2025-effective,
title = "Effective Self-Mining of In-Context Examples for Unsupervised Machine Translation with {LLM}s",
author = "El Mekki, Abdellah and
Abdul-Mageed, Muhammad",
editor = "Chiruzzo, Luis and
Ritter, Alan and
Wang, Lu",
booktitle = "Findings of the Association for Computational Linguistics: NAACL 2025",
month = apr,
year = "2025",
address = "Albuquerque, New Mexico",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.findings-naacl.238/",
doi = "10.18653/v1/2025.findings-naacl.238",
pages = "4229--4256",
ISBN = "979-8-89176-195-7",
abstract = "Large Language Models (LLMs) have demonstrated impressive performance on a wide range of natural language processing (NLP) tasks, primarily through in-context learning (ICL). In ICL, the LLM is provided with examples that represent a given task such that it learns to generate answers for test inputs. However, access to these in-context examples is not guaranteed especially for low-resource or massively multilingual tasks. In this work, we propose an unsupervised approach to mine in-context examples for machine translation (MT), enabling unsupervised MT (UMT) across different languages. Our approach begins with word-level mining to acquire word translations that are then used to perform sentence-level mining. As the quality of mined parallel pairs may not be optimal due to noise or mistakes, we introduce a filtering criterion to select the optimal in-context examples from a pool of unsupervised parallel sentences. We evaluate our approach using two multilingual LLMs on 288 directions from the FLORES-200 dataset (CITATION) and analyze the impact of various linguistic features on performance. Our findings demonstrate the effectiveness of our unsupervised approach in mining in-context examples for MT, leading to better or comparable translation performance as translation with regular in-context samples (extracted from human-annotated data), while also outperforming the other state-of-the-art UMT methods by an average of 7 BLEU points."
}