UdmurtBench is a benchmark for evaluating large language models (LLMs) on the Udmurt language. The benchmark is designed for a low-resource language setting and combines translated, adapted, and culturally specific tasks.
The benchmark evaluates several aspects of model competence:
- machine translation into Udmurt;
- reading comprehension in Udmurt;
- physical commonsense reasoning in Udmurt;
- short-answer cultural knowledge related to the Udmurt context.
The project was developed as part of a master's thesis on creating a dataset for evaluating LLMs on the Udmurt language.
Udmurt is a low-resource and digitally underrepresented language. For such languages, a single translation test is not enough: a model can translate isolated sentences reasonably well, but still fail at reading comprehension, commonsense reasoning, or culturally specific questions.
UdmurtBench therefore follows a modular design. Each subtask measures a different ability, and the final leaderboard is intended to show a model profile rather than a single universal score.
| Subtask | What it measures | Task format | Main metric |
|---|---|---|---|
Translation-FLORES |
Russian → Udmurt machine translation | Free-form translation | chrF++ |
Belebele-Udmurt |
Reading comprehension | Multiple choice, 4 options | Accuracy |
Global PIQA-Udmurt |
Physical commonsense reasoning | A/B choice | Accuracy |
Shudkom |
Udmurt cultural knowledge | Short free-form answer | LLM-as-a-Judge accuracy |
Additional metrics are used for diagnostics: BLEU and TER for translation, valid answer rate for classification-style tasks, and generation/judge success rates for the short-answer task.
| Subtask | Source | Dev size | Test size |
|---|---|---|---|
Translation-FLORES |
FLORES-200 | 1400 | 100 |
Belebele-Udmurt |
Belebele / FLORES | 800 | 100 |
Global PIQA-Udmurt |
Global PIQA | 0 | 100 |
Shudkom |
Udmurt intellectual quiz questions | 114 | 100 |
The hidden test subset should not be published in full if the benchmark is used for final model comparison. This reduces the risk of future benchmark contamination.
Current code layout:
UdmurtBench/
└── code/
├── belebele_eval.py
├── flores_results.py
├── piqa_eval.py
├── shudkom_eval.py
└── shudkom_results.py
| File | Purpose |
|---|---|
code/belebele_eval.py |
Evaluation script for Belebele-Udmurt: reading comprehension with four answer options. |
code/piqa_eval.py |
Evaluation script for Global PIQA-Udmurt: A/B physical commonsense reasoning. |
code/shudkom_eval.py |
Generation and/or judging script for the Shudkom short-answer task. |
code/shudkom_results.py |
Aggregation of Shudkom results and LLM-as-a-Judge outputs. |
code/flores_results.py |
Calculation and aggregation of translation metrics for Translation-FLORES. |
The following table shows the initial experimental results reported for UdmurtBench. Yandex Translate is included only as a specialized translation baseline and is not part of the full four-task LLM comparison.
| Model | Overall | FLORES chrF++ | Belebele acc. | Global PIQA acc. | Шудком acc. |
|---|---|---|---|---|---|
google/gemini-3.5-flash |
0.928 | 48.92 | 0.82 | 0.98 | 0.42 |
qwen/qwen3.6-plus |
0.759 | 46.78 | 0.86 | 0.96 | 0.18 |
anthropic/claude-sonnet-4.6 |
0.534 | 35.82 | 0.80 | 0.85 | 0.14 |
openai/gpt-5.5 |
0.456 | 35.53 | 0.76 | 0.76 | 0.14 |
deepseek/deepseek-v4-flash |
0.408 | 41.24 | 0.71 | 0.74 | 0.07 |
x-ai/grok-4.3 |
0.044 | 24.93 | 0.55 | 0.51 | 0.09 |
moonshotai/kimi-k2.6 |
0.024 | 26.04 | 0.53 | 0.48 | 0.09 |
yandex/translate |
— | 53.69 | — | — | — |
The overall score is an auxiliary leaderboard value based on normalized task scores. It should be interpreted together with the individual task metrics.
If you use UdmurtBench, please cite the thesis or repository:
@mastersthesis{lebedev2026udmurtbench,
title = {Создание набора данных для оценки больших языковых моделей на удмуртском языке},
author = {Lebedev, Egor M.},
school = {National Research University Higher School of Economics},
year = {2026},
url = {https://github.com/udmurtNLP/UdmurtBench}
}UdmurtBench uses or adapts materials from multilingual benchmark resources such as FLORES-200, Belebele, and Global PIQA, and includes a culturally specific Udmurt task based on intellectual quiz questions.