UdmurtBench

UdmurtBench is a benchmark for evaluating large language models (LLMs) on the Udmurt language. The benchmark is designed for a low-resource language setting and combines translated, adapted, and culturally specific tasks.

The benchmark evaluates several aspects of model competence:

machine translation into Udmurt;
reading comprehension in Udmurt;
physical commonsense reasoning in Udmurt;
short-answer cultural knowledge related to the Udmurt context.

The project was developed as part of a master's thesis on creating a dataset for evaluating LLMs on the Udmurt language.

Why UdmurtBench?

Udmurt is a low-resource and digitally underrepresented language. For such languages, a single translation test is not enough: a model can translate isolated sentences reasonably well, but still fail at reading comprehension, commonsense reasoning, or culturally specific questions.

UdmurtBench therefore follows a modular design. Each subtask measures a different ability, and the final leaderboard is intended to show a model profile rather than a single universal score.

Benchmark structure

Subtask	What it measures	Task format	Main metric
`Translation-FLORES`	Russian → Udmurt machine translation	Free-form translation	chrF++
`Belebele-Udmurt`	Reading comprehension	Multiple choice, 4 options	Accuracy
`Global PIQA-Udmurt`	Physical commonsense reasoning	A/B choice	Accuracy
`Shudkom`	Udmurt cultural knowledge	Short free-form answer	LLM-as-a-Judge accuracy

Additional metrics are used for diagnostics: BLEU and TER for translation, valid answer rate for classification-style tasks, and generation/judge success rates for the short-answer task.

Dataset splits

Subtask	Source	Dev size	Test size
`Translation-FLORES`	FLORES-200	1400	100
`Belebele-Udmurt`	Belebele / FLORES	800	100
`Global PIQA-Udmurt`	Global PIQA	0	100
`Shudkom`	Udmurt intellectual quiz questions	114	100

The hidden test subset should not be published in full if the benchmark is used for final model comparison. This reduces the risk of future benchmark contamination.

Repository layout

Current code layout:

UdmurtBench/
└── code/
    ├── belebele_eval.py
    ├── flores_results.py
    ├── piqa_eval.py
    ├── shudkom_eval.py
    └── shudkom_results.py

Code overview

File	Purpose
`code/belebele_eval.py`	Evaluation script for `Belebele-Udmurt`: reading comprehension with four answer options.
`code/piqa_eval.py`	Evaluation script for `Global PIQA-Udmurt`: A/B physical commonsense reasoning.
`code/shudkom_eval.py`	Generation and/or judging script for the `Shudkom` short-answer task.
`code/shudkom_results.py`	Aggregation of `Shudkom` results and LLM-as-a-Judge outputs.
`code/flores_results.py`	Calculation and aggregation of translation metrics for `Translation-FLORES`.

Initial leaderboard

The following table shows the initial experimental results reported for UdmurtBench. Yandex Translate is included only as a specialized translation baseline and is not part of the full four-task LLM comparison.

Model	Overall	FLORES chrF++	Belebele acc.	Global PIQA acc.	Шудком acc.
`google/gemini-3.5-flash`	0.928	48.92	0.82	0.98	0.42
`qwen/qwen3.6-plus`	0.759	46.78	0.86	0.96	0.18
`anthropic/claude-sonnet-4.6`	0.534	35.82	0.80	0.85	0.14
`openai/gpt-5.5`	0.456	35.53	0.76	0.76	0.14
`deepseek/deepseek-v4-flash`	0.408	41.24	0.71	0.74	0.07
`x-ai/grok-4.3`	0.044	24.93	0.55	0.51	0.09
`moonshotai/kimi-k2.6`	0.024	26.04	0.53	0.48	0.09
`yandex/translate`	—	53.69	—	—	—

The overall score is an auxiliary leaderboard value based on normalized task scores. It should be interpreted together with the individual task metrics.

Citation

If you use UdmurtBench, please cite the thesis or repository:

@mastersthesis{lebedev2026udmurtbench,
  title  = {Создание набора данных для оценки больших языковых моделей на удмуртском языке},
  author = {Lebedev, Egor M.},
  school = {National Research University Higher School of Economics},
  year   = {2026},
  url    = {https://github.com/udmurtNLP/UdmurtBench}
}

Acknowledgements

UdmurtBench uses or adapts materials from multilingual benchmark resources such as FLORES-200, Belebele, and Global PIQA, and includes a culturally specific Udmurt task based on intellectual quiz questions.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
code		code
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

UdmurtBench

Why UdmurtBench?

Benchmark structure

Dataset splits

Repository layout

Code overview

Initial leaderboard

Citation

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

UdmurtBench

Why UdmurtBench?

Benchmark structure

Dataset splits

Repository layout

Code overview

Initial leaderboard

Citation

Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages