Skip to content

udmurtNLP/UdmurtBench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 

Repository files navigation

UdmurtBench

UdmurtBench is a benchmark for evaluating large language models (LLMs) on the Udmurt language. The benchmark is designed for a low-resource language setting and combines translated, adapted, and culturally specific tasks.

The benchmark evaluates several aspects of model competence:

  • machine translation into Udmurt;
  • reading comprehension in Udmurt;
  • physical commonsense reasoning in Udmurt;
  • short-answer cultural knowledge related to the Udmurt context.

The project was developed as part of a master's thesis on creating a dataset for evaluating LLMs on the Udmurt language.

Why UdmurtBench?

Udmurt is a low-resource and digitally underrepresented language. For such languages, a single translation test is not enough: a model can translate isolated sentences reasonably well, but still fail at reading comprehension, commonsense reasoning, or culturally specific questions.

UdmurtBench therefore follows a modular design. Each subtask measures a different ability, and the final leaderboard is intended to show a model profile rather than a single universal score.

Benchmark structure

Subtask What it measures Task format Main metric
Translation-FLORES Russian → Udmurt machine translation Free-form translation chrF++
Belebele-Udmurt Reading comprehension Multiple choice, 4 options Accuracy
Global PIQA-Udmurt Physical commonsense reasoning A/B choice Accuracy
Shudkom Udmurt cultural knowledge Short free-form answer LLM-as-a-Judge accuracy

Additional metrics are used for diagnostics: BLEU and TER for translation, valid answer rate for classification-style tasks, and generation/judge success rates for the short-answer task.

Dataset splits

Subtask Source Dev size Test size
Translation-FLORES FLORES-200 1400 100
Belebele-Udmurt Belebele / FLORES 800 100
Global PIQA-Udmurt Global PIQA 0 100
Shudkom Udmurt intellectual quiz questions 114 100

The hidden test subset should not be published in full if the benchmark is used for final model comparison. This reduces the risk of future benchmark contamination.

Repository layout

Current code layout:

UdmurtBench/
└── code/
    ├── belebele_eval.py
    ├── flores_results.py
    ├── piqa_eval.py
    ├── shudkom_eval.py
    └── shudkom_results.py

Code overview

File Purpose
code/belebele_eval.py Evaluation script for Belebele-Udmurt: reading comprehension with four answer options.
code/piqa_eval.py Evaluation script for Global PIQA-Udmurt: A/B physical commonsense reasoning.
code/shudkom_eval.py Generation and/or judging script for the Shudkom short-answer task.
code/shudkom_results.py Aggregation of Shudkom results and LLM-as-a-Judge outputs.
code/flores_results.py Calculation and aggregation of translation metrics for Translation-FLORES.

Initial leaderboard

The following table shows the initial experimental results reported for UdmurtBench. Yandex Translate is included only as a specialized translation baseline and is not part of the full four-task LLM comparison.

Model Overall FLORES chrF++ Belebele acc. Global PIQA acc. Шудком acc.
google/gemini-3.5-flash 0.928 48.92 0.82 0.98 0.42
qwen/qwen3.6-plus 0.759 46.78 0.86 0.96 0.18
anthropic/claude-sonnet-4.6 0.534 35.82 0.80 0.85 0.14
openai/gpt-5.5 0.456 35.53 0.76 0.76 0.14
deepseek/deepseek-v4-flash 0.408 41.24 0.71 0.74 0.07
x-ai/grok-4.3 0.044 24.93 0.55 0.51 0.09
moonshotai/kimi-k2.6 0.024 26.04 0.53 0.48 0.09
yandex/translate 53.69

The overall score is an auxiliary leaderboard value based on normalized task scores. It should be interpreted together with the individual task metrics.

Citation

If you use UdmurtBench, please cite the thesis or repository:

@mastersthesis{lebedev2026udmurtbench,
  title  = {Создание набора данных для оценки больших языковых моделей на удмуртском языке},
  author = {Lebedev, Egor M.},
  school = {National Research University Higher School of Economics},
  year   = {2026},
  url    = {https://github.com/udmurtNLP/UdmurtBench}
}

Acknowledgements

UdmurtBench uses or adapts materials from multilingual benchmark resources such as FLORES-200, Belebele, and Global PIQA, and includes a culturally specific Udmurt task based on intellectual quiz questions.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages