RuChartQA: Russian Chart Reasoning Benchmark for Vision-Language Models

A benchmark for evaluating vision-language models on chart question answering with Russian-language charts and Cyrillic labels.

Dataset on HuggingFace: https://huggingface.co/datasets/romath/RuChartQA

What's in this benchmark

Two complementary datasets totalling 1,442 question-answer pairs:

Synthetic (1,200 QA) — controlled matplotlib-generated bar charts in 4 variants: ru/en × image/text-only. Three subdatasets: ChartBasic (lookup, max, min), ChartReasoning (difference, ratio, sum, average, multistep, conditional), ChartPerception (10-12 bars, narrow value spread).
ChartReal (242 QA) — 96 real charts extracted from official Russian government publications (Rosstat 51%, Bank of Russia 49%). Russian-only, image-only. Used for external validity.

Main findings

Three vision-language models evaluated: Gemini 2.5 Flash, Nemotron Nano 12B VL, Qwen3-VL 32B. Plus an OCR + text-LLM baseline.

Synthetic-only evaluation overestimates real-world performance by 11–41 percentage points (paired bootstrap, all p < 0.01). Magnitude is model-dependent: Qwen +11pp, Gemini +22pp, Nemotron +41pp.
The leaderboard reorders on real charts. On synthetic Gemini leads; on ChartReal Qwen and Gemini are statistically equivalent (CI crosses 0), both clearly outperform Nemotron.
Vision contributes meaningfully. Naive OCR + Llama 3.3 70B baseline achieves only 34.7% on ChartReal — the weakest VLM beats it by +11pp (p = 0.006).
It's vision difficulty, not language difficulty. On synthetic ChartReasoning, ru vs en image gaps are statistically not significant; image vs text-only gaps are significant by 9–18pp.
Min queries are a blind spot on real charts. All models drop from 93–97% on synthetic to 23–55% on ChartReal.

See analysis/ and results_v2/ for full numbers.

Repository structure

generator/ — synthetic chart generation (Python + matplotlib)
chartreal/ — real-chart collection pipeline (PyMuPDF + manual classification + manual annotation)
eval/ — evaluation pipeline, normalizer (numeric tolerance, bidirectional categorical, year-as-numeric exact match)
analysis/ — bootstrap CIs, methodology notes
error_analysis/ — 273 manually annotated errors with categories
results_v2/ — predictions of 3 VLMs on synthetic dataset
chartreal/predictions/ — predictions on ChartReal
data_v2/dataset_v2.json — synthetic dataset
data_v2/dataset_real.json — ChartReal dataset
sources.csv — chart provenance (source, URL, page)

Reproducing the results

Images themselves are not stored in this repository (they are on HuggingFace). To regenerate or run evaluation:

Synthetic data: python generator/main.py regenerates all 1,200 examples deterministically (seed=42).
ChartReal: python chartreal/collect_charts.py re-downloads PDFs and extracts pages; manual classification and QA annotation steps documented in chartreal/README.md.
Evaluation: python eval/eval_v2.py --dataset <path> --model <slug> runs a model through OpenRouter; predictions saved incrementally.
Statistical analysis: scripts in analysis/ reproduce the 15 paired-bootstrap tests reported in the thesis.

Citation

If you use this benchmark, please cite via Zenodo: @dataset{iashchenko_2026_ruchartqa, author = {Iashchenko, Roman}, title = {RuChartQA: Russian Chart Reasoning Benchmark for Vision-Language Models}, year = {2026}, publisher = {Zenodo}, doi = {10.5281/zenodo.20109375} }

DOI will be added after Zenodo release.

License

Apache 2.0 for code and dataset annotations. Original chart images in ChartReal are extracted from public-domain Russian government publications (Rosstat, Bank of Russia, Ministry of Economic Development); see sources.csv for provenance.

Author

Roman Iashchenko, HSE University, Faculty of Computer Science (PMI), 2026. Bachelor's thesis project.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RuChartQA: Russian Chart Reasoning Benchmark for Vision-Language Models

What's in this benchmark

Main findings

Repository structure

Reproducing the results

Citation

License

Author

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
analysis		analysis
chartreal		chartreal
data_v2		data_v2
error_analysis		error_analysis
eval		eval
generator		generator
results_v2		results_v2
.gitignore		.gitignore
README.md		README.md
qwen_sanity_check.py		qwen_sanity_check.py
sources.csv		sources.csv

Folders and files

Latest commit

History

Repository files navigation

RuChartQA: Russian Chart Reasoning Benchmark for Vision-Language Models

What's in this benchmark

Main findings

Repository structure

Reproducing the results

Citation

License

Author

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages