A benchmark for evaluating vision-language models on chart question answering with Russian-language charts and Cyrillic labels.
Dataset on HuggingFace: https://huggingface.co/datasets/romath/RuChartQA
Two complementary datasets totalling 1,442 question-answer pairs:
- Synthetic (1,200 QA) — controlled matplotlib-generated bar charts in 4 variants: ru/en × image/text-only. Three subdatasets: ChartBasic (lookup, max, min), ChartReasoning (difference, ratio, sum, average, multistep, conditional), ChartPerception (10-12 bars, narrow value spread).
- ChartReal (242 QA) — 96 real charts extracted from official Russian government publications (Rosstat 51%, Bank of Russia 49%). Russian-only, image-only. Used for external validity.
Three vision-language models evaluated: Gemini 2.5 Flash, Nemotron Nano 12B VL, Qwen3-VL 32B. Plus an OCR + text-LLM baseline.
- Synthetic-only evaluation overestimates real-world performance by 11–41 percentage points (paired bootstrap, all p < 0.01). Magnitude is model-dependent: Qwen +11pp, Gemini +22pp, Nemotron +41pp.
- The leaderboard reorders on real charts. On synthetic Gemini leads; on ChartReal Qwen and Gemini are statistically equivalent (CI crosses 0), both clearly outperform Nemotron.
- Vision contributes meaningfully. Naive OCR + Llama 3.3 70B baseline achieves only 34.7% on ChartReal — the weakest VLM beats it by +11pp (p = 0.006).
- It's vision difficulty, not language difficulty. On synthetic ChartReasoning, ru vs en image gaps are statistically not significant; image vs text-only gaps are significant by 9–18pp.
- Min queries are a blind spot on real charts. All models drop from 93–97% on synthetic to 23–55% on ChartReal.
See analysis/ and results_v2/ for full numbers.
generator/— synthetic chart generation (Python + matplotlib)chartreal/— real-chart collection pipeline (PyMuPDF + manual classification + manual annotation)eval/— evaluation pipeline, normalizer (numeric tolerance, bidirectional categorical, year-as-numeric exact match)analysis/— bootstrap CIs, methodology noteserror_analysis/— 273 manually annotated errors with categoriesresults_v2/— predictions of 3 VLMs on synthetic datasetchartreal/predictions/— predictions on ChartRealdata_v2/dataset_v2.json— synthetic datasetdata_v2/dataset_real.json— ChartReal datasetsources.csv— chart provenance (source, URL, page)
Images themselves are not stored in this repository (they are on HuggingFace). To regenerate or run evaluation:
- Synthetic data:
python generator/main.pyregenerates all 1,200 examples deterministically (seed=42). - ChartReal:
python chartreal/collect_charts.pyre-downloads PDFs and extracts pages; manual classification and QA annotation steps documented inchartreal/README.md. - Evaluation:
python eval/eval_v2.py --dataset <path> --model <slug>runs a model through OpenRouter; predictions saved incrementally. - Statistical analysis: scripts in
analysis/reproduce the 15 paired-bootstrap tests reported in the thesis.
If you use this benchmark, please cite via Zenodo: @dataset{iashchenko_2026_ruchartqa, author = {Iashchenko, Roman}, title = {RuChartQA: Russian Chart Reasoning Benchmark for Vision-Language Models}, year = {2026}, publisher = {Zenodo}, doi = {10.5281/zenodo.20109375} }
DOI will be added after Zenodo release.
Apache 2.0 for code and dataset annotations. Original chart images
in ChartReal are extracted from public-domain Russian government
publications (Rosstat, Bank of Russia, Ministry of Economic
Development); see sources.csv for provenance.
Roman Iashchenko, HSE University, Faculty of Computer Science (PMI), 2026. Bachelor's thesis project.