Skip to content

yaeooa/RuChartQA

Repository files navigation

RuChartQA: Russian Chart Reasoning Benchmark for Vision-Language Models

A benchmark for evaluating vision-language models on chart question answering with Russian-language charts and Cyrillic labels.

Dataset on HuggingFace: https://huggingface.co/datasets/romath/RuChartQA

What's in this benchmark

Two complementary datasets totalling 1,442 question-answer pairs:

  • Synthetic (1,200 QA) — controlled matplotlib-generated bar charts in 4 variants: ru/en × image/text-only. Three subdatasets: ChartBasic (lookup, max, min), ChartReasoning (difference, ratio, sum, average, multistep, conditional), ChartPerception (10-12 bars, narrow value spread).
  • ChartReal (242 QA) — 96 real charts extracted from official Russian government publications (Rosstat 51%, Bank of Russia 49%). Russian-only, image-only. Used for external validity.

Main findings

Three vision-language models evaluated: Gemini 2.5 Flash, Nemotron Nano 12B VL, Qwen3-VL 32B. Plus an OCR + text-LLM baseline.

  1. Synthetic-only evaluation overestimates real-world performance by 11–41 percentage points (paired bootstrap, all p < 0.01). Magnitude is model-dependent: Qwen +11pp, Gemini +22pp, Nemotron +41pp.
  2. The leaderboard reorders on real charts. On synthetic Gemini leads; on ChartReal Qwen and Gemini are statistically equivalent (CI crosses 0), both clearly outperform Nemotron.
  3. Vision contributes meaningfully. Naive OCR + Llama 3.3 70B baseline achieves only 34.7% on ChartReal — the weakest VLM beats it by +11pp (p = 0.006).
  4. It's vision difficulty, not language difficulty. On synthetic ChartReasoning, ru vs en image gaps are statistically not significant; image vs text-only gaps are significant by 9–18pp.
  5. Min queries are a blind spot on real charts. All models drop from 93–97% on synthetic to 23–55% on ChartReal.

See analysis/ and results_v2/ for full numbers.

Repository structure

  • generator/ — synthetic chart generation (Python + matplotlib)
  • chartreal/ — real-chart collection pipeline (PyMuPDF + manual classification + manual annotation)
  • eval/ — evaluation pipeline, normalizer (numeric tolerance, bidirectional categorical, year-as-numeric exact match)
  • analysis/ — bootstrap CIs, methodology notes
  • error_analysis/ — 273 manually annotated errors with categories
  • results_v2/ — predictions of 3 VLMs on synthetic dataset
  • chartreal/predictions/ — predictions on ChartReal
  • data_v2/dataset_v2.json — synthetic dataset
  • data_v2/dataset_real.json — ChartReal dataset
  • sources.csv — chart provenance (source, URL, page)

Reproducing the results

Images themselves are not stored in this repository (they are on HuggingFace). To regenerate or run evaluation:

  1. Synthetic data: python generator/main.py regenerates all 1,200 examples deterministically (seed=42).
  2. ChartReal: python chartreal/collect_charts.py re-downloads PDFs and extracts pages; manual classification and QA annotation steps documented in chartreal/README.md.
  3. Evaluation: python eval/eval_v2.py --dataset <path> --model <slug> runs a model through OpenRouter; predictions saved incrementally.
  4. Statistical analysis: scripts in analysis/ reproduce the 15 paired-bootstrap tests reported in the thesis.

Citation

If you use this benchmark, please cite via Zenodo: @dataset{iashchenko_2026_ruchartqa, author = {Iashchenko, Roman}, title = {RuChartQA: Russian Chart Reasoning Benchmark for Vision-Language Models}, year = {2026}, publisher = {Zenodo}, doi = {10.5281/zenodo.20109375} }

DOI will be added after Zenodo release.

License

Apache 2.0 for code and dataset annotations. Original chart images in ChartReal are extracted from public-domain Russian government publications (Rosstat, Bank of Russia, Ministry of Economic Development); see sources.csv for provenance.

Author

Roman Iashchenko, HSE University, Faculty of Computer Science (PMI), 2026. Bachelor's thesis project.

About

Russian Chart Reasoning Benchmark for Vision-Language Models / HSE bachelor thesis

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors