The first publicly available evaluation benchmark for large language model performance on Indian financial regulatory text.
406 expert-annotated QA items · 192 SEBI + RBI documents · 12 LLMs evaluated · Hybrid RAG with Recall@5 = 0.785
→ huggingface.co/spaces/Rajveer-code/IndiaFinBench
A fully working Flask web application deployed on HuggingFace Spaces (Docker, CPU, free tier):
| Feature | Description |
|---|---|
| Interactive Leaderboard | Sortable table of 12 LLMs with 95% Wilson CIs, per-task breakdown (REG / NUM / CON / TMP), difficulty analysis |
| Animated Bar Charts | Hover-interactive visualisations of task and difficulty performance |
| Dataset Explorer | Browse random benchmark items filtered by task type and difficulty |
| RAG Demo | Live hybrid retrieval (FAISS + BM25 + RRF) over 192 regulatory documents, answered by Groq LLaMA-3.3-70B |
| Model Submission | Opens a pre-filled GitHub issue with the exact evaluation command |
Stack: Python 3.11 · Flask 3 · Gunicorn · FAISS-CPU · sentence-transformers (BAAI/bge-base-en-v1.5) · rank-bm25 · Groq API · SQLite · Docker
IndiaFinBench is a zero-shot evaluation benchmark consisting of 406 expert-annotated question-answer pairs drawn from 192 documents published by the Securities and Exchange Board of India (SEBI) and the Reserve Bank of India (RBI), spanning 1992–2026. It tests whether large language models can reliably reason about Indian financial regulatory text — a domain with distinctive challenges not captured by existing Western-centric financial NLP benchmarks.
Indian regulatory documents embed numerical thresholds in dense prose, reference chains of superseding circulars that require temporal reasoning to untangle, and use jurisdiction-specific terminology (LODR, PMLA, SFB, AIF, FEMA) that models trained on Western corpora may not reliably interpret. IndiaFinBench makes these challenges measurable.
Results on the full 406-item benchmark under zero-shot, context-only evaluation:
| Rank | Model | REG | NUM | CON | TMP | Overall | 95% CI |
|---|---|---|---|---|---|---|---|
| 1 | Gemini 2.5 Flash | 93.1% | 84.8% | 88.7% | 88.5% | 89.7% | [86.3%, 92.3%] |
| 2 | Qwen3-32B | 85.1% | 77.2% | 90.3% | 92.3% | 85.5% | [81.7%, 88.6%] |
| 3 | LLaMA-3.3-70B | 86.2% | 75.0% | 95.2% | 79.5% | 83.7% | [79.8%, 87.0%] |
| 4 | Llama 4 Scout 17B | 86.2% | 66.3% | 98.4% | 84.6% | 83.3% | [79.3%, 86.6%] |
| 5 | Kimi K2 | 89.1% | 65.2% | 91.9% | 75.6% | 81.5% | [77.5%, 85.0%] |
| 6 | LLaMA-3-8B | 79.9% | 64.1% | 93.5% | 78.2% | 78.1% | [73.8%, 81.8%] |
| 7 | GPT-OSS 120B | 79.9% | 59.8% | 95.2% | 76.9% | 77.1% | [72.8%, 80.9%] |
| 8 | GPT-OSS 20B | 79.9% | 58.7% | 95.2% | 76.9% | 76.8% | [72.5%, 80.7%] |
| 9 | Gemini 2.5 Pro | 89.7% | 48.9% | 93.5% | 64.1% | 76.1% | [71.8%, 80.0%] |
| 10 | Mistral-7B | 79.9% | 66.3% | 80.6% | 74.4% | 75.9% | [71.5%, 79.8%] |
| 11 | DeepSeek R1 70B | 72.4% | 69.6% | 96.8% | 70.5% | 75.1% | [70.7%, 79.1%] |
| 12 | Gemma 4 E4B | 83.9% | 50.0% | 72.6% | 62.8% | 70.4% | [65.8%, 74.7%] |
| — | Human Expert (n=100) | — | — | — | — | 69.0% | [59.4%, 77.2%] |
† Claude 3 Haiku was evaluated on the initial 150-item subset (REG=53, NUM=32, CON=30, TMP=35): Overall 91.3%. Provided for contextualisation; not directly comparable to the 406-item results above.
95% Wilson score confidence intervals. Paired bootstrap significance testing (10,000 resamples) across all 66 model pairs confirms three statistically distinct performance tiers. Full significance matrix in evaluation/bootstrap_significance_results.json.
IndiaFinBench (406 items)
├── REG — Regulatory Interpretation (174 items, 42.9%)
│ Given a regulatory passage, identify the correct rule, threshold,
│ or scope of applicability. Tests precision reading of regulatory language.
│
├── NUM — Numerical Reasoning (92 items, 22.7%)
│ Perform arithmetic over figures embedded in regulatory text:
│ capital ratios, dividend limits, margin requirements.
│
├── CON — Contradiction Detection (62 items, 15.3%)
│ Given two regulatory passages, determine whether they contradict
│ each other on the stated issue (Yes/No + explanation required).
│
└── TMP — Temporal Reasoning (78 items, 19.2%)
Establish the chronological ordering of regulatory events, identify
which circular was operative at a given time, or compute elapsed time
between regulatory milestones.
Difficulty: Easy 160 (39.4%) · Medium 182 (44.8%) · Hard 64 (15.8%)
- Three performance tiers confirmed by bootstrap testing: Frontier API models (81–90%), mid-tier open-weight models (75–79%), and Gemma 4 E4B (70%). Most cross-tier differences are statistically significant (p < 0.05).
- Efficiency over scale: Llama 4 Scout 17B statistically matches LLaMA-3.3-70B (p = 0.79) with one-quarter the parameters — strong evidence that scale alone does not drive regulatory reasoning.
- Scaling plateau: GPT-OSS 120B and GPT-OSS 20B are statistically indistinguishable (p = 0.91, Δ = +0.3pp).
- DeepSeek R1 paradox: Despite being reasoning-specialised, DeepSeek R1 70B ranks 11th — particularly weak on temporal reasoning (70.5%), exposing a gap between general chain-of-thought capability and domain-specific timeline reasoning.
- Gemini 2.5 Pro NUM dissociation: Scores only 48.9% on NUM (lowest of any model) while ranking 1st on REG (89.7%), demonstrating that task-type performance can be highly dissociated even within the same model.
- NUM as the key discriminator: 35.9pp spread between best (Gemini 2.5 Flash: 84.8%) and worst (Gemini 2.5 Pro: 48.9%) — the most informative task type for differentiating model capability.
- All 12 models beat the human baseline: Human expert accuracy = 69.0% (n=100, 95% CI [59.4%, 77.2%]); all 12 evaluated LLMs exceed this, with Gemini 2.5 Flash leading at 89.7%.
This repository also includes a hybrid retrieval-augmented generation (RAG) system for open-book querying of the regulatory corpus — the open-book counterpart to the closed-book benchmark above. It is integrated into the live demo.
Pipeline:
Query → BGE Embedder → FAISS (dense) ┐
→ BM25 (sparse) ──┤ RRF → Top-K chunks → Groq LLaMA-3.3-70B → Answer
┘
- Embeddings: BAAI/bge-base-en-v1.5 (768-dim, asymmetric query/corpus encoding)
- Sparse index: BM25 (rank-bm25) over 1600-character chunks
- Dense index: FAISS flat L2 (17 MB, 4,347 vectors)
- Fusion: Reciprocal Rank Fusion (k = 60)
- Generator: Groq
llama-3.3-70b-versatile(14,400 free requests/day)
| Config | Recall@5 | MRR | p50 ms |
|---|---|---|---|
| Dense only (B0) | 0.688 | 0.542 | 48 |
| BM25 only (B1) | 0.764 | 0.674 | 30 |
| Hybrid RRF (B2) ◄ | 0.785 | 0.640 | 77 |
| Small chunks 800-char (B3) | 0.583 | 0.493 | 138 |
| Large chunks 2400-char (B4) | 0.542 | 0.410 | 71 |
| Hybrid k=10 (B5) | 0.785 | 0.640 | 78 |
Hybrid RRF improves Recall@5 by +9.7pp over dense-only. BM25 achieves the best MRR, confirming that citation-heavy regulatory text with structured identifiers strongly favours lexical matching. 1600-char chunking is the empirical optimum: smaller chunks fragment multi-clause provisions; larger chunks introduce retrieval noise.
# Build the index from the parsed corpus (~3 min on CPU)
python -m rag.scripts.build_index
# Run the 6-configuration retrieval ablation
python -m rag.scripts.run_evaluation# Clone and install
git clone https://github.com/Rajveer-code/IndiaFinBench.git
cd IndiaFinBench
pip install -r demo/requirements.txt -r rag/requirements.txt
# Set Groq API key (free at console.groq.com)
export GROQ_API_KEY="your_key_here"
# Start the server
python demo/app.py
# → http://localhost:7860| Method | Endpoint | Description |
|---|---|---|
GET |
/ |
Leaderboard HTML page |
GET |
/api/leaderboard |
JSON — 12 models + human baseline |
GET |
/api/example?task=&diff= |
Random benchmark item (filterable) |
POST |
/api/rag |
Hybrid RAG query (rate-limited 20 req/min) |
POST |
/api/submit |
Returns pre-filled GitHub issue URL |
HuggingFace Spaces (Docker, CPU basic, free — 2 vCPU / 16 GB RAM)
│
├── Gunicorn (1 worker, 4 threads, port 7860)
│ └── Flask app demo/app.py
│
├── RAG pipeline rag/
│ ├── BGE embedder (pre-baked in Docker image, ~270 MB)
│ ├── FAISS index rag/index/faiss.index (17 MB, via Git LFS)
│ └── BM25 index rag/index/bm25.pkl (18 MB, via Git LFS)
│
└── SQLite demo/leaderboard.db (seeded at startup, read-only at runtime)
The BGE model is downloaded once at Docker build time and baked into the image, so the first RAG request is fast (~200 ms) rather than triggering a 30-second cold start.
pytest demo/tests/test_app.py -v
# 14 passed in <2sfrom datasets import load_dataset
ds = load_dataset("Rajveer-code/IndiaFinBench", split="train")
print(f"Total items: {len(ds)}") # 406
# Filter by task type
reg_items = ds.filter(lambda x: x["task_type"] == "regulatory_interpretation")
print(f"REG items: {len(reg_items)}") # 174# API model
python evaluation/evaluate.py \
--dataset data/benchmark/indiafinbench_v1.csv \
--model gemini-2.5-flash \
--provider google \
--output results/predictions/gemini_flash.csv
# Local model via Ollama
python evaluation/evaluate.py \
--dataset data/benchmark/indiafinbench_v1.csv \
--model llama3:8b \
--provider ollama \
--output results/predictions/llama3_8b.csv# All paper figures + bootstrap / Wilson CI / difficulty outputs
python scripts/generate_figures.pyModel-based validation (150 items): LLaMA-3.3-70B independently attempted each item to verify unambiguous answerability from context. Overall agreement: 90.7%. Cohen's κ = 0.918 for contradiction detection (binary Yes/No).
Human inter-annotator agreement (180 items, three annotation rounds, 44.3% benchmark coverage):
| Task | Items | Agreement | Cohen's κ |
|---|---|---|---|
| Regulatory Interpretation | 63 | 85.7% | — |
| Numerical Reasoning | 44 | 59.1% | — |
| Contradiction Detection | 35 | 88.6% | 0.645 |
| Temporal Reasoning | 38 | 73.7% | — |
| Overall | 180 | 77.2% | — |
κ = 0.645 for contradiction detection falls in the "substantial agreement" range (Landis & Koch, 1977). NUM agreement of 59.1% reflects a formatting artefact: annotator2 provides verbose step-by-step derivations whereas reference answers give concise final values; post-hoc review of all 26 NUM disagreements confirmed zero substantive arithmetic errors. Full IAA data in annotation/iaa/.
Answers are scored using a four-stage procedure:
- Exact match — after case-normalisation and punctuation stripping
- Fuzzy token match — RapidFuzz
token_set_ratio ≥ 0.72 - Numerical extraction match — handles currency symbols, commas, units
- Yes/No match — for contradiction detection items
The 0.72 fuzzy threshold was calibrated by manual inspection of borderline cases and validated against adjacent thresholds (0.65 too permissive, 0.80 too strict). Full ablation in evaluation/error_analysis/fuzzy_ablation_*.csv.
IndiaFinBench/
│
├── data/
│ ├── benchmark/indiafinbench_v1.csv # Canonical 406-item benchmark
│ ├── metadata_sebi.csv # 92 SEBI source documents with URLs
│ └── metadata_rbi.csv # 100 RBI source documents with URLs
│
├── annotation/
│ ├── raw_qa/ # Full benchmark JSON (406 + 150-item subset)
│ ├── guidelines/annotation_guide_v1.md # Annotation protocol
│ ├── iaa/ # Inter-annotator agreement data (180 items, 3 rounds)
│ └── human_eval/ # Human evaluation responses (n=100)
│
├── evaluation/
│ ├── evaluate.py # Canonical evaluation entry point
│ ├── prompts/ # Per-task-type system prompts
│ ├── results/ # Per-model prediction CSVs
│ ├── error_analysis/ # Error taxonomy, bootstrap, Wilson CI
│ └── novel_methods/ # 11 novel methodological analyses
│
├── results/
│ ├── predictions/ # Canonical predictions (12 models)
│ └── aggregate/all_model_results.csv # Aggregated Table 6 results
│
├── scripts/
│ ├── generate_figures.py # All paper figures + statistical outputs
│ ├── bootstrap_significance.py # Paired bootstrap (10k resamples)
│ ├── wilson_ci.py # 95% Wilson CI computation
│ ├── compute_kappa.py # IAA Cohen's kappa
│ └── exp[1-11]_*.py # Novel methodological analysis scripts
│
├── rag/ # Hybrid RAG pipeline
│ ├── pipeline.py # Top-level RAGPipeline orchestrator
│ ├── embeddings.py # BGE embedder (asymmetric query/corpus)
│ ├── index.py # FAISS dense index
│ ├── bm25_index.py # BM25 sparse index
│ ├── retriever.py # HybridRetriever with RRF fusion
│ ├── generator.py # Groq LLM generation
│ ├── config.py # RAGConfig dataclass
│ ├── index/ # Pre-built indices (Git LFS)
│ │ ├── faiss.index # 17 MB FAISS flat index
│ │ ├── bm25.pkl # 18 MB BM25 serialised model
│ │ └── chunks.pkl # 9.8 MB chunk metadata
│ └── scripts/
│ ├── build_index.py # Index construction script
│ └── run_evaluation.py # 6-config ablation evaluation
│
├── demo/ # Live web application
│ ├── app.py # Flask app (leaderboard, RAG, submit APIs)
│ ├── requirements.txt # Flask, gunicorn, FAISS, sentence-transformers
│ ├── templates/index.html # Single-page leaderboard UI
│ ├── static/js/
│ │ ├── charts.js # Bar charts, submit handler, RAG UI
│ │ ├── data.js # Model data + Wilson CI bounds (frontend)
│ │ └── animations.js # Scroll animations
│ ├── database/db.py # SQLite leaderboard (init + query)
│ ├── data/
│ │ ├── questions.json # 406 benchmark items (for dataset explorer)
│ │ └── baselines.json # Baseline model results (seeds DB)
│ └── tests/test_app.py # 14 API behaviour tests
│
├── paper/
│ ├── indiafinbench_paper_v12.md # Current paper draft (EMNLP 2026)
│ ├── references.bib # BibTeX references
│ └── figures/ # Publication figures
│
├── Dockerfile # Root Dockerfile for HF Spaces (Docker SDK)
├── .dockerignore # Excludes .env, corpus, research artefacts
├── README.md # This file
└── LICENSE
@article{pall2026indiafinbench,
title = {{IndiaFinBench}: An Evaluation Benchmark for Large Language Model Performance on Indian Financial Regulatory Text},
author = {Pall, Rajveer Singh},
journal = {Proceedings of EMNLP},
year = {2026},
url = {https://github.com/Rajveer-code/IndiaFinBench}
}| Component | License |
|---|---|
Dataset (data/benchmark/, annotation/) |
CC BY 4.0 — free to use with attribution |
Code (scripts/, evaluation/, demo/, rag/) |
MIT License |
| Source regulatory documents | Public domain (Government of India) |
Rajveer Singh Pall — rajveerpall04@gmail.com
For questions about the benchmark, evaluation methodology, or collaboration inquiries, open an issue or contact directly.