The first expert-annotated benchmark for evaluating large language models on Indian financial regulatory text — 192 documents spanning 1992–2026.
All 12 models surpass the 69.0% human baseline (n=100). Gemini 2.5 Flash leads at 89.7%. Numerical reasoning is the most discriminative task (35.9 pp spread). Llama 4 Scout 17B matches LLaMA-3.3-70B with ¼ the parameters (p=0.79). GPT-OSS 20B→120B yields no benefit (p=0.91). Gemini 2.5 Pro falls to Tier 2 due to verbose-output scoring artifact (corrected: 84.5%). Human IAA confirmed stable across 120 items (κ=0.611 CON).
| Statistic | Value |
|---|---|
| Total QA pairs | 406 |
| Easy / Medium / Hard | 160 / 182 / 64 |
| SEBI source documents | 92 |
| RBI source documents | 100 |
| Total source documents | 192 |
| Document span | 1992–2026 |
| Avg. context length | ~142 words |
| Model validation agreement | 90.7% |
| Cohen's κ (CON, model) | 0.918 |
| Human IAA overall | 76.7% |
| Cohen's κ (CON, human) | 0.611 |
| Human expert baseline | 69.0% (n=100) |
@article{pall2026indiafinbench,
title = {{IndiaFinBench}: An Evaluation Benchmark
for LLM Performance on Indian Financial
Regulatory Text},
author = {Pall, Rajveer Singh},
journal = {Proceedings of EMNLP},
year = {2026},
url = {https://github.com/Rajveer-code/IndiaFinBench}
}