EMNLP 2026  ·  Zero-Shot  ·  SEBI & RBI

IndiaFinBench

The first expert-annotated benchmark for evaluating large language models on Indian financial regulatory text — 192 documents spanning 1992–2026.

406
QA Pairs
12
Models
4
Task Types
192
Documents
89.7%
Best Score
69.0%
Human Baseline
Live Rankings
Performance Leaderboard
Zero-shot accuracy across 406 expert-annotated items. Each model has a unique colour. Hover any bar for full breakdown.
Overall Accuracy
All 406 items · zero-shot · no fine-tuning
Complete Results
Full Results Table
Click column headers to sort. †Claude 3 Haiku evaluated on 150-item subset only.
Rank Model Params Type REG NUM CON TMP Overall 95% CI
Per-Task Analysis
Task Breakdown
Top-5 models per task. Each task probes a distinct reasoning capability unique to Indian financial regulation.
Difficulty Analysis
Performance by Question Difficulty
Assigned at authoring time based on reasoning steps. Hard = multi-instrument tracking or complex arithmetic.
Easy
160
39.4% of benchmark
Medium
182
44.8% of benchmark
Hard
64
15.8% of benchmark
Accuracy by Difficulty Level
LLaMA-3.3-70B improves on hard items (79.4% → 90.6%). Gemma 4 E4B collapses (82.5% → 56.2%). †Gemini 2.5 Pro's hard drop reflects verbose-output scoring artifact.
ModelEasy (n=160)Medium (n=182)Hard (n=64)
Open Evaluation
Submit a Model
Submit any public HuggingFace model for zero-shot evaluation on all 406 IndiaFinBench items. We run the evaluation and add results to the leaderboard.
Model Evaluation
Enter a public HF model ID. Evaluated zero-shot on all 406 items with task-aware scoring.
How It Works
1
Enter a public HuggingFace model ID and click Open Submission Issue.
2
A pre-filled GitHub issue opens with your model details and the exact evaluation command.
3
Submit the issue — we run the eval locally on all 406 items using the four-stage scoring procedure.
4
Results are added to the leaderboard with 95% Wilson CIs, usually within a few days.
Live Retrieval
RAG Demo
Ask questions grounded in all 192 SEBI & RBI regulatory documents (1992–2026). Answers are retrieved and generated in real-time — no hallucination, every claim is sourced.
Try:
Powered by FAISS + BM25 hybrid retrieval · Groq LLaMA-3.3-70B · 192 SEBI & RBI documents (1992–2026)
Research Benchmark
About IndiaFinBench
192 primary-source regulatory documents (1992–2026), dual-layer annotation quality validation.
Task Types
REG
Regulatory Interpretation · n = 174
Extract compliance rules, thresholds and deadlines from SEBI and RBI circulars.
NUM
Numerical Reasoning · n = 92
Arithmetic over capital adequacy ratios, dividend limits, and margin requirements.
CON
Contradiction Detection · n = 62
Determine whether two regulatory passages contradict each other (Yes/No + explanation).
TMP
Temporal Reasoning · n = 78
Sequence regulatory amendments and identify which circular was operative at a given time.
Key Findings

All 12 models surpass the 69.0% human baseline (n=100). Gemini 2.5 Flash leads at 89.7%. Numerical reasoning is the most discriminative task (35.9 pp spread). Llama 4 Scout 17B matches LLaMA-3.3-70B with ¼ the parameters (p=0.79). GPT-OSS 20B→120B yields no benefit (p=0.91). Gemini 2.5 Pro falls to Tier 2 due to verbose-output scoring artifact (corrected: 84.5%). Human IAA confirmed stable across 120 items (κ=0.611 CON).

Dataset Statistics
StatisticValue
Total QA pairs406
Easy / Medium / Hard160 / 182 / 64
SEBI source documents92
RBI source documents100
Total source documents192
Document span1992–2026
Avg. context length~142 words
Model validation agreement90.7%
Cohen's κ (CON, model)0.918
Human IAA overall76.7%
Cohen's κ (CON, human)0.611
Human expert baseline69.0% (n=100)
Citation
@article{pall2026indiafinbench,
  title   = {{IndiaFinBench}: An Evaluation Benchmark
             for LLM Performance on Indian Financial
             Regulatory Text},
  author  = {Pall, Rajveer Singh},
  journal = {Proceedings of EMNLP},
  year    = {2026},
  url     = {https://github.com/Rajveer-code/IndiaFinBench}
}
⚙ Tweaks
Hero Background
Bar Style