IndiaFinBench — LLM Benchmark for Indian Financial Regulatory Text

Live Rankings

Performance Leaderboard

Zero-shot accuracy across 406 expert-annotated items. Each model has a unique colour. Hover any bar for full breakdown.

Overall Accuracy

All 406 items · zero-shot · no fine-tuning

Complete Results

Full Results Table

Click column headers to sort. †Claude 3 Haiku evaluated on 150-item subset only.

Rank↕	Model	Params↕	Type	REG↕	NUM↕	CON↕	TMP↕	Overall↓	95% CI

Per-Task Analysis

Task Breakdown

Top-5 models per task. Each task probes a distinct reasoning capability unique to Indian financial regulation.

Difficulty Analysis

Performance by Question Difficulty

Assigned at authoring time based on reasoning steps. Hard = multi-instrument tracking or complex arithmetic.

Easy

160

39.4% of benchmark

Medium

182

44.8% of benchmark

Hard

64

15.8% of benchmark

Accuracy by Difficulty Level

LLaMA-3.3-70B improves on hard items (79.4% → 90.6%). Gemma 4 E4B collapses (82.5% → 56.2%). †Gemini 2.5 Pro's hard drop reflects verbose-output scoring artifact.

Model	Easy (n=160)	Medium (n=182)	Hard (n=64)

Open Evaluation

Submit a Model

Submit any public HuggingFace model for zero-shot evaluation on all 406 IndiaFinBench items. We run the evaluation and add results to the leaderboard.

Model Evaluation

Enter a public HF model ID. Evaluated zero-shot on all 406 items with task-aware scoring.

HuggingFace Model ID *

Display Name

Parameters

Model Type

How It Works

1

Enter a public HuggingFace model ID and click Open Submission Issue.

2

A pre-filled GitHub issue opens with your model details and the exact evaluation command.

3

Submit the issue — we run the eval locally on all 406 items using the four-stage scoring procedure.

4

Results are added to the leaderboard with 95% Wilson CIs, usually within a few days.

Live Retrieval

RAG Demo

Ask questions grounded in all 192 SEBI & RBI regulatory documents (1992–2026). Answers are retrieved and generated in real-time — no hallucination, every claim is sourced.

Try:

Powered by FAISS + BM25 hybrid retrieval · Groq LLaMA-3.3-70B · 192 SEBI & RBI documents (1992–2026)

Research Benchmark

About IndiaFinBench

192 primary-source regulatory documents (1992–2026), dual-layer annotation quality validation.

Task Types

REG

Regulatory Interpretation · n = 174

Extract compliance rules, thresholds and deadlines from SEBI and RBI circulars.

NUM

Numerical Reasoning · n = 92

Arithmetic over capital adequacy ratios, dividend limits, and margin requirements.

CON

Contradiction Detection · n = 62

Determine whether two regulatory passages contradict each other (Yes/No + explanation).

TMP

Temporal Reasoning · n = 78

Sequence regulatory amendments and identify which circular was operative at a given time.

Key Findings

All 12 models surpass the 69.0% human baseline (n=100). Gemini 2.5 Flash leads at 89.7%. Numerical reasoning is the most discriminative task (35.9 pp spread). Llama 4 Scout 17B matches LLaMA-3.3-70B with ¼ the parameters (p=0.79). GPT-OSS 20B→120B yields no benefit (p=0.91). Gemini 2.5 Pro falls to Tier 2 due to verbose-output scoring artifact (corrected: 84.5%). Human IAA confirmed stable across 120 items (κ=0.611 CON).

Dataset Statistics

Statistic	Value
Total QA pairs	406
Easy / Medium / Hard	160 / 182 / 64
SEBI source documents	92
RBI source documents	100
Total source documents	192
Document span	1992–2026
Avg. context length	~142 words
Model validation agreement	90.7%
Cohen's κ (CON, model)	0.918
Human IAA overall	76.7%
Cohen's κ (CON, human)	0.611
Human expert baseline	69.0% (n=100)

Citation

@article{pall2026indiafinbench,
  title   = {{IndiaFinBench}: An Evaluation Benchmark
             for LLM Performance on Indian Financial
             Regulatory Text},
  author  = {Pall, Rajveer Singh},
  journal = {Proceedings of EMNLP},
  year    = {2026},
  url     = {https://github.com/Rajveer-code/IndiaFinBench}
}

🤗 HuggingFace GitHub