NCERTQABench: A Large-Scale Bilingual Question Answering Dataset Grounded in Indian School Curriculum with Fine-tuned Language Model Evaluation
Abstract
India’s school education system revolves around the National Council of Educational Research and Training (NCERT) textbooks, yet the research community has largely overlooked them as a source for structured question-answering datasets. We address this gap with NCERTQABench— a collection of 222,880 question-answer pairs drawn from NCERT textbooks spanning Grades 6 to 12. The dataset covers Mathematics, Science, Social Science, Commerce, English literature,and Hindi literature, making it both curriculum-broad and bilingual (English: 78.7%, Hindi:21.3%). To probe how much domain-specific training actually matters, we fine-tune Qwen2.5-3B-Instruct via Quantized Low-Rank Adaptation (QLoRA) and compare it against the same untrained model on a held-out evaluation set of 6,042 samples (4,712 English, 1,330 Hindi). The fine-tuned model reaches a ROUGE-L score of 0.4373 on English questions against 0.2017 for the zero-shot baseline — a 2.17× improvement. On Hindi questions, evaluated with character-level ROUGE-L (which correctly handles Devanagari script), the fine-tuned model scores 0.5303 versus 0.4266 for the baseline. The zero-shot model fails to produce a single verbatim match(0% Exact Match), while the fine-tuned model reaches 0.93% on English and 0.30% on Hindi. We release the dataset, trained model weights, and evaluation scripts publicly.
Citation Information
@article{abhinavsaxena2026,
title={NCERTQABench: A Large-Scale Bilingual Question Answering Dataset Grounded in Indian School Curriculum with Fine-tuned Language Model Evaluation},
author={Abhinav Saxena and Sarsij Tripathi},
journal={Research Square},
year={2026},
doi={https://doi.org/10.21203/rs.3.rs-9334872/v1}
}
SinoXiv