Research Article 2026-04-23 posted v1

mLLeQA: Translation-Augmented Dense Retrieval for Multilingual Legal Question Answering

Rachit Verma IIT Dharwad

Alluri Lakshman Narendra IIT Dharwad

Abhinav Shankar IIT Dharwad

Riya Chitnis IIT Dharwad

Achyut Mani Tripathi IIT Dharwad

Konjengbam Anand IIT Dharwad

Download PDF View Original Citation

Abstract

Multilingual legal question answering remains underexplored, despite its practical importance in jurisdictions like the European Union, where citizens have a constitutional right to access legal information in their native language. However, constructing multilingual legal benchmarks is prohibitively expensive, requiring annotators with both multilingual proficiency and legal expertise. This paper investigates whether high-quality neural machine translation can produce synthetic multilingual legal corpora that preserve retrieval utility for downstream dense retrieval systems. We present mLLeQA, a synthetic multilingual benchmark derived by translating the French LLeQA legal retrieval dataset into five additional European languages (Dutch, English, Finnish, Italian,and Spanish) using Seamless-M4T-v2-large. We train language-specific BERT encoders (MDRL) and multilingual BERT encoders (MDRM) as dense retrievers using contrastive learning with hard negatives, and evaluate retrieval performance against BM25 lexical baselines across 195 test queries per language. Language-specific BERT retrievers show no statistically significant performance degradation from the French baseline (p > 0.05), with average cross-language deviation from the French baseline of 5.14 pp. Dense retrievers consistently outperform BM25 across all languages, achieving 29.49 – 62.47 pp absolute improvement on Recall@500. Notably, Finnish (typologically distant) shows minimal deviation (1.61 pp) while Italian exhibits maximal deviation (10.12 pp), suggesting that pre-training corpus characteristics dominate linguistic similarity effects. BLEU scores prove poor predictors of dense retrieval performance but correlate with lexical retrieval, revealing disconnect between surface-level translation quality metrics and semantic task utility. These findings suggest translation-augmented dense retrieval as a potentially viable and cost-effective pathway toward multilingual legal information access in the EU, with immediate applications in retrieval-augmented generation pipelines for citizen-facing legal question-answering systems. All codes, datasets and output files are available for review at https://github.com/rachitprojects/mLLeQAAllCodeResultsUpload

Keywords

legal multilingual retrieval neural machine translation dense passage retrieval contrastive learning multilingual corpora synthesis natural language processing

Citation Information

@article{rachitverma2026,
  title={mLLeQA: Translation-Augmented Dense Retrieval for Multilingual Legal Question Answering},
  author={Rachit Verma and Alluri Lakshman Narendra and Abhinav Shankar and Riya Chitnis and Achyut Mani Tripathi and Konjengbam Anand},
  journal={Research Square},
  year={2026},
  doi={https://doi.org/10.21203/rs.3.rs-9448649/v1}
}

Rachit Verma et al. (2026). mLLeQA: Translation-Augmented Dense Retrieval for Multilingual Legal Question Answering. Research Square. https://doi.org/10.21203/rs.3.rs-9448649/v1

Rachit Verma, et al. \"mLLeQA: Translation-Augmented Dense Retrieval for Multilingual Legal Question Answering.\" Research Square, 2026.

[6]Rachit Verma, Alluri Lakshman Narendra, Abhinav Shankar, Riya Chitnis, Achyut Mani Tripathi, Konjengbam Anand.mLLeQA: Translation-Augmented Dense Retrieval for Multilingual Legal Question Answering[Research Article].Research Square,2026.

Paper Details

mLLeQA: Translation-Augmented Dense Retrieval for Multilingual Legal Question Answering

Abstract

Keywords

Citation Information

Related Papers

Welcome to SinoXiv

Paper Details

mLLeQA: Translation-Augmented Dense Retrieval for Multilingual Legal Question Answering

Abstract

Keywords

Citation Information

Related Papers