mLLeQA: Translation-Augmented Dense Retrieval for Multilingual Legal Question Answering
Abstract
Multilingual legal question answering remains underexplored, despite its practical importance in jurisdictions like the European Union, where citizens have a constitutional right to access legal information in their native language. However, constructing multilingual legal benchmarks is prohibitively expensive, requiring annotators with both multilingual proficiency and legal expertise. This paper investigates whether high-quality neural machine translation can produce synthetic multilingual legal corpora that preserve retrieval utility for downstream dense retrieval systems. We present mLLeQA, a synthetic multilingual benchmark derived by translating the French LLeQA legal retrieval dataset into five additional European languages (Dutch, English, Finnish, Italian,and Spanish) using Seamless-M4T-v2-large. We train language-specific BERT encoders (MDRL) and multilingual BERT encoders (MDRM) as dense retrievers using contrastive learning with hard negatives, and evaluate retrieval performance against BM25 lexical baselines across 195 test queries per language. Language-specific BERT retrievers show no statistically significant performance degradation from the French baseline (p > 0.05), with average cross-language deviation from the French baseline of 5.14 pp. Dense retrievers consistently outperform BM25 across all languages, achieving 29.49 – 62.47 pp absolute improvement on Recall@500. Notably, Finnish (typologically distant) shows minimal deviation (1.61 pp) while Italian exhibits maximal deviation (10.12 pp), suggesting that pre-training corpus characteristics dominate linguistic similarity effects. BLEU scores prove poor predictors of dense retrieval performance but correlate with lexical retrieval, revealing disconnect between surface-level translation quality metrics and semantic task utility. These findings suggest translation-augmented dense retrieval as a potentially viable and cost-effective pathway toward multilingual legal information access in the EU, with immediate applications in retrieval-augmented generation pipelines for citizen-facing legal question-answering systems. All codes, datasets and output files are available for review at https://github.com/rachitprojects/mLLeQAAllCodeResultsUpload
Keywords
Citation Information
@article{rachitverma2026,
title={mLLeQA: Translation-Augmented Dense Retrieval for Multilingual Legal Question Answering},
author={Rachit Verma and Alluri Lakshman Narendra and Abhinav Shankar and Riya Chitnis and Achyut Mani Tripathi and Konjengbam Anand},
journal={Research Square},
year={2026},
doi={https://doi.org/10.21203/rs.3.rs-9448649/v1}
}
SinoXiv