The largest parallel multilingual NLU benchmark: 900 reading comprehension questions translated into 122 language variants (spanning high-, medium-, and low-resource languages across all continents). Multiple-choice format with 4 options. Every question appears identically across all languages, enabling direct cross-lingual comparison.

Revealed massive performance gaps between high-resource and low-resource languages even for frontier multilingual models. Still discriminative for multilingual evaluation. ACL 2024. By Bandarkar, Liang et al. (Meta FAIR).

Paper

Venue ACL 2024
Citations 1

Evaluation Details

Questions 109,800
Tasks 122
Domains 3
Scoring 4-way multiple-choice accuracy (1 correct of 4 options)
Human baseline 97.6% (4 paper authors, blind test on ~30 English MCQs each; 95% CI [93.1, 99.5])
Random baseline 25% (4-way MC)
Domains: news (Wikinews), children's educational content (Wikijunior), travel (WikiVoyage)
benchmarkevaluationmultilingual