Belebele
evalThe largest parallel multilingual NLU benchmark: 900 reading comprehension questions translated into 122 language variants (spanning high-, medium-, and low-resource languages across all continents). Multiple-choice format with 4 options. Every question appears identically across all languages, enabling direct cross-lingual comparison.
Revealed massive performance gaps between high-resource and low-resource languages even for frontier multilingual models. Still discriminative for multilingual evaluation. ACL 2024. By Bandarkar, Liang et al. (Meta FAIR).