WAXAL: African Language Speech Corpus

One of the largest openly-licensed speech corpora for Sub-Saharan African languages — ~1,250 hours of transcribed natural ASR speech plus ~235 hours of single-speaker TTS recordings across 24 languages (Hausa, Yoruba, Igbo, Swahili, Amharic, Twi, Luganda, Lingala, and more) spoken by 100M+ people. Built with four African academic/community partners and framed as inclusive speech-tech and digital-preservation infrastructure; a large Google effort (authors include Jeff Dean, Yossi Matias, Moustapha Cissé, Avinatan Hassidim). Surfaced in the weekly sweep as a prior-month gap.

Paper (arXiv)HuggingFace (dataset)

Dataset

Size ~1,485 hours (1,250h ASR + 235h TTS)

License CC-BY-4.0

Languages: Hausa, Yoruba, Igbo, Swahili, Amharic, Twi, Luganda, Lingala

HuggingFace

datasetspeechmultilingualopen-source

Your notes

Dataset