WAXAL: African Language Speech Corpus
datasetOne of the largest openly-licensed speech corpora for Sub-Saharan African languages — ~1,250 hours of transcribed natural ASR speech plus ~235 hours of single-speaker TTS recordings across 24 languages (Hausa, Yoruba, Igbo, Swahili, Amharic, Twi, Luganda, Lingala, and more) spoken by 100M+ people. Built with four African academic/community partners and framed as inclusive speech-tech and digital-preservation infrastructure; a large Google effort (authors include Jeff Dean, Yossi Matias, Moustapha Cissé, Avinatan Hassidim). Surfaced in the weekly sweep as a prior-month gap.
Dataset
Size ~1,485 hours (1,250h ASR + 235h TTS)
License CC-BY-4.0
Languages: Hausa, Yoruba, Igbo, Swahili, Amharic, Twi, Luganda, Lingala