Language(s) | # Sent Pairs | Size | Domain | Source |
---|---|---|---|---|
Multilingual | JW300 | |||
Multilingual | Mozilla localization | |||
English-isiZulu | Autshumato Corpus | |||
English-Setswana | Autshumato Corpus | |||
English-Xitsonga | Autshumato Corpus | |||
English-Northern-Sotho | Autshumato Corpus | |||
English-Afrikaans | Autshumato Corpus | |||
English-isiXhosa | Navy Corpus | |||
English-isiXhosa | MeMat corpora | |||
Lingala-French | Lingala Song Lyrics | |||
Igbo-English | 10k | Evaluation Benchmark | ||
English-Hausa | 19k | Paracrawl | ||
English-Igbo | Paracrawl | |||
English-kiSwahili | GourMeT | |||
English-Amharic | GourMeT | |||
French-Swahili(Congo) | 25k | TWB-Gamayun | ||
English-kiSwahili | 5k | TWB-Gamayun | ||
French-Nande | 15k | TWB-Gamayun | ||
English-Hausa | 15k | TWB-Gamayun | ||
English-Kanuri | 5k | TWB-Gamayun | ||
English-Dinka | 3k | TICO-19 | ||
English-Nigerian Fulfulde | 3k | TICO-19 | ||
English-Hausa | 3k | TICO-19 | ||
English/French-Luganda | 3k | TICO-19 | ||
English/French-Lingala | 3k | TICO-19 | ||
English-Nuer | 3k | TICO-19 | ||
English/Amharic-Oromo | 3k | TICO-19 | ||
English/French-Kinyarwanda | 3k | TICO-19 | ||
English-Somali | 3k | TICO-19 | ||
English/French-kiSwahili | 3k | TICO-19 | ||
English/Amharic-Tigrinya (Ethiopian) | 3k | TICO-19 | ||
English/Amharic-Tigrinya (Eritrean) | 3k | TICO-19 | ||
English/French-Zulu | 3k | TICO-19 |
Language | # Sents | Size | Domain | Source |
---|---|---|---|---|
isiZulu | Zulu Wikipedia | |||
isiZulu | NCHLT isiZulu Text Corpus | |||
isiZulu | University of Leipzig Zulu Corpora | |||
isiZulu | isiZulu National Corpus (currently not avail) | |||
isiZulu | African Speech Technology | |||
isiZulu | Zulu Bible (to be scraped) | |||
isiZulu | Zulu Quoran (to be scraped) | |||
Igbo | ~384k | Igbo Monolingual | ||
Yorùbá | ~626k | 560MB | Various | Yorùbá Text (News, blogs, Bible/Quran/Mormon, proverbs, various books & corpora) |
Yorùbá | 182MB | Various | Wikipedia dump (cleaned) (Various articles on science, entertainment, etc.) | |
Setswana | News | Zenodo | ||
Sepedi | News | Zenodo | ||
kiSwahili | News | GourMeT |
Language | # Sents | Size | Domain | Source |
---|---|---|---|---|
isiZulu | SADiLaR | |||
isiXhosa | SADiLaR | |||
Afrikaans | SADiLaR | |||
Sepedi | SADiLaR | |||
Setswana | SADiLaR | |||
Sesotho | SADiLaR | |||
Xitsonga | SADiLaR | |||
Siswati | SADiLaR | |||
Tshivenda | SADiLaR | |||
isiNdebele | SADiLaR | |||
Yorùbá | Global Voices Yorùbá NER | |||
Hausa | VOA Hausa NER |
Language | # Sents | Size | Domain | Source |
---|---|---|---|---|
Tunisian Arabic | Zenodo |
Language | Size (GB) | Size (hours) | No. Speakers | Annotation Type | Source |
---|---|---|---|---|---|
Yoruba | ~1 GB | ~ 4 hours | transcriptions | OpenSLR | |
kiSwahili | 1.7 GB | ~ 6 hours | 1 | transcriptions | TWB-Gamayun |