Skip to content

Latest commit

 

History

History
90 lines (77 loc) · 6.38 KB

list-of-datasets.md

File metadata and controls

90 lines (77 loc) · 6.38 KB

A Reference of Datasets for African NLP

Machine Translation

Language(s) # Sent Pairs Size Domain Source
Multilingual JW300
Multilingual Mozilla localization
English-isiZulu Autshumato Corpus
English-Setswana Autshumato Corpus
English-Xitsonga Autshumato Corpus
English-Northern-Sotho Autshumato Corpus
English-Afrikaans Autshumato Corpus
English-isiXhosa Navy Corpus
English-isiXhosa MeMat corpora
Lingala-French Lingala Song Lyrics
Igbo-English 10k Evaluation Benchmark
English-Hausa 19k Paracrawl
English-Igbo Paracrawl
English-kiSwahili GourMeT
English-Amharic GourMeT
French-Swahili(Congo) 25k TWB-Gamayun
English-kiSwahili 5k TWB-Gamayun
French-Nande 15k TWB-Gamayun
English-Hausa 15k TWB-Gamayun
English-Kanuri 5k TWB-Gamayun
English-Dinka 3k TICO-19
English-Nigerian Fulfulde 3k TICO-19
English-Hausa 3k TICO-19
English/French-Luganda 3k TICO-19
English/French-Lingala 3k TICO-19
English-Nuer 3k TICO-19
English/Amharic-Oromo 3k TICO-19
English/French-Kinyarwanda 3k TICO-19
English-Somali 3k TICO-19
English/French-kiSwahili 3k TICO-19
English/Amharic-Tigrinya (Ethiopian) 3k TICO-19
English/Amharic-Tigrinya (Eritrean) 3k TICO-19
English/French-Zulu 3k TICO-19

Monolingual

Language # Sents Size Domain Source
isiZulu Zulu Wikipedia
isiZulu NCHLT isiZulu Text Corpus
isiZulu University of Leipzig Zulu Corpora
isiZulu isiZulu National Corpus (currently not avail)
isiZulu African Speech Technology
isiZulu Zulu Bible (to be scraped)
isiZulu Zulu Quoran (to be scraped)
Igbo ~384k Igbo Monolingual
Yorùbá ~626k 560MB Various Yorùbá Text (News, blogs, Bible/Quran/Mormon, proverbs, various books & corpora)
Yorùbá 182MB Various Wikipedia dump (cleaned) (Various articles on science, entertainment, etc.)
Setswana News Zenodo
Sepedi News Zenodo
kiSwahili News GourMeT

Named Entity Recognition

Language # Sents Size Domain Source
isiZulu SADiLaR
isiXhosa SADiLaR
Afrikaans SADiLaR
Sepedi SADiLaR
Setswana SADiLaR
Sesotho SADiLaR
Xitsonga SADiLaR
Siswati SADiLaR
Tshivenda SADiLaR
isiNdebele SADiLaR
Yorùbá Global Voices Yorùbá NER
Hausa VOA Hausa NER

Sentiment Analysis

Language # Sents Size Domain Source
Tunisian Arabic Zenodo

Speech

Language Size (GB) Size (hours) No. Speakers Annotation Type Source
Yoruba ~1 GB ~ 4 hours transcriptions OpenSLR
kiSwahili 1.7 GB ~ 6 hours 1 transcriptions TWB-Gamayun