Skip to content

A survey of corpora for Germanic low-resource languages and dialects

Notifications You must be signed in to change notification settings

mainlp/germanic-lrl-corpora

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

67 Commits
 
 

Repository files navigation

A Survey of Corpora for Germanic Low-Resource Languages and Dialects

You can read more about this corpus collection here. If you find this overview useful for your research, please cite:

@inproceedings{blaschke-etal-2023-survey,
  title = {A survey of corpora for {G}ermanic low-resource languages and dialects},
  author = {Blaschke, Verena and Sch{\"u}tze, Hinrich and Plank, Barbara},
  year = {2023},
  month = may,
  booktitle = {Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)},
  address = {T{\'o}rshavn, Faroe Islands},
  publisher = {University of Tartu Library},
  url = {https://aclanthology.org/2023.nodalida-1.41},
  pages = {392--414},
}

Language varieties:

Inclusion criteria:

  • Accessible to researchers
  • Can be downloaded (easily)
  • No extensive pre-processing required (appropriate file formats; no abundance of OCR errors)
  • Full sentences/utterances rather than word lists We have relaxed this criterion and are now also including word-based resources useful for variationist research.
  • Data are contemporaneous or from the past century
  • If only a written version is available, it should be (manually) annotated and/or showcase variation through phone[t/m]ic transcriptions or orthographies used specifically for that language variety

We focus on manual or manually corrected annotations rather than fully automatically annotated data. For corpora with an “uncurated” note, we strongly recommend manually checking the data quality, as it might be low or mixed. We've excluded corpora where we were able to determine large-scale data quality issues. Note that the webcrawl-based corpora likely overlap with the contents of some of the other corpora, and for languages with especially few resources, the overlap with Wikipedia tends to be extremely high.

The license names link to where the license is mentioned on the corpus website, unless the license is mentioned on the site linked in the first column, in the article accompanying the dataset, or in the downloaded corpus files. Always refer to the original corpus websites/papers to double-check the license information; we cannot guarantee that the information here is up to date.

Did we forget a corpus for a Germanic low-resource language or dialect that fits these inclusion criteria? Please reach out to us via a GitHub issue or an email to verena DOT blaschke ÄT cis.lmu.de!

General

Corpus Notes Size Representation License
Sound Comparisons: Germanic (Paschen ea 2019) word-based, 120 locations/doculects from all Germanic sub-branches 106 words × 120 locations audio, phono (IPA), English ortho, ortho of relevant std languages CC BY-NC-ND 4.0

Faroese · fao · fao1244

Corpus Notes Size Representation License
UD Faroese OFT (Tyers ea 2018) POS (UPOS, Giellatekno-FAO), dependencies (UD), morpho (UD), lemmas. Contains material from Wikipedia 1.2k sentences Faroese ortho GNU GPL 2.0, GNU LGPL 2.1, Mozilla Public License 1.1
FarPaHC (Ingason ea 2012, Rögnvalsson ea 2012) POS (mod. Penn-historical, phrase structure (mod. Penn-historical) 53k tokens Faroese ortho CC BY 4.0
UD Faroese FarPaHC (Ingason ea 2012, Rögnvalsson ea 2012) POS (UPOS), dependencies (UD), morpho (UD) 40k tokens Faroese ortho CC BY-SA 4.0
FoNE (Snæbjarnarson ea 2023) named entities (8 classes). The text overlaps with the BLARK background corpus (Sosialurin subcorpus) 118k tokens Faroese ortho CC BY 4.0
Fo-STS (Snæbjarnarson ea 2023) semantic text similarity (sentence-level), translated subset of the English STS corpus (Cer ea 2017) 729 sentence pairs Faroese ortho CC BY 4.0
BLARK 1.0 (background corpus) (Simonsen ea 2022) 25M tokens Faroese ortho CC BY 4.0
Sprotin translations English–Faroese parallel sentences 126k sentence pairs Faroese ortho MIT license
Føroyskur talumálsbanki (Jacobsen 2022) 599.9k tokens Faroese ortho(, audio?) CLARIN RES-PLAN-BY-PRIV-NORED
Faroese text collection (FTS) in BLARK 1.0 background corpus 1.1M tokens Faroese ortho CC BY 4.0
Korp (Giellatekno) in BLARK 1.0 background corpus (download via BLARK), contains Wikipedia articles ? Faroese ortho CC BY 4.0
BLARK 1.0 (audio) (Simonsen ea 2022) locations (Suðuroy, Sandoy, Suðurstreymoy, Norðurstreymoy/​Eysturoy, Vágar, Norðuroyggjar) 100 hrs audio, Faroese ortho, some phono CC BY 4.0
Faroese Danish Corpus Hamburg (FADAC Hamburg) (subset) (Debess 2019) locations (Tórshavn, Vágar, Suðuroy, Eysturoy/​Norðuroyggjar) 31 hrs audio, Faroese ortho HZSK-RES
FLORES-200 (subset) (Goyal ea 2022, NLLB Team 2022) parallel with ~200 languages 2k sentences Faroese ortho CC BY-SA 4.0
Tatoeba (fao subset) translations into other languages 417 sentences Faroese ortho CC BY 2.0 FR
ITU Faroese/Danish (Derczynski ea 2022) Danish translations; overlaps with (Danish) Tatoeba 4k sentences CC BY 4.0
Ubuntu via OPUS (Tiedemann 2012) translations into other languages 20.2k tokens Faroese ortho ?
QED via OPUS (Abdelali ea 2014, Tiedemann 2012) translations into other languages 6.4k tokens Faroese ortho ?
UDHR-LID (subset) (Karagan ea 2023, Unicode) 57 sentences CC0 1.0
OpenLID (subset) (Burchell ea 2023) combines other corpora 40k lines depend on source datasets
FAO News 2020 (Goldhahn ea 2012) uncurated? 33.8k sentences ?
FAO Newscrawl 2011 (Goldhahn ea 2012) uncurated? 8.8k sentences ?
Faroese Mixed Corpus (Goldhahn ea 2012) uncurated? 300k sentences ?
Faroese Web Corpus (Goldhahn ea 2012) uncurated? 1M sentences ?
FC3 (Snæbjarnarson ea 2023) Faroese subset of CommonCrawl (uncurated) 98k paragraphs / 9M tokens Faroese ortho unspecified CC license
Web to Corpus (W2C) (subset) (Majliš 2011, Majliš & Žabokrtský 2012) uncurated 102 MB Faroese ortho CC BY-SA 3.0
MADLAD-400 (subset) (Kudugunta ea 2023) uncurated, subset of CommonCrawl 1.8M sentences CC-BY-4.0
Glot500-c (subset) (Imani ea 2023) partially uncurated, corpus overlap documented in data 2.3M sentences Apache 2.0 + licenses of source datasets
Wikipedia (fo subset) uncurated 14k articles Faroese ortho text: GFDL, CC BY-SA 3.0; images: CC BY-SA 4.0

For additional resources/tools, see also the resource list of the Faroese Centre for Language Technology.

↑ top

Norwegian · nor · norw1258

Corpus Notes Size Representation License
LIA Treebank (+transcriptions) (Øvrelid ea 2018) POS (mod. NDT), dependencies (mod. NDT), morpho (mod. NDT), lemmas, locations (17 places in Norway). Annotated subset of LIA Norsk 7.5k speech segments / 78k tokens Nynorsk ortho, phono CC BY-NC-SA 4.0
UD Norwegian Nynorsk LIA (+transcriptions) (Øvrelid ea 2018) POS (UPOS), dependencies (UD), morpho (UD), lemmas, locations (10 places in Norway). Annotated subset of LIA Norsk 5.3k speech segments / 55k tokens Nynorsk ortho, phono; aligned Nynorsk+phono here (Blaschke ea 2023) treebank: CC BY-SA 4.0, transcriptions: CC BY-NC-SA 4.0
NDC Treebank (+transcriptions; website) (Kåsen ea 2022, Johannessen ea 2009) POS (mod. NDT), dependencies (mod. NDT), morpho (mod. NDT), lemmas, locations (17 places) 4.6k speech segments / 66k tokens Bokmål ortho, phono treebank and transcriptions: CC BY-NC-SA 4.0
NoMusic (Mæhlum & Scherrer 2024) subset of xSID slot filling, intent detection, translations into Bokmål and 16 other languages; location (8 dialects) 8×800 sentences ad-hoc pronunciation spelling CC BY-SA 4.0
NorDial (subset) (Barnes ea 2021) 348 tweets ad-hoc spelling CC0 1.0
NorDial (POS-annotated subset) (Mæhlum ea 2022 – contact authors) POS (UPOS) 35+ tweets ad-hoc spelling
Nordic Dialect Corpus (subset) (Johannessen ea 2009) locations (>100 places) 1.9M tokens Bokmål ortho, phono; aligned Bokmål+phono here (Scherrer 2023) CC BY-NC-SA 4.0
LIA Norsk (Øvrelid ea 2018) locations (222 places) 3.5M tokens Nynorsk ortho, phono CC BY-NC-SA 4.0
LIA Norsk (downloadable audio subset) (Øvrelid ea 2018) locations (178 places) ? audio, Nynorsk ortho, phono CC BY-NC-SA 4.0
The spoken language investigation in Oslo (TAUS) locations (East vs. West Oslo) 387k tokens Bokmål ortho, phono CC BY-NC-SA 4.0
American Nordic Speech Corpus (CANS) (subset) (Johannessen ea 2015) locations (57 places in USA/Canada) 773k tokens Bokmål ortho, phono CC BY-NC-SA 4.0
Speech Database for Norwegian (NB Tale) locations (24 areas) 365 × 2 mins (spontaneous speech), 7.6k sentences (reading) audio, Bokmål ortho, mod. X-SAMPA CC0
Norwegian Parliamentary Speech Corpus (NPSC) locations (5 dialect regions) 140 hrs / 65k sentences / 1.2M tokens audio, Bokmål/​Nynorsk ortho CC0

↑ top

Jutish · juti1236

Corpus Notes Size Representation License
Danish Gigaword Corpus (synne subset) (Derczynski ea 2021) South Jutish ca. 20k tokens CC BY 4.0

↑ top

East Danish · scan1238

Corpus Notes Size Representation License
Danish Gigaword Corpus (botxt subset) (Derczynski ea 2021, Kjeldsen 2019) Bornholmsk ca. 400k tokens CC BY 4.0

↑ top

Elfdalian/Övdalian · ovd · elfd1234

Glottolog 4.7 categorizes Elfdalian as a dialect of Dalecarlian/dale1238.

Corpus Notes Size Representation License
Nordic Dialect Corpus (subset) (Johannessen ea 2009) locations (7 places) 15.7k tokens Elfdalian ortho (Råðdjärum's orthography), Swedish ortho CC BY-NC-SA 4.0

↑ top

Swedish · swe · swe1254

Corpus Notes Size Representation License
Parallel dialectal-standard Swedish data (Hämäläinen ea 2020, Ivars & Södergård 2007) Finland Swedish (with locations) 86.5k tokens transcription, Swedish ortho CC BY-NC-SA 4.0
American Nordic Speech Corpus (CANS) (subset) (Johannessen ea 2015) locations (7 places in the US) 46k tokens Swedish ortho, phono CC BY-NC-SA 4.0

↑ top

Anglo-Frisian

Scots · sco · scot1243

Corpus Notes Size Representation License
POS-tagged Scots corpus (Lameris & Stymne 2021) POS (UPOS); overlaps with the SCOTS corpus 1k tokens partially ad hoc (SCOTS), partially with a standardized orthography (Mak Forrit)
Scottish Corpus of Texts & Speech (SCOTS) (subset) (Anderson ea 2007) partially annotated in the POS-tagged Scots corpus unknown (4.6M tokens total) mix of ad-hoc spelling and English ortho custom
UDHR-LID (subset) (Karagan ea 2023, Unicode) 58 sentences CC0 1.0
Web to Corpus (W2C) (subset) (Majliš 2011, Majliš & Žabokrtský 2012) uncurated 35 MB ? CC BY-SA 3.0
Glot500-c (subset) (Imani ea 2023) partially uncurated, corpus overlap documented in data 410k sentences Apache 2.0 + licenses of source datasets
Wikipedia (sco subset) uncurated, see reports here and here 39k articles Scots spelling recommendations text: GFDL, CC BY-SA 3.0; images: CC BY-SA 4.0

↑ top

English · eng · stan1293

Corpus Notes Size Representation License
TwitterAAE-UD (Blodgett ea 2016) dependencies (UD); AAVE 250 tweets ad-hoc spelling
Diachronic Electronic Corpus of Tyneside English (DECTE) (Corrigan ea 2012 locations (19 places in NE England). Contains the Newcastle Electronic Corpus of Tyneside English (NECTE) and NECTE2, and NECTE in turn contains the Tyneside Linguistic Survey (TLS) and the Phonological Variation and Change in Contemporary Spoken English (PVC) corpus. 72 hrs / 804k tokens audio, English ortho, partially: phono custom
Intonational Variation in English (IViE) (Nolan & Post 2013) locations (British Isles: Belfast, Dublin, Newcastle, Leeds, Bradford, Liverpool, Cambridge, Cardiff, London) 36 hrs audio, English ortho custom
Crowdsourced high-quality UK and Ireland English Dialect speech data set (Demirsahin ea 2020) locations (British Isles: Ireland, Midlands, Northern England, Scotland, Southern England, Wales) 31 hrs audio, English ortho CC BY-SA 4.0
Helsinki Corpus of British English Dialects locations (UK: Cambridgeshire, Devon, Essex/Lancashire, Isle of Ely, Somerset, Suffolk) 1M tokens audio, English ortho
Nationwide Speech Project (NSP) (Clopper & Pisoni 2006) locations (USA: West, Midland, North, South, New England, Mid-Atlantic) 60 × 1 hr audio, partially: English ortho
Corpus of Regional African American Language (CORAAL) (Kendall & Farrington 2021) 6 locations, AAVE 135.6 hrs / 1.5M tokens audio, English ortho CC BY-NC-SA 4.0
Sound Comparisons: Englishes (Maguire ea 2019) word-based, 51 locations 110 words × 51 locations audio, phono (IPA), English ortho CC BY-NC-ND 4.0

See also: SPADE: SPeech Across Dialects of English (Stuart-Smith ea 2017–2020) and their corpus collection.

↑ top

West(ern) Frisian · fry · west2354

Corpus Notes Size Representation License
UD Frisian/Dutch Fame (Braggar & van der Goot 2021, Yılmaz ea 2016) POS (UPOS), dependencies (UD), code-switching; code-mixed Frisian and Dutch. Annotated subset of FAME. 400 sentences Frisian​(/Dutch) ortho CC BY-SA 4.0
UD Frisian Frysk (Heeringa ea 2021) under development!; POS (UPOS), dependencies (UD), morpho (UD), lemmas 2.9k sentences Frisian ortho CC BY-SA 3.0
Common Voice (subset) (Ardila ea 2020) 211 hrs audio, Frisian ortho CC0
Frisian AudioMining Enterprise (FAME!) (Yılmaz ea 2016) partially: locations 18.5 hrs audio, Frisian ortho
Recordings of Dutch-Frisian council meetings (Bentum ea 2022) 26 hrs / 281k tokens audio, Frisian ortho
Corpus Spoken Frisian / Korpus Sprutsen Frysk (KSF) 200 hrs (65 hrs thereof transcribed) audio, partially: Frisian ortho
Boarnsterhim Corpus (BHC) (subset) (Sloos ea 2018) under revision! unknown (250 hrs total, with Dutch) audio
Tatoeba (fry subset) translations into other languages 641 sentences Frisian ortho CC BY 2.0 FR
Ubuntu via OPUS (Tiedemann 2012) translations into other languages 22.4k tokens Frisian ortho
KDE4 via OPUS (Tiedemann 2012) translations into other languages ca. 300k tokens Frisian ortho
GNOME via OPUS (Tiedemann 2012) translations into other languages 55.7k tokens Frisian ortho
Mozilla-I10n translations into other languages ca. 400k tokens Frisian ortho Mozilla Public License 2.0
UDHR-LID (subset) (Karagan ea 2023, Unicode) 58 sentences CC0 1.0
FRY News 2020 (Goldhahn ea 2012) uncurated? 107.5k sentences ? (written) ?
Western Frisian Newscrawl (Goldhahn ea 2012) uncurated? 100k sentences
Web to Corpus (W2C) (subset) (Majliš 2011, Majliš & Žabokrtský 2012) uncurated 72 MB Frisian ortho CC BY-SA 3.0
CC-100 (subset) (Wenzek ea 2020) uncurated, subset of CommonCrawl 174 MB Frisian ortho
OSCAR (subset) (Abadji ea 2022) uncurated, subset of CommonCrawl 9.9M tokens / 70.4 MB Frisian ortho Metadata/annotations: CC0 1.0, Common Crawl: custom
CulturaX (subset) (Nguyen ea 2023) uncurated, subset of mc4 and OSCAR 223k sentences see mc4 & OSCAR
MADLAD-400 (subset) (Kudugunta ea 2023) uncurated, subset of CommonCrawl 3.7M sentences CC-BY-4.0
Glot500-c (subset) (Imani ea 2023) partially uncurated, corpus overlap documented in data 927k sentences Apache 2.0 + licenses of source datasets
Wikipedia (fy subset) uncurated 50k articles Frisian ortho text: GFDL, CC BY-SA 3.0; images: CC BY-SA 4.0

↑ top

North(ern) Frisian · frr · north2626

Corpus Notes Size Representation License
Tatoeba (frr subset) translations into other languages 2.9k sentences ? CC BY 2.0 FR
Glot500-c (subset) (Imani ea 2023) partially uncurated, corpus overlap documented in data 55.3k sentences Apache 2.0 + licenses of source datasets
Wikipedia (frr subset) uncurated, partially tagged with dialect information 17k articles different dialect-based (ad-hoc?) orthographies text: GFDL, CC BY-SA 3.0; images: CC BY-SA 4.0

↑ top

Saterland Frisian/Saterfrisian · stq · sate1242

Corpus Notes Size Representation License
Tatoeba (stq subset) translations into other languages 96 sentences ? CC BY 2.0 FR
MADLAD-400 (subset) (Kudugunta ea 2023) uncurated, subset of CommonCrawl 27.7k sentences CC-BY-4.0
Wikipedia (stq subset) uncurated 4k articles revised Kramer orthography for Saterfrisian (unclear if example, recommendation or rule for this wiki) text: GFDL, CC BY-SA 3.0; images: CC BY-SA 4.0

↑ top

Low German

Low Saxon/Low German · nds · lowg1239

(The relationship between the ISO 639-3 code and the Glottocode is complicated.)

Corpus Notes Size Representation License
UD Low Saxon LSDC (Siewert & Rueter 2024) POS (UPOS), dependencies (UD), morphological features (UD), glosses (Middle Low Saxon), lemmas, locations (18 dialect areas, see also LSDC note); overlaps with LSDC 1000 sentences ad-hoc spelling, Nysassiske Sryvwyse CC BY-SA 4.0
TaPaCo (subset) (Scherrer 2020) paraphrases; annotated subset of Tatoeba 1107 sentences ? CC BY 2.0
Low Saxon Dialect Classification (LSDC) (Siewert ea 2020) locations (15 dialect areas); overlaps with UD Low Saxon LSDC; also contains FRS, WEP, TWD, ACT content 88.9k sentences (incl. FRS etc.) ad-hoc spelling CC BY-NC-SA 4.0
Sprachvariation in Norddeutschland (SiN, Hamburg collection) (Schröder 2011, Elmentaler ea 2015) (Low German subset) varieties of Low Saxon (Nordhannoversch, Emsländisch Oldenburgisch), East Frisian Low Saxon and (Northern) German unknown (300 hrs total) audio HZSK-RES
Zwirner-Korpus (subset of downloadable subcorpus) (Zwirner & Bethge 1958, IDS: Datenbank für gesprochenes Deutsch (DGD)) locations 80 min / 10.7k tokens audio, German ortho custom terms
Tatoeba (nds subset) translations into other languages 18.1k sentences ? CC BY 2.0 FR
Ubuntu via OPUS (Tiedemann 2012) translations into other languages 35.3k tokens ?
KDE4 via OPUS (Tiedemann 2012) translations into other languages 1.1M tokens ?
GNOME via OPUS (Tiedemann 2012) translations into other languages ca. 700k tokens ?
UDHR-LID (subset) (Karagan ea 2023, Unicode) 58 sentences CC0 1.0
Web to Corpus (W2C) (subset) (Majliš 2011, Majliš & Žabokrtský 2012) uncurated 24 MB ? CC BY-SA 3.0
OSCAR (subset) (Abadji ea 2022) uncurated, subset of CommonCrawl 1.6M tokens / 10.7 MB ? Metadata/annotations: CC0 1.0, Common Crawl: custom
CulturaX (subset) (Nguyen ea 2023) uncurated, subset of mc4 and OSCAR 15.1k sentences see mc4 & OSCAR
Glot500-c (subset) (Imani ea 2023) partially uncurated, corpus overlap documented in data 934k sentences Apache 2.0 + licenses of source datasets
Wikipedia (nds subset) uncurated, partially tagged with dialect information 84k articles Sass'sche Schrievwies text: GFDL, CC BY-SA 3.0; images: CC BY-SA 4.0
Wikipedia (nds-nl subset) uncurated, partially tagged with dialect information 8k articles Nysassiske Skryvwyse (preferred) and Algemene Nedersaksische Schriefwieze (older articles) text: GFDL, CC BY-SA 3.0; images: CC BY-SA 4.0

↑ top

East Frisian Low Saxon · frs · east2288

Corpus Notes Size Representation License
Sprachvariation in Norddeutschland (SiN, Hamburg collection) (East Frisian Low Saxon subset) varieties of Low Saxon, East Frisian Low Saxon and (Northern) German unknown (300 hrs total) audio HZSK-RES
Low Saxon Dialect Classification (LSDC) (OFR subset) (Siewert ea 2020) minor overlaps with UD Low Saxon LSDC 240 sentences ad-hoc spelling CC BY-NC-SA 4.0

↑ top

Gronings · gos · gron1242

Corpus Notes Size Representation License
TaPaCo (subset) (Scherrer 2020) paraphrases; annotated subset of Tatoeba 122 sentences ? CC BY 2.0
Automatic speech recognition dataset for Gronings (Bartelds ea 2023) 4 hours audio, written CC BY 4.0
Dataset: Gronings (Bartelds & San 2021, San ea 2021) 23 mins audio, written CC BY 4.0
Tatoeba (gos subset) translations into other languages 5.7k sentences ? CC BY 2.0 FR

↑ top

Twents · twd · twen1241

Corpus Notes Size Representation License
Low Saxon Dialect Classification (LSDC) (TWE subset) (Siewert ea 2020) minor overlaps with UD Low Saxon LSDC 668 sentences ad-hoc spelling CC BY-NC-SA 4.0

↑ top

Achterhoeks · act · acht1238

Corpus Notes Size Representation License
Low Saxon Dialect Classification (LSDC) (ACH subset) (Siewert ea 2020) minor overlaps with UD Low Saxon LSDC 988 sentences ad-hoc spelling CC BY-NC-SA 4.0

↑ top

Westphalic/Westphalish/Westphalian · wep · west2356

Corpus Notes Size Representation License
Zwirner-Korpus (subset of downloadable subcorpus) (Zwirner & Bethge 1958, IDS: Datenbank für gesprochenes Deutsch (DGD)) 15 min / 2.4k tokens audio, German ortho custom terms
Low Saxon Dialect Classification (LSDC) (OWL subset) (Siewert ea 2020) minor overlaps with UD Low Saxon LSDC 15k sentences ad-hoc spelling CC BY-NC-SA 4.0

↑ top

Macro-Dutch

Dutch · nld · dutc1256

Corpus Notes Size Representation License
Corpus of Southern Dutch Dialects (GCND) (Breitbarth ea 2018) under construction!; might also include West Flemish, Zeelandic, and/or Limburgs audio, transcriptions
SAND (Barbiers ea 2006) locations ? phono custom
MAND/FAND/GTRP (Goeman ea) (contact institute) locations phono (K-IPA)

↑ top

West(ern) Flemish · vls · vlaa1240

Corpus Notes Size Representation License
Stemmen uit het verleden (annotated subset) (Lybaert ea 2019, Van Keymeulen ea 2019) V2 variation, locations (25 places) 1.4k sentences phono CC BY-NC 4.0
Glot500-c (subset) (Imani ea 2023) partially uncurated, corpus overlap documented in data 102k sentences Apache 2.0 + licenses of source datasets
VLS Community 2017 (Goldhahn ea 2012) possibly uncurated 36.4k sentences ? (written) ?
Wikipedia (vls subset) uncurated, partially tagged with dialect information 8k articles Standoardvlams (orthography developped by vls.wikipedia.org editors) text: GFDL, CC BY-SA 3.0; images: CC BY-SA 4.0

↑ top

Zeelandic/Zeeuws · zea · zeeu1238

Corpus Notes Size Representation License
Glot500-c (subset) (Imani ea 2023) partially uncurated, corpus overlap documented in data 34.4k sentences Apache 2.0 + licenses of source datasets
Wikipedia (zea subset) uncurated 6k articles ? text: GFDL, CC BY-SA 3.0; images: CC BY-SA 4.0

↑ top

Central German

Upper Saxon · sxu · uppe1400

Corpus Notes Size Representation License
SXUCorpus (Herms ea 2016) (contact authors) 8 locations 500 min / 70 k tokens audio, German ortho
Zwirner-Korpus (subset of downloadable subcorpus) (Zwirner & Bethge 1958, IDS: Datenbank für gesprochenes Deutsch (DGD)) 12 min / 1.7k tokens audio, German ortho custom terms

↑ top

Moselle Franconian · luxe1241

Luxembourgish · ltz · luxe1243

Corpus Notes Size Representation License
UD Luxembourgish LuxBank (Plum ea 2024) POS tags (UPOS), dependencies (UD) 20 sentences Luxembourgish ortho
Banking Client Support (BCS) Dataset (Lothritz ea 2021) intent detection, slot filling, parallel with DEU, FRA, ENG 1k sentences Luxembourgish ortho ?
Luxembourgish translation of Winograd Natural Language Inference (L-WNLI) (Lothritz ea 2022) NLI, parallel with other languages (Levesque ea 2012) 767 samples Luxembourgish ortho ?
Luxembourgish POS and NER (Lothritz ea 2022) (contact authors) POS tags (15 tags), NER (PER, ORG, LOC, GPE, MISC) 5.5k sentences Luxembourgish ortho ?
Luxembourgish news classification (Lothritz ea 2022) (contact authors) 8 classes 10k articles Luxembourgish ortho ?
SA1 (Lothritz ea 2023; contact authors) sentiment 1.8k sentences
Luxembourgish sentence negation (Lothritz ea 2023) position of negation particle; overlaps with Leipzig corpora (Newscrawl and/or Web and/or Wikipedia) 46k sentences
LuxId (Lavergne ea 2014) code-switching (LTZ, DEU, FRA) 924 sentences (most with LTZ content) Luxembourgish​(/German/​French) ortho CC BY-SA 3.0
FLORES-200 (subset) (Goyal ea 2022, NLLB Team 2022) parallel with ~200 languages 2k sentences Luxembourgish ortho CC BY-SA 4.0
FLEURS (subset) (Conneau ea 2023) parallel with ~100 languages; audio version of FLORES (Goyal ea 2022) 1-3 recordings each of 1.9k sentences (3.8k recordings total) audio, Luxembourgish ortho CC BY 4.0
Tatoeba (ltz subset) translations into other languages 884 sentences Luxembourgish ortho CC BY 2.0 FR
Ubuntu via OPUS (Tiedemann 2012) translations into other languages 17k tokens Luxembourgish ortho ?
KDE4 via OPUS (Tiedemann 2012) translations into other languages 28.8k tokens Luxembourgish ortho ?
Mozilla-I10n translations into other languages 6.9k tokens Luxembourgish ortho Mozilla Public License 2.0
QED via OPUS (Abdelali ea 2014, Tiedemann 2012) translations into other languages 19.2k tokens Luxembourgish ortho ?
TED2020 via OPUS (Reimers & Gurevych, Tiedemann 2012) translations into other languages 1.7k tokens Luxembourgish ortho CC BY-NC-ND 4.0
UDHR-LID (subset) (Karagan ea 2023, Unicode) 59 sentences CC0 1.0
OpenLID (subset) (Burchell ea 2023) combines other corpora 37.7k lines depend on source datasets
Luxembourgish Newscrawl (Goldhahn ea 2012) uncurated? 300k sentences
Luxembourgish Web Corpus (Goldhahn ea 2012) uncurated? 1M sentences
Web to Corpus (W2C) (subset) (Majliš 2011, Majliš & Žabokrtský 2012) uncurated 81 MB ? CC BY-SA 3.0
OSCAR (subset) (Abadji ea 2022) uncurated, subset of CommonCrawl 2.5M tokens / 18.4 MB ? Metadata/annotations: CC0 1.0, Common Crawl: custom
CulturaX (subset) (Nguyen ea 2023) uncurated, subset of mc4 and OSCAR 166k sentences see mc4 & OSCAR
MADLAD-400 (subset) (Kudugunta ea 2023) uncurated, subset of CommonCrawl 3.4M sentences CC-BY-4.0
Wikipedia (lb subset) uncurated 61k articles Luxembourgish ortho text: GFDL, CC BY-SA 3.0; images: CC BY-SA 4.0

For other kinds of resources/tools, see also questoph/NLPforLTZ.

↑ top

Transylvanian Saxon · tran1294

Corpus Notes Size Representation License
Audioatlas siebenbürgisch-sächsischer Dialekte (ASD) (University of Munich) 360 hrs / 450k tokens audio, German ortho, partially phono CLARIN RES

↑ top

Colognian · ksh · kols1241

Corpus Notes Size Representation License
Tatoeba (ksh subset) translations into other languages 82 sentences ? CC BY 2.0 FR
Glot500-c (subset) (Imani ea 2023) partially uncurated, corpus overlap documented in data 33.5k sentences Apache 2.0 + licenses of source datasets
Wikipedia (ksh subset) uncurated, Colognian and other varieties of Ripuarian, partially tagged with dialect and/or orthography information 3k articles ad-hoc spelling, some articles according to various Ripuarian orthographies text: GFDL, CC BY-SA 3.0; images: CC BY-SA 4.0

↑ top

Limburgish/Limburgan · lim · lim1263

Corpus Notes Size Representation License
FLORES-200 (subset) (Goyal ea 2022, NLLB Team 2022) parallel with ~200 languages; Maastrichtian Limburgs 2k sentences CC BY-SA 4.0
Ubuntu via OPUS (Tiedemann 2012) translations into other languages 18.4k tokens ?
GNOME via OPUS (Tiedemann 2012) translations into other languages ca. 400k tokens ?
OpenLID (subset) (Burchell ea 2023) combines other corpora 48k lines depend on source datasets
LIM Community 2017 (Goldhahn ea 2012) possibly uncurated 84.4k sentences ? (written) ?
LIM Web 2010 (Netherlands) (Goldhahn ea 2012) uncurated? 35.4k sentences ? (written) ?
CC-100 (subset) (Wenzek ea 2020) uncurated, subset of CommonCrawl 8.3 MB
CulturaX (subset) (Nguyen ea 2023) uncurated, subset of mc4 and OSCAR 206 sentences see mc4 & OSCAR
Glot500-c (subset) (Imani ea 2023) partially uncurated, corpus overlap documented in data 652k sentences Apache 2.0 + licenses of source datasets
Wikipedia (li subset) uncurated, partially tagged with dialect and/or orthography information 14k articles Veldeke-sjpelling, Algemein Gesjreve Limburgs text: GFDL, CC BY-SA 3.0; images: CC BY-SA 4.0

↑ top

Rhine/Rhenish Franconian · rhin1244

Includes Palatin(at)e German · pfl · pala1330.

Corpus Notes Size Representation License
Thorsten-Voice Dataset 2023.09 Hessisch (Müller & Kreutz 2024) Hessian 2 hrs / 2.1k sentences audio, German ortho CC0
Zwirner-Korpus (subset of downloadable subcorpus) (Zwirner & Bethge 1958, IDS: Datenbank für gesprochenes Deutsch (DGD)) Hessian 8 min / 1.4k tokens audio, German ortho custom terms
Wikipedia (pfl subset) uncurated, partially tagged with dialect information; contains articles in Palatine German, Lorraine Franconian, Hessian 3k articles (implied) ad-hoc spelling text: GFDL, CC BY-SA 3.0; images: CC BY-SA 4.0

↑ top

Pennsylvania Dutch · pdc · penn1240

Corpus Notes Size Representation License
Tatoeba (pdc subset) translations into other languages 57 sentences ? CC BY 2.0 FR
Wikipedia (pdc subset) uncurated 2k articles ? text: GFDL, CC BY-SA 3.0; images: CC BY-SA 4.0

↑ top

Yiddish · yid · west2361/east2295

Corpus Notes Size Representation License
Penn Parsed Corpus of Historical Yiddish (Santorini 2021) POS (Penn-historical, phrase structure (Penn-historical) 200k tokens partially YIVO transliteration, partially YIVO-inspired ad-hoc transliteration CC BY-NC-SA 4.0
CABank Yiddish Corpus (Newman 2015) New York 1 hr audio, transcriptions (partially IPA, partially orthography-based (YIVO-transliteration-based?)) CC BY-NC-SA 3.0
FLORES-200 (subset) (Goyal ea 2022, NLLB Team 2022) parallel with ~200 languages; Eastern Yiddish (Hasidic) 2k sentences CC BY-SA 4.0
UDHR-LID (subset) (Karagan ea 2023, Unicode) Eastern Yiddish 59 sentences CC0 1.0
OpenLID (subset) (Burchell ea 2023) combines other corpora; Eastern Yiddish 911 lines depend on source datasets
YDD Community 2017 (Goldhahn ea 2012) Eastern Yiddish; possibly uncurated 21.8k sentences ? (written) ?
CC-100 (subset) (Wenzek ea 2020) uncurated, subset of CommonCrawl 51 MB
OSCAR (subset) (Abadji ea 2022) uncurated, subset of CommonCrawl 14.3M tokens / 171.7 MB ? Metadata/annotations: CC0 1.0, Common Crawl: custom
CulturaX (subset) (Nguyen ea 2023) uncurated, subset of mc4 and OSCAR 141k sentences see mc4 & OSCAR
MADLAD-400 (subset) (Kudugunta ea 2023) uncurated, subset of CommonCrawl 1.9M sentences CC-BY-4.0
Glot500-c (subset) (Imani ea 2023) partially uncurated, corpus overlap documented in data 220k sentences Apache 2.0 + licenses of source datasets
Wikipedia (yi subset) uncurated 15k articles text: GFDL, CC BY-SA 3.0; images: CC BY-SA 4.0

↑ top

Upper German

German · deu · stan1295

Corpus Notes Size Representation License
Sprachvariation in Norddeutschland (SiN, Hamburg collection) (Schröder 2011, Elmentaler ea 2015) (German subset) varieties of Low Saxon, East Frisian Low Saxon and (Northern) German unknown (300 hrs total) audio HZSK-RES
Regional Variants of German 1 (RVG1) (+link2) (Burger & Schiel 1998) unclear whether all of the recordings are in regionally accented (Standard) German or some are in Low Saxon/Bavarian/Colognian/etc. instead 500 × 1 min spontaneous speech audio, phono (SAMPA), German ortho CLARIN ACA
Texas German Sample Corpus (TGSC) (Blevins 2022) 13.5 hrs / 75k tokens audio, German ortho CC0 1.0
Wenkersätze (Wenker 1889–1923: Sprachatlas des Deutschen Reichs. Handdrawn by Emil Maurmann, Georg Wenker and Ferdinand Wrede. Published online as Digitaler Wenker-Atlas, Schmidt ea 2020-) 40 German sentences, translated into various lects spoken in the German Reich at the turn of the century 40 sentences × 2210 samples various phonetic transcription styles and ad-hoc spellings CC BY-SA 4.0

For (mostly non-downloadable) resources for studying German dialect variation, see also the updated overview by Fischer & Limper (2019).

↑ top

Upper/High Franconian · uppe1464

Including East Franconian · vmf · main1267.

Corpus Notes Size Representation License
Zwirner-Korpus (subset of downloadable subcorpus) (Zwirner & Bethge 1958, IDS: Datenbank für gesprochenes Deutsch (DGD)) South Franconian and East Franconian South: 10 min / 1.6k tokens; East: between 13 and 26 min / between 1.9k and 2.3k tokens audio, German ortho custom terms

Bavarian · bar · bava1246

Corpus Notes Size Representation License
UD Bavarian MaiBaam (Blaschke ea, 2024) POS (UPOS), dependencies (UD), German lemmas; dialect/location information; overlaps with wiki, xSID, NaLiBaSID 1k sentences ad-hoc pronunciation spelling CC BY-SA 4.0
Kontatto (Dal Negro & Ciccolone 2020) POS (unknown), lemmas (German). South Tyrolean 147k tokens audio, phono custom
BarNER (Peng ea 2024) named entities (based on CoNLL2003); overlaps with wiki 11k sentences ad-hoc pronunciation spelling CC-BY 4.0
xSID (van der Goot ea 2021; Aepli ea 2023; Winkler ea 2024) (de-st and de-ba subsets) slot filling, intent detection, translations into 16 languages; South Tyrolean and Central Bavarian 2×800 sentences ad-hoc pronunciation spelling CC BY-SA 4.0
NaLiBaSID MAS:de-ba (Winkler ea 2024) slot filling, intent detection; Central Bavarian; translation of MASSIVE hence parallel with 50+ other languages 2k sentences ad-hoc pronunciation spelling
NaLiBaSID nat:de-ba (Winkler ea 2024) slot filling, intent detection 315 sentences ad-hoc pronunciation spelling
DiDi (Frey ea 2015, 2019) (subset) South Tyrolean 9.6k messages ad-hoc pronunciation spelling CLARIN ACA-BY-NC-NORED
Kontatti (Ghilardi 2019) (subset) South Tyrolean unknown (6:48 hrs total) audio, German ortho custom
Zwirner-Korpus (subset of downloadable subcorpus) (Zwirner & Bethge 1958, IDS: Datenbank für gesprochenes Deutsch (DGD)) between 21 and 34 min / between 2.7k and 3.2k tokens audio, German ortho custom terms
AlpiLinK (Rabanus ea 2023) (tir subset) South Tyrolean; location information 1908 files (49 sentences, up to 43 speakers) audio, German ortho CC BY-NC-SA 4.0
VinKo (tir subset) (Rabanus ea 2023, Krujt ea 2023) South Tyrolean; location information 148 sentences + 71 words (up to 195 speakers per entry) audio, German ortho CC BY-NC-ND 4.0
Tatoeba (bar subset) translations into other languages 226 sentences ad-hoc pronunciation spelling CC BY 2.0 FR
Wikipedia (bar subset) uncurated, partially tagged with dialect information 27k articles ad-hoc pronunciation spelling with some optional conventions text: GFDL, CC BY-SA 3.0; images: CC BY-SA 4.0

↑ top

Cimbrian · cim · cimb1238

Corpus Notes Size Representation License
Kontatti (Ghilardi 2019) (subset) unknown (6:48 hrs total) audio, German ortho custom
AlpiLinK (Rabanus ea 2023) (cim subset) location information 530 files (42 sentences, up to 14 speakers) audio, German ortho CC BY-NC-SA 4.0
VinKo (cim subset) (Rabanus ea 2023, Krujt ea 2023) location information 159 sentences + 40 words (up to 14 speakers per entry) audio, German ortho CC BY-NC-ND 4.0

↑ top

Mòcheno · mhn · moch1255

Corpus Notes Size Representation License
AlpiLinK (Rabanus ea 2023) (mhn subset) location information 42 sentences (1 speaker) audio, German ortho CC BY-NC-SA 4.0
VinKo (mhn subset) (Rabanus ea 2023, Krujt ea 2023) location information 159 sentences + 30 words (up to 17 speakers per entry) audio, German ortho CC BY-NC-ND 4.0

↑ top

Swabian · swg · swab1242

Corpus Notes Size Representation License
Tatoeba (swg subset) translations into other languages 1.9k sentences ad-hoc pronunciation spelling CC BY 2.0 FR
Wikipedia (subset of als subset) uncurated 927 (of 27k) articles tagged as Swabian no defined standard, but a set of recommendations based on published works, the (Swiss German) Dieth orthography and the (Alsatian) Orthal orthography text: GFDL, CC BY-SA 3.0; images: CC BY-SA 4.0

↑ top

Central Alemannic (incl. Swiss German & Alsatian) · gsw · swis1247

Corpus Notes Size Representation License
Annotated Corpus for the Alsatian Dialects (Bernhard ea 2018, 2019) POS (UPOS, mod. UPOS), lemmas, glosses (French), NEs (locations); Alsatian; overlap with Wikipedia 798 sentences ad-hoc pronunciation spelling CC BY-SA 4.0
BISAME GSW (STIH 2020, Millour & Fort 2018) POS (mod. UPOS); Alsatian 382 sentences ad-hoc pronunciation spelling CC BY-NC-SA 3.0 FR
NOAH's corpus (Hollenstein & Aepli 2015) POS (mod. STTS, partially also STTS and UPOS); overlap with UD Swiss German UZH and Wikipedia 115k toks (mostly?) ad-hoc pronunciation spelling annotations: CC BY 4.0
UD Swiss German UZH (Aepli & Clematide 2018) POS (UPOS, mod. STTS), dependencies (UD); overlap with NOAH's corpus and Wikipedia 100 sentences (mostly?) ad-hoc pronunciation spelling CC BY-SA 4.0
WUS DIALOG GSW (Stark ea 2014-20, Ueberwasser & Stark 2017) (subset) POS (mod. STTS), locations 34.7k tokens ad-hoc pronunciation spelling, German ortho CC BY-NC-ND
xSID (Aepli ea 2023) (gsw subset) slot filling, intent detection, translations into 16 languages. Bernese 800 sentences
SwissDial (Dogan-Schönberger ea 2021) topics (14 classes), translations (across dialects and into German), locations (Aargau, Bern, Basel, Graubünden, Luzern, St. Gallen, Wallis, Zürich); the Wallis data are presumably in Walser (wae) 2.5-4.6 hrs × 7-8 dialects audio, pronunciation spelling, German ortho CC BY-NC 4.0
Zwirner-Korpus (subset of downloadable subcorpus) (Zwirner & Bethge 1958, IDS: Datenbank für gesprochenes Deutsch (DGD)) 10 min / 612 tokens audio, German ortho custom terms
SpinningBytes Swiss German Corpus (SB-CH) (annotated subset) (Grubenmann ea 2018) sentiment; potential overlap with NOAH's corpus 2.8k sentences pronunciation spelling CC BY 4.0
anko Schweizerdeutsch (subset of the Picture postcard corpus) (Sugisaki ea 2023) discourse-related text spans 600 postcards pronunciation spelling ?
What's up, Switzerland? (subset) (Stark ea 2014-20, Ueberwasser & Stark 2017) locations 507k messages / 3.6M tokens pronunciation spelling CC BY-NC-ND
Swatchgroup Geschäftsbericht (subset) via PaCoCo (Graën ea 2019) 79.6k tokens pronunciation spelling CC BY-SA
Schweizerdeutsches Mundartkorpus (CHMK; downloadable subcorpus) (Weibel & Peter 2020) locations ? CC BY-SA 4.0
Text+Berg via PaCoCo (subset) (Bubenhofer ea 2015, Graën ea 2019) 156 sentences / 3.1k tokens CC BY-SA
ArchiMob (Scherrer ea 2019) 70 hrs audio, transcription based on the Dieth orthography for Swiss German, German ortho CC BY-NC-SA 4.0
STT4SG-350 (Plüss ea 2023) locations (7 regions) 343 hrs audio, German ortho META-SHARE NonCommercial NoRedistribution
SDS-200 (Plüss ea 2022) 200 hrs audio, German ortho META-SHARE NonCommercial NoRedistribution
Swiss Parliaments Corpus (Plüss ea 2021a) 293 hrs audio, German ortho
All Swiss German Dialects Test Set (Plüss ea 2021b) locations (cantons, incl. Wallis) 13 hrs / 5.8k utterances audio, German ortho MIT
Gemeinderat Zürich Audio Corpus (Plüss ea 2021b) 1208 hrs audio MIT
Ein geparstes und grammatisch annotiertes Korpus schweizerdeutscher Spontansprachdaten (Schönenberger & Haeberli 2019) (contact authors) POS (mod. Penn-historical, phrase structure (Penn-historical). Location: Wil (SG) 100k+ tokens Dieth orthography
UDHR-LID (subset) (Karagan ea 2023, Unicode) 59 sentences ? CC0 1.0
Swiss Crawl (Linder ea 2020) uncurated 500k+ sentences ? CC BY-NC 4.0
SpinningBytes Swiss German Corpus (SB-CH) (Grubenmann ea 2018) uncurated; contains NOAH's corpus 116k sentences CC BY 4.0
SwigSpot (Linder 2018) uncurated 8k sentences ? Apache 2.0
Tatoeba (gsw subset) translations into other languages 474 sentences ? CC BY 2.0 FR
Swiss German Web Corpus (Goldhahn ea 2012) uncurated? 100+k sentences ?
OSCAR (subset) (Abadji ea 2022) uncurated, subset of CommonCrawl 34k tokens / 233 KB ? Metadata/annotations: CC0 1.0, Common Crawl: custom
CulturaX (subset) (Nguyen ea 2023) uncurated, subset of mc4 and OSCAR 6.9k sentences see mc4 & OSCAR
MADLAD-400 (subset) (Kudugunta ea 2023) uncurated, subset of CommonCrawl. the dataset audit notes issues with the Swiss German subcorpus ⚠ 1M sentences CC-BY-4.0
Glot500-c (subset) (Imani ea 2023) partially uncurated, corpus overlap documented in data 441k sentences Apache 2.0 + licenses of source datasets
Wikipedia (subset of als subset) uncurated, partially tagged with dialect information 27k total (including Swabian and Walser), thereof 2.3k (directly or indirectly) tagged as Alsatian, and 1.7k (directly or indirectly) tagged as Swiss German no defined standard, but a set of recommendations based on published works, the (Swiss German) Dieth orthography and the (Alsatian) Orthal orthography text: GFDL, CC BY-SA 3.0; images: CC BY-SA 4.0

↑ top

Walser · wae · wals1238

Corpus Notes Size Representation License
ArchiWals / CLiMAlp (Angster ea 2017, Gaeta 2020) locations (Gressoney, Issime, Formazza, Rimella, Alagna) 80k+ tokens pronunciation spelling
Walliserdeutsch/RRO (Garner 2014, Garner ea 2014) 8.3 hrs audio, non-standardized transcription custom
SwissDial (subset) (Dogan-Schönberger ea 2021) topics (14 classes), translations (into German and 7 Swiss German dialects) 3.3 hrs audio, pronunciation spelling, German ortho CC BY-NC 4.0
All Swiss German Dialects Test Set (Plüss ea 2021b) locations (cantons, incl. Wallis) unk audio, German ortho MIT
AlpiLinK (Rabanus ea 2023) (wae subset) location information 122 files (42 sentences, up to 3 speakers) audio, German ortho CC BY-NC-SA 4.0
Wikipedia (subset of als subset) uncurated 35 (of 27k total) tagged as Wal(li)ser no defined standard, but a set of recommendations based on published works, the (Swiss German) Dieth orthography and the (Alsatian) Orthal orthography text: GFDL, CC BY-SA 3.0; images: CC BY-SA 4.0

↑ top

About

A survey of corpora for Germanic low-resource languages and dialects

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published