Skip to content

Latest commit

 

History

History
54 lines (40 loc) · 2.46 KB

parallel-data.md

File metadata and controls

54 lines (40 loc) · 2.46 KB
parent title description featured
Customisation
Parallel data
Parallel data for training machine translation
true

Parallel data

Parallel data or parallel corpora are datasets of translation pairs - sentences and their translations. They are used to train and test machine translation models.

Original Translation
File Archivo

Parallel datasets can include translations for one or more language pairs, and be directioned or directionless.

Parallel data creation

Parallel datasets can be created manually, automatically, or created synthetically from monolingual data.

Parallel data can be created by crawling and aligned monolingual test, and by back-translation or back-copying.

Goals

Parallel data is used to train statistical and neural machine translation engines.

Challenges

Parallel data is available for most widely written language pairs, but not available for other language pairs.

Parallel data can have errors, like misaligned sentences, bad sentence segmentation, bad encodings, wrong or mixed language. Errors in parallel data are challenging because they affect the quality of the machine translation output. Parallel data errors can be solved via filtering.

Public parallel data

Name Type
CCAligned Data repository
CCMatrix Data set
Clarin Data repository
Europarl Data set
FLORES Data set
Hansard Data set
JESC Data set
Mozilla Common Voice Data set
OpenSubtitles Data repository
ParaCrawl Data repository
VoxPopuli Data set
WikiMatrix Data set
WikiTitles Data set