parent	title	description	featured
Customisation	Parallel data	Parallel data for training machine translation	true

Parallel data

Parallel data or parallel corpora are datasets of translation pairs - sentences and their translations. They are used to train and test machine translation models.

Original	Translation
File	Archivo

Parallel datasets can include translations for one or more language pairs, and be directioned or directionless.

Parallel data creation

Parallel datasets can be created manually, automatically, or created synthetically from monolingual data.

Human translation
Human post-editing
Crawling
Alignment

Parallel data can be created by crawling and aligned monolingual test, and by back-translation or back-copying.

Goals

Parallel data is used to train statistical and neural machine translation engines.

Challenges

Parallel data is available for most widely written language pairs, but not available for other language pairs.

Parallel data can have errors, like misaligned sentences, bad sentence segmentation, bad encodings, wrong or mixed language. Errors in parallel data are challenging because they affect the quality of the machine translation output. Parallel data errors can be solved via filtering.

Public parallel data

Name	Type
CCAligned	Data repository
CCMatrix	Data set
Clarin	Data repository
Europarl	Data set
FLORES	Data set
Hansard	Data set
JESC	Data set
Mozilla Common Voice	Data set
OpenSubtitles	Data repository
ParaCrawl	Data repository
VoxPopuli	Data set
WikiMatrix	Data set
WikiTitles	Data set

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

parallel-data.md

parallel-data.md

Parallel data

Parallel data creation

Goals

Challenges

Public parallel data

Files

parallel-data.md

Latest commit

History

parallel-data.md

File metadata and controls

Parallel data

Parallel data creation

Goals

Challenges

Public parallel data