Consider statistically translating short sentences from monolingual datasets. #880

gregtatum · 2024-10-15T18:04:32Z

Short sentences are frequently removed from parallel datasets, so there aren't enough to train on.

In HPLT 2.0 the data is filtered at the document level, rather than sentence level. We could take high-scoring documents and extract short sentences from them. These short sentences could be higher quality given they are embedded in a higher quality document. Then using tokenization and alignment, we can statistically extract a corresponding translation from the parallel sentences. These synthesized pairs could be used for training.

The biggest risk I see here is the model learning bad translations from using wrong parts of speech or morphology for a word. For instance, if declensions for the word are different in their isolated form, it may mis-translate the sentence into an awkward form. Perhaps this would only work for subsets of languages with lower morphological differences between their words.

gregtatum added quality Improving robustness and translation quality data sources Data importer support labels Oct 15, 2024

gregtatum mentioned this issue Oct 15, 2024

Improve translation of short sentences #215

Open

gregtatum changed the title ~~Consider stastically translating short sentences from monolingual datasets.~~ Consider statistically translating short sentences from monolingual datasets. Oct 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider statistically translating short sentences from monolingual datasets. #880

Consider statistically translating short sentences from monolingual datasets. #880

gregtatum commented Oct 15, 2024

Consider statistically translating short sentences from monolingual datasets. #880

Consider statistically translating short sentences from monolingual datasets. #880

Comments

gregtatum commented Oct 15, 2024