You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Short sentences are frequently removed from parallel datasets, so there aren't enough to train on.
In HPLT 2.0 the data is filtered at the document level, rather than sentence level. We could take high-scoring documents and extract short sentences from them. These short sentences could be higher quality given they are embedded in a higher quality document. Then using tokenization and alignment, we can statistically extract a corresponding translation from the parallel sentences. These synthesized pairs could be used for training.
The biggest risk I see here is the model learning bad translations from using wrong parts of speech or morphology for a word. For instance, if declensions for the word are different in their isolated form, it may mis-translate the sentence into an awkward form. Perhaps this would only work for subsets of languages with lower morphological differences between their words.
The text was updated successfully, but these errors were encountered:
gregtatum
changed the title
Consider stastically translating short sentences from monolingual datasets.
Consider statistically translating short sentences from monolingual datasets.
Oct 30, 2024
Short sentences are frequently removed from parallel datasets, so there aren't enough to train on.
In HPLT 2.0 the data is filtered at the document level, rather than sentence level. We could take high-scoring documents and extract short sentences from them. These short sentences could be higher quality given they are embedded in a higher quality document. Then using tokenization and alignment, we can statistically extract a corresponding translation from the parallel sentences. These synthesized pairs could be used for training.
The biggest risk I see here is the model learning bad translations from using wrong parts of speech or morphology for a word. For instance, if declensions for the word are different in their isolated form, it may mis-translate the sentence into an awkward form. Perhaps this would only work for subsets of languages with lower morphological differences between their words.
The text was updated successfully, but these errors were encountered: