Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider statistically translating short sentences from monolingual datasets. #880

Open
gregtatum opened this issue Oct 15, 2024 · 0 comments
Labels
data sources Data importer support quality Improving robustness and translation quality

Comments

@gregtatum
Copy link
Member

Short sentences are frequently removed from parallel datasets, so there aren't enough to train on.

In HPLT 2.0 the data is filtered at the document level, rather than sentence level. We could take high-scoring documents and extract short sentences from them. These short sentences could be higher quality given they are embedded in a higher quality document. Then using tokenization and alignment, we can statistically extract a corresponding translation from the parallel sentences. These synthesized pairs could be used for training.

The biggest risk I see here is the model learning bad translations from using wrong parts of speech or morphology for a word. For instance, if declensions for the word are different in their isolated form, it may mis-translate the sentence into an awkward form. Perhaps this would only work for subsets of languages with lower morphological differences between their words.

@gregtatum gregtatum added quality Improving robustness and translation quality data sources Data importer support labels Oct 15, 2024
@gregtatum gregtatum changed the title Consider stastically translating short sentences from monolingual datasets. Consider statistically translating short sentences from monolingual datasets. Oct 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data sources Data importer support quality Improving robustness and translation quality
Projects
None yet
Development

No branches or pull requests

1 participant