Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve translation of short sentences #215

Open
Tracked by #891 ...
eu9ene opened this issue Oct 4, 2023 · 9 comments
Open
Tracked by #891 ...

Improve translation of short sentences #215

eu9ene opened this issue Oct 4, 2023 · 9 comments
Labels
quality Improving robustness and translation quality

Comments

@eu9ene
Copy link
Collaborator

eu9ene commented Oct 4, 2023

We have issues with translating shorter sentences and single words.

What might help:

  • Better data cleaning to avoid removing short translations from the training datasets
  • Try to find more datasets that contain short sentences
@eu9ene eu9ene added the quality Improving robustness and translation quality label Oct 4, 2023
@gregtatum
Copy link
Member

I wonder if we can load in dictionaries where it's literally one word to one word.

@gregtatum
Copy link
Member

Or maybe even synthesize it with the alignment data.

@gregtatum
Copy link
Member

This behavior is also visible with numbers. A good example is to do a list of numbers.

@gregtatum
Copy link
Member

Verify the fix with:

https://bugzilla.mozilla.org/show_bug.cgi?id=1888972

@gregtatum
Copy link
Member

Here is a word count distribution for the merged corpus sl-en: https://firefox-ci-tc.services.mozilla.com/tasks/groups/PPCzZRHaTT6Ys4BIhPGT5w

word count distribution "en"

word count distribution "sl"

Generated via:

python3 pipeline/data/analyze.py --file_location https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/VK5zmxJRTLy0y0WBQ0DRJg/artifacts/public/build/corpus.en.zst --output data --dataset OPUS_corpus --language en

python3 pipeline/data/analyze.py --file_location https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/VK5zmxJRTLy0y0WBQ0DRJg/artifacts/public/build/corpus.sl.zst --output data --dataset OPUS_corpus --language sl

@marco-c
Copy link
Collaborator

marco-c commented Jul 26, 2024

So we have basically 0 sentences with 1 word?

@gregtatum
Copy link
Member

I filed #878 which suggests augmenting with statistically synthesized single word translations.

@gregtatum
Copy link
Member

I filed #879 which suggests harvesting short sentences from parallel datasets.

@gregtatum
Copy link
Member

I filed #880 which suggests statistically synthesizing short sentence translations from monolingual data sources.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
quality Improving robustness and translation quality
Projects
None yet
Development

No branches or pull requests

3 participants