Improve translation of short sentences #215

eu9ene · 2023-10-04T23:21:28Z

We have issues with translating shorter sentences and single words.

What might help:

Better data cleaning to avoid removing short translations from the training datasets
Try to find more datasets that contain short sentences

gregtatum · 2023-10-11T20:08:57Z

I wonder if we can load in dictionaries where it's literally one word to one word.

gregtatum · 2023-10-11T20:09:10Z

Or maybe even synthesize it with the alignment data.

gregtatum · 2024-05-15T17:51:47Z

This behavior is also visible with numbers. A good example is to do a list of numbers.

gregtatum · 2024-07-19T18:31:31Z

Verify the fix with:

https://bugzilla.mozilla.org/show_bug.cgi?id=1888972

gregtatum · 2024-07-26T20:02:35Z

Here is a word count distribution for the merged corpus sl-en: https://firefox-ci-tc.services.mozilla.com/tasks/groups/PPCzZRHaTT6Ys4BIhPGT5w

Generated via:

python3 pipeline/data/analyze.py --file_location https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/VK5zmxJRTLy0y0WBQ0DRJg/artifacts/public/build/corpus.en.zst --output data --dataset OPUS_corpus --language en

python3 pipeline/data/analyze.py --file_location https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/VK5zmxJRTLy0y0WBQ0DRJg/artifacts/public/build/corpus.sl.zst --output data --dataset OPUS_corpus --language sl

marco-c · 2024-07-26T20:09:09Z

So we have basically 0 sentences with 1 word?

gregtatum · 2024-10-15T17:48:22Z

I filed #878 which suggests augmenting with statistically synthesized single word translations.

gregtatum · 2024-10-15T17:52:25Z

I filed #879 which suggests harvesting short sentences from parallel datasets.

gregtatum · 2024-10-15T18:06:32Z

I filed #880 which suggests statistically synthesizing short sentence translations from monolingual data sources.

eu9ene added the quality Improving robustness and translation quality label Oct 4, 2023

eu9ene mentioned this issue Oct 31, 2023

[meta] Improve translation robustness #238

Open

marco-c mentioned this issue Apr 4, 2024

Create a dataset of short sentences for training and/or validation #514

Open

gregtatum mentioned this issue May 15, 2024

Our models should be robust enough to translate a calendar #600

Closed

marco-c mentioned this issue Jul 26, 2024

Investigate using LLMs to generate training data #767

Open

ZJaume mentioned this issue Oct 21, 2024

Disable use of bicleaner-hardrules #888

Closed

eu9ene mentioned this issue Oct 21, 2024

[meta] Retrain older models #891

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve translation of short sentences #215

Improve translation of short sentences #215

eu9ene commented Oct 4, 2023

gregtatum commented Oct 11, 2023

gregtatum commented Oct 11, 2023

gregtatum commented May 15, 2024

gregtatum commented Jul 19, 2024

gregtatum commented Jul 26, 2024

marco-c commented Jul 26, 2024

gregtatum commented Oct 15, 2024

gregtatum commented Oct 15, 2024

gregtatum commented Oct 15, 2024

Improve translation of short sentences #215

Improve translation of short sentences #215

Comments

eu9ene commented Oct 4, 2023

gregtatum commented Oct 11, 2023

gregtatum commented Oct 11, 2023

gregtatum commented May 15, 2024

gregtatum commented Jul 19, 2024

gregtatum commented Jul 26, 2024

marco-c commented Jul 26, 2024

gregtatum commented Oct 15, 2024

gregtatum commented Oct 15, 2024

gregtatum commented Oct 15, 2024