Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Experiment with student model parameters #894

Open
Tracked by #912
gregtatum opened this issue Oct 22, 2024 · 7 comments
Open
Tracked by #912

Experiment with student model parameters #894

gregtatum opened this issue Oct 22, 2024 · 7 comments
Assignees
Labels
experiment A training experiment with hypothesis and results quality Improving robustness and translation quality

Comments

@gregtatum
Copy link
Member

gregtatum commented Oct 22, 2024

In Ludicrously Fast Neural Machine Translation, they test a variety of decoder configurations for faster models.

Screenshot of Table 3: Configuration of student models and submissions

In #174 @eu9ene showed that a larger model helps improve the COMET score for en-ru by +2.9, which is pretty significant.

(Edit: I changed from en-ru to en-lt)

I'd like to test the parameters a bit more, as these changes are impactful in terms of quality, but also affect the performance of the model. The paper tested parameters on en-de, but our training of en-ru has struggled to gain the same amount of COMET with the same architecture. Rather than testing en-ru, I'll do a clean run on en-lt as it had a pretty low COMET score, and also features a much more varied morphology for the language due to its declension system. The idea is that the results will scale to other Balto-Slavic languages.

I'm shortening the labels in the table a bit:

dec-depth: depth
dim-emb: emb
transformer-dim-ffn: ffn
COMET vs tiny speed depth emb ffn Name
86.67 - 2 256 1536 decoder-tiny
88.78 +2.11 2 512 2048 decoder-base
3 256 1536 decoder-depth-3
6 256 1536 decoder-depth-6
2 256 2048 decoder-ffn-bigger
88.47 +1.80 2 512 1536 decoder-emb-bigger

Links

@gregtatum gregtatum added the experiment A training experiment with hypothesis and results label Oct 22, 2024
@gregtatum gregtatum self-assigned this Oct 22, 2024
@eu9ene
Copy link
Collaborator

eu9ene commented Oct 22, 2024

I suggest using a different language pair for this experiment. en-ru was trained from a super convoluted branch "release_no_priors" where I had to change the graph by adding an extra step for alignments to do some bug fixes and not retrain everything from scratch. It's far behind main and doesn't have the latest W&B fixes, so I don't want to run any more experiments from the "release" based branches. If we switch to main, the graph will not be compatible, so we'll have to at least rerun the alignments step and reuse some other tasks using "existing_tasks". With all that it's a lot easier to run some other language pair we struggle with from main where we can reuse the tasks from release, for example, en-lt.

On another note, this looks like a hyperparameter search that we can do manually, but there are tools to automate it that we might explore in the future.

@gregtatum
Copy link
Member Author

Ok, en-lt sounds like a great choice. I read a bit more on it and it's got a lot of qualitative feedback in #756.

@gregtatum
Copy link
Member Author

Lithuanian has a similar use of declensions: https://en.wikipedia.org/wiki/Lithuanian_declension

@gregtatum gregtatum added the quality Improving robustness and translation quality label Oct 30, 2024
@gregtatum
Copy link
Member Author

I've got the first one started, and will wait until it gets to the student step before kicking off the rest:

https://firefox-ci-tc.services.mozilla.com/tasks/groups/Wxvkl1ruQkCj6URG6oIuuQ

The configs are each done on a commit-by-commit level:
https://github.com/mozilla/translations/commits/dev-en-lt-decoder-size/

@gregtatum
Copy link
Member Author

They are all in student training now on the dashboard. They are all the ones named decoder-*.

@gregtatum gregtatum changed the title Experiment with the decoder sizes Experiment with student model parameters Nov 13, 2024
@gregtatum
Copy link
Member Author

So I misread the paper a bit when it was talking about decoders, and the ffn and embedding size affect both decoder and encoder equally. The decoder depth is the only parameter changed that affects the decoder. I'm updating my experiment notes accordingly.

@ZJaume
Copy link
Collaborator

ZJaume commented Nov 13, 2024

I think transformer-dim-ffn applies for both. But since the decoder is ssru, which is a recurrent network, the params that are applied to the decoder are being the s2s ones. So I guess to change the feed-forward size of the decoder, dim-rnn has to be used?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
experiment A training experiment with hypothesis and results quality Improving robustness and translation quality
Projects
None yet
Development

No branches or pull requests

3 participants