Experiment with distillation data inference #931

gregtatum · 2024-11-15T15:09:38Z

I'm going to do several experiments around distillation data decoding, and it will be easier to write up the results here as they are all related to the same part of the pipeline, translate-mono-src and translate-corpus.

Investigate using CTranslate2 for decoding Investigate CTranslate2 for translating sentences with teacher model #165
Remove the teacher ensemble Investigate removing teacher ensemble training #778
Improve student GPU utilization Improve GPU utilization for "translate" tasks #785

My plan is to test on da-en since it was a good model result and should be indicative of quality drops.

The data for this experiment is available in this spreadsheet. I measured both the GPU utilization and how much data was being written into the target file in terms of bytes/sec. Each run operated on the same subset of the data. I measured 15 minutes of translations across 5 batches, and summarized the results to get a sample of how fast the translations were happening.

decoder	precision	teachers	maxi-batch-words	maxi-batch	gpu utilization	bytes/sec	vs 500	vs Marian Best
marian	float32	2	500	1,000	64.7	156,118
marian	float16	2	500	1,000	57.5	183,627	118%
marian	float16	2	4,000	1,000	61.5	329,703	211%
marian	float16	2	5,000	1,000	57.4	338,798	217%
marian	float16	2	5,000	10,000	57.2	348,805	223%
marian	float16	1	5,000	10,000	55.9	601,635	385%
marian	float16	2	8,000	1,000	-	Out of Memory	-	-
ctranslate2	float16	1	5,000	-	97.2	1,187,192	760%	197%

Then run the experiments on the decoder/ensemble configurations.

inference	teacher ensemble	student comet	speed	gpu utilization
ctranslate2	1
marian	1
marian	2

The text was updated successfully, but these errors were encountered:

gregtatum · 2024-11-19T16:15:31Z

I think I got an OOM with ~~maxi-batch-words: 5000~~ mini-batch-words: 5000

https://firefox-ci-tc.services.mozilla.com/tasks/Z2rfI9lLSNWKnUoQ-7FFLw

eu9ene · 2024-11-19T17:24:03Z

Impressive results with CTranslate! I like ~100% GPU utilization and an order of magnitude speed up

maxi-batch-words is likely mini-batch-words

gregtatum added the experiment A training experiment with hypothesis and results label Nov 15, 2024

gregtatum self-assigned this Nov 15, 2024

gregtatum changed the title ~~Experiment with distillation data decoding~~ Experiment with distillation data inference Nov 15, 2024

This was referenced Nov 18, 2024

Adjust default values for batching #934

Merged

Improve GPU utilization for "translate" tasks #785

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Experiment with distillation data inference #931

Experiment with distillation data inference #931

gregtatum commented Nov 15, 2024 •

edited

Loading

gregtatum commented Nov 19, 2024 •

edited

Loading

eu9ene commented Nov 19, 2024 •

edited

Loading

Experiment with distillation data inference #931

Experiment with distillation data inference #931

Comments

gregtatum commented Nov 15, 2024 • edited Loading

gregtatum commented Nov 19, 2024 • edited Loading

eu9ene commented Nov 19, 2024 • edited Loading

gregtatum commented Nov 15, 2024 •

edited

Loading

gregtatum commented Nov 19, 2024 •

edited

Loading

eu9ene commented Nov 19, 2024 •

edited

Loading