How can I use the pretokenized jsonl file ? #61

localbetascreening · 2024-12-05T04:04:40Z

I have pretokenized the proxy codes to speedup training. How can I use it during training? I cannot see the config file to include the pretrained jsonl file.

localbetascreening · 2024-12-05T08:50:13Z

Question 2: For each epoch, the images are randomly cropped, so the corresponding proxy codes should be different from other epochs. So how can a pretokenized proxy code for an image link to all the epochs?

localbetascreening · 2024-12-06T08:47:50Z

I add pretokenization: "PATH/pretokenized.jsonl" to config file and run, then got the error:

[rank6]: global_step = train_one_epoch(config, logger, accelerator,
[rank6]: File "/home/weixing/titok/utils/train_utils.py", line 353, in train_one_epoch
[rank6]: raise ValueError(f"Not found valid keys: {batch.keys()}")
[rank6]: AttributeError: 'list' object has no attribute 'keys'

How can I run correctly?

cornettoyu · 2024-12-06T20:46:40Z

Hi,

It seems that you are training tokenizer with the pretokenized jsonl? We have not tried that or supported that in the current codebase. You may need to add this part to the tokenizer training if you want to do that.

localbetascreening · 2024-12-07T04:54:21Z

Hi,

It seems that you are training tokenizer with the pretokenized jsonl? We have not tried that or supported that in the current codebase. You may need to add this part to the tokenizer training if you want to do that.

Thanks so much for the reply! When pretokenized jsonl is not used, the input of stage 1 training include the image data, but when pretokenized jsonl is used, there are no images in the jsonl file, so how can the data in pretokenized jsonl file be trained ?

cornettoyu · 2024-12-09T21:06:41Z

Hi,
It seems that you are training tokenizer with the pretokenized jsonl? We have not tried that or supported that in the current codebase. You may need to add this part to the tokenizer training if you want to do that.

Thanks so much for the reply! When pretokenized jsonl is not used, the input of stage 1 training include the image data, but when pretokenized jsonl is used, there are no images in the jsonl file, so how can the data in pretokenized jsonl file be trained ?

For TiTok stage1 training, the raw image input will be tokenized by the MaskGIT-VQGAN first so the reconstruction target is proxy codes. If you have a pretokenized jsonl, you can directly read the codes from the jsonl and thus skip this step, similar to what we did for generation (see the differences between the if else branch)

localbetascreening · 2024-12-11T06:04:29Z

Hi,
It seems that you are training tokenizer with the pretokenized jsonl? We have not tried that or supported that in the current codebase. You may need to add this part to the tokenizer training if you want to do that.

Thanks so much for the reply! When pretokenized jsonl is not used, the input of stage 1 training include the image data, but when pretokenized jsonl is used, there are no images in the jsonl file, so how can the data in pretokenized jsonl file be trained ?

For TiTok stage1 training, the raw image input will be tokenized by the MaskGIT-VQGAN first so the reconstruction target is proxy codes. If you have a pretokenized jsonl, you can directly read the codes from the jsonl and thus skip this step, similar to what we did for generation (see the differences between the if else branch)

According to my tests, the pretokenization is more than 20 times faster than TiTok stage 1, if they are doing the same thing, why training the TiTok stage 1? Pretokenization should be the priority, am I right?

cornettoyu · 2024-12-12T23:40:55Z

Hi,
It seems that you are training tokenizer with the pretokenized jsonl? We have not tried that or supported that in the current codebase. You may need to add this part to the tokenizer training if you want to do that.

Thanks so much for the reply! When pretokenized jsonl is not used, the input of stage 1 training include the image data, but when pretokenized jsonl is used, there are no images in the jsonl file, so how can the data in pretokenized jsonl file be trained ?

For TiTok stage1 training, the raw image input will be tokenized by the MaskGIT-VQGAN first so the reconstruction target is proxy codes. If you have a pretokenized jsonl, you can directly read the codes from the jsonl and thus skip this step, similar to what we did for generation (see the differences between the if else branch)

According to my tests, the pretokenization is more than 20 times faster than TiTok stage 1, if they are doing the same thing, why training the TiTok stage 1? Pretokenization should be the priority, am I right?

Pretokenization can speed up the training indeed. Since random crop is used for tokenizer training, the pretokenization may not give the exact same results compared to on-the-flight tokenization for tokenizer training. We have not tried applying pretokenization for TiTok training, but it should be fine if your dataset is large enough (or you include those different crops in the pretokenization jsonl as well)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How can I use the pretokenized jsonl file ? #61

How can I use the pretokenized jsonl file ? #61

localbetascreening commented Dec 5, 2024

localbetascreening commented Dec 5, 2024

localbetascreening commented Dec 6, 2024

cornettoyu commented Dec 6, 2024

localbetascreening commented Dec 7, 2024

cornettoyu commented Dec 9, 2024

localbetascreening commented Dec 11, 2024

cornettoyu commented Dec 12, 2024

How can I use the pretokenized jsonl file ? #61

How can I use the pretokenized jsonl file ? #61

Comments

localbetascreening commented Dec 5, 2024

localbetascreening commented Dec 5, 2024

localbetascreening commented Dec 6, 2024

cornettoyu commented Dec 6, 2024

localbetascreening commented Dec 7, 2024

cornettoyu commented Dec 9, 2024

localbetascreening commented Dec 11, 2024

cornettoyu commented Dec 12, 2024