Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How can I use the pretokenized jsonl file ? #61

Open
localbetascreening opened this issue Dec 5, 2024 · 7 comments
Open

How can I use the pretokenized jsonl file ? #61

localbetascreening opened this issue Dec 5, 2024 · 7 comments

Comments

@localbetascreening
Copy link

I have pretokenized the proxy codes to speedup training. How can I use it during training? I cannot see the config file to include the pretrained jsonl file.

@localbetascreening
Copy link
Author

Question 2: For each epoch, the images are randomly cropped, so the corresponding proxy codes should be different from other epochs. So how can a pretokenized proxy code for an image link to all the epochs?

@localbetascreening
Copy link
Author

I add pretokenization: "PATH/pretokenized.jsonl" to config file and run, then got the error:

[rank6]: global_step = train_one_epoch(config, logger, accelerator,
[rank6]: File "/home/weixing/titok/utils/train_utils.py", line 353, in train_one_epoch
[rank6]: raise ValueError(f"Not found valid keys: {batch.keys()}")
[rank6]: AttributeError: 'list' object has no attribute 'keys'

How can I run correctly?

@cornettoyu
Copy link
Collaborator

Hi,

It seems that you are training tokenizer with the pretokenized jsonl? We have not tried that or supported that in the current codebase. You may need to add this part to the tokenizer training if you want to do that.

@localbetascreening
Copy link
Author

Hi,

It seems that you are training tokenizer with the pretokenized jsonl? We have not tried that or supported that in the current codebase. You may need to add this part to the tokenizer training if you want to do that.

Thanks so much for the reply! When pretokenized jsonl is not used, the input of stage 1 training include the image data, but when pretokenized jsonl is used, there are no images in the jsonl file, so how can the data in pretokenized jsonl file be trained ?

@cornettoyu
Copy link
Collaborator

Hi,
It seems that you are training tokenizer with the pretokenized jsonl? We have not tried that or supported that in the current codebase. You may need to add this part to the tokenizer training if you want to do that.

Thanks so much for the reply! When pretokenized jsonl is not used, the input of stage 1 training include the image data, but when pretokenized jsonl is used, there are no images in the jsonl file, so how can the data in pretokenized jsonl file be trained ?

For TiTok stage1 training, the raw image input will be tokenized by the MaskGIT-VQGAN first so the reconstruction target is proxy codes. If you have a pretokenized jsonl, you can directly read the codes from the jsonl and thus skip this step, similar to what we did for generation (see the differences between the if else branch)

@localbetascreening
Copy link
Author

Hi,
It seems that you are training tokenizer with the pretokenized jsonl? We have not tried that or supported that in the current codebase. You may need to add this part to the tokenizer training if you want to do that.

Thanks so much for the reply! When pretokenized jsonl is not used, the input of stage 1 training include the image data, but when pretokenized jsonl is used, there are no images in the jsonl file, so how can the data in pretokenized jsonl file be trained ?

For TiTok stage1 training, the raw image input will be tokenized by the MaskGIT-VQGAN first so the reconstruction target is proxy codes. If you have a pretokenized jsonl, you can directly read the codes from the jsonl and thus skip this step, similar to what we did for generation (see the differences between the if else branch)

According to my tests, the pretokenization is more than 20 times faster than TiTok stage 1, if they are doing the same thing, why training the TiTok stage 1? Pretokenization should be the priority, am I right?

@cornettoyu
Copy link
Collaborator

Hi,
It seems that you are training tokenizer with the pretokenized jsonl? We have not tried that or supported that in the current codebase. You may need to add this part to the tokenizer training if you want to do that.

Thanks so much for the reply! When pretokenized jsonl is not used, the input of stage 1 training include the image data, but when pretokenized jsonl is used, there are no images in the jsonl file, so how can the data in pretokenized jsonl file be trained ?

For TiTok stage1 training, the raw image input will be tokenized by the MaskGIT-VQGAN first so the reconstruction target is proxy codes. If you have a pretokenized jsonl, you can directly read the codes from the jsonl and thus skip this step, similar to what we did for generation (see the differences between the if else branch)

According to my tests, the pretokenization is more than 20 times faster than TiTok stage 1, if they are doing the same thing, why training the TiTok stage 1? Pretokenization should be the priority, am I right?

Pretokenization can speed up the training indeed. Since random crop is used for tokenizer training, the pretokenization may not give the exact same results compared to on-the-flight tokenization for tokenizer training. We have not tried applying pretokenization for TiTok training, but it should be fine if your dataset is large enough (or you include those different crops in the pretokenization jsonl as well)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants