-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How can I use the pretokenized jsonl file ? #61
Comments
Question 2: For each epoch, the images are randomly cropped, so the corresponding proxy codes should be different from other epochs. So how can a pretokenized proxy code for an image link to all the epochs? |
I add pretokenization: "PATH/pretokenized.jsonl" to config file and run, then got the error: [rank6]: global_step = train_one_epoch(config, logger, accelerator, How can I run correctly? |
Hi, It seems that you are training tokenizer with the pretokenized jsonl? We have not tried that or supported that in the current codebase. You may need to add this part to the tokenizer training if you want to do that. |
Thanks so much for the reply! When pretokenized jsonl is not used, the input of stage 1 training include the image data, but when pretokenized jsonl is used, there are no images in the jsonl file, so how can the data in pretokenized jsonl file be trained ? |
For TiTok stage1 training, the raw image input will be tokenized by the MaskGIT-VQGAN first so the reconstruction target is proxy codes. If you have a pretokenized jsonl, you can directly read the codes from the jsonl and thus skip this step, similar to what we did for generation (see the differences between the if else branch) |
According to my tests, the pretokenization is more than 20 times faster than TiTok stage 1, if they are doing the same thing, why training the TiTok stage 1? Pretokenization should be the priority, am I right? |
Pretokenization can speed up the training indeed. Since random crop is used for tokenizer training, the pretokenization may not give the exact same results compared to on-the-flight tokenization for tokenizer training. We have not tried applying pretokenization for TiTok training, but it should be fine if your dataset is large enough (or you include those different crops in the pretokenization jsonl as well) |
I have pretokenized the proxy codes to speedup training. How can I use it during training? I cannot see the config file to include the pretrained jsonl file.
The text was updated successfully, but these errors were encountered: