You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I trained a maskgit(ImageBert) model using imagenet-1k dataset for reproduce your work, but after training 22 epochs(199 in total), i found the mlm loss cuve decrease slowly and the accuracy always lower than 0.01. Could you please provide some info in the process of reducing training loss and acc? If possible, please also share the possible problems that may occur during the mlm task process. Looking forward to your reply, tks.
The training command line like this: WANDB_MODE=offline accelerate launch --num_processes=8 --main_process_port=10086 --same_network scripts/train_maskgit.py config=configs/training/generator/maskgit_original.yaml \ experiment.project="titok_generation" \ experiment.name="titok_l32_maskgit_original" \ experiment.output_dir="titok_l32_maskgit_original" \ experiment.output_dir="titok_l32_maskgit_original"
and the config file like this:
It seems that you are using a total batch size of 96 * 8 = 768 while our results was obtained by a total batch size = 2048. I would suggest stick to our original setting if you aim at reproducing the results. For loss curve and accuracy, I think it is reasonable to see it decreases slowly, mainly due to the high masking ratio in MaskGIT strategy, it should be fine as long as you can see meaningful images are generated in the process.
I trained a maskgit(ImageBert) model using imagenet-1k dataset for reproduce your work, but after training 22 epochs(199 in total), i found the mlm loss cuve decrease slowly and the accuracy always lower than 0.01. Could you please provide some info in the process of reducing training loss and acc? If possible, please also share the possible problems that may occur during the mlm task process. Looking forward to your reply, tks.
The training command line like this:
WANDB_MODE=offline accelerate launch --num_processes=8 --main_process_port=10086 --same_network scripts/train_maskgit.py config=configs/training/generator/maskgit_original.yaml \ experiment.project="titok_generation" \ experiment.name="titok_l32_maskgit_original" \ experiment.output_dir="titok_l32_maskgit_original" \ experiment.output_dir="titok_l32_maskgit_original"
and the config file like this:
losses:
label_smoothing: 0.1
loss_weight_unmasked_token: 0.1
dataset:
params:
train_shards_path_or_url: "xxx/imagenet-wds/imagenet-train-{000000..000320}.tar" #"imagenet_sharded/train/imagenet-train-{0000..0252}.tar"
eval_shards_path_or_url: "xxx/imagenet-val-{000000..000049}.tar"
num_workers_per_gpu: 12
preprocessing:
resize_shorter_edge: 256
crop_size: 256
random_crop: False
random_flip: True
optimizer:
name: adamw
params:
learning_rate: 2e-4
beta1: 0.9
beta2: 0.96
weight_decay: 0.03
lr_scheduler:
scheduler: "cosine"
params:
learning_rate: ${optimizer.params.learning_rate}
warmup_steps: 10_000
end_lr: 1e-5
training:
gradient_accumulation_steps: 1
per_gpu_batch_size: 96 # 32 GPU, total batch size 2048
mixed_precision: "bf16"
enable_tf32: True
enable_wandb: True
use_ema: True
seed: 42
max_train_steps: 500_000
max_grad_norm: 1.0`
The text was updated successfully, but these errors were encountered: