You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello, I am recently encountered with some strange (at least for me) problem when training yolov9. Here is the thing:
First of all, I trained yolov9 from scratch on a large general dataset, "big detection", which has about 3 million images with 600 categories. After that, I got a pretrained model.
Then, I tried to continue train the pretrained model on some specific dataset, which contains about 100k images in total. This dataset contains about 80 categories in total, I split it into train and test (8:2) and I use a initial lr0=0.0004, with a batch size of 8 per device and 32 devices in total.
The problem is that after around 100 epochs of continue training, I find that the model will fail to detect some categories which means that those categories have a zero precision and zero recall and so on. The visualization shows that the model just predicts nothing in those areas, even with a confidence threshold of 0.001.
You might say that there must be some thing wrong with my dataset. However, the most strange thing is that those failed categories differ from each other in each training. For example, in the first training experiment, the model fails to detect cat and dog. While in the second training experiment with exact same settings, the model fails to detect pig and horse and succeeds in detecting cat and dog. I tried about 10 experiments and did not find any two share the same failed categories. I am not saying that my dataset is perfect, but if there are some obvious mistakes with the dataset, wrong labeling for example, shouldn't the result of each training shares the same failed cases?
I am speculating that the initial lr0 is still too large consider the small batch size compared with dataset and the model converges to a local minimal. Is there any other thoughts? It would be really helpful if someone has ran into something like this before.
The text was updated successfully, but these errors were encountered:
Hello, I am recently encountered with some strange (at least for me) problem when training yolov9. Here is the thing:
First of all, I trained yolov9 from scratch on a large general dataset, "big detection", which has about 3 million images with 600 categories. After that, I got a pretrained model.
Then, I tried to continue train the pretrained model on some specific dataset, which contains about 100k images in total. This dataset contains about 80 categories in total, I split it into train and test (8:2) and I use a initial lr0=0.0004, with a batch size of 8 per device and 32 devices in total.
The problem is that after around 100 epochs of continue training, I find that the model will fail to detect some categories which means that those categories have a zero precision and zero recall and so on. The visualization shows that the model just predicts nothing in those areas, even with a confidence threshold of 0.001.
You might say that there must be some thing wrong with my dataset. However, the most strange thing is that those failed categories differ from each other in each training. For example, in the first training experiment, the model fails to detect cat and dog. While in the second training experiment with exact same settings, the model fails to detect pig and horse and succeeds in detecting cat and dog. I tried about 10 experiments and did not find any two share the same failed categories. I am not saying that my dataset is perfect, but if there are some obvious mistakes with the dataset, wrong labeling for example, shouldn't the result of each training shares the same failed cases?
I am speculating that the initial lr0 is still too large consider the small batch size compared with dataset and the model converges to a local minimal. Is there any other thoughts? It would be really helpful if someone has ran into something like this before.
The text was updated successfully, but these errors were encountered: