-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NaN in loss while training the log model #13
Comments
Hi @kishore-greddy Hope this helps ;) |
Hey @mattpoggi , Thanks for the quick reply. I will try this out. |
Hi @mattpoggi , Forgot to ask,
|
I made some experiments by bounding the uncertainty in 0-1 with a sigmoid layer and adding the log term in the loss function, as you mentioned. The same strategy is used in D3VO paper (https://vision.in.tum.de/research/vslam/d3vo). |
Hey @mattpoggi , I tried to model the log-uncertainty as you suggested, without binding the uncertainty to any range. I have exploding gradients problem. I have updated my loss function to be the one below, After some iterations, in the first epoch itself, I face issues, please have a look at the image below, Notice the loss just before I run into problems. Did you ever have to deal with something like this? Any hint is appreciated, Thanks. EDIT: I managed to set a breakpoint just before the gradients exploded. Added a new image which shows the minimum value of the output uncertainties(in fact log uncertainties) for all images in the batch. As you can see , the minimum value coming at the output channel is -33.99, if we take exp(-33.99) it would come up to the order of 10^-15, and this being in the denominator is causing the loss value to blow up. I tried finding reasons why this is happening and I am not quite sure. Any guidance is highly appreciated. Thanks |
That's quite weird, I actually never had a problem with gradients... |
Hi @mattpoggi , |
You properly upsampled the uncertainty to the proper resolution scale, right? |
Everything looks good. I'll try to give a look at it next week |
Thanks :) Would be waiting for yor inputs |
I launched a single train and it ended without issues. I'll try a few more times |
Okay..Let me know how it goes.. |
Hi, Wonderful work, and thanks for sharing the code. I'm working on training the model with log loss to estimate uncertaitny. But, i'm facing the exploding gradient issue. Have you fixed the exploding gradient issue with log_loss ? Thanks ! |
Hi, sorry for late. |
@kishore-greddy @IemProg one of the reasons might be the batch size you're using. I had a similar experience in another framework where the training goes to instability if you use small batch size (like 1 or 2). If you use a different batch size than the one used in the paper that might be the issue. @mattpoggi could you please confirm this by trying to set the training batch size to 1 and see if you experience exploding/vanishing gradients? |
Hey @mattpoggi ,
I was trying to train the log model. I made necessary changes to the decoder to include the additional channel. When I start training, the intial loss is NaN and then after some batches it is NaN again. I was debugging the issue and stumbled upon this piece of code from your decoder.py
In line 81, sigmoid is used as the original code from monodepth2, but I do not see sigmoid being used for uncerts in line 85, Is there any reason for this?
I train on the GPU, but for debugging I use the CPU. While debugging on my CPU with batch_size 2 (any size greater will cause memory issues), I used breakpoints to see the values of uncert.
As seen in the image, the min value is negative, Log of a negative number is
NaN. This made me ask the first question, why the uncerts are not clamped
between 0(possibly a tiny bit greater to avoid inf when log is taken in the
loss function) and 1.
Is my understanding right or have I misunderstood something?
My loss function is
Is there a problem with this? Do you also use the "to_optimise" which is a
min(reprojection_losses and identity losses) or just the original reprojection
losses?
EDIT : After reading quite a lot, I feel that my log loss is wrong. Maybe the uncertainities coming at the output channel are already \log(uncertainties) , so I would have to correct my loss function to below?
EDIT 2: Would the above edit hold good for the self teaching loss too, meaning the uncertainity outputs are actually the \log(uncertainties), so I have to take torch.exp() in the loss.?
Thanks in advance
The text was updated successfully, but these errors were encountered: