Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NaN in loss while training the log model #13

Open
kishore-greddy opened this issue Jan 14, 2021 · 16 comments
Open

NaN in loss while training the log model #13

kishore-greddy opened this issue Jan 14, 2021 · 16 comments

Comments

@kishore-greddy
Copy link

kishore-greddy commented Jan 14, 2021

Hey @mattpoggi ,

I was trying to train the log model. I made necessary changes to the decoder to include the additional channel. When I start training, the intial loss is NaN and then after some batches it is NaN again. I was debugging the issue and stumbled upon this piece of code from your decoder.py

image

  1. In line 81, sigmoid is used as the original code from monodepth2, but I do not see sigmoid being used for uncerts in line 85, Is there any reason for this?

  2. I train on the GPU, but for debugging I use the CPU. While debugging on my CPU with batch_size 2 (any size greater will cause memory issues), I used breakpoints to see the values of uncert.

image
As seen in the image, the min value is negative, Log of a negative number is
NaN. This made me ask the first question, why the uncerts are not clamped
between 0(possibly a tiny bit greater to avoid inf when log is taken in the
loss function) and 1.
Is my understanding right or have I misunderstood something?

  1. My loss function is
    image

    Is there a problem with this? Do you also use the "to_optimise" which is a
    min(reprojection_losses and identity losses) or just the original reprojection
    losses?

EDIT : After reading quite a lot, I feel that my log loss is wrong. Maybe the uncertainities coming at the output channel are already \log(uncertainties) , so I would have to correct my loss function to below?
image

EDIT 2: Would the above edit hold good for the self teaching loss too, meaning the uncertainity outputs are actually the \log(uncertainties), so I have to take torch.exp() in the loss.?

Thanks in advance

@mattpoggi
Copy link
Owner

Hi @kishore-greddy
1-2. the uncertainty is usually unbounded but might lead your network to instability. As you noticed in your "EDIT", if you model log-uncertainty you should fix the problem.
3. this seems correct. You divide by the uncertainty the min term.
EDIT 2: yes, the trick works for self teaching as well (but you do not have the minimum among multiple reprojection losses, only a single L1 loss wrt the proxy labels of the teacher)

Hope this helps ;)

@kishore-greddy
Copy link
Author

Hey @mattpoggi ,

Thanks for the quick reply. I will try this out.

@kishore-greddy
Copy link
Author

kishore-greddy commented Jan 15, 2021

Hi @mattpoggi ,

Forgot to ask,
Have you also tried the other method? Meaning, keeping the uncertainty values greater than 0 in the decoder and actually modelling for the uncertainty instead of log(uncertainty) where my loss function in 3) works.
I read about Negative Log Likelihood minimization and a lot of people talk about taking the \log in the loss rather than modelling the log uncertainty itself.

image
Quoted from Lakshminarayanan et al, "Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles", one of the papers referenced by you in your research. Here they talk about variance greater than 0.
Could you please clarify? Thanks

@mattpoggi
Copy link
Owner

I made some experiments by bounding the uncertainty in 0-1 with a sigmoid layer and adding the log term in the loss function, as you mentioned. The same strategy is used in D3VO paper (https://vision.in.tum.de/research/vslam/d3vo).
The numbers where almost identical in the two formulations. I believe the important thing is just to avoid exploding gradients and unstable behaviors.

@kishore-greddy
Copy link
Author

kishore-greddy commented Jan 17, 2021

Hey @mattpoggi ,

I tried to model the log-uncertainty as you suggested, without binding the uncertainty to any range. I have exploding gradients problem. I have updated my loss function to be the one below,

image

After some iterations, in the first epoch itself, I face issues, please have a look at the image below,

image

Notice the loss just before I run into problems. Did you ever have to deal with something like this? Any hint is appreciated, Thanks.

EDIT: I managed to set a breakpoint just before the gradients exploded. Added a new image which shows the minimum value of the output uncertainties(in fact log uncertainties) for all images in the batch. As you can see , the minimum value coming at the output channel is -33.99, if we take exp(-33.99) it would come up to the order of 10^-15, and this being in the denominator is causing the loss value to blow up. I tried finding reasons why this is happening and I am not quite sure. Any guidance is highly appreciated. Thanks
image

@mattpoggi
Copy link
Owner

That's quite weird, I actually never had a problem with gradients...
Does this occur at any training? Does this occur even if you use the sigmoid trick?
Anyway, before the gradients explode, the loss numbers are very similar to the ones I had seen during my experiments.

@kishore-greddy
Copy link
Author

kishore-greddy commented Jan 18, 2021

Hi @mattpoggi ,
I observed that this occurs almost at every training of log model. I have tried it 3 times now, and every time I have this problem. Sometimes the problem occurs at the 5th epoch, sometimes the 1st epoch itself, so that is not consistent. However, as I showed the loguncertainty values just before the gradients start to explode, the min value is -33, Network is predicting this value at some pixel. I am not sure why this problem is so random and even more surprised that you did not face any issues like this. My decoder is almost the same as yours and I have also posted the loss function. Do you find an issue there? Because that is the only thing that is different. I have not used the sigmoid trick yet, I wanted to train the model as you did.

@mattpoggi
Copy link
Owner

You properly upsampled the uncertainty to the proper resolution scale, right?
I can dig more into this after the CVPR rebuttal occurring this week... Just a few questions: 1) are you training on KITTI? 2) are you using M, S or MS?

@kishore-greddy
Copy link
Author

kishore-greddy commented Jan 19, 2021

Do you mean scaling of the uncertainty to full resolution before calculating the loss? Yes, I have done that.
image

If you mean upsampling of the uncertainties in the decoder, Yes, I have done that too,

class DepthDecoder(nn.Module):
    def __init__(self, num_ch_enc, scales=range(4), num_output_channels=1, use_skips=True, use_uncert=False):
        super(DepthDecoder, self).__init__()

        self.num_output_channels = num_output_channels
        self.use_skips = use_skips
        self.upsample_mode = 'nearest'
        self.scales = scales
        self.use_uncert = use_uncert
        self.num_ch_enc = num_ch_enc
        self.num_ch_dec = np.array([16, 32, 64, 128, 256])

        # decoder
        self.convs = OrderedDict()
        for i in range(4, -1, -1):
            # upconv_0
            num_ch_in = self.num_ch_enc[-1] if i == 4 else self.num_ch_dec[i + 1]
            num_ch_out = self.num_ch_dec[i]
            self.convs[("upconv", i, 0)] = ConvBlock(num_ch_in, num_ch_out)

            # upconv_1
            num_ch_in = self.num_ch_dec[i]
            if self.use_skips and i > 0:
                num_ch_in += self.num_ch_enc[i - 1]
            num_ch_out = self.num_ch_dec[i]
            self.convs[("upconv", i, 1)] = ConvBlock(num_ch_in, num_ch_out)

        for s in self.scales:
            self.convs[("dispconv", s)] = Conv3x3(self.num_ch_dec[s], self.num_output_channels)
            if self.use_uncert:
                self.convs[("uncertconv", s)] = Conv3x3(self.num_ch_dec[s], self.num_output_channels)

        self.decoder = nn.ModuleList(list(self.convs.values()))
        self.sigmoid = nn.Sigmoid()

    def forward(self, input_features):
        self.outputs = {}

        # decoder
        x = input_features[-1]
        for i in range(4, -1, -1):
            x = self.convs[("upconv", i, 0)](x)
            x = [upsample(x)]
            if self.use_skips and i > 0:
                x += [input_features[i - 1]]
            x = torch.cat(x, 1)

            x = self.convs[("upconv", i, 1)](x)

            if i in self.scales:
                self.outputs[("disp", i)] = self.sigmoid(self.convs[("dispconv", i)](x))
                if self.use_uncert:
                    self.outputs[("uncert", i)] = self.convs[("uncertconv", i)](x)
        return self.outputs
  1. I am training on the eigen zhou split of KITTI dataset (monodepth2 default)
  2. I am training the M model.

@mattpoggi
Copy link
Owner

Everything looks good. I'll try to give a look at it next week

@kishore-greddy
Copy link
Author

Thanks :) Would be waiting for yor inputs

@mattpoggi
Copy link
Owner

I launched a single train and it ended without issues. I'll try a few more times

@kishore-greddy
Copy link
Author

Okay..Let me know how it goes..

@IemProg
Copy link

IemProg commented Jun 2, 2021

Hi,

Wonderful work, and thanks for sharing the code. I'm working on training the model with log loss to estimate uncertaitny. But, i'm facing the exploding gradient issue.

Have you fixed the exploding gradient issue with log_loss ?

Thanks !

@mattpoggi
Copy link
Owner

Hi, sorry for late.
Are you trying to estimate log uncertainty as we mentioned in the previous comments?
Among them, we also mentioned using a sigmoid in place of modeling the log uncertainty (#13 (comment)). I used this in some following up works and seems extremely stable, yet giving equivalent results.

@Abdulaaty
Copy link

@kishore-greddy @IemProg one of the reasons might be the batch size you're using. I had a similar experience in another framework where the training goes to instability if you use small batch size (like 1 or 2). If you use a different batch size than the one used in the paper that might be the issue.

@mattpoggi could you please confirm this by trying to set the training batch size to 1 and see if you experience exploding/vanishing gradients?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants