NaN in loss while training the log model #13

kishore-greddy · 2021-01-14T20:59:52Z

I was trying to train the log model. I made necessary changes to the decoder to include the additional channel. When I start training, the intial loss is NaN and then after some batches it is NaN again. I was debugging the issue and stumbled upon this piece of code from your decoder.py

In line 81, sigmoid is used as the original code from monodepth2, but I do not see sigmoid being used for uncerts in line 85, Is there any reason for this?
I train on the GPU, but for debugging I use the CPU. While debugging on my CPU with batch_size 2 (any size greater will cause memory issues), I used breakpoints to see the values of uncert.

As seen in the image, the min value is negative, Log of a negative number is
NaN. This made me ask the first question, why the uncerts are not clamped
between 0(possibly a tiny bit greater to avoid inf when log is taken in the
loss function) and 1.
Is my understanding right or have I misunderstood something?

My loss function is

Is there a problem with this? Do you also use the "to_optimise" which is a
min(reprojection_losses and identity losses) or just the original reprojection
losses?

EDIT : After reading quite a lot, I feel that my log loss is wrong. Maybe the uncertainities coming at the output channel are already \log(uncertainties) , so I would have to correct my loss function to below?

EDIT 2: Would the above edit hold good for the self teaching loss too, meaning the uncertainity outputs are actually the \log(uncertainties), so I have to take torch.exp() in the loss.?

Thanks in advance

mattpoggi · 2021-01-15T17:53:05Z

Hi @kishore-greddy
1-2. the uncertainty is usually unbounded but might lead your network to instability. As you noticed in your "EDIT", if you model log-uncertainty you should fix the problem.
3. this seems correct. You divide by the uncertainty the min term.
EDIT 2: yes, the trick works for self teaching as well (but you do not have the minimum among multiple reprojection losses, only a single L1 loss wrt the proxy labels of the teacher)

Hope this helps ;)

kishore-greddy · 2021-01-15T18:06:04Z

Hey @mattpoggi ,

Thanks for the quick reply. I will try this out.

kishore-greddy · 2021-01-15T18:09:34Z

Hi @mattpoggi ,

Forgot to ask,
Have you also tried the other method? Meaning, keeping the uncertainty values greater than 0 in the decoder and actually modelling for the uncertainty instead of log(uncertainty) where my loss function in 3) works.
I read about Negative Log Likelihood minimization and a lot of people talk about taking the \log in the loss rather than modelling the log uncertainty itself.

Quoted from Lakshminarayanan et al, "Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles", one of the papers referenced by you in your research. Here they talk about variance greater than 0.
Could you please clarify? Thanks

mattpoggi · 2021-01-16T12:53:47Z

I made some experiments by bounding the uncertainty in 0-1 with a sigmoid layer and adding the log term in the loss function, as you mentioned. The same strategy is used in D3VO paper (https://vision.in.tum.de/research/vslam/d3vo).
The numbers where almost identical in the two formulations. I believe the important thing is just to avoid exploding gradients and unstable behaviors.

kishore-greddy · 2021-01-17T18:37:24Z

Hey @mattpoggi ,

I tried to model the log-uncertainty as you suggested, without binding the uncertainty to any range. I have exploding gradients problem. I have updated my loss function to be the one below,

After some iterations, in the first epoch itself, I face issues, please have a look at the image below,

Notice the loss just before I run into problems. Did you ever have to deal with something like this? Any hint is appreciated, Thanks.

EDIT: I managed to set a breakpoint just before the gradients exploded. Added a new image which shows the minimum value of the output uncertainties(in fact log uncertainties) for all images in the batch. As you can see , the minimum value coming at the output channel is -33.99, if we take exp(-33.99) it would come up to the order of 10^-15, and this being in the denominator is causing the loss value to blow up. I tried finding reasons why this is happening and I am not quite sure. Any guidance is highly appreciated. Thanks

mattpoggi · 2021-01-18T14:05:51Z

That's quite weird, I actually never had a problem with gradients...
Does this occur at any training? Does this occur even if you use the sigmoid trick?
Anyway, before the gradients explode, the loss numbers are very similar to the ones I had seen during my experiments.

kishore-greddy · 2021-01-18T14:17:25Z

Hi @mattpoggi ,
I observed that this occurs almost at every training of log model. I have tried it 3 times now, and every time I have this problem. Sometimes the problem occurs at the 5th epoch, sometimes the 1st epoch itself, so that is not consistent. However, as I showed the loguncertainty values just before the gradients start to explode, the min value is -33, Network is predicting this value at some pixel. I am not sure why this problem is so random and even more surprised that you did not face any issues like this. My decoder is almost the same as yours and I have also posted the loss function. Do you find an issue there? Because that is the only thing that is different. I have not used the sigmoid trick yet, I wanted to train the model as you did.

mattpoggi · 2021-01-19T16:14:04Z

You properly upsampled the uncertainty to the proper resolution scale, right?
I can dig more into this after the CVPR rebuttal occurring this week... Just a few questions: 1) are you training on KITTI? 2) are you using M, S or MS?

kishore-greddy · 2021-01-19T16:32:48Z

Do you mean scaling of the uncertainty to full resolution before calculating the loss? Yes, I have done that.

If you mean upsampling of the uncertainties in the decoder, Yes, I have done that too,

class DepthDecoder(nn.Module):
    def __init__(self, num_ch_enc, scales=range(4), num_output_channels=1, use_skips=True, use_uncert=False):
        super(DepthDecoder, self).__init__()

        self.num_output_channels = num_output_channels
        self.use_skips = use_skips
        self.upsample_mode = 'nearest'
        self.scales = scales
        self.use_uncert = use_uncert
        self.num_ch_enc = num_ch_enc
        self.num_ch_dec = np.array([16, 32, 64, 128, 256])

        # decoder
        self.convs = OrderedDict()
        for i in range(4, -1, -1):
            # upconv_0
            num_ch_in = self.num_ch_enc[-1] if i == 4 else self.num_ch_dec[i + 1]
            num_ch_out = self.num_ch_dec[i]
            self.convs[("upconv", i, 0)] = ConvBlock(num_ch_in, num_ch_out)

            # upconv_1
            num_ch_in = self.num_ch_dec[i]
            if self.use_skips and i > 0:
                num_ch_in += self.num_ch_enc[i - 1]
            num_ch_out = self.num_ch_dec[i]
            self.convs[("upconv", i, 1)] = ConvBlock(num_ch_in, num_ch_out)

        for s in self.scales:
            self.convs[("dispconv", s)] = Conv3x3(self.num_ch_dec[s], self.num_output_channels)
            if self.use_uncert:
                self.convs[("uncertconv", s)] = Conv3x3(self.num_ch_dec[s], self.num_output_channels)

        self.decoder = nn.ModuleList(list(self.convs.values()))
        self.sigmoid = nn.Sigmoid()

    def forward(self, input_features):
        self.outputs = {}

        # decoder
        x = input_features[-1]
        for i in range(4, -1, -1):
            x = self.convs[("upconv", i, 0)](x)
            x = [upsample(x)]
            if self.use_skips and i > 0:
                x += [input_features[i - 1]]
            x = torch.cat(x, 1)

            x = self.convs[("upconv", i, 1)](x)

            if i in self.scales:
                self.outputs[("disp", i)] = self.sigmoid(self.convs[("dispconv", i)](x))
                if self.use_uncert:
                    self.outputs[("uncert", i)] = self.convs[("uncertconv", i)](x)
        return self.outputs

I am training on the eigen zhou split of KITTI dataset (monodepth2 default)
I am training the M model.

mattpoggi · 2021-01-19T20:09:16Z

Everything looks good. I'll try to give a look at it next week

kishore-greddy · 2021-01-19T21:03:43Z

Thanks :) Would be waiting for yor inputs

mattpoggi · 2021-01-28T10:23:37Z

I launched a single train and it ended without issues. I'll try a few more times

kishore-greddy · 2021-01-28T20:42:47Z

Okay..Let me know how it goes..

IemProg · 2021-06-02T07:31:08Z

Hi,

Wonderful work, and thanks for sharing the code. I'm working on training the model with log loss to estimate uncertaitny. But, i'm facing the exploding gradient issue.

Have you fixed the exploding gradient issue with log_loss ?

Thanks !

mattpoggi · 2021-07-13T07:22:40Z

Hi, sorry for late.
Are you trying to estimate log uncertainty as we mentioned in the previous comments?
Among them, we also mentioned using a sigmoid in place of modeling the log uncertainty (#13 (comment)). I used this in some following up works and seems extremely stable, yet giving equivalent results.

Abdulaaty · 2022-05-04T12:59:20Z

@kishore-greddy @IemProg one of the reasons might be the batch size you're using. I had a similar experience in another framework where the training goes to instability if you use small batch size (like 1 or 2). If you use a different batch size than the one used in the paper that might be the issue.

@mattpoggi could you please confirm this by trying to set the training batch size to 1 and see if you experience exploding/vanishing gradients?

kishore-greddy closed this as completed Jan 15, 2021

kishore-greddy reopened this Jan 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NaN in loss while training the log model #13

NaN in loss while training the log model #13

kishore-greddy commented Jan 14, 2021 •

edited

Loading

mattpoggi commented Jan 15, 2021

kishore-greddy commented Jan 15, 2021

kishore-greddy commented Jan 15, 2021 •

edited

Loading

mattpoggi commented Jan 16, 2021

kishore-greddy commented Jan 17, 2021 •

edited

Loading

mattpoggi commented Jan 18, 2021

kishore-greddy commented Jan 18, 2021 •

edited

Loading

mattpoggi commented Jan 19, 2021

kishore-greddy commented Jan 19, 2021 •

edited

Loading

mattpoggi commented Jan 19, 2021

kishore-greddy commented Jan 19, 2021

mattpoggi commented Jan 28, 2021

kishore-greddy commented Jan 28, 2021

IemProg commented Jun 2, 2021

mattpoggi commented Jul 13, 2021

Abdulaaty commented May 4, 2022

NaN in loss while training the log model #13

NaN in loss while training the log model #13

Comments

kishore-greddy commented Jan 14, 2021 • edited Loading

mattpoggi commented Jan 15, 2021

kishore-greddy commented Jan 15, 2021

kishore-greddy commented Jan 15, 2021 • edited Loading

mattpoggi commented Jan 16, 2021

kishore-greddy commented Jan 17, 2021 • edited Loading

mattpoggi commented Jan 18, 2021

kishore-greddy commented Jan 18, 2021 • edited Loading

mattpoggi commented Jan 19, 2021

kishore-greddy commented Jan 19, 2021 • edited Loading

mattpoggi commented Jan 19, 2021

kishore-greddy commented Jan 19, 2021

mattpoggi commented Jan 28, 2021

kishore-greddy commented Jan 28, 2021

IemProg commented Jun 2, 2021

mattpoggi commented Jul 13, 2021

Abdulaaty commented May 4, 2022

kishore-greddy commented Jan 14, 2021 •

edited

Loading

kishore-greddy commented Jan 15, 2021 •

edited

Loading

kishore-greddy commented Jan 17, 2021 •

edited

Loading

kishore-greddy commented Jan 18, 2021 •

edited

Loading

kishore-greddy commented Jan 19, 2021 •

edited

Loading