Getting NaN in the softmax Layer

dhirajsuvarna · March 31, 2020, 11:11am

Hi,

I am trying to train an existing neural network from a published paper, using custom dataset.

However, why trainng this I am getting NAN as my predictions even before completeing the first batch of training (batch size = 32).

I tried to google out the error and came across multiple post from this forum and tried few things -

Reducing the learning rate (default was 0.001, reduced it to 0.0001)
Reducing batch size from 32 to 10
Using torch.autograd.detect_anomaly()

by using torch.autograd.detect_anomaly() I got the following error

Warning: Traceback of forward call that caused the error:
  File "train_classification.py", line 126, in <module>
    pred, trans, trans_feat = classifier(points)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/content/mydrive/My Drive/Projects/pointnet/pointnet.pytorch/pointnet/model.py", line 147, in forward
    return F.log_softmax(x, dim=1), trans, trans_feat
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/functional.py", line 1317, in log_softmax
    ret = input.log_softmax(dim)
 (print_stack at /pytorch/torch/csrc/autograd/python_anomaly_mode.cpp:57)
Traceback (most recent call last):
  File "train_classification.py", line 133, in <module>
    loss.backward()
  File "/usr/local/lib/python3.6/dist-packages/torch/tensor.py", line 195, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/usr/local/lib/python3.6/dist-packages/torch/autograd/__init__.py", line 99, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: Function 'LogSoftmaxBackward' returned nan values in its 0th output.

If I am interpreting this correctly, the error seems to be from the softmax layer - F.log_softmax()

Now I am stuck here, do not know how to proceed, can someone help me in this regards?

ptrblck · April 1, 2020, 6:00am

Could you check the min and max values of x before feeding it to F.log_softmax?

dhirajsuvarna · April 1, 2020, 7:12am

The forward() function of the model is as follows -

    def forward(self, x):
        x, trans, trans_feat = self.feat(x)
        x = F.relu(self.bn1(self.fc1(x)))
        x = F.relu(self.bn2(self.dropout(self.fc2(x))))
        x = self.fc3(x)
        return F.log_softmax(x, dim=1), trans, trans_feat

I printed out the max and min for the weights of f3 layer.
Here is the output dump for the last few iterations -

[0: 40/47] test loss: 3.190464 accuracy: 0.250000
f3 Weights Min: -0.06608034670352936
f3 Weights Max: 0.06457919627428055
f3 Gradients Min: -0.18587647378444672
f3 Gradients Max: 0.06978222727775574
[0: 41/47] train loss: 2.302364 accuracy: 0.593750
f3 Weights Min: -0.066163070499897
f3 Weights Max: 0.06460865586996078
f3 Gradients Min: -0.19260874390602112
f3 Gradients Max: 0.060340650379657745
[0: 42/47] train loss: 2.837160 accuracy: 0.281250
f3 Weights Min: -0.0662456750869751
f3 Weights Max: 0.06465635448694229
f3 Gradients Min: -0.23064641654491425
f3 Gradients Max: 0.03692895919084549
[0: 43/47] train loss: 2.571262 accuracy: 0.406250
f3 Weights Min: -0.06632933020591736
f3 Weights Max: 0.06470420211553574
f3 Gradients Min: -0.2090958207845688
f3 Gradients Max: 0.071017786860466
[0: 44/47] train loss: 2.492180 accuracy: 0.375000
f3 Weights Min: -0.06641349196434021
f3 Weights Max: 0.06475285440683365
Warning: Traceback of forward call that caused the error:
  File "train_classification.py", line 126, in <module>
    pred, trans, trans_feat = classifier(points)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/content/mydrive/My Drive/Projects/pointnet/pointnet.pytorch/pointnet/model.py", line 147, in forward
    return F.log_softmax(x, dim=1), trans, trans_feat
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/functional.py", line 1317, in log_softmax
    ret = input.log_softmax(dim)
 (print_stack at /pytorch/torch/csrc/autograd/python_anomaly_mode.cpp:57)
Traceback (most recent call last):
  File "train_classification.py", line 133, in <module>
    loss.backward()
  File "/usr/local/lib/python3.6/dist-packages/torch/tensor.py", line 195, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/usr/local/lib/python3.6/dist-packages/torch/autograd/__init__.py", line 99, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: Function 'LogSoftmaxBackward' returned nan values in its 0th output.

I don’t see the gradients exploding but i think the gradients are vanishing, am I right?

SandPhoenix · April 1, 2020, 1:47pm

I have the same problem since two days ago and I do not find the cause of the problem. I even deleted the logsoftmax from my code and still the same…
Please tell me if you find a solution and I will do the same if I do

dhirajsuvarna · April 1, 2020, 1:49pm

Sure, I will, I am actively pursuing this. I broke my head over this for two days, I will try to update this post as soon as I get help from any forum.

In the meantime, did you try to print out the weights as suggested by @ptrblck.

SandPhoenix · April 1, 2020, 1:57pm

I did and they seeme pretty normal to me. when I remove the detect_anomaly(), it starts working for like a sec and then from nowhere the nan values come out. It was the first thing I did actually even before using the detect_anomaly

ptrblck · April 1, 2020, 6:34pm

Is this only visible on the CPU, GPU or both devices?
Could you try to update numpy as suggested here?

CC @dhirajsuvarna

SandPhoenix · April 1, 2020, 11:09pm

For me, I am using the CPU. I can’t try using the GPU currently.

ptrblck · April 2, 2020, 12:04am

Does updating numpy work for you?

SandPhoenix · April 2, 2020, 12:28am

Sadly, it is already the latest version.

ptrblck · April 2, 2020, 12:29am

Could you try to store the tensors, which create the NaN output so that we could further debug?
Alternatively, could you post the shapes and stats of the tensors, so that we could run it with random values and try to reproduce the failure?

SandPhoenix · April 2, 2020, 12:37am

def convNxN(in_p,out_p,N):
return nn.Conv2d(in_p,ou_p,kernel_size=(N,N),stride=(1,1),padding=(1,1),bias=False)
class Modelname(nn.Module):
def init(self):
self.Conv1 = convNxN(3,64,3)
#other Conv2d calls of the same way.

def forward(self,x):
out=self.Conv1(x)
# When I print it here, it already starts giving nans

By the way, I am using CIFAR-10 as dataset.

ptrblck · April 2, 2020, 12:41am

How is convNxN defined?

SandPhoenix · April 2, 2020, 12:42am

def convNxN(in_p,out_p,N):
return nn.Conv2d(in_p,ou_p,kernel_size=(N,N),stride=(1,1),padding=(1,1),bias=False)

ptrblck · April 2, 2020, 1:00am

Thanks, could you post all arguments to create an instance of this conv layer as well as the input shape and the stats of the input?

SandPhoenix · April 2, 2020, 1:13am

I didn’t quite understand what you are asking exactly. Do you want me to give a screenshot of input sample? Because I thought I gave how I am calling for the Convolution I created isn’t that it? (I am sorry but I get things and I am trying to handle everything but I am not quite familliar with the vocabulary)

ptrblck · April 2, 2020, 5:51am

No, I just asked for the arguments to create an instance of your convolution, i.e. in_channels, out_channels, kernel_size, etc.
Also, could you post the shape of your input, please?

SandPhoenix · April 2, 2020, 7:28am

Hello,
self.Conv1 = nn.Conv2d(3,64,(3,3),(1,1),(1,1),False)
the shape of my input is:
torch.size([2, 3, 32, 32])

dhirajsuvarna · April 2, 2020, 10:48am

@ptrblck, @SandPhoenix

Update from my end.
The input x had a NAN value in it, which was the root cause of the problem.

This NAN was not present in the input as I had double checked it, but got introduced during the Normalization process.

Right now, I have figured out the input causing this NAN and removed it input dataset.
Things are working now.

@SandPhoenix: do please check your input for NAN’s before passing it to the Network Layers.

SandPhoenix · April 2, 2020, 11:14am

Happy to hear that you solved your problem.
It is not the same thing for me but thank you for the information.
I tried print((x!=x).any()) and it gave me tensor(False)
I tried everything xD I need to think more.
@ptrblck Do you have any further advices for me ? I would be grateful, I already am for your help.