Getting NaN in the softmax Layer

Hi,

I am trying to train an existing neural network from a published paper, using custom dataset.

However, why trainng this I am getting NAN as my predictions even before completeing the first batch of training (batch size = 32).

I tried to google out the error and came across multiple post from this forum and tried few things -

  1. Reducing the learning rate (default was 0.001, reduced it to 0.0001)
  2. Reducing batch size from 32 to 10
  3. Using torch.autograd.detect_anomaly()

by using torch.autograd.detect_anomaly() I got the following error

Warning: Traceback of forward call that caused the error:
  File "train_classification.py", line 126, in <module>
    pred, trans, trans_feat = classifier(points)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/content/mydrive/My Drive/Projects/pointnet/pointnet.pytorch/pointnet/model.py", line 147, in forward
    return F.log_softmax(x, dim=1), trans, trans_feat
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/functional.py", line 1317, in log_softmax
    ret = input.log_softmax(dim)
 (print_stack at /pytorch/torch/csrc/autograd/python_anomaly_mode.cpp:57)
Traceback (most recent call last):
  File "train_classification.py", line 133, in <module>
    loss.backward()
  File "/usr/local/lib/python3.6/dist-packages/torch/tensor.py", line 195, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/usr/local/lib/python3.6/dist-packages/torch/autograd/__init__.py", line 99, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: Function 'LogSoftmaxBackward' returned nan values in its 0th output.

If I am interpreting this correctly, the error seems to be from the softmax layer - F.log_softmax()

Now I am stuck here, do not know how to proceed, can someone help me in this regards?

1 Like

Could you check the min and max values of x before feeding it to F.log_softmax?

The forward() function of the model is as follows -

    def forward(self, x):
        x, trans, trans_feat = self.feat(x)
        x = F.relu(self.bn1(self.fc1(x)))
        x = F.relu(self.bn2(self.dropout(self.fc2(x))))
        x = self.fc3(x)
        return F.log_softmax(x, dim=1), trans, trans_feat

I printed out the max and min for the weights of f3 layer.
Here is the output dump for the last few iterations -

[0: 40/47] test loss: 3.190464 accuracy: 0.250000
f3 Weights Min: -0.06608034670352936
f3 Weights Max: 0.06457919627428055
f3 Gradients Min: -0.18587647378444672
f3 Gradients Max: 0.06978222727775574
[0: 41/47] train loss: 2.302364 accuracy: 0.593750
f3 Weights Min: -0.066163070499897
f3 Weights Max: 0.06460865586996078
f3 Gradients Min: -0.19260874390602112
f3 Gradients Max: 0.060340650379657745
[0: 42/47] train loss: 2.837160 accuracy: 0.281250
f3 Weights Min: -0.0662456750869751
f3 Weights Max: 0.06465635448694229
f3 Gradients Min: -0.23064641654491425
f3 Gradients Max: 0.03692895919084549
[0: 43/47] train loss: 2.571262 accuracy: 0.406250
f3 Weights Min: -0.06632933020591736
f3 Weights Max: 0.06470420211553574
f3 Gradients Min: -0.2090958207845688
f3 Gradients Max: 0.071017786860466
[0: 44/47] train loss: 2.492180 accuracy: 0.375000
f3 Weights Min: -0.06641349196434021
f3 Weights Max: 0.06475285440683365
Warning: Traceback of forward call that caused the error:
  File "train_classification.py", line 126, in <module>
    pred, trans, trans_feat = classifier(points)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/content/mydrive/My Drive/Projects/pointnet/pointnet.pytorch/pointnet/model.py", line 147, in forward
    return F.log_softmax(x, dim=1), trans, trans_feat
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/functional.py", line 1317, in log_softmax
    ret = input.log_softmax(dim)
 (print_stack at /pytorch/torch/csrc/autograd/python_anomaly_mode.cpp:57)
Traceback (most recent call last):
  File "train_classification.py", line 133, in <module>
    loss.backward()
  File "/usr/local/lib/python3.6/dist-packages/torch/tensor.py", line 195, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/usr/local/lib/python3.6/dist-packages/torch/autograd/__init__.py", line 99, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: Function 'LogSoftmaxBackward' returned nan values in its 0th output.

I don’t see the gradients exploding but i think the gradients are vanishing, am I right?

I have the same problem since two days ago and I do not find the cause of the problem. I even deleted the logsoftmax from my code and still the same…
Please tell me if you find a solution and I will do the same if I do

Sure, I will, I am actively pursuing this. I broke my head over this for two days, I will try to update this post as soon as I get help from any forum.

In the meantime, did you try to print out the weights as suggested by @ptrblck.

1 Like

I did and they seeme pretty normal to me. when I remove the detect_anomaly(), it starts working for like a sec and then from nowhere the nan values come out. It was the first thing I did actually even before using the detect_anomaly

Is this only visible on the CPU, GPU or both devices?
Could you try to update numpy as suggested here?

CC @dhirajsuvarna

For me, I am using the CPU. I can’t try using the GPU currently.

Does updating numpy work for you?

Sadly, it is already the latest version.

Could you try to store the tensors, which create the NaN output so that we could further debug?
Alternatively, could you post the shapes and stats of the tensors, so that we could run it with random values and try to reproduce the failure?

def convNxN(in_p,out_p,N):
return nn.Conv2d(in_p,ou_p,kernel_size=(N,N),stride=(1,1),padding=(1,1),bias=False)
class Modelname(nn.Module):
def init(self):
self.Conv1 = convNxN(3,64,3)
#other Conv2d calls of the same way.

def forward(self,x):
out=self.Conv1(x)
# When I print it here, it already starts giving nans

By the way, I am using CIFAR-10 as dataset.

How is convNxN defined?

def convNxN(in_p,out_p,N):
return nn.Conv2d(in_p,ou_p,kernel_size=(N,N),stride=(1,1),padding=(1,1),bias=False)

Thanks, could you post all arguments to create an instance of this conv layer as well as the input shape and the stats of the input?

I didn’t quite understand what you are asking exactly. Do you want me to give a screenshot of input sample? Because I thought I gave how I am calling for the Convolution I created isn’t that it? (I am sorry but I get things and I am trying to handle everything but I am not quite familliar with the vocabulary)

No, I just asked for the arguments to create an instance of your convolution, i.e. in_channels, out_channels, kernel_size, etc.
Also, could you post the shape of your input, please?

Hello,
self.Conv1 = nn.Conv2d(3,64,(3,3),(1,1),(1,1),False)
the shape of my input is:
torch.size([2, 3, 32, 32])

1 Like

@ptrblck, @SandPhoenix

Update from my end.
The input x had a NAN value in it, which was the root cause of the problem.

This NAN was not present in the input as I had double checked it, but got introduced during the Normalization process.

Right now, I have figured out the input causing this NAN and removed it input dataset.
Things are working now.

@SandPhoenix: do please check your input for NAN’s before passing it to the Network Layers.

1 Like

Happy to hear that you solved your problem.
It is not the same thing for me but thank you for the information.
I tried print((x!=x).any()) and it gave me tensor(False)
I tried everything xD I need to think more.
@ptrblck Do you have any further advices for me ? I would be grateful, I already am for your help.