[Solved]Get 'nan' for input after applying a convolutional layer

Hi all, I want to know what may be the reasons for getting nan after a convolution, if my inputs are all properly initialized (not for loss but for the input). Thanks in advance!!

Here is part of the code:

self.encoder_1 = nn.Sequential(
nn.Conv3d(1,25,7,padding=6,dilation=2),
nn.BatchNorm3d(25),
nn.PReLU()
) #first encoder
Output:
encoder_1 Variable containing:
(0 ,0 ,0 ,.,.) =
nan nan nan nan nan nan
nan nan nan nan nan nan
nan nan nan nan nan nan
nan nan nan nan nan nan
nan nan nan nan nan nan
nan nan nan nan nan nan

(0 ,0 ,1 ,.,.) =
nan nan nan nan nan nan
nan nan nan nan nan nan
nan nan nan nan nan nan
nan nan nan nan nan nan
nan nan nan nan nan nan
nan nan nan nan nan nan

(0 ,0 ,2 ,.,.) =
nan nan nan nan nan nan
nan nan nan nan nan nan
nan nan nan nan nan nan
nan nan nan nan nan nan
nan nan nan nan nan nan
nan nan nan nan nan nan

(0 ,0 ,3 ,.,.) =
nan nan nan nan nan nan
nan nan nan nan nan nan
nan nan nan nan nan nan
nan nan nan nan nan nan
nan nan nan nan nan nan
nan nan nan nan nan nan

if inputs are all properly initialized and you did not train network yet, then this should not happen.
Give a 20 line script that will produce this behavior and i can take a look.

Thanks so much for the quick reply!!

Actually this code works fine on some inputs but fails on others (they are all from the same batch) and that’s why I’m so confused. These are MRI data so unfortunately it may be hard to reproduce the error without those data. I have examined my data and there is no nan in it. What’s more wired is that if “take out” those unnormal data and apply a convolution layer for it, it seems to work fine

e.g.
encoder_1 = nn.Sequential(
nn.Conv3d(1,25,7,padding=6,dilation=2),
nn.BatchNorm3d(25),
nn.PReLU()
)

encoder_1(unnormalData)

I got:

0 ,0 ,25,.,.) =
1.00000e-07 *
0.0000 0.0000 0.0000 … 0.0000 0.0000 0.0000
0.0000 0.0000 0.0000 … 0.0000 0.0000 0.0000
0.0000 0.0000 0.0000 … 0.0000 0.0000 0.0000
… ? …
0.0000 0.0000 0.0000 … 0.0000 0.0000 0.0000
0.0000 0.0000 0.0000 … 0.0000 0.0000 0.0000
0.0000 0.0000 0.0000 … 0.0000 0.0000 0.0000

(0 ,0 ,26,.,.) =
1.00000e-07 *
0.0000 0.0000 0.0000 … 0.0000 0.0000 0.0000
0.0000 0.0000 0.0000 … 0.0000 0.0000 0.0000
0.0000 0.0000 0.0000 … 0.0000 0.0000 0.0000
… ? …
0.0000 0.0000 0.0000 … 0.0000 0.0000 0.0000
0.0000 0.0000 0.0000 … 0.0000 0.0000 0.0000
0.0000 0.0000 0.0000 … 0.0000 0.0000 0.0000
?

(0 ,1 ,0 ,.,.) =
1.00000e-07 *
8.8292 8.8292 8.8292 … 8.8292 8.8292 8.8292
8.8292 8.8292 8.8292 … 8.8292 8.8292 8.8292
8.8292 8.8292 8.8292 … 8.8292 8.8292 8.8292
… ? …
8.8292 8.8292 8.8292 … 8.8292 8.8292 8.8292
8.8292 8.8292 8.8292 … 8.8292 8.8292 8.8292
8.8292 8.8292 8.8292 … 8.8292 8.8292 8.8292

I have separate scripts for the net and training. When I run the train.py, this is what I got after applying the first convolutional layer (the exact setup as the one I gave above):

encoder_1 Variable containing:
(0 ,0 ,0 ,.,.) =
nan nan nan … nan nan nan
nan nan nan … nan nan nan
nan nan nan … nan nan nan
… ? …
nan nan nan … nan nan nan
nan nan nan … nan nan nan
nan nan nan … nan nan nan

(0 ,0 ,1 ,.,.) =
nan nan nan … nan nan nan
nan nan nan … nan nan nan
nan nan nan … nan nan nan
… ? …
nan nan nan … nan nan nan
nan nan nan … nan nan nan
nan nan nan … nan nan nan

(0 ,0 ,2 ,.,.) =
nan nan nan … nan nan nan
nan nan nan … nan nan nan
nan nan nan … nan nan nan
… ? …
nan nan nan … nan nan nan
nan nan nan … nan nan nan
nan nan nan … nan nan nan

  ...

Traceback (most recent call last):
File “/Users/xxx/Desktop/Net/train.py”, line 85, in
output = net(imgdata)
File “/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/torch/nn/modules/module.py”, line 224, in call
result = self.forward(*input, **kwargs)
File “/Users/xxx/Desktop/YL-Net/YLNet3D.py”, line 101, in forward
de1_unpool = self.unpool_1(de_1, indices_3, output_size=size_3) #transfer of indices
File “/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/torch/nn/modules/module.py”, line 224, in call
result = self.forward(*input, **kwargs)
File “/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/torch/nn/modules/pooling.py”, line 371, in forward
self.padding, output_size)
File “/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/torch/nn/functional.py”, line 352, in max_unpool3d
return _functions.thnn.MaxUnpool3d.apply(input, indices, output_size, stride, padding)
File “/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/torch/nn/_functions/thnn/pooling.py”, line 241, in forward
ctx.padding[0], ctx.padding[2], ctx.padding[1])
RuntimeError: found an invalid max index 16321 (output volumes are of size 6x6x6) at /Users/soumith/code/builder/wheel/pytorch-src/torch/lib/THNN/generic/VolumetricMaxUnpooling.c:118

are you giving batch size of 1? if so BatchNorm might give nan

yes…but why for some inputs it seems to work fine?

Even if I just gave a random 5D input, say [1,1,27,27,27] it still works…

thanks!

ok after I excluded the inputs that are all zeros I haven’t had that error again, though I’m still confused but I’ve tried to give an all zeros input to the network and it works.

I got that error again…

This is the error message I got, though I know the root cause may be those ‘nan’ values…

Traceback (most recent call last):
File “/Users/xxx/Desktop/YL-Net/train.py”, line 85, in
output = net(imgdata)
File “/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/torch/nn/modules/module.py”, line 224, in call
result = self.forward(*input, **kwargs)
File “/Users/xxx/Desktop/YL-Net/YLNet3D.py”, line 100, in forward
de1_unpool = self.unpool_1(de_1, indices_3, output_size=size_3) #transfer of indices
File “/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/torch/nn/modules/module.py”, line 224, in call
result = self.forward(*input, **kwargs)
File “/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/torch/nn/modules/pooling.py”, line 371, in forward
self.padding, output_size)
File “/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/torch/nn/functional.py”, line 352, in max_unpool3d
return _functions.thnn.MaxUnpool3d.apply(input, indices, output_size, stride, padding)
File “/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/torch/nn/_functions/thnn/pooling.py”, line 241, in forward
ctx.padding[0], ctx.padding[2], ctx.padding[1])
RuntimeError: found an invalid max index 16321 (output volumes are of size 6x6x6) at /Users/soumith/code/builder/wheel/pytorch-src/torch/lib/THNN/generic/VolumetricMaxUnpooling.c:118

Hello, even if I remove the batch normalization part in the net I still got ‘nan’… please helpQAQ
I have separate train.py and net.py, I try to print out what happened after applying the first conv layer in both files, and they give me different outputs, which is really wired

Output in train.py, after applying the first conv layer:

encoder_1 in train.py Variable containing:
(0 ,0 ,0 ,.,.) =
1.00000e-02 *
-0.1709 -0.1709 -0.1709 … -0.1709 -0.1709 -0.1709
-0.1709 -0.1709 -0.1709 … -0.1709 -0.1709 -0.1709
-0.1709 -0.1709 -0.1709 … -0.1709 -0.1709 -0.1709
… ? …
-0.1709 -0.1709 -0.1709 … -0.1709 -0.1709 -0.1709
-0.1709 -0.1709 -0.1709 … -0.1709 -0.1709 -0.1709
-0.1709 -0.1709 -0.1709 … -0.1709 -0.1709 -0.1709

(0 ,0 ,1 ,.,.) =
1.00000e-02 *
-0.1709 -0.1709 -0.1709 … -0.1709 -0.1709 -0.1709
-0.1709 -0.1709 -0.1709 … -0.1709 -0.1709 -0.1709
-0.1709 -0.1709 -0.1709 … -0.1709 -0.1709 -0.1709
… ? …
-0.1709 -0.1709 -0.1709 … -0.1709 -0.1709 -0.1709
-0.1709 -0.1709 -0.1709 … -0.1709 -0.1709 -0.1709
-0.1709 -0.1709 -0.1709 … -0.1709 -0.1709 -0.1709

(0 ,0 ,2 ,.,.) =
1.00000e-02 *
-0.1709 -0.1709 -0.1709 … -0.1709 -0.1709 -0.1709
-0.1709 -0.1709 -0.1709 … -0.1709 -0.1709 -0.1709
-0.1709 -0.1709 -0.1709 … -0.1709 -0.1709 -0.1709
… ? …
-0.1709 -0.1709 -0.1709 … -0.1709 -0.1709 -0.1709
-0.1709 -0.1709 -0.1709 … -0.1709 -0.1709 -0.1709
-0.1709 -0.1709 -0.1709 … -0.1709 -0.1709 -0.1709

(0 ,0 ,24,.,.) =
1.00000e-02 *
-0.1709 -0.1709 -0.1709 … -0.1709 -0.1709 -0.1709
-0.1709 -0.1709 -0.1709 … -0.1709 -0.1709 -0.1709
-0.1709 -0.1709 -0.1709 … -0.1709 -0.1709 -0.1709
… ? …
-0.1709 -0.1709 -0.1709 … -0.1709 -0.1709 -0.1709
-0.1709 -0.1709 -0.1709 … -0.1709 -0.1709 -0.1709
-0.1709 -0.1709 -0.1709 … -0.1709 -0.1709 -0.1709

Output in net.py, also after applying the first conv layer:

encoder_1 in net.py Variable containing:
(0 ,0 ,0 ,.,.) =
nan nan nan … nan nan nan
nan nan nan … nan nan nan
nan nan nan … nan nan nan
… ? …
nan nan nan … nan nan nan
nan nan nan … nan nan nan
nan nan nan … nan nan nan

(0 ,0 ,1 ,.,.) =
nan nan nan … nan nan nan
nan nan nan … nan nan nan
nan nan nan … nan nan nan
… ? …
nan nan nan … nan nan nan
nan nan nan … nan nan nan
nan nan nan … nan nan nan

(0 ,0 ,2 ,.,.) =
nan nan nan … nan nan nan
nan nan nan … nan nan nan
nan nan nan … nan nan nan
… ? …
nan nan nan … nan nan nan
nan nan nan … nan nan nan
nan nan nan … nan nan nan

In net.py
class net(nn.Module):
def init(self,num_classes):
super(net, self).init()

    self.encoder_1 = nn.Sequential(
        nn.Conv3d(1,25,7,padding=6,dilation=2),
        nn.BatchNorm3d(25),
        nn.PReLU()
        ) 

def forward(self,x):
size_1 = x.size()
en_1 = self.encoder_1(x)
print 'en_1 in net.py ', en_1

In tran.py

                  encoder_1 = nn.Sequential(
                                            nn.Conv3d(1,25,7,padding=6,dilation=2),
                                           nn.BatchNorm3d(25),
                                            nn.PReLU()
                                            ) #first encoder
                   **print ('encoder_1 in train.py ',encoder_1(imgdata))**
                   output = net(imgdata)

keep removing and see when it stops.

Thanks so much for the reply!!

It stops and gives me the same error:

RuntimeError: found an invalid max index 16321 (output volumes are of size 6x6x6) at /Users/soumith/code/builder/wheel/pytorch-src/torch/lib/THNN/generic/VolumetricMaxUnpooling.c:118

that’s a different error, unrelated to the nan problem.

Then why this error happens? since every time I got this error, the outputs from the first conv layer are ‘nan’…so i thought these are related…QAQ

please…help…I am desperate…QAQ

hmmm it looks like if you have nans, MaxUnpool3d might return such indices.

Thanks so much for the reply!!!
I have checked the input that causes the error, it has no nan…and it’s non all zeros. QAQ

if you give a script to reproduce your problem i can look further. right now it’s impossible to debug your issue without knowing the fuller context on what is happening.

Do you mind me sending the scripts and the data to your email?