[Solved]Get 'nan' for input after applying a convolutional layer

Yilin_Liu · September 30, 2017, 5:38pm

Hi all, I want to know what may be the reasons for getting nan after a convolution, if my inputs are all properly initialized (not for loss but for the input). Thanks in advance!!

Here is part of the code:

self.encoder_1 = nn.Sequential(
nn.Conv3d(1,25,7,padding=6,dilation=2),
nn.BatchNorm3d(25),
nn.PReLU()
) #first encoder
Output:
encoder_1 Variable containing:
(0 ,0 ,0 ,.,.) =
nan nan nan nan nan nan
nan nan nan nan nan nan
nan nan nan nan nan nan
nan nan nan nan nan nan
nan nan nan nan nan nan
nan nan nan nan nan nan

(0 ,0 ,1 ,.,.) =
nan nan nan nan nan nan
nan nan nan nan nan nan
nan nan nan nan nan nan
nan nan nan nan nan nan
nan nan nan nan nan nan
nan nan nan nan nan nan

(0 ,0 ,2 ,.,.) =
nan nan nan nan nan nan
nan nan nan nan nan nan
nan nan nan nan nan nan
nan nan nan nan nan nan
nan nan nan nan nan nan
nan nan nan nan nan nan

(0 ,0 ,3 ,.,.) =
nan nan nan nan nan nan
nan nan nan nan nan nan
nan nan nan nan nan nan
nan nan nan nan nan nan
nan nan nan nan nan nan
nan nan nan nan nan nan

smth · September 30, 2017, 7:25pm

if inputs are all properly initialized and you did not train network yet, then this should not happen.
Give a 20 line script that will produce this behavior and i can take a look.

Yilin_Liu · September 30, 2017, 10:46pm

Thanks so much for the quick reply!!

Actually this code works fine on some inputs but fails on others (they are all from the same batch) and that’s why I’m so confused. These are MRI data so unfortunately it may be hard to reproduce the error without those data. I have examined my data and there is no nan in it. What’s more wired is that if “take out” those unnormal data and apply a convolution layer for it, it seems to work fine

e.g.
encoder_1 = nn.Sequential(
nn.Conv3d(1,25,7,padding=6,dilation=2),
nn.BatchNorm3d(25),
nn.PReLU()
)

encoder_1(unnormalData)

I got:

0 ,0 ,25,.,.) =
1.00000e-07 *
0.0000 0.0000 0.0000 … 0.0000 0.0000 0.0000
0.0000 0.0000 0.0000 … 0.0000 0.0000 0.0000
0.0000 0.0000 0.0000 … 0.0000 0.0000 0.0000
… ? …
0.0000 0.0000 0.0000 … 0.0000 0.0000 0.0000
0.0000 0.0000 0.0000 … 0.0000 0.0000 0.0000
0.0000 0.0000 0.0000 … 0.0000 0.0000 0.0000

(0 ,0 ,26,.,.) =
1.00000e-07 *
0.0000 0.0000 0.0000 … 0.0000 0.0000 0.0000
0.0000 0.0000 0.0000 … 0.0000 0.0000 0.0000
0.0000 0.0000 0.0000 … 0.0000 0.0000 0.0000
… ? …
0.0000 0.0000 0.0000 … 0.0000 0.0000 0.0000
0.0000 0.0000 0.0000 … 0.0000 0.0000 0.0000
0.0000 0.0000 0.0000 … 0.0000 0.0000 0.0000
?

(0 ,1 ,0 ,.,.) =
1.00000e-07 *
8.8292 8.8292 8.8292 … 8.8292 8.8292 8.8292
8.8292 8.8292 8.8292 … 8.8292 8.8292 8.8292
8.8292 8.8292 8.8292 … 8.8292 8.8292 8.8292
… ? …
8.8292 8.8292 8.8292 … 8.8292 8.8292 8.8292
8.8292 8.8292 8.8292 … 8.8292 8.8292 8.8292
8.8292 8.8292 8.8292 … 8.8292 8.8292 8.8292

I have separate scripts for the net and training. When I run the train.py, this is what I got after applying the first convolutional layer (the exact setup as the one I gave above):

encoder_1 Variable containing:
(0 ,0 ,0 ,.,.) =
nan nan nan … nan nan nan
nan nan nan … nan nan nan
nan nan nan … nan nan nan
… ? …
nan nan nan … nan nan nan
nan nan nan … nan nan nan
nan nan nan … nan nan nan

(0 ,0 ,1 ,.,.) =
nan nan nan … nan nan nan
nan nan nan … nan nan nan
nan nan nan … nan nan nan
… ? …
nan nan nan … nan nan nan
nan nan nan … nan nan nan
nan nan nan … nan nan nan

(0 ,0 ,2 ,.,.) =
nan nan nan … nan nan nan
nan nan nan … nan nan nan
nan nan nan … nan nan nan
… ? …
nan nan nan … nan nan nan
nan nan nan … nan nan nan
nan nan nan … nan nan nan

...

Traceback (most recent call last):
File “/Users/xxx/Desktop/Net/train.py”, line 85, in
output = net(imgdata)
File “/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/torch/nn/modules/module.py”, line 224, in call
result = self.forward(*input, **kwargs)
File “/Users/xxx/Desktop/YL-Net/YLNet3D.py”, line 101, in forward
de1_unpool = self.unpool_1(de_1, indices_3, output_size=size_3) #transfer of indices
File “/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/torch/nn/modules/module.py”, line 224, in call
result = self.forward(*input, **kwargs)
File “/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/torch/nn/modules/pooling.py”, line 371, in forward
self.padding, output_size)
File “/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/torch/nn/functional.py”, line 352, in max_unpool3d
return _functions.thnn.MaxUnpool3d.apply(input, indices, output_size, stride, padding)
File “/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/torch/nn/_functions/thnn/pooling.py”, line 241, in forward
ctx.padding[0], ctx.padding[2], ctx.padding[1])
RuntimeError: found an invalid max index 16321 (output volumes are of size 6x6x6) at /Users/soumith/code/builder/wheel/pytorch-src/torch/lib/THNN/generic/VolumetricMaxUnpooling.c:118

smth · September 30, 2017, 10:47pm

are you giving batch size of 1? if so BatchNorm might give nan

Yilin_Liu · September 30, 2017, 10:50pm

yes…but why for some inputs it seems to work fine?

Even if I just gave a random 5D input, say [1,1,27,27,27] it still works…

thanks!

Yilin_Liu · September 30, 2017, 11:14pm

ok after I excluded the inputs that are all zeros I haven’t had that error again, though I’m still confused but I’ve tried to give an all zeros input to the network and it works.

Yilin_Liu · September 30, 2017, 11:56pm

I got that error again…

Yilin_Liu · October 1, 2017, 12:09am

This is the error message I got, though I know the root cause may be those ‘nan’ values…

Traceback (most recent call last):
File “/Users/xxx/Desktop/YL-Net/train.py”, line 85, in
output = net(imgdata)
File “/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/torch/nn/modules/module.py”, line 224, in call
result = self.forward(*input, **kwargs)
File “/Users/xxx/Desktop/YL-Net/YLNet3D.py”, line 100, in forward
de1_unpool = self.unpool_1(de_1, indices_3, output_size=size_3) #transfer of indices
File “/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/torch/nn/modules/module.py”, line 224, in call
result = self.forward(*input, **kwargs)
File “/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/torch/nn/modules/pooling.py”, line 371, in forward
self.padding, output_size)
File “/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/torch/nn/functional.py”, line 352, in max_unpool3d
return _functions.thnn.MaxUnpool3d.apply(input, indices, output_size, stride, padding)
File “/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/torch/nn/_functions/thnn/pooling.py”, line 241, in forward
ctx.padding[0], ctx.padding[2], ctx.padding[1])
RuntimeError: found an invalid max index 16321 (output volumes are of size 6x6x6) at /Users/soumith/code/builder/wheel/pytorch-src/torch/lib/THNN/generic/VolumetricMaxUnpooling.c:118

Yilin_Liu · October 1, 2017, 2:44pm

Hello, even if I remove the batch normalization part in the net I still got ‘nan’… please helpQAQ
I have separate train.py and net.py, I try to print out what happened after applying the first conv layer in both files, and they give me different outputs, which is really wired

Yilin_Liu · October 1, 2017, 2:53pm

Output in train.py, after applying the first conv layer:

encoder_1 in train.py Variable containing:
(0 ,0 ,0 ,.,.) =
1.00000e-02 *
-0.1709 -0.1709 -0.1709 … -0.1709 -0.1709 -0.1709
-0.1709 -0.1709 -0.1709 … -0.1709 -0.1709 -0.1709
-0.1709 -0.1709 -0.1709 … -0.1709 -0.1709 -0.1709
… ? …
-0.1709 -0.1709 -0.1709 … -0.1709 -0.1709 -0.1709
-0.1709 -0.1709 -0.1709 … -0.1709 -0.1709 -0.1709
-0.1709 -0.1709 -0.1709 … -0.1709 -0.1709 -0.1709

(0 ,0 ,1 ,.,.) =
1.00000e-02 *
-0.1709 -0.1709 -0.1709 … -0.1709 -0.1709 -0.1709
-0.1709 -0.1709 -0.1709 … -0.1709 -0.1709 -0.1709
-0.1709 -0.1709 -0.1709 … -0.1709 -0.1709 -0.1709
… ? …
-0.1709 -0.1709 -0.1709 … -0.1709 -0.1709 -0.1709
-0.1709 -0.1709 -0.1709 … -0.1709 -0.1709 -0.1709
-0.1709 -0.1709 -0.1709 … -0.1709 -0.1709 -0.1709

(0 ,0 ,2 ,.,.) =
1.00000e-02 *
-0.1709 -0.1709 -0.1709 … -0.1709 -0.1709 -0.1709
-0.1709 -0.1709 -0.1709 … -0.1709 -0.1709 -0.1709
-0.1709 -0.1709 -0.1709 … -0.1709 -0.1709 -0.1709
… ? …
-0.1709 -0.1709 -0.1709 … -0.1709 -0.1709 -0.1709
-0.1709 -0.1709 -0.1709 … -0.1709 -0.1709 -0.1709
-0.1709 -0.1709 -0.1709 … -0.1709 -0.1709 -0.1709
…

(0 ,0 ,24,.,.) =
1.00000e-02 *
-0.1709 -0.1709 -0.1709 … -0.1709 -0.1709 -0.1709
-0.1709 -0.1709 -0.1709 … -0.1709 -0.1709 -0.1709
-0.1709 -0.1709 -0.1709 … -0.1709 -0.1709 -0.1709
… ? …
-0.1709 -0.1709 -0.1709 … -0.1709 -0.1709 -0.1709
-0.1709 -0.1709 -0.1709 … -0.1709 -0.1709 -0.1709
-0.1709 -0.1709 -0.1709 … -0.1709 -0.1709 -0.1709

Output in net.py, also after applying the first conv layer:

encoder_1 in net.py Variable containing:
(0 ,0 ,0 ,.,.) =
nan nan nan … nan nan nan
nan nan nan … nan nan nan
nan nan nan … nan nan nan
… ? …
nan nan nan … nan nan nan
nan nan nan … nan nan nan
nan nan nan … nan nan nan

(0 ,0 ,1 ,.,.) =
nan nan nan … nan nan nan
nan nan nan … nan nan nan
nan nan nan … nan nan nan
… ? …
nan nan nan … nan nan nan
nan nan nan … nan nan nan
nan nan nan … nan nan nan

(0 ,0 ,2 ,.,.) =
nan nan nan … nan nan nan
nan nan nan … nan nan nan
nan nan nan … nan nan nan
… ? …
nan nan nan … nan nan nan
nan nan nan … nan nan nan
nan nan nan … nan nan nan
…

Yilin_Liu · October 1, 2017, 2:57pm

In net.py
class net(nn.Module):
def init(self,num_classes):
super(net, self).init()

    self.encoder_1 = nn.Sequential(
        nn.Conv3d(1,25,7,padding=6,dilation=2),
        nn.BatchNorm3d(25),
        nn.PReLU()
        )

def forward(self,x):
size_1 = x.size()
en_1 = self.encoder_1(x)
print 'en_1 in net.py ', en_1

In tran.py

                  encoder_1 = nn.Sequential(
                                            nn.Conv3d(1,25,7,padding=6,dilation=2),
                                           nn.BatchNorm3d(25),
                                            nn.PReLU()
                                            ) #first encoder
                   **print ('encoder_1 in train.py ',encoder_1(imgdata))**
                   output = net(imgdata)

smth · October 1, 2017, 3:09pm

keep removing and see when it stops.

Yilin_Liu · October 1, 2017, 3:13pm

Thanks so much for the reply!!

It stops and gives me the same error:

RuntimeError: found an invalid max index 16321 (output volumes are of size 6x6x6) at /Users/soumith/code/builder/wheel/pytorch-src/torch/lib/THNN/generic/VolumetricMaxUnpooling.c:118

smth · October 1, 2017, 3:14pm

that’s a different error, unrelated to the nan problem.

Yilin_Liu · October 1, 2017, 3:17pm

Then why this error happens? since every time I got this error, the outputs from the first conv layer are ‘nan’…so i thought these are related…QAQ

Yilin_Liu · October 1, 2017, 8:54pm

please…help…I am desperate…QAQ

smth · October 1, 2017, 10:55pm

hmmm it looks like if you have nans, MaxUnpool3d might return such indices.

Yilin_Liu · October 1, 2017, 10:56pm

Thanks so much for the reply!!!
I have checked the input that causes the error, it has no nan…and it’s non all zeros. QAQ

smth · October 1, 2017, 10:57pm

if you give a script to reproduce your problem i can look further. right now it’s impossible to debug your issue without knowing the fuller context on what is happening.

Yilin_Liu · October 1, 2017, 10:59pm

Do you mind me sending the scripts and the data to your email?