Well-formed input into a simple linear layer, output Nan

I input well-formed data into a simple linear layer with normal weights and bias, the output has some ‘nan’ in it. This only happens on Ubuntu18 + PyTorch1.4.0, but on Win10 + PyTorch1.4.0 or Colab, the linear layer works well.
On Ubuntu:

import torch
import torch.nn as nn
model = nn.Linear(6, 8)
x = torch.randn((4, 6))
y = model(x)
print(x)
print(y)
print(model.weight)
print(model.bias)

I get:

tensor([[ 0.4134, -1.0348, -0.7237,  0.6970,  0.0487,  0.1217],
        [ 1.2217,  0.2465, -0.0832,  0.5530, -0.5716, -0.4916],
        [-0.3497, -0.1285, -1.3026, -0.0845, -0.5295, -0.7776],
        [ 1.2792,  0.1692, -1.3023, -1.9925, -0.6869,  0.8478]])
tensor([[ 3.0334e-01,  1.9168e+05, -4.3615e+31,  3.7255e+04, -1.5095e+23,
         -6.4485e+34,  1.0371e+34,         nan],
        [ 3.0324e-01, -7.7407e+05,  1.0390e+31,  4.2809e+03,  6.0958e+23,
         -7.4157e+33, -4.1881e+34,         nan],
        [ 3.0343e-01, -1.2243e+06, -5.4147e+30,  6.7047e+04,  9.6410e+23,
         -1.1607e+35, -6.6239e+34,         nan],
        [ 3.0323e-01,  1.3349e+06,  7.1316e+30,  6.7047e+04, -1.0513e+24,
         -1.1605e+35,  7.2227e+34,         nan]], grad_fn=<AddmmBackward>)
tensor([[-0.3934,  0.2317, -0.3121, -0.1111, -0.2597, -0.2182],
        [-0.2294,  0.3308, -0.1488, -0.1515,  0.1625,  0.0586],
        [-0.3984,  0.1251,  0.0899,  0.0807,  0.0110,  0.1535],
        [-0.2065, -0.3367,  0.1329,  0.4042, -0.0387, -0.3794],
        [ 0.2129,  0.1050, -0.2263, -0.3991,  0.3312,  0.3797],
        [ 0.3508,  0.1268, -0.2395,  0.0672,  0.3916,  0.0131],
        [-0.1437, -0.4070, -0.0663, -0.1292, -0.0127, -0.0040],
        [ 0.0037, -0.0610, -0.0200,  0.2865, -0.0804,  0.1235]],
       requires_grad=True)
tensor([ 0.3034, -0.0240,  0.1232,  0.3308,  0.1517, -0.3978, -0.2769,  0.0284],
       requires_grad=True)

When I test on Colab or Win10, it returns y like:

tensor([[0.8947, 0.7973, 0.0000, 0.4691, 0.8554, 0.4177, 1.5419, 0.0241, 0.7379,
         0.0000, 1.6729, 0.0000, 0.0000, 1.1157, 0.0000, 0.7253],
        [0.4217, 0.1511, 0.0000, 0.9463, 0.0000, 0.0000, 0.0000, 1.0647, 0.0000,
         0.9710, 0.0000, 1.5806, 0.0000, 0.0000, 1.3374, 0.0059],
        [0.0000, 0.7281, 1.6656, 0.0000, 0.0000, 0.6633, 0.0000, 0.0000, 0.0000,
         0.0000, 0.0000, 0.0893, 1.3113, 0.0000, 0.5653, 0.0000],
        [0.3799, 0.0000, 0.0000, 0.2625, 0.9877, 0.6430, 0.1361, 0.5216, 0.9120,
         1.0127, 0.0000, 0.0000, 0.6246, 0.7964, 0.0000, 0.9006]],
       grad_fn=<ReluBackward0>)

Could someone help?

How did you define the y tensor, which contains the NaN values?

Sorry, I forgot to type it:

y = model(x)

Based on the values, it looks as if y is using uninitialized values.
Do you get the NaN (and high value) outputs using this code?

x = torch.tensor([[ 0.4134, -1.0348, -0.7237,  0.6970,  0.0487,  0.1217],
        [ 1.2217,  0.2465, -0.0832,  0.5530, -0.5716, -0.4916],
        [-0.3497, -0.1285, -1.3026, -0.0845, -0.5295, -0.7776],
        [ 1.2792,  0.1692, -1.3023, -1.9925, -0.6869,  0.8478]])

model = nn.Linear(6, 8)

with torch.no_grad():
    model.weight = nn.Parameter(torch.tensor([[-0.3934,  0.2317, -0.3121, -0.1111, -0.2597, -0.2182],
        [-0.2294,  0.3308, -0.1488, -0.1515,  0.1625,  0.0586],
        [-0.3984,  0.1251,  0.0899,  0.0807,  0.0110,  0.1535],
        [-0.2065, -0.3367,  0.1329,  0.4042, -0.0387, -0.3794],
        [ 0.2129,  0.1050, -0.2263, -0.3991,  0.3312,  0.3797],
        [ 0.3508,  0.1268, -0.2395,  0.0672,  0.3916,  0.0131],
        [-0.1437, -0.4070, -0.0663, -0.1292, -0.0127, -0.0040],
        [ 0.0037, -0.0610, -0.0200,  0.2865, -0.0804,  0.1235]]))
    model.bias = nn.Parameter(
        torch.tensor([ 0.3034, -0.0240,  0.1232,  0.3308,  0.1517, -0.3978, -0.2769,  0.0284]))


y = model(x)
print(y)

The same condition. On Ubuntu, I got:

tensor([[ 3.0335e-01,  1.9162e+05, -4.3614e+31,  3.7255e+04, -1.5090e+23,
         -6.4485e+34,  1.0367e+34,         nan],
        [ 3.0325e-01, -7.7402e+05,  1.0389e+31,  4.2792e+03,  6.0954e+23,
         -7.4128e+33, -4.1879e+34,         nan],
        [ 3.0344e-01, -1.2243e+06, -5.4160e+30,  6.7047e+04,  9.6416e+23,
         -1.1607e+35, -6.6243e+34,         nan],
        [ 3.0325e-01,  1.3349e+06,  7.1314e+30,  6.7045e+04, -1.0512e+24,
         -1.1604e+35,  7.2223e+34,         nan]], grad_fn=<AddmmBackward>)

On Colab:

tensor([[ 0.0102, -0.4440, -0.1605,  0.7313,  0.0790, -0.1432,  0.0417,  0.3183],
        [ 0.1001, -0.4158, -0.3773,  0.4166, -0.1402, -0.1112, -0.6095,  0.1632],
        [ 1.1343,  0.0887, -0.0027,  0.5545, -0.0784, -0.4480, -0.0672, -0.0167],
        [ 0.4606,  0.1722, -0.5206, -1.2638,  1.6261, -0.0075, -0.1805, -0.3621]],
       grad_fn=<AddmmBackward>)

Could you run this script to collect your setup information (from the GitHub issue template):

wget https://raw.githubusercontent.com/pytorch/pytorch/master/torch/utils/collect_env.py
# For security purposes, please check the contents of collect_env.py before running it.
python collect_env.py

The envs are listed here:

Collecting environment information...
PyTorch version: 1.4.0
Is debug build: No
CUDA used to build PyTorch: 10.1

OS: Ubuntu 18.04.3 LTS
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
CMake version: version 3.10.2

Python version: 3.6
Is CUDA available: Yes
CUDA runtime version: Could not collect
GPU models and configuration: GPU 0: GeForce MX150
Nvidia driver version: 440.59
cuDNN version: Could not collect

Versions of relevant libraries:
[pip3] numpy==1.14.3
[pip3] numpydoc==0.8.0
[pip3] torch==1.4.0
[pip3] torchvision==0.5.0
[conda] blas                      1.0                         mkl  
[conda] mkl                       2018.0.2                      1  
[conda] mkl-service               1.1.2            py36h17a0993_4  
[conda] mkl_fft                   1.0.1            py36h3010b51_0  
[conda] mkl_random                1.0.1            py36h629b387_0

Sovled it by upgrading numpy. Thanks for your help.

1 Like

Oh wow, how did you figure it out?
I’m not sure how numpy might interact with the linear layer and create NaNs. :frowning:

Pytorch’s computation may depend on numpy and I found my numpy version had not been updated for a long time. Just gave it a try. :rofl:

1 Like

It’s very strange, I met the same situation like you.I trian VGG16 with datasets CIFAR-10 in ubuntu16.04 and after first backward and step, I get a loss NAN and grad is also NAN. Then I try to change VGG16 layer like modify the parameters in nn.Linear(in,out), but I found when I change original FC layer from nn.Linear(4096,num_classes =10) to nn.Linear(4096,293),it works and no NAN. And delete all FC layer of VGG16, it still works well. After checking my numpy version(1.18.1), I update numpy to the latest version 1.19.2.Wow !!! no more Nan exsit in my experiment !!! I still don’t understand why this problem happens in ubuntu not Windows

1 Like

Upgrading numpy from 1.18.5 to 1.21.4 not working for me, torch 1.6.0 on Linux running a transformer model.