NAN - PyTorch code diverging with Runtime Error

snsmssss · August 1, 2020, 10:53am

Background

-I took a working Python code for sample neural net( analtically derive gradient, Victor Zhou Blog)
-I converted it into PyTorch with autograd
-My aim is to explain NN through Pytorch & GPU/CUDA - to beginners
-I have hard coded weights to make the compute deterministic
-CPU version
-Python VERSION: 3.8.3
-pyTorch VERSION: 1.5.1

Problem
-Though Python code converges, pytorch code terminates with Error & NAN
-Am I making any basic mistakes?( I am new to PyTorch)

-since pure Python version works fine, why torch code explodes with NAN?
after running for 80 epochs

PyTorch code given below:

gist.github.com

https://gist.github.com/snsmssss/0369052c6daf7a001660e51ba9da9e92

gistfile1.txt

import torch
def sigmoid(x):
  # Sigmoid activation function: f(x) = 1 / (1 + e^(-x))
  sx =   1.0 / (1.0 + torch.exp(-x))
  return sx 

# y_true and y_pred are numpy arrays of the same length.
def mse_loss(y_true, y_pred):
  return torch.mean((y_true - y_pred) ** 2)

This file has been truncated. show original

OUTPUT

gist.github.com

https://gist.github.com/snsmssss/0e6f4e49de6af9d5ca2a8cb7bf2b0260

Output with Error message

Epoch 0 loss: 0.334
Epoch 10 loss: 0.305
Epoch 20 loss: 0.118
Epoch 30 loss: 0.072
Epoch 40 loss: 0.049
Epoch 50 loss: 0.047
Epoch 60 loss: 0.046
Epoch 70 loss: 0.037
Epoch 80 loss: 0.024
---------------------------------------------------------------------------

This file has been truncated. show original

Kushaj · August 1, 2020, 10:20pm

Maybe normalizing the inputs can help.

snsmssss · August 2, 2020, 2:46am

I did that as well, still … problem persisits

Kushaj · August 2, 2020, 2:07pm

data = (data - data.mean())/data.std()

Loss is converging

Epoch 0 loss: 0.326
Epoch 10 loss: 0.309
Epoch 20 loss: 0.263
Epoch 30 loss: 0.200
Epoch 40 loss: 0.166
Epoch 50 loss: 0.161
Epoch 60 loss: 0.145
Epoch 70 loss: 0.102
Epoch 80 loss: 0.051
Epoch 90 loss: 0.020
Epoch 100 loss: 0.010
Epoch 110 loss: 0.009
Epoch 120 loss: 0.009
Epoch 130 loss: 0.009
Epoch 140 loss: 0.009
Epoch 150 loss: 0.007
Epoch 160 loss: 0.006
Epoch 170 loss: 0.004
Epoch 180 loss: 0.003
Epoch 190 loss: 0.002
Epoch 200 loss: 0.001
Epoch 210 loss: 0.001
Epoch 220 loss: 0.001
Epoch 230 loss: 0.000
Epoch 240 loss: 0.000
Epoch 250 loss: 0.000
Epoch 260 loss: 0.000
Epoch 270 loss: 0.000
Epoch 280 loss: 0.000
Epoch 290 loss: 0.000
Epoch 300 loss: 0.000
Epoch 310 loss: 0.000
Epoch 320 loss: 0.000
Epoch 330 loss: 0.000
Epoch 340 loss: 0.000
Epoch 350 loss: 0.000

snsmssss · August 4, 2020, 11:57am

Kushaj: Thank you ! What did you find & change.

I found one issue with the way I had defined in my code

all_y_trues = torch.tensor([
1.,
0.,
0.,
1.])
I changed it to

all_y_trues = torch.tensor([
[1.],
[0.],
[0.],
[1.]])

it doesn’t explode now in the midway of epochs iteration

Kushaj · August 5, 2020, 11:18am

I only normalized the input data by adding this line data = (data - data.mean())/data.std().