Weights getting 'nan' during training

Shiv · September 30, 2017, 8:43pm

I am checking my weights every 10 epochs. I have tried xavier and normal initialization of weights and have varied learning rate in a wide range. Also, validation error either remain constant or increases slowly. Irrespective of various set up, I am getting ‘nan’ in some filters at 10th epoch. What could be the issue and how to solve it?

(1 ,0 ,.,.) =
nan nan nan … nan nan nan
nan nan nan … nan nan nan
nan nan nan … nan nan nan
… ⋱ …
nan nan nan … nan nan nan
nan nan nan … nan nan nan
nan nan nan … nan nan nan

fmassa · September 30, 2017, 8:46pm

There can be several reasons.

Make sure your inputs are not unitialized
check to see if you don’t have gradient explosion, that might lead to nan/inf. Smaller learning rate could help here
Check if you don’t have division by zero, etc

It’s difficult to say more without further details.

Shiv · September 30, 2017, 8:52pm

@fmassa I have already tried very small learning rate but there is no affect.

maxest · April 6, 2018, 8:04am

I also had an issue with nan and it was caused by my error function (the one on which I backward) containing torch.sqrt (I wanted to measure my error in Euclidean distance). After removing that part (so now I use classical MSE) it all works without problems.
I have to point out though that I had the very same code/algorithm implemented in TensorFlow, with sqrt included, and it worked without any nan problems.

Dillon_Davis · April 16, 2018, 12:58am

Were you using torch.sqrt on the output of MSELoss or did you remove torch.sqrt from somewhere in the source code of MSELoss? I’m having the same issue with MSELoss right now but I’m not explicitly using torch.sqrt myself.

maxest · April 16, 2018, 10:45am

I didn’t use MSELoss at all. I have my onw cost function in which I use (or not) troch.sqrt directly.

damien · November 27, 2019, 2:31pm

Yes! After I decrease the learning rate, this problem dispear!