Weights getting 'nan' during training


(Shiv) #1

I am checking my weights every 10 epochs. I have tried xavier and normal initialization of weights and have varied learning rate in a wide range. Also, validation error either remain constant or increases slowly. Irrespective of various set up, I am getting ‘nan’ in some filters at 10th epoch. What could be the issue and how to solve it?

(1 ,0 ,.,.) =
nan nan nan … nan nan nan
nan nan nan … nan nan nan
nan nan nan … nan nan nan
… ⋱ …
nan nan nan … nan nan nan
nan nan nan … nan nan nan
nan nan nan … nan nan nan


(Francisco Massa) #2

There can be several reasons.

  • Make sure your inputs are not unitialized
  • check to see if you don’t have gradient explosion, that might lead to nan/inf. Smaller learning rate could help here
  • Check if you don’t have division by zero, etc

It’s difficult to say more without further details.


(Shiv) #3

@fmassa I have already tried very small learning rate but there is no affect.


#4

I also had an issue with nan and it was caused by my error function (the one on which I backward) containing torch.sqrt (I wanted to measure my error in Euclidean distance). After removing that part (so now I use classical MSE) it all works without problems.
I have to point out though that I had the very same code/algorithm implemented in TensorFlow, with sqrt included, and it worked without any nan problems.


(Dillon Davis) #6

Were you using torch.sqrt on the output of MSELoss or did you remove torch.sqrt from somewhere in the source code of MSELoss? I’m having the same issue with MSELoss right now but I’m not explicitly using torch.sqrt myself.


#7

I didn’t use MSELoss at all. I have my onw cost function in which I use (or not) troch.sqrt directly.