I am checking my weights every 10 epochs. I have tried xavier and normal initialization of weights and have varied learning rate in a wide range. Also, validation error either remain constant or increases slowly. Irrespective of various set up, I am getting ‘nan’ in some filters at 10th epoch. What could be the issue and how to solve it?
(1 ,0 ,.,.) =
nan nan nan … nan nan nan
nan nan nan … nan nan nan
nan nan nan … nan nan nan
… ⋱ …
nan nan nan … nan nan nan
nan nan nan … nan nan nan
nan nan nan … nan nan nan
I also had an issue with nan and it was caused by my error function (the one on which I backward) containing torch.sqrt (I wanted to measure my error in Euclidean distance). After removing that part (so now I use classical MSE) it all works without problems.
I have to point out though that I had the very same code/algorithm implemented in TensorFlow, with sqrt included, and it worked without any nan problems.
Were you using torch.sqrt on the output of MSELoss or did you remove torch.sqrt from somewhere in the source code of MSELoss? I’m having the same issue with MSELoss right now but I’m not explicitly using torch.sqrt myself.