After finding the nans in the conv weight grad, I tried to compute the grad manually. And I find only some infs in the manually computed grad, but not any nans; and other values (which are not infs or nans) are very close.
I want to know why nans is procuded by pytorch. Thanks
NaN creation depends on the used operations and backend. While e.g. cuDNN will propagate NaNs, it’s currently unclear which operation creates them in the first place. I’m unsure what your use case is to differentiate between Infs and NaNs, but feel free to post a minimal and executable code snippet reproducing the difference.
The operation is Conv2d in stem stage in resnet.
The test steps was posted here
I don’t see any executable code snippet in the linked post, unfortunately, so don’t know what’s causing the difference.
It may be difficult to reproduce, I just debug it in my project, and found the diff between the gradient from backward() and the the value compute using chain rule in the conv layer.
Thanks