The conv weight grad from loss.backward has nans, but it only has infs when compute manually

3yuan · May 23, 2024, 8:00am

After finding the nans in the conv weight grad, I tried to compute the grad manually. And I find only some infs in the manually computed grad, but not any nans; and other values (which are not infs or nans) are very close.
I want to know why nans is procuded by pytorch. Thanks

ptrblck · May 23, 2024, 5:11pm

NaN creation depends on the used operations and backend. While e.g. cuDNN will propagate NaNs, it’s currently unclear which operation creates them in the first place. I’m unsure what your use case is to differentiate between Infs and NaNs, but feel free to post a minimal and executable code snippet reproducing the difference.

3yuan · May 24, 2024, 7:31am

The operation is Conv2d in stem stage in resnet.
The test steps was posted here

ptrblck · May 24, 2024, 1:30pm

I don’t see any executable code snippet in the linked post, unfortunately, so don’t know what’s causing the difference.

3yuan · May 31, 2024, 3:04am

It may be difficult to reproduce, I just debug it in my project, and found the diff between the gradient from backward() and the the value compute using chain rule in the conv layer.
Thanks