I’m running into the same NaN softmax issue in a modified version of VGG11. I have logged the offending input tensor (no NaNs or non-finite vals), the corresponding output (all NaN) and loss (NaN). I have passed the offending input tensors directly to the network one at a time, with grad enabled, and am unable to reproduce the issue on either CPU or GPU. I can only reproduce the issue when running my training loop on GPU.
The modifications I’ve made to VGG are swapping out all ReLU layers for layers comprised of an exponential ReLU that can be found here: https://github.com/briardoty/allen-inst-cell-types/blob/da39c554c7147e32c813dd51722055982bdfb826/modules/ActivationFunctions.py#L129. The NaN softmax issue occurs whether I run with my custom activation function implemented as a nn.Module or as a torch.autograd.Function with my own implementation of backward().
Here is a link to a script I attempted to use to reproduce the issue using a single copy of an offending input tensor: https://github.com/briardoty/allen-inst-cell-types/blob/master/nan_check.py.
I’m using pytorch 1.4.0, cudatoolkit 10.0.130, and numpy 1.18.5.
I’m quite baffled. Any help would be greatly appreciated. Happy to provide any more necessary details.