Getting NaN in the softmax Layer

awalkingstick · July 28, 2020, 9:04pm

I’m running into the same NaN softmax issue in a modified version of VGG11. I have logged the offending input tensor (no NaNs or non-finite vals), the corresponding output (all NaN) and loss (NaN). I have passed the offending input tensors directly to the network one at a time, with grad enabled, and am unable to reproduce the issue on either CPU or GPU. I can only reproduce the issue when running my training loop on GPU.

The modifications I’ve made to VGG are swapping out all ReLU layers for layers comprised of an exponential ReLU that can be found here: https://github.com/briardoty/allen-inst-cell-types/blob/da39c554c7147e32c813dd51722055982bdfb826/modules/ActivationFunctions.py#L129. The NaN softmax issue occurs whether I run with my custom activation function implemented as a nn.Module or as a torch.autograd.Function with my own implementation of backward().

Here is a link to a script I attempted to use to reproduce the issue using a single copy of an offending input tensor: https://github.com/briardoty/allen-inst-cell-types/blob/master/nan_check.py.

I’m using pytorch 1.4.0, cudatoolkit 10.0.130, and numpy 1.18.5.

I’m quite baffled. Any help would be greatly appreciated. Happy to provide any more necessary details.

Lin_Jia · May 17, 2022, 2:45am

I am not sure whether or not this will be helpful. One way for Softmax to output nan is that its direct input are all negative numbers with high absolute values, such as [-100000, -100000]
From a theoretical perspective, this should not happen because this means that the neural net has learned both output options are highly unsuitable for the input.

RTG8055 · March 29, 2023, 2:13pm

Hi, I am facing with the exact same issue, i checked my input tensor to the network it is finite and does not have any nan. I have saved my model checkpoint and tensor, how do I get any more information from it, my model defination looks like this:

nn.Sequential(
nn.Linear(*input_dims, fc1_dims),
nn.ReLU(),
nn.Linear(fc1_dims, fc2_dims),
nn.ReLU(),
nn.Linear(fc2_dims, fc3_dims),
nn.ReLU(),
nn.Linear(fc3_dims, n_actions),
nn.Softmax(dim=-1)
)

ptrblck · March 30, 2023, 1:49am

You could add debugging modules after each layer of your model to check which layer is creating the invalid outputs.
Something like this should work:

class PrintLayer(nn.Module):
    def __init__(self):
        super().__init__()
        
    def forward(self, x):
        print("torch.isfinite(x).all(): {}, min. {:.5f}, max. {:.5f}".format(
            torch.isfinite(x).all(), x.min(), x.max()))
        return x


model = nn.Sequential(
    PrintLayer(),
    nn.Linear(16, 32),
    PrintLayer(),
    nn.ReLU(),
    PrintLayer(),
    nn.Linear(32, 64),
    PrintLayer(),
    nn.ReLU(),
    PrintLayer(),
    nn.Linear(64, 128),
    PrintLayer(),
    nn.ReLU(),
    PrintLayer(),
    nn.Linear(128, 10),
    PrintLayer(),
    nn.Softmax(dim=-1)
)

x = torch.randn(1, 16)
out = model(x)
# torch.isfinite(x).all(): True, min. -1.55479, max. 2.18836
# torch.isfinite(x).all(): True, min. -1.00142, max. 0.82056
# torch.isfinite(x).all(): True, min. 0.00000, max. 0.82056
# torch.isfinite(x).all(): True, min. -0.42594, max. 0.34994
# torch.isfinite(x).all(): True, min. 0.00000, max. 0.34994
# torch.isfinite(x).all(): True, min. -0.16251, max. 0.24222
# torch.isfinite(x).all(): True, min. 0.00000, max. 0.24222
# torch.isfinite(x).all(): True, min. -0.14961, max. 0.09479