nn.ReLU outputs nan on forward

Averu · December 7, 2020, 12:09pm

nn.ReLU randomly outputs Nan on forward. The problem only appears on GPU and not on CPU.

I captured ReLU input and outputs. This happens randomly on different parts of my torchvision VGG_16bn backbone, but allways at the first half of layers.
For example in one of the calculations where output contained a single Nan the input tensor was size [2, 64, 1056, 800] and value (-0.70703125000000000000, device=‘cuda:0’, dtype=torch.float16) had turned into nan. This happened when the network was in training mode. Evaluation mode not tested.

When I loaded the same input tensor and performed F.relu on it the output did not generate nan.

More project and system information:
I’m performing object detection from images using torchvisions FasterRCNN. I have my custom dataset. Currently I’m replacing image data with random tensors and keeping the targets from my dataset. The image replace is done to make sure that Nan is not caused by my image data. I’m boubtfull it could be caused by target data. Both image and target data are transformed by torchvision.models.detection.transforms.GeneralizedRCNNTransform.

For FasterRCNN I have my own custom backbone that uses VGG_16bn as part of it. Learning rate does not seem to affect if NaNs appear. I think they appeared even with learning rate 0.0000000005 and they allways appear on relu forward.

Currently I’m using on GPU GradScaler() and autocast() with my model but NaNs appeared before using theese. Anomally detection does not detect anything as the error happens on forward.

Torch version: 1.7.0
Torchvision version:0.8.1
Cuda version: 11.0
cudnn version: 8
OS and setup: I’m using Windows 10 as my main OS. I have downloaded an Ububtu WSL where I’m running an docker container. The container has Ubuntu 18.04 as OS. I’m connecting to the container using localhost and visual studio code on windows 10. I also get the same errors when I run the code on my windows 10.

Code for capturing the nan in nn.relu forward:

def forward(self, input: Tensor) -> Tensor:
        myinput = input.clone().detach()
        if torch.isnan(input).any():
            exit("Relu input is nan")

        out = F.relu(input, inplace=self.inplace)

        if torch.isnan(out).any():
            torch.save(out, "/workspaces/masters/code/CRAFT-pytorch/debug_relu_cpu_out")
            torch.save(myinput, "/workspaces/masters/code/CRAFT-pytorch/debug_relu_cpu")
            exit("Relu output is nan")

        return out

The code exits on “Relu output is nan”.

What could cause the Nans on relu forward?

ptrblck · December 10, 2020, 7:01am

Could you use the latest nightly, remove mixed-precision training (as it seems to also occur using float32), and rerun the script, please?
How long does a usual run take to yield the first NaN output?