Cuda error: device side assert triggered at 56th epoch

So my model was training and converging just fine for both training and testing cases and suddenly at the 56th epoch it threw a device-side assert triggered error.
Here’s an image:


I’m performing binary segmentation task.

I have checked and confirmed that my input values to the loss function are indeed within the range of (0, 1) as the final layer of my segmentation network has a sigmoid function.

I devised a loss function specific to my problem after some research, the loss function is a combination of BCEloss, hausdorff distance and dice loss like so:

both input and output of the model are float32 precision.

Initially I presumed that the model might be outputting NaN values, but then again if it did, the error would be pointing towards the loss computing function and not the batch_loss.backward() method.

I tried to reproduce this error again but with just a few 100 samples because it takes hours for this model to train with all samples and get to this point, but with the few 100 samples it ran fine and didn’t encounter any error.

Has anyone experienced this before?
What could be the issue? @ptrblck

After hours of debugging, I have finally discovered that the cause of this RuntimeError is one of the torchvision transforms (transforms.RandomPerspective). However, I still don’t know the reason why this transform gives NaN values.

Edit: it seems that the actual cause of the NaN values is the sequence of arrangement

class FirstChannelRandomInvert(object):
    def __init__(self, p):
        self.p = p
        self.invert_color = transforms.RandomInvert(p=self.p)

    def __call__(self, sample):
        #input shape: (..., 4, 224, 224)
        image, mask = sample[0].unsqueeze(dim=0), sample[1:]
        image = self.invert_color(image)
        sample = torch.cat((image, mask), dim=0)
        return sample


class FirstChannelRandomGaussianBlur(object):
    def __init__(self, p, kernel_size=(5, 9), sigma=(0.1, 5)):
        self.p = p
        self.kernel_size = kernel_size
        self.sigma = sigma
        self.gaussian_blur = transforms.GaussianBlur(kernel_size=self.kernel_size, sigma=self.sigma)

    def __call__(self, sample):
        #input shape: (..., 4, 224, 224)
        image, mask = sample[0].unsqueeze(dim=0), sample[1:]
        randn = np.random.rand()
        if randn < self.p: image = self.gaussian_blur(image)
        sample = torch.cat((image, mask), dim=0)
        return sample


data_transforms = transforms.Compose([
  transforms.RandomHorizontalFlip(p=0.5),
  transforms.RandomVerticalFlip(p=0.5),
  FirstChannelRandomInvert(p=0.5),
  FirstChannelRandomGaussianBlur(p=0.5),
  transforms.RandomRotation((0, 360)),
  transforms.RandomPerspective(distortion_scale=0.6, p=0.5)
])

Although, I see nothing wrong in this, but when I change the sequence a bit, it stops producing NaN values.

Does this mean that the NaN values are not created anymore if you swap the order of transformations or what is the “sequence” referring to?

Good day.
Yes if I swap the order of the transform, the NaN values are not created anymore