Input is Nan after transformation


So I have a dataset class responsible to read from two directories having different sizes. I have the following code for my dataset class:

    def __getitem__(self, idx):

        idx_ood = random.randint(0,5374)
        image =
        image = image.convert('RGB')

        image_ood =[idx_ood])
        image_ood = image_ood.convert('RGB')

        label = np.zeros(len(self.PRED_LABEL), dtype=int)
        for i in range(0, len(self.PRED_LABEL)):
             # can leave zero if zero, else make one
            if(self.df[self.PRED_LABEL[i].strip()].iloc[idx].astype('int') > 0):
                label[i] = self.df[self.PRED_LABEL[i].strip()

        if self.transform:
            image = self.transform(image)
            image_ood_tr = self.transform(image_ood)

        if torch.any(torch.isnan(image_ood_tr)):
            print("NAN in ood input image!")

        return (image, label,self.df.index[idx]),(image_ood_tr,idx_ood,self.totensor(image_ood))

When I am fetching data via data loader, one of the images has nan at an element inside the tensor. While debugging, I checked the original image before transform and it does not have any nan. When I try to run the code again, there was no nan in the same image tensor.

Could this be a hardware issue? Or can you find any error?

I would be grateful for your help.

Could you check the tensors for invalid values before and after applying the transform and run a few epochs? Based on your description it seems that this issue is not reproducible, but is visible randomly?

Hi @ptrblck, yes you are absolutely correct. The issue is visible and random.

I did check the tensors before and after applying the transformation. The tensor before the transformation has no NaNs. Interestingly, I opened the debug console and applied transformation and there was no NaN. But the same transformation produce NaN randomly when running.

Other than that, the same tensor does not give any NaN in another run of the same code. I am unable to comprehend it.

Do you think the issue could be because of reading two datasets of different sizes (though it should not be the case)? Or could you recommend anything else?

I don’t think the issue is created by the sizes/length of the datasets.
Were you able to narrow down which transformation could potentially cause the issue?
Are you using torchvision.transforms or any other library?