Input is Nan after transformation

Hi,

So I have a dataset class responsible to read from two directories having different sizes. I have the following code for my dataset class:

    def __getitem__(self, idx):

        idx_ood = random.randint(0,5374)
        image = Image.open(
            os.path.join(
                self.path_to_images,
                self.df.index[idx]))
        image = image.convert('RGB')

        image_ood = Image.open(self.ood_names[idx_ood])
        image_ood = image_ood.convert('RGB')

        label = np.zeros(len(self.PRED_LABEL), dtype=int)
        for i in range(0, len(self.PRED_LABEL)):
             # can leave zero if zero, else make one
            if(self.df[self.PRED_LABEL[i].strip()].iloc[idx].astype('int') > 0):
                label[i] = self.df[self.PRED_LABEL[i].strip()
                                   ].iloc[idx].astype('int')

        if self.transform:
            image = self.transform(image)
            image_ood_tr = self.transform(image_ood)

        if torch.any(torch.isnan(image_ood_tr)):
            print("NAN in ood input image!")

        return (image, label,self.df.index[idx]),(image_ood_tr,idx_ood,self.totensor(image_ood))

When I am fetching data via data loader, one of the images has nan at an element inside the tensor. While debugging, I checked the original image before transform and it does not have any nan. When I try to run the code again, there was no nan in the same image tensor.

Could this be a hardware issue? Or can you find any error?

I would be grateful for your help.

Could you check the tensors for invalid values before and after applying the transform and run a few epochs? Based on your description it seems that this issue is not reproducible, but is visible randomly?

Hi @ptrblck, yes you are absolutely correct. The issue is visible and random.

I did check the tensors before and after applying the transformation. The tensor before the transformation has no NaNs. Interestingly, I opened the debug console and applied transformation and there was no NaN. But the same transformation produce NaN randomly when running.

Other than that, the same tensor does not give any NaN in another run of the same code. I am unable to comprehend it.

Do you think the issue could be because of reading two datasets of different sizes (though it should not be the case)? Or could you recommend anything else?

I don’t think the issue is created by the sizes/length of the datasets.
Were you able to narrow down which transformation could potentially cause the issue?
Are you using torchvision.transforms or any other library?

Are there any updates on this?
I have a similar issue under ddp python3.9.
I saved the tensors before they go NaN and do the same operations manually after that and still have the same issue. Could it be some environment/system known issue?

For a float input image, ColorJitter produced NaN values if the pixel values were not between 0-1 (strictly). I had a rough time figuring this out. You might have a similar issue brother. It might behave differently for int8 (0-255) images though. .clamp(min=0, max=1) does the trick.

1 Like