You are right, it either doesn’t make sense to me.

My task is semantic segmentation and I have two modalities, rgb and a corresponding segmented image, rgb remains unchanged but the second modality becomes nan after some epochs. I found it out by tracing this error at first:

RuntimeError: Function ‘LogSoftmaxBackward’ returned nan values in its 0th output.

To detect source of nan, I searched for nan and inf in summation of model parameters, however summation of all parameters stayed limitted. Then I checked all inputs and outputs of network layers, and it turns out that network second input is the source of nan.

Regarding your suggestion, I have checked the data by printing np.isnan(input) in get_item and torch.isnan after taking tensors from dataloader. During training, data doesn’t become nan in get_item but after about 38 epochs trainloader returns tensors including nan values.

The second input data type has values in [0, 200000] and I normalize them into [0, 1].

Finally, it’s worth mentioning by resuming the saved checkpoint, training continues until 38 more epochs.

Thanks in advance for helping me out