Training data becomes nan after several epochs

Hi everyone,

In a semantic segmentation network, I use a type of data, normalized between 0 and 1, saved as pickle. After 23 epochs, at least one sample of this data becomes nan before entering to the network as input. By changing learning rate nothing changes, but by changing one of the convolutions’ bias into False, it gets nan after 38 epochs.
Does anybody have an idea about the reason or how to fix it?

Well, as you can imagine if it gets NaN before entering the network it’s a problem related to the data preprocessing you are applying.

If you are normalizing between 0 and 1 you are probably dividing by the maximum value of the tensor.
As that value is zero you find 0/0=NaN

Thanks for quick reply.
This can be correct if my input was nan from the first epoch. But these inputs have limited values by 22 epochs and suddenly at epoch 23 some of them become nan.

Soo let me clarify.

What you say doesn’t make sense to me. You state that your data is fixed and that the sample is NaN before feeding the network. Those options are mutually exclusive. Either your dataloading pipeline has some randomness (for example, you are loading segments of audio randomly picked and you can get a full-zeros one) or your data is becoming NaN inside the network.

I would recommend you to check if the loss is inf or is NaN. If you backpropagate a NaN or an Inf, all the weights in the network will become NaN.

Are you reading some sort of data which can lead to issues? For example reading from buffers which may be empty, video readers which aren’t returning frames… don’t know.

Anyway you can add a simple check in the __getitem__ function looking for NaNs in your data and raising an exception with the path in case there is an issue. This way you can debug if the data is wrong or the dataloader is having troubles to read it.

Oh, a less likely option I saw (but no one could debug) is I was finding NaNs after allocating tensors on GPU for GTX 1080 model. Sooo maybe another check (less likely one) is to check after and before allocating the tensors.

You are right, it either doesn’t make sense to me.

My task is semantic segmentation and I have two modalities, rgb and a corresponding segmented image, rgb remains unchanged but the second modality becomes nan after some epochs. I found it out by tracing this error at first:
RuntimeError: Function ‘LogSoftmaxBackward’ returned nan values in its 0th output.
To detect source of nan, I searched for nan and inf in summation of model parameters, however summation of all parameters stayed limitted. Then I checked all inputs and outputs of network layers, and it turns out that network second input is the source of nan.
Regarding your suggestion, I have checked the data by printing np.isnan(input) in get_item and torch.isnan after taking tensors from dataloader. During training, data doesn’t become nan in get_item but after about 38 epochs trainloader returns tensors including nan values.

The second input data type has values in [0, 200000] and I normalize them into [0, 1].

Finally, it’s worth mentioning by resuming the saved checkpoint, training continues until 38 more epochs.

Thanks in advance for helping me out

If you are using random transformations, could you check, if some of them might e.g. crop an invalid input? Are you able to reproduce the invalid values in the segmented image by using the DataLoader alone (without the model training) by just iterating it for 38 epochs?

Thanks for your response.

I checked what you said, without training no nan tensor is produced by Dataloader. Isn’t it odd?

Should I do something to prevent from changing inputs?

I would assume the input are not changed by the Dataset or DataLoader, as each sample is loaded and processed on the fly or do you expect them to change somehow?
Are you using any random transformations, which could potentially create the invalid outputs depending on the actual random values used in this transformation?

Similarly, I think segmented data shouldn’t change by DataLoader, as doesn’t rgb data.

There is some randomness in my transformer:
composed_trn = transforms.Compose(
ResizeShorterScale(shorter_side, low_scale, high_scale),
Pad(crop_size, [123.675, 116.28, 103.53], ignore_label),

Thanks to you, I found out that padding and interpolation inside resizing creates nan values in the segmented input image. By eliminating these two transformations network works well!

Good to hear you were able to isolate the issue.
Where does ResizeShorterScale come from?

Thanks a lot!

It is defined in the open source code of RefineNet implemented by DrSleep:

class ResizeShorterScale(object):
“”“Resize shorter side to a given value and randomly scale.”""

def __init__(self, shorter_side, low_scale, high_scale):
    assert isinstance(shorter_side, int)
    self.shorter_side = shorter_side
    self.low_scale = low_scale
    self.high_scale = high_scale

def __call__(self, sample):
    image, mask = sample["image"], sample["mask"]
    min_side = min(image.shape[:2])
    scale = np.random.uniform(self.low_scale, self.high_scale)
    if min_side * scale < self.shorter_side:
        scale = self.shorter_side * 1.0 / min_side
    image = cv2.resize(
        image, None, fx=scale, fy=scale, interpolation=cv2.INTER_CUBIC
    mask = cv2.resize(
        mask, None, fx=scale, fy=scale, interpolation=cv2.INTER_NEAREST
    return {"image": image, "mask": mask}

Dear @ptrblck, although the network issue is solved, I wonder why dataloader doesn’t produce nan when there is no training, but with the similar random seed and other conditions. Do you have any idea?

No, unfortunately I don’t know what’s causing this issue. In the past I’ve seen a similar issue by a potentially broken MKL version in combination with an AMD CPU (the CPU transformations via numpy were using this library). You could try to check which backends the transformations are using and either change the versions of these libs (or disable multi-threading etc.) to isolate it further.

1 Like

OK, thank you so much.

I am experiencing the same issue.

My data preprocessing steps are fine as I checked the data beforehand. But randomly, after a certain epoch, my target data will become a bunch of nans. But when I catch and ignore these nans and allow it to continue, the next data sample will be fine.

Also, when I rerun the entire training, this random behaviour will happen at a completely different epoch/data sample. There is no consistency.