NaNs in every output after some batches

Hello,

I’ve read a lot of topics connected to my problem, but I haven’t found solution for it yet.
I’ve got big model, which has resnet (for image processing) and ulmfit (for text processing) connected on the outputs of them.
While I start training my model, everything seems to be fine. But after some time (and a lot of batches) model starts giving NaNs as the value of the loss. In my model, I’ve got few loss functions but all of them are CrossEntropyLoss or BCEWithLogitsLoss - I add them up before loss.backward() to train few outputs (“heads”) of the model. The NaNs in calculating loss are linked to the outputs of model - I receive outputs with all NaNs values.
My data seems to be OK. It looks like the weights in model, after some time, are NaNs, but I can’t understand why. I’ve tried to print every batch tensor with images, texts and outputs, and everything is fine. I train the model with big dataset (about 1.2 mln images + texts).

I’ve tried to debug with torch.autograd.set_detect_anomaly(True) and I got this output:

2020-08-13 00:28:22 UTC -- tensor(nan, device='cuda:0', grad_fn=<AddBackward0>)

2020-08-13 00:28:25 UTC -- Traceback (most recent call last):

2020-08-13 00:28:25 UTC --   File "train.py", line 104, in <module>

2020-08-13 00:28:25 UTC --     train(model=top_model, data_loader=dataloader, criterion_categories=criterion_cats, criterion_tg=criterion_tags, optimize=optimizer, sgd_shed=sgdr_partial, device=device)

2020-08-13 00:28:25 UTC --   File "/code/helper_functions.py", line 156, in train

2020-08-13 00:28:25 UTC --     loss.backward()

2020-08-13 00:28:25 UTC --   File "/root/.local/lib/python3.6/site-packages/torch/tensor.py", line 184, in backward

2020-08-13 00:28:25 UTC --     torch.autograd.backward(self, gradient, retain_graph, create_graph)

2020-08-13 00:28:25 UTC --   File "/root/.local/lib/python3.6/site-packages/torch/autograd/__init__.py", line 115, in backward

2020-08-13 00:28:25 UTC --     allow_unreachable=True)  # allow_unreachable flag

2020-08-13 00:28:25 UTC -- RuntimeError: Function 'BinaryCrossEntropyWithLogitsBackward' returned nan values in its 0th output.

This first line with tensor(nan) is just printed loss value. How can I check what is causing this problem?

Best regards

I dont think the info is enough to know whats the problem exactly. Some of the loss ouput before it goes to nan would be helpful, but I assume that the loss is increase during the training and exploding to inf giving not a number error. if thats the case, the first thing is to see if it makes sense to clip your values.

Nope, loss is not exploding. I was monitoring it, and the values were decreasing. I looks like that (few last batch losses):

2020-08-13 00:28:22 UTC -- tensor(3.1548, device='cuda:0', grad_fn=<AddBackward0>)

2020-08-13 00:28:22 UTC -- tensor(3.5346, device='cuda:0', grad_fn=<AddBackward0>)

2020-08-13 00:28:22 UTC -- tensor(3.6171, device='cuda:0', grad_fn=<AddBackward0>)

2020-08-13 00:28:22 UTC -- tensor(3.0142, device='cuda:0', grad_fn=<AddBackward0>)

2020-08-13 00:28:22 UTC -- tensor(2.5788, device='cuda:0', grad_fn=<AddBackward0>)

2020-08-13 00:28:22 UTC -- tensor(2.6942, device='cuda:0', grad_fn=<AddBackward0>)

2020-08-13 00:28:22 UTC -- tensor(2.9119, device='cuda:0', grad_fn=<AddBackward0>)

2020-08-13 00:28:22 UTC -- tensor(3.5253, device='cuda:0', grad_fn=<AddBackward0>)

2020-08-13 00:28:22 UTC -- tensor(3.5954, device='cuda:0', grad_fn=<AddBackward0>)

2020-08-13 00:28:22 UTC -- tensor(1.8311, device='cuda:0', grad_fn=<AddBackward0>)

2020-08-13 00:28:22 UTC -- tensor(nan, device='cuda:0', grad_fn=<AddBackward0>)

You should check your dataset and ensure there are no singularities in the input.

Just by doing normalization you may find some zero-channel or any other thing
If it’s to big add an epsilon. Another option is keeping a register to check if NaN appears for certain samples always. You can even put a checker inside the forward (it will be slower but just to debug) which tries to look for NaNs nd infs

Let me know if the dataset is ok

I mean, everyone thinks the data is ok, but you have to be sure it’s truly ok cos a single wrong sample mess everything.

if this sanity check is truly ok, have a look at the batch normalization statistics

Ok, I’ve checked and it seems that data is OK.
I’ve also made something like this in forward pass:

if True in torch.isnan(x) or True in torch.isinf(x):
            if True in torch.isnan(x1) or True in torch.isinf(x1):
                print('image')
                for item in x1:
                    print(item)
            print('image_output')
            for item in x:
                print(item)
        
        if True in torch.isnan(txt) or True in torch.isinf(txt):
            if True in torch.isnan(txt_input) or True in torch.isinf(txt_input):
                print('text')
                for item in txt_input:
                    print(item)
            print('text_output')
            for item in txt:
                print(item)

And it printed:

text_output
torch.tensor(nan, nan, nan, nan (...), device='cuda:0')
(...)
torch.tensor(nan, nan, nan, nan (...), device='cuda:0')

So it seems that input text was OK (without nans or infs), but the text model gave NaNs in all outputs? It is strange, because I’ve done to it this:

for r in self.text.parameters():
     r.requires_grad = False

So in my opinion it shouldn’t take grad - am I right?
What do you think about that? What can cause this problem?

Hmmm,
Soo it’s not all about the gradient. If there is a nan in the forward it will operated with the image features.

Once a NaN appears it expands everywhere. It’s not about the backward nor gradients but the forward.

Which NLP model are you using?
The key is identifiying why are you getting this.

Does it happens always for the same sample(s)?
How are you encoding the text?

You can use a debugger to allow you to have a look once the NaN appears so that you can try to figure out what happened.

In different runs it’s happening in different moments, so it is not about samples because I turned off shuffling the data.

I’m using pretrained ULMFiT, I’m encoding text with sentencepiece.

Is it possible that after optimize.step(), weights in the model are getting NaN values? Hmm, but why is it possible if I set requires_grad = False in this text model? Maybe it’s not that.

What would you suggest next?

Sooo as it happens from the text side.
Can you check if any output contains NaNs running the text model under with torch.no_grad() and text model in eval mode?

(This is textmode.eval() )

(Obviusly since gradients won’t be computed you will have to remove text model from the optimizer)

The idea in behind is to check if some batch normalization layer is exploiding while training.

EDIT: yeh the typical iisue is that once there is a NaN wherever in the forward, it will infect gradients and weights due to backprop.

Other options are, does the model contain dangerous layers like divisions logarithms etcetera…

It would be nice if you can detect which layer is generating the NaN inside your model.
(You can use forward hooks to get intermidiate outputs)

I’m shocked.

I tried same thing with torch.no_grad() and text model in evaluation mode and I got something like this:

2020-08-16 05:13:52 UTC -- 5435
2020-08-16 05:13:52 UTC -- tensor(2.2348, device='cuda:0', grad_fn=<AddBackward0>)
text_output
2020-08-16 05:13:52 UTC -- tensor([nan, nan, nan, nan, nan, nan, nan, nan, (...)], device='cuda:0')
(...)
2020-08-16 05:13:53 UTC -- 5436
2020-08-16 05:13:53 UTC -- tensor(nan, device='cuda:0', grad_fn=<AddBackward0>)

The number 5435 is an iterator. So we can see that first there were NaNs in the text output and after that NaN in the loss value! If it was made with torch.no_grad() so it has to be something wrong with text model. But what could it be if I trained it using the same data?
I’m loading state dict to this text model, so maybe it is corrupted? I’ll try to train once again this text model and check if it will help. If not, maybe there is something wrong with sentencepiece model?

And one more thing - I’ve also tried to run model with training and once again I see that this problem happens in different places in dataset (during different iterations) - I’m running without shuffe, so I think that it’s not connected to data.

Best regards

Hi,
But the fact here is that once you use torch no grad, eval mode and no shuffling the behaviour is deterministic (the same samples must fail the whole time).
You can identify those to check whether they have some issue or not. Try to train removing them. Try to identify the 1st layer inside the text model which is producing this output.

In the end using the same data doesn’t mean you are using the same preprocessing. There are always subtle details which are not written in the paper.

do you have apex package or mixed, or half precision in your code? I had not quite the same but NaN outputs which caused by these. I’ve to mention my loss function was not BCE or CE.