I’ve read a lot of topics connected to my problem, but I haven’t found solution for it yet.
I’ve got big model, which has resnet (for image processing) and ulmfit (for text processing) connected on the outputs of them.
While I start training my model, everything seems to be fine. But after some time (and a lot of batches) model starts giving NaNs as the value of the loss. In my model, I’ve got few loss functions but all of them are CrossEntropyLoss or BCEWithLogitsLoss - I add them up before
loss.backward() to train few outputs (“heads”) of the model. The NaNs in calculating loss are linked to the outputs of model - I receive outputs with all NaNs values.
My data seems to be OK. It looks like the weights in model, after some time, are NaNs, but I can’t understand why. I’ve tried to print every batch tensor with images, texts and outputs, and everything is fine. I train the model with big dataset (about 1.2 mln images + texts).
I’ve tried to debug with
torch.autograd.set_detect_anomaly(True) and I got this output:
2020-08-13 00:28:22 UTC -- tensor(nan, device='cuda:0', grad_fn=<AddBackward0>) 2020-08-13 00:28:25 UTC -- Traceback (most recent call last): 2020-08-13 00:28:25 UTC -- File "train.py", line 104, in <module> 2020-08-13 00:28:25 UTC -- train(model=top_model, data_loader=dataloader, criterion_categories=criterion_cats, criterion_tg=criterion_tags, optimize=optimizer, sgd_shed=sgdr_partial, device=device) 2020-08-13 00:28:25 UTC -- File "/code/helper_functions.py", line 156, in train 2020-08-13 00:28:25 UTC -- loss.backward() 2020-08-13 00:28:25 UTC -- File "/root/.local/lib/python3.6/site-packages/torch/tensor.py", line 184, in backward 2020-08-13 00:28:25 UTC -- torch.autograd.backward(self, gradient, retain_graph, create_graph) 2020-08-13 00:28:25 UTC -- File "/root/.local/lib/python3.6/site-packages/torch/autograd/__init__.py", line 115, in backward 2020-08-13 00:28:25 UTC -- allow_unreachable=True) # allow_unreachable flag 2020-08-13 00:28:25 UTC -- RuntimeError: Function 'BinaryCrossEntropyWithLogitsBackward' returned nan values in its 0th output.
This first line with
tensor(nan) is just printed loss value. How can I check what is causing this problem?