NAN while training the model, RuntimeError: Function 'PowBackward0' returned nan values in its 0th output

kunasiramesh · November 2, 2020, 12:45pm

Hi All,

I am using the custom loss function:
loss = torch.mean(torch.square(torch.sqrt(y_true + 1e-10) - torch.sqrt(y_predict + 1e-10)) + 10*torch.square(torch.square(torch.sqrt(y_true + 1e-10) - torch.sqrt(y_predict + 1e-10))))

After some iteration, I am getting below error.

[W python_anomaly_mode.cpp:60] Warning: Error detected in PowBackward0. Traceback of forward call that caused the error:
File “main.py”, line 38, in
main()
File “main.py”, line 34, in main
train(dataloader_train=train_dl, dataloader_eval=valid_dl, model=model, hyper_params=train_params, device=‘cuda’)
File “train_model.py”, line 81, in train
loss = my_cost(outputs,labels)
File “train_model.py”, line 15, in my_cost
loss = torch.mean(torch.square(torch.sqrt(y_true + 1e-10) - torch.sqrt(y_predict + 1e-10)) + 10*torch.square(torch.square(torch.sqrt(y_true + 1e-10) - torch.sqrt(y_predict + 1e-10))))
(function print_stack)
Traceback (most recent call last):
File “main.py”, line 38, in
main()
File “main.py”, line 34, in main
train(dataloader_train=train_dl, dataloader_eval=valid_dl, model=model, hyper_params=train_params, device=‘cuda’)
File “train_model.py”, line 83, in train
loss.backward()
File “/usr/local/lib/python3.6/dist-packages/torch/tensor.py”, line 185, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File “/usr/local/lib/python3.6/dist-packages/torch/autograd/init.py”, line 127, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: Function ‘PowBackward0’ returned nan values in its 0th output.

My final output is after relu activation, so I am sending only +ve values to the sqrt function

tom · November 3, 2020, 9:56am

Maybe you could output the value of y_true and y_predict when the exception happens to be sure.

kunasiramesh · November 3, 2020, 10:08am

Thank you the reply Tom.

I have printed the estimated output when the exception happened.

y_true::tensor([[[0.9508, 0.9464, 0.9941, …, 0.1872, 0.4230, 0.4505],
[0.9412, 0.9590, 0.9167, …, 0.0199, 0.0446, 0.0476],
[1.0088, 0.9939, 1.1853, …, 0.0752, 0.1353, 0.1411],
…,
[1.0073, 1.0330, 1.0652, …, 0.3139, 0.7555, 0.8156],
[0.9773, 0.9773, 0.9945, …, 0.6462, 0.8663, 0.8789],
[1.0590, 1.0328, 1.0088, …, 0.8305, 0.9132, 0.9175]]],
device=‘cuda:0’)

y_predict::tensor([[[0.2288, 0.0000, 0.0000, …, 0.0000, 0.0000, 0.1711],
[0.2288, 0.0000, 0.0000, …, 0.0000, 0.0000, 0.1711],
[0.2288, 0.0000, 0.0000, …, 0.0000, 0.0000, 0.1711],
…,
[0.2288, 0.0000, 0.0000, …, 0.0000, 0.0000, 0.1711],
[0.2288, 0.0000, 0.0000, …, 0.0000, 0.0000, 0.1711],
[0.2288, 0.0000, 0.0000, …, 0.0000, 0.0000, 0.1711]]],
device=‘cuda:0’, grad_fn=)

tom · November 3, 2020, 8:58pm

So if you take min() of them, that’s positive (well non-negative), too, right?

kunasiramesh · November 4, 2020, 12:09am

Sorry Tom, I didn’t understand. Can you elaborate?