RuntimeError: Function 'SolveBackward' returned nan values in its 0th output

barrykui · May 2, 2020, 7:15am

The loss came to NaN, and I debug the autograd by using torch.autograd.set_detect_anomaly(True) and got the error below.

....
  File "main.py", line 374, in compute_plane
    ca_cb, LU = torch.solve(B, A + eps)

Traceback (most recent call last):
  File "main.py", line 879, in <module>
    main()
  File "main.py", line 420, in main
    train(fk, train_loader, val_loader, model, optimizer, lr_scheduler, last_iter+1, tb_logger, criterion=loss_fun)
  File "main.py", line 544, in train
    loss.backward()
  File "/lib/python3.6/site-packages/torch/tensor.py", line 107, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/lib/python3.6/site-packages/torch/autograd/__init__.py", line 93, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: Function 'SolveBackward' returned nan values in its 0th output.

Does anyone have a solution for this problem? pls help, Thanks.

ptrblck · May 3, 2020, 1:09am

Could you post the inputs to torch.solve, which create the NaNs?

GitHub issue for reference.

barrykui · May 3, 2020, 9:39am

Finally, I found the problem which should be the mathematic error in the previous code. Thank you all the same.