Hi,
I wanted to trace why the losses, and psnr values of my model is NaN. Some of the articles here pointed out to use with torch.autograd.detect_anomaly(True): After doing this, an error is thrown “RuntimeError: Function ‘UnsafeViewBackward0’ returned nan values in its 0th output.” I’m not sure where to start since I do not know what is UnsafeViewBackward0. And how should I trace where the values became NaN.
Thanks!
_unsafe_view is a lower-level operation that can called by other ops. One way forward is to try logging out the ops to get sense of what the other ops around it are:
from torch.testing._internal.logging_tensor import capture_logs_with_logging_tensor_mode
x = torch.randn([])
y = torch.randn([])
with capture_logs_with_logging_tensor_mode() as logs:
torch.empty([])
x + y
print('\n'.join(logs))
$0 = torch._ops.aten.empty.memory_format([], device=device(type='cpu'), pin_memory=False)
$3 = torch._ops.aten.add.Tensor($1, $2)