When I tries to train a model, it will crash after several iterations and an error occurs. That is
loss.backward()
File "/opt/conda/lib/python3.7/site-packages/torch/tensor.py", line 198, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/opt/conda/lib/python3.7/site-packages/torch/autograd/__init__.py", line 100, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: CUDA error: an illegal memory access was encountered (launch_kernel at /opt/conda/conda-bld/pytorch_1591914880026/work/aten/src/ATen/native/cuda/CUDALoops.cuh:217)
After I set CUDA_LAUNCH_BLOCKING
to True, it is
error in nnd get grad: an illegal memory access was encountered.
...
What does nnd
mean? Is it NaN
?