Error in nnd get grad: an illegal memory access was encountered

Mendel123 · December 24, 2020, 11:37am

When I tries to train a model, it will crash after several iterations and an error occurs. That is

loss.backward()
  File "/opt/conda/lib/python3.7/site-packages/torch/tensor.py", line 198, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/opt/conda/lib/python3.7/site-packages/torch/autograd/__init__.py", line 100, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: CUDA error: an illegal memory access was encountered (launch_kernel at /opt/conda/conda-bld/pytorch_1591914880026/work/aten/src/ATen/native/cuda/CUDALoops.cuh:217)

After I set CUDA_LAUNCH_BLOCKING to True, it is

error in nnd get grad: an illegal memory access was encountered.
...

What does nnd mean? Is it NaN?

ptrblck · January 5, 2021, 9:07am

Are you using the latest PyTorch release (1.7.1) and if not could update to it and rerun your script?
If you are already on the latest version, could you post a minimal code snippet to reproduce this issue and post your current setup, i.e. used GPU, CUDA, cudnn version etc.?

Mendel123 · January 13, 2021, 10:12am

Sorry for the late reply… I way using Pytorch1.5 before and after I change it to 1.6.0, the error disappears. Thanks very much.