CUDA runtime error for Mixed precision teaining

chinmay5 · February 11, 2022, 4:43pm

I am running into a strange issue when using the mixed precision, grad_scaling function. The Pytorch version is 1.9, CUDA is 11.1 and I am using torchio for augmentation. I am not exactly sure what this error indicates and what should be done for debugging. If I remove the mixed precision training, I do not see the error. Any insights would be really helpful.

Here is the error message when using CUDA_LAUNCH_BLOCKING=1

File "/home/chinmayp/setup/pytorch/lib/python3.6/site-packages/torch/autograd/__init__.py", line 149, in backward
    allow_unreachable=True, accumulate_grad=True)  # allow_unreachable flag
RuntimeError: CUDA error: an illegal memory access was encountered

Thanks,
Chinmay Prabhakar

ptrblck · February 11, 2022, 4:46pm

Could you update to the latest stable or nightly release and check, if you are still hitting this issue?
If so, could you post a minimal, executable code snippet to reproduce the error?

chinmay5 · February 11, 2022, 5:56pm

Updating the Cuda version solved the issue. Thank you so much. But, what should be the general steps to follow in such debugging scenarios? There are scenarios, when bumping up cuda version is not desirable.

ptrblck · February 11, 2022, 6:12pm

The general steps for debugging an illegal memory access would be:

update PyTorch to the nightly version and check, if this might have been a known and already fixed issue
if that doesn’t help: rerun the code with CUDA_LAUNCH_BLOCKING=1 and check which operation is failing to narrow down if it’s coming from the framework or e.g. a custom extension
if you get stuck, create and issue and ping me on it