During backward() | CUDA error: an illegal memory access was encountered

Hello,

I encounter a CUDA error: CUDA error: an illegal memory access was encountered
I do not know how to deal with it.
I googled this issue, I found a suggestion indicating that the problem might arise from multiple users accessing the same GPU card. However, I am the sole user of the card (I checked nvidia-smi several times).
I also found that this issue may result from apex. I did use torch amp; however, if I removed amp and torch.autocast, the problem was not solved.
Can you tell me any possible causes of the problem or how to avoid it?
Thank you.

Traceback (most recent call last):
File “/home/uu/@Research/BlockHH/main.py”, line 130, in
main(config)
File “/home/uu/@Research/BlockHH/main.py”, line 100, in main
trainer()
File “/home/uu/@Research/BlockHH/trainer/trainer.py”, line 243, in call
scaler.scale(loss).backward()
File “/home/uu/.pyenv/versions/3.11.1/lib/python3.11/site-packages/torch/_tensor.py”, line 487, in backward
torch.autograd.backward(
File “/home/uu/.pyenv/versions/3.11.1/lib/python3.11/site-packages/torch/autograd/init.py”, line 200, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

It’s the first time I’m hearing this, but no, multiple processes or users do not cause memory violations in your script.

Could you run your code via compute-sanitizer python script.py and check if it can detect the kernel causing the issue?

Hello,
Thank you.
I run my code via your suggestion.
The error messenge is as follows. (Error: No attachable process found.)
BTW, I am training a neural network (CNN). If I set initial_channel=32, this issue won’t arise.
However, if the initial_channel is greater than 32, it arises.

========= COMPUTE-SANITIZER
========= Error: No attachable process found. compute-sanitizer timed-out.
========= Default timeout can be adjusted with --launch-timeout. Awaiting target completion.
Traceback (most recent call last):
File “/home/uu/@Research/BlockHH/main.py”, line 130, in
main(config)
File “/home/uu/@Research/BlockHH/main.py”, line 100, in main
trainer()
File “/home/uu/@Research/BlockHH/trainer/trainer.py”, line 243, in call
scaler.scale(loss).backward()
File “/home/uu/.pyenv/versions/3.11.1/lib/python3.11/site-packages/torch/_tensor.py”, line 487, in backward
torch.autograd.backward(
File “/home/uu/.pyenv/versions/3.11.1/lib/python3.11/site-packages/torch/autograd/init.py”, line 200, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: CUDA error: misaligned address
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Could you post a minimal and executable code snippet reproducing the issue so that we could try to isolate the failing kernel?

Hello,

After several days of debugging, we have finally identified the problem.
The error originates from the function:
torch.linalg.solve()
During back-propagation, the gradient turns into NaN, which I suspect might be due to an overflow causing “an illegal memory access was encountered.”
I opted for using inv() instead, which works well.
Thank you.


For anyone who has the same issue:
If you’re incorporating inverse matrices into your back-propagation process, I recommend avoiding the use of torch.linalg.solve() to prevent similar issues.