Any way to debug this?
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA
to enable device-side assertions.
Pytorch:2.7.1, CUDA 11.8
I usually use cuda memcheck, you can use also
cuda-memcheck python your_script.py
Or enable sync reporting
export CUDA_LAUNCH_BLOCKING=1
export TORCH_USE_CUDA_DSA=1
export CUDA_DEVICE_WAITS_ON_EXCEPTION=1
# Then run your script
python your_script.py
cuda-memcheck
is deprecated so use compute-sanitizer
and disable the caching allocator via: PYTORCH_NO_CUDA_MEMORY_CACHING=1
.
1 Like