CUDA error: an illegal memory access was encountered (pytorch CUDA extension)

I implemented a pytorch cuda extension of xnor_gemm.
when I run this gemm in a small demo.py there is no problem
But there is CUDA memory access error when I put this function in a ALBERT/huggingface forward function.

pytorch version 1.4.0 (cuda version 10.1)
cuda nvcc 10.0
I tried other version of pytorch such as 1.6.0 but error still happens

Here is the small demo

import xnor_cuda

def test():
    a = torch.ones(32,128).to(device='cuda')
    b = torch.ones(128,32).to(device='cuda')
    output1 = xnor_cuda.xnor_gemm(a,b)

    return 1

test()

Here is psudo code that error happens test() is the same as the small demo

Class A:
    def forward():
         test()

The error is

RuntimeError: CUDA error: an illegal memory access was encountered (copy_kernel_cuda at /pytorch/aten/src/ATen/native/cuda/Copy.cu:180)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x33 (0x7f5316c95193 in /home/piaotairen/.conda/envs/piaoenv36/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x56e3912 (0x7f531c7cb912 in /home/piaotairen/.conda/envs/piaoenv36/lib/python3.6/site-packages/torch/lib/libtorch.so)
frame #2: <unknown function> + 0x1a2a41d (0x7f5318b1241d in /home/piaotairen/.conda/envs/piaoenv36/lib/python3.6/site-packages/torch/lib/libtorch.so)
frame #3: <unknown function> + 0x1a266ff (0x7f5318b0e6ff in /home/piaotairen/.conda/envs/piaoenv36/lib/python3.6/site-packages/torch/lib/libtorch.so)
frame #4: at::native::copy_(at::Tensor&, at::Tensor const&, bool) + 0x3e (0x7f5318b10bee in /home/piaotairen/.conda/envs/piaoenv36/lib/python3.6/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x199af9d (0x7f5318a82f9d in /home/piaotairen/.conda/envs/piaoenv36/lib/python3.6/site-packages/torch/lib/libtorch.so)
frame #6: <unknown function> + 0x56e2ed2 (0x7f531c7caed2 in /home/piaotairen/.conda/envs/piaoenv36/lib/python3.6/site-packages/torch/lib/libtorch.so)
frame #7: <unknown function> + 0x1a2a41d (0x7f5318b1241d in /home/piaotairen/.conda/envs/piaoenv36/lib/python3.6/site-packages/torch/lib/libtorch.so)
frame #8: <unknown function> + 0x1a266ff (0x7f5318b0e6ff in /home/piaotairen/.conda/envs/piaoenv36/lib/python3.6/site-packages/torch/lib/libtorch.so)
frame #9: at::native::copy_(at::Tensor&, at::Tensor const&, bool) + 0x3e (0x7f5318b10bee in /home/piaotairen/.conda/envs/piaoenv36/lib/python3.6/site-packages/torch/lib/libtorch.so)
frame #10: <unknown function> + 0x436ecb8 (0x7f531b456cb8 in /home/piaotairen/.conda/envs/piaoenv36/lib/python3.6/site-packages/torch/lib/libtorch.so)
frame #11: <unknown function> + 0x199af9d (0x7f5318a82f9d in /home/piaotairen/.conda/envs/piaoenv36/lib/python3.6/site-packages/torch/lib/libtorch.so)
frame #12: <unknown function> + 0x1cb836d (0x7f5318da036d in /home/piaotairen/.conda/envs/piaoenv36/lib/python3.6/site-packages/torch/lib/libtorch.so)
frame #13: at::native::to(at::Tensor const&, c10::Device, c10::ScalarType, bool, bool, c10::optional<c10::MemoryFormat>) + 0x2a6 (0x7f5318da20d6 in /home/piaotairen/.conda/envs/piaoenv36/lib/python3.6/site-packages/torch/lib/libtorch.so)
frame #14: <unknown function> + 0x1ffdbf3 (0x7f53190e5bf3 in /home/piaotairen/.conda/envs/piaoenv36/lib/python3.6/site-packages/torch/lib/libtorch.so)
frame #15: <unknown function> + 0x3ce3db2 (0x7f531adcbdb2 in /home/piaotairen/.conda/envs/piaoenv36/lib/python3.6/site-packages/torch/lib/libtorch.so)
frame #16: <unknown function> + 0x204864a (0x7f531913064a in /home/piaotairen/.conda/envs/piaoenv36/lib/python3.6/site-packages/torch/lib/libtorch.so)
frame #17: at::Tensor::to(c10::Device, c10::ScalarType, bool, bool, c10::optional<c10::MemoryFormat>) const + 0x1fb (0x7f53622ca24b in /home/piaotairen/.conda/envs/piaoenv36/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #18: at::print(std::ostream&, at::Tensor const&, long) + 0x917 (0x7f53189df5e7 in /home/piaotairen/.conda/envs/piaoenv36/lib/python3.6/site-packages/torch/lib/libtorch.so)
frame #19: <unknown function> + 0x366fd (0x7f52be2546fd in /home/piaotairen/.conda/envs/piaoenv36/lib/python3.6/site-packages/xnor_cuda-0.0.0-py3.6-linux-x86_64.egg/xnor_cuda.cpython-36m-x86_64-linux-gnu.so)
frame #20: xnor_gemm_cuda(at::Tensor, at::Tensor) + 0x337 (0x7f52be25535a in /home/piaotairen/.conda/envs/piaoenv36/lib/python3.6/site-packages/xnor_cuda-0.0.0-py3.6-linux-x86_64.egg/xnor_cuda.cpython-36m-x86_64-linux-gnu.so)
frame #21: xnor_gemm(at::Tensor, at::Tensor) + 0x57 (0x7f52be247417 in /home/piaotairen/.conda/envs/piaoenv36/lib/python3.6/site-packages/xnor_cuda-0.0.0-py3.6-linux-x86_64.egg/xnor_cuda.cpython-36m-x86_64-linux-gnu.so)
frame #22: <unknown function> + 0x2b6fd (0x7f52be2496fd in /home/piaotairen/.conda/envs/piaoenv36/lib/python3.6/site-packages/xnor_cuda-0.0.0-py3.6-linux-x86_64.egg/xnor_cuda.cpython-36m-x86_64-linux-gnu.so)
frame #23: <unknown function> + 0x32ae1 (0x7f52be250ae1 in /home/piaotairen/.conda/envs/piaoenv36/lib/python3.6/site-packages/xnor_cuda-0.0.0-py3.6-linux-x86_64.egg/xnor_cuda.cpython-36m-x86_64-linux-gnu.so)
<omitting python frames>


You could try to add assert statements to your custom CUDA extension to get a proper error message by running the code via CUDA_LAUNCH_BLOCKING=1 python script.py args, use cuda-memcheck to debug the illegal memory access, or run the code via cuda-gdb --args python script.py to get an interactive debugging session.

Sorry for my late reply.
Thank you for your suggestions.
I initially wondered if it was a pytorch version problem…but it was my fault :frowning:
Finally I found that there was a memory access error when running my kernel.
So, thank you again, and I want to tell the people who may have same problem with me that just check your code first :slight_smile:

1 Like