Strange CUDA error on specific gpu: "CUDA error: unspecified launch failure"

ICEORY · December 26, 2018, 4:15am

Environment:

cuda 9.0, cudnn 7.4, pytorch 1.0.0 or 0.4.1, ubuntu1604(nvidia-docker)
gpu: gtx1080ti

this error occurs on a specific GPU, and my code works well on other GPUs. Maybe it is a hardware issue

I have tested the GPU memory with memtestG80, and got 0 error . The official mnist example also works well on the problematic GPU.

I also run other pytorch projects on the problematic GPU, but it works well with pytorch 1.0.0 and failed with pytorch0.4.1.

error messages of cuda-memcheck:

========= Invalid __global__ read of size 4
=========     at 0x00001958 in _ZN2at6native18elementwise_kernelILi128ELi4EZNS0_16gpu_index_kernelIZNS0_17index_kernel_implINS0_10OpaqueTypeILi4EEEEEvRNS_14TensorIteratorEN3c108ArrayRefIlEESA_EUlPcSB_lE_EEvS7_SA_SA_RKT_EUliE_EEviT1_
=========     by thread (64,0,0) in block (2873,0,0)
=========     Address 0x7f4af9880d14 is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/usr/local/nvidia/lib64/libcuda.so.1 (cuLaunchKernel + 0x2cd) [0x22b40d]
=========     Host Frame:/usr/local/lib/python2.7/dist-packages/torch/lib/libcaffe2_gpu.so [0x2625bdb]
=========     Host Frame:/usr/local/lib/python2.7/dist-packages/torch/lib/libcaffe2_gpu.so [0x264333e]
=========     Host Frame:/usr/local/lib/python2.7/dist-packages/torch/lib/libcaffe2_gpu.so (_ZN2at6native16gpu_index_kernelI17__nv_dl_wrapper_tI11__nv_dl_tagIPFvRNS_14TensorIteratorEN3c108ArrayRefIlEES8_EXadL_ZNS0_17index_kernel_implINS0_10OpaqueTypeILi4EEEEEvS5_S8_S8_EELj1EEJEEEEvS5_S8_S8_RKT_ + 0x406) [0x20c6626]
=========     Host Frame:/usr/local/lib/python2.7/dist-packages/torch/lib/libcaffe2_gpu.so [0x20c3f0e]
=========     Host Frame:/usr/local/lib/python2.7/dist-packages/torch/lib/libcaffe2_gpu.so [0x20c4755]
=========     Host Frame:/usr/local/lib/python2.7/dist-packages/torch/lib/libcaffe2.so (_ZN2at6native5indexERKNS_6TensorEN3c108ArrayRefIS1_EE + 0x54f) [0x527e3f]
=========     Host Frame:/usr/local/lib/python2.7/dist-packages/torch/lib/libcaffe2.so (_ZNK2at11TypeDefault5indexERKNS_6TensorEN3c108ArrayRefIS1_EE + 0x8e) [0x823dee]
=========     Host Frame:/usr/local/lib/python2.7/dist-packages/torch/lib/libtorch.so.1 (_ZNK5torch8autograd12VariableType5indexERKN2at6TensorEN3c108ArrayRefIS3_EE + 0x21f) [0x39da3f]
=========     Host Frame:/usr/local/lib/python2.7/dist-packages/torch/lib/libtorch.so.1 (_ZN5torch8autograd9generated16IndexPutBackward5applyEOSt6vectorINS0_8VariableESaIS4_EE + 0x1df) [0x29beff]
=========     Host Frame:/usr/local/lib/python2.7/dist-packages/torch/lib/libtorch.so.1 (_ZN5torch8autograd6Engine17evaluate_functionERNS0_12FunctionTaskE + 0x1d56) [0x268186]
=========     Host Frame:/usr/local/lib/python2.7/dist-packages/torch/lib/libtorch.so.1 (_ZN5torch8autograd6Engine11thread_mainEPNS0_9GraphTaskE + 0xea) [0x268eca]
=========     Host Frame:/usr/local/lib/python2.7/dist-packages/torch/lib/libtorch.so.1 (_ZN5torch8autograd6Engine11thread_initEi + 0xcc) [0x2656bc]
=========     Host Frame:/usr/local/lib/python2.7/dist-packages/torch/lib/libtorch_python.so (_ZN5torch8autograd6python12PythonEngine11thread_initEi + 0x2a) [0x2fb82a]
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libstdc++.so.6 [0xb8c80]
=========     Host Frame:/lib/x86_64-linux-gnu/libpthread.so.0 [0x76ba]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (clone + 0x6d) [0x10741d]
=========
Traceback (most recent call last):
  File "main.py", line 180, in <module>
    main()
  File "main.py", line 176, in main
    exp.run()
  File "main.py", line 125, in run
    self.trainer.run(epoch, mode="Train")
  File ".../trainer.py", line 367, in run
    self.backward(loss)
  File ".../trainer.py", line 324, in backward
    loss.backward()
  File "/usr/local/lib/python2.7/dist-packages/torch/tensor.py", line 102, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/usr/local/lib/python2.7/dist-packages/torch/autograd/__init__.py", line 93, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: CUDA error: unspecified launch failure
========= ERROR SUMMARY: 1 error

Can anyone help me to address the issue? Thanks!

ICEORY · December 26, 2018, 4:25am

More information:

========= Invalid __global__ read of size 4
=========     at 0x00005470 in _ZN2at6native18elementwise_kernelILi128ELi4EZNS0_16gpu_index_kernelIZNS0_17index_kernel_implINS0_10OpaqueTypeILi4EEEEEvRNS_14TensorIteratorEN3c108ArrayRefIlEESA_EUlPcSB_lE_EE
vS7_SA_SA_RKT_EUliE_EEviT1_
=========     by thread (96,0,0) in block (2588,0,0)
=========     Address 0x7f16ace49500 is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/usr/local/nvidia/lib64/libcuda.so.1 (cuLaunchKernel + 0x2cd) [0x22b40d]
=========     Host Frame:/usr/local/lib/python2.7/dist-packages/torch/lib/libcaffe2_gpu.so [0x2625bdb]
=========     Host Frame:/usr/local/lib/python2.7/dist-packages/torch/lib/libcaffe2_gpu.so [0x264333e]
=========     Host Frame:/usr/local/lib/python2.7/dist-packages/torch/lib/libcaffe2_gpu.so (_ZN2at6native16gpu_index_kernelI17__nv_dl_wrapper_tI11__nv_dl_tagIPFvRNS_14TensorIteratorEN3c108ArrayRefIlEES8_EX
adL_ZNS0_17index_kernel_implINS0_10OpaqueTypeILi4EEEEEvS5_S8_S8_EELj1EEJEEEEvS5_S8_S8_RKT_ + 0x406) [0x20c6626]
=========     Host Frame:/usr/local/lib/python2.7/dist-packages/torch/lib/libcaffe2_gpu.so [0x20c3f0e]
=========     Host Frame:/usr/local/lib/python2.7/dist-packages/torch/lib/libcaffe2_gpu.so [0x20c4755]
=========     Host Frame:/usr/local/lib/python2.7/dist-packages/torch/lib/libcaffe2.so (_ZN2at6native5indexERKNS_6TensorEN3c108ArrayRefIS1_EE + 0x54f) [0x527e3f]
=========     Host Frame:/usr/local/lib/python2.7/dist-packages/torch/lib/libcaffe2.so (_ZNK2at11TypeDefault5indexERKNS_6TensorEN3c108ArrayRefIS1_EE + 0x8e) [0x823dee]
=========     Host Frame:/usr/local/lib/python2.7/dist-packages/torch/lib/libtorch.so.1 (_ZNK5torch8autograd12VariableType5indexERKN2at6TensorEN3c108ArrayRefIS3_EE + 0x21f) [0x39da3f]
=========     Host Frame:/usr/local/lib/python2.7/dist-packages/torch/lib/libtorch.so.1 (_ZN5torch8autograd9generated16IndexPutBackward5applyEOSt6vectorINS0_8VariableESaIS4_EE + 0x1df) [0x29beff]
=========     Host Frame:/usr/local/lib/python2.7/dist-packages/torch/lib/libtorch.so.1 (_ZN5torch8autograd6Engine17evaluate_functionERNS0_12FunctionTaskE + 0x1d56) [0x268186]
=========     Host Frame:/usr/local/lib/python2.7/dist-packages/torch/lib/libtorch.so.1 (_ZN5torch8autograd6Engine11thread_mainEPNS0_9GraphTaskE + 0xea) [0x268eca]
=========     Host Frame:/usr/local/lib/python2.7/dist-packages/torch/lib/libtorch.so.1 (_ZN5torch8autograd6Engine11thread_initEi + 0xcc) [0x2656bc]
=========     Host Frame:/usr/local/lib/python2.7/dist-packages/torch/lib/libtorch_python.so (_ZN5torch8autograd6python12PythonEngine11thread_initEi + 0x2a) [0x2fb82a]
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libstdc++.so.6 [0xb8c80]
=========     Host Frame:/lib/x86_64-linux-gnu/libpthread.so.0 [0x76ba]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (clone + 0x6d) [0x10741d]
=========
========= Invalid __global__ read of size 4
=========     at 0x000036e8 in _ZN2at6native18elementwise_kernelILi128ELi4EZNS0_16gpu_index_kernelIZNS0_17index_kernel_implINS0_10OpaqueTypeILi4EEEEEvRNS_14TensorIteratorEN3c108ArrayRefIlEESA_EUlPcSB_lE_EE
vS7_SA_SA_RKT_EUliE_EEviT1_
=========     by thread (96,0,0) in block (2556,0,0)
=========     Address 0x7f16ace3e108 is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/usr/local/nvidia/lib64/libcuda.so.1 (cuLaunchKernel + 0x2cd) [0x22b40d]
=========     Host Frame:/usr/local/lib/python2.7/dist-packages/torch/lib/libcaffe2_gpu.so [0x2625bdb]
=========     Host Frame:/usr/local/lib/python2.7/dist-packages/torch/lib/libcaffe2_gpu.so [0x264333e]
=========     Host Frame:/usr/local/lib/python2.7/dist-packages/torch/lib/libcaffe2_gpu.so (_ZN2at6native16gpu_index_kernelI17__nv_dl_wrapper_tI11__nv_dl_tagIPFvRNS_14TensorIteratorEN3c108ArrayRefIlEES8_EX
adL_ZNS0_17index_kernel_implINS0_10OpaqueTypeILi4EEEEEvS5_S8_S8_EELj1EEJEEEEvS5_S8_S8_RKT_ + 0x406) [0x20c6626]
=========     Host Frame:/usr/local/lib/python2.7/dist-packages/torch/lib/libcaffe2_gpu.so [0x20c3f0e]
=========     Host Frame:/usr/local/lib/python2.7/dist-packages/torch/lib/libcaffe2_gpu.so [0x20c4755]
=========     Host Frame:/usr/local/lib/python2.7/dist-packages/torch/lib/libcaffe2.so (_ZN2at6native5indexERKNS_6TensorEN3c108ArrayRefIS1_EE + 0x54f) [0x527e3f]
=========     Host Frame:/usr/local/lib/python2.7/dist-packages/torch/lib/libcaffe2.so (_ZNK2at11TypeDefault5indexERKNS_6TensorEN3c108ArrayRefIS1_EE + 0x8e) [0x823dee]
=========     Host Frame:/usr/local/lib/python2.7/dist-packages/torch/lib/libtorch.so.1 (_ZNK5torch8autograd12VariableType5indexERKN2at6TensorEN3c108ArrayRefIS3_EE + 0x21f) [0x39da3f]
=========     Host Frame:/usr/local/lib/python2.7/dist-packages/torch/lib/libtorch.so.1 (_ZN5torch8autograd9generated16IndexPutBackward5applyEOSt6vectorINS0_8VariableESaIS4_EE + 0x1df) [0x29beff]
=========     Host Frame:/usr/local/lib/python2.7/dist-packages/torch/lib/libtorch.so.1 (_ZN5torch8autograd6Engine17evaluate_functionERNS0_12FunctionTaskE + 0x1d56) [0x268186]
=========     Host Frame:/usr/local/lib/python2.7/dist-packages/torch/lib/libtorch.so.1 (_ZN5torch8autograd6Engine11thread_mainEPNS0_9GraphTaskE + 0xea) [0x268eca]
=========     Host Frame:/usr/local/lib/python2.7/dist-packages/torch/lib/libtorch.so.1 (_ZN5torch8autograd6Engine11thread_initEi + 0xcc) [0x2656bc]
=========     Host Frame:/usr/local/lib/python2.7/dist-packages/torch/lib/libtorch_python.so (_ZN5torch8autograd6python12PythonEngine11thread_initEi + 0x2a) [0x2fb82a]
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libstdc++.so.6 [0xb8c80]
=========     Host Frame:/lib/x86_64-linux-gnu/libpthread.so.0 [0x76ba]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (clone + 0x6d) [0x10741d]
=========
Traceback (most recent call last):
  File "main.py", line 180, in <module>
    main()
  File "main.py", line 176, in main
    exp.run()
  File "main.py", line 125, in run
    self.trainer.run(epoch, mode="Train")
  File ".../trainer.py", line 367, in run
    self.backward(loss)
  File ".../trainer.py", line 324, in backward
    loss.backward()
  File "/usr/local/lib/python2.7/dist-packages/torch/tensor.py", line 102, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/usr/local/lib/python2.7/dist-packages/torch/autograd/__init__.py", line 93, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: CUDA error: unspecified launch failure
========= ERROR SUMMARY: 2 errors

smth · December 28, 2018, 11:10pm

unspecified launch failure usually indicates a driver-level failure. So it’d be either related to an nvidia driver bug, or that your GPU is faulty

deniz · December 14, 2019, 10:30pm

I have been looking for a fix to this problem until I saw @smth 's comment which confirmed my suspicions about faulty nvidia driver. Thank you for that information btw!
I had a similar error in my case as well but I didn’t have multiple graphics cards to choose from. It was occurring in my laptop with gtx 1050ti. I am running PyTorch 1.3.1, cuda 10.1, cudnn 7.6.4 on Windows 10. Turns out that my nvidia driver was the root of my problem, so a quick downgrade from 441.66 to 431.86 fixed it! Although the downside of that seems to be longer runtime, for whatever reason.