Cudnn_status_mapping_error

JuanFMontesinos · August 20, 2020, 12:34pm

Hi,
I’ve got

  File "/home/jfm/.local/lib/python3.6/site-packages/flerken/framework/framework.py", line 206, in backpropagate
    self.loss.backward()
  File "/home/jfm/.local/lib/python3.6/site-packages/torch/tensor.py", line 185, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/jfm/.local/lib/python3.6/site-packages/torch/autograd/__init__.py", line 127, in backward
    allow_unreachable=True)  # allow_unreachable flag
cuDNN error: CUDNN_STATUS_MAPPING_ERROR
Exception raised from operator() at /pytorch/aten/src/ATen/native/cudnn/Conv.cpp:980 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f7e2106d1e2 in /home/jfm/.local/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xebae82 (0x7f7e22390e82 in /home/jfm/.local/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xebcdb5 (0x7f7e22392db5 in /home/jfm/.local/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
frame #3: <unknown function> + 0xeb800e (0x7f7e2238e00e in /home/jfm/.local/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xeb9bfb (0x7f7e2238fbfb in /home/jfm/.local/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
frame #5: at::native::cudnn_convolution_backward_input(c10::ArrayRef<long>, at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, bool, bool) + 0xb2 (0x7f7e22390152 in /home/jfm/.local/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xf1f35b (0x7f7e223f535b in /home/jfm/.local/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xf4f178 (0x7f7e22425178 in /home/jfm/.local/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
frame #8: at::cudnn_convolution_backward_input(c10::ArrayRef<long>, at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, bool, bool) + 0x1ad (0x7f7e5d2cd88d in /home/jfm/.local/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #9: at::native::cudnn_convolution_backward(at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, bool, bool, std::array<bool, 2ul>) + 0x223 (0x7f7e2238e823 in /home/jfm/.local/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
frame #10: <unknown function> + 0xf1f445 (0x7f7e223f5445 in /home/jfm/.local/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
frame #11: <unknown function> + 0xf4f1d4 (0x7f7e224251d4 in /home/jfm/.local/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
frame #12: at::cudnn_convolution_backward(at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, bool, bool, std::array<bool, 2ul>) + 0x1e2 (0x7f7e5d2dc242 in /home/jfm/.local/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #13: <unknown function> + 0x2ec9c62 (0x7f7e5ef9fc62 in /home/jfm/.local/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #14: <unknown function> + 0x2ede224 (0x7f7e5efb4224 in /home/jfm/.local/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #15: at::cudnn_convolution_backward(at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, bool, bool, std::array<bool, 2ul>) + 0x1e2 (0x7f7e5d2dc242 in /home/jfm/.local/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #16: torch::autograd::generated::CudnnConvolutionBackward::apply(std::vector<at::Tensor, std::allocator<at::Tensor> >&&) + 0x258 (0x7f7e5ee26c38 in /home/jfm/.local/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #17: <unknown function> + 0x3375bb7 (0x7f7e5f44bbb7 in /home/jfm/.local/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #18: torch::autograd::Engine::evaluate_function(std::shared_ptr<torch::autograd::GraphTask>&, torch::autograd::Node*, torch::autograd::InputBuffer&, std::shared_ptr<torch::autograd::ReadyQueue> const&) + 0x1400 (0x7f7e5f447400 in /home/jfm/.local/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #19: torch::autograd::Engine::thread_main(std::shared_ptr<torch::autograd::GraphTask> const&) + 0x451 (0x7f7e5f447fa1 in /home/jfm/.local/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #20: torch::autograd::Engine::thread_init(int,std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) + 0x89 (0x7f7e5f440119 in /home/jfm/.local/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #21: torch::autograd::python::PythonEngine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) + 0x4a (0x7f7e6cbe04ba in /home/jfm/.local/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #22: <unknown function> + 0xbd6df (0x7f7e6dd3c6df in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #23: <unknown function> + 0x76db (0x7f7e701f56db in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #24: clone + 0x3f (0x7f7e7052ea3f in /lib/x86_64-linux-gnu/libc.so.6)

Just wanted to know what may cause this error.

user_123454321 · August 20, 2020, 12:37pm

Does it work with lower batch sizes ? Also, can you run without cuda and check, I have found running on cpu produces more informative errors.

JuanFMontesinos · August 20, 2020, 12:42pm

It’s a bit strange, the code runs for 2 epochs and then this error raises. That’s why I would like to understand what can cause it.

ptrblck · August 23, 2020, 9:22am

Could you post an executable code snippet to reproduce this issue as well as your setup (used GPU, CUDA, cudnn, PyTorch versions), so that we could have a look at it?

JuanFMontesinos · August 25, 2020, 11:54pm

Hi,
I’ve tried to reproduce it but I can’t.
I just know that disabling cudnn fix it.
It’s a very complex pipeline and I cannot identify the origin.

BTW is there anything useful in the error which gives me a hint about where to look for?

Soo it’s bit strange. Code runs for 2 epochs and then (no matter how large the epochs are) it raises this error. I’m using DALI library and I think it’s somehow related to instanting DALI’s pipeline epoch-wise.
But i cannot really find what’s causing this.
I find strange it occurs in the backward.

Soo if the error gives any clue it would be really useful.

ptrblck · August 26, 2020, 1:45am

You could enable cudnn API logging, rerun the script, and post the output here, so that I could debug the issue and see, if cudnn is creating the crash or if maybe another sticky error is just caught by cudnn.
Also, if possible, rerun the script via CUDA_LAUNCH_BLOCKING=1 python script.py args, which should point to the line of code, which causes the error.

JuanFMontesinos · August 26, 2020, 8:33pm

You are gonna kill me but seems that
CUDA_LAUNCH_BLOCKING=1
is not compatible with multi gpu.
Aaand i reach a deadlock by using
CUDNN_LOGDEST_DBG=filename.txt
The issue dissapears if I run everything in a single gpu

Soo thank you anyway for your help.

ptrblck · August 27, 2020, 5:30am

Sorry I didn’t realize you are using DDP, so skip the CUDA_LAUNCH_BLOCKING=1 part (it might work, but could also create issues).
Are you sure you are running into a dealock or did a GPU die and the other processes were waiting for the dead process?
If you are using DDP, you could use ps -auxf and check the tree structure for a dead DDP process, when the script is hanging.
If you see that a specific GPU-related process is indeed dead or hanging, you could send a SIGHUP to the main process, so that the logging might continue.

I understand that it’s not easy to create a reproducible code snippet, but if it would be possible, I could take care of the debugging, as it’s not always trivial.