Autograd.detect_anomaly fixes CUDNN_STATUS_EXECUTION_FAILED

jafioti · October 4, 2020, 3:02pm

Hi, I am training a seq2seq RNN model, and I keep getting CUDNN_STATUS_EXECUTION_FAILED errors. I’ve looked through numerous threads and many solutions, and none of them apply here. I’ve checked that the GPU memory and the RAM are not running out, I’ve triple checked my CUDA and CUDNN versions are good, and I still get this error. The strange thing is that when I use torch.autograd.detect_anomaly(), the problem goes away. It trains much slower because of all the other debugging things that anomaly detection does, but strangely enough it also takes care of my error. What does anomaly detection do that could remedy this?

ptrblck · October 5, 2020, 6:17am

Could you rerun your code via:

CUDA_LAUNCH_BLOCKING=1 python script.py args

and post the stack trace here, please?

jafioti · October 7, 2020, 8:52pm

Thanks for the response. I’ve used that and it now doesn’t give an error. I’ve been running it for two days now, and it still doesn’t throw an error anymore. It does however take much longer to run, though I assume that’s the point.

jafioti · October 14, 2020, 2:50pm

@ptrblck Do you have any ideas what could be causing this issue? Can it possibly be a memory issue? If so, I’m not sure why I can fit a much larger batch size if I use CUDA LAUNCH BLOCKING.

ptrblck · October 15, 2020, 2:12am

I don’t know what could cause this issue.
Could you post an executable code snippet to reproduce this issue, so that we could debug it?

jafioti · October 15, 2020, 1:19pm

@ptrblck Unfortunately my project is large and deeply engrained with a library I wrote, so it will take time to write a minimal reproduction. I have a stack trace I get without CUDA LAUNCH BLOCKING:

Traceback (most recent call last):
  File "train.py", line 142, in <module>
    if __name__ == "__main__": main()
  File "train.py", line 74, in main
    train_epoch(model, train_dataset, test_dataset, optimizer, epoch, eval_every=eval_every)
  File "train.py", line 113, in train_epoch
    loss.backward()
  File "C:\Users\Joe Fioti\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\tensor.py", line 185, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "C:\Users\Joe Fioti\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\autograd\__init__.py", line 127, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED
Exception raised from _cudnn_rnn_backward_input at ..\aten\src\ATen\native\cudnn\RNN.cpp:923 (most recent call first):
00007FF997E675A200007FF997E67540 c10.dll!c10::Error::Error [<unknown file> @ <unknown line number>]
00007FF94242529600007FF9424251E0 torch_cuda.dll!at::native::Descriptor<cudnnRNNStruct,&cudnnCreateRNNDescriptor,&cudnnDestroyRNNDescriptor>::Descriptor<cudnnRNNStruct,&cudnnCreateRNNDescriptor,&cudnnDestroyRNNDescriptor> [<unknown file> @ <unknown line number>]
00007FF94243C11B00007FF942439AD0 torch_cuda.dll!at::native::_cudnn_rnn_backward [<unknown file> @ <unknown line number>]
00007FF94243A03000007FF942439AD0 torch_cuda.dll!at::native::_cudnn_rnn_backward [<unknown file> @ <unknown line number>]
00007FF942492BA800007FF94244E400 torch_cuda.dll!at::native::set_storage_cuda_ [<unknown file> @ <unknown line number>]
00007FF9424A13DD00007FF94244E400 torch_cuda.dll!at::native::set_storage_cuda_ [<unknown file> @ <unknown line number>]
00007FF93A51BBF100007FF93A48D9D0 torch_cpu.dll!at::native::mkldnn_sigmoid_ [<unknown file> @ <unknown line number>]
00007FF93A56B9DA00007FF93A568FA0 torch_cpu.dll!at::bucketize_out [<unknown file> @ <unknown line number>]
00007FF93A552ECA00007FF93A552D40 torch_cpu.dll!at::_cudnn_rnn_backward [<unknown file> @ <unknown line number>]
00007FF93B85088900007FF93B80E010 torch_cpu.dll!torch::autograd::GraphRoot::apply [<unknown file> @ <unknown line number>]
00007FF93B85D12D00007FF93B80E010 torch_cpu.dll!torch::autograd::GraphRoot::apply [<unknown file> @ <unknown line number>]
00007FF93A51BBF100007FF93A48D9D0 torch_cpu.dll!at::native::mkldnn_sigmoid_ [<unknown file> @ <unknown line number>]
00007FF93A56B9DA00007FF93A568FA0 torch_cpu.dll!at::bucketize_out [<unknown file> @ <unknown line number>]
00007FF93A552ECA00007FF93A552D40 torch_cpu.dll!at::_cudnn_rnn_backward [<unknown file> @ <unknown line number>]
00007FF93B75C12D00007FF93B75BAF0 torch_cpu.dll!torch::autograd::generated::CudnnRnnBackward::apply [<unknown file> @ <unknown line number>]
00007FF93B747E9100007FF93B747B50 torch_cpu.dll!torch::autograd::Node::operator() [<unknown file> @ <unknown line number>]
00007FF93BCAF9BA00007FF93BCAF300 torch_cpu.dll!torch::autograd::Engine::add_thread_pool_task [<unknown file> @ <unknown line number>]
00007FF93BCB03AD00007FF93BCAFFD0 torch_cpu.dll!torch::autograd::Engine::evaluate_function [<unknown file> @ <unknown line number>]
00007FF93BCB4FE200007FF93BCB4CA0 torch_cpu.dll!torch::autograd::Engine::thread_main [<unknown file> @ <unknown line number>]
00007FF93BCB4C4100007FF93BCB4BC0 torch_cpu.dll!torch::autograd::Engine::thread_init [<unknown file> @ <unknown line number>]
00007FF97E3D08B700007FF97E3A9F90 torch_python.dll!THPShortStorage_New [<unknown file> @ <unknown line number>]
00007FF93BCABF1400007FF93BCAB780 torch_cpu.dll!torch::autograd::Engine::get_base_engine [<unknown file> @ <unknown line number>]
00007FF9C86E0E8200007FF9C86E0D40 ucrtbase.dll!beginthreadex [<unknown file> @ <unknown line number>]
00007FF9CA477BD400007FF9CA477BC0 KERNEL32.DLL!BaseThreadInitThunk [<unknown file> @ <unknown line number>]
00007FF9CA82CE5100007FF9CA82CE30 ntdll.dll!RtlUserThreadStart [<unknown file> @ <unknown line number>]

This refers to a line in the RNN.cpp file with this:

auto datatype = getCudnnDataType(input);

Not sure if this helps or not.

ptrblck · October 15, 2020, 5:27pm

I can’t remember seeing a similar issue pointing to this line of code, so would need to reproduce it in order to debug it properly.

Are you trying to use the cudnn RNN in eval() mode during training? Also, are you using the latest PyTorch version?

jafioti · October 15, 2020, 8:44pm

@ptrblck I am using the nn.GRU modules in multiple places throughout the model, and using it in both training and eval modes. I am packing the input before I pass it through, and unpacking after. I am using PyTorch 1.6.0, Cuda 10.1 on Windows.