RuntimeError cuDNN error: CUDNN_STATUS_BAD_PARAM

CR7 · June 18, 2020, 11:39am

When I use two GPUs for training I got the following error, :
Exception has occurred: RuntimeError
cuDNN error: CUDNN_STATUS_BAD_PARAM (operator() at /opt/conda/conda-bld/pytorch_1587428398394/work/aten/src/ATen/native/cudnn/Conv.cpp:1142)

My codes are as follows:

use_gpu = torch.cuda.is_available()
if use_gpu and torch.cuda.device_count() > 1:
   net = nn.DataParallel(net)
   net.cuda()
   net.train()

for i, (images, target) in enumerate(train_loader):
    if use_gpu:
       images, target = images.cuda(), target.cuda()
    pred = net(images)
    loss = criterion(pred, target)
    total_loss += loss.data.item()
    optimizer.zero_grad()
    loss.backward()

Error when running loss.backward()

Exception has occurred: RuntimeError
cuDNN error: CUDNN_STATUS_BAD_PARAM (operator() at /opt/conda/conda-bld/pytorch_1587428398394/work/aten/src/ATen/native/cudnn/Conv.cpp:1142)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x4e (0x7f3c98237b5e in /home/cr7/anaconda3/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: + 0xd792a2 (0x7f3c991dd2a2 in /home/cr7/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #2: + 0xd76225 (0x7f3c991da225 in /home/cr7/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #3: + 0xd776bf (0x7f3c991db6bf in /home/cr7/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xd7b310 (0x7f3c991df310 in /home/cr7/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #5: at::native::cudnn_convolution_backward_weight(c10::ArrayRef, at::Tensor const&, at::Tensor const&, c10::ArrayRef, c10::ArrayRef, c10::ArrayRef, long, bool, bool) + 0x49 (0x7f3c991df569 in /home/cr7/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #6: + 0xde1ec0 (0x7f3c99245ec0 in /home/cr7/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #7: + 0xe26138 (0x7f3c9928a138 in /home/cr7/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #8: at::native::cudnn_convolution_backward(at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::ArrayRef, c10::ArrayRef, c10::ArrayRef, long, bool, bool, std::array<bool, 2ul>) + 0x2fc (0x7f3c991e021c in /home/cr7/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #9: + 0xde1bcb (0x7f3c99245bcb in /home/cr7/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #10: + 0xe26194 (0x7f3c9928a194 in /home/cr7/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #11: + 0x29defc6 (0x7f3cc1df7fc6 in /home/cr7/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #12: + 0x2a2ea54 (0x7f3cc1e47a54 in /home/cr7/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #13: torch::autograd::generated::CudnnConvolutionBackward::apply(std::vector<at::Tensor, std::allocatorat::Tensor >&&) + 0x378 (0x7f3cc1a0ff28 in /home/cr7/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #14: + 0x2ae8215 (0x7f3cc1f01215 in /home/cr7/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #15: torch::autograd::Engine::evaluate_function(std::shared_ptr<Torch::autograd::GraphTask>&, torch::autograd::Node*, torch::autograd::InputBuffer&) + 0x16f3 (0x7f3cc1efe513 in /home/cr7/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #16: torch::autograd::Engine::thread_main(std::shared_ptrTorch::autograd::GraphTasK>> const&, bool) + 0x3d2 (0x7f3cc1eff2f2 in /home/cr7/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #17: torch::autograd::Engine::thread_init(int) + 0x39 (0x7f3cc1ef7969 in /home/cr7/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #18: torch::autograd::python::PythonEngine::thread_init(int) + 0x38 (0x7f3cc523e558 in /home/cr7/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #19: + 0xc819d (0x7f3cfad4d19d in /home/cr7/anaconda3/bin/…/lib/libstdc++.so.6)
frame #20: + 0x76db (0x7f3d042ae6db in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #21: clone + 0x3f (0x7f3d03fd788f in /lib/x86_64-linux-gnu/libc.so.6)
File “/home/cr7/AI/github_projects/pytorch-YOLO-v1-master/train.py”, line 136, in
loss.backward()

By the way, I tried to train the network on CPU only, it works just fine.

My software environment and hardware type are as follows:
python: 3.7
pytorch: 1.5.0
cuda: 10.1.168
cudnn: 7.6.2.24

The two GPUs I use are GeForce RTX 2070 and GeForce GTX 1060

albanD · June 18, 2020, 3:00pm

Hi,

Can you share a small code sample that we can use to reproduce this?
Also what are the exact parameters of the convolution that causes the error (input size, kernel size, stride, padding, etc)?
Thanks !

CR7 · June 18, 2020, 3:43pm

Thank you for your reply!
This is the repository I used:

But I tried to call nn.DataParallel() to implement multiple GPUs training follow the instructions below:
https://pytorch.org/tutorials/beginner/blitz/data_parallel_tutorial.html#imports-and-parameters
And then I got the error mentioned above, I’m not sure whether it is because the type of two GPUs I use is different, but I’m doubting…

CR7 · June 19, 2020, 5:24am

Hi, I found this problem sometimes does not appear, but most of the time it will report an error. If I gradually debug and run the program in IDE(eg, vscode), it can be backpropagated normally. If I run the program in the terminal or directly in IDE, this error will basically be thrown.

ptrblck · June 19, 2020, 9:06am

Could you update to PyTorch 1.5.1 and CUDA10.2.89 + cudnn7.6.5.32?
Also, is the YOLO code working fine and the error is raised by your dataparallel code?
If so, could you post the code to reproduce this error (with input shapes if possible)?