When I use two GPUs for training I got the following error, :
Exception has occurred: RuntimeError
cuDNN error: CUDNN_STATUS_BAD_PARAM (operator() at /opt/conda/conda-bld/pytorch_1587428398394/work/aten/src/ATen/native/cudnn/Conv.cpp:1142)
My codes are as follows:
use_gpu = torch.cuda.is_available()
if use_gpu and torch.cuda.device_count() > 1:
net = nn.DataParallel(net)
net.cuda()
net.train()
for i, (images, target) in enumerate(train_loader):
if use_gpu:
images, target = images.cuda(), target.cuda()
pred = net(images)
loss = criterion(pred, target)
total_loss += loss.data.item()
optimizer.zero_grad()
loss.backward()
Error when running loss.backward()
Exception has occurred: RuntimeError
cuDNN error: CUDNN_STATUS_BAD_PARAM (operator() at /opt/conda/conda-bld/pytorch_1587428398394/work/aten/src/ATen/native/cudnn/Conv.cpp:1142)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x4e (0x7f3c98237b5e in /home/cr7/anaconda3/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: + 0xd792a2 (0x7f3c991dd2a2 in /home/cr7/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #2: + 0xd76225 (0x7f3c991da225 in /home/cr7/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #3: + 0xd776bf (0x7f3c991db6bf in /home/cr7/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xd7b310 (0x7f3c991df310 in /home/cr7/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #5: at::native::cudnn_convolution_backward_weight(c10::ArrayRef, at::Tensor const&, at::Tensor const&, c10::ArrayRef, c10::ArrayRef, c10::ArrayRef, long, bool, bool) + 0x49 (0x7f3c991df569 in /home/cr7/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #6: + 0xde1ec0 (0x7f3c99245ec0 in /home/cr7/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #7: + 0xe26138 (0x7f3c9928a138 in /home/cr7/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #8: at::native::cudnn_convolution_backward(at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::ArrayRef, c10::ArrayRef, c10::ArrayRef, long, bool, bool, std::array<bool, 2ul>) + 0x2fc (0x7f3c991e021c in /home/cr7/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #9: + 0xde1bcb (0x7f3c99245bcb in /home/cr7/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #10: + 0xe26194 (0x7f3c9928a194 in /home/cr7/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #11: + 0x29defc6 (0x7f3cc1df7fc6 in /home/cr7/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #12: + 0x2a2ea54 (0x7f3cc1e47a54 in /home/cr7/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #13: torch::autograd::generated::CudnnConvolutionBackward::apply(std::vector<at::Tensor, std::allocatorat::Tensor >&&) + 0x378 (0x7f3cc1a0ff28 in /home/cr7/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #14: + 0x2ae8215 (0x7f3cc1f01215 in /home/cr7/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #15: torch::autograd::Engine::evaluate_function(std::shared_ptr<Torch::autograd::GraphTask>&, torch::autograd::Node*, torch::autograd::InputBuffer&) + 0x16f3 (0x7f3cc1efe513 in /home/cr7/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #16: torch::autograd::Engine::thread_main(std::shared_ptrTorch::autograd::GraphTasK>> const&, bool) + 0x3d2 (0x7f3cc1eff2f2 in /home/cr7/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #17: torch::autograd::Engine::thread_init(int) + 0x39 (0x7f3cc1ef7969 in /home/cr7/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #18: torch::autograd::python::PythonEngine::thread_init(int) + 0x38 (0x7f3cc523e558 in /home/cr7/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #19: + 0xc819d (0x7f3cfad4d19d in /home/cr7/anaconda3/bin/…/lib/libstdc++.so.6)
frame #20: + 0x76db (0x7f3d042ae6db in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #21: clone + 0x3f (0x7f3d03fd788f in /lib/x86_64-linux-gnu/libc.so.6)
File “/home/cr7/AI/github_projects/pytorch-YOLO-v1-master/train.py”, line 136, in
loss.backward()
By the way, I tried to train the network on CPU only, it works just fine.
My software environment and hardware type are as follows:
python: 3.7
pytorch: 1.5.0
cuda: 10.1.168
cudnn: 7.6.2.24
The two GPUs I use are GeForce RTX 2070 and GeForce GTX 1060