Train jointly with Tesla P4 & Tesla T4

Hi all.

I’m trying to train a model simultaneously on two types of cards: Tesla P4 & Tesla T4.
I’m using the codes from https://github.com/kuangliu/pytorch-cifar. However, it failed unfortunately. The error log is:

==> Preparing data…
Files already downloaded and verified
Files already downloaded and verified
==> Building model…
/home/yjh/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py:26: UserWarning:
There is an imbalance between your GPUs. You may want to exclude GPU 1 which
has less than 75% of the memory or cores of GPU 0. You can do so by setting
the device_ids argument to DataParallel, or by setting the CUDA_VISIBLE_DEVICES
environment variable.
warnings.warn(imbalance_warn.format(device_ids[min_pos], device_ids[max_pos]))

Epoch: 0
Traceback (most recent call last):
File “main.py”, line 150, in
train(epoch)
File “main.py”, line 102, in train
loss.backward()
File “/home/yjh/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/tensor.py”, line 198, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File “/home/yjh/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/autograd/init.py”, line 100, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: cuDNN error: CUDNN_STATUS_BAD_PARAM (operator() at /opt/conda/conda-bld/pytorch_1591914880026/work/aten/src/ATen/native/cudnn/Conv.cpp:1142)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x4e (0x7fcb40ac6b5e in /home/yjh/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: + 0xd70c62 (0x7fcb41c90c62 in /home/yjh/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #2: + 0xd6dbe5 (0x7fcb41c8dbe5 in /home/yjh/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #3: + 0xd6f07f (0x7fcb41c8f07f in /home/yjh/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xd72cd0 (0x7fcb41c92cd0 in /home/yjh/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #5: at::native::cudnn_convolution_backward_weight(c10::ArrayRef, at::Tensor const&, at::Tensor const&, c10::ArrayRef, c10::ArrayRef, c10::ArrayRef, long, bool, bool) + 0x49 (0x7fcb41c92f29 in /home/yjh/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #6: + 0xdd9880 (0x7fcb41cf9880 in /home/yjh/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #7: + 0xe1daf8 (0x7fcb41d3daf8 in /home/yjh/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #8: at::native::cudnn_convolution_backward(at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::ArrayRef, c10::ArrayRef, c10::ArrayRef, long, bool, bool, std::array<bool, 2ul>) + 0x2fc (0x7fcb41c93bdc in /home/yjh/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #9: + 0xdd958b (0x7fcb41cf958b in /home/yjh/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #10: + 0xe1db54 (0x7fcb41d3db54 in /home/yjh/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #11: + 0x29dee26 (0x7fcb6aab4e26 in /home/yjh/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #12: + 0x2a2e634 (0x7fcb6ab04634 in /home/yjh/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #13: torch::autograd::generated::CudnnConvolutionBackward::apply(std::vector<at::Tensor, std::allocatorat::Tensor >&&) + 0x378 (0x7fcb6a6ccff8 in /home/yjh/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #14: + 0x2ae7df5 (0x7fcb6abbddf5 in /home/yjh/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #15: torch::autograd::Engine::evaluate_function(std::shared_ptrtorch::autograd::GraphTask&, torch::autograd::Node*, torch::autograd::InputBuffer&) + 0x16f3 (0x7fcb6abbb0f3 in /home/yjh/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #16: torch::autograd::Engine::thread_main(std::shared_ptrtorch::autograd::GraphTask const&, bool) + 0x3d2 (0x7fcb6abbbed2 in /home/yjh/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #17: torch::autograd::Engine::thread_init(int) + 0x39 (0x7fcb6abb4549 in /home/yjh/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #18: torch::autograd::python::PythonEngine::thread_init(int) + 0x38 (0x7fcb6e104638 in /home/yjh/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #19: + 0xc819d (0x7fcb7095f19d in /home/yjh/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/lib/…/…/…/…/./libstdc++.so.6)
frame #20: + 0x9609 (0x7fcb89219609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #21: clone + 0x43 (0x7fcb89140103 in /lib/x86_64-linux-gnu/libc.so.6)

However, I remember this works earlier. Also, everything is ok now if I run the codes on a single type of card. I don’t know whether there’re changes on pytorch source codes, could anyone be so kind to help me?

Are you directly using the linked code or did you add any changes to it?
Also, could you post the PyTorch, CUDA, and cudnn versions you are using?

I’m using the linked codes without modification since I’m just testing a new machine.

After some test, I found:
cudatoolkit 10.2.89 + cudnn 7.6.5 + pytorch 1.5.1 fails;
cudatoolkit 10.0.130 + cudnn 7.4.2 + pytorch 1.0.1 works.

Could you halve the batch size and rerun the script of these devices in isolation for the failing version, please?

It works in isolation.

Does it even work in isolation with batch_size/2, as this would be the new workload for each device, if you are using nn.DataParallel?

Yes it works in isolation with halved batch size.

That’s bad, as we won’t be able to try to reproduce it easily.
Could you try to create the cudnn logs as described here and upload them?

The log is large so I upload it here.