Train jointly with Tesla P4 & Tesla T4

111104 · July 24, 2020, 3:29pm

Hi all.

I’m trying to train a model simultaneously on two types of cards: Tesla P4 & Tesla T4.
I’m using the codes from https://github.com/kuangliu/pytorch-cifar. However, it failed unfortunately. The error log is:

==> Preparing data…
Files already downloaded and verified
Files already downloaded and verified
==> Building model…
/home/yjh/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py:26: UserWarning:
There is an imbalance between your GPUs. You may want to exclude GPU 1 which
has less than 75% of the memory or cores of GPU 0. You can do so by setting
the device_ids argument to DataParallel, or by setting the CUDA_VISIBLE_DEVICES
environment variable.
warnings.warn(imbalance_warn.format(device_ids[min_pos], device_ids[max_pos]))

Epoch: 0
Traceback (most recent call last):
File “main.py”, line 150, in
train(epoch)
File “main.py”, line 102, in train
loss.backward()
File “/home/yjh/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/tensor.py”, line 198, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File “/home/yjh/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/autograd/init.py”, line 100, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: cuDNN error: CUDNN_STATUS_BAD_PARAM (operator() at /opt/conda/conda-bld/pytorch_1591914880026/work/aten/src/ATen/native/cudnn/Conv.cpp:1142)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x4e (0x7fcb40ac6b5e in /home/yjh/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: + 0xd70c62 (0x7fcb41c90c62 in /home/yjh/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #2: + 0xd6dbe5 (0x7fcb41c8dbe5 in /home/yjh/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #3: + 0xd6f07f (0x7fcb41c8f07f in /home/yjh/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xd72cd0 (0x7fcb41c92cd0 in /home/yjh/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #5: at::native::cudnn_convolution_backward_weight(c10::ArrayRef, at::Tensor const&, at::Tensor const&, c10::ArrayRef, c10::ArrayRef, c10::ArrayRef, long, bool, bool) + 0x49 (0x7fcb41c92f29 in /home/yjh/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #6: + 0xdd9880 (0x7fcb41cf9880 in /home/yjh/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #7: + 0xe1daf8 (0x7fcb41d3daf8 in /home/yjh/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #8: at::native::cudnn_convolution_backward(at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::ArrayRef, c10::ArrayRef, c10::ArrayRef, long, bool, bool, std::array<bool, 2ul>) + 0x2fc (0x7fcb41c93bdc in /home/yjh/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #9: + 0xdd958b (0x7fcb41cf958b in /home/yjh/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #10: + 0xe1db54 (0x7fcb41d3db54 in /home/yjh/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #11: + 0x29dee26 (0x7fcb6aab4e26 in /home/yjh/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #12: + 0x2a2e634 (0x7fcb6ab04634 in /home/yjh/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #13: torch::autograd::generated::CudnnConvolutionBackward::apply(std::vector<at::Tensor, std::allocatorat::Tensor >&&) + 0x378 (0x7fcb6a6ccff8 in /home/yjh/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #14: + 0x2ae7df5 (0x7fcb6abbddf5 in /home/yjh/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #15: torch::autograd::Engine::evaluate_function(std::shared_ptrtorch::autograd::GraphTask&, torch::autograd::Node*, torch::autograd::InputBuffer&) + 0x16f3 (0x7fcb6abbb0f3 in /home/yjh/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #16: torch::autograd::Engine::thread_main(std::shared_ptrtorch::autograd::GraphTask const&, bool) + 0x3d2 (0x7fcb6abbbed2 in /home/yjh/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #17: torch::autograd::Engine::thread_init(int) + 0x39 (0x7fcb6abb4549 in /home/yjh/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #18: torch::autograd::python::PythonEngine::thread_init(int) + 0x38 (0x7fcb6e104638 in /home/yjh/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #19: + 0xc819d (0x7fcb7095f19d in /home/yjh/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/lib/…/…/…/…/./libstdc++.so.6)
frame #20: + 0x9609 (0x7fcb89219609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #21: clone + 0x43 (0x7fcb89140103 in /lib/x86_64-linux-gnu/libc.so.6)

However, I remember this works earlier. Also, everything is ok now if I run the codes on a single type of card. I don’t know whether there’re changes on pytorch source codes, could anyone be so kind to help me?

ptrblck · July 26, 2020, 8:35am

Are you directly using the linked code or did you add any changes to it?
Also, could you post the PyTorch, CUDA, and cudnn versions you are using?

111104 · July 26, 2020, 9:39am

I’m using the linked codes without modification since I’m just testing a new machine.

After some test, I found:
cudatoolkit 10.2.89 + cudnn 7.6.5 + pytorch 1.5.1 fails;
cudatoolkit 10.0.130 + cudnn 7.4.2 + pytorch 1.0.1 works.

ptrblck · July 26, 2020, 9:40am

Could you halve the batch size and rerun the script of these devices in isolation for the failing version, please?

111104 · July 26, 2020, 3:25pm

It works in isolation.

ptrblck · July 26, 2020, 10:18pm

Does it even work in isolation with batch_size/2, as this would be the new workload for each device, if you are using nn.DataParallel?

111104 · July 27, 2020, 4:46am

Yes it works in isolation with halved batch size.

ptrblck · July 27, 2020, 4:51am

That’s bad, as we won’t be able to try to reproduce it easily.
Could you try to create the cudnn logs as described here and upload them?

111104 · July 29, 2020, 12:27pm

The log is large so I upload it here.