Hi all.
I’m trying to train a model simultaneously on two types of cards: Tesla P4 & Tesla T4.
I’m using the codes from https://github.com/kuangliu/pytorch-cifar
. However, it failed unfortunately. The error log is:
==> Preparing data…
Files already downloaded and verified
Files already downloaded and verified
==> Building model…
/home/yjh/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py:26: UserWarning:
There is an imbalance between your GPUs. You may want to exclude GPU 1 which
has less than 75% of the memory or cores of GPU 0. You can do so by setting
the device_ids argument to DataParallel, or by setting the CUDA_VISIBLE_DEVICES
environment variable.
warnings.warn(imbalance_warn.format(device_ids[min_pos], device_ids[max_pos]))
Epoch: 0
Traceback (most recent call last):
File “main.py”, line 150, in
train(epoch)
File “main.py”, line 102, in train
loss.backward()
File “/home/yjh/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/tensor.py”, line 198, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File “/home/yjh/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/autograd/init.py”, line 100, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: cuDNN error: CUDNN_STATUS_BAD_PARAM (operator() at /opt/conda/conda-bld/pytorch_1591914880026/work/aten/src/ATen/native/cudnn/Conv.cpp:1142)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x4e (0x7fcb40ac6b5e in /home/yjh/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: + 0xd70c62 (0x7fcb41c90c62 in /home/yjh/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #2: + 0xd6dbe5 (0x7fcb41c8dbe5 in /home/yjh/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #3: + 0xd6f07f (0x7fcb41c8f07f in /home/yjh/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xd72cd0 (0x7fcb41c92cd0 in /home/yjh/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #5: at::native::cudnn_convolution_backward_weight(c10::ArrayRef, at::Tensor const&, at::Tensor const&, c10::ArrayRef, c10::ArrayRef, c10::ArrayRef, long, bool, bool) + 0x49 (0x7fcb41c92f29 in /home/yjh/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #6: + 0xdd9880 (0x7fcb41cf9880 in /home/yjh/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #7: + 0xe1daf8 (0x7fcb41d3daf8 in /home/yjh/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #8: at::native::cudnn_convolution_backward(at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::ArrayRef, c10::ArrayRef, c10::ArrayRef, long, bool, bool, std::array<bool, 2ul>) + 0x2fc (0x7fcb41c93bdc in /home/yjh/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #9: + 0xdd958b (0x7fcb41cf958b in /home/yjh/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #10: + 0xe1db54 (0x7fcb41d3db54 in /home/yjh/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #11: + 0x29dee26 (0x7fcb6aab4e26 in /home/yjh/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #12: + 0x2a2e634 (0x7fcb6ab04634 in /home/yjh/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #13: torch::autograd::generated::CudnnConvolutionBackward::apply(std::vector<at::Tensor, std::allocatorat::Tensor >&&) + 0x378 (0x7fcb6a6ccff8 in /home/yjh/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #14: + 0x2ae7df5 (0x7fcb6abbddf5 in /home/yjh/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #15: torch::autograd::Engine::evaluate_function(std::shared_ptrtorch::autograd::GraphTask&, torch::autograd::Node*, torch::autograd::InputBuffer&) + 0x16f3 (0x7fcb6abbb0f3 in /home/yjh/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #16: torch::autograd::Engine::thread_main(std::shared_ptrtorch::autograd::GraphTask const&, bool) + 0x3d2 (0x7fcb6abbbed2 in /home/yjh/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #17: torch::autograd::Engine::thread_init(int) + 0x39 (0x7fcb6abb4549 in /home/yjh/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #18: torch::autograd::python::PythonEngine::thread_init(int) + 0x38 (0x7fcb6e104638 in /home/yjh/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #19: + 0xc819d (0x7fcb7095f19d in /home/yjh/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/lib/…/…/…/…/./libstdc++.so.6)
frame #20: + 0x9609 (0x7fcb89219609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #21: clone + 0x43 (0x7fcb89140103 in /lib/x86_64-linux-gnu/libc.so.6)
However, I remember this works earlier. Also, everything is ok now if I run the codes on a single type of card. I don’t know whether there’re changes on pytorch source codes, could anyone be so kind to help me?