cuDNN error: CUDNN_STATUS_EXECUTION_FAILED on one GPU and not the other

Hi,

I have the following error:

RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

Full stack trace:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-2-33f16e9e6467> in <module>
      6 trainer = Trainer(splitter, LeggedTCN)
      7 print("training")
----> 8 trainer.train()
      9 #with sprees

~/pycharms_remote/valx_trader/rrl_hm/training.py in train(self)
    281                 if self.flippers['mixup']: X, Y = self.mixup(X, Y, self.flippers['mixup'])
    282 
--> 283                 loss = self.train_iteration(X, Y, sym_perm, epoch) # crm,
    284                 self.iter_losses[sub_epoch] = loss
    285                 if batch_index % print_every == 0:

~/pycharms_remote/valx_trader/rrl_hm/training.py in train_iteration(self, X, Y, sym_perm, epoch)
    308         self.optimizer.zero_grad()
    309         loss = self.my_loss_func(None, Y, outputs) # crm, pps
--> 310         loss.backward()
    311         self.optimizer.step()
    312 

~/pycharms_remote/valx_trader/.venv/lib/python3.6/site-packages/torch/tensor.py in backward(self, gradient, retain_graph, create_graph)
    100                 products. Defaults to ``False``.
    101         """
--> 102         torch.autograd.backward(self, gradient, retain_graph, create_graph)
    103 
    104     def register_hook(self, hook):

~/pycharms_remote/valx_trader/.venv/lib/python3.6/site-packages/torch/autograd/__init__.py in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables)
     88     Variable._execution_engine.run_backward(
     89         tensors, grad_tensors, retain_graph, create_graph,
---> 90         allow_unreachable=True)  # allow_unreachable flag
     91 
     92 

RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

I am running on a desktop with 2 GPU. I have 2 jupyter labs that I start with:

CUDA_VISIBLE_DEVICES=[0 | 1] jupyter lab --port=802[0 | 1]

I ran the exact same notebook on each instance of jupyter lab, and on the GPU#0, everything works fine, on the GPU#1, after 35 iterations (about 5min of training), I get the EXECUTION_FAILED error.

Here is the nvidia-smi output at the time of the crash:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.116                Driver Version: 390.116                   |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:24:00.0  On |                  N/A |
| 58%   74C    P2   224W / 250W |   3159MiB / 11175MiB |     84%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 108...  Off  | 00000000:25:00.0 Off |                  N/A |
| 47%   70C    P2    63W / 250W |   1527MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1514      G   /usr/lib/xorg/Xorg                            59MiB |
|    0      1724      G   /usr/bin/gnome-shell                          70MiB |
|    0      2348      C   ...ms_remote/valx_trader/.venv/bin/python3  3025MiB |
|    1      9360      C   ...ms_remote/valx_trader/.venv/bin/python3  1515MiB |
+-----------------------------------------------------------------------------+

Each GPU has 11GB of Memory, and I use about 3G, so the issue is not running out of GPU memory. Note that the error is with GPU#1, there are no other processes on that GPU (so it’s not an issue with a conflict with other processes).

Sometimes, I get a CUDA illegal memory access error instead. The error is always with GPU#1, after 1-10minutes of training, and it’s related to CUDA/CUDNN, but the exact error message, stacktrace, and timing can vary.

If I train much smaller models on that GPU, I have no error, or much later in my training. I recently added 32GB of RAM to the machine, so RAM is not an issue (it’s hovering at around 20GB (out of 64GB total) during training.

Do you have an idea of what could be the issue? Can it be an issue with the GPU itself?
I don’t think it’s an issue with the code because everything works fine on the first gpu, and the code I run is exactly the same.

The fact that the error happens at random after training for a while, and that it’s not always the same error is very weird to me. How can I diagnose/debug this situation?

Thanks

I’m trying to upgrade pytorch 1.0.1 to 1.1, cuda 8 to cuda 9, and CUDNN 6.0.21 to 7.0. And see if the error goes away

Same issue with the newer cuda/cudnn version.

After about 20min of training, I get this:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-3-33f16e9e6467> in <module>
      6 trainer = Trainer(splitter, LeggedTCN)
      7 print("training")
----> 8 trainer.train()
      9 #with sprees

~/pycharms_remote/valx_trader/rrl_hm/training.py in train(self)
    281                 if self.flippers['mixup']: X, Y = self.mixup(X, Y, self.flippers['mixup'])
    282 
--> 283                 loss = self.train_iteration(X, Y, sym_perm, epoch) # crm,
    284                 self.iter_losses[sub_epoch] = loss
    285                 if batch_index % print_every == 0:

~/pycharms_remote/valx_trader/rrl_hm/training.py in train_iteration(self, X, Y, sym_perm, epoch)
    311         self.optimizer.step()
    312 
--> 313         self.losses[epoch] += loss.item()
    314         # print('loss', loss)
    315         return loss

RuntimeError: CUDA error: an illegal memory access was encountered

I tried switching the two GPUs: instead of
notebook#1 with GPU#0, notebook#2 with GPU#1,
I run
notebook#1 with GPU#1, notebook#2 with GPU#0,

And now I have the following error on the notebook#1, so the issue is clearly linked to the GPU. Instead of a stacktrace in the notebook, the notebook simply freezes and I see the error in the jupyter lab terminal (Having the error in the terminal instead of the notebook already happened with the other notebook. The exact error and where I get it – notebook/terminal – seems random, but it always happens in the notebook running with GPU#1):

terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: an illegal memory access was encountered (record at /pytorch/aten/src/ATen/cuda/CUDAEvent.h:118)
frame #0: std::function<std::string ()>::operator()() const + 0x11 (0x7fb65e532441 in /home/pinouchon/pycharms_remote/valx_trader/.venv/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x2a (0x7fb65e531d7a in /home/pinouchon/pycharms_remote/valx_trader/.venv/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #2: <unknown function> + 0x12b2612 (0x7fb669fd8612 in /home/pinouchon/pycharms_remote/valx_trader/.venv/lib/python3.6/site-packages/torch/lib/libcaffe2_gpu.so)
frame #3: <unknown function> + 0x12a123a (0x7fb669fc723a in /home/pinouchon/pycharms_remote/valx_trader/.venv/lib/python3.6/site-packages/torch/lib/libcaffe2_gpu.so)
frame #4: <unknown function> + 0x12a5dda (0x7fb669fcbdda in /home/pinouchon/pycharms_remote/valx_trader/.venv/lib/python3.6/site-packages/torch/lib/libcaffe2_gpu.so)
frame #5: <unknown function> + 0x12a5ffe (0x7fb669fcbffe in /home/pinouchon/pycharms_remote/valx_trader/.venv/lib/python3.6/site-packages/torch/lib/libcaffe2_gpu.so)
frame #6: <unknown function> + 0x9aa446 (0x7fb65f0f5446 in /home/pinouchon/pycharms_remote/valx_trader/.venv/lib/python3.6/site-packages/torch/lib/libcaffe2.so)
frame #7: at::native::gru(at::Tensor const&, at::Tensor const&, c10::ArrayRef<at::Tensor>, bool, long, double, bool, bool, bool) + 0x10d (0x7fb65f0e93dd in /home/pinouchon/pycharms_remote/valx_trader/.venv/lib/python3.6/site-packages/torch/lib/libcaffe2.so)
frame #8: at::TypeDefault::gru(at::Tensor const&, at::Tensor const&, c10::ArrayRef<at::Tensor>, bool, long, double, bool, bool, bool) const + 0xc9 (0x7fb65f42ab19 in /home/pinouchon/pycharms_remote/valx_trader/.venv/lib/python3.6/site-packages/torch/lib/libcaffe2.so)
frame #9: torch::autograd::VariableType::gru(at::Tensor const&, at::Tensor const&, c10::ArrayRef<at::Tensor>, bool, long, double, bool, bool, bool) const + 0x4a9 (0x7fb65d567159 in /home/pinouchon/pycharms_remote/valx_trader/.venv/lib/python3.6/site-packages/torch/lib/libtorch.so.1)
frame #10: <unknown function> + 0x25fb0a (0x7fb69d960b0a in /home/pinouchon/pycharms_remote/valx_trader/.venv/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #11: /home/pinouchon/pycharms_remote/valx_trader/.venv/bin/python3() [0x5030d5]
frame #12: _PyEval_EvalFrameDefault + 0x449 (0x506859 in /home/pinouchon/pycharms_remote/valx_trader/.venv/bin/python3)
frame #13: /home/pinouchon/pycharms_remote/valx_trader/.venv/bin/python3() [0x504c28]

[lots more frames]

frame #60: /home/pinouchon/pycharms_remote/valx_trader/.venv/bin/python3() [0x502f3d]
frame #61: _PyEval_EvalFrameDefault + 0x449 (0x506859 in /home/pinouchon/pycharms_remote/valx_trader/.venv/bin/python3)
frame #62: /home/pinouchon/pycharms_remote/valx_trader/.venv/bin/python3() [0x504c28]
frame #63: _PyFunction_FastCallDict + 0x2de (0x501b2e in /home/pinouchon/pycharms_remote/valx_trader/.venv/bin/python3)

Any ideas?

I went to a hardware shop to test the GPU, and the guy said that he cannot connect a display to the card. So that makes it more likely that the GPU has some hardware issue.

Thanks for getting back with this information!
I was following this thread, but couldn’t come up with other suggestions than what you’ve already tried.
Does it mean the card does not produce any video output at all?

The guy who ran the tests said that he couldn’t get any video output from the card (and he tried a few different motherboards/video cables).
So he didn’t run any specific testing/benchmarking.

Also I didn’t mentioned that I originally bought this card from a miner, so he might have burned a few transistors here and there :slight_smile:

With that said, the card is listed in nvidia-smi, and I can get it to run some models.
With my latest testing, the card seems to handle a model composed of just RNNs and a Wavenet model just fine. (I used this wavenet model: https://github.com/vincentherrmann/pytorch-wavenet/blob/master/wavenet_model.py, and built-in torch GRUs)
It seem to have issues with my TCN model (taken here: https://github.com/locuslab/TCN/blob/master/TCN/tcn.py)
It’s very hard to debug because it usually fails after 5-20min of training, so I’m never sure of anything.

There is no clear pattern about what in the model (batch norm layers, regularization, convs, concats, indexing…) is causing the GPU to fail. And I did not have the patience to figure it out.

I think I’ll just get a brand new 2080TI and this will be it.
Thanks for jumping in, I appreciate it

1 Like