Hi,
I have the following error:
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED
Full stack trace:
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-2-33f16e9e6467> in <module>
6 trainer = Trainer(splitter, LeggedTCN)
7 print("training")
----> 8 trainer.train()
9 #with sprees
~/pycharms_remote/valx_trader/rrl_hm/training.py in train(self)
281 if self.flippers['mixup']: X, Y = self.mixup(X, Y, self.flippers['mixup'])
282
--> 283 loss = self.train_iteration(X, Y, sym_perm, epoch) # crm,
284 self.iter_losses[sub_epoch] = loss
285 if batch_index % print_every == 0:
~/pycharms_remote/valx_trader/rrl_hm/training.py in train_iteration(self, X, Y, sym_perm, epoch)
308 self.optimizer.zero_grad()
309 loss = self.my_loss_func(None, Y, outputs) # crm, pps
--> 310 loss.backward()
311 self.optimizer.step()
312
~/pycharms_remote/valx_trader/.venv/lib/python3.6/site-packages/torch/tensor.py in backward(self, gradient, retain_graph, create_graph)
100 products. Defaults to ``False``.
101 """
--> 102 torch.autograd.backward(self, gradient, retain_graph, create_graph)
103
104 def register_hook(self, hook):
~/pycharms_remote/valx_trader/.venv/lib/python3.6/site-packages/torch/autograd/__init__.py in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables)
88 Variable._execution_engine.run_backward(
89 tensors, grad_tensors, retain_graph, create_graph,
---> 90 allow_unreachable=True) # allow_unreachable flag
91
92
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED
I am running on a desktop with 2 GPU. I have 2 jupyter labs that I start with:
CUDA_VISIBLE_DEVICES=[0 | 1] jupyter lab --port=802[0 | 1]
I ran the exact same notebook on each instance of jupyter lab, and on the GPU#0, everything works fine, on the GPU#1, after 35 iterations (about 5min of training), I get the EXECUTION_FAILED error.
Here is the nvidia-smi
output at the time of the crash:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.116 Driver Version: 390.116 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 108... Off | 00000000:24:00.0 On | N/A |
| 58% 74C P2 224W / 250W | 3159MiB / 11175MiB | 84% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 108... Off | 00000000:25:00.0 Off | N/A |
| 47% 70C P2 63W / 250W | 1527MiB / 11178MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1514 G /usr/lib/xorg/Xorg 59MiB |
| 0 1724 G /usr/bin/gnome-shell 70MiB |
| 0 2348 C ...ms_remote/valx_trader/.venv/bin/python3 3025MiB |
| 1 9360 C ...ms_remote/valx_trader/.venv/bin/python3 1515MiB |
+-----------------------------------------------------------------------------+
Each GPU has 11GB of Memory, and I use about 3G, so the issue is not running out of GPU memory. Note that the error is with GPU#1, there are no other processes on that GPU (so it’s not an issue with a conflict with other processes).
Sometimes, I get a CUDA illegal memory access error
instead. The error is always with GPU#1, after 1-10minutes of training, and it’s related to CUDA/CUDNN, but the exact error message, stacktrace, and timing can vary.
If I train much smaller models on that GPU, I have no error, or much later in my training. I recently added 32GB of RAM to the machine, so RAM is not an issue (it’s hovering at around 20GB (out of 64GB total) during training.
Do you have an idea of what could be the issue? Can it be an issue with the GPU itself?
I don’t think it’s an issue with the code because everything works fine on the first gpu, and the code I run is exactly the same.
The fact that the error happens at random after training for a while, and that it’s not always the same error is very weird to me. How can I diagnose/debug this situation?
Thanks