AZ Notebooks - RuntimeError - on Standard_NC6 and Python 3.8 - Pytorch and Tensorflow kernel

Dimitris_Papatsarouc · October 17, 2022, 11:38am

Hello PyTorch community,

I have some scripts in a working environment in google colab, and as I am working on my Thesis I tried to use Azure Student promo.
I have set up a Standard_NC6 with Pytorch and Tensorflow kernel and I am getting the following error:

RuntimeError Traceback (most recent call last)
Input In [9], in <cell line: 64>()
76 loss_critic = -(torch.mean(critic_real) - torch.mean(critic_fake))
77 critic.zero_grad()
—> 78 loss_critic.backward(retain_graph=True)
79 opt_critic.step()
81 # clip critic weights between -0.01, 0.01

File /anaconda/envs/azureml_py38_PT_TF/lib/python3.8/site-packages/torch/_tensor.py:396, in Tensor.backward(self, gradient, retain_graph, create_graph, inputs)
387 if has_torch_function_unary(self):
388 return handle_torch_function(
389 Tensor.backward,
390 (self,),
(…)
394 create_graph=create_graph,
395 inputs=inputs)
→ 396 torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)

File /anaconda/envs/azureml_py38_PT_TF/lib/python3.8/site-packages/torch/autograd/init.py:173, in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables, inputs)
168 retain_graph = create_graph
170 # The reason we repeat same the comment below is that
171 # some Python versions print out the first line of a multi-line function
172 # calls in the traceback and some print out the last line
→ 173 Variable.execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
174 tensors, grad_tensors, retain_graph, create_graph, inputs,
175 allow_unreachable=True, accumulate_grad=True)

RuntimeError: Unable to find a valid cuDNN algorithm to run convolution

I tried different versions of pytorch + cu113 and cu116.

The nvdia-smi output is:

±----------------------------------------------------------------------------+
| NVIDIA-SMI 470.141.03 Driver Version: 470.141.03 CUDA Version: 11.4 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla K80 On | 00000001:00:00.0 Off | 0 |
| N/A 41C P0 70W / 149W | 880MiB / 11441MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 11425 C …eml_py38_PT_TF/bin/python 877MiB |
±----------------------------------------------------------------------------+

I guess that the problem is about drivers and versions… as in google colab environment it’s working

Thanks,
DP