While running a pytorch script on a cluster, I’m getting the following error:
Traceback (most recent call last):
File “/global/u2/a/anshuman/StructRepGen_Dev/diff_gpu.py”, line 681, in
z , z_mu, z_var = unet(batch_noisy, t)
File “/global/homes/a/anshuman/.conda/envs/srg/lib/python3.9/site-packages/torch/nn/modules/module.py”, line 1130, in _call_impl
return forward_call(*input, **kwargs)
File “/global/u2/a/anshuman/StructRepGen_Dev/diff_gpu.py”, line 560, in forward
x = self.init_conv(x)
File “/global/homes/a/anshuman/.conda/envs/srg/lib/python3.9/site-packages/torch/nn/modules/module.py”, line 1130, in _call_impl
return forward_call(*input, **kwargs)
File “/global/homes/a/anshuman/.conda/envs/srg/lib/python3.9/site-packages/torch/nn/modules/conv.py”, line 307, in forward
return self._conv_forward(input, self.weight, self.bias)
File “/global/homes/a/anshuman/.conda/envs/srg/lib/python3.9/site-packages/torch/nn/modules/conv.py”, line 303, in _conv_forward
return F.conv1d(input, weight, bias, self.stride,
RuntimeError: Unable to find a valid cuDNN algorithm to run convolution
I saw a post related to this error, where it was mentioned about the mismatch between cuda version and pytorch version. I tried to re-install the pytorch version but still the problem persists. Do I need to explicitly give the cuda path and the pytorch path? If so, then how may I do so? Thanks.
I have the following versions:
torch.version = 1.12.1
torch.version.cuda = 11.3
torch.backends.cudnn.version() = 8302
path of cuda ( which nvcc
)=
/opt/nvidia/hpc_sdk/Linux_x86_64/22.7/cuda/11.7/bin/nvcc
cuda version on terminal = nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Tue_May__3_18:49:52_PDT_2022
Cuda compilation tools, release 11.7, V11.7.64
Build cuda_11.7.r11.7/compiler.31294372_0