Unable to find a valid cuDNN algorithm to run convolution

– just leaving this comment in case someone has the same issue.
hi,
my code that was running fine, just hit this issue raising the error RuntimeError: Unable to find a valid cuDNN algorithm to run convolution.
it runs fine on a small dataset but raises the error on large dataset.
the code works fine on either using other methods. the new tested method uses some gradient stuff (more memory).
by default, i set torch.backends.cudnn.benchmark = True
by default torch.backends.cudnn.enabled is True.

reinstalling a fresh env or rebuild pytorch is a lot since the code works fine on small dataset. pytorch (1.9.0) was installed using conda.

i noticed that the gpu memory (16gb) is maxed out after 97 minibatch during training… usually 11gb does the job.

quick search, and some answers from tensorflow-similar-issue seem to point toward a memory overload (retinanet - Tensorflow 2.1 Failed to get convolution algorithm. This is probably because cuDNN failed to initialize - Stack Overflow) which is similar to my case.

i double checked and found a memory leak in my code.
i was tracking a loss without detaching from the graph which kept increasing the memory usage until overload. fixing this solved the problem.

indeed, the error is vague and does not hint to a memory issue (pytorch - Unable to find a valid cuDNN algorithm to run convolution - Stack Overflow).

thanks

2 Likes

decreasing batch size helped

I tried smaller batches but it did not help me. Perhaps even smaller data batches are needed…

Had the same error: "return F.conv2d(input, weight, bias, self.stride,
RuntimeError: Unable to find a valid cuDNN algorithm to run convolution
"
I saw you suggested to uninstall all binary installations, how can I do that?
Or should I change something in my ~/.bashrc file?

I will mention that I managed to run this code before, I suspect it started to fail after I installed a python package that messed my cuda settings up.

More details:

torch.cuda.is_available() => True
torch.__version__ => 2.0.1+cu117
torch.version.cuda => 11.7
torch.backends.cudnn.version() => 8500

I ran “nvidia-smi” and made sure the GPU is not occupied.

Could you post a minimal and executable code snippet to reproduce the issue as well as the output of python -m torch.utils.collect_env, please?