"Unable to find a valid cuDNN algorithm to run convolution" on backward() function

I am getting an error on the backward function.

  File "/home/kusuma/.conda/envs/pyssd/lib/python3.8/site-packages/torch/autograd/__init__.py", line 154, in backward

RuntimeError: Unable to find a valid cuDNN algorithm to run convolution

I looked it up what might cause it and some sources point out that the batch size caused it. I reduced the batch size from 8 to 1 and it worked.
But is there any other way to avoid this bug since using batch size of 1 for the training makes less sense for me?
I’m using PyTorch 1.10, running on 1080Ti.
Does anyone have any insight on this matter?

>>> print(torch.__version__) # => 1.10.1
>>> print(torch.version.cuda) # => 11.3
>>> print(torch.backends.cudnn.version()) # => 8200

Could you install the latest nightly binary as it should use cuDNN8.3.2 and might have already solved the issue. If you are still hitting the error, could you post the model definition as well as the input shapes so that we could try to reproduce it, please?

Thanks for the heads up and sorry for the late response! I wasn’t be able to install the nightly binary as it’s not my personal gpu and I’m not assigned as an administrator.

What I tried though was debugging via CUDA_LAUNCH_BLOCKING=1 with assumption that it’d show me something I missed since I only use single gpu.
Turned out it was because of memory issue.
Next assumption is I believe having my training script computing quite a few losses from multiple loss functions and having a speed-memory trade off via setting torch.backends.cudnn.benchmark = False might also be the case.

After some clean ups and disabling the inbuilt auto-tuner the training worked just fine:)
But what would the latest nightly binary highlight though in my case? Would it show me a more specific error? I’m still just wondering as I have no personal gpu to try this debugging alternative.

It wouldn’t necessarily show a better error message unfortunately, but it would give me a signal if this issue still persists (and might indeed be a functional issue). It seems you’ve narrowed it down to the memory requirement now.