– just leaving this comment in case someone has the same issue.
my code that was running fine, just hit this issue raising the error
RuntimeError: Unable to find a valid cuDNN algorithm to run convolution.
it runs fine on a
small dataset but raises the error on
the code works fine on either using other methods. the new tested method uses some gradient stuff (more memory).
by default, i set
torch.backends.cudnn.benchmark = True
torch.backends.cudnn.enabled is True.
reinstalling a fresh env or rebuild pytorch is a lot since the code works fine on small dataset. pytorch (1.9.0) was installed using conda.
i noticed that the gpu memory (16gb) is maxed out after 97 minibatch during training… usually 11gb does the job.
quick search, and some answers from tensorflow-similar-issue seem to point toward a memory overload (retinanet - Tensorflow 2.1 Failed to get convolution algorithm. This is probably because cuDNN failed to initialize - Stack Overflow) which is similar to my case.
i double checked and found a memory leak in my code.
i was tracking a loss without detaching from the graph which kept increasing the memory usage until overload. fixing this solved the problem.
indeed, the error is vague and does not hint to a memory issue (pytorch - Unable to find a valid cuDNN algorithm to run convolution - Stack Overflow).