– just leaving this comment in case someone has the same issue.
hi,
my code that was running fine, just hit this issue raising the error RuntimeError: Unable to find a valid cuDNN algorithm to run convolution
.
it runs fine on a small
dataset but raises the error on large
dataset.
the code works fine on either using other methods. the new tested method uses some gradient stuff (more memory).
by default, i set torch.backends.cudnn.benchmark = True
by default torch.backends.cudnn.enabled
is True.
reinstalling a fresh env or rebuild pytorch is a lot since the code works fine on small dataset. pytorch (1.9.0) was installed using conda.
i noticed that the gpu memory (16gb) is maxed out after 97 minibatch during training… usually 11gb does the job.
quick search, and some answers from tensorflow-similar-issue seem to point toward a memory overload (retinanet - Tensorflow 2.1 Failed to get convolution algorithm. This is probably because cuDNN failed to initialize - Stack Overflow) which is similar to my case.
i double checked and found a memory leak in my code.
i was tracking a loss without detaching from the graph which kept increasing the memory usage until overload. fixing this solved the problem.
indeed, the error is vague and does not hint to a memory issue (pytorch - Unable to find a valid cuDNN algorithm to run convolution - Stack Overflow).
thanks