cuDNN error: CUDNN_STATUS_INTERNAL_ERROR

Bahar_Nikpour · July 9, 2020, 4:40am

Hello everyone!
I’m training a policy gradient algorithm with 3 BLSTM networks and using leave one out to test it on 60 data samples, but I get error "cuDNN error: CUDNN_STATUS_INTERNAL_ERROR"after passing 10 loops of for loop for leave one out. I neither can run it on google colab because of an unknown error (it crashes after hours). I don’t know whether its memory error or the problem with my code. It takes a lot of time on CPU so I cannot test it to see that my code is working properly but I have checked it several times and seems correct. Would you please give me an idea as I got stuck in my research now. I appreciate your help.

ptrblck · July 10, 2020, 9:31am

I would recommend to update to the latest PyTorch version, if you are using an older one.
If you are already using the latest release, could you post an executable code snippet, which would recreate the issue and post your setup (GPU, CUDA version, cudnn version, PyTorch version)?

Also, as a workaround you could try to use torch.backends.cudnn.benchmark=True or disable cudnn via torch.backends.cudnn.enabled=False.