Got CUDNN_STATUS_NOT_INITIALIZED although PyTorch (and Lua Torch!) correctly recognize CUDA & CuDNN

Warrior · July 13, 2018, 5:01pm

OS: Ubuntu 16.04 LTS
PyTorch version: 0.5.0a0+1483bb7 (and also the latest ones from today)
How you installed PyTorch (conda, pip, source): source
Python version: 3.5.2
torch.backends.cudnn.version(): 7104
CUDA version: 9.0.176 (also tested with 9.1.85)
NVIDIA driver version: 390.48 (also tested with 390.67)

I compiled PyTorch from source in my singularity container and tried to run the CIFAR classification code from here. I only move the network and mini batches to GPU before/while training. However, I get the following error when I do forward-pass for the first time:

RuntimeError: CuDNN error: CUDNN_STATUS_NOT_INITIALIZED

Doing torch.device("cuda:0" if torch.cuda.is_available() else "cpu") will correctly show me cuda:0. Then I do net.to(device) and inputs, labels = inputs.to(device), labels.to(device). However, I still get that error when doing the forward pass. I may note that I have been building PyTorch from source exactly the same way in the past couple of months and I never encountered any issues. My previous containers were built around two months ago and had no issues. The new Singularity containers all have the same issue.

Initially I was thinking this is only a PyTorch issue but I noticed that running my Lua Torch code gives me the same exact issue. I’ve been using that code since a couple of months ago without any issues with previous revisions of CuDNN v7.

As suggested in the issue I opened on PyTorch GitHub repo, Soumith suggested upgrading the NVIDIA driver version. I upgraded the driver version to 390.67 (the latest from NVIDIA) but still have this issue. I am suspicious that the new version of CuDNN is causing this but I’m not entirely about that.

I also sent an email to Felix Abecassis at NVIDIA and told him about this issue and asked him if he thinks the updates on CuDNN might be causing this. This is his reply:

The versions of cuBLAS and cuDNN for CUDA 9.0 were updated, but I think that’s all. These are the only potential culprits I see on the image side.

I wonder, has anyone been encountering this issue? Does anyone know what might be causing this?