CuDNN status internal error with Ubuntu 14.04

hyojin · October 23, 2017, 6:27pm

Hi,

I have CUDNN_STATUS_INTERNAL_ERROR on my Ubuntu 14.04 machine with Cuda 8. (GPUs: Maxwell Titan X, Pascal Titan X, Driver version: 384.66)
The problem goes away when I reboot the machine, but it is inconvenient because it happens every so often. Does anyone know why this would happen?

This seems identical to the problem in this post: https://stackoverflow.com/questions/45810356/runtimeerror-cudnn-status-internal-error

Traceback (most recent call last):
File “train_univ_incremental_itersize.py”, line 607, in
main()
File “train_univ_incremental_itersize.py”, line 341, in main
train_loss, cur_learning_rate = train(train_loader, net, criterion, optimizer, epoch)
File “train_univ_incremental_itersize.py”, line 399, in train
myOut = model(input_var)
File “/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py”, line 259, in call
result = self.forward(*input, **kwargs)
File “/playpen1/alternet.py”, line 109, in forward
low_out = self.alexnet_lower(img)
File “/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py”, line 259, in call
result = self.forward(*input, **kwargs)
File “/usr/local/lib/python2.7/dist-packages/torch/nn/modules/container.py”, line 67, in forward
input = module(input)
File “/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py”, line 259, in call
result = self.forward(*input, **kwargs)
File “/usr/local/lib/python2.7/dist-packages/torch/nn/modules/conv.py”, line 254, in forward
self.padding, self.dilation, self.groups)
File “/usr/local/lib/python2.7/dist-packages/torch/nn/functional.py”, line 52, in conv2d
return f(input, weight, bias)

richard · October 23, 2017, 6:33pm

What is your pytorch version (check torch.__version__) and what is your CUDNN version?

hyojin · October 23, 2017, 6:37pm

Pytorch version is 0.2.0_3
CuDNN version is v5

Thanks

richard · October 23, 2017, 6:40pm

Pytorch doesn’t support CuDNN v5. I would recommend upgrading to v6 (https://developer.nvidia.com/cudnn)

hyojin · October 23, 2017, 6:47pm

Oops, I checked cudnn version for pytorch via torch.backends.cudnn.version() and it returns 6021. Which means it uses cuDNN v6.
I don’t know what is the problem now…

According to following post, pytorch ships its own cuDNN version (6.0)

richard · October 23, 2017, 6:50pm

Interesting. Could you try the following: (any of these will be helpful in identifying the problem!)

build pytorch from master and see if it happens?
provide a minimal example that causes the CuDNN error. From the trace it looks like it’s happening in conv2d

hyojin · October 23, 2017, 7:15pm

I can try this. Currently installed version is the official distributed version which I downloaded in August 2017.
This I cannot do. I don’t think this is specific to code. Indeed it seems to be caused by conv2 (or other conv operations), but the same code usually runs without problem. I did test with different codes.

But I do stop programs using keyboard interrupt (Ctrl+C). Could this mess up with the cuDNN status? That sometimes does not clear up the GPU memory, too.

hyojin · November 1, 2017, 4:45am

As mentioned in the following post, sudo rm -r ~/.nv works for this problem. Not sure why it happens though.

ho0712 · June 2, 2018, 4:45am

yes. it’s work. I also have the error