I have CUDNN_STATUS_INTERNAL_ERROR on my Ubuntu 14.04 machine with Cuda 8. (GPUs: Maxwell Titan X, Pascal Titan X, Driver version: 384.66)
The problem goes away when I reboot the machine, but it is inconvenient because it happens every so often. Does anyone know why this would happen?
Traceback (most recent call last):
File “train_univ_incremental_itersize.py”, line 607, in
main()
File “train_univ_incremental_itersize.py”, line 341, in main
train_loss, cur_learning_rate = train(train_loader, net, criterion, optimizer, epoch)
File “train_univ_incremental_itersize.py”, line 399, in train
myOut = model(input_var)
File “/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py”, line 259, in call
result = self.forward(*input, **kwargs)
File “/playpen1/alternet.py”, line 109, in forward
low_out = self.alexnet_lower(img)
File “/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py”, line 259, in call
result = self.forward(*input, **kwargs)
File “/usr/local/lib/python2.7/dist-packages/torch/nn/modules/container.py”, line 67, in forward
input = module(input)
File “/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py”, line 259, in call
result = self.forward(*input, **kwargs)
File “/usr/local/lib/python2.7/dist-packages/torch/nn/modules/conv.py”, line 254, in forward
self.padding, self.dilation, self.groups)
File “/usr/local/lib/python2.7/dist-packages/torch/nn/functional.py”, line 52, in conv2d
return f(input, weight, bias)
Oops, I checked cudnn version for pytorch via torch.backends.cudnn.version() and it returns 6021. Which means it uses cuDNN v6.
I don’t know what is the problem now…
According to following post, pytorch ships its own cuDNN version (6.0)
I can try this. Currently installed version is the official distributed version which I downloaded in August 2017.
This I cannot do. I don’t think this is specific to code. Indeed it seems to be caused by conv2 (or other conv operations), but the same code usually runs without problem. I did test with different codes.
But I do stop programs using keyboard interrupt (Ctrl+C). Could this mess up with the cuDNN status? That sometimes does not clear up the GPU memory, too.