Hi,
I was trying to train GoogLeNet on a server with multiple GPUs using python3 + pytorch 0.3.1.post2. However, I keep getting this error: ‘RuntimeError: CUDNN_STATUS_INTERNAL_ERROR’. It seems like this error happens when conv3d() is called. The complete error message is below. Could anyone help me out with it?
Thanks
File “main.py”, line 220, in class_reg_eval
outputs = combined_classifier.net(event_data)
File “/home/junzel2/anaconda2/envs/py36/lib/python3.6/site-packages/torch/nn/modules/module.py”, line 357, in call
result = self.forward(*input, **kwargs)
File “/home/junzel2/Triforce_CaloML/Architectures/GoogLeNet.py”, line 142, in forward
x = self.pre_layers(x)
File “/home/junzel2/anaconda2/envs/py36/lib/python3.6/site-packages/torch/nn/modules/module.py”, line 357, in call
result = self.forward(*input, **kwargs)
File “/home/junzel2/anaconda2/envs/py36/lib/python3.6/site-packages/torch/nn/modules/container.py”, line 67, in forward
input = module(input)
File “/home/junzel2/anaconda2/envs/py36/lib/python3.6/site-packages/torch/nn/modules/module.py”, line 357, in call
result = self.forward(*input, **kwargs)
File “/home/junzel2/anaconda2/envs/py36/lib/python3.6/site-packages/torch/nn/modules/conv.py”, line 388, in forward
self.padding, self.dilation, self.groups)
File “/home/junzel2/anaconda2/envs/py36/lib/python3.6/site-packages/torch/nn/functional.py”, line 126, in conv3d
return f(input, weight, bias)
RuntimeError: CUDNN_STATUS_INTERNAL_ERROR
Updates:
I have tried several solutions I found with Google. I’ve tried ‘rm -rf ~/.nv’, ‘torch.backends.cudnn.benchmark = False’, etc. But they all didn’t work.