RuntimeError: CUDNN_STATUS_INTERNAL_ERROR when l run the program for a second time

Hello,

l’m running a pytorch program as follow :

python2 crnn_main.py --trainroot="/home/ahmed/Downloads/training_data/output/train" --valroot="/home/ahmed/Downloads/training_data/output/valid" --imgH=32 --cuda --adadelta  --experiment="/home/ahmed/Downloads/training_data/output/"

it works correctly when l run it for the first time. However when l try to run it for a second time l got the following error :

Traceback (most recent call last):
  File "crnn_main.py", line 200, in <module>
    cost = trainBatch(crnn, criterion, optimizer)
  File "crnn_main.py", line 183, in trainBatch
    preds = crnn(image)
  File "/home/ahmed/anaconda3/envs/cv/lib/python2.7/site-packages/torch/nn/modules/module.py", line 206, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ahmed/crnn/models/crnn.py", line 78, in forward
    conv = utils.data_parallel(self.cnn, input, self.ngpu)
  File "/home/ahmed/crnn/models/utils.py", line 12, in data_parallel
    output = model(input)
  File "/home/ahmed/anaconda3/envs/cv/lib/python2.7/site-packages/torch/nn/modules/module.py", line 206, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ahmed/anaconda3/envs/cv/lib/python2.7/site-packages/torch/nn/modules/container.py", line 64, in forward
    input = module(input)
  File "/home/ahmed/anaconda3/envs/cv/lib/python2.7/site-packages/torch/nn/modules/module.py", line 206, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ahmed/anaconda3/envs/cv/lib/python2.7/site-packages/torch/nn/modules/conv.py", line 237, in forward
    self.padding, self.dilation, self.groups)
  File "/home/ahmed/anaconda3/envs/cv/lib/python2.7/site-packages/torch/nn/functional.py", line 40, in conv2d
    return f(input, weight, bias)
RuntimeError: CUDNN_STATUS_INTERNAL_ERROR

to run it again l need to reboot my laptop

Hi!!!
Did you figure out how to solve this? I’m having a similar issue.
The first time (or couple of first times) I run a script it works fine, but afterwards it will throw either the same error as yours (CUDNN_STATUS_INTERNAL_ERROR) or a Cuda runtime error (4): unspecified launch failure or segmentation fault (core dumped).
I supose there’s something wrong with my cuda installation but I’ve tried reinstalling it several times, I’ve even tried switching back to kernel 10.3 instead of 13.1 and reinstalling it but so far I cannot get rid of this weird behavior.
I’ve also tried building pytorch from source and switching to pytorch 0.1.12.

I supose you used different versions than mine, but any pointers would be appreciated. I’m running out of ideas.

Btw I’m using Nvidia’s driver 384.69 and Cuda 8.0.61, both installed with the runfiles for Ubuntu 16.04.3 with kernel 13.1 on a system with a single Gtx 1080 and a ryzen 1600 processor.

I’ve ran into the same problem today, and found my GPU didn’t free it’s memory.
Try this, it works fine for me.
https://discuss.pytorch.org/t/gpu-memory-not-returned/1311

How did you know it didn’t free the memory? Only using nvidia-smi? Or is there something else? Because cheking with nvidia-smi the memory seems to be free after the script terminates.

I have the same problem these days?How do you solve it?Thank u

In my case it was a hardware related issue.
My pc had a ryzen processor affected by a production bug. I haven’t experienced these problems after replacing the cpu.

2 Likes

It would be nice to know if you could tell to which processor you replaced it!

I got a replacement of the same model. If I recall correctly there was some specific issue with the first batches of 1st gen ryzen processors.