Cudnn status execution failed and frequent crashes

RiccardoDF · January 25, 2019, 12:13pm

Forward propagation in my model seems to work fine, however as soon as I try to loss.backward() I get

  File "PyTorch1.py", line 265, in <module>
    loss.backward()
  File "/home/riccardo/.anaconda3/envs/PyTorch/lib/python3.7/site-packages/torch/tensor.py", line 102, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/riccardo/.anaconda3/envs/PyTorch/lib/python3.7/site-packages/torch/autograd/__init__.py", line 90, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

I’m running pytorch 1.0.0 in a conda environment, torch.backends.cudnn.version() returns 7401, GeForce GTX 1080 with Cuda compilation tools release 9.1 , Driver Version: 410.48. I tried to replace my custom loss function with a simple torch.sum() and this doesn’t seem to change anything. This is the code I use to test one forward pass and gradient computation (on random numbers):

A=Unet(PARAMS).cuda()

start_time=time.time()
X=torch.randn(1,1,256,256,4)
FakeLabel=torch.randn(1,3,256,256,4)
FakeLabel[FakeLabel>0.5]=1
FakeLabel[FakeLabel<0.6]=0
FakeMask=torch.randn(1,1,256,256,4)
FakeMask[FakeMask>0.5]=1
FakeMask[FakeMask<0.6]=0

X=X.cuda()
optimizer=torch.optim.Adam(A.parameters())


optimizer.zero_grad()
#Mask, Label = A(X)
OUT=A(X)
WW=np.array([1.1,1.2,4])
loss = torch.sum(OUT)#MonoLoss(FakeMask,Mask,1,1) + CateLoss(FakeLabel,Label,1,WW)
loss=loss.cuda()
loss.backward()
optimizer.step()

vabh · January 25, 2019, 12:30pm

Could your try to run on the CPU and see what error message it gives you (it should be more informative).

RiccardoDF · January 25, 2019, 1:21pm

This helped, at least now I know I have a memory issue, getting a not enough memory error. I wouldn’t have expected that a network taking less than 8GB of GPU memory for forward prop would need more than 60 for backwards. A 3D volume of size 200,200,4 instead of 256,256,4 seems to work on both CPU and GPU.

This is also puzzling me, as running on CPU it takes up to 70% of my ram to perform it, so at least 30GB, while the GPU has absolutely no issue with this computation.

What is going on, and is there a way around this other than rescaling the input?

RiccardoDF · January 25, 2019, 1:54pm

Ok, this gets stranger. I tried setting

torch.backends.cudnn.benchmark = True

and after running this code for the first time my PC just shut down immediately. Next time, provided benchmark mode is True, the code runs fine with the original data size.

It seems I’m getting several crashes like this, usually after I have restarted spyder’s kernel one or two times, using benchmark mode. However if I run the code on a fresh restart it seems to work. Sound like a low level issue.

vabh · January 25, 2019, 2:22pm

It could be that Spyder does not free memory after one run is completed, and so after a while you do not have any free memory left. I don’t work with Spyder so don’t know how it operates.

You can try adding torch.cuda.empty_cache() at the end of your script to release memory and see if it helps in the next runs.

RiccardoDF · January 25, 2019, 3:00pm

Doesn’t seem to work. In fact, turning off benchmark mode and going back to just using a smaller input, I tried to iterate on several gradient descents, and eventually I get the same kind of sudden crash. The system’s log file offers no information about it. The computer doesn’t even reboot afterwards unless I physically disconnect it before trying again. torch.cuda.empty_cache() made no difference, nor did running the code from the interpreter.

rasbt · January 25, 2019, 3:14pm

This is also puzzling me, as running on CPU it takes up to 70% of my ram to perform it, so at least 30GB, while the GPU has absolutely no issue with this computation.

The convolution is implemented differently for CPU, which is the main reason. Convolution on CPU corresponds is what you usually find in textbooks for discrete convolution / cross-correlation. On the GPU it’s using CUDA/CuDNN, where it’s implemented as a clever estimation method in terms of using a fast Winograd fourier transforms (or even something newer/more modified) for computing convolutions