Why does torch.backends.cudnn only deal with recurrent layers?

sidg54 · August 7, 2019, 7:46pm

Hello!

I was reading through the documentation and was wondering why there seems to only be a file for rnn.py in the torch.backends.cudnn module? The reason I ask is because while attempting to create a language model with torch.nn.GRU as one of the layers I received the following error: “RuntimeError: cuDNN Error: CUDNN_STATUS_EXECUTION_FAILED” but this error goes away when I run the same code, setting torch.backends.cudnn.enabled = False. But then a new error occurs: “RuntimeError: Input and parameters tensors are not at the same device, found input tensor at cuda:0 and parameters tensor at cpu”. I am using the following GPU setup:

RTX 2080ti
CUDA Version: ‘9.0.176’ (this is after running torch.version.cuda)
cuDNN: 7501 (after checking using torch.backends.cudnn.version())

And using:

torch 1.1.0
Ubuntu 18.04

The first part of the forward method’s code is the following:

    embed = self.embeddings(x)
    h0 = self.init_hidden(self.batch_size)
    embed = embed.permute(1, 0, 2)
    if self.cuda:
        embed = embed.cuda()
        h0 = h0.cuda()
    temp, hidden = self.gru(embed, h0)

Thank you!

ptrblck · August 8, 2019, 12:06pm

cudnn is also used for e.g. convolutions as seen here.

It’s was a good idea to disable cudnn, since it can potentially hide some other errors.
As you can see in your example, the actual error is a device mismatch.
Make sure all parameters and inputs are on the same device.
While embed and h0 seem to be on the default GPU, self.gru or some other layers might still be on the CPU.

Try to run your code with CUDA_LAUNCH_BLOCKING=1 python script.py args, as this will point to the right line of code, which causes this error.

sidg54 · August 8, 2019, 10:53pm

Thank you, this helped greatly in debugging my issues. The issue was that my GRU layer wasn’t being loaded onto cuda:0 correctly, it remained on the cpu.