I am trying to run a simple RNN model with LSTM unit but I am getting cuda error (same code is working fine with CPU)
Rnn model is like below

class BiRNN(nn.Module):   
    def forward(self, x):
        # Set initial states
        h0 = Variable(torch.zeros(self.num_layers*2, x.size(0), self.hidden_size)) # 2 for bidirection 
        c0 = Variable(torch.zeros(self.num_layers*2, x.size(0), self.hidden_size))
        # Forward propagate RNN
        out, _ = self.lstm(x, (h0, c0))
        # Decode hidden state of last time step
        out = self.fc(out[:, -1, :])
        return out

And when I tried to call feed forward:

rnn = BiRNN(input_size, hidden_size, num_layers)
# Loss and Optimizer
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(rnn.parameters(), lr=learning_rate)

for epoch in range(100):
    for data in zip(train_loader_x,train_loader_y): 
         #Forward + Backward + Optimize
        #X_train_ = np.reshape(data[0], (data[0].shape[0], data[0].shape[1], 1))# as dimenion is 2 
        X_train_ = Variable(data[0].float().cuda())
        temp = data[1].float().cuda()
        labels = Variable(temp.view(temp.numel(),1))
        outputs = rnn(X_train_)
        loss = criterion(outputs, labels)
    print("Loss is {}".format([0]))

I am getting error:


 File "D:\Users\Saurabh\Anaconda2\envs\pytorch_gpu1\lib\site-packages\spyder\utils\site\", line 705, in runfile
    execfile(filename, namespace)

  File "D:\Users\Saurabh\Anaconda2\envs\pytorch_gpu1\lib\site-packages\spyder\utils\site\", line 102, in execfile
    exec(compile(, filename, 'exec'), namespace)

  File "D:/Users/Saurabh/Documents/thesis/codes/", line 118, in <module>
    outputs = rnn(X_train_)

  File "D:\Users\Saurabh\Anaconda2\envs\pytorch_gpu1\lib\site-packages\torch\nn\modules\", line 491, in __call__
    result = self.forward(*input, **kwargs)

  File "D:/Users/Saurabh/Documents/thesis/codes/", line 96, in forward
    out, _ = self.lstm(x, (h0, c0))

  File "D:\Users\Saurabh\Anaconda2\envs\pytorch_gpu1\lib\site-packages\torch\nn\modules\", line 491, in __call__
    result = self.forward(*input, **kwargs)

  File "D:\Users\Saurabh\Anaconda2\envs\pytorch_gpu1\lib\site-packages\torch\nn\modules\", line 192, in forward
    output, hidden = func(input, self.all_weights, hx, batch_sizes)

  File "D:\Users\Saurabh\Anaconda2\envs\pytorch_gpu1\lib\site-packages\torch\nn\_functions\", line 323, in forward
    return func(input, *fargs, **fkwargs)

  File "D:\Users\Saurabh\Anaconda2\envs\pytorch_gpu1\lib\site-packages\torch\nn\_functions\", line 287, in forward


I tried some solutions as discussed here 1,2 but still, i got the same error. Also, I tried to run a sample program and it is working fine on the GPU

I am using Windows 10, Nvidia GeForce GTX 1050 Ti, cuda 9.1, and pytorch version is 0.4.0

Whole program is here

1 Like

h0, c0 are not moved to GPU in your model. You should try:

h0, c0 = h0.cuda(), c0.cuda()

after you create the variables.


Ahh, I missed that. Thanks @jet. It’s working great now

Hi, I get the same error with a slightly different situation. My model was working fine with torch 0.4 and python 2.7. Now, after updating to python 3.5 and torch 1.0.1 it gives that same error when adding the rnn layer:

class BidirectionalLSTM(nn.Module):
    # Module to extract BLSTM features from convolutional feature map

    def __init__(self, nIn, nHidden, nOut):
        super(BidirectionalLSTM, self).__init__()

        self.rnn = nn.LSTM(nIn, nHidden, bidirectional=True)
        self.embedding = nn.Linear(nHidden * 2, nOut)

    def forward(self, input):
        recurrent, _ = self.rnn(input)
        T, b, h = recurrent.size()
        t_rec = recurrent.view(T * b, h)

        output = self.embedding(t_rec)  # [T * b, nOut]
        output = output.view(T, b, -1)

        return output

class RecognitionModel(nn.Module):

    def __init__(self,feature_size = 256,pool_h=32,alphabet_len=38):
        self.alphabet_len = alphabet_len
        nh= 256
        self.conv1 = nn.Conv2d(feature_size, nh, kernel_size=3, padding=1)
        self.act1 = nn.ReLU()

        self.conv2 = nn.Conv2d(nh, nh, kernel_size=3, padding=1)
        self.act2 = nn.ReLU()

        self.rnn = nn.Sequential(
        BidirectionalLSTM(pool_h*nh, nh,nh),
        BidirectionalLSTM(nh,nh, pool_h*nh))

        self.output = nn.Linear(in_features=nh*pool_h,out_features=self.alphabet_len)

    def forward(self,x):
        x= self.conv1(x)
        x = self.act1(x)

        x = self.conv2(x)
        x = self.act2(x)


        x = self.rnn(x)
        output = self.output(x)
        return output

If I remove the declaration of self.rnn and it’s call in forward the model works fine, but when adding it it gives the error
The code used to work with CUDA 8.0 and GeForce GTX 1080 Ti GPU, now I use CUDA 10.0 with a TITAN RTX.
Any input is appreciated.

I had the same error after upgrading CUDA to version 10.1

I went to pytorch site and selected the installation preferences, for which in my case I got the following command:
conda install pytorch torchvision cudatoolkit=10.0 -c pytorch

Upon running the command, it turned out there were some inconsistencies among the previously installed libraries, but the installation (upgrade) went smooth and now everything works well.


Hi all, I’m posting here because I was also getting a RuntimeError: CUDNN_STATUS_EXECUTION_FAILED, also with an LSTM model, but for a different cause…

In my case, my issue was resolved by reducing the length of my truncated backpropagation length in my LSTM training.

If I had to guess, I think I’m running out of GPU memory when it hits the LSTM call. If I set the truncated backpropagation length even higher, then I get a more interpretable CUDA out of memory error, since I think it runs out of memory before it hits the RNN.

Hopefully this helps somebody out. Would be interested if somebody could confirm my hypothesis or explain what’s actually happening.


This cudnn error might indeed mask an OOM error, so your hypothesis makes sense.
However, I cannot confirm it of course. :wink:
Did you monitor the memory usage using nvidia-smi?

1 Like

Hi @ptrblck thanks, yes gpu usage is right about 10.4 GB out of 11 GB with the working setup I run, and so it definitely makes sense for it to OOM if I increase the batch/truncation length just a little bit larger.

Thanks for confirming that it is a reasonable hypothesis!

1 Like

I had similar RuntimeError, but when I changed the device to “CPU” the error became " Expected float got double", which means that the model and inputs weren’t the same type, after I changed this. The problem is solved and I got back to cuda again.

1 Like

I am also facing this same issue. I did not face it when I was using a small dataset (by small here I mean the number of features). When I ran the same code on a larger dataset (larger in number of features), I had to increase the size of parameters of the RNN and CNN layers I am using. It is then when I got this error.
Using torch.backends.cudnn.enabled = False resolves this, but the execution becomes very slow.
I am not able to understand how to resolve this error and why is this “workaround” working?

1 Like

Try reducing the batch size.

It would be great if Pytorch could distinguish an out-of-memory error from the other stuff that causes a CUDNN_STATUS_EXECUTION_FAILED. Is there any ongoing effort in this matter?
A better error message would be particular useful, for instance, when doing neuro architecture search, and you want to automatically reduce the batch size, if an OOM error is detected.


great! this error is resolved

I also encountered this error. My model is a combination of residual CNN and RNN.
It seems that the condition in which this error occurs is unknown. Sometimes, it happens in epoch 10, and sometimes in epoch 200. It’s definitely not OOM because PyTorch is only using about 2.8GB of a total of 11GB. Plus, this error seems to happen more often when the GPU is doing 3D-related jobs at the same time. I observed that this error happens more often when I run my model and render CGs at the same time.

Is the rendering done on the GPU and if so, did you make sure that the GPU has still enough memory?
If your GPU isn’t running out of memory, could you create the cublas logs using these env vars, post them here and post your setup (CUDA and PyTorch versions, GPU used)?

meet also this Cudnn Error but don’t know how to solve it??

Check if you are running out of memory and if that’s not the case, please post a minimal, executable code snippet to reproduce the issue.

This my code snippet
ef train_one_epoch(model, dataloader, optimizer, args, epoch):


tloss = 0.

tcnt = 0.

st_time = time.time()

with tqdm(dataloader, desc='Train Ep '+str(epoch), mininterval=60) as tq:

    for batch in tq:

        pred = model(batch)

        nll_loss = F.nll_loss(pred.view(-1, pred.shape[-1]), batch['tgt_text'].view(-1), ignore_index=0)

        loss = nll_loss



        nn.utils.clip_grad_norm_(model.parameters(), args.clip)


        loss = loss.item()

        if loss!=loss:

            raise ValueError('NaN appear')

        tloss += loss * len(batch['tgt_text'])

        tcnt += len(batch['tgt_text'])

        tq.set_postfix({'loss': tloss/tcnt}, refresh=False)

print('Train Ep ', str(epoch), 'AVG Loss ', tloss/tcnt, 'Steps ', tcnt, 'Time ', time.time()-st_time, 'GPU', torch.cuda.max_memory_cached()/1024.0/1024.0/1024.0), args.save_model+str(epoch%100))

val_loss = 2**31

def eval_it(model, dataloader, args, epoch):

global val_loss


tloss = 0.

tcnt = 0.

st_time = time.time()

with tqdm(dataloader, desc='Eval Ep '+str(epoch), mininterval=60) as tq:

    for batch in tq:

        with torch.no_grad():

            pred = model(batch)

            nll_loss = F.nll_loss(pred.view(-1, pred.shape[-1]), batch['tgt_text'].view(-1), ignore_index=0)

        loss = nll_loss

        loss = loss.item()

        tloss += loss * len(batch['tgt_text'])

        tcnt += len(batch['tgt_text'])

        tq.set_postfix({'loss': tloss/tcnt}, refresh=False)

print('Eval Ep ', str(epoch), 'AVG Loss ', tloss/tcnt, 'Steps ', tcnt, 'Time ', time.time()-st_time)

if tloss/tcnt < val_loss:

    print('Saving best model ', 'Ep ', epoch, ' loss ', tloss/tcnt), args.save_model+'best')

    val_loss = tloss/tcnt

def train(gpu,args):

rank = * args.gpus + gpu

cuda_string = 'cuda'+':'+str(gpu)

device = torch.device(cuda_string if torch.cuda.is_available() else 'cpu')

if args.world_size > 1:

    dist.init_process_group(backend='nccl', init_method='env://', world_size=args.world_size, rank=rank)

Unfortunately, your code is not executable so I cannot help in debugging it.

Ohh sorry i forget to post it well but if you don’t mind you can help me to run this model (Graphwriter) on GPU RTX 3090 cause it is my task to evaluate it’s performance on GPU
Link to model repository. GraphWriter · master · Graph Neural Network for Arch / gnnmark · GitLab