RNN giving CUDNN_STATUS_SUCCESS Error

chinmay5 · June 28, 2020, 10:18pm

I have a RNN model that throws this strange exception. I am using PyTorch 0.4 since this is an old code and I am still trying to upgrade it (still would really like to have it running for comparison).

I have CUDA 10.1 installed and it seems only the LSTM based model is causing issues. Any help would be highly appreciated.

    self.lang_model.cuda()
  File "/home/chinmay/Desktop/setup/3dsis/lib/python3.6/site-packages/torch/nn/modules/module.py", line 258, in cuda
    return self._apply(lambda t: t.cuda(device))
  File "/home/chinmay/Desktop/setup/3dsis/lib/python3.6/site-packages/torch/nn/modules/module.py", line 185, in _apply
    module._apply(fn)
  File "/home/chinmay/Desktop/setup/3dsis/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 112, in _apply
    self.flatten_parameters()
  File "/home/chinmay/Desktop/setup/3dsis/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 105, in flatten_parameters
    self.batch_first, bool(self.bidirectional))
RuntimeError: CuDNN error: CUDNN_STATUS_SUCCESS

ptrblck · June 29, 2020, 8:26am

Could you update to the latest stable PyTorch version (1.5.1) and post a code snippet to reproduce this issue, please?

chinmay5 · June 29, 2020, 8:29am

Hi @ptrblck the issue does not happen with the latest version of PyTorch it is just that I have an older code and I am still working on porting it. However, interestingly, I found an old issue which suggests that we should try and put the model on CUDA twice and I was able to get it to work (5 minutes before your message ).

bibo · December 2, 2021, 7:19am

Hi, my problem is similar to the above post. Specifically, my GPU server environment is centOS Linux 7.8.2003(core),and pytorch0.4.1, two cuda version 9.0 and 10.1(and 10.1 is the version I use). And I’m reproducing others program( that is cloned from GitHub) now, When I run the program, the system throw some exception as the following,

Thanks for any help!

ptrblck · December 2, 2021, 9:11am

Could you update to the latest PyTorch release (1.10.0) as 0.4.1 was released in July 2018 and is quite old by now?

bibo · December 2, 2021, 9:58am

Because the program I want to reproduce requires pytorch0.4.1 and other environment requirements! So I have no choice!

ptrblck · December 2, 2021, 10:02am

In that case you could disable cuDNN via torch.backends.cudnn.enabled = False and see if your code could run.

bibo · December 2, 2021, 10:12am

You mean I need set torch.backends.cudnn.enabled = False, which file that I need to add this line in? or I need modify some global config file, such as my .bashrc file, etc ? Could you give me some suggestion! Thank you very much, I’ll try it now!

ptrblck · December 2, 2021, 10:18am

You can add it into your main script after importing torch.

chinmay5 · December 2, 2021, 10:20am

@bibo I honestly think catching the exception and putting the model on Cuda twice is perhaps the easiest option. Did it not work for you?

bibo · December 2, 2021, 10:20am

OK, I 'll try it now! Thanks a lot!

bibo · December 2, 2021, 10:28am

Thank you! And I’d like to know how can I putting the model on Cuda twice. Could you give me some available steps?

chinmay5 · December 2, 2021, 10:50am

This is what I used:-

# Solving error by trying to put on Cuda twice
try:
     self.lang_model.cuda()
except:
     self.lang_model.cuda()

bibo · December 2, 2021, 10:53am

Where can I add these lines? Which file should add these code? Thank you!

chinmay5 · December 3, 2021, 2:59pm

Can you please paste the snippet of your main.py file or any other file you are using for the initial setup?

bibo · December 6, 2021, 7:42am

Thanks for your help! The program that I reproduce is from GitHub(GitHub - zhangj111/rencos). The main.py is run.py at(GitHub - zhangj111/rencos). Its content as follows:
import os
import sys
import time

def main(opt, mode=2):
if opt == ‘preprocess’:
command = “python preprocess.py -train_src samples/%s/train/train.spl.src
-train_tgt samples/%s/train/train.txt.tgt
-valid_src samples/%s/valid/valid.spl.src
-valid_tgt samples/%s/valid/valid.txt.tgt
-save_data samples/%s/preprocessed/baseline_spl
-src_seq_length 10000
-tgt_seq_length 10000
-src_seq_length_trunc %d
-tgt_seq_length_trunc %d” % (lang, lang, lang, lang, lang, src_len, tgt_len)
os.system(command)
elif opt == ‘train’:
command = “python train.py -word_vec_size 256
-layers 1
-rnn_size 512
-rnn_type LSTM
-global_attention mlp
-data samples/%s/preprocessed/baseline_spl
-save_model models/%s/baseline_spl
-gpu_ranks 0
-batch_size 32
-optim adam
-learning_rate 0.001
-dropout 0
-encoder_type brnn” % (lang, lang)
os.system(command)
elif opt == ‘retrieval’:
print(‘Syntactic level…’)
command1 = “python syntax.py %s” % lang
os.system(command1)
print(‘Semantic level…’)
batch_size = 32 if lang == ‘python’ else 16
command2 = “python translate.py -model models/%s/baseline_spl_step_100000.pt
-src samples/%s/train/train.spl.src
-output samples/%s/output/test.out
-batch_size %d
-gpu 0
-fast
-max_sent_length %d
-refer 0
-lang %s
-search 2” % (lang, lang, lang, batch_size, src_len, lang)
os.system(command2)
command3 = “python translate.py -model models/%s/baseline_spl_step_100000.pt
-src samples/%s/test/test.spl.src
-output samples/%s/test/test.ref.src.1
-batch_size 32
-gpu 0
-fast
-max_sent_length %d
-refer 0
-lang %s
-search 2” % (lang, lang, lang, src_len, lang)
os.system(command3)
print(‘Normalize…’)
command4 = “python normalize.py %s” % lang
os.system(command4)
elif opt == ‘translate’:
command = “python translate.py -model models/%s/baseline_spl_step_100000.pt
-src samples/%s/test/test.spl.src
-output samples/%s/output/test.out
-min_length 3
-max_length %d
-batch_size 32
-gpu 0
-fast
-max_sent_length %d
-refer %d
-lang %s
-beam 5” % (lang, lang, lang, tgt_len, src_len, mode, lang)
os.system(command)
print(‘Done.’)

if name == ‘main’:
option = sys.argv[1]
lang = sys.argv[2]
assert option in [‘preprocess’, ‘train’, ‘retrieval’, ‘translate’, ‘all’]
assert lang in [‘python’, ‘java’]
if lang == ‘python’:
src_len, tgt_len = 100, 50
elif lang == ‘java’:
src_len, tgt_len = 300, 30
else:
print(“Unsupported Programming Language:”, lang)
if option == ‘all’:
main(‘preprocess’)
main(‘train’)
main(‘retrieval’)
main(‘translate’)
else:
if option == ‘translate’:
mode = int(sys.argv[3])
main(option, mode)
else:
main(option)

bibo · December 6, 2021, 8:17am

Hi, My problem that I posted before have been solved, thank you for your help! And now I met another problem when I translate and generate output, it is “Runtime error; CUDA error: out of memory”. I think that it may be caused by the small amount of the monitor memory of our GPU server, so I want to reset my batch-size so that it can be suitable to the monitor memory of our GPU server, but I don’t know which file I need to choose to.

ptrblck · December 6, 2021, 9:08am

Based on your previously posted code you are setting the batch size via:

-batch_size 32

so could reduce this value until the memory usage meets the available GPU memory.

bibo · December 6, 2021, 9:43am

OK, you mean that I can modify run.py and reset batch size! Are there any other python file need to modify simultaneously? Thanks! I’ll try it!