Cudnn_status_execution_failed


#1

Hi,

I have a model that is running fine on 3 machines with Titan X. I’ve tried to run it on a Tesla P100-SXM2-16GB and get this error:

Traceback (most recent call last):
  File "/users/oanuru/anaconda3/envs/nips/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/users/oanuru/anaconda3/envs/nips/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/data/dgx1/oanuru/experiments/src/main/python/nips2017/__main__.py", line 210, in <module>
    main(arg_tools.parse_args(config.Config, FIXED_TYPES, APP_NAME, APP_DESC))
  File "/data/dgx1/oanuru/experiments/src/main/python/nips2017/__main__.py", line 207, in main
    exp.run()
  File "/data/dgx1/oanuru/experiments/src/main/python/nips2017/experiment/base_experiment.py", line 431, in run
    self._train(epoch)
  File "/data/dgx1/oanuru/experiments/src/main/python/nips2017/experiment/experiment_2.py", line 402, in _train
    decodings = self._model(batch.premise, batch.hypothesis, decode_with_tf=(not self._conf.no_teacher_forcing))
  File "/users/oanuru/anaconda3/envs/nips/lib/python3.6/site-packages/torch/nn/modules/module.py", line 206, in __call__
    result = self.forward(*input, **kwargs)
  File "/data/dgx1/oanuru/experiments/src/main/python/nips2017/experiment/models/model_exp_2.py", line 123, in forward
    decoding_1 = self.autoenc(sentence_1, decode_with_tf=decode_with_tf)
  File "/users/oanuru/anaconda3/envs/nips/lib/python3.6/site-packages/torch/nn/modules/module.py", line 206, in __call__
    result = self.forward(*input, **kwargs)
  File "/data/dgx1/oanuru/experiments/src/main/python/nips2017/models/autoencoder.py", line 75, in forward
    enc = self.encoder(inputs)
  File "/users/oanuru/anaconda3/envs/nips/lib/python3.6/site-packages/torch/nn/modules/module.py", line 206, in __call__
    result = self.forward(*input, **kwargs)
  File "/data/dgx1/oanuru/experiments/src/main/python/nips2017/models/simple_encoder.py", line 188, in forward
    output, (hidden, cell) = self.rnn(current_inputs, (hidden, cell))
  File "/users/oanuru/anaconda3/envs/nips/lib/python3.6/site-packages/torch/nn/modules/module.py", line 206, in __call__
    result = self.forward(*input, **kwargs)
  File "/users/oanuru/anaconda3/envs/nips/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 91, in forward
    output, hidden = func(input, self.all_weights, hx)
  File "/users/oanuru/anaconda3/envs/nips/lib/python3.6/site-packages/torch/nn/_functions/rnn.py", line 343, in forward
    return func(input, *fargs, **fkwargs)
  File "/users/oanuru/anaconda3/envs/nips/lib/python3.6/site-packages/torch/autograd/function.py", line 202, in _do_forward
    flat_output = super(NestedIOFunction, self)._do_forward(*flat_input)
  File "/users/oanuru/anaconda3/envs/nips/lib/python3.6/site-packages/torch/autograd/function.py", line 224, in forward
    result = self.forward_extended(*nested_tensors)
  File "/users/oanuru/anaconda3/envs/nips/lib/python3.6/site-packages/torch/nn/_functions/rnn.py", line 285, in forward_extended
    cudnn.rnn.forward(self, input, hx, weight, output, hy)
  File "/users/oanuru/anaconda3/envs/nips/lib/python3.6/site-packages/torch/backends/cudnn/rnn.py", line 296, in forward
    ctypes.c_void_p(fn.reserve.data_ptr()), fn.reserve.size(0)
  File "/users/oanuru/anaconda3/envs/nips/lib/python3.6/site-packages/torch/backends/cudnn/__init__.py", line 249, in check_error
    raise CuDNNError(status)
torch.backends.cudnn.CuDNNError: 8: b'CUDNN_STATUS_EXECUTION_FAILED'

I printed the cuda and cudnn versions from pytorch and got:

+========================+
|        GPU INFO        |
+===============+========+
| CUDA version  | 8.0.44 |
| cuDNN version | 6021   |
+===============+========+

Any help would be appreciated.

Thanks!
Oana


#2

Hi Oana,

Can I reproduce this? This seems like a CUDNN bug, and if you give a way to reproduce this, NVIDIA will be interested to fix this.


#3

Hi smth,

That’d be great. I’ve created this repo with the snli mode from pytorch examples and the dockerfile I used: https://github.com/OanaCamburu/cudnn_issue

I’ve got the following error this time:

55901af1922b:/home/trial$ python train.py 
downloading
extracting
downloading word vectors from http://nlp.stanford.edu/data/glove.42B.300d.zip
glove.42B.300d: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰| 1.88G/1.88G [02:32<00:00, 8.12MB/s]extracting word vectors into /home/trial/.data_cache
glove.42B.300d: 1.88GB [03:26, 9.07MB/s]                                                                                                               
Loading word vectors from /home/trial/.data_cache/glove.42B.300d.txt
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1917494/1917494 [06:24<00:00, 4987.26it/s]

  Time Epoch Iteration Progress    (%Epoch)   Loss   Dev/Loss     Accuracy  Dev/Accuracy
Traceback (most recent call last):
  File "train.py", line 71, in <module>
    answer = model(batch)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 206, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/trial/model.py", line 71, in forward
    prem_embed = self.relu(self.projection(prem_embed))
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 206, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/trial/model.py", line 12, in forward
    out = super(Bottle, self).forward(input.view(size[0]*size[1], -1))
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/linear.py", line 54, in forward
    return self._backend.Linear()(input, self.weight, self.bias)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/_functions/linear.py", line 10, in forward
    output.addmm_(0, 1, input, weight.t())
RuntimeError: cublas runtime error : the GPU program failed to execute at /py/conda-bld/pytorch_1493674854206/work/torch/lib/THC/THCBlas.cu:246

Hope this helps.

Thanks!
Oana


#4

Thank you. Reported to NVIDIA, will let you know if they suggest a workaround.


(ngimel) #5

The problem is with your docker file, you are supposed to do
conda install -y pytorch cuda80 -c soumith
if you want to run on P100. FWIW, you would not have this problem if you used either devel Dockerfile (in pytorch root directory) or runtime Dockerfile (in tools/docker). You’d have to modify either of them to install torchtext and its dependencies, or, alternatively, build a base image with the provided Dockerfile and have a separate Dockerfile that would reference this base image in the FROM line and install whatever add-ons you need.


(Wasi Ahmad) #6

I am also getting the error, torch.backends.cudnn.CuDNNError: 8: b'CUDNN_STATUS_EXECUTION_FAILED'. Did anyone find any workaround to solve the issue?


(Hon Weng Chong) #7

Yes me too, I get this bug sometimes when training to 400 epochs.


(Superhans) #8

I get this error as well from time to time. It’s hard to reproduce it, but it always happens in the middle of execution (so some epochs have correctly run without any trouble).

Never had this error before updating to the latest pytorch.
I’m doing one fairly complicated broadcasting operation, so I will take a look at the β€œImportant Breakages and Workarounds” section in the docs and update.


(Greaber) #9

I encountered this issue too. It seems to be related to passing too large of an input to the CUDNN RNN layer (in my case, a GRU). I found that reducing the batch size made the problem go away (and the original batch size would have been way too big for me not to run out of memory anyway since even with a smaller batch size that didn’t cause the error I still ran out of memory and had to reduce further). Also, for me it was not an intermittent problem but happened every time.


GRU crashing after 31 training steps
(Solomon K ) #10

This happens also in the windows port of PyTorch, the only way to overcome this when using (in my case) large CNN’s is to use: torch.backends.cudnn.enabled=False.


#11

Any update on this issue? I also encountered the same problem. The environment is:
GPU: GTX 1080
Driver Version: 384.111
Pytorch: 0.3.0.post4


(Greaber) #12

The most likely reason you are encountering CUDNN_STATUS_EXECUTION_FAILED is that you have run out of memory. Unfortunately, when you run out of memory on the GPU, you can get one of a few different error messages, so it is not always obvious that this is what happened.

If you are completely sure you are not running out of memory then you should open a new issue and give specifics about your problem.

Note that the OP was (assuming ngimel was correct) encountering a different problem from you relating to using a P100 with the wrong version of PyTorch. Probably no one else who posted in this thread has had the OP’s problem, and it may well be that everyone else who posted in this thread has just run out of memory.


(Usnik Chawla) #13

i had this error while installing
Traceback (most recent call last):
File β€œ/usr/local/lib/python3.5/dist-packages/pip/basecommand.py”, line 215, in main
status = self.run(options, args)
File β€œ/usr/local/lib/python3.5/dist-packages/pip/commands/install.py”, line 342, in run
prefix=options.prefix_path,
File β€œ/usr/local/lib/python3.5/dist-packages/pip/req/req_set.py”, line 784, in install
**kwargs
File β€œ/usr/local/lib/python3.5/dist-packages/pip/req/req_install.py”, line 851, in install
self.move_wheel_files(self.source_dir, root=root, prefix=prefix)
File β€œ/usr/local/lib/python3.5/dist-packages/pip/req/req_install.py”, line 1064, in move_wheel_files
isolated=self.isolated,
File β€œ/usr/local/lib/python3.5/dist-packages/pip/wheel.py”, line 345, in move_wheel_files
clobber(source, lib_dir, True)
File β€œ/usr/local/lib/python3.5/dist-packages/pip/wheel.py”, line 316, in clobber
ensure_dir(destdir)
File β€œ/usr/local/lib/python3.5/dist-packages/pip/utils/init.py”, line 83, in ensure_dir
os.makedirs(path)
File β€œ/usr/lib/python3.5/os.py”, line 241, in makedirs
mkdir(name, mode)
PermissionError: [Errno 13] Permission denied: '/usr/local/lib/python3.5/dist-packages/torch’
please can you help.


#14

Hi greaber, Thanks for your reply, I found that this problem happened when doing the evaluation(inference). Actually, during the inference, only one instance was used, which is pretty small. Also, I monitored the memory and it was not out-of-memory. I am really confused.

I just follow this repo, and try to run the code. seq2seq_batched

The error is:

torch.backends.cudnn.CuDNNError: 8: b’CUDNN_STATUS_EXECUTION_FAILED’


(Greaber) #15

Hi ShilinHE, I can’t tell what the problem is from your description. I guess you will need to report more info to get help. But just monitoring memory usage won’t necessarily tell you if running out of memory is the issue since it could fail on a big allocation or very quickly rather than gradually running out of memory.


(Gayatri) #16

Hi, I had the same issue.

This is because of an error in the evaluate() function, the input_lengths array incorrectly contains the number of characters in the sentence, instead of the number of words. Replace

input_lengths = [len(input_seq)]

with

input_lengths = [len(input_seq.split())]

and the error should go away.


(Hamid) #17

I had similar issue during evaluation only but as @greaber mentioned it was a memory issue. When I decrease the training batch size, it went away


(Greaber) #18

@ngimel, it would be awesome if we could reliably get an β€œout of memory” error when running out of memory on the GPU.


#19

Cannot agree more! Sometimes it turns out to be out of memory suddenly, and monitoring GPU does not work.