RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED (batch_norm)

Arpit_Garg · March 28, 2022, 10:24am

RuntimeError Traceback (most recent call last)
in ()
15 pretrained = pretrained,
16 dataset=dataset,
—> 17 noise_rate=0.45
18 )

in run_vae(train_loader, test_loader, batch_size, epochs, z_dim, est_loader, cls_model, out_dir, select_ratio, pretrained, dataset, noise_rate)
57
58 adjust_learning_rate(optimizers[‘vae2’], epoch)
—> 59 train(epoch, model, train_loader, optimizers, device)
60
61

in train(epoch, model, train_loader, optimizers, device)
19
20 #forward
—> 21 x_hat1, n_logits1, mu1, log_var1, c_logits1, y_hat1 = vae_model1(data)
22 x_hat2, n_logits2, mu2, log_var2, c_logits2, y_hat2 = vae_model2(data)
23 #calculate acc

/home/ubuntu/anaconda3/envs/idnl/lib/python3.6/site-packages/torch/nn/modules/module.py in call(self, *input, **kwargs)
545 result = self._slow_forward(*input, **kwargs)
546 else:
→ 547 result = self.forward(*input, **kwargs)
548 for hook in self._forward_hooks.values():
549 hook_result = hook(self, input, result)

/media/ubuntu/Storage/Noisy_Labels/IDLN/mylib/models/vae.py in forward(self, x)
32 def forward(self, x):
33 ### trick 1, add a softmax function to logits
—> 34 c_logits = self.y_encoder(x)
35 y_hat = self._y_hat_reparameterize(c_logits)
36 mu, logvar = self.z_encoder(x, y_hat)

/home/ubuntu/anaconda3/envs/idnl/lib/python3.6/site-packages/torch/nn/modules/module.py in call(self, *input, **kwargs)
545 result = self._slow_forward(*input, **kwargs)
546 else:
→ 547 result = self.forward(*input, **kwargs)
548 for hook in self._forward_hooks.values():
549 hook_result = hook(self, input, result)

/media/ubuntu/Storage/Noisy_Labels/IDLN/mylib/models/resnet.py in forward(self, x, revision, output_f)
220
221 def forward(self, x, revision=False, output_f=False):
→ 222 return self._forward_impl(x,revision,output_f)
223
224

/media/ubuntu/Storage/Noisy_Labels/IDLN/mylib/models/resnet.py in _forward_impl(self, x, revision, output_f)
199 # See note [TorchScript super()]
200 x = self.conv1(x)
→ 201 x = self.bn1(x)
202 x = self.relu(x)
203 x = self.maxpool(x)

/home/ubuntu/anaconda3/envs/idnl/lib/python3.6/site-packages/torch/nn/modules/module.py in call(self, *input, **kwargs)
545 result = self._slow_forward(*input, **kwargs)
546 else:
→ 547 result = self.forward(*input, **kwargs)
548 for hook in self._forward_hooks.values():
549 hook_result = hook(self, input, result)

/home/ubuntu/anaconda3/envs/idnl/lib/python3.6/site-packages/torch/nn/modules/batchnorm.py in forward(self, input)
79 input, self.running_mean, self.running_var, self.weight, self.bias,
80 self.training or not self.track_running_stats,
—> 81 exponential_average_factor, self.eps)
82
83 def extra_repr(self):

/home/ubuntu/anaconda3/envs/idnl/lib/python3.6/site-packages/torch/nn/functional.py in batch_norm(input, running_mean, running_var, weight, bias, training, momentum, eps)
1654 return torch.batch_norm(
1655 input, weight, bias, running_mean, running_var,
→ 1656 training, momentum, eps, torch.backends.cudnn.enabled
1657 )
1658

RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

I am running this on pytorch==1.2 and I have tried both cuda 9.0 and 10.0; can’t figure out the error
@ptrblck

wmpauli · November 2, 2022, 11:22pm

Did you ever figure out what’s was causing the Runtime Error. I’m running into a similar issue when using DataParallel.

ptrblck · November 2, 2022, 11:36pm

Would you post a minimal and executable code snippet to reproduce the issue using the latest stable or nightly release, please?

wmpauli · November 3, 2022, 6:11pm

Thank you for your response (as always).

This only happens when I use data model = nn.DataParallel(model), not on single device.

If I disable cudnn (torch.backends.cudnn.enabled = False), I don’t get an error either.

after setting env "CUDA_LAUNCH_BLOCKING": "1", the error appears to occur in line (this is not my code).

github.com

zhouhao94/CANet/blob/a011a76009243d50082eb6b7c631eb2d16f61dc6/Co_attention.py#L137


      
          gap_p = sum([gap_1, y])

          gap_c = sum([gap_2, z])

          

          gap_p = F.adaptive_avg_pool2d(gap_p, 1)  # n, 512, h, w -> n, 512, 1, 1

          gap_c = F.adaptive_avg_pool2d(gap_c, 1)  # n, 512, h, w -> n, 512, 1, 1

          

          gap_p = self.fc1_p(gap_p)

          gap_c = self.fc1_c(gap_c)

          

          if self.use_bn:

              gap_p = self.bn1_p(gap_p)

              gap_c = self.bn1_c(gap_c)

          

          gap_p = self.relu(gap_p)

          gap_c = self.relu(gap_c)

          

          atten_p = self.fc2_p(gap_p)  # n, 256, 1, 1 -> n, 1024, 1, 1

          atten_c = self.fc2_c(gap_c)  # n, 256, 1, 1 -> n, 1024, 1, 1

          

          atten_p = self.rsoftmax(atten_p).view(batch, -1, 1, 1)  # (n, 1024) -> (n, 1024, 1, 1)

          atten_c = self.rsoftmax(atten_c).view(batch, -1, 1, 1)  # (n, 1024) -> (n, 1024, 1, 1)

gap_p looks like this when it happens.

Note: the error occurs the second time this line is executed, on device:1. No problem the first time.

Traceback:

Traceback (most recent call last):
  File "/home/foo_user/miniconda3/envs/also_foo/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/foo_user/miniconda3/envs/also_foo/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/foo_user/.vscode-server/extensions/ms-python.python-2022.16.1/pythonFiles/lib/python/debugpy/adapter/../../debugpy/launcher/../../debugpy/__main__.py", line 39, in <module>
    cli.main()
  File "/home/foo_user/.vscode-server/extensions/ms-python.python-2022.16.1/pythonFiles/lib/python/debugpy/adapter/../../debugpy/launcher/../../debugpy/../debugpy/server/cli.py", line 430, in main
    run()
  File "/home/foo_user/.vscode-server/extensions/ms-python.python-2022.16.1/pythonFiles/lib/python/debugpy/adapter/../../debugpy/launcher/../../debugpy/../debugpy/server/cli.py", line 284, in run_file
    runpy.run_path(target, run_name="__main__")
  File "/home/foo_user/.vscode-server/extensions/ms-python.python-2022.16.1/pythonFiles/lib/python/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 321, in run_path
    return _run_module_code(code, init_globals, run_name,
  File "/home/foo_user/.vscode-server/extensions/ms-python.python-2022.16.1/pythonFiles/lib/python/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 135, in _run_module_code
    _run_code(code, mod_globals, init_globals,
  File "/home/foo_user/.vscode-server/extensions/ms-python.python-2022.16.1/pythonFiles/lib/python/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 124, in _run_code
    exec(code, run_globals)
  File "train.py", line 302, in <module>
    main()
  File "train.py", line 184, in main
    loss, acc = train(model, clf, dataloader, crit)
  File "train.py", line 43, in train
    out = model(input)
  File "/home/foo_user/miniconda3/envs/also_foo/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/foo_user/miniconda3/envs/also_foo/lib/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 168, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/home/foo_user/miniconda3/envs/also_foo/lib/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 178, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/home/foo_user/miniconda3/envs/also_foo/lib/python3.9/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
    output.reraise()
  File "/home/foo_user/miniconda3/envs/also_foo/lib/python3.9/site-packages/torch/_utils.py", line 461, in reraise
    raise exception
RuntimeError: Caught RuntimeError in replica 1 on device 1.
Original Traceback (most recent call last):
  File "/home/foo_user/miniconda3/envs/also_foo/lib/python3.9/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
    output = module(*input, **kwargs)
  File "/home/foo_user/miniconda3/envs/also_foo/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/foo_user/foo/model.py", line 61, in forward
    fusion = self.co_attention(x, y)
  File "/home/foo_user/miniconda3/envs/also_foo/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/foo_user/foo/model.py", line 102, in forward
    out = self.split_conv(m, p_out, c_out)
  File "/home/foo_user/miniconda3/envs/also_foo/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/foo_user/foo/co_attention.py", line 137, in forward
    gap_p = self.bn1_p(gap_p)
  File "/home/foo_user/miniconda3/envs/also_foo/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/foo_user/miniconda3/envs/also_foo/lib/python3.9/site-packages/torch/nn/modules/batchnorm.py", line 168, in forward
    return F.batch_norm(
  File "/home/foo_user/miniconda3/envs/also_foo/lib/python3.9/site-packages/torch/nn/functional.py", line 2438, in batch_norm
    return torch.batch_norm(
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

Here is the result of $ python -m torch.utils.collect_env

PyTorch version: 1.12.1+cu113
Is debug build: False
CUDA used to build PyTorch: 11.3
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.4 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Clang version: Could not collect
CMake version: version 3.15.5
Libc version: glibc-2.31

Python version: 3.9.13 | packaged by conda-forge | (main, May 27 2022, 16:58:50)  [GCC 10.3.0] (64-bit runtime)
Python platform: Linux-5.15.0-1022-azure-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration: 
GPU 0: Tesla V100-PCIE-16GB
GPU 1: Tesla V100-PCIE-16GB

Nvidia driver version: 470.103.01
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.4.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.4.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.4.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.4.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.4.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.4.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.4.0
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] mypy-extensions==0.4.3
[pip3] numpy==1.23.2
[pip3] torch==1.12.1+cu113
[pip3] torchvision==0.13.1+cu113
[conda] cudatoolkit               11.3.1              h9edb442_10    conda-forge
[conda] numpy                     1.23.2                   pypi_0    pypi
[conda] torch                     1.12.1+cu113             pypi_0    pypi
[conda] torchvision               0.13.1+cu113             pypi_0    pypi

For now, I can just disable batch_normalization to avoid the problem. Model seems to converge well without.

ptrblck · November 3, 2022, 9:32pm

Thanks for sharing the repository. Could you let me know how to execute it using random input tensors to reproduce the issue, please?
Also, do you see the issue using the latest PyTorch release with CUDA 11.7?

wmpauli · November 3, 2022, 10:59pm

I’ll try to get to reproducing with code snippet. For now I can say that the error also occurs with nightly build, with cu116 torch1140.

Getting an environment with cu117 will be difficult, because I depends on docker containers that somebody else builds. Thanks again.