Running MASK-RCNN 3D demo in my conda env, a error occurs-"RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

ShiChenchen · November 2, 2021, 10:04am

when running MASK-RCNN 3D demo in my conda env, a error occurs-“RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED
Backend Qt5Agg is interactive backend. Turning interactive mode on.
Opening ZED Camera…
Mask enabled!
Traceback (most recent call last):
File “C:\Users\Lenovo.conda\envs\pytorch_test\lib\runpy.py”, line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File “C:\Users\Lenovo.conda\envs\pytorch_test\lib\runpy.py”, line 87, in run_code
exec(code, run_globals)
File "c:\Users\Lenovo.vscode\extensions\ms-python.python-2021.10.1365161279\pythonFiles\lib\python\debugpy_main.py”, line 45, in
cli.main()
File “c:\Users\Lenovo.vscode\extensions\ms-python.python-2021.10.1365161279\pythonFiles\lib\python\debugpy/…\debugpy\server\cli.py”, line 444, in main
run()
File “c:\Users\Lenovo.vscode\extensions\ms-python.python-2021.10.1365161279\pythonFiles\lib\python\debugpy/…\debugpy\server\cli.py”, line 285, in run_file
runpy.run_path(target_as_str, run_name=compat.force_str(“main”))
File “C:\Users\Lenovo.conda\envs\pytorch_test\lib\runpy.py”, line 268, in run_path
return _run_module_code(code, init_globals, run_name,
File “C:\Users\Lenovo.conda\envs\pytorch_test\lib\runpy.py”, line 97, in _run_module_code
_run_code(code, mod_globals, init_globals,
File “C:\Users\Lenovo.conda\envs\pytorch_test\lib\runpy.py”, line 87, in _run_code
exec(code, run_globals)
File “d:\CloudDisk\OneDrive - nuaa.edu.cn\15_code\03_Python_Code\02_ZEDCamera\zed-pytorch-master\zed-pytorch-master\zed_object_detection.py”, line 290, in
main()
File “d:\CloudDisk\OneDrive - nuaa.edu.cn\15_code\03_Python_Code\02_ZEDCamera\zed-pytorch-master\zed-pytorch-master\zed_object_detection.py”, line 254, in main
prediction = coco_demo.select_top_predictions(coco_demo.compute_prediction(img))
File “d:\CloudDisk\OneDrive - nuaa.edu.cn\15_code\03_Python_Code\02_ZEDCamera\zed-pytorch-master\zed-pytorch-master\predictor.py”, line 246, in compute_prediction
predictions = self.model(image_list)
File “C:\Users\Lenovo.conda\envs\pytorch_test\lib\site-packages\torch\nn\modules\module.py”, line 1102, in _call_impl
return forward_call(*input, **kwargs)
File “C:\Users\Lenovo.conda\envs\pytorch_test\lib\site-packages\maskrcnn_benchmark-0.1-py3.9-win-amd64.egg\maskrcnn_benchmark\modeling\detector\generalized_rcnn.py”, line 50, in forward
proposals, proposal_losses = self.rpn(images, features, targets)
File “C:\Users\Lenovo.conda\envs\pytorch_test\lib\site-packages\torch\nn\modules\module.py”, line 1102, in _call_impl
return forward_call(*input, **kwargs)
File “C:\Users\Lenovo.conda\envs\pytorch_test\lib\site-packages\maskrcnn_benchmark-0.1-py3.9-win-amd64.egg\maskrcnn_benchmark\modeling\rpn\rpn.py”, line 155, in forward
objectness, rpn_box_regression = self.head(features)
File “C:\Users\Lenovo.conda\envs\pytorch_test\lib\site-packages\torch\nn\modules\module.py”, line 1102, in _call_impl
return forward_call(*input, **kwargs)
File “C:\Users\Lenovo.conda\envs\pytorch_test\lib\site-packages\maskrcnn_benchmark-0.1-py3.9-win-amd64.egg\maskrcnn_benchmark\modeling\rpn\rpn.py”, line 103, in forward
t = F.relu(self.conv(feature))
File “C:\Users\Lenovo.conda\envs\pytorch_test\lib\site-packages\torch\nn\modules\module.py”, line 1102, in _call_impl
return forward_call(*input, **kwargs)
File “C:\Users\Lenovo.conda\envs\pytorch_test\lib\site-packages\torch\nn\modules\conv.py”, line 446, in forward
return self._conv_forward(input, self.weight, self.bias)
File “C:\Users\Lenovo.conda\envs\pytorch_test\lib\site-packages\torch\nn\modules\conv.py”, line 442, in _conv_forward
return F.conv2d(input, weight, bias, self.stride,
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED
You can try to repro this exception using the following code snippet. If that doesn’t trigger the error, please include your original repro script when reporting this issue.

import torch
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.benchmark = False
torch.backends.cudnn.deterministic = False
torch.backends.cudnn.allow_tf32 = True
data = torch.randn([1, 256, 16, 30], dtype=torch.float, device=‘cuda’, requires_grad=True)
net = torch.nn.Conv2d(256, 256, kernel_size=[3, 3], padding=[1, 1], stride=[1, 1], dilation=[1, 1], groups=1)
net = net.cuda().float()
out = net(data)
out.backward(torch.randn_like(out))
torch.cuda.synchronize()

ConvolutionParams
data_type = CUDNN_DATA_FLOAT
padding = [1, 1, 0]
stride = [1, 1, 0]
dilation = [1, 1, 0]
groups = 1
deterministic = false
allow_tf32 = true
input: TensorDescriptor 0000029E84679850
type = CUDNN_DATA_FLOAT
nbDims = 4
dimA = 1, 256, 16, 30,
strideA = 122880, 480, 30, 1,
output: TensorDescriptor 0000029E8467AC70
type = CUDNN_DATA_FLOAT
nbDims = 4
dimA = 1, 256, 16, 30,
strideA = 122880, 480, 30, 1,
weight: FilterDescriptor 0000029E9D71DF70
type = CUDNN_DATA_FLOAT
tensor_format = CUDNN_TENSOR_NCHW
nbDims = 4
dimA = 256, 256, 3, 3,
Pointer addresses:
input: 0000000BA9278000
output: 0000000BA92F0000
weight: 0000000B1C5C0000
Forward algorithm: 1

I tried to solve the problem many times, but unfortunately failed.

1、Disable the CUDNN module in “predictor.py”
– torch.backends.cudnn.enabled=False

2、Check the edition of host cuda and conda env cuda, and reinstall cudatoolkit 11.3.1, cuda and conda installration info list as below:

My System info:

OS-WIN10;
CUDA VERSION-11.3.109
GPU-RTX3080-16GB;

My Develop info:

conda env name - pytorch_test;
conda package list
cudatoolkit-11.3.1

could you plz give me some advice!!!

ptrblck · November 3, 2021, 12:42am

What kind of error did you see when disabling cuDNN? If another error was raised, this could mean that cuDNN would also be running into this issue, although it’s not necessarily the case.

ShiChenchen · November 6, 2021, 6:38am

Thank you, after disable the cudnn in main function, the program works fine.
At first I only disabled the cudnn in module section.

ptrblck · November 6, 2021, 6:50am

Could you post an executable code snippet to reproduce the issue, please?
I thought the issue might be related to this one, which I’m already debugging, but since disabling cuDNN seems to work I doubt it.