Need help to solve RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

Hi everyone. I was trying to run FlowNet2 on custom data (cuda 10.1, torch 1.4, python3.7). But I am getting this issue:

Traceback (most recent call last):
File “main.py”, line 425, in
stats = inference(args=args, epoch=epoch - 1, data_loader=inference_loader, model=model_and_loss, offset=offset)
File “main.py”, line 383, in inference
losses, output = model(data[0], target[0], inference=True)
File “/home/ice-kms1/anaconda3/envs/flownet2/lib/python3.7/site-packages/torch/nn/modules/module.py”, line 532, in call
result = self.forward(*input, **kwargs)
File “/home/ice-kms1/anaconda3/envs/flownet2/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py”, line 152, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File “/home/ice-kms1/anaconda3/envs/flownet2/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py”, line 162, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File “/home/ice-kms1/anaconda3/envs/flownet2/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py”, line 85, in parallel_apply
output.reraise()
File “/home/ice-kms1/anaconda3/envs/flownet2/lib/python3.7/site-packages/torch/_utils.py”, line 394, in reraise
raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
File “/home/ice-kms1/anaconda3/envs/flownet2/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py”, line 60, in _worker
output = module(*input, **kwargs)
File “/home/ice-kms1/anaconda3/envs/flownet2/lib/python3.7/site-packages/torch/nn/modules/module.py”, line 532, in call
result = self.forward(*input, **kwargs)
File “main.py”, line 181, in forward
output = self.model(data)
File “/home/ice-kms1/anaconda3/envs/flownet2/lib/python3.7/site-packages/torch/nn/modules/module.py”, line 532, in call
result = self.forward(*input, **kwargs)
File “/media/ice-kms1/DATA 1/javlonbek/ByteTrack-main/ByteTrack/flownet2-pytorch/models.py”, line 129, in forward
flownetc_flow2 = self.flownetc(x)[0]
File “/home/ice-kms1/anaconda3/envs/flownet2/lib/python3.7/site-packages/torch/nn/modules/module.py”, line 532, in call
result = self.forward(*input, **kwargs)
File “/media/ice-kms1/DATA 1/javlonbek/ByteTrack-main/ByteTrack/flownet2-pytorch/networks/FlowNetC.py”, line 75, in forward
out_conv1a = self.conv1(x1)
File “/home/ice-kms1/anaconda3/envs/flownet2/lib/python3.7/site-packages/torch/nn/modules/module.py”, line 532, in call
result = self.forward(*input, **kwargs)
File “/home/ice-kms1/anaconda3/envs/flownet2/lib/python3.7/site-packages/torch/nn/modules/container.py”, line 100, in forward
input = module(input)
File “/home/ice-kms1/anaconda3/envs/flownet2/lib/python3.7/site-packages/torch/nn/modules/module.py”, line 532, in call
result = self.forward(*input, **kwargs)
File “/home/ice-kms1/anaconda3/envs/flownet2/lib/python3.7/site-packages/torch/nn/modules/conv.py”, line 345, in forward
return self.conv2d_forward(input, self.weight)
File “/home/ice-kms1/anaconda3/envs/flownet2/lib/python3.7/site-packages/torch/nn/modules/conv.py”, line 342, in conv2d_forward
self.padding, self.dilation, self.groups)
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

Could you update PyTorch to the latest stable or nightly release and check if you are seeing the same issue?

Thank you for your reply! I created new environment (cuda 10.1, torch 1.5.1) and made some modification based on github issues. The error is gone but now I have “Segmentation fault” and I think it is not because of low memory. There are some issues related to it and solutions ask to have cuda 9 and pytorch 1 to solve the issue. But I can’t have them since I use RTX3090 and Ubuntu 20.04. I don’t know what to do now…

python main.py --inference --model FlowNet2 --save_flow --save ./output --inference_dataset ImagesFromFolder --inference_dataset_root ./data/frames_seq/0_292_0_020501779/0 --resume ./models/FlowNet2_checkpoint.pth.tar
Parsing Arguments
[0.003s] batch_size: 8
[0.003s] crop_size: [256, 256]
[0.003s] fp16: False
[0.003s] fp16_scale: 1024.0
[0.003s] gradient_clip: None
[0.003s] inference: True
[0.004s] inference_batch_size: 1
[0.004s] inference_dataset: ImagesFromFolder
[0.004s] inference_dataset_iext: jpg
[0.004s] inference_dataset_replicates: 1
[0.004s] inference_dataset_root: ./data/frames_seq/0_292_0_020501779/0
[0.004s] inference_n_batches: -1
[0.004s] inference_size: [-1, -1]
[0.004s] inference_visualize: False
[0.004s] log_frequency: 1
[0.004s] loss: L1Loss
[0.004s] model: FlowNet2
[0.004s] model_batchNorm: False
[0.004s] model_div_flow: 20.0
[0.004s] name: run
[0.004s] no_cuda: False
[0.004s] number_gpus: 2
[0.004s] number_workers: 8
[0.004s] optimizer: Adam
[0.004s] optimizer_amsgrad: False
[0.004s] optimizer_betas: (0.9, 0.999)
[0.004s] optimizer_eps: 1e-08
[0.004s] optimizer_lr: 0.001
[0.004s] optimizer_weight_decay: 0
[0.004s] render_validation: False
[0.004s] resume: ./models/FlowNet2_checkpoint.pth.tar
[0.004s] rgb_max: 255.0
[0.004s] save: ./output
[0.004s] save_flow: True
[0.004s] schedule_lr_fraction: 10
[0.004s] schedule_lr_frequency: 0
[0.004s] seed: 1
[0.004s] skip_training: False
[0.004s] skip_validation: False
[0.004s] start_epoch: 1
[0.004s] total_epochs: 10000
[0.004s] train_n_batches: -1
[0.004s] training_dataset: MpiSintelFinal
[0.004s] training_dataset_replicates: 1
[0.004s] training_dataset_root: ./MPI-Sintel/flow/training
[0.004s] validation_dataset: MpiSintelClean
[0.004s] validation_dataset_replicates: 1
[0.004s] validation_dataset_root: ./MPI-Sintel/flow/training
[0.004s] validation_frequency: 5
[0.004s] validation_n_batches: -1
[0.006s] Operation finished

Source Code
Current Git Hash: b’2e9e010c98931bc7cef3eb063b195f1e0ab470ba’

Initializing Datasets
[0.004s] Inference Dataset: ImagesFromFolder
[0.009s] Inference Input: [3, 2, 320, 576]
[0.041s] Inference Targets: [3, 2, 320, 576]
[0.041s] Operation finished

Building FlowNet2 model
[1.342s] Effective Batch Size: 16
[1.343s] Number of parameters: 162518834
[1.343s] Initializing CUDA
Segmentation fault

PyTorch 1.5 is also an old release, so update to 2.2.2 or a nightly.