Runtime execption when using DataParallel on Nvidia A100, Pytoch 1.12

Hi,

I get this error when trying to use 2 GPUs - A100s

File “/usr/scratch4/samo4615/miniconda3/envs/torchcon38/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py”, line 168, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File “/usr/scratch4/samo4615/miniconda3/envs/torchcon38/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py”, line 178, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File “/usr/scratch4/samo4615/miniconda3/envs/torchcon38/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py”, line 86, in parallel_apply
output.reraise()
File “/usr/scratch4/samo4615/miniconda3/envs/torchcon38/lib/python3.8/site-packages/torch/_utils.py”, line 461, in reraise
raise exception
RuntimeError: Caught RuntimeError in replica 1 on device 2.
Original Traceback (most recent call last):
File “/usr/scratch4/samo4615/miniconda3/envs/torchcon38/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py”, line 61, in _worker
output = module(*input, **kwargs)
File “/usr/scratch4/samo4615/miniconda3/envs/torchcon38/lib/python3.8/site-packages/torch/nn/modules/module.py”, line 1130, in _call_impl
return forward_call(*input, **kwargs)
File “/usr/scratch4/samo4615/Documents/codeworks/lidar_mtl/kitti_3d_det/model_custom.py”, line 657, in forward
x_rot = self.bn2_rot(self.a2_rot(self.c2_rot(x_rot)))
File “/usr/scratch4/samo4615/miniconda3/envs/torchcon38/lib/python3.8/site-packages/torch/nn/modules/module.py”, line 1130, in _call_impl
return forward_call(*input, **kwargs)
File “/usr/scratch4/samo4615/miniconda3/envs/torchcon38/lib/python3.8/site-packages/torch/nn/modules/conv.py”, line 457, in forward
return self._conv_forward(input, self.weight, self.bias)
File “/usr/scratch4/samo4615/miniconda3/envs/torchcon38/lib/python3.8/site-packages/torch/nn/modules/conv.py”, line 453, in _conv_forward
return F.conv2d(input, weight, bias, self.stride,
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED
You can try to repro this exception using the following code snippet. If that doesn’t trigger the error, please include your original repro script when reporting this issue.

import torch
torch.backends.cuda.matmul.allow_tf32 = False
torch.backends.cudnn.benchmark = False
torch.backends.cudnn.deterministic = False
torch.backends.cudnn.allow_tf32 = True
data = torch.randn([4, 64, 576, 384], dtype=torch.float, device=‘cuda’, requires_grad=True).to(memory_format=torch.channels_last)
net = torch.nn.Conv2d(64, 64, kernel_size=[3, 3], padding=[2, 2], stride=[1, 1], dilation=[2, 2], groups=1)
net = net.cuda().float().to(memory_format=torch.channels_last)
out = net(data)
out.backward(torch.randn_like(out))
torch.cuda.synchronize()

ConvolutionParams
memory_format = ChannelsLast
data_type = CUDNN_DATA_FLOAT
padding = [2, 2, 0]
stride = [1, 1, 0]
dilation = [2, 2, 0]
groups = 1
deterministic = false
allow_tf32 = true
input: TensorDescriptor 0x7f10015e9880
type = CUDNN_DATA_FLOAT
nbDims = 4
dimA = 4, 64, 576, 384,
strideA = 14155776, 1, 24576, 64,
output: TensorDescriptor 0x7f10015eb3d0
type = CUDNN_DATA_FLOAT
nbDims = 4
dimA = 4, 64, 576, 384,
strideA = 14155776, 1, 24576, 64,
weight: FilterDescriptor 0x7f100165a1f0
type = CUDNN_DATA_FLOAT
tensor_format = CUDNN_TENSOR_NHWC
nbDims = 4
dimA = 64, 64, 3, 3,
Pointer addresses:
input: 0x7f0e29800000
output: 0x7f0dd5800000
weight: 0x7f14efa1a600
Forward algorithm: 1

The model runs fine on 1 GPU.

Please help.
Input shape - 16x3x576x320

Could you check if you are running out of memory and if this issue is still reproducible in the latest PyTorch release?
I cannot reproduce the cuDNN error using the posted code snippet on an A100.