Today, I want to use multiple graphics cards(cuda:0, cuda:1
) with DP to train the model of torch.float16
But, there is a error about this , as follow:
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED
You can try to repro this exception using the following code snippet. If that doesn't trigger the error, please include your original repro script when reporting this issue.
import torch
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.benchmark = True
torch.backends.cudnn.deterministic = False
torch.backends.cudnn.allow_tf32 = True
data = torch.randn([4, 32, 119, 159], dtype=torch.half, device='cuda', requires_grad=True)
net = torch.nn.Conv2d(32, 64, kernel_size=[5, 5], padding=[0, 0], stride=[2, 2], dilation=[1, 1], groups=1)
net = net.cuda().half()
out = net(data)
out.backward(torch.randn_like(out))
torch.cuda.synchronize()
ConvolutionParams
data_type = CUDNN_DATA_HALF
padding = [0, 0, 0]
stride = [2, 2, 0]
dilation = [1, 1, 0]
groups = 1
deterministic = false
allow_tf32 = true
input: TensorDescriptor 000001E834DE5180
type = CUDNN_DATA_HALF
nbDims = 4
dimA = 4, 32, 119, 159,
strideA = 605472, 18921, 159, 1,
output: TensorDescriptor 000001E834DE3AC0
type = CUDNN_DATA_HALF
nbDims = 4
dimA = 4, 64, 58, 78,
strideA = 289536, 4524, 78, 1,
weight: FilterDescriptor 000001E8349A6610
type = CUDNN_DATA_HALF
tensor_format = CUDNN_TENSOR_NCHW
nbDims = 4
dimA = 64, 32, 5, 5,
Pointer addresses:
input: 0000002363108000
output: 00000023637A8800
weight: 0000002305E01600
Additional pointer addresses:
grad_output: 00000023637A8800
grad_input: 0000002363108000
Backward data algorithm: 1
when I only use (cuda:0) this error does not occur !!!
Why?
So, how to use multiple graphics cards to train float16 model?