CUDNN_STATUS_EXECUTION_FAILED from torch.float16

Today, I want to use multiple graphics cards(cuda:0, cuda:1) with DP to train the model of torch.float16

But, there is a error about this , as follow:


RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED
You can try to repro this exception using the following code snippet. If that doesn't trigger the error, please include your original repro script when reporting this issue.

import torch
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.benchmark = True
torch.backends.cudnn.deterministic = False
torch.backends.cudnn.allow_tf32 = True
data = torch.randn([4, 32, 119, 159], dtype=torch.half, device='cuda', requires_grad=True)
net = torch.nn.Conv2d(32, 64, kernel_size=[5, 5], padding=[0, 0], stride=[2, 2], dilation=[1, 1], groups=1)
net = net.cuda().half()
out = net(data)
out.backward(torch.randn_like(out))
torch.cuda.synchronize()

ConvolutionParams 
    data_type = CUDNN_DATA_HALF
    padding = [0, 0, 0]
    stride = [2, 2, 0]
    dilation = [1, 1, 0]
    groups = 1
    deterministic = false
    allow_tf32 = true
input: TensorDescriptor 000001E834DE5180
    type = CUDNN_DATA_HALF
    nbDims = 4
    dimA = 4, 32, 119, 159, 
    strideA = 605472, 18921, 159, 1, 
output: TensorDescriptor 000001E834DE3AC0
    type = CUDNN_DATA_HALF
    nbDims = 4
    dimA = 4, 64, 58, 78, 
    strideA = 289536, 4524, 78, 1, 
weight: FilterDescriptor 000001E8349A6610
    type = CUDNN_DATA_HALF
    tensor_format = CUDNN_TENSOR_NCHW
    nbDims = 4
    dimA = 64, 32, 5, 5, 
Pointer addresses: 
    input: 0000002363108000
    output: 00000023637A8800
    weight: 0000002305E01600
Additional pointer addresses: 
    grad_output: 00000023637A8800
    grad_input: 0000002363108000
Backward data algorithm: 1

when I only use (cuda:0) this error does not occur !!!

Why?

So, how to use multiple graphics cards to train float16 model?