Nn.dataparallel with multiple output, weird gradient result

I have a simple network which produce a tuple in forward() call like below:

class NN4(nn.Module):
    def __init__(self):
        super(NN4, self).__init__()
        self.fc1 = nn.Linear(8, 4)
        self.fc21 = nn.Linear(4, 1)

    def forward(self, x):
        x = F.selu(self.fc1(x))
        x1 = torch.sigmoid(self.fc21(x))
        # return x, x  # not None
        return x, x1  # None

Then I test this network with the following code (note in the first half NN4 is wrapped with nn.dataparallel, the lower half of the code was not):

DEVICE = torch.device('cuda:0')

def test_NN4():
    images = torch.randn(4, 8).to(DEVICE)
    fimages = torch.randn(4, 8).to(DEVICE)

    D = NN4().to(DEVICE)
    D = nn.DataParallel(D)
    D.zero_grad()

    d_loss = D(images)[0].mean() - D(fimages)[0].mean()
    print('d_loss: -->', d_loss)
    d_loss.backward()

    print('-------->>>')
    aaa = list(D.named_parameters())
    print(aaa[0][0])
    print(aaa[0][1].grad)

    D2 = NN4().to(DEVICE)
    D2.zero_grad()

    d2_loss = D2(images)[0].mean() - D2(fimages)[0].mean()
    print('d2_loss: -->', d2_loss)
    d2_loss.backward()

    print('-------->>>')
    aaa2 = list(D2.named_parameters())
    print(aaa2[0][0])
    print(aaa2[0][1].grad)

I run this code with two GPUs id = [0, 1] (with CUDA_VISIBLE_DEVICES=0,1 python test.py) and the result is

d_loss: --> tensor(0.0098, device='cuda:0', grad_fn=<SubBackward0>)
-------->>>
module.fc1.weight
None
d2_loss: --> tensor(-0.0592, device='cuda:0', grad_fn=<SubBackward0>)
-------->>>
fc1.weight
tensor([[ 0.2356, -0.1217,  0.0502, -0.2524,  0.1167,  0.0295,  0.1135,  0.1423],
        [ 0.3054, -0.2515,  0.0074, -0.2933,  0.1163,  0.0952,  0.1906,  0.2290],
        [ 0.3524, -0.1401,  0.0276, -0.2763,  0.1148,  0.0307,  0.3021,  0.1994],
        [ 0.2883, -0.2090, -0.0485, -0.1937,  0.0650,  0.0781,  0.3529,  0.2433]],
       device='cuda:0')

I am expecting the gradient of fc1 under nn.dataparallel to be valid than “None” (just like when not wrapped with nn.dataparallel). The strange thing is if I switch the output of the NN4 forward call to

return x, x

then the result is OK:

d_loss: --> tensor(0.1056, device='cuda:0', grad_fn=<SubBackward0>)
-------->>>
module.fc1.weight
tensor([[ 0.1904,  0.0461, -0.2445,  0.0530, -0.0502,  0.0738,  0.0506, -0.1648],
        [ 0.2761,  0.1007, -0.2761,  0.0436, -0.0724,  0.0660,  0.0267, -0.1630],
        [ 0.2097,  0.0416, -0.2006,  0.0426, -0.0496,  0.0706, -0.0654, -0.1262],
        [ 0.1848,  0.0789, -0.3042,  0.0943, -0.0567,  0.1234, -0.0341, -0.2012]],
       device='cuda:0')
d2_loss: --> tensor(0.1202, device='cuda:0', grad_fn=<SubBackward0>)
-------->>>
fc1.weight
tensor([[ 0.1592,  0.0493, -0.2680,  0.0611, -0.0546,  0.1066,  0.0206, -0.1425],
        [ 0.2109,  0.0573, -0.2443,  0.0503, -0.0348,  0.0786,  0.0665, -0.2017],
        [ 0.2091,  0.0704, -0.3194,  0.0410, -0.0809,  0.1483, -0.0061, -0.1214],
        [ 0.2173,  0.0366, -0.2628,  0.0207, -0.0380,  0.1162,  0.0384, -0.1626]],
       device='cuda:0')

Can anybody explain this? What is the correct way to return a tuple from nn.Module? I am using PyTorch 1.0.0

I ran the same code with PyTorch v0.4, the result seems to be normal for both:

return x, x

and

return x, x1

Gradients are all properly calculated than returning a “None” for nn.DataParallel() wrapped NN4.