Questions about loss and backward process in Dataparallel

Jianxin_Wang · July 12, 2022, 2:48am

Hi,

I am trying to understand how data parallel distributes loss and gradient back to each device for a backward pass after the forward pass.

I did a little experiment.

class Net(torch.nn.Module):
    def __init__(self):
        super(Net, self).__init__()

        modules = [
            torch.nn.Linear(10, 3),
            torch.nn.Linear(3, 4),
            torch.nn.Linear(4, 5),
        ]
        self.net = torch.nn.ModuleList(modules)

    def forward(self, inputs):
        for i, n in enumerate(self.net):
            inputs = n(inputs)

        return inputs

def main():
    X = np.random.uniform(-1, 1, (20, 10)).astype(np.float32)
    y = np.random.randint(0, 5, (20,))
    print(X.shape)
    print(y.shape)

    model = Net()
    loss = torch.nn.CrossEntropyLoss()
    #print('Model:', type(model))
    #print('Loss:', type(loss))

    X = torch.from_numpy(X)
    y = torch.from_numpy(y)
    print('X', X.size(), 'y', y.size())

    if torch.cuda.is_available():
        model = torch.nn.DataParallel(model)
        print('Model:', type(model))
        print('Devices:', model.device_ids)

        model = model.cuda()

        loss = loss.cuda()
        X = X.cuda()
        y = y.cuda()

    else:
        print('No devices available')

    X = Variable(X)
    y = Variable(y)

    outputs = model(X)
    l = loss(outputs, y)
    print('Loss:', l)

if __name__ == '__main__':
    main()

I got following results,

In my understanding, the loss should be in the format of [loss 1, loss 2, loss3, loss4] from the data parallel working diagram (first picture in the second row).

Why it is just a single number from the experiment code?

ptrblck · July 12, 2022, 3:46am

The outputs will be gathered on the default device and thus the loss will also be calculated on the default device. Afterwards the gradients are scattered to all GPUs as shown in the diagram.

Jianxin_Wang · July 12, 2022, 4:06am

Thanks for the response! I wonder how the scattering process decomposes the loss on the default device to all GPU?

ptrblck · July 12, 2022, 4:31am

I don’t fully understand the question as the gradients are scattered as shown in your figure.

Jianxin_Wang · July 12, 2022, 5:28pm

Sorry for the confusion. What I mean was how the loss in the default device turns to [loss1, loss2, loss3, loss4]. How the process divides the loss into 4 losses? (In my understanding, it can’t just evenly divide the loss in the default device because the losses for the batch in 4 GPUs are different)

ptrblck · July 12, 2022, 9:21pm

nn.DataParallel will create chunks of the input initially which refers to the [i1, i2, i3, i4] tensors. Note that the input tensor on the default device will not be a list of 4 tensors in this case, but will be a single one. The figure uses these list notation to explain that this single tensor will be split.
The same applies for the output tensor, which will be a single tensor on the default device containing the splits from all GPUs which are concatenated in the batch dimension (dim0).