DataParallel does not concat outputs from multi-gpu

pyxiea · December 5, 2020, 1:22pm

A similar problem on stackoverflow is here, but no answer is useful.

As this tutorial shown, the output of multi-gpus will be concatenated on the dimension 0, but I don’t know why does it not work in my code.


    model = T2T(......)  # T2T is a sub class of nn.Module
    if torch.cuda.device_count() > 1:
        print("Using", torch.cuda.device_count(), "GPUs!")
        model = torch.nn.DataParallel(model)
    model = model.to(device)
    optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)
    train_loader = DataLoader(......)
    ......
    epoch_steps = len(train_loader)
    STEPS = train_epochs * epoch_steps
    train_loader = cycle(train_loader)
    ......
    for step in range(STEPS):
        batch_data = next(train_loader)
        optimizer.zero_grad()
        labels = batch_data['labels'].to(device)
        ......
        probs = model(......)
        # calculate loss using labels and probs

When calculating loss, I got an error, the batch size of labels and probs are not the same. The shape of labels is [64,…] and the shape of probs is [32,…]. I am using 2 gpus, so I guess the output of multi-gpus are not concatenated.

Any idea how to fix this? Thanks in advance!

ptrblck · December 6, 2020, 8:54am

What is the shape of the input? The nn.DataParallel model would split the input in dim0 and concatenate it afterwards. If the input has a batch size of 32, the output would have the same shape.

pyxiea · December 6, 2020, 1:39pm

probs = model(inputrulelist, syn_inputrulelist, tree_path_vec, rule_mask, syn_rule_mask, inputrulelistnode, syn_inputrulelistnode, inputrulelistson, syn_inputrulelistson, sequence_mask, treemask, syn_treemask, path_lens)

Sorry for the late reply, I have already fixed the problem but the actual reason remains unknown. The model has many params as above, but the first dimensions of them are all batch_size, and the current value is 64.

One exception is the sequence_mask tensor, it has shape [1, n, n] and the elements on and below the main diagonal are all zeros (it is used to do the sequence masking in multi-head attention). The batch_size of it is set to 1 because I have noticed that DataParallel model will split the input in dim0, and I hope all devices will get the same sequence_mask.

After I change the first dimension of sequence_mask to batch_size(64 for now), the problem is fixed. Specifically, I use tensor.repeat(batch_size, 1, 1) instead of tensor.unsqueeze(0) to generate the batch_size dimension as before. After that, the first dimension of the output of the DataParallel model (probs in the code) becomes 64 instead of 32. I don’t know the exact reason due to my unfamiliarity of DataParallel, it would be so helpful if you could tell me, thanks a lot.

ptrblck · December 6, 2020, 8:10pm

Your description is right. nn.DataParallel expects all input tensors in the shape [batch_size, *] and will split them in dim0. If you’ve applied broadcasting in the past, repeating the tensor sounds reasonable.

pyxiea · December 7, 2020, 3:24am

But I still don’t know why can I only get output from one gpu (instead of the concatenation of outputs from all gpus) if I pass a tensor with shape [1, *]

ptrblck · December 7, 2020, 8:21am

This tensor won’t be split, as dim0 has only a single sample and thus only a single GPU will execute this tensor.
If you want to make sure each GPU gets a chunk of the input tensor, make sure dim0 has at least a size of num_gpus.

pyxiea · December 7, 2020, 11:51am

Oh I get it, thank you for your time!