DataParallel does not concat outputs from multi-gpu

A similar problem on stackoverflow is here, but no answer is useful.

As this tutorial shown, the output of multi-gpus will be concatenated on the dimension 0, but I don’t know why does it not work in my code.

    model = T2T(......)  # T2T is a sub class of nn.Module
    if torch.cuda.device_count() > 1:
        print("Using", torch.cuda.device_count(), "GPUs!")
        model = torch.nn.DataParallel(model)
    model =
    optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)
    train_loader = DataLoader(......)
    epoch_steps = len(train_loader)
    STEPS = train_epochs * epoch_steps
    train_loader = cycle(train_loader)
    for step in range(STEPS):
        batch_data = next(train_loader)
        labels = batch_data['labels'].to(device)
        probs = model(......)
        # calculate loss using labels and probs

When calculating loss, I got an error, the batch size of labels and probs are not the same. The shape of labels is [64,…] and the shape of probs is [32,…]. I am using 2 gpus, so I guess the output of multi-gpus are not concatenated.

Any idea how to fix this? Thanks in advance!

What is the shape of the input? The nn.DataParallel model would split the input in dim0 and concatenate it afterwards. If the input has a batch size of 32, the output would have the same shape.

probs = model(inputrulelist, syn_inputrulelist, tree_path_vec, rule_mask, syn_rule_mask, inputrulelistnode, syn_inputrulelistnode, inputrulelistson, syn_inputrulelistson, sequence_mask, treemask, syn_treemask, path_lens)

Sorry for the late reply, I have already fixed the problem but the actual reason remains unknown. The model has many params as above, but the first dimensions of them are all batch_size, and the current value is 64.

One exception is the sequence_mask tensor, it has shape [1, n, n] and the elements on and below the main diagonal are all zeros (it is used to do the sequence masking in multi-head attention). The batch_size of it is set to 1 because I have noticed that DataParallel model will split the input in dim0, and I hope all devices will get the same sequence_mask.

After I change the first dimension of sequence_mask to batch_size(64 for now), the problem is fixed. Specifically, I use tensor.repeat(batch_size, 1, 1) instead of tensor.unsqueeze(0) to generate the batch_size dimension as before. After that, the first dimension of the output of the DataParallel model (probs in the code) becomes 64 instead of 32. I don’t know the exact reason due to my unfamiliarity of DataParallel, it would be so helpful if you could tell me, thanks a lot.

Your description is right. nn.DataParallel expects all input tensors in the shape [batch_size, *] and will split them in dim0. If you’ve applied broadcasting in the past, repeating the tensor sounds reasonable.

But I still don’t know why can I only get output from one gpu (instead of the concatenation of outputs from all gpus) if I pass a tensor with shape [1, *]

This tensor won’t be split, as dim0 has only a single sample and thus only a single GPU will execute this tensor.
If you want to make sure each GPU gets a chunk of the input tensor, make sure dim0 has at least a size of num_gpus.

Oh I get it, thank you for your time!