DataParallel - How to collect data from each gpu?

Is there a function that I can use to manually collect tensors from all the gpu into one common gpu ? Like if my batch size is 16 and I am using 4 GPUs then the output of the network has a batch size of 4. How can I manually collect the output from each thread so that I now have a tensor that has a batch size of 16 ?

And how do I make sure the results are in the same order that they are split ?

I looked through https://pytorch.org/docs/stable/nn.html?highlight=nn%20dataparallel#torch.nn.DataParallel but could not find anything.

For example

class Model(nn.Module):
    def forward(data):
        datalist = []
        for x in data:
            datalist.append(x)
    return torch.stack(datalist)

net = torch.nn.DataParallel(Model())
datalist = net(data)

but datalist does not have the same shape as data and even if I reshape datalist to have the same shape as data the values are no longer equal because the append does not come in order i.e. gpu0, gpu1, gpu2, gpu3, etc. How can I fix this ?

I have the same problem. Could someone provide any pointers?