How to concatenate different-size tensors from distributed processes?

Hi, I’m trying to implement object detection code with apex distributed data parallel. Actually, many of my codes are following the apex official example.

However, the problem is that when evaluating ROC. To do that I have to set a threshold which makes the number of False Positives (FPs) to fixed-size. Therefore, I need to gather (or concatenate) all the objectness score of predicted bounding boxes from each GPU and have to sort them to decide the threshold.

Here, the problem is

  1. Since the length of a tensor for objectness score is all different from GPUs I think I cannot use the function ‘distributed.all_gather’. Here’s my snippet.
scores = torch.from_numpy(scores).cuda()
tensor_list = [torch.zeros_like(scores) for _ in range(dist.get_world_size())
rt = scores.clone()
dist.all_gather(tensor_list, rt)
  1. Then, how can I concatenate these tensors? I have only found ‘dist.reduce’ which enables summation, multiplication … but no concatenation. That’s why I use all_gather.

  2. Is there any sample code or tutorial for object detection that I could refer to?


Hi @hanbit. Have you solved your issue? I am having the same question.

I solved it by transforming the data into pickle and torch.byteTensor. You can check the code in official facebook mask R-CNN.


I know this question is a little old, but here’s how I solved it:

def varsize_tensor_all_gather(tensor: torch.Tensor):
    tensor = tensor.contiguous()

    cuda_device = f'cuda:{torch.distributed.get_rank()}'
    size_tens = torch.tensor([tensor.shape[0]], dtype=torch.int64, device=cuda_device)

    size_tens = tensor_all_gather(size_tens).cpu()

    max_size = size_tens.max()

    padded = torch.empty(max_size, *tensor.shape[1:],
    padded[:tensor.shape[0]] = tensor

    ag = tensor_all_gather(padded)

    slices = []
    for i, sz in enumerate(size_tens):
        start_idx = i * max_size
        end_idx = start_idx + sz.item()

        if end_idx > start_idx:

    ret =, dim=0)


Essentially, the algorithm figures out which rank contains the largest tensor, and then all ranks allocate a tensor of that size, and fill in the relevant slice of that tensor. Then, we run the real gather on that tensor. And finally, we compact the resulting tensor.

Note: This is only implemented for dim=0 and also may be buggy.

solve problem for me. Thanks