How to concatenate tensors in distributed multi-node setup?

I am trying to implement something like this for 2 nodes (each node with 2 GPUs):

#### Parallel process initiated with torch.distributed.init_process_group()
### All GPUs work in parallel, and generate lists like :
    [20, 0, 1, 17] for GPU0 of node A 
    [1, 2, 3, 4] for GPU1 of node A
    [5, 6, 7, 8] for GPU0 of node B
    [0, 2, 4, 6] for GPU1 of node B
I tried torch.distributed.reduce() to get a sum of these 4:   [26,  10, 15, 35]

But what I want is a concatenated version like this [[20, 0, 1, 17], [1, 2, 3, 4] , [5, 6, 7, 8] , [0, 2, 4, 6]]
Or [20, 0, 1, 17, 1, 2, 3, 4, 5, 6, 7, 8, 0, 2, 4, 6] is also OK with me.

Is it possible to achieve this from torch.distributed?

Thanks for checking this out. I believe this will be a very useful example for distributed operations of pytorch : )