I am trying to implement something like this for 2 nodes (each node with 2 GPUs):
#### Parallel process initiated with torch.distributed.init_process_group()
### All GPUs work in parallel, and generate lists like :
[20, 0, 1, 17] for GPU0 of node A
[1, 2, 3, 4] for GPU1 of node A
[5, 6, 7, 8] for GPU0 of node B
[0, 2, 4, 6] for GPU1 of node B
I tried torch.distributed.reduce() to get a sum of these 4: [26, 10, 15, 35]
But what I want is a concatenated version like this [[20, 0, 1, 17], [1, 2, 3, 4] , [5, 6, 7, 8] , [0, 2, 4, 6]]
Or [20, 0, 1, 17, 1, 2, 3, 4, 5, 6, 7, 8, 0, 2, 4, 6] is also OK with me.
Is it possible to achieve this from torch.distributed?