How to synchronize lists across gpus using torch.distributed.launch

Hi.

I want to concat lists with different lengths across different gpus using torch.distributed.launch. Is there any api like torch.distributed.all_reduce() can help me?

Example Code (test.py):

import random
import torch
l = []
length = random.randint(5, 8)
for i in range(length):
    l.append(i)
print(l)

Run:

python -m torch.distributed.launch \
    --nproc_per_node=4  \
    --use_env \
    --master_port=$RANDOM \
    test.py

Result:

[1, 2, ..., length in GPU 0]
[1, 2, ..., length in GPU 1]
[1, 2, ..., length in GPU 2]
[1, 2, ..., length in GPU 3]

What I want (concat/synchronize the list in 4 different gpus together):

[1, 2, ..., length in GPU 0, ..., length in GPU 1, ..., length in GPU 2, ..., length in GPU 3]
[1, 2, ..., length in GPU 0, ..., length in GPU 1, ..., length in GPU 2, ..., length in GPU 3]
[1, 2, ..., length in GPU 0, ..., length in GPU 1, ..., length in GPU 2, ..., length in GPU 3]
[1, 2, ..., length in GPU 0, ..., length in GPU 1, ..., length in GPU 2, ..., length in GPU 3]

Thanks!

You can use allgather for this purpose.

Hi, Thanks for your nice suggestion!

Another harder problem for me is that, when there are too many 1D tensors with different lengths on each gpu, is there any method to gather them easier without a loop?
Situation:

GPU 0: [torch.Tensor(101), torch.Tensor(102), torch.Tensor(103), ..., torch.Tensor(200)]
GPU 1: [torch.Tensor(201), torch.Tensor(202), torch.Tensor(203), ..., torch.Tensor(300)]
GPU 2: [torch.Tensor(301), torch.Tensor(302), torch.Tensor(303), ..., torch.Tensor(400)]
GPU 3: [torch.Tensor(401), torch.Tensor(402), torch.Tensor(403), ..., torch.Tensor(500)]

Result:

GPU 0: [torch.Tensor(101), torch.Tensor(102), ..., torch.Tensor(200), ...  torch.Tensor(201), ..., torch.Tensor(500)]
GPU 1: [torch.Tensor(101), torch.Tensor(102), ..., torch.Tensor(200), ...  torch.Tensor(201), ..., torch.Tensor(500)]
GPU 2: [torch.Tensor(101), torch.Tensor(102), ..., torch.Tensor(200), ...  torch.Tensor(201), ..., torch.Tensor(500)]
GPU 3: [torch.Tensor(101), torch.Tensor(102), ..., torch.Tensor(200), ...  torch.Tensor(201), ..., torch.Tensor(500)]

The order of tensors in the output list doesn’t matter.

Hi, this is indeed what I need!

However, in my situation, data is generated dynamically during training on each GPU. What I need to do is to gather the data, and then use distributedSampler to sample them. I’m stucked at the step of gathering.

Do you have any good ideas?

You can pad each tensor to the maximum size and then use allgather. If you don’t know the maximum size before hand, you can first perform an allgather to collect the size of each tensor and then calculate the max.