How to synchronize lists across gpus using torch.distributed.launch

huangdi · April 23, 2021, 3:04pm

Hi.

I want to concat lists with different lengths across different gpus using torch.distributed.launch. Is there any api like torch.distributed.all_reduce() can help me?

Example Code (test.py):

import random
import torch
l = []
length = random.randint(5, 8)
for i in range(length):
    l.append(i)
print(l)

Run:

python -m torch.distributed.launch \
    --nproc_per_node=4  \
    --use_env \
    --master_port=$RANDOM \
    test.py

Result:

[1, 2, ..., length in GPU 0]
[1, 2, ..., length in GPU 1]
[1, 2, ..., length in GPU 2]
[1, 2, ..., length in GPU 3]

What I want (concat/synchronize the list in 4 different gpus together):

[1, 2, ..., length in GPU 0, ..., length in GPU 1, ..., length in GPU 2, ..., length in GPU 3]
[1, 2, ..., length in GPU 0, ..., length in GPU 1, ..., length in GPU 2, ..., length in GPU 3]
[1, 2, ..., length in GPU 0, ..., length in GPU 1, ..., length in GPU 2, ..., length in GPU 3]
[1, 2, ..., length in GPU 0, ..., length in GPU 1, ..., length in GPU 2, ..., length in GPU 3]

Thanks!

pritamdamania87 · April 23, 2021, 9:33pm

You can use allgather for this purpose.

huangdi · April 24, 2021, 9:33am

Hi, Thanks for your nice suggestion!

Another harder problem for me is that, when there are too many 1D tensors with different lengths on each gpu, is there any method to gather them easier without a loop?
Situation:

GPU 0: [torch.Tensor(101), torch.Tensor(102), torch.Tensor(103), ..., torch.Tensor(200)]
GPU 1: [torch.Tensor(201), torch.Tensor(202), torch.Tensor(203), ..., torch.Tensor(300)]
GPU 2: [torch.Tensor(301), torch.Tensor(302), torch.Tensor(303), ..., torch.Tensor(400)]
GPU 3: [torch.Tensor(401), torch.Tensor(402), torch.Tensor(403), ..., torch.Tensor(500)]

Result:

GPU 0: [torch.Tensor(101), torch.Tensor(102), ..., torch.Tensor(200), ...  torch.Tensor(201), ..., torch.Tensor(500)]
GPU 1: [torch.Tensor(101), torch.Tensor(102), ..., torch.Tensor(200), ...  torch.Tensor(201), ..., torch.Tensor(500)]
GPU 2: [torch.Tensor(101), torch.Tensor(102), ..., torch.Tensor(200), ...  torch.Tensor(201), ..., torch.Tensor(500)]
GPU 3: [torch.Tensor(101), torch.Tensor(102), ..., torch.Tensor(200), ...  torch.Tensor(201), ..., torch.Tensor(500)]

The order of tensors in the output list doesn’t matter.

huangdi · April 24, 2021, 10:32am

Hi, this is indeed what I need!

However, in my situation, data is generated dynamically during training on each GPU. What I need to do is to gather the data, and then use distributedSampler to sample them. I’m stucked at the step of gathering.

Do you have any good ideas?

pritamdamania87 · April 26, 2021, 8:34pm

You can pad each tensor to the maximum size and then use allgather. If you don’t know the maximum size before hand, you can first perform an allgather to collect the size of each tensor and then calculate the max.