Distributed gather does not work properly?

Hi:
I am implementing distributed training model where I need to gather tensor from other nodes.

Here is my scenario:

Master node is rank 0.
Worker nodes are rank 1 and 2.

Here I want to gather tensor from worker rank 1 and 2.

so in my master side I have:

dist.gather(W_Y_update,gather_list=g_list,group=group)

in my worker side i have:

dist.gather(W_Y_update,dst=0,group=group)

(note that group is [0,1,2])

If the gather operation is correct.

I will have g_list[1], g_list[2] to be worker’s tensor

but the result is as follows:

In master side I received:


g_list[0]

-1.0264e+00
1.5552e-01
-2.5123e-01

5.0416e-01
-5.7919e-01
4.3203e-01
[torch.DoubleTensor of size 27648]

g_list[1]

-1.0264e+00
1.5552e-01
-2.5123e-01

5.0416e-01
-5.7919e-01
4.3203e-01
[torch.DoubleTensor of size 27648]

g_list[2]

-1.0264e+00
1.5552e-01
-2.5123e-01

5.0416e-01
-5.7919e-01
4.3203e-01
[torch.DoubleTensor of size 27648]


The tensor in worker 1 is :

-9.4636e-01
-1.0843e-01
-8.6267e-01

-7.3371e-02
-8.7356e-01
-1.4109e-01
[torch.DoubleTensor of size 27648]

and for worker 2 is :

-1.0264e+00
1.5552e-01
-2.5123e-01

5.0416e-01
-5.7919e-01
4.3203e-01
[torch.DoubleTensor of size 27648]

can anyone help me understand what mistake i have made?

Thanks a lot !

1 Like

Here’s another simple codes:

import os
import torch
import torch.distributed as dist
from torch.multiprocessing import Process
import sys

rank= int(sys.argv[1])

def run(rank, size):
if(rank==0):
group = dist.new_group(range(size))
g_list=[]
alphak=torch.DoubleTensor([1.0])
for i in range(size):
g_list.append(alphak)

    dist.gather(gather_list=g_list,tensor=alphak,group=group)

    for i in range(size):
        print(g_list[i])
else:
    group = dist.new_group(range(size))

    alphak=torch.DoubleTensor([rank])
    print(alphak)
    dist.gather(alphak,dst=0,group=group)

""" Distributed function to be implemented later. """

def init_processes(rank, size, fn, backend=‘tcp’):
""" Initialize the distributed environment. “”"
os.environ[‘MASTER_ADDR’] = '127.0.0.1’
os.environ[‘MASTER_PORT’] = '29500’
dist.init_process_group(backend, rank=rank, world_size=size)
fn(rank, size)

if name == “main”:
size = 3
init_processes(rank, size, run, backend=‘tcp’)


master still receive 3 numbers which is the same as worker 2, which is 2.


could any one help me understand what’s wrong with this?

thanks

Hi I am facing the exact same issue. Any resolution?

hey look at this ,

Thanks… that worked :slight_smile: