Hi:
I am implementing distributed training model where I need to gather tensor from other nodes.
Here is my scenario:
Master node is rank 0.
Worker nodes are rank 1 and 2.
Here I want to gather tensor from worker rank 1 and 2.
so in my master side I have:
dist.gather(W_Y_update,gather_list=g_list,group=group)
in my worker side i have:
dist.gather(W_Y_update,dst=0,group=group)
(note that group is [0,1,2])
If the gather operation is correct.
I will have g_list[1], g_list[2] to be worker’s tensor
but the result is as follows:
In master side I received:
g_list[0]
-1.0264e+00
1.5552e-01
-2.5123e-01
⋮
5.0416e-01
-5.7919e-01
4.3203e-01
[torch.DoubleTensor of size 27648]
g_list[1]
-1.0264e+00
1.5552e-01
-2.5123e-01
⋮
5.0416e-01
-5.7919e-01
4.3203e-01
[torch.DoubleTensor of size 27648]
g_list[2]
-1.0264e+00
1.5552e-01
-2.5123e-01
⋮
5.0416e-01
-5.7919e-01
4.3203e-01
[torch.DoubleTensor of size 27648]
The tensor in worker 1 is :
-9.4636e-01
-1.0843e-01
-8.6267e-01
⋮
-7.3371e-02
-8.7356e-01
-1.4109e-01
[torch.DoubleTensor of size 27648]
and for worker 2 is :
-1.0264e+00
1.5552e-01
-2.5123e-01
⋮
5.0416e-01
-5.7919e-01
4.3203e-01
[torch.DoubleTensor of size 27648]
can anyone help me understand what mistake i have made?
Thanks a lot !