Hi, I am using distributed data parallel with nccl as backend for the following workload.
There are 2 nodes, node 0 will send tensors to node 1.
The send / recv process will run 100 times in a for loop.
The problem is node 0 will finish send 100 times, but node 1 will get stuck around 40 - 50.
Here is the code:
def main():
args = parser.parse_args()
main_worker(args.local_rank, args)
def main_worker(rank, args):
dist.init_process_group(backend='nccl')
data_processing_group = dist.new_group([0])
training_group = dist.new_group([1])
torch.cuda.set_device(rank)
if rank == 0:
func_0(rank, args)
else:
func_1(rank, args)
def func_0(rank, args):
for tile_id in range(100):
print('GPU: 0 started sending tile %d to GPU: 1.'%(tile_id))
tile = torch.zeros([1, 3, 3072, 3072], dtype=torch.float, device=rank)
tile.fill_(tile_id)
dist.send(tensor=tile, dst=1, tag=tile_id)
def func_1(rank, args):
for tile_id in range(100):
print('GPU: 1 started receiving tile %d from GPU: 0.'%(tile_id))
tile = torch.zeros([1, 3, 3072, 3072], dtype=torch.float, device=torch.device('cuda', rank))
dist.recv(tensor=tile, src=0, tag=tile_id)
print(torch.mean(tile))
if __name__ == '__main__':
main()
Here is the running command:
python -m torch.distributed.launch --nproc_per_node=1 --nnode=2 --node_rank=0 --master_addr="127.0.0.1" --master_port=29500 test.py --local_rank 0
python -m torch.distributed.launch --nproc_per_node=1 --nnode=2 --node_rank=1 --master_addr="127.0.0.1" --master_port=29500 test.py --local_rank 1
Here are the results:
node 0:
GPU: 0 started sending tile 90 to GPU: 1.
GPU: 0 started sending tile 91 to GPU: 1.
GPU: 0 started sending tile 92 to GPU: 1.
GPU: 0 started sending tile 93 to GPU: 1.
GPU: 0 started sending tile 94 to GPU: 1.
GPU: 0 started sending tile 95 to GPU: 1.
GPU: 0 started sending tile 96 to GPU: 1.
GPU: 0 started sending tile 97 to GPU: 1.
GPU: 0 started sending tile 98 to GPU: 1.
GPU: 0 started sending tile 99 to GPU: 1.
algorithms7@oyysuoctr1613739854048-vvjzp:~/hms2-pytorch-ddp/hms2_pseudo$
node 1:
GPU: 1 started receiving tile 39 from GPU: 0.
tensor(39., device='cuda:1')
GPU: 1 started receiving tile 40 from GPU: 0.
tensor(40., device='cuda:1')
GPU: 1 started receiving tile 41 from GPU: 0.
tensor(41., device='cuda:1')
GPU: 1 started receiving tile 42 from GPU: 0.
tensor(42., device='cuda:1')
GPU: 1 started receiving tile 43 from GPU: 0.
tensor(43., device='cuda:1')
GPU: 1 started receiving tile 44 from GPU: 0.
tensor(44., device='cuda:1')
GPU: 1 started receiving tile 45 from GPU: 0.
tensor(45., device='cuda:1')
GPU: 1 started receiving tile 46 from GPU: 0.
Node 0 fills the tile (tensor) with the same value of its tile_id before sending, and node 1 will check the values of the received tiles. From the results, I can conclude that all the tiles received indeed have the correct value.
I also changed using isend / irecv and called req.wait(), but the same problem exists as well.
I am also confused of some parts of send / recv, from PyTorch distributed docs, it says nccl doesn’t support send / recv, but there is actual data transmission indeed. So what is the underlying protocol of send / recv?
My nccl version is 2.8.02 and PyTorch version is 1.8.0, thank you so much!