PyTorch distributed send finishing before recv

Hi, I am using distributed data parallel with nccl as backend for the following workload.
There are 2 nodes, node 0 will send tensors to node 1.
The send / recv process will run 100 times in a for loop.
The problem is node 0 will finish send 100 times, but node 1 will get stuck around 40 - 50.

Here is the code:

def main():
    args = parser.parse_args()
    main_worker(args.local_rank, args)

def main_worker(rank, args):
    dist.init_process_group(backend='nccl')
    data_processing_group = dist.new_group([0])
    training_group = dist.new_group([1])
    torch.cuda.set_device(rank)

    if rank == 0:
        func_0(rank, args)
    else:
        func_1(rank, args)

def func_0(rank, args):
    for tile_id in range(100):
        print('GPU: 0 started sending tile %d to GPU: 1.'%(tile_id))
        tile = torch.zeros([1, 3, 3072, 3072], dtype=torch.float, device=rank)
        tile.fill_(tile_id)
        dist.send(tensor=tile, dst=1, tag=tile_id)

def func_1(rank, args):
    for tile_id in range(100):
        print('GPU: 1 started receiving tile %d from GPU: 0.'%(tile_id))
        tile = torch.zeros([1, 3, 3072, 3072], dtype=torch.float, device=torch.device('cuda', rank))
        dist.recv(tensor=tile, src=0, tag=tile_id)
        print(torch.mean(tile))

if __name__ == '__main__':
    main()

Here is the running command:

python -m torch.distributed.launch --nproc_per_node=1 --nnode=2 --node_rank=0 --master_addr="127.0.0.1" --master_port=29500 test.py --local_rank 0

python -m torch.distributed.launch --nproc_per_node=1 --nnode=2 --node_rank=1 --master_addr="127.0.0.1" --master_port=29500 test.py --local_rank 1

Here are the results:

node 0:

GPU: 0 started sending tile 90 to GPU: 1.
GPU: 0 started sending tile 91 to GPU: 1.
GPU: 0 started sending tile 92 to GPU: 1.
GPU: 0 started sending tile 93 to GPU: 1.
GPU: 0 started sending tile 94 to GPU: 1.
GPU: 0 started sending tile 95 to GPU: 1.
GPU: 0 started sending tile 96 to GPU: 1.
GPU: 0 started sending tile 97 to GPU: 1.
GPU: 0 started sending tile 98 to GPU: 1.
GPU: 0 started sending tile 99 to GPU: 1.
algorithms7@oyysuoctr1613739854048-vvjzp:~/hms2-pytorch-ddp/hms2_pseudo$

node 1:

GPU: 1 started receiving tile 39 from GPU: 0.
tensor(39., device='cuda:1')
GPU: 1 started receiving tile 40 from GPU: 0.
tensor(40., device='cuda:1')
GPU: 1 started receiving tile 41 from GPU: 0.
tensor(41., device='cuda:1')
GPU: 1 started receiving tile 42 from GPU: 0.
tensor(42., device='cuda:1')
GPU: 1 started receiving tile 43 from GPU: 0.
tensor(43., device='cuda:1')
GPU: 1 started receiving tile 44 from GPU: 0.
tensor(44., device='cuda:1')
GPU: 1 started receiving tile 45 from GPU: 0.
tensor(45., device='cuda:1')
GPU: 1 started receiving tile 46 from GPU: 0.

Node 0 fills the tile (tensor) with the same value of its tile_id before sending, and node 1 will check the values of the received tiles. From the results, I can conclude that all the tiles received indeed have the correct value.

I also changed using isend / irecv and called req.wait(), but the same problem exists as well.

I am also confused of some parts of send / recv, from PyTorch distributed docs, it says nccl doesn’t support send / recv, but there is actual data transmission indeed. So what is the underlying protocol of send / recv?

My nccl version is 2.8.02 and PyTorch version is 1.8.0, thank you so much!

One current solution is to add:

dist.barrier()

at the end of for loop, but send / recv should perform blocking itself IMO.

@richguybobby NCCL supports point to point send/recv since 2.7.3 version, we added send and recv API on top of it in PT 1.8.0. The NCCL operations should be async, to synchronize the stream, do you want to try to call cudaDeviceSynchronize() to synchronize all streams with the host device, or call cudaEventSynchronize to synchronize one event with host device. Meanwhile, we will update the PyTorch Distributed docs. Thanks

1 Like