Torch.distributed.broadcast deadlock

when I train a model using DDP in 4 GPUs and evaluate it in one GPU with args.local_rank==0, I want to broadcast the top1 to other GPUs. but I got the deadlock. The GPUs (local_rank=1,2,3) just enter the next command without blocking to get the broadcast results.

The code is shown below. It was execuated after finishing training one epoch.

if args.local_rank == 0:
      top1, top5 = test(net, testloader, criterion, False)
torch.distributed.broadcast(torch.tensor(top1).cuda(args.local_rank),src=0, async_op=False)
print("local rank:{}, top1:{}".format(args.local_rank, top1))

The result is shown below. The process was hanged after print the following information:

Is there anyone who met the same problem?

Can you try this way

if args.local_rank == 0:
    top1, top5 = test(net, testloader, criterion, False)
    top1 = torch.tensor(top1).cuda(args.local_rank)
else:
    top1 = torch.tensor(0.).cuda(args.local_rank)
torch.distributed.broadcast(top1,src=0, async_op=False)
print("local rank:{}, top1:{}".format(args.local_rank, top1.item()))

Thanks for your reply! But it still doesn’t work
image

Hey @cindybrain, that’s weird. Could you please share a self-contained repro? Thanks!

Hi, @mrshenli
I created a mini-repo: GitHub - SHu0421/Question-Repo. you can directly run it by
bash train.sh. my torch version is 1.8.1 and the cuda version is 10.2.
I run the code on four Tesla v100 GPUs (one node). It hung out as before with all GPUs usage 100%