`torch.distributed.barrier` used in multi-node distributed data-parallel training

iffiX · July 20, 2020, 5:36pm

I think your code is correct, there really isn’t any visible issue with:

    model = model.to(device)
    print("Local Rank: {} | Location: {}".format(local_rank, 2.1))
    ddp_model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[local_rank], output_device=local_rank)

My knowledge is not enough to explain this behavior, some possible debug solutions:

will “gloo” halt?
insert some more print tracer into pytorch source code

It most likely would be a problem of nccl becasue DDP basically does these things in initialization:

call dist._broadcast_coleased to broadcast parameters to all groups

dist._broadcast_coleased is defined in torch/csrc/distributed/c10d/comm.cpp,
however, since it is a private function, there is no indication about whether it is blocking etc, I only know that it is invoked by all processes.

call _ddp_init_helper, which basically only do some local operations like:

Initialization helper function that does the following:

     (1) replicating the module from device[0] to the other devices
     (2) bucketing the parameters for reductions
     (3) resetting the bucketing states
     (4) registering the grad hooks
     (5) passing a handle of DDP to SyncBatchNorm Layer

You can check nccl installation with, but this might not help you much if the “gloo” backend also halts:

Sorry that I cannot help you more with this problem.