`torch.distributed.barrier` used in multi-node distributed data-parallel training

I think your code is correct, there really isn’t any visible issue with:

    model = model.to(device)
    print("Local Rank: {} | Location: {}".format(local_rank, 2.1))
    ddp_model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[local_rank], output_device=local_rank)

My knowledge is not enough to explain this behavior, some possible debug solutions:

  1. will “gloo” halt?
  2. insert some more print tracer into pytorch source code

It most likely would be a problem of nccl becasue DDP basically does these things in initialization:

  1. call dist._broadcast_coleased to broadcast parameters to all groups

    dist._broadcast_coleased is defined in torch/csrc/distributed/c10d/comm.cpp,
    however, since it is a private function, there is no indication about whether it is blocking etc, I only know that it is invoked by all processes.

  2. call _ddp_init_helper, which basically only do some local operations like:

    Initialization helper function that does the following:
    
         (1) replicating the module from device[0] to the other devices
         (2) bucketing the parameters for reductions
         (3) resetting the bucketing states
         (4) registering the grad hooks
         (5) passing a handle of DDP to SyncBatchNorm Layer
    

You can check nccl installation with, but this might not help you much if the “gloo” backend also halts:

:slightly_frowning_face: Sorry that I cannot help you more with this problem.