I think your code is correct, there really isn’t any visible issue with:
model = model.to(device)
print("Local Rank: {} | Location: {}".format(local_rank, 2.1))
ddp_model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[local_rank], output_device=local_rank)
My knowledge is not enough to explain this behavior, some possible debug solutions:
- will “gloo” halt?
- insert some more print tracer into pytorch source code
It most likely would be a problem of nccl becasue DDP basically does these things in initialization:
-
call dist._broadcast_coleased to broadcast parameters to all groups
dist._broadcast_coleased is defined in
torch/csrc/distributed/c10d/comm.cpp
,
however, since it is a private function, there is no indication about whether it is blocking etc, I only know that it is invoked by all processes. -
call _ddp_init_helper, which basically only do some local operations like:
Initialization helper function that does the following: (1) replicating the module from device[0] to the other devices (2) bucketing the parameters for reductions (3) resetting the bucketing states (4) registering the grad hooks (5) passing a handle of DDP to SyncBatchNorm Layer
You can check nccl installation with, but this might not help you much if the “gloo” backend also halts:
Sorry that I cannot help you more with this problem.