`torch.distributed.barrier` used in multi-node distributed data-parallel training

mrshenli · July 22, 2020, 3:33pm

Sorry for being late to the discussion.

I saw you were using ddp_model.load_state_dict to load model parameters. Is this method untested and unfavored?

Right, we don’t have tests for saving.loading DDP models yet, IIUC. Let me create an issue to track.

So the second node got halted in

DDP constructor does have a broadcast op, I believe that’s where it is halted:

pytorch/pytorch/blob/d5ae4a07ef5b2e77cf51737fb0a3aafc2e71231d/torch/nn/parallel/distributed.py#L387-L391


      
          if len(module_states) > 0:
              self._distributed_broadcast_coalesced(
                  module_states,
                  self.broadcast_bucket_size)

Looking at the log, some ranks proceed beyond 2.1 while others are waiting at 2.1, which suggest there is a desync across all processes. Curious, why there is no output for Location 0 at rank 0? Is it just because the print for Location 0 is actually in the if clause?

For the log, can you try also print dist.get_world_size(), and then use dist.get_rank() instead of local rank? Let’s verify if the launching script did anything wrong.

I found the number of processes is 7 on each node despite the fact that I requested using 4 GPU on each node.

Looks like other processes (local_rank != 0) also created CUDA context and allocated some tensor on cuda:0. You can avoid this by setting CUDA_VISIBLE_DEVICES variable for each subprocess, either directly in command line or in the program before loading any cuda logic. See Running on specific GPU device
Note that after this change, you will also need to change all f'cuda:{local_rank}' to cuda:0 as each process now only sees one device.