Why the second barrier is used in the DDP tutorial?

I am reading the DistributedDataparallel tutorial. The last line from the following snippet confuses me:

   if rank == 0:
        # All processes should see same parameters as they all start from same
        # random parameters and gradients are synchronized in backward passes.
        # Therefore, saving it in one process is sufficient.
        torch.save(ddp_model.state_dict(), CHECKPOINT_PATH)

    # Use a barrier() to make sure that process 1 loads the model after process
    # 0 saves it.
    dist.barrier()
    # configure map_location properly
    map_location = {'cuda:%d' % 0: 'cuda:%d' % rank}
    ddp_model.load_state_dict(
        torch.load(CHECKPOINT_PATH, map_location=map_location))

    optimizer.zero_grad()
    outputs = ddp_model(torch.randn(20, 10))
    labels = torch.randn(20, 5).to(rank)
    loss_fn = nn.MSELoss()
    loss_fn(outputs, labels).backward()
    optimizer.step()

    # Use a barrier() to make sure that all processes have finished reading the
    # checkpoint
    **dist.barrier()**
  1. If the last line is used to ensure all processes finish reading, why does not it directly follow ddp_model.load_state_dict?
  2. For each iteration, do we need to call dist.barrier()?
1 Like

If the last line is used to ensure all processes finish reading, why does not it directly follow ddp_model.load_state_dict ?

Good catch! The original reason for adding that barrier is to guard the file deletion below:

    if rank == 0:
        os.remove(CHECKPOINT_PATH)

But looking at it again, this barrier is not necessary. Because the backward() on the DDP model is also a synchronization point as it calls AllReduce internally. Let me remove that.

For each iteration, do we need to call dist.barrier() ?

No. Two common reasons of using a barrier are

  1. to avoid AllReduce timeout caused by skewed workloads across DDP processes
  2. code after barrier() on rank A depends on the completion of code before barrier() on node B.

If none of the bot is a concern in your use case, then barrier shouldn’t be necessary.

1 Like

If I just want to save a model, I don’t need dist.barrier(), right?

Yep, that should be fine. If only rank 0 saves the model and that might take very long, you can set the timeout argument in init_process_group to a larger value. The default is 30min.