Distributed: other ranks not waiting rank_0's evaluation

Hello!
I am using DistributedDataParallel on one node with 4 GPUs. Also using DistributedSampler for training.

self.model = torch.nn.parallel.DistributedDataParallel(
    self.model,
    device_ids=[self.local_rank],
    output_device=self.local_rank,
    find_unused_parameters=True
)

Doing evaluation after every train epoch only on rank_0.
During evaluation I observed (through nvidia-smi) that other (1, 2, 3) ranks/gpus continue to be processing something with 100% load.
My questions:

  1. Is it possible that other ranks continuing training next epoch without waiting for rank_0 to finish evaluation?
  2. In case (1) is true, is it ok to leave it like this (will rank_0 process it’s part of the next epoch after it finishes evaluation)? Or is it better to set a barrier so that other ranks will wait for rank_0 to do evaluation?

Thanks!

  1. Is it possible that other ranks continuing training next epoch without waiting for rank_0 to finish evaluation?

Yes, it is possible. Because all communication/synchronization happen in the backward pass. So other ranks will proceed with their next forward pass and local backward pass, and then block on the AllReduce operation in the backward pass.

  1. In case (1) is true, is it ok to leave it like this (will rank_0 process it’s part of the next epoch after it finishes evaluation)?

Yes, it should be OK to leave it this way. Other ranks is just block waiting on AllReduce until rank_0 finishes evaluations and then runs the subsequent backward pass. It shouldn’t affect the correctness.

During evaluation I observed (through nvidia-smi ) that other (1, 2, 3) ranks/gpus continue to be processing something with 100% load.

Yes, when block waiting for AllReduce, CUDA would show busy, although there might be no real computation running.

1 Like

great answer, thanks!

Thank you for such a great explanation. Really help me figure out how to do evaluation.

So is it possible to evaluate on all GPUs and combine the results, or will that be very complicated?