Unexpected behavior of synchronization while running DistributedDataParallel

Hi guys!
I have one PC with two GPUs on board. Below is schematic code for distributed training:

dist.init_process_group(backend="nccl", .....)
.....
model = DistributedDataParallel(model, device_ids=[rank])  # rank is 0 or 1
.....
for i in range(len(batches))
    .....    
    outputs = model(inputs, .....)
    loss = criterion(outputs, ....)
    loss.backward()
    optimizer.step()
    print(rank, "-", i)
    .....

I expected to see following print outputs (to save space, I quote them horizontally):
0-0 1-0, 0-1 1-1, 0-2 1-2, 0-3 1-3, ..., 0-n 1-n
However I got something unexpected:
0-0 1-0, 0-1 1-1, ..., 0-60 1-62, ..., 0-3125 1-3145, ...
Sometimes gpu_1 is up by 20 iterations ahead of gpu_0!
I see two possibilities here:

  1. my code has error somewhere
  2. it is normal behavior for synchronization

If 2. is the case could you please explain why is it so?
From what I’ve understood in documentation, synchronization between processes happens at each iteration in loss.backward() call. Suppose I have some model parameter w. At each iteration this parameter must be the same in gpu_0 and gpu_1 model replicas.
For example:
at first iteration w is:
w + (upd0_0 + upd1_0)/2
at second iteration:
w + (upd0_0 + upd1_0)/2 + (upd0_1 + upd1_1)/2
at i-th iteration:
w + (upd0_0 + upd1_0)/2 + (upd0_1 + upd1_1)/2 + ... + (upd0_i + upd1_i)/2
where upd0_i and upd1_i- parameter updates calculated during backprop at i-th iteration in gpu_0 and gpu_1 correspondingly.

Thanks!

Can you change the print statement as print(rank, “-”, i, flush=True)? I guess some output is buffered, so it gives you the impression that one ranks is behind the other.

Hi, Yi Wang
Thanks for suggestion, I tried flush=True, but situation remains the same. Looks like I have bug in my code…

Realize that the root cause is that, your print statement only requires the input from host, not from device. If you print results from CUDA tensors, you should see synced outputs.

This is because although DDP syncs across devices at each step, the allreduce communication from the host perspective is just a non-blocking enqueue operation. There isn’t anything wrong with your DDP code. Just your print statement gives you an illusion caused by the enqueue operation from the host.

1 Like

I see, got it, thanks.