How Can DDP Processes Get Out of Sync?

I run a single machine, 2-GPU resnet-based training with one process for each GPU. Each process prints a msg at the start of each epoch.

After a few minutes, one process draws ahead of the other. The difference ends up to be multiple epochs, even though both processes do finish.

How can this speed difference occur, given that DDP synchronizes the two processes during each call to backward()?

Based on a note on a Web site about different input sizes to different DDP processes I tried

with model.join():
    <training loop>

But observed no different behavior. How does DDP sneak past that back prop synchronization point?

I unfortunately could not reproduce the problem on a simple case. But maybe I am missing some logic?

- Ubuntu 20.04
- Pytorch torch-1.7.1-py3.8
- torch.cuda.nccl.version(): 2708
- 2xNvidia GTX Titan
- Single machine, 2 process, one for each of the GPUs

Hi, unless an explicit synchronization (such as with a device to host copy or torch.cuda.synchronize) is triggered, this sort of skew is possible due to the different GPUs, but such a significant skew (many epochs) seems unlikely. Are there any other jobs/processes running on your GPU that may slow this down and is the problem consistently reproducible?

As far as model.join() that API is to support different processes having different no. of inputs, is that your case here? If not, you would not need model.join().

Thank you for both answers, Rohan! Before responding
I was trying to verify that moving the model save to after

the optimizer step solved the issue. And I didn’t have a
chance for that yet.

The oddity with the out-of-synch processes is that they run in lockstep
for quite a while, meaning several minutes. And then one of

them slows way down. Often, once the faster process finishes,

the slow one sits with 100% of both GPU and CPU (according to top).

So something is seriously wrong.

What confuses me is that I thought DDP will force synchronization

as part of the backward() call to ensure the same gradients

being used on all the processes going forward. So how could two

processes get out of sync at all even just beyond a single train loop.

Nobody is running on the machine other than these two processes.

Inputs should be of the same lengths; the model.join() attempt was
just out of desperation. I run with drop-last in the dataloader, so even
the final batch is fully populated. If anything, the two processes are too

decoupled, rather than one waiting for the other due to unequal inputs.

The one difference from all the tutorial setups is that I am using

k-fold cross validation with a distributed sampler. So epochs are
put together by train/validate over multiple splits (i.e. rotating the

validation folds) But to the DDP mechanism that shouldn’t make a difference,

I would think.

For now both processes run on the same machine to exclude NCCL version mismatch.

Though my goal is cross-machine work. I did try updating NCCL and cuda to (I think)

version 11? But this measure did not fix the issue.

Does any of the above point to a misunderstanding on my part? Of course,
it could simply be a bug in the code somewhere.


The way that NCCL allreduce calls (basically the backward sync point you mention) work is by enqueuing the ops on GPU, and then the CPU host can continue, so while unlikely this desynchronization can occur.

Are you noticing any significant performance issues when this happens? If you do need to synchronize the GPUs, you can use torch.cuda.synchronize() or dist.barrier(), though this might affect the performance, especially if called very frequently.

Sorry, Rohan; I needed to move forward for now, and

will run on a single GPU for now. I did try the synchronize()
and barrier(), but somehow one process ends up taking
100% of a CPU, and all memory on its GPU. So something
is wrong; I’ll have to go through my own code again when I
get the chance. Thank you nonetheless!