Best practice for uneven dataset sizes with DistributedDataParallel

I am wondering about the recommended approach to balancing dataset sizes across different devices while training with DDP. I have split my dataset across four GPUs, but one of them receives a single extra batch, which causes training to hang and wait indefinitely for gradient synchronization with the other devices. I have thought of a few fixes but each seems like it has a drawback:

1.Throw out the final batch to guarantee equal number of iterations
2. Use torch.cuda.no_sync() decorator on the final batch. This will cause one device to have different model weights.
3. Proceed to the next epoch on the other devices and allow the first batch of epoch 2 to synchronize with this final batch from epoch 1.

I appreciate any suggestions you can give!

I saw people doing option 1.

People reporting this issue was usually because applications do not know how many batches each process will take prior to training. It seems in your case, you deterministically know what processes will take one more batch? In that case, I think we might be able to do better. For example,

option 1. randomly skipping one batch in each of the processes that takes one more input batch
option 2. using no_sync on the first batch in each of the processes that takes one more input batch. no_sync won’t lead to parameter disparities, it will just accumulate the grad in param.grad. As long as you don’t run optimizer.step() in that iteration, it should be fine. The next forward-backward pass out of the no_sync context will accumulate more grad to param.grad and consume them together.

Would sth like this work for you? This can be implemented in the application using allreduce.

I think that would accomplish it but I basically adopted approach 3 and it has been working fine.

1 Like