Best practice for uneven dataset sizes with DistributedDataParallel

I saw people doing option 1.

People reporting this issue was usually because applications do not know how many batches each process will take prior to training. It seems in your case, you deterministically know what processes will take one more batch? In that case, I think we might be able to do better. For example,

option 1. randomly skipping one batch in each of the processes that takes one more input batch
option 2. using no_sync on the first batch in each of the processes that takes one more input batch. no_sync won’t lead to parameter disparities, it will just accumulate the grad in param.grad. As long as you don’t run optimizer.step() in that iteration, it should be fine. The next forward-backward pass out of the no_sync context will accumulate more grad to param.grad and consume them together.