I was wondering if virtual batch sizing (as explained, for example, in Opacus · Train PyTorch models with Differential Privacy) is supposed to work with multi-GPU training? For me, virtual batch sizing work well (even if model is wrapped in DifferentiallyPrivateDistributedDataParallel) when using a single GPU but fails when using multiple GPUs. The first couple iterations work fine but at some point the
DistributedDPOptimizer returns True for one rank while returning False for the other rank. The code then gets stuck because the first optimizer takes a “real step” while the second doesn’t.
EDIT: After further investigation it seems that the problem is the following. Due to Poisson sampling, the
BatchSplittingSampler may split a batch into
n physical batches, where
n is not divisible by the number of GPUs used.