Fake Distributed on 1 GPU
I have big samples, so I can’t use a big batch size. I virtually increase the batch size simply by calling the optimizer.step()
every N batches. However, that of course doesn’t help the statistics of BatchNorm that are calculated per batch, and suffer from that. There is only so much I can do with the batchnorm momentum… I would like to simulate a distributed system on 1 GPU and sync the BN layers across multiple fake-parallel batches.
Is that possible?