I am currently trying to perform gradient accumulation in a DistributedDataParallel setting for simulating large batch sizes on a small number of GPUs. As far as I know, that should be possible by skipping the AllReduce operation.
Horovod features the backward_passes_per_step option for such cases - I read that the same should be possible in PyTorch-DDP with no_sync interface.
My original batch size is 64, and I try to “simulate” a batch size that is 4x as large (=256).
This is a snippet of my current code:
for i, (images, target) in enumerate(train_loader):
images = images.to(device)
target = target.to(device)
if (i % 4 == 0):
# perform regular AllReduce operation every four steps
output = model(images)
loss = criterion(output, target)
train_loss += loss
optimizer.zero_grad()
loss.backward()
optimizer.step()
else:
# accumulate gradients using the no_sync interface
with model.no_sync():
output = model(images)
loss = criterion(output, target)
train_loss += loss
# skipping the optimizer call here
loss.backward()
However, when comparing the training loss of this “simulated” large batch run (with gradient accumulation) with actually running with batch size 256, I get a lower loss for the gradient accumulation run than for the run with the actual batch size 256.
I suspect there are some hidden synchronization steps in my code that occur despite the use of the no_sync interface (and therefore get me a better training loss)? Is that the case here and how could that be fixed?