My application uses AMP, DDP, and gradient accumulation. I’m trying to use no_sync
to optimize the training loop. However, when I used no_sync
it causes an extra ~2GB of VRAM usage and ultimately OOM after the first optimizer step. Without no_sync
it runs fine (albeit inefficiently). Here is a code snippet:
#with nullcontext():
with self.unet.no_sync() if not is_last_device_step else nullcontext():
# Forward pass
loss = self.get_model_pred(batch)
loss = loss / self.gradient_accumulation_steps
# Backward pass
self.scaler.scale(loss).backward()
When I just use nullcontext
, memory usage before the optimizer step is 18186MB as reported by nvidia-smi
. When I use no_sync()
memory use is 20262MB.
Based on my understanding of no_sync
, the only difference should be disabling reducing the gradients. So I don’t see why it would cause an increase of memory use? Is it keeping a second copy of the gradients for some reason?
For testing purposes I’m currently distributed across only a single GPU.
PyTorch 2.0.1, nVidia 3090, driver 525.105.17, CUDA 12.0.