My application uses AMP, DDP, and gradient accumulation. I’m trying to use
no_sync to optimize the training loop. However, when I used
no_sync it causes an extra ~2GB of VRAM usage and ultimately OOM after the first optimizer step. Without
no_sync it runs fine (albeit inefficiently). Here is a code snippet:
#with nullcontext(): with self.unet.no_sync() if not is_last_device_step else nullcontext(): # Forward pass loss = self.get_model_pred(batch) loss = loss / self.gradient_accumulation_steps # Backward pass self.scaler.scale(loss).backward()
When I just use
nullcontext, memory usage before the optimizer step is 18186MB as reported by
nvidia-smi. When I use
no_sync() memory use is 20262MB.
Based on my understanding of
no_sync, the only difference should be disabling reducing the gradients. So I don’t see why it would cause an increase of memory use? Is it keeping a second copy of the gradients for some reason?
For testing purposes I’m currently distributed across only a single GPU.
PyTorch 2.0.1, nVidia 3090, driver 525.105.17, CUDA 12.0.