Sorry if this has been asked before. I spent hours but don’t seem to find the answer I want.
In my experiment, optimizer.step()(Adam for example) can take a lot of time and sometimes as much time as forward/backward. So it seems reasonable to me to parallelize backward and optimizer update. Is that doable with one gpu in PyTorch?(no ddp or fsdp) Would I get speedup due to parallelism or save memory because gradient can be released early? I came up with this post Unified Authoring of Overlapped Optimizers, and per parameter optimizer settings · Issue #616 · pytorch/torchrec · GitHub before. It seems that DDP/FSDP does that automatically. are there blog talking about how much gain we get from using it in DDP/FSDP?
It seems the feature landed in TorchRec, not DDP, based on the linked issue. For your standard model you could try to use the
foreach implementation of the optimizer (if available) as it would fuse multiple of the small update kernels into one.
Yes, doing the optimizer step/accumulating as you do the backward you would avoid keeping all the gradients alive at once, and in regimes where the parameters dominate memory, e.g. when your batch size/seq len are small this could reduce memory usage.
I see, that is the tutorial I was looking for. Thank you . Another followup question is how does it affect performance? I timed the script in google colab(Nvidia T4 gpu) and see no speedup. I even see some performance regression in other cases(different model + different GPU). Can having multiple optimizers hurt the performance? If performance is my major concern, are there ways to speedup the training by parallelizing backward and optimizer updates in PyTorch?
I’d say that this is mostly a memory-reducing technqiue. One way this hurts performance is that it you would no longer be getting the horizontal fusion from batching the optimizer operations. This would be mitigated if we were able to fuse the optimizer into the backward itself, but that may not be possible today.