Hello everyone!
I am studying distributed training with PyTorch, and more precisely, Distributed Data Parallel (DDP) and Fully Sharded Data Parallel (FSDP). I have two questions regarding how some things function under the hood.
-
From the PyTorch documentation of DDP, it is clear that the gradients are reduced with the “average” operation. Does the same thing hold for FSDP too? Basically, I am asking which is the reduction operation passed in the reduce scatter collective communication. Given that nobody mentions it, i guess it should be the same.
-
Another question, relevant to both settings, is whether the model.no_sync() is necessary when accumulating gradients to simulate a larger batch size. As far as I have understood (and did some math), it is key for efficiency and training speed,but does not change the final gradients, so an implementation without it is also correct.
I would appreciate much any help with these, thank you in advance
Forgot to mention, I use PyTorch 2.1.0