FSDP reduce operation

nasosg10 · March 7, 2025, 11:21am

Hello everyone!
I am studying distributed training with PyTorch, and more precisely, Distributed Data Parallel (DDP) and Fully Sharded Data Parallel (FSDP). I have two questions regarding how some things function under the hood.

From the PyTorch documentation of DDP, it is clear that the gradients are reduced with the “average” operation. Does the same thing hold for FSDP too? Basically, I am asking which is the reduction operation passed in the reduce scatter collective communication. Given that nobody mentions it, i guess it should be the same.
Another question, relevant to both settings, is whether the model.no_sync() is necessary when accumulating gradients to simulate a larger batch size. As far as I have understood (and did some math), it is key for efficiency and training speed,but does not change the final gradients, so an implementation without it is also correct.

I would appreciate much any help with these, thank you in advance
Forgot to mention, I use PyTorch 2.1.0