Distributed Autograd + FSDP?

Jacob_Buckman · September 13, 2024, 6:25am

The FSDP wrapper seems to use the regular torch.distributed NCCL primitives, but distributed autograd requires RPC primitives instead. Is it possible to use these two in conjunction? Is there any example of this?

fegin · September 13, 2024, 4:41pm

FSDP doesn’t support RPC primitives. Curious in what scenario do you need distributed autograd with FSDP?

Jacob_Buckman · September 13, 2024, 4:59pm

It’s for a custom sharding strategy I am implementing as part of a research project – more or less a ringattention alternative.

fegin · September 13, 2024, 5:07pm

The current approach for implementing ring-attention algorithms is to utilize send/recv and customized forward/backward functions to perform the necessary ring computations. One example is pytorch/torch/distributed/tensor/experimental/_attention.py at main · pytorch/pytorch · GitHub.

Jacob_Buckman · September 13, 2024, 5:20pm

Thanks for the reference! I’ll probably end up doing it like this then, implementing manual gradient calculations with NCCL seems like less work than reimplementing FSDP in RPC.