Is there a way to use the FSDP system without using data parallelism. The tutorial here (Getting Started with Fully Sharded Data Parallel(FSDP) — PyTorch Tutorials 2.2.1+cu121 documentation) seems to use both. In my use case, I want the benefit of splitting up large gradients in backprop across multiple GPUs but I don’t want to spawn multiple training processes simultaneously. Is this possible?
From my understanding it should be possible (I haven’t tried), but then you would be using the additional GPUs (all GPUs but one) just as a storage because your gradient doesn’t fit in memory, right. AM I wrong?
Yes that is the idea, the gradients are too large and I want more storage for them by combining GPUs.
The FSDP implementation assumes that each process uses its own GPU, so if you are not spawning multiple processes, then FSDP probably will not work.
Were you able to get more clarity how to do this?
Not really, I ended up abandoning the FSDP approach and resolving the memory issues a different way by handling my data differently.
Yeah. My use case is also very similar and FSDP seems very confusing to get into. What did you end up doing?
I was using very large graphs with PytorchGeometric, but for my use case I was able to split the graphs in half which helped. We also found better performance with smaller models which reduced the computational burden further. This observation also meant that we no longer trying the larger models causing problems because we concluded that smaller models were doing better.
Great. Thank you so much for your help.