FSDP without data parallelism

Yash_Lal · March 29, 2024, 7:38pm

Is there a way to use the FSDP system without using data parallelism. The tutorial here (Getting Started with Fully Sharded Data Parallel(FSDP) — PyTorch Tutorials 2.2.1+cu121 documentation) seems to use both. In my use case, I want the benefit of splitting up large gradients in backprop across multiple GPUs but I don’t want to spawn multiple training processes simultaneously. Is this possible?

luigiscorzato · June 21, 2024, 2:34pm

From my understanding it should be possible (I haven’t tried), but then you would be using the additional GPUs (all GPUs but one) just as a storage because your gradient doesn’t fit in memory, right. AM I wrong?

Yash_Lal · July 26, 2024, 9:11pm

Yes that is the idea, the gradients are too large and I want more storage for them by combining GPUs.

agu · July 29, 2024, 10:21pm

The FSDP implementation assumes that each process uses its own GPU, so if you are not spawning multiple processes, then FSDP probably will not work.

MADHAV_THAKKER · February 24, 2025, 12:34pm

Were you able to get more clarity how to do this?

Yash_Lal · February 24, 2025, 2:04pm

Not really, I ended up abandoning the FSDP approach and resolving the memory issues a different way by handling my data differently.

MADHAV_THAKKER · February 25, 2025, 5:52pm

Yeah. My use case is also very similar and FSDP seems very confusing to get into. What did you end up doing?

Yash_Lal · February 25, 2025, 6:32pm

I was using very large graphs with PytorchGeometric, but for my use case I was able to split the graphs in half which helped. We also found better performance with smaller models which reduced the computational burden further. This observation also meant that we no longer trying the larger models causing problems because we concluded that smaller models were doing better.

MADHAV_THAKKER · February 25, 2025, 11:03pm

Great. Thank you so much for your help.