FSDP hybrid sharding on multiple nodes

Hello together,

does it work in principle to run an FSDP training with hybrid_shard policy on a multinode setup, where each node has a different number of gpus?

So e.g. I have two nodes, first node has 8 GPUs and second node has 4 GPUs.

Thanks for any kind of help!

How would the sharding work in this case?

Node 1 (8 ranks) and node 2 (4 ranks) are replicated and they the model is sharded within each (imbalanced)? Or would you have 3 groups of 4 ranks (balanced)?

In theory, I think both would work but I don’t know if there is a way to express this in device_mesh (Getting Started with DeviceMesh — PyTorch Tutorials 2.6.0+cu124 documentation). You may need to create the process groups yourself to express this.

Thanks for your reply.

My idea was to shard on the first node the model within the 8 GPUs and on the second node within 4 GPUs, so it is the unbalanced case. I have some reasons for that, but they are not relevant here.

I just was interested, if in principle the theory behind the FSDP implementation can handle this, or if it is necessary, that the tensors of the shards have all the same size. I had a look at the FSDP hybrid shard code, and we have there the communication with all_reduce/all_gather. But as far as I have understood it, we have this communication only between the corresponding shards. And If my shards are different large (because of different numbers of GPU) this communication could blow up. But I’m not sure, thats why I asked here…