Node 1 (8 ranks) and node 2 (4 ranks) are replicated and they the model is sharded within each (imbalanced)? Or would you have 3 groups of 4 ranks (balanced)?
My idea was to shard on the first node the model within the 8 GPUs and on the second node within 4 GPUs, so it is the unbalanced case. I have some reasons for that, but they are not relevant here.
I just was interested, if in principle the theory behind the FSDP implementation can handle this, or if it is necessary, that the tensors of the shards have all the same size. I had a look at the FSDP hybrid shard code, and we have there the communication with all_reduce/all_gather. But as far as I have understood it, we have this communication only between the corresponding shards. And If my shards are different large (because of different numbers of GPU) this communication could blow up. But I’m not sure, thats why I asked here…