Hi all,
I ran into unexpected behavior when using FSDP with ShardingStrategy.HYBRID_SHARD
and a device mesh of shape (n, 1)
, where n
is the replicate dimension and 1
is the (non-existent) shard dimension.
As expected, FSDP falls back to using NO_SHARD
in this case, since sharding over a size-1 dimension is a no-op:
UserWarning: FSDP is switching to use `NO_SHARD` instead of ShardingStrategy.HYBRID_SHARD since the world size is 1.
However, the fallback seems to set the world size to 1, which results in gradients not being synchronized across replicated ranks. This causes model parameters to diverge between GPUs.
To verify this, I ran a test on a single node with 2 GPUs and used a replicate dimension of 2. After one training step, I compared model parameters and optimizer states across ranks:
With DDP and FSDP +
NO_SHARD
: parameters were consistent.With FSDP +
HYBRID_SHARD
(which falls back toNO_SHARD
): parameters diverged.
It seems like the fallback isn’t preserving the DDP-like behavior as expected. I’m not sure if this is a bug or intentional, but we were hoping to use a single FSDP script to test different mesh configurations (including (n, 1)
), and this issue breaks that workflow.
Would appreciate any clarification on whether this is expected. Thanks!