Potential Bug with HYBRID_SHARD and (n, 1) Device Mesh Falling Back to NO_SHARD

yrraadi-bio · May 31, 2025, 11:15pm

Hi all,

I ran into unexpected behavior when using FSDP with ShardingStrategy.HYBRID_SHARD and a device mesh of shape (n, 1), where n is the replicate dimension and 1 is the (non-existent) shard dimension.

As expected, FSDP falls back to using NO_SHARD in this case, since sharding over a size-1 dimension is a no-op:

UserWarning: FSDP is switching to use `NO_SHARD` instead of ShardingStrategy.HYBRID_SHARD since the world size is 1.

However, the fallback seems to set the world size to 1, which results in gradients not being synchronized across replicated ranks. This causes model parameters to diverge between GPUs.

To verify this, I ran a test on a single node with 2 GPUs and used a replicate dimension of 2. After one training step, I compared model parameters and optimizer states across ranks:

With DDP and FSDP + NO_SHARD: parameters were consistent.
With FSDP + HYBRID_SHARD (which falls back to NO_SHARD): parameters diverged.

It seems like the fallback isn’t preserving the DDP-like behavior as expected. I’m not sure if this is a bug or intentional, but we were hoping to use a single FSDP script to test different mesh configurations (including (n, 1)), and this issue breaks that workflow.

Would appreciate any clarification on whether this is expected. Thanks!

H-Huang · June 2, 2025, 3:11pm

This looks like a bug. Can you file an issue on the PyTorch repository with an example script to reproduce it? GitHub · Where software is built