Difference between DDP vs FSDP.NO_SHARD

Could anyone know the difference between DistributedDataParallel

fsdp.ShardingStrategy = NO_SHARD

They looks pretty similar. regarding memory usage and training speed are they the same ?

I think these 2 are the same. cc @agu @weifengpy

They are the same high-level algorithm: vanilla data parallelism. However, their implementation differs.

DistributedDataParallel (DDP) uses a C++ reducer to bucket gradients for all-reduce. FullyShardedDataParallel with NO_SHARD follows the module wrapping to bucket gradients for all-reduce.

DDP is more mature and generally the preferred solution over NO_SHARD.


Since they are the same, is there a plan to replace it with the FSDP one?