Could anyone know the difference between DistributedDataParallel
fsdp.ShardingStrategy = NO_SHARD
They looks pretty similar. regarding memory usage and training speed are they the same ?
Could anyone know the difference between DistributedDataParallel
fsdp.ShardingStrategy = NO_SHARD
They looks pretty similar. regarding memory usage and training speed are they the same ?
I think these 2 are the same. cc @agu @weifengpy
They are the same high-level algorithm: vanilla data parallelism. However, their implementation differs.
DistributedDataParallel
(DDP) uses a C++ reducer to bucket gradients for all-reduce. FullyShardedDataParallel
with NO_SHARD
follows the module wrapping to bucket gradients for all-reduce.
DDP is more mature and generally the preferred solution over NO_SHARD
.
Since they are the same, is there a plan to replace it with the FSDP one?