Difference between DDP vs FSDP.NO_SHARD

bigtree · September 18, 2024, 5:49am

Could anyone know the difference between DistributedDataParallel

They looks pretty similar. regarding memory usage and training speed are they the same ?

XWu · September 20, 2024, 5:12pm

I think these 2 are the same. cc @agu @weifengpy

agu · September 23, 2024, 1:01am

They are the same high-level algorithm: vanilla data parallelism. However, their implementation differs.

DistributedDataParallel (DDP) uses a C++ reducer to bucket gradients for all-reduce. FullyShardedDataParallel with NO_SHARD follows the module wrapping to bucket gradients for all-reduce.

DDP is more mature and generally the preferred solution over NO_SHARD.

bigtree · September 23, 2024, 4:33pm

Since they are the same, is there a plan to replace it with the FSDP one?