Implementation Mismatch between DeepSpeed and PyTorch

I notice that PyTorch has added zero redundancy optimizer since 1.8. However, it seems the zero optimizer in pytorch is using a different way to partition optimizer states from that in DeepSpeed. DeepSpeed partitions the each parameter equally in the param group and then allocate them to different ranks while PyTorch allocates each parameter directly to a different ranks. That is saying that DeepSpeed does partitioning in a finer granularity.

I am wondering if this is intended and anyone has verified the memory performance?

Hey Frank, yep, the difference is intentional. ZeroRedundancyOptimizer in PyTorch is supposed to be used in conjunction with DDP. Since DDP already holds the full model replica, it will be more memory efficient for ZeroRedundancyOptimizer to directly using those parameters as broadcast buffers instead of creating new intra-parameter shards.

I see, is there any plan to support zero level 2 and 3 as well as zero offloading in the future?