I notice that PyTorch has added zero redundancy optimizer since 1.8. However, it seems the zero optimizer in pytorch is using a different way to partition optimizer states from that in DeepSpeed. DeepSpeed partitions the each parameter equally in the param group and then allocate them to different ranks while PyTorch allocates each parameter directly to a different ranks. That is saying that DeepSpeed does partitioning in a finer granularity.
I am wondering if this is intended and anyone has verified the memory performance?