What are the benefits to limiting param_group size?

As part of deepspeed’s mixture of experts implementation, they separate expert parameters into their own separate param group via the function split_params_into_different_moe_groups_for_optimizer. It has the parameter max_group_size = 178956971. What would be the reasoning for limiting param group size? Does this help in a distributed setting or something? What are the benefits?

1 Like

Per the PR that introduced this splitting, Improving memory utilization of Z2+MoE by siddharth9820 · Pull Request #2079 · microsoft/DeepSpeed · GitHub , it is to save memory because only one of these groups will be on the GPU in full precision at a time.

Please take the answer with a grain of salt, the question just piqued my curiosity.

Best regards


1 Like