Calling new_group cost long time when training model with large cluster,such as using 5000+ gpus

Calling new_group cost long time when training model with large cluster, using 5000+ gpus, cost 30 minutes.
Having the configs as follows:
1,TP=8,PP=12
2, torch2.0
3, torch.distributed.init_process_group env init
4, Megatron mpu.initialize_model_parallel cost almost 30 minutes, calling plenty of new_group
we see the feature in version torch2.1, wether the feature fix new_group?

Did you profile what exactly takes this amount of time?

Not yet,profiling on small cluster is ongoing,large cluster is in busy use and has no chance to proflie owing to need rebuild torch.we only profile this function mpu.initialize_model_parallel in megatron, and it cost 30 minutes when restart training.