What is the right way to use round_robin_process_groups

I m trying to improve the performance of training BERT model with multiple GPUs with torch DDP like discribed in chap 5.4 in paper “PyTorch Distributed: Experiences on Accelerating Data Parallel Training”.
But I met some error when I set process group instances >1 on 2 servers with 16 GPUs.
I m trying to use round_robin_process_groups instead of default process_group initiated with “torch.distributed.init_process_group” .
Please correct me if I m using the wrong way to use round_robin_process_groups. Thank you!
Here is my code:

if args.num_process_groups > 1:
                store = c10d._get_default_store()
                rr_pg = torch.distributed._round_robin_process_groups(
                    [
                        c10d.ProcessGroupNCCL(c10d.PrefixStore(str(i), store), args.local_rank, args.n_gpu)
                        for i in range(args.num_process_groups)
                    ]
                )
model = DDP(model, device_ids=[args.local_rank], output_device=args.local_rank, bucket_cap_mb=25, process_group=rr_pg)



and i got the following errors:

ncclSystemError: System call (socket, malloc, munmap, etc) failed.
    args, final_loss, train_time_raw = main()
  File "/workspace/bert/run_pretraining.py", line 942, in main
    model, optimizer, lr_scheduler, checkpoint, global_step = prepare_model_and_optimizer(args, device)
  File "/workspace/bert/run_pretraining.py", line 770, in prepare_model_and_optimizer
    model, optimizer, lr_scheduler, checkpoint, global_step = prepare_model_and_optimizer(args, device)
  File "/workspace/bert/run_pretraining.py", line 770, in prepare_model_and_optimizer
    model, optimizer, lr_scheduler, checkpoint, global_step = prepare_model_and_optimizer(args, device)
  File "/workspace/bert/run_pretraining.py", line 770, in prepare_model_and_optimizer
    model, optimizer, lr_scheduler, checkpoint, global_step = prepare_model_and_optimizer(args, device)
  File "/workspace/bert/run_pretraining.py", line 770, in prepare_model_and_optimizer
    bucket_cap_mb=25, gradient_as_bucket_view= False, process_group=rr_pg)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 496, in __init__
    bucket_cap_mb=25, gradient_as_bucket_view= False, process_group=rr_pg)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 496, in __init__
    bucket_cap_mb=25, gradient_as_bucket_view= False, process_group=rr_pg)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 496, in __init__
    dist._verify_model_across_ranks(self.process_group, parameters)
RuntimeError: NCCL error in: ../torch/lib/c10d/ProcessGroupNCCL.cpp:911, unhandled system error, NCCL version 2.7.8
ncclSystemError: System call (socket, malloc, munmap, etc) failed.
    bucket_cap_mb=25, gradient_as_bucket_view= False, process_group=rr_pg)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 496, in __init__
    dist._verify_model_across_ranks(self.process_group, parameters)
RuntimeError: NCCL error in: ../torch/lib/c10d/ProcessGroupNCCL.cpp:911, unhandled system error, NCCL version 2.7.8
ncclSystemError: System call (socket, malloc, munmap, etc) failed.    dist._verify_model_across_ranks(self.process_group, parameters)

RuntimeError: NCCL error in: ../torch/lib/c10d/ProcessGroupNCCL.cpp:911, unhandled system error, NCCL version 2.7.8
ncclSystemError: System call (socket, malloc, munmap, etc) failed.
    dist._verify_model_across_ranks(self.process_group, parameters)
RuntimeError: NCCL error in: ../torch/lib/c10d/ProcessGroupNCCL.cpp:911, unhandled system error, NCCL version 2.7.8
ncclSystemError: System call (socket, malloc, munmap, etc) failed.

Using round_robin_process_group with NCCL is not currently recommended. Check out the warning under: Distributed communication package - torch.distributed — PyTorch master documentation :

Using multiple process groups with the NCCL backend concurrently is not safe and the user should perform explicit synchronization in their application to ensure only one process group is used at a time. This means collectives from one process group should have completed execution on the device (not just enqueued since CUDA execution is async) before collectives from another process group are enqueued. See Using multiple NCCL communicators concurrently for more details.