I m trying to improve the performance of training BERT model with multiple GPUs with torch DDP like discribed in chap 5.4 in paper “PyTorch Distributed: Experiences on Accelerating Data Parallel Training”.
But I met some error when I set process group instances >1 on 2 servers with 16 GPUs.
I m trying to use round_robin_process_groups instead of default process_group initiated with “torch.distributed.init_process_group” .
Please correct me if I m using the wrong way to use round_robin_process_groups. Thank you!
Here is my code:
if args.num_process_groups > 1:
store = c10d._get_default_store()
rr_pg = torch.distributed._round_robin_process_groups(
[
c10d.ProcessGroupNCCL(c10d.PrefixStore(str(i), store), args.local_rank, args.n_gpu)
for i in range(args.num_process_groups)
]
)
model = DDP(model, device_ids=[args.local_rank], output_device=args.local_rank, bucket_cap_mb=25, process_group=rr_pg)
and i got the following errors:
ncclSystemError: System call (socket, malloc, munmap, etc) failed.
args, final_loss, train_time_raw = main()
File "/workspace/bert/run_pretraining.py", line 942, in main
model, optimizer, lr_scheduler, checkpoint, global_step = prepare_model_and_optimizer(args, device)
File "/workspace/bert/run_pretraining.py", line 770, in prepare_model_and_optimizer
model, optimizer, lr_scheduler, checkpoint, global_step = prepare_model_and_optimizer(args, device)
File "/workspace/bert/run_pretraining.py", line 770, in prepare_model_and_optimizer
model, optimizer, lr_scheduler, checkpoint, global_step = prepare_model_and_optimizer(args, device)
File "/workspace/bert/run_pretraining.py", line 770, in prepare_model_and_optimizer
model, optimizer, lr_scheduler, checkpoint, global_step = prepare_model_and_optimizer(args, device)
File "/workspace/bert/run_pretraining.py", line 770, in prepare_model_and_optimizer
bucket_cap_mb=25, gradient_as_bucket_view= False, process_group=rr_pg)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 496, in __init__
bucket_cap_mb=25, gradient_as_bucket_view= False, process_group=rr_pg)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 496, in __init__
bucket_cap_mb=25, gradient_as_bucket_view= False, process_group=rr_pg)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 496, in __init__
dist._verify_model_across_ranks(self.process_group, parameters)
RuntimeError: NCCL error in: ../torch/lib/c10d/ProcessGroupNCCL.cpp:911, unhandled system error, NCCL version 2.7.8
ncclSystemError: System call (socket, malloc, munmap, etc) failed.
bucket_cap_mb=25, gradient_as_bucket_view= False, process_group=rr_pg)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 496, in __init__
dist._verify_model_across_ranks(self.process_group, parameters)
RuntimeError: NCCL error in: ../torch/lib/c10d/ProcessGroupNCCL.cpp:911, unhandled system error, NCCL version 2.7.8
ncclSystemError: System call (socket, malloc, munmap, etc) failed. dist._verify_model_across_ranks(self.process_group, parameters)
RuntimeError: NCCL error in: ../torch/lib/c10d/ProcessGroupNCCL.cpp:911, unhandled system error, NCCL version 2.7.8
ncclSystemError: System call (socket, malloc, munmap, etc) failed.
dist._verify_model_across_ranks(self.process_group, parameters)
RuntimeError: NCCL error in: ../torch/lib/c10d/ProcessGroupNCCL.cpp:911, unhandled system error, NCCL version 2.7.8
ncclSystemError: System call (socket, malloc, munmap, etc) failed.