Program freezes when doing distributed training

xdwang0726 · September 16, 2020, 9:08am

I am trying to add distributed training on my program as my model is relatively large. The program runs well without distributed training, however, when I add distributed training, the program freezes right at model = torch.nn.parallel.DistributedDataParallel(model) without returning any error messages. I am wondering have anyone else faced this situation before? Are there any possible solutions? Thanks!

mrshenli · September 16, 2020, 2:59pm

When doing the above without specifying a device_id, it will try to replicate the model to all visible devices in each process (unless the model is on CPU). Is this intentional? The recommended use of DDP is to let each process exclusively operate on one GPU.

Beside, before v1.7 DDP will create communication buckets. The total size of those buckets will be the same as model size. So the GPU memory size needs to be at least 3X of the model size.

xdwang0726 · September 17, 2020, 1:59am

Thanks for your reply! I have specified the cuda ids when training. When I trained in a single GPU, it returns the error message that ‘CUDA requires ~6G more memory’. I add one more 16G GPU to do the training. According to your suggestion, it seems like I need at least 4 GPUs in total? Thanks!

xdwang0726 · September 17, 2020, 2:00am

Thanks for your reply! I have specified the cuda ids when training. When I trained in a single GPU, it returns the error message that ‘CUDA requires ~6G more memory’. I add one more 16G GPU to do the training. According to your suggestion, it seems like I need at least 4 GPUs in total? Thanks!

xdwang0726 · September 17, 2020, 1:20pm

I got it work, never mind, thanks

mrshenli · September 17, 2020, 2:56pm

Hey @xdwang0726, do you mind sharing what was the cause of the problem, and how it was resolved? In case future users hit the same issue.

xdwang0726 · September 18, 2020, 1:58am

Before I was using local_rank = torch.distributed.get_rank() and the program freezes. I manually set local_rank=-1 that solves the problem.