Program freezes when doing distributed training

I am trying to add distributed training on my program as my model is relatively large. The program runs well without distributed training, however, when I add distributed training, the program freezes right at model = torch.nn.parallel.DistributedDataParallel(model) without returning any error messages. I am wondering have anyone else faced this situation before? Are there any possible solutions? Thanks!

When doing the above without specifying a device_id, it will try to replicate the model to all visible devices in each process (unless the model is on CPU). Is this intentional? The recommended use of DDP is to let each process exclusively operate on one GPU.

Beside, before v1.7 DDP will create communication buckets. The total size of those buckets will be the same as model size. So the GPU memory size needs to be at least 3X of the model size.

Thanks for your reply! I have specified the cuda ids when training. When I trained in a single GPU, it returns the error message that ‘CUDA requires ~6G more memory’. I add one more 16G GPU to do the training. According to your suggestion, it seems like I need at least 4 GPUs in total? Thanks!

Thanks for your reply! I have specified the cuda ids when training. When I trained in a single GPU, it returns the error message that ‘CUDA requires ~6G more memory’. I add one more 16G GPU to do the training. According to your suggestion, it seems like I need at least 4 GPUs in total? Thanks!

I got it work, never mind, thanks

Hey @xdwang0726, do you mind sharing what was the cause of the problem, and how it was resolved? In case future users hit the same issue.

Before I was using local_rank = torch.distributed.get_rank() and the program freezes. I manually set local_rank=-1 that solves the problem.