Getting a eError: Default process group has not been initialized, please make sure to call init_process_group

Hello,

I’ve been trying to move a model from a single GPU to a machine I’ve rented with four GPUs. I used the DistributedDataParralel command and I’m getting the following error:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-30-d1064068ed08> in <module>
      6 exp_lr_scheduler = lr_scheduler.StepLR(optimizer, step_size=7, gamma=0.1)
      7 combined_model = combined_model.cuda()
----> 8 combined_model = DDP(combined_model)

~/miniconda3/lib/python3.7/site-packages/torch/nn/parallel/distributed.py in __init__(self, module, device_ids, output_device, dim, broadcast_buffers, process_group, bucket_cap_mb, find_unused_parameters, check_reduction)
    271 
    272         if process_group is None:
--> 273             self.process_group = _get_default_group()
    274         else:
    275             self.process_group = process_group

~/miniconda3/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py in _get_default_group()
    266     """
    267     if not is_initialized():
--> 268         raise RuntimeError("Default process group has not been initialized, "
    269                            "please make sure to call init_process_group.")
    270     return _default_pg

RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

Here is the code I’m using to try and parallelize the model:

from torch.nn.parallel import DistributedDataParallel as DDP
torch.manual_seed(101)
combined_model = Image_Embedd(embedding_size=train_categorical_embedding_sizes)
criterion = torch.nn.NLLLoss().cuda()
optimizer = torch.optim.Adam(combined_model.parameters(), lr=0.001)
scheduler = ReduceLROnPlateau(optimizer, 'min', patience = 4, verbose = True, min_lr = .00000001)
exp_lr_scheduler = lr_scheduler.StepLR(optimizer, step_size=7, gamma=0.1)
combined_model = combined_model.cuda()
combined_model = DDP(combined_model)

Has anyone ran into this error before? I’m on a ubuntu environment with jupyter notebook.

You would have to setup your environment as described here.

3 Likes

This is a minimum example.