Hello,
I’ve been trying to move a model from a single GPU to a machine I’ve rented with four GPUs. I used the DistributedDataParralel
command and I’m getting the following error:
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-30-d1064068ed08> in <module>
6 exp_lr_scheduler = lr_scheduler.StepLR(optimizer, step_size=7, gamma=0.1)
7 combined_model = combined_model.cuda()
----> 8 combined_model = DDP(combined_model)
~/miniconda3/lib/python3.7/site-packages/torch/nn/parallel/distributed.py in __init__(self, module, device_ids, output_device, dim, broadcast_buffers, process_group, bucket_cap_mb, find_unused_parameters, check_reduction)
271
272 if process_group is None:
--> 273 self.process_group = _get_default_group()
274 else:
275 self.process_group = process_group
~/miniconda3/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py in _get_default_group()
266 """
267 if not is_initialized():
--> 268 raise RuntimeError("Default process group has not been initialized, "
269 "please make sure to call init_process_group.")
270 return _default_pg
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.
Here is the code I’m using to try and parallelize the model:
from torch.nn.parallel import DistributedDataParallel as DDP
torch.manual_seed(101)
combined_model = Image_Embedd(embedding_size=train_categorical_embedding_sizes)
criterion = torch.nn.NLLLoss().cuda()
optimizer = torch.optim.Adam(combined_model.parameters(), lr=0.001)
scheduler = ReduceLROnPlateau(optimizer, 'min', patience = 4, verbose = True, min_lr = .00000001)
exp_lr_scheduler = lr_scheduler.StepLR(optimizer, step_size=7, gamma=0.1)
combined_model = combined_model.cuda()
combined_model = DDP(combined_model)
Has anyone ran into this error before? I’m on a ubuntu environment with jupyter notebook.