I’ve been trying to move a model from a single GPU to a machine I’ve rented with four GPUs. I used the DistributedDataParralel command and I’m getting the following error:
RuntimeError Traceback (most recent call last)
<ipython-input-30-d1064068ed08> in <module>
6 exp_lr_scheduler = lr_scheduler.StepLR(optimizer, step_size=7, gamma=0.1)
7 combined_model = combined_model.cuda()
----> 8 combined_model = DDP(combined_model)
~/miniconda3/lib/python3.7/site-packages/torch/nn/parallel/distributed.py in __init__(self, module, device_ids, output_device, dim, broadcast_buffers, process_group, bucket_cap_mb, find_unused_parameters, check_reduction)
272 if process_group is None:
--> 273 self.process_group = _get_default_group()
275 self.process_group = process_group
~/miniconda3/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py in _get_default_group()
267 if not is_initialized():
--> 268 raise RuntimeError("Default process group has not been initialized, "
269 "please make sure to call init_process_group.")
270 return _default_pg
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.
Here is the code I’m using to try and parallelize the model:
@ptrblck I have trained my model with DistributedDataParallel and have saved the model like this: torch.save(model.module.state_dict(), ‘TS_Model.pth’).
I want to load my model in another machine that has not GPU device, but I get various errors.
1- When I load the model like this:
model = Network()
state_dict = torch.load('TS_model.pth, map_location=torch.device(‘cpu’)).module.state_dict()
I get this error: “Default process group has not been initialized, please make sure to call init_process_group.”
2- When I initialize the environment just like training process and then load the model, I get this error: “Distributed package doesn’t have NCCL built in”
I can run this code on my machine totally fine, but I cannot load it in another machine.
How can I overcome this issue?
as you are calling .module.state_dict() on the output of torch.load.
This would mean that TS_model.pth (note the lowercase m in model) would contain the entire DDP model stored as torch.save(model, path) and would thus try to reinitialize it with the DDP setup and will fail on the other machine.
When I try state_dict = torch.load('TS_model.pth', map_location=torch.device('cpu'))
I get this error : "Missing key(s) in state_dict: “branch1.backbone.conv1.0.weight”,“branch1.backbone.conv1.1.weight” and so on to the end of the model structure (i.e branch1.backbone.conv2 and … ).
This would mean that the keys of the stored state_dict and the current state_dict of the model do not match, which is caused by a change in the model architecture (e.g. if you are re-wrapping the model into another parent model etc.).
Check which keys are mismatches and where the additional or missing attributes are coming from.
state_dict = torch.load('TSmodel.pth',map_location='cpu')
from collections import OrderedDict
new_state_dict = OrderedDict()
for k, v in state_dict.items():
name = k[7:] # remove `module.`
new_state_dict[name] = v
# load params
But when I feed the input to the model I receive this error:
output = model(img[None, :], step=1)
NVTX functions not installed. Are you sure you have a CUDA build?
Honestly I do not know whether these markers have been added or not. How can I remove them from the environment? (Just to remind I want to load my model in another machine that has not GPU device)
BTW I have torch 1.13.0.dev20220614+cpu on this machine.
I don’t see how the state_dict loading would add them so are you even able to run the plain CPU model (without trying to load the pretrained state_dict) on the CPU using some random input?
If not, where does this model come from as PyTorch would not automatically add GPU-specific code to your model.
Thanks for your reply! But the project is a little bit big, I couldn’t get a small snippet for you. However, you said that the problem maybe is related to the baseline part. So I checked the baseline part and maybe you said is right! The code related to load model is down below.
baseline = RolloutBaseline(model, problem, opts)
def load_state_dict(self, state_dict):
# We make it such that it works whether model was saved as data parallel or not
load_model = copy.deepcopy(self.model)
self._update_model(load_model, state_dict['epoch'], state_dict['dataset'])
baseline = WarmupBaseline(baseline, opts.bl_warmup_epochs, warmup_exp_beta=opts.exp_beta)
# Checkpointing within warmup stage makes no sense, only save inner baseline
So when I save the baseline as baseline.state_dict(), I save model but not model.module! Therefore,when I try to load the whole document(with the saved baseline in it), it will try to load the entire model. Is this true?
If it is true, is there any solution to only load the model part without the baseline part? ( except add model.module to the baseline save part because the training process takes a long time )