When I train a speech enhancement model with single GPU on the pytorch framework, Training stage is not finish. I stoped it because the speed is a little slow, Now I want to use data-parallel mode multiply GPU to continue training from resume model, what I should do?
You can refer to the example code in this reply.
First, thanks for your reply. Because single GPU training model, its state dict is different from Mutilply GPU training model. so it can’t load state dict from single GPU resume model on the multiply GPU training model.
Hi @duo_ma, sorry, I should have explained in more detail.
DistributedDataParallel models calls it
self.module.xxx while single-gpu models call their
What I would recommend in general is always saving
.module. and loading it to model itself (single-gpu) or model.module (multi-gpu) accordingly.
# saving if isinstance(model, (nn.DataParallel, nn.DistributedDataParallel)): torch.save(model.module.state_dict(), model_save_name) else: torch.save(model.state_dict(), model_save_name)
# loading model = nn.DataParallel(model, **gpu_device_arg) # multi-gpu model if isinstance(model, (nn.DataParallel, nn.DistributedDataParallel)): model.module.load_state_dict(state_dict) # your model will be loaded to multi-gpu model. else: model.load_state_dict(state_dict)
In a case when you already saved multi-gpu model parameters as
.module.xxx and loading to a single-gpu model, then you should do:
model = nn.DataParallel(model, **gpu_device_arg) # make it multi-gpu model.load_state_dict(state_dict) model = model.module # make it single-gpu