How to store model which trained by multi GPUs(DataParallel)?

qi_chang · August 21, 2017, 5:05pm

Hi,
I trained the model using dataParallel, and save the only submodule in the DataParallel. When I validate the model, I load the model for just one GPU. But it seems not working, the prediction accuracy is not good as the training, even if using the same data.
Here’s my code snippet:

....
model.load_state_dict(torch.load(args.state))
....
if args.gpu_num > 1:
  device_ids = list(range(0, args.gpu_num))
  print(f'Use multiple GPUs: {device_ids}')
  model = torch.nn.DataParallel(model, device_ids=device_ids)

After finish training epoch, I save the submodule from DataParallel.

saveModule = model
if args.gpu_num > 1:
   # If use multi GPUs to train, only save the child node.
   saveModule = list(model.children())[0]
torch.save(saveModule.state_dict(), logpath)

Thanks
Qi

dhpollack · January 13, 2018, 10:45am

See here and here

basically you have to put your model into a dataparallel temporarily again or change the key names to remove “module.”

yuangan.zhou · December 11, 2018, 6:46am

Hi, I have met the same problem like you met, have you solved this problem ? My test accuracy which is extreme lower than training and validation.I think the model worked incorrectly on single gpu…I need help, thank you.@qi_chang