DataParallel optim and saving correctness

Arun_Mallya · June 15, 2017, 9:05pm

Are the two marked lines 1) and 2) equivalent when using DataParallel?

net = torch.nn.DataParallel(net).cuda()
optim = torch.optim.Adam(net.parameters(), LR)         # 1)
optim = torch.optim.Adam(net.module.parameters(), LR)  # 2)

Or does one of them create weird synchronization issues?

Also, while saving params to file, which one is preferred when using a DataParallel module?

net = torch.nn.DataParallel(net).cuda()
...
torch.save({
  'epoch': epoch,
  'args': args,
  'state_dict': model.state_dict(),  # 1) OR
  'state_dict': model.module.state_dict(),  # 2)
  'loss_history': loss_history,
}, model_save_filename)

In Torch, an analogue of method 2) was preferred (https://github.com/facebook/fb.resnet.torch/blob/master/checkpoints.lua#L45-L48).

smth · June 22, 2017, 4:42am

#1 is preferred in both cases. Unlike (Lua)Torch you dont need the workarounds.

Arun_Mallya · June 22, 2017, 6:01pm

I tried #1 however, I run into this issue while loading the state_dict back:
KeyError: 'unexpected key "module.cnn.0.weight" in state_dict'
Essentially, the model is nested in the module of DataParallel.
I guess it might be better to use #2 for saving?

smth · June 22, 2017, 8:20pm

oh i see. yea probably #2 for saving via state_dict.