Hi @ptrblck, thankyou for your reply.
The stored model was trained in DistributedDataParallel, and I am resuming training in DataParallel. If I load the state_dict from the plain model first, I have to remove the prefix ‘module.’ from the state_dict keys, and then I wrap the model in DataParallel.
If I wrap the model in DataParallel first, and then load the state_dict, I don’t get any key mismatch issues. But I want to know if this approach is correct? Loading the state_dict after wrapping the model in DataParallel essentially means changing the weights of the model. Will the model after loading the state_dict be replicated across all gpus?