Loading model weights that was trained on multiple GPUs using nn.DataParallel

Yangmin · February 5, 2022, 2:45pm

Hello, I have an issue with using ResNet50 trained model weights that were trained on multi-GPUs.
My goal is using only 1GPU and ResNet to train Generator.

First, I saved the ResNet model weights that were trained with nn.DataParallel().
Then, I loaded the weights and initialized like the code below.

model = nn.DataParallel(model)
checkpoint = torch.load(fpath)
model.module.load_state_dict(checkpoint[‘state_dict’])

Second, I freezed the ResNet model and only trained the Generator with the output of ResNet model features.

The problem is the generator is not properly trained at all!
But weird thing is that when I used ResNet50 model that was only trained on 1 GPU, and load the weights, the Generator is properly trained!. I also, checked the ResNet50 model performance after loading through torch.load(), the performance is still the same.

I really cannot figure out what the problem is here. I checked the model weights properly loaded that was trained with nn.DataParallel.

Is it not possible to use model weights trained with nn.DataParallel??
Do I need to do something more rather than loading ResNet50 model?

What am I doing wrong?? Can someone tell me.

ptrblck · February 9, 2022, 1:39am

No, it’s possible to load model parameters trained with DataParallel.
Usually, you would load the state_dict before wrapping the model into nn.DataParallel, but I don’t think that’s the issue here.

Based on your code snippet the Generator is not defined and is another model, which gets the outputs of the pretrained ResNet which is wrapped into nn.DataParallel?
If so, I’m unsure I understand the question completely, as the issue seems to come from the Generator, not the ResNet.