If I train a model with one GPU (without nn.DataParallel), the parameter names in saved models are something like:
features.0.weight
If the model was initialized with nn.DataParallel, the saved parameter names have a prefix:
model.features.0.weight
While doing inference I only use one GPU so the model failed to load the latter model file because the parameter names are not matching. I am wondering why parameter names are prepend the prefix? Can I trim the prefix and still use the model?
Curious why the prefix is needed? It creates inconvenience when we want to resume training a single-GPU-trained-model with multi-GPU, or pass a multi-GPU trained model to inference code that only uses one GPU?
Also it looks the pretrained resnet doesn’t have ‘module.’ prefix, does it mean that they were trained on single GPU?
It’s needed because that’s how state_dicts work You recursively go over the network, prepending the names. But maybe it’s a good idea to override that for DataParallel.
No, they probably had the prefixes trimmed before serialization.
"torch.nn.DataParallel is a model wrapper that enables parallel GPU utilization. To save a DataParallel model generically, save the model.module.state_dict() . This way, you have the flexibility to load the model any way you want to any device you want."