How to switch model trained on 2 GPUs to 1 GPU?

daemonslayer · June 21, 2018, 7:47am

I have trained my model on 2 GPUs. Now I want to retrain it on single gpu so I can use other GPU for other tasks. But doing so is giving me this error :

RuntimeError: Error(s) in loading state_dict for Encoder:
        Missing key(s) in state_dict: "main.0.0.weight", "main.0.0.bias", "main.0.1.alpha", "main.0.1.beta", "main.1.0.weight", "main.1.0.bias", "main.1.1.alpha", "main.1.1.beta", "main.2.0.weight", "main.2.0.bias", "main.2.1.alpha", "main.2.1.beta", "main.3.0.weight", "main.3.0.bias", "main.3.1.alpha", "main.3.1.beta", "main.4.0.weight", "main.4.0.bias", "main.4.1.alpha", "main.4.1.beta". 
        Unexpected key(s) in state_dict: "module.main.0.0.weight", "module.main.0.0.bias", "module.main.0.1.alpha", "module.main.0.1.beta", "module.main.1.0.weight", "module.main.1.0.bias", "module.main.1.1.alpha", "module.main.1.1.beta", "module.main.2.0.weight", "module.main.2.0.bias", "module.main.2.1.alpha", "module.main.2.1.beta", "module.main.3.0.weight", "module.main.3.0.bias", "module.main.3.1.alpha", "module.main.3.1.beta", "module.main.4.0.weight", "module.main.4.0.bias", "module.main.4.1.alpha", "module.main.4.1.beta". 
Makefile:2: recipe for target 'all' failed
make: *** [all] Error 1

My model still is on Dataparallel but using single GPU, pytorch doesnt add up the weights trained on separate gpu. I have tried saving the model in different ways like pushing it first to cpu then saving, using torch.save and pushing parts of state_dict to cpu and then saving. They didnt help and gave same errors.

How do I switch my model from 2 gpus to 1 gpu?

Joe1 · July 20, 2018, 5:13am

I have met the same problem. Have you solved it？

daemonslayer · August 2, 2018, 6:13am

No, as much as I was able to find, this needs a lot of manual input as to how the variables divided for 2 GPUs have to be converted to 1 GPU. It was too much work, so I just retrained on 1 GPU

string111 · August 2, 2018, 6:59am

The problem here is, that you have saved your model as torch.DataParallel just wrap it in the DataParallel(model) before you start training and and specify the max number of GPUs to use as a workaround. The more elegant mehtod would be to change the saved state_dict.
The problem is "module.main.0.0.weight" the "module." of each key in your saved state dict, just remove it and save it again (be sure to specifiy the map_location argument in torch.load() to only one GPU: Here is the code to remove the "module." string before all the keys:

state_dict = torch.load('path/to/weights')
keys = state_dict.keys()
values = state_dict.values()

new_keys = []
for key in keys:
    new_key = key[8:]    # remove the 'module.'
    new_keys.append(new_key)

new_dict = OrderedDict(list(zip(new_keys, values))) # create a new OrderedDict with (key, value) pairs
torch.save(new_dict, 'path/to/save')

masoud_mahdi · September 16, 2021, 5:12pm

Yes! I solved it with simplest way

In DataParallel models state_dict keys added ‘module.’ string as a prefix.
So simple way to solve this is using a create new dictionary and copy weights of multi-GPU model and remove ‘module.’ prefix in keys and load it to single GPU model.

state_dict = torch.load("Your multi GPU weights path")
new_state_dict = {}
for key in state_dict:
  new_key = key.replace('module.','')
  new_state_dict[new_key] = state_dict[key]

new_single_GPU_model = sampleModel() #"Your model without nn.DataParallel"
new_single_GPU_model.load_state_dict(new_state_dict)

You can use ordinary python dictionary for loading the weight on pytorch.

for using OrderedDict as weight you can simply import it and set as our new_state_dic dictionary

from collections import OrderedDict
new_state_dict = OrderedDict()
.... # same steps with up code

I hope it helps you

dan.torch · September 10, 2023, 11:33pm

I have found another straightforward way for this issue.
Just use DataParallel for loading your model to single gpu as well before model.load_state_dict and it works fine.
For example:

model = torch.nn.DataParallel(model, device_ids=device_ids)

Here device id would be a list of 1 gpu, for ex: [0].
I hope this helps, without any need for model conversion.