Trained model using DataParallel cannot be loaded for inference on 1 gpu

kirk86 · September 8, 2019, 3:37pm

Hi folks,
I have question regarding DataParallel. When I use it to train a model on 2 gpus and then try to load the model only on gpu:0 for prediction I keep getting a lot of mismatch keys from the state dictionary. When I inspect the state dictionary I see that all params of the model are on gpu:0, so I don’t understand why I keep getting these mismatch keys?

Instead if I load the model for prediction with DataParallel everything seems to work fine. My initial thought while getting mismatch keys was that maybe some parts of the model are on one gpu and others on the other, but that’s not the case since after loading the checkpoint and inspecting all params everything seems to be on gpu:0.

JuanFMontesinos · September 9, 2019, 8:12am

When you save a model from Dataparallel it adds an additional module which is reflected on the keywords such that 1st word is always module.

        new_state_dict = OrderedDict()

        for k, v in state_dict.items():
            if k[:7] == 'module.':
                name = k[7:]  # remove `module.`
            else:
                name = k
            new_state_dict[name] = v

kirk86 · September 9, 2019, 10:53am

Thank you, Juan
that’s refreshing to know especially when these things are not reflected in any documentation.
If you don’t mind me asking do you have any idea on what is the proper usage of distributed.DataParallel?

For instance here I’m explaining how I’ve used it and still end up getting error during execution.

ptrblck · September 9, 2019, 10:59am

Some information regarding the serialization of a data parallel model is given here. Where have you been looking and where would you like to see another mention of this snippet?

kirk86 · September 9, 2019, 11:31am

Oops, my bad I overlooked that, I guess I didn’t scroll down all the way to the end of the page.