How do I test my parallel trained model on one gpu?

SteveXWu · March 17, 2022, 3:16am

I trained a customized CNN model by using
model = torch.nn.DataParallel(model, device_ids=[0, 1, 2, 3])
and save the whole model torch.save(model, "./ours_5.pkl")
Now I would like to load this model and test it on single GPU model_net = torch.load(path, map_location="cuda:3").
It gives me a RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cuda:3

How do I fix this?

wanchaol · March 22, 2022, 4:59am

@SteveXWu Thanks for posting.

If you save your DataParallel model and want to load it in a different env, you need to properly define the mapping of devices using the map_location option in torch.save/load, you can check the following pointers and see if they resolve your issue.

https://pytorch.org/tutorials/intermediate/ddp_tutorial.html#save-and-load-checkpoints

btw, we recommend using DDP instead of DataParallel Distributed Data Parallel — PyTorch 1.11.0 documentation

You can also check this similar issue Load DDP model trained with 8 gpus on only 2 gpus? - #12 by kazem

rfmac · March 22, 2022, 6:39pm

I think the model should be loaded in the first device from the device_ids list.
If I’m right, you must do one of the following options:

Retrain your model with device #3 in the first position: model = torch.nn.DataParallel(model, device_ids=[3, 0, 1, 2])
Load the model in the device #0: model_net = torch.load(path, map_location="cuda:0")

Assuming you don’t want to retrain just to change the device, I think the best option would be to load in the device #0 and then to transfer to device #3, with something like:
model_net = torch.load(path, map_location="cuda:0").device("cuda:3")

I’m not sure if it works, it’s just a guess based on what is written in the official docs, but I hope it helps you!

Best regards,
Rafael Macedo.