I trained a customized CNN model by using
model = torch.nn.DataParallel(model, device_ids=[0, 1, 2, 3])
and save the whole model
Now I would like to load this model and test it on single GPU
model_net = torch.load(path, map_location="cuda:3").
It gives me a RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids) but found one of them on device: cuda:3
How do I fix this?
@SteveXWu Thanks for posting.
If you save your DataParallel model and want to load it in a different env, you need to properly define the mapping of devices using the
map_location option in torch.save/load, you can check the following pointers and see if they resolve your issue.
btw, we recommend using DDP instead of DataParallel Distributed Data Parallel — PyTorch 1.11.0 documentation
You can also check this similar issue Load DDP model trained with 8 gpus on only 2 gpus? - #12 by kazem
I think the model should be loaded in the first device from the
If I’m right, you must do one of the following options:
- Retrain your model with device #3 in the first position:
model = torch.nn.DataParallel(model, device_ids=[3, 0, 1, 2])
- Load the model in the device #0:
model_net = torch.load(path, map_location="cuda:0")
Assuming you don’t want to retrain just to change the device, I think the best option would be to load in the device #0 and then to transfer to device #3, with something like:
model_net = torch.load(path, map_location="cuda:0").device("cuda:3")
I’m not sure if it works, it’s just a guess based on what is written in the official docs, but I hope it helps you!