I have 4xP100s and saved an initial checkpoint after nn.DataParallel(model) but when I tried to load it in another model I got an error
RuntimeError: Caught RuntimeError in replica 1 on device 1.
Original Traceback (most recent call last):
File "/gpfs/alpine/world-shared/gen011/shubhankar/summitdev/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
output = module(*input, **kwargs)
File "/gpfs/alpine/world-shared/gen011/shubhankar/summitdev/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 541, in __call__
result = self.forward(*input, **kwargs)
File "/gpfs/alpine/world-shared/gen011/shubhankar/summitdev/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 146, in forward
"them on device: {}".format(self.src_device_obj, t.device))
RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cuda:1
It would be useful to have either your code (if it’s simple), or a minimal example to help understand why you get this message. Just some background, if I remember correctly, the actual model parameters must be stored on the default GPU (id=0), though calculations are carried out in parallel.
A guess I can provide now is this: when loading a parallelized checkpoint, you must initialize the model including any .cuda() calls before loading the checkpoint. It might be you’ve done this after.
A scheme like this should work:
model = SomeModel()
model = nn.DataParallel( model ).cuda()
model.load_state_dict(checkpoint)
I actually remember having the same exact issue, and following that same doc. Could you try the scheme I gave? In either case post the part of your script where you initialize the model.
I think the order matters in this case. Did you try wrapping in nn.DataParallel before either .to(device) or .cuda() and before leading the checkpoint?
Hey everyone, I’m facing the same issue. I trained a resnet model on cuda:0 and I want to load it on cuda:1. I have the model stored in a .t7 file format, soo I don’t know if the whole model was saved or its parameters. How can I load the model on other gpu other than that on which it was trained on?
Hi, thanks for the reply. Actually the model was originally trained on cuda:0 and all its parameters and buffer is stored on cuda:0 . If i change the model to load on cuda:1, will it cause any error??
Hey @AlphaBetaGamma96 , I changed every device to cuda:1 in every file i import, still the code gives me the above error that model parameters and buffer are on cuda:0.
Can you print the exact error message? Also, are you using an optimizer here? Because the optim might have the parameters and buffers on the original device too.