RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cuda:1

Shubhankar · December 22, 2019, 1:32am

I have 4xP100s and saved an initial checkpoint after nn.DataParallel(model) but when I tried to load it in another model I got an error

RuntimeError: Caught RuntimeError in replica 1 on device 1.
Original Traceback (most recent call last):
  File "/gpfs/alpine/world-shared/gen011/shubhankar/summitdev/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
    output = module(*input, **kwargs)
  File "/gpfs/alpine/world-shared/gen011/shubhankar/summitdev/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/gpfs/alpine/world-shared/gen011/shubhankar/summitdev/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 146, in forward
    "them on device: {}".format(self.src_device_obj, t.device))
RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cuda:1

aauker · December 22, 2019, 6:03am

It would be useful to have either your code (if it’s simple), or a minimal example to help understand why you get this message. Just some background, if I remember correctly, the actual model parameters must be stored on the default GPU (id=0), though calculations are carried out in parallel.

A guess I can provide now is this: when loading a parallelized checkpoint, you must initialize the model including any .cuda() calls before loading the checkpoint. It might be you’ve done this after.

A scheme like this should work:

model = SomeModel()
model = nn.DataParallel( model ).cuda()
model.load_state_dict(checkpoint)

Shubhankar · December 22, 2019, 6:15am

I am just following this doc. I mean as far as loading the model is concerned. Other details are the regular load, train, test loops.

aauker · December 22, 2019, 6:28am

I actually remember having the same exact issue, and following that same doc. Could you try the scheme I gave? In either case post the part of your script where you initialize the model.

Shubhankar · December 22, 2019, 6:40am

     network = models.resnet18(pretrained=False)
     network.conv1 = nn.Conv2d(1, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
     network.fc = nn.Linear(512, out_features=27)
     loader = torch.utils.data.DataLoader(train_set, batch_size=run.batch_size, 
     num_workers=run.num_workers, shuffle=False, pin_memory=False)
     with torch.no_grad():
         test_loader = torch.utils.data.DataLoader(test_set, batch_size=run.batch_size, 
                               num_workers=run.num_workers, pin_memory=False)
   
    network.to(device)
    optimizer = optim.AdamW(network.parameters(), lr=run.lr, weight_decay=run.weight_decay)
   
    with open('epoch-0.pth', 'rb') as f:
        checkpoint = torch.load(f)
        network.module.load_state_dict(checkpoint['model_state_dict'])
        optimizer.load_state_dict(checkpoint['optimizer_state_dict'])    
    network = nn.DataParallel(network)
    network.fc = nn.Linear(512, out_features=3)

Shubhankar · December 22, 2019, 7:01am

I tried appending .cuda() to nn.DataParallel(model) but didn’t work.

aauker · December 22, 2019, 8:33pm

I think the order matters in this case. Did you try wrapping in nn.DataParallel before either .to(device) or .cuda() and before leading the checkpoint?

Lucian_Mircea_Sasu · March 16, 2020, 9:28am

Is there any solution to this issue? I encountered the same situation.

Shubhankar · March 17, 2020, 5:02pm

What’s your issue. I did figure out the solution.

lmsasu · April 23, 2020, 10:05am

Hi @Shubhankar, sorry for delay. Right now I cannot have a minimum working example, as I rewrote the code and the bug disappeared.

Leigh_Davis · February 13, 2021, 7:52pm

Hi @Shubhankar, can you please share your solution with us as well? I am in the same boat; having exactly the same issue. Thanks!

Shubhankar · September 24, 2021, 4:00pm

What issue are you facing?

Vivek_Kumar · January 11, 2022, 5:03am

module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cuda:2

I am also facing same issue any solution?

ptrblck · January 12, 2022, 6:48pm

Could you post a minimal executable code snippet to reproduce the issue, please?

Bhavay_Malhotra · July 29, 2022, 8:56am

Hey everyone, I’m facing the same issue. I trained a resnet model on cuda:0 and I want to load it on cuda:1. I have the model stored in a .t7 file format, soo I don’t know if the whole model was saved or its parameters. How can I load the model on other gpu other than that on which it was trained on?

AlphaBetaGamma96 · July 29, 2022, 9:39am

@Bhavay_Malhotra If your model is currently on cuda:0 you can move it to cuda:1 via model = model.to('cuda:1') or in general by using the .to method.

Bhavay_Malhotra · July 29, 2022, 10:42am

Hi, thanks for the reply. Actually the model was originally trained on cuda:0 and all its parameters and buffer is stored on cuda:0 . If i change the model to load on cuda:1, will it cause any error??

AlphaBetaGamma96 · July 29, 2022, 10:50am

It should be fine, move it across and see if it runs.

Bhavay_Malhotra · July 30, 2022, 6:20pm

Hey @AlphaBetaGamma96 , I changed every device to cuda:1 in every file i import, still the code gives me the above error that model parameters and buffer are on cuda:0.

AlphaBetaGamma96 · August 1, 2022, 9:59am

Can you print the exact error message? Also, are you using an optimizer here? Because the optim might have the parameters and buffers on the original device too.