Problems loading model trained on one GPU onto another GPU with .to(device)

I have a model class myModel, which I pre-train on device cuda:1 and then save to file modelFile. I don’t understand the behaviour when trying to load this model onto another, say cuda:0. Can someone help me understand what is going on behind the scenes when one has the following:

model = myModel()
model.load_state_dict(torch.load(modelFile))
model = model.eval().to("cuda:0")

and why this gives different results to:

model = myModel().to("cuda:0")
model.load_state_dict(torch.load(modelFile))
model = model.eval()

What does it mean for a saved model to be associated with a device, and why does setting model to a different device to the saved model cause problems? Should we always load the model first and then push to a new device?

EDIT: I should also say that I find it slightly disturbing that such a small, inconspicuous difference in the code causes huge implications for the model output. Would it be possible to flag a warning when the model and state_dicts have different devices?

1 Like

Hi,

Saving Tensors does save their device as it is a property of the object.
You can use the map_location argument from torch.load to avoid any issue if the device it was saved for is not available anymore. In this case, you want to use "cuda:0" I guess for the new location).

Thank you for replying. I understand that what you have said but I would really appreciate understanding what’s happening in the first piece of code that I wrote.
The model is put onto cuda:0, then it loads a state dict which contains a tensor mapped to cuda:1. So what actually happens in model? What values will it contain after loading the state dict? If a state dict has to be loaded into a model to be useful, what is the use of it having its own device property? And would it be feasible to have a warning when the model and state dict devices do not match?

Hi,

When you load a state dict, you load it into the current parameters of your network. So if the network is on cuda:0, the resulting net will be on cuda:0 as well.
The state dict contains Tensors and Tensors always have a device. That’s the only reason why it maintains device information.

In general, it should not be a problem to have them on different devices as whatever you asked for with .to() on the model is what counts. not what was in the state dict.

Thank you for the reply again, but it doesn’t match what I’ve seen in practice. When I run the second piece of code in the original question and run the model net, I get results that I do not expect, implying that the code failed to move the state dict correctly to cuda:0. Is this expected behaviour, and if not should I try to make a minimum example and flag an issue?

What do you mean by “I get results that I do not expect” ? Can you be more clear about what you observe?

Apologies for being vague. I had trained an autoencoder model on cuda:1, and then ran the equivalent of the second code section above. The reconstructed image from the autoencoder was not correct, it looked as if the wrong batchnorm were being applied. When I changed the code to the first section it worked fine.

I’m sorry, I don’t have a tidy minimum example of the issue yet, but if you think it sounds suspicious I would be happy to try to make one.

When I changed the code to the first section it worked fine.

This is quite surprising indeed. They should give the same result.
I would check first that this is not about the .eval() mode that is set properly. It has a large impact on the result for models with batchnorm layers.

I have made a minimum code (without batchnorms) that seems to show the problem, sorry that it’s a bit long:

import torch
import torch.nn as nn
import matplotlib.pyplot as plt

#%% define the model
class Generator(nn.Module):
    def __init__(self):
        super(Generator, self).__init__()
        self.main = nn.Sequential(
            nn.ConvTranspose2d(     100, 2 * 8, 4, 1, 0, bias=False),
            nn.ReLU(True),
            nn.ConvTranspose2d(2 * 8, 2 * 4, 4, 2, 1, bias=False),
            nn.ReLU(True),
            nn.ConvTranspose2d(2 * 4, 2 * 2, 4, 2, 1, bias=False),
            nn.ReLU(True),
            nn.ConvTranspose2d(2 * 2,     2, 4, 2, 1, bias=False),
            nn.ReLU(True),
            nn.ConvTranspose2d(    2,      1, 4, 2, 1, bias=False),
            nn.Tanh()
        )

    def forward(self, input):
        output = self.main(input)
        return output

# make a version of the model on cuda:1 and save the model to file
netG = Generator().to('cuda:1')
torch.save(netG.state_dict(), 'netG.pt')
netG = netG.eval()

# load the model twice onto cuda:0, with .to(device) both before and after loading the state dict
netG_1 = Generator().to('cuda:0')
netG_1.load_state_dict(torch.load('netG.pt'))
netG_1 = netG_1.eval()

netG_2 = Generator()
netG_2.load_state_dict(torch.load('netG.pt'))
netG_2 = netG_2.eval().to('cuda:0')


#%% apply the three versions
z = torch.randn(1,100,1,1) # generator input

with torch.no_grad():
    
    output = netG(z.to('cuda:1'))
    output1 = netG_1(z.to('cuda:0'))
    output2 = netG_2(z.to('cuda:0'))

#%% display output
plt.figure()
plt.subplot(1,3,1)
plt.imshow(torch.squeeze(output.cpu()))
plt.subplot(1,3,2)
plt.imshow(torch.squeeze(output1.cpu()))
plt.subplot(1,3,3)
plt.imshow(torch.squeeze(output2.cpu()))

This is the figure I get when I run the code:

We can see that the middle input, with the model that was sent to cuda:0 before loading the state dict has a different output.

I don’t have a machine with 2 GPUs readily available here.
While I find one, could you check if changing netG_2 = netG_2.eval().to('cuda:0') to netG_2 = netG_2.to('cuda:0').eval() change anything?

I’m training models on both GPUs so can’t perform your test at the moment. I will try to do it asap and let you know the result.

1 Like

The results are identical in either case (making sure that the torch seed is the same for both runs too).

1 Like

Ok, I’ll get a machine and check that.

1 Like

Hi,

Sorry it took a bit of time to get the machine.
I’m afraid I cannot repro with the modifies script below: I don’t have a screen on the other machine so I compare the outputs with:

print(output.min(), output.mean(), output.max())
print(output1.min(), output1.mean(), output1.max())
print(output2.min(), output2.mean(), output2.max())
print((output - output1.to("cuda:1")).abs().max())
print((output - output2.to("cuda:1")).abs().max())

And they all seems to be the same.

What does these prints return for you?

If I add those lines to my example and run, I get the following output:

tensor(-0.0064, device='cuda:1') tensor(-0.0003, device='cuda:1') tensor(0.0064, device='cuda:1')
tensor(-0.0040, device='cuda:0') tensor(0.0005, device='cuda:0') tensor(0.0056, device='cuda:0')
tensor(-0.0064, device='cuda:0') tensor(-0.0003, device='cuda:0') tensor(0.0064, device='cuda:0')
tensor(2.0009, device='cuda:1')
tensor(3.6893e+19, device='cuda:1')

So, they are not the same.

Even more curiously, I find that if I re-run the last two lines you sent without killing the kernel, I get:

tensor(3.6893e+19, device='cuda:1')
tensor(3.6893e+19, device='cuda:1')

So, for some reason re-running I get a different answer for (output - output1.to("cuda:1")).abs().max().

Not sure where we go from here, please do let me know what other tests I should run, or if you need more information.

I ran my test with a nightly build on conda and that worked fine.
Can you try a nightly build?

I’ll try. Just to check because I’ve never installed a nightly build before, is this the way to do so?

conda install -c pytorch pytorch-nightly

I never know :smiley: I usually check here: https://pytorch.org/get-started/locally/
Looks like the -c is not at the right place :smiley:

I also rerun the code and got:

tensor(-0.0106, device='cuda:1') tensor(7.5541e-05, device='cuda:1') tensor(0.0082, device='cuda:1')
tensor(-0.0106, device='cuda:0') tensor(7.5541e-05, device='cuda:0') tensor(0.0082, device='cuda:0')
tensor(-0.0106, device='cuda:0') tensor(7.5541e-05, device='cuda:0') tensor(0.0082, device='cuda:0')
tensor(4.6566e-10, device='cuda:1')
tensor(9.3132e-10, device='cuda:1')

The issue of rerunning the same line of code and getting another result seems to be really strange.
Was this setup running correctly before or is this a new system?

Okay, I made a new conda environment and installed the nightly build with conda install pytorch torchvision cudatoolkit=9.2 -c pytorch-nightly as suggested by https://pytorch.org/get-started/locally/.

I re-run the script, and I get similar result to before insofar as the outputs do not match:

tensor(-0.0064, device='cuda:1') tensor(-0.0003, device='cuda:1') tensor(0.0064, device='cuda:1')
tensor(-0.0040, device='cuda:0') tensor(0.0005, device='cuda:0') tensor(0.0056, device='cuda:0')
tensor(-0.0064, device='cuda:0') tensor(-0.0003, device='cuda:0') tensor(0.0064, device='cuda:0')
tensor(3.0040, device='cuda:1')
tensor(3.0049, device='cuda:1')

And again, if I re-run the last two lines again manually, I get a different output:

tensor(3.0058, device='cuda:1')
tensor(3.0067, device='cuda:1')

In this case though, both of these output values change with every run, rather than just the first one.

@ptrblck I have never done these kinds of checks before so I don’t know if it was running correctly before. This was the first time I remember that I’ve tried loading a model trained on one GPU onto another.

If both of your set-ups get the correct results, and mine still doesn’t even with the nightly build, what else could it be? I haven’t explicitly set cudnn to deterministic, but I assume that neither of you did either? There must be some mistake I’m making somewhere.