Error when moving GPU-trained model to CPU

my_torch · December 10, 2017, 2:08pm

I trained a LSTM model on my gpu device which can work well on both training and testing phases.
Following is my corresponding code.
class LSTM(nn.Module):

def __init__(self, feature_dim, hidden_dim, tagset_size):
    super(LSTM, self).__init__()
    self.hidden_dim = hidden_dim
    self.lstm = nn.LSTM(feature_dim, hidden_dim)
    self.hidden2tag = nn.Linear(hidden_dim, tagset_size)
    self.hidden = self.init_hidden()

def init_hidden(self):
    return (autograd.Variable(torch.zeros(1,1,self.hidden_dim).cuda()),
            autograd.Variable(torch.zeros(1,1,self.hidden_dim).cuda()))

def forward(self, vec_seq):
    vec_seq = autograd.Variable(vec_seq)
    lstm_out, self.hidden = self.lstm(
        vec_seq.view(len(vec_seq),1,-1), self.hidden)
    tag_space = self.hidden2tag(lstm_out.view(len(vec_seq),-1))
    tag_scores = F.log_softmax(tag_space)
    return tag_scores

To make it work on cuda, I added following codes.

model = LSTM(FEATURE_DIM, HIDDEN_DIM, len(tags)).cuda()
loss_function = nn.NLLLoss().cuda()

Every epoch I trained, I also transformed my data into cuda type.

ts = torch.Tensor(training_data[i]).float().cuda()
tt = autograd.Variable(torch.Tensor([training_targets[i]]).view(1).cuda())

The codes above run well on my GPU. However, when I want to run the testing phase on CPU, some errors occur.
Following is my code:

model.cpu()
for i in range(len(test_data)):
    ts = torch.FloatTensor(test_data[i])
    tt = autograd.Variable(torch.FloatTensor([test_targets[i]]).view(1))
    tag_scores = model(ts)
    last_output = tag_scores[-1].view(1,-1)
    pred_y = torch.max(last_output,1)[1].data.numpy()
    print 'id = ', tt, 'pred_y = ', pred_y
    if test_targets[i] == pred_y[0]:
        prec += 1
    nsamp += 1
print 'precision = ', prec*1.0/nsamp

I have printed the model.state_dict(), the outputs are all torch.FloatTensor of size 64. Therefore it seems the model has been transferred to CPU side?

The error messages are like the attached image pytorch_error

I also tried the way of save()->load() like the following code shows.

torch.save(model.state_dict(), 'model_2_cuda.pt')
torch.load('model_2_cuda.pt', map_location=lambda storage, loc: storage)

But it still does not work.

I have checked my code and did not find anything with type torch.cuda.FloatTensor.
Could anyone help me with some tips?
Thanks.

antspy · December 10, 2017, 3:49pm

torch.cuda.FloatTensor is a normal FloatTensor, but it is on a GPU. The only problem is that it is a different type, so if you have a model in cpu, you cannot assign to it cuda tensors.

So you either bring the model to GPU and then load it (and potentially reconvert the loaded model to cpu), or you convert the cuda tensors in the state dict to normal tensors. Something like this should work

cpu_model_dict = {}
for key, val in model_dict.items():
    cpu_model_dict[key] = val.cpu()

my_torch · December 11, 2017, 1:11am

Hi antspy, thanks for your help.

However, I still have something confused.
In your code, you transformed all the parameters of cuda_model.state_dict() to CPU version. What is the difference between your operation and mine? ( I did cuda_model.cpu() , I also output all the parameters of my model after executing cuda_model.cpu() and found they have all been transformed into torch.FloatTensor type).

Regards,

dgriff · December 11, 2017, 5:42am

I think your issue is here is that your model class specifically defines cuda as location and so even though you load to cpu it clashes with the model class (assuming you don’t have a separate cpu model class not shown in your post). You need to alter the model class so that you can choose for it to run on cuda/cpu or make a separate cpu model class if you prefer.

netheril96 · December 11, 2017, 9:13am

The variables returned by init_hidden are not registered with the module, so they are not moved to cpu when you call model.cpu(). PyTorch has this magic where if you assign a Parameter to a module object, it gets automatically registered. You need to turn the variables into parameters, and assign them separately onto self.