Loading saved models gives inconsistent results each time

I have multiple trained LSTM models on different data. I save them as below.

save_checkpoints({
                'num_epochs': epoch,
                'num_hidden': number_hidden,
                'num_cells': number_cells,
                'device': device,
                'state_dict': model.state_dict()}, <ckpt_file>)

def save_checkpoints(state, file_name):
    torch.save(state, file_name)

When I load multiple models one after another with below method only first gives expected results on the test, others fail. But when I load them individually test gives expected results.

    checkpoint = torch.load(<ckpt_file>)
    _epochs = checkpoint['num_epochs']
    num_hidden = checkpoint['num_hidden']
    num_cells = checkpoint['num_cells']
    dev = checkpoint['device']
    seq1 = Seq2seq(num_hidden=num_hidden, num_cells=num_cells, device=dev)
    seq1.load_state_dict(checkpoint['state_dict'])
    seq1.to(seq1.device)
    seq1.eval()

I don’t understand why loading multiple models one after another does not work. I am running out of ideas.

Can you show some example results, both expected and unexpected ones?

When I load and test individually, the first model and second model.
First model:

test loss: 13.146434557001005
Weighted mean absolute percentage error is : 2.941648706059142

Second model:

test loss: 193.99452499602967
Weighted mean absolute percentage error is : 106.5421036013515

When I load and test first and then second model, the loss of first is correct but second is wrong

test loss: 13.146434557001005
Weighted mean absolute percentage error is : 2.941648706059142
test loss: 603501.1757331954
Weighted mean absolute percentage error is : 99.3929476222628

I also confirmed that the weights of individually loaded model are different than when I load the model in succession.

1 Like

I see, that’s weird. So, we need to properly debug this to see where this comes from. We need to see the outputs of the two models in different settings, but using the same input. Can you do the following steps:

  1. Create a single batch of input, any batch-size is fine.
  2. Print out the outputs of the two models when loaded individually (one at a time), but on the same batch of data that was created in step 1.
  3. Print out the outputs of the two model when they are loaded in a loop, using the exact same batch of data that was created step 1.

Note that we want to look at the output of the model, not the loss or the accuracy. For example, if the model returns probabilities of different classes, we want to see these probabilities.

1 Like

Thanks Vahid for replying. So just for the reference let me mention that I am doing a time series forecast. I use LBFGS optimizer here.

I am just trying to reduce the degrees of freedom/variables, so when I reproduced, I noticed it is not really dependent on the loading different models, although it happens with different models as well; so I am loading and testing a single model 3 times below.
Single load and test run contains print statements in the form:-
<<< test loss
error
i/p tensor
o/p or forecast tensor>>>

Output on console:

test loss: 3.921948823121292
Weighted mean absolute error is : 3.519775891348272
input tensor([[ 31.0200,  31.0400,  31.0500,  31.0800,  31.0900,  31.1200,
          31.1400,  31.1600,  31.1800,  31.2000,  31.2300,  31.2500,
          31.2700,  31.2900,  31.3100,  31.3300,  31.3500,  31.3700,
          31.3900,  31.5400,  31.5600,  31.6900,  31.7100,  31.7300,
          31.7600,  31.7800,  31.7900,  31.8200,  31.8400,  31.8600,
          31.8900,  31.9000,  31.9200,  31.9400,  31.9600,  31.9800,
          32.0000,  32.0300,  32.0600,  32.0900]], dtype=torch.float64, device='cuda:0')
forecast tensor([ 41.3981,  35.4583,  30.2360,  28.8500,  29.3385,  30.0793,
         30.4554,  30.4568,  30.2663,  30.0545,  29.9258,  29.9056,
         29.9869,  30.1447,  30.3500,  30.5775,  30.8080,  31.0286,
         31.2320,  31.4916,  31.6959,  31.9196,  32.0829,  32.2035,
         32.3037,  32.3846,  32.4457,  32.5049,  32.5549,  32.5969,
         32.6384,  32.6662,  32.6908,  32.7136,  32.7347,  32.7543,
         32.7726,  32.7959,  32.8216,  32.8481], dtype=torch.float64, device='cuda:0')
test loss: 575158.8622441115
Weighted mean absolute error is : 95.97287345358812
input tensor([[ 31.0200,  31.0400,  31.0500,  31.0800,  31.0900,  31.1200,
          31.1400,  31.1600,  31.1800,  31.2000,  31.2300,  31.2500,
          31.2700,  31.2900,  31.3100,  31.3300,  31.3500,  31.3700,
          31.3900,  31.5400,  31.5600,  31.6900,  31.7100,  31.7300,
          31.7600,  31.7800,  31.7900,  31.8200,  31.8400,  31.8600,
          31.8900,  31.9000,  31.9200,  31.9400,  31.9600,  31.9800,
          32.0000,  32.0300,  32.0600,  32.0900]], dtype=torch.float64, device='cuda:0')
forecast tensor([  705.8100,   972.7171,  1061.4445,  1062.1935,  1023.6047,
          972.3282,   921.9674,   878.4381,   843.3888,   816.3613,
          796.0533,   780.9891,   769.8488,   761.5599,   755.3051,
          750.4875,   746.6833,   743.5984,   741.0307,   738.8185,
          736.7482,   734.8179,   732.9316,   731.1557,   729.5226,
          728.0333,   726.6906,   725.4937,   724.4122,   723.4327,
          722.5403,   721.7137,   720.9582,   720.2620,   719.6152,
          719.0100,   718.4400,   717.8991,   717.3720,   716.8506], dtype=torch.float64, device='cuda:0')
test loss: 45029.746996055845
Weighted mean absolute error is : 87.00255914868971
input tensor([[ 31.0200,  31.0400,  31.0500,  31.0800,  31.0900,  31.1200,
          31.1400,  31.1600,  31.1800,  31.2000,  31.2300,  31.2500,
          31.2700,  31.2900,  31.3100,  31.3300,  31.3500,  31.3700,
          31.3900,  31.5400,  31.5600,  31.6900,  31.7100,  31.7300,
          31.7600,  31.7800,  31.7900,  31.8200,  31.8400,  31.8600,
          31.8900,  31.9000,  31.9200,  31.9400,  31.9600,  31.9800,
          32.0000,  32.0300,  32.0600,  32.0900]], dtype=torch.float64, device='cuda:0')
forecast tensor([ 167.3744,  209.8192,  215.1611,  211.6978,  209.1109,  209.4348,
         212.1701,  216.3190,  221.0366,  225.7625,  230.1856,  234.1582,
         237.6411,  240.6542,  243.2452,  245.4709,  247.3874,  249.0455,
         250.4888,  251.7960,  252.9559,  254.0158,  254.9618,  255.8026,
         256.5561,  257.2335,  257.8430,  258.4001,  258.9106,  259.3803,
         259.8172,  260.2196,  260.5931,  260.9413,  261.2670,  261.5722,
         261.8588,  262.1313,  262.3911,  262.6389], dtype=torch.float64, device='cuda:0')

Process finished with exit code 0

Does any body have any idea how can I debug this further? Or if any workaround to save the models?

I see that the models are generating different results! So, are you using LSTM? If that’s the case, are you sure that you are properly resetting the states?

Yes, I am using LSTM. I am not sure what do you mean by resetting the states, can you please elaborate.

Sure! Initially, the hidden states are probably initialized to zero, but then after you run a forward pass multiple times, like the hidden states accumulate. So, if you also print out some of the hidden states of your model, that may show that each time they have different values.

So, in that case, you can reinitialize the hidden states to zero and hopefully that will fix the problem. In the tutorial there is an example for resetting the hidden states:

    def init_hidden(self):
        # Before we've done anything, we dont have any hidden state.
        # Refer to the Pytorch documentation to see exactly
        # why they have this dimensionality.
        # The axes semantics are (num_layers, minibatch_size, hidden_dim)
        return (torch.zeros(1, 1, self.hidden_dim),
                torch.zeros(1, 1, self.hidden_dim))

then, you can call init_hidden() function in order to initialize the hidden states:

# Also, we need to clear out the hidden state of the LSTM,
# detaching it from its history on the last instance.
model.hidden = model.init_hidden()
## (copied from PyTorch tutorial)

I don’t think I have inited the hidden states in between forward passes.

    optimizer = optim.LBFGS(seq.parameters(), lr=lr)
    pkle_file = result_file_path + "tr_loss"
    for epoch in range(num_epochs):
        print('EPOCH: ', epoch)
        tr_loss = None

        def closure():
            global tr_loss
            optimizer.zero_grad()
            out = seq(iput)
            l_train = criteria(out, target)
            tr_loss = l_train.item()
            print('loss:', tr_loss)
            with open(pkle_file, 'wb') as file:
                pickle.dump(tr_loss, file)
            l_train.backward()
            return l_train

        optimizer.step(closure)

Let me try that and revert back.

Ohh, I overlooked it. I am initing the hidden states of the models to zeros in each forward pass.

class Seq2seq(nn.Module):
    def __init__(self, num_hidden, num_cells, device=None):
        """
        Initialize the LSTM predictor
        :param num_hidden: Number of hidden units of LSTM
        :param num_cells: Number of LSTM cells in the NN, equivalent to number of layers
        """
        super(Seq2seq, self).__init__()
        self.num_cells = num_cells
        self.num_hidden = num_hidden
        self.cell_list = []
        if device is None:
            if self.num_cells > 5 and self.num_hidden > 51:
                self.device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
            else:
                self.device = "cpu"
        else:
            if device == "gpu":
                self.device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
            else:
                self.device = "cpu"
        for i in range(0, num_cells):
            if i == 0:
                self.cell_list.append((nn.LSTMCell(1, num_hidden).double()).to(self.device))
            else:
                self.cell_list.append((nn.LSTMCell(num_hidden, num_hidden).double()).to(self.device))
        self.linear = nn.Linear(num_hidden, 1)

    def forward(self, iput, future=0):
        """
        Forward pass of the classifier
        :param iput: input dataframe for training/testing
        :param future: Number of future steps to be predicted
        :return: returns outputs
        """
        list_h = []
        list_c = []
        for i in range(0, self.num_cells):
            h_t = torch.zeros(iput.size(0), self.num_hidden, dtype=torch.double).to(self.device)
            list_h += [h_t]
            c_t = torch.zeros(iput.size(0), self.num_hidden, dtype=torch.double).to(self.device)
            list_c += [c_t]
        outputs = []
        for i, iput_t in enumerate(iput.chunk(iput.size(1), dim=1)):
            for j in range(0, self.num_cells):
                if j == 0:
                    h_t, c_t = (self.cell_list[j])(iput_t, (list_h[j], list_c[j]))
                    list_h[j] = h_t
                    list_c[j] = c_t
                else:
                    h_t, c_t = (self.cell_list[j])(list_h[j - 1], (list_h[j], list_c[j]))
                    list_h[j] = h_t
                    list_c[j] = c_t
            output = self.linear(list_h[self.num_cells - 1])
            outputs += [output]
        for i in range(future):  # if we should predict the future
            for j in range(0, self.num_cells):
                if j == 0:
                    list_h[j], list_c[j] = (self.cell_list[j])(output, (list_h[j], list_c[j]))
                else:
                    list_h[j], list_c[j] = (self.cell_list[j])(list_h[j - 1], (list_h[j], list_c[j]))
            output = self.linear(list_h[self.num_cells - 1])
            outputs += [output]
        outputs = torch.stack(outputs, 1).squeeze(2)
        return outputs

Yes, that looks correct. One question, in the documentation, the dimension of input is given as (seq_len, batch, input_size), so why are you creating the input chucks based on the second dimension iput.chunk(iput.size(1), dim=1). Shouldn’t that be the first dimension? Unless if you have always assumed that your input has a different shape.

But anyways, something really weird is happening, but it is hard to find it. Can you print the first 2 elements of h_t and c_t in that for loop. We need to find out when/where the difference starts.

I have used the code from https://github.com/pytorch/examples/blob/master/time_sequence_prediction/train.py and modified for my use, but example also used input.chunk() to train the data, but we squeeze the outputs anyway to get the original shape; and it is just the matter of use, at least here.

Anyways, As I stated earlier, I have confirmed that after torch.load() itself we get different(as you said when/where we see the difference) weights loaded for the same model. So, I don’t think printing the h_t and c_t at the test time would help, isn’t it?

O, I missed this part. So, the weights are different when you load the checkpoints individually vs when you load them in a loop, is that right?

That is correct, except not only in loop but also : say I load it three times(not in loop), each time I get different weights loaded.

I see. When you load the models three times, do you keep each one of them separately in 3 different objects, or same object name is used for all of them?

Different.

    checkpoint = torch.load(_ckpt_files[0])
    _epochs = checkpoint['num_epochs']
    num_hidden = checkpoint['num_hidden']
    num_cells = checkpoint['num_cells']
    dev = checkpoint['device']
    if dev is None:
        dev = "cpu"
    else:
        dev = "gpu"
    seq1 = Seq2seq(num_hidden=num_hidden, num_cells=num_cells, device=dev)
    seq1.load_state_dict(checkpoint['state_dict'])
    seq1.to(seq1.device)
    seq1.double()
    seq1.eval()
    _, _ = test(csv_data=current_data[0], train_size=train_size, test_size=test_size,
                data_col=_data_col_list[0], time_col=_timestamp_col_list[0], seq=seq1,
                result_file=None, show=0)
    checkpoint = torch.load(_ckpt_files[0])
    _epochs = checkpoint['num_epochs']
    num_hidden = checkpoint['num_hidden']
    num_cells = checkpoint['num_cells']
    dev = checkpoint['device']
    if dev is None:
        dev = "cpu"
    else:
        dev = "gpu"
    seq2 = Seq2seq(num_hidden=num_hidden, num_cells=num_cells, device=dev)
    seq2.load_state_dict(checkpoint['state_dict'])
    seq2.to(seq2.device)
    seq2.double()
    seq2.eval()
    _, _ = test(csv_data=current_data[0], train_size=train_size, test_size=test_size,
                data_col=_data_col_list[0], time_col=_timestamp_col_list[0], seq=seq2,
                result_file=None, show=0)
    checkpoint = torch.load(_ckpt_files[0])
    _epochs = checkpoint['num_epochs']
    num_hidden = checkpoint['num_hidden']
    num_cells = checkpoint['num_cells']
    dev = checkpoint['device']
    if dev is None:
        dev = "cpu"
    else:
        dev = "gpu"
    seq3 = Seq2seq(num_hidden=num_hidden, num_cells=num_cells, device=dev)
    seq3.load_state_dict(checkpoint['state_dict'])
    seq3.to(seq3.device)
    seq3.double()
    seq3.eval()
    _, _ = test(csv_data=current_data[0], train_size=train_size, test_size=test_size,
                data_col=_data_col_list[0], time_col=_timestamp_col_list[0], seq=seq3,
                result_file=None, show=0)

Okay. What OS and PyTorch version are you using? It might be that the RAM memory gets almost full after loading the first model, and there is not enough memory for the second and third. So the second model is not loaded completely. This is just random guessing, but it is possible that some unexpected behaviour happens if there is not enough memory available for loading the model.

Okay. I can try making a model with different weights and save the models. Then, I will load them similar to what you have done and see if I can reproduce the issue. But I am using PyTorch 1.0.0 on a Linux operating system.

Thanks a million for your time on this. I also have opened the issue to github pytorch/examples yesterday, https://github.com/pytorch/examples/issues/505