Loading saved models gives inconsistent results each time

lugiavn · February 8, 2019, 7:58am

This seems like some software engineer debugging work. No evidence it’s pytorch fault

checkpoint = torch.load(_ckpt_files[0])
seq1.load_state_dict(checkpoint['state_dict'])
checkpoint = torch.load(_ckpt_files[0])
seq2.load_state_dict(checkpoint['state_dict'])
checkpoint = torch.load(_ckpt_files[0])
seq3.load_state_dict(checkpoint['state_dict'])

You said you ran this and the loaded weights were different? Can you run the above code and print the 1st 3 numbers inside checkpoint[‘state_dict’], and then the 1st 3 numbers in seq1, seg2, seg3 parameters and post the result so that we can verify they’re different?

mahesh_bhosale · February 17, 2019, 7:46am

Each time I load the checkpoints, it loads correctly as below. But if I check the weights it is linear layer weights.

OrderedDict([('linear.weight', tensor([[  -41.0457, -1418.4844,  -972.4815,  -918.7100,   404.7451,
           145.0526,   268.7369,  -898.6894,   -63.3250,  -904.7896,
          -424.2258,  -509.7233,   218.7865,   311.9687,   -92.7278,
         -1364.6249, -2106.7388,   437.8537,  -588.3126,  -565.6002,
         -1535.2934,  -962.4911,  -515.7998,    -0.2963,  -732.9321,
           700.0198, -1275.2723,   311.5322, -1283.7908,   285.9134,
          -969.9243,  -621.8158,  1621.2073, -1284.5657,  1765.6793,
         -1103.9920,    79.7396,  -306.4757, -1038.7383, -1032.0752,
           -57.7447,  1127.6596,  2263.5248,  -886.7473,  -534.6647,
          -677.6693, -1380.0471,   159.1263,   125.1079,   294.6976,
          1363.7513]], dtype=torch.float64, device='cuda:0')), ('linear.bias', tensor([ 24.7377], dtype=torch.float64, device='cuda:0'))])
And all the sequences load the linear layer weights correctly.

seq1.linear.weight

Parameter containing:
tensor([[  -41.0457, -1418.4844,  -972.4814,  -918.7100,   404.7451,
           145.0526,   268.7369,  -898.6894,   -63.3250,  -904.7896,
          -424.2258,  -509.7233,   218.7865,   311.9687,   -92.7278,
         -1364.6249, -2106.7388,   437.8537,  -588.3126,  -565.6002,
         -1535.2935,  -962.4911,  -515.7999,    -0.2963,  -732.9321,
           700.0198, -1275.2722,   311.5322, -1283.7908,   285.9135,
          -969.9244,  -621.8158,  1621.2073, -1284.5658,  1765.6792,
         -1103.9921,    79.7396,  -306.4757, -1038.7383, -1032.0752,
           -57.7447,  1127.6597,  2263.5249,  -886.7473,  -534.6647,
          -677.6693, -1380.0470,   159.1263,   125.1079,   294.6976,
          1363.7513]], dtype=torch.float64, device='cuda:0')

seq2.linear.weight

Parameter containing:
tensor([[  -41.0457, -1418.4844,  -972.4814,  -918.7100,   404.7451,
           145.0526,   268.7369,  -898.6894,   -63.3250,  -904.7896,
          -424.2258,  -509.7233,   218.7865,   311.9687,   -92.7278,
         -1364.6249, -2106.7388,   437.8537,  -588.3126,  -565.6002,
         -1535.2935,  -962.4911,  -515.7999,    -0.2963,  -732.9321,
           700.0198, -1275.2722,   311.5322, -1283.7908,   285.9135,
          -969.9244,  -621.8158,  1621.2073, -1284.5658,  1765.6792,
         -1103.9921,    79.7396,  -306.4757, -1038.7383, -1032.0752,
           -57.7447,  1127.6597,  2263.5249,  -886.7473,  -534.6647,
          -677.6693, -1380.0470,   159.1263,   125.1079,   294.6976,
          1363.7513]], dtype=torch.float64, device='cuda:0')

seq3.linear.weight

Parameter containing:
tensor([[  -41.0457, -1418.4844,  -972.4814,  -918.7100,   404.7451,
           145.0526,   268.7369,  -898.6894,   -63.3250,  -904.7896,
          -424.2258,  -509.7233,   218.7865,   311.9687,   -92.7278,
         -1364.6249, -2106.7388,   437.8537,  -588.3126,  -565.6002,
         -1535.2935,  -962.4911,  -515.7999,    -0.2963,  -732.9321,
           700.0198, -1275.2722,   311.5322, -1283.7908,   285.9135,
          -969.9244,  -621.8158,  1621.2073, -1284.5658,  1765.6792,
         -1103.9921,    79.7396,  -306.4757, -1038.7383, -1032.0752,
           -57.7447,  1127.6597,  2263.5249,  -886.7473,  -534.6647,
          -677.6693, -1380.0470,   159.1263,   125.1079,   294.6976,
          1363.7513]], dtype=torch.float64, device='cuda:0')

But my sequence to sequence model contains 3 LSTM cells and one linear layer

    def __init__(self, num_hidden, num_cells, device=None):
        """
        Initialize the LSTM predictor
        :param num_hidden: Number of hidden units of LSTM
        :param num_cells: Number of LSTM cells in the NN, equivalent to number of layers
        """
        super(Seq2seq, self).__init__()
        self.num_cells = num_cells
        self.num_hidden = num_hidden
        self.cell_list = []
        if device is None:
            if self.num_cells > 5 and self.num_hidden > 51:
                self.device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
            else:
                self.device = "cpu"
        else:
            if device == "gpu":
                self.device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
            else:
                self.device = "cpu"
        for i in range(0, num_cells):
            if i == 0:
                self.cell_list.append((nn.LSTMCell(1, num_hidden).double()).to(self.device))
            else:
                self.cell_list.append((nn.LSTMCell(num_hidden, num_hidden).double()).to(self.device))
        self.linear = nn.Linear(num_hidden, 1)

So I think I need to save the parameters of theree LSTM cells as well, those are not included when I save state with torch.save(model.state_dict())

What is the correct way to save the multiple models?

mahesh_bhosale · February 17, 2019, 9:49am

I have corrected the code to save dict of multiple LSTM Cells and load them individually like below,


            # Save the checkpoint
            save_checkpoints({
                'num_epochs': epoch,
                'num_hidden': number_hidden,
                'num_cells': number_cells,
                'device': device,
                'state_linear': model.state_dict(),
                'state_dict0': model.cell_list[0].state_dict(),
                'state_dict1': model.cell_list[1].state_dict(),
                'state_dict2': model.cell_list[2].state_dict()}, file_name)


def save_checkpoints(state, file_name):
    """
    Save the trained model and check points related to model
    :param state: state of the model to save
    :param file_name: file where to save the model
    :return:
    """
    torch.save(state, file_name)

And when I now load the same checkpoint file multiple times like below, I get the correct results every time.

    checkpoint = torch.load(_ckpt_files[0])
    _epochs = checkpoint['num_epochs']
    num_hidden = checkpoint['num_hidden']
    num_cells = checkpoint['num_cells']
    dev = checkpoint['device']
    if dev is None:
        dev = "cpu"
    else:
        dev = "gpu"
    seq1 = Seq2seq(num_hidden=num_hidden, num_cells=num_cells, device=dev)
    seq1.load_state_dict(checkpoint['state_linear'])
    seq1.cell_list[0].load_state_dict(checkpoint['state_dict0'])
    seq1.cell_list[1].load_state_dict(checkpoint['state_dict1'])
    seq1.cell_list[2].load_state_dict(checkpoint['state_dict2'])
    seq1.to(seq1.device)
    seq1.double()
    seq1.eval()
    _, _ = test(csv_data=current_data[0], train_size=train_size, test_size=test_size,
                data_col=_data_col_list[0], time_col=_timestamp_col_list[0], seq=seq1,
                result_file=None, show=0)
    checkpoint = torch.load(_ckpt_files[0])
    _epochs = checkpoint['num_epochs']
    num_hidden = checkpoint['num_hidden']
    num_cells = checkpoint['num_cells']
    dev = checkpoint['device']
    if dev is None:
        dev = "cpu"
    else:
        dev = "gpu"
    seq2 = Seq2seq(num_hidden=num_hidden, num_cells=num_cells, device=dev)
    seq2.load_state_dict(checkpoint['state_linear'])
    seq2.cell_list[0].load_state_dict(checkpoint['state_dict0'])
    seq2.cell_list[1].load_state_dict(checkpoint['state_dict1'])
    seq2.cell_list[2].load_state_dict(checkpoint['state_dict2'])
    seq2.to(seq1.device)
    seq2.double()
    seq2.eval()
    _, _ = test(csv_data=current_data[0], train_size=train_size, test_size=test_size,
                data_col=_data_col_list[0], time_col=_timestamp_col_list[0], seq=seq2,
                result_file=None, show=0)
    checkpoint = torch.load(_ckpt_files[0])
    _epochs = checkpoint['num_epochs']
    num_hidden = checkpoint['num_hidden']
    num_cells = checkpoint['num_cells']
    dev = checkpoint['device']
    if dev is None:
        dev = "cpu"
    else:
        dev = "gpu"
    seq3 = Seq2seq(num_hidden=num_hidden, num_cells=num_cells, device=dev)
    seq3.load_state_dict(checkpoint['state_linear'])
    seq3.cell_list[0].load_state_dict(checkpoint['state_dict0'])
    seq3.cell_list[1].load_state_dict(checkpoint['state_dict1'])
    seq3.cell_list[2].load_state_dict(checkpoint['state_dict2'])
    seq3.to(seq1.device)
    seq3.double()
    seq3.eval()
    _, _ = test(csv_data=current_data[0], train_size=train_size, test_size=test_size,
                data_col=_data_col_list[0], time_col=_timestamp_col_list[0], seq=seq3,
                result_file=None, show=0)

Here are the results:-

test loss: 122.80924618395184
Weighted mean absolute error is : 25.6365589979712
input tensor([[ 31.0200,  31.0400,  31.0500,  31.0800,  31.0900,  31.1200,
          31.1400,  31.1600,  31.1800,  31.2000,  31.2300,  31.2500,
          31.2700,  31.2900,  31.3100,  31.3300,  31.3500,  31.3700,
          31.3900,  31.5400,  31.5600,  31.6900,  31.7100,  31.7300,
          31.7600,  31.7800,  31.7900,  31.8200,  31.8400,  31.8600,
          31.8900,  31.9000,  31.9200,  31.9400,  31.9600,  31.9800,
          32.0000,  32.0300,  32.0600,  32.0900]], dtype=torch.float64, device='cuda:0')
forecast tensor([ 39.3661,  38.2005,  36.8705,  36.3623,  36.6181,  37.3330,
         38.2346,  39.1504,  39.9915,  40.7231,  41.3416,  41.8541,
         42.2743,  42.6168,  42.8953,  43.1216,  43.3058,  43.4561,
         43.5792,  43.7020,  43.8089,  43.9155,  44.0072,  44.0820,
         44.1438,  44.1943,  44.2343,  44.2692,  44.2998,  44.3269,
         44.3529,  44.3754,  44.3959,  44.4151,  44.4335,  44.4513,
         44.4686,  44.4871,  44.5071,  44.5282], dtype=torch.float64, device='cuda:0')
test loss: 122.80924618395184
Weighted mean absolute error is : 25.6365589979712
input tensor([[ 31.0200,  31.0400,  31.0500,  31.0800,  31.0900,  31.1200,
          31.1400,  31.1600,  31.1800,  31.2000,  31.2300,  31.2500,
          31.2700,  31.2900,  31.3100,  31.3300,  31.3500,  31.3700,
          31.3900,  31.5400,  31.5600,  31.6900,  31.7100,  31.7300,
          31.7600,  31.7800,  31.7900,  31.8200,  31.8400,  31.8600,
          31.8900,  31.9000,  31.9200,  31.9400,  31.9600,  31.9800,
          32.0000,  32.0300,  32.0600,  32.0900]], dtype=torch.float64, device='cuda:0')
forecast tensor([ 39.3661,  38.2005,  36.8705,  36.3623,  36.6181,  37.3330,
         38.2346,  39.1504,  39.9915,  40.7231,  41.3416,  41.8541,
         42.2743,  42.6168,  42.8953,  43.1216,  43.3058,  43.4561,
         43.5792,  43.7020,  43.8089,  43.9155,  44.0072,  44.0820,
         44.1438,  44.1943,  44.2343,  44.2692,  44.2998,  44.3269,
         44.3529,  44.3754,  44.3959,  44.4151,  44.4335,  44.4513,
         44.4686,  44.4871,  44.5071,  44.5282], dtype=torch.float64, device='cuda:0')
test loss: 122.80924618395184
Weighted mean absolute error is : 25.6365589979712
input tensor([[ 31.0200,  31.0400,  31.0500,  31.0800,  31.0900,  31.1200,
          31.1400,  31.1600,  31.1800,  31.2000,  31.2300,  31.2500,
          31.2700,  31.2900,  31.3100,  31.3300,  31.3500,  31.3700,
          31.3900,  31.5400,  31.5600,  31.6900,  31.7100,  31.7300,
          31.7600,  31.7800,  31.7900,  31.8200,  31.8400,  31.8600,
          31.8900,  31.9000,  31.9200,  31.9400,  31.9600,  31.9800,
          32.0000,  32.0300,  32.0600,  32.0900]], dtype=torch.float64, device='cuda:0')
forecast tensor([ 39.3661,  38.2005,  36.8705,  36.3623,  36.6181,  37.3330,
         38.2346,  39.1504,  39.9915,  40.7231,  41.3416,  41.8541,
         42.2743,  42.6168,  42.8953,  43.1216,  43.3058,  43.4561,
         43.5792,  43.7020,  43.8089,  43.9155,  44.0072,  44.0820,
         44.1438,  44.1943,  44.2343,  44.2692,  44.2998,  44.3269,
         44.3529,  44.3754,  44.3959,  44.4151,  44.4335,  44.4513,
         44.4686,  44.4871,  44.5071,  44.5282], dtype=torch.float64, device='cuda:0')

Process finished with exit code 0

So I think we need to save the multiple LSTM cells’ dicts in separately.

lugiavn · February 20, 2019, 5:26am

This is because cell_list is not part of the model (it is a list object, not a torch module object), so when you call state_dict(), it doesn’t include parameters from cell_list. So when loading, those cell_list are randomly init from scratch

Same thing when you call model.parameters() to pass to optimizer, the cell_list param won’t be included unless you explicitly add them to optimizer

I can think of 2 solutions:
(1) convert cell_list to torch.nn.Module like this:
cell_list = torch.nn.Sequential(cell_list)
You can still treat it as a list
(2) save the whole model object instead of the state_dict(), but don’t do this lol

singleroc · April 25, 2019, 11:02am

Does this thread help?

mahesh_bhosale · April 25, 2019, 11:16am

Yes, I resolved the issue by adding parameters for each cell state separately.

Anjusree_v_k · December 5, 2023, 10:21am

Hi ,how ddi you resolve this issue? I am facing the same problem. I have multiple checkpoints generating by teh model but all of them are giving me the worst result.