Why the loaded LSTM does not work correctly?

I want to train a LSTM to predict a word given 2 words. For example, if both ‘dog’ and ‘drink’ are fed to LSTM, then it’s expected to see LSTM predicts ‘water’ as the next word. To do so, I read the example ‘N-Gram Language Modeling’ and ‘An LSTM for Part-of-Speech Tagging’ in official tutorial, and replaced the model in example ‘N-Gram Language Modeling’ by an adjusted version of the class LSTMTagger in example ‘An LSTM for Part-of-Speech Tagging’.

However, I found a very strange phenomenon associated with saving and loading models. For instance, assuming a model achieves accuracy of 60% after training and saved to local disk, and then if I loaded the trained model and repeated training on the same training data, the first epoch yields accuracy near 0. It seems the loaded model is trained from scratch.

This is strange because I adopted the same save-and-load strategy to example ‘N-Gram Language Modeling’ and ‘An LSTM for Part-of-Speech Tagging’, the loaded models for both examples yield the correct accuracies. For instance, assuming the accuracy of model in ‘N-Gram Language Modeling’ achieves 80%, then if the model is loaded and training again, the accuracy yielded in the first epoch is about 80%.

I hope someone can explain what is wrong in my script, I cannot spot the bugs because of my limited experience of Pytorch. The working script is as follows, it runs in 64-bit Windows 10 and Pytorch 0.4.

import re, string, time, os, subprocess
from pathlib import Path

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

torch.manual_seed(1)

test_sentence = """A long time ago, there were four families who lived in a small village in Somalia. The first family would argue all of the time, the second family were very greedy, the third family were always away from the village exploring because they were never happy with what they had or where they lived. But the fourth family were calm and patient, and they enjoyed living in their small community.
One night, the daughter of the third family was out exploring when she discovered a well hidden among some trees in the wilderness. The daughter ran back to her family and told them about the well and so they started to use the well to get their water.
It was not long before the other families heard news of the well, and very soon all four families were using the well to get their water until it was in danger of running dry.
This went on for some time, and it was obvious that the water in the well was getting lower and lower, yet none of the families wanted to stop using the well as it was close to the village and meant that they did not have to walk so far to get the water which they used to drink and cook and clean with.
One day, the wise chief, who had always known about the secret well, spoke to each family in turn. The chief said to them, ‘Tonight you must stay in your homes. You must not use the well for one whole night, that way the water will have time to rise once more.’
Each of the families agreed to stay away from the well, especially as the wise chief warned that there would be a severe punishment for any family who disobeyed this simple rule.
But when night fell, the son of the first family could not resist visiting the well as he wanted to make sure he had plenty of water for the following day so that his family would not argue over who would walk the long distance to the usual well used by the rest of the villagers. He crept out to the well carrying two large buckets and filled them both to the top before returning to his home and hiding the buckets where they would not be seen.
Not long after, the son of the second family also crept out to the well and filled two large buckets all the way to the top as he was very greedy and wanted the water for his family alone.
Then the daughter of the third family also crept out to the well as she could not resist exploring at night and reasoned that it was she who had discovered the well in the first place so it was her family who deserved the extra water despite the warning from the wise chief.
The next day, the chief visited the well and was distressed to find that it was completely dry. He waited until he knew that all of the families were away from their homes, then he visited each home in turn.
In the first home he discovered the two buckets, one of which was already empty, but the other still contained the water which was stolen from the well. When he visited the second and third homes he also discovered the buckets of water hidden where nobody would see them. But when he visited the fourth home he discovered that the buckets were dry and realised that the patient family had remained in their beds all night. They had listened to his warning and had stayed away from the well so that the water might rise once more.
The wise chief called all four families to the meeting place in the village where he confronted them about the well. ‘You three families all stole water from the well even though I told you not to,’ said the chief in a stern voice. ‘I know this because I visited your homes this morning and discovered the buckets of water. Because you defied my instructions you will be forced to remain in your homes for thirty days and nights without food or water as punishment. I hope that you will spend this time thinking about the wrong you have done.’
To the fourth family he said, ‘You listened to my simple instructions and stayed in your home last night and did not visit the well. Take this letter and open it when you return to your home.’
The fourth family took the letter and returned home. When they opened the letter there was a map inside. The family followed the directions on the map and after travelling for many miles they discovered a well surrounded by an abundance of fruit trees and vegetable plants. There was enough food and water to last the family a whole lifetime!
The families who were forced to stay in their homes without food or water learned a valuable lesson that day. They learned that it was always best to listen to the advice of one’s elders and not to take things when you were told not to. They also realised that the fourth family were rewarded for their patience and their willingness to follow the simple rules which benefit a community.
""".lower().split()
# we should tokenize the input, but we will ignore that for now
# build a list of tuples.  Each tuple is ([ word_i-2, word_i-1 ], target word)
trigrams = [([test_sentence[i], test_sentence[i + 1]], test_sentence[i + 2])
            for i in range(len(test_sentence) - 2)]

SAVE_PATH = Path(os.path.join('.', 'NGramModel.tar')).resolve()  

vocab = set(test_sentence)
word_to_ix = {word: i for i, word in enumerate(vocab)}

IS_FORCE_TRAIN = True
EMBEDDING_DIM = 10
HIDDEN_DIM = 64
VOCAB_SIZE = len(vocab)
TAGSET_SIZE = VOCAB_SIZE
EPOCH_NUM = 10

def prepare_sequence(input_sentence, input_word_to_ix):
    """
        Given a list containing strings and corresponding index map, return the string indices wrapped in a tensor.
        This function serves for providing the correct data type required by nn.Embedding()
    """
    idxs = [word_to_ix[w] for w in input_sentence]
    return torch.tensor(idxs, dtype=torch.long)

# Create the model:
class LSTMTagger(nn.Module):
    def __init__(self, embedding_dim, hidden_dim, vocab_size, tagset_size):
        super(LSTMTagger, self).__init__()
        self.hidden_dim = hidden_dim

        self.word_embeddings = nn.Embedding(vocab_size, embedding_dim)

        # The LSTM takes word embeddings as inputs, and outputs hidden states
        # with dimensionality hidden_dim. Because our inputs are sentences
        # in which the words are embedded, hence the input size is the product
        # of word number and embedding dimension
        self.lstm = nn.LSTM(embedding_dim * 2, hidden_dim)

        # The linear layer that maps from hidden state space to tag space
        self.hidden2tag = nn.Linear(hidden_dim, tagset_size)
        self.hidden = self.init_hidden()

    def init_hidden(self):        
        # The axes semantics are (num_layers, minibatch_size, hidden_dim)
        # notice, if 'to(device)' is omitted, then LSTM will crash when GPU 
        # is the default device
        return (torch.zeros(1, 1, self.hidden_dim),
                torch.zeros(1, 1, self.hidden_dim))

    def forward(self, sentence):
        embeds = self.word_embeddings(sentence)
        lstm_out, self.hidden = self.lstm(
            embeds.view(1, 1, -1), self.hidden)
        tag_space = self.hidden2tag(lstm_out.view(1, -1))
        tag_scores = F.log_softmax(tag_space, dim=1)
        return tag_scores

model = LSTMTagger(EMBEDDING_DIM, HIDDEN_DIM, VOCAB_SIZE, TAGSET_SIZE)

try:
    model.load_state_dict(torch.load(SAVE_PATH))    
    print('model loaded successfully')
except FileNotFoundError:
    # doesn't exist
    do_nothing = 1


loss_function = nn.NLLLoss()
optimizer = optim.SGD(model.parameters(), lr=0.1)

model.train()
for epoch in range(EPOCH_NUM):    
    accuracy = 0
    for context, target in trigrams:

        # Step 1. Prepare the inputs to be passed to the model (i.e, turn the words
        # into integer indices and wrap them in variables)
        context_idxs = torch.tensor([word_to_ix[w] for w in context], dtype=torch.long)

        # Step 2. Recall that torch *accumulates* gradients. Before passing in a
        # new instance, you need to zero out the gradients from the old
        # instance
        model.zero_grad()
        model.hidden = model.init_hidden()            

        # Step 3. Run the forward pass, getting log probabilities over next
        # words
        log_probs = model(context_idxs)
        _, predicted_class = torch.max(log_probs, 1)        
        accuracy = accuracy + 1 if predicted_class.item() == word_to_ix[target] else accuracy + 0

        # Step 4. Compute your loss function. (Again, Torch wants the target
        # word wrapped in a variable)
        loss = loss_function(log_probs, torch.tensor([word_to_ix[target]], dtype=torch.long))
            
        # Step 5. Do the backward pass and update the gradient
        loss.backward()
        optimizer.step()
    
    print ('Epoch [%d/%d], Loss: %.4f, ACC: %.4f' %(epoch+1, EPOCH_NUM, loss.item(), accuracy / len(test_sentence)))  

torch.save(model.state_dict(), SAVE_PATH)

I modified your code a bit to load and save model. And I can see the training loss with model loaded is consistent with last check point. However, the acc is always 0 even though loss keeps decreasing. Maybe double check the correctness of the way to compute acc.

import re, string, time, os, subprocess
from pathlib import Path

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import shutil
torch.manual_seed(1)

test_sentence = """A long time ago, there were four families who lived in a small village in Somalia. The first family would argue all of the time, the second family were very greedy, the third family were always away from the village exploring because they were never happy with what they had or where they lived. But the fourth family were calm and patient, and they enjoyed living in their small community.
One night, the daughter of the third family was out exploring when she discovered a well hidden among some trees in the wilderness. The daughter ran back to her family and told them about the well and so they started to use the well to get their water.
It was not long before the other families heard news of the well, and very soon all four families were using the well to get their water until it was in danger of running dry.
This went on for some time, and it was obvious that the water in the well was getting lower and lower, yet none of the families wanted to stop using the well as it was close to the village and meant that they did not have to walk so far to get the water which they used to drink and cook and clean with.
One day, the wise chief, who had always known about the secret well, spoke to each family in turn. The chief said to them, ‘Tonight you must stay in your homes. You must not use the well for one whole night, that way the water will have time to rise once more.’
Each of the families agreed to stay away from the well, especially as the wise chief warned that there would be a severe punishment for any family who disobeyed this simple rule.
But when night fell, the son of the first family could not resist visiting the well as he wanted to make sure he had plenty of water for the following day so that his family would not argue over who would walk the long distance to the usual well used by the rest of the villagers. He crept out to the well carrying two large buckets and filled them both to the top before returning to his home and hiding the buckets where they would not be seen.
Not long after, the son of the second family also crept out to the well and filled two large buckets all the way to the top as he was very greedy and wanted the water for his family alone.
Then the daughter of the third family also crept out to the well as she could not resist exploring at night and reasoned that it was she who had discovered the well in the first place so it was her family who deserved the extra water despite the warning from the wise chief.
The next day, the chief visited the well and was distressed to find that it was completely dry. He waited until he knew that all of the families were away from their homes, then he visited each home in turn.
In the first home he discovered the two buckets, one of which was already empty, but the other still contained the water which was stolen from the well. When he visited the second and third homes he also discovered the buckets of water hidden where nobody would see them. But when he visited the fourth home he discovered that the buckets were dry and realised that the patient family had remained in their beds all night. They had listened to his warning and had stayed away from the well so that the water might rise once more.
The wise chief called all four families to the meeting place in the village where he confronted them about the well. ‘You three families all stole water from the well even though I told you not to,’ said the chief in a stern voice. ‘I know this because I visited your homes this morning and discovered the buckets of water. Because you defied my instructions you will be forced to remain in your homes for thirty days and nights without food or water as punishment. I hope that you will spend this time thinking about the wrong you have done.’
To the fourth family he said, ‘You listened to my simple instructions and stayed in your home last night and did not visit the well. Take this letter and open it when you return to your home.’
The fourth family took the letter and returned home. When they opened the letter there was a map inside. The family followed the directions on the map and after travelling for many miles they discovered a well surrounded by an abundance of fruit trees and vegetable plants. There was enough food and water to last the family a whole lifetime!
The families who were forced to stay in their homes without food or water learned a valuable lesson that day. They learned that it was always best to listen to the advice of one’s elders and not to take things when you were told not to. They also realised that the fourth family were rewarded for their patience and their willingness to follow the simple rules which benefit a community.
""".lower().split()
# we should tokenize the input, but we will ignore that for now
# build a list of tuples.  Each tuple is ([ word_i-2, word_i-1 ], target word)
trigrams = [([test_sentence[i], test_sentence[i + 1]], test_sentence[i + 2])
            for i in range(len(test_sentence) - 2)]

vocab = set(test_sentence)
word_to_ix = {word: i for i, word in enumerate(vocab)}

IS_FORCE_TRAIN = True
EMBEDDING_DIM = 10
HIDDEN_DIM = 64
VOCAB_SIZE = len(vocab)
TAGSET_SIZE = VOCAB_SIZE
EPOCH_NUM = 10

def prepare_sequence(input_sentence, input_word_to_ix):
    """
        Given a list containing strings and corresponding index map, return the string indices wrapped in a tensor.
        This function serves for providing the correct data type required by nn.Embedding()
    """
    idxs = [word_to_ix[w] for w in input_sentence]
    return torch.tensor(idxs, dtype=torch.long)

# Create the model:
class LSTMTagger(nn.Module):
    def __init__(self, embedding_dim, hidden_dim, vocab_size, tagset_size):
        super(LSTMTagger, self).__init__()
        self.hidden_dim = hidden_dim

        self.word_embeddings = nn.Embedding(vocab_size, embedding_dim)

        # The LSTM takes word embeddings as inputs, and outputs hidden states
        # with dimensionality hidden_dim. Because our inputs are sentences
        # in which the words are embedded, hence the input size is the product
        # of word number and embedding dimension
        self.lstm = nn.LSTM(embedding_dim * 2, hidden_dim)

        # The linear layer that maps from hidden state space to tag space
        self.hidden2tag = nn.Linear(hidden_dim, tagset_size)
        self.hidden = self.init_hidden()

    def init_hidden(self):        
        # The axes semantics are (num_layers, minibatch_size, hidden_dim)
        # notice, if 'to(device)' is omitted, then LSTM will crash when GPU 
        # is the default device
        return (torch.zeros(1, 1, self.hidden_dim),
                torch.zeros(1, 1, self.hidden_dim))

    def forward(self, sentence):
        embeds = self.word_embeddings(sentence)
        lstm_out, self.hidden = self.lstm(
            embeds.view(1, 1, -1), self.hidden)
        tag_space = self.hidden2tag(lstm_out.view(1, -1))
        tag_scores = F.log_softmax(tag_space, dim=1)
        return tag_scores

model = LSTMTagger(EMBEDDING_DIM, HIDDEN_DIM, VOCAB_SIZE, TAGSET_SIZE)

try:
    model.load_state_dict(torch.load("NGramModel.tar")['state'])    
    print('model loaded successfully')
except Exception as e:
    # doesn't exist
    print(e)


loss_function = nn.NLLLoss()
optimizer = optim.SGD(model.parameters(), lr=0.1)

model.train()
for epoch in range(EPOCH_NUM):    
    accuracy = 0
    for context, target in trigrams:

        # Step 1. Prepare the inputs to be passed to the model (i.e, turn the words
        # into integer indices and wrap them in variables)
        context_idxs = torch.tensor([word_to_ix[w] for w in context], dtype=torch.long)

        # Step 2. Recall that torch *accumulates* gradients. Before passing in a
        # new instance, you need to zero out the gradients from the old
        # instance
        model.zero_grad()
        model.hidden = model.init_hidden()            

        # Step 3. Run the forward pass, getting log probabilities over next
        # words
        log_probs = model(context_idxs)
        _, predicted_class = torch.max(log_probs, 1)
#         print(predicted_class.item(), word_to_ix[target])
        if predicted_class.item() == word_to_ix[target]:
            accuracy +=1

        # Step 4. Compute your loss function. (Again, Torch wants the target
        # word wrapped in a variable)
        loss = loss_function(log_probs, torch.tensor([word_to_ix[target]], dtype=torch.long))
            
        # Step 5. Do the backward pass and update the gradient
        loss.backward()
        optimizer.step()
    
    print ('Epoch [%d/%d], Loss: %.4f, ACC: %.4f' %(epoch+1, EPOCH_NUM, loss.item(), accuracy / len(test_sentence)))  

torch.save({'state': model.state_dict()}, "NGramModel.tar")

Thanks for reply. I’ve run your code, however, nothing improved. I listed the outputs as follows.

Output for the initial training without loading any pre-trained model:

    [Errno 2] No such file or directory: 'NGramModel.tar'
    Epoch [1/10], Loss: 6.3162, ACC: 0.1006
    Epoch [2/10], Loss: 6.1846, ACC: 0.1345
    Epoch [3/10], Loss: 5.7875, ACC: 0.1446
    Epoch [4/10], Loss: 5.1115, ACC: 0.1661
    Epoch [5/10], Loss: 4.1613, ACC: 0.1921
    Epoch [6/10], Loss: 3.1089, ACC: 0.2576
    Epoch [7/10], Loss: 2.1107, ACC: 0.3153
    Epoch [8/10], Loss: 1.2641, ACC: 0.3887
    Epoch [9/10], Loss: 0.7448, ACC: 0.5062
    Epoch [10/10], Loss: 0.5428, ACC: 0.5977

Output for the second training with loading the model yielded by the initial training:

    model loaded successfully
    Epoch [1/10], Loss: 4.3901, ACC: 0.0723
    Epoch [2/10], Loss: 2.7416, ACC: 0.1706
    Epoch [3/10], Loss: 1.5863, ACC: 0.2633
    Epoch [4/10], Loss: 0.8832, ACC: 0.3514
    Epoch [5/10], Loss: 0.6362, ACC: 0.4621
    Epoch [6/10], Loss: 0.5090, ACC: 0.5672
    Epoch [7/10], Loss: 0.4624, ACC: 0.6328
    Epoch [8/10], Loss: 0.2824, ACC: 0.6881
    Epoch [9/10], Loss: 0.2198, ACC: 0.7130
    Epoch [10/10], Loss: 0.1668, ACC: 0.7164