Inconsistent results after loading torch model

Hey, so my model is trained to perform NER (sequence tagging) using a BILSTM network. CRF layer is used as the classifier.

I have followed this tutorial to recreate a model on my own dataset: intro-to-nlp-with-pytorch/NamedEntityRecognition.ipynb at master · PythonWorkshop/intro-to-nlp-with-pytorch · GitHub

I obtain high accuracy on my train and test when the model is loaded and tested during the same instance. However, when I restart my kernel, it produces terrible results. I have tried saving the model both using and

I have also tried loading all the layers (the bilstm, linear, embeddings) but the results persist to be very bad. I tried the experiment using Adam, RMSprop, and SGD optimizers
Adam and RMS prop produce very good results when the model is trained,saved, loaded and tested in the same kernel.
However after restarting the kernel, Sgd is slightly better with the results (still bad), whereas Adam and RMSprop are really bad.

Can someone please give me insights as to where I am going wrong.

class BiLSTM_CRF(nn.Module):

def __init__(self, vocab_size, tag_to_ix, embedding_dim, hidden_dim):

    """Initialize network."""


    super(BiLSTM_CRF, self).__init__()

    self.embedding_dim = embedding_dim

    self.hidden_dim = hidden_dim

    self.vocab_size = vocab_size

    self.tag_to_ix = tag_to_ix

    self.tagset_size = len(tag_to_ix)

    self.word_embeds = nn.Embedding(vocab_size, embedding_dim)

    self.lstm = nn.LSTM(embedding_dim, hidden_dim // 2,

                        num_layers=1, bidirectional=True)

    self.dropout = torch.nn.Dropout(0.15)

    # Maps the output of the LSTM into tag space.

    self.hidden2tag = nn.Linear(hidden_dim, self.tagset_size)

    # Matrix of transition parameters.  Entry i,j is the score of

    # transitioning *to* i *from* j.


    state = torch.get_rng_state()


    self.transitions = nn.Parameter(

        torch.randn(self.tagset_size, self.tagset_size))

    # print(self.transitions)


    # These two statements enforce the constraint that we never transfer

    # to the start tag and we never transfer from the stop tag[tag_to_ix[START_TAG], :] = -10000[:, tag_to_ix[STOP_TAG]] = -10000

    self.hidden = self.init_hidden()


Training code:


torch.backends.cudnn.deterministic = True

torch.backends.cudnn.benchmark = False

state = torch.get_rng_state()

losses = []

epochs = []

for epoch in range(10):


for sentence, tags in train_data:

    # Step 1. Remember that Pytorch accumulates gradients.

    # We need to clear them out before each instance of LSTM


    sentence_in = prepare_sequence(sentence, word_to_ix)

    targets = torch.tensor([tag_to_ix[t] for t in tags], dtype=torch.long)

    loss = model.neg_log_likelihood(sentence_in, targets)




print("Epoch: {} Loss: {}".format(epoch+1, np.mean(losses)))






Saving the parameters:, ‘’), ‘’), ‘’), ‘’)

Saving the model:, ‘saved_model/’)

Loading the model and the parameters:
import random


model2 = BiLSTM_CRF(len(word_to_ix), tag_to_ix, EMBEDDING_DIM, HIDDEN_DIM)



model2 = torch.load(‘saved_model/’)


model2.transitions = torch.load(‘’)


model2.word_embeds = torch.load(‘’)


model2.lstm = torch.load(‘’)


model2.hidden2tag = torch.load(‘’)






state = torch.get_rng_state()

Testing the model:

torch.backends.cudnn.deterministic = True




accuracies = []

predicted_tags = []

#testing the model. no need to accumulate gradients

with torch.no_grad():

for i in range(len(train_data)):

    precheck_sent = prepare_sequence(train_data[i][0], word_to_ix)

    pred =  model2(precheck_sent)[1]

    prediction = [ix_to_tag[idx] for idx in pred]


    print('Prediction:   ', [ix_to_tag[idx] for idx in pred])

    print('Ground truth: ', train_data[i][1])

    accuracy = sum(1 for x,y in zip([ix_to_tag[idx] for idx in pred] , train_data[i][1]) if x==y) / float(len(train_data[i][1]))


for i in range(len(test_data)):

    precheck_sent = prepare_sequence(test_data[i][0], word_to_ix)

    pred =  model2(precheck_sent)[1]

    prediction = [ix_to_tag[idx] for idx in pred]


    print('Prediction:   ', [ix_to_tag[idx] for idx in pred])

    print('Ground truth: ', test_data[i][1])

    accuracy = sum(1 for x,y in zip([ix_to_tag[idx] for idx in pred] , test_data[i][1]) if x==y) / float(len(test_data[i][1]))



To debug the issue compare the outputs of a static tensor (either define one via e.g. torch.ones or save a specific input batch) before saving the model and after loading it in the other notebook. If the results are equal, this would point to a difference in the data processing. On the other hand, if the results are different, either the model creates “random” outputs e.g. via some random operations (call model.eval() to disable dropout layers and check for other random operations) or the saving/loading failed.

I save a static input batch before training and saving the model. The results are consistent after loading the model as well. But please note that the input batch doesn’t go through any data processing here. The randomness is mainly caused by the transitions variable of the BiLSTM _cRF class’ constructor. Is there anything else I must check for in terms of randomness?

Another thing i tried was to save the model’s output of a single input batch before training and after training.
So, in the same kernel instance, after training i get good results. When i reload the instance, i get poor results for a single input batch. But if i save the results right after training and load in another instance, it would work as I am saving the output of a single input batch. This however is not what’s required.

I need my model outputs to stay consistent even after restarting a kernel and remove any randomness so that its reproducible.

Please give me some more insights as to what’s to be done next