Two models with the same parameters, weights, and same vocab produce different classifications when fed the same text

I implemented a transformer classification model following Peter Bloem’s blog. It works quite well and classifies text into 20 labels with validation accuracy 90%. I then saved the model with plans to test it when uploaded.

if acc > best_acc:
    best_acc = acc, ''')

When I loaded the model, it performed poorly, so I read the discussions here and learned I should save the vocab as well.', 'TEXTvocab', pickle_module=dill), 'LABELvocab', pickle_module=dill)

Great news, the validation accuracy with the loaded model is 90% in my Jupyter notebook with the device as cpu (I trained in Google Colab with device=‘gpu’).

There is one thing that still confuses me. I was very excited by the good results, so I made a toy example with sentences to classify. When I test my performance on these toy samples, I get 5 out of 5 correct in Google Colab using the freshly trained model. However, in my Jupyter notebook, I only get 3 out of 5 correct.

I double checked that the weights are the same - I randomly checked several different layers of the model and the weights were exactly the same for the freshly trained model and the loaded model. The sentences I fed to both are exactly the same. I understand that these results would not be as surprising if I trained the same model twice and tested on these toy samples. But shouldn’t the results be the same if the weights are exactly the same, and the input sentences are exactly the same?

I have a suspicion that it had something to do with the vocab. With the freshly trained model, I have the Data Fields setup, so it’s easy to use them for my toy samples…I even use the BucketIterator with the same args, except with a batch size of 1. I couldn’t figure out how to recreate this with my saved vocab objects, so I used the code below (modified from Analytics Vidhya). I still get different results from the freshly trained model.

Am I incorrect in thinking that two models should produce the same output? If they should produce the same output, are there any thoughts on how to get the same output?

edit: I retrained, saved, and loaded the model again. This time the freshly trained model and loaded model both get 3 out of 5 correct on my toy example. I don’t recall doing anything different, but I suppose there is a chance I did.

from torchtext import vocab # this code allowed the vocab4 files to be downloaded - code in this cell 
# fails without it.
except AttributeError:
    def _default_unk_index():
        return 0
    vocab._default_unk_index = _default_unk_index

TEXT = torch.load('data_directory/TEXTvocab', pickle_module=dill) 
LABEL = torch.load('data_directory/LABELvocab', pickle_module=dill)

with torch.no_grad(): 

    for i in range(len(test_text)):

        tokenized = tokenizer(test_text[i].lower())

        indexed = [TEXT[t] for t in tokenized]
        tensor = torch.LongTensor(indexed).to('cpu')
        tensor = tensor.unsqueeze(1).T
        out = model(tensor).argmax(dim=1)
        print('Actual label is: ', test_units[i])
        print('Predicted label is: ', LABEL.itos[out.item() + 1]) #