I am currently working on a NLP project, and have trained a Seq2Seq Transformer model. Due to certain project requirements, I need to check that my model (and code) can work on different machines (with different operating systems).
I have noticed that when conducting evaluation on the different machines with the same model, code, and dataset, the results are all slightly different. Specifically, once every while (once per 200 cases), my output sequences on the different machines will be different by a few tokens. Just to double check, I am wondering if this behaviour, that evaluation on different machines will result in slightly different results, is expected. Additionally, what might be the cause of this phenomenon?
Thank you in advance.