Slightly Different Results When Evaluating Same Model on Different Machines

I am currently working on a NLP project, and have trained a Seq2Seq Transformer model. Due to certain project requirements, I need to check that my model (and code) can work on different machines (with different operating systems).

I have noticed that when conducting evaluation on the different machines with the same model, code, and dataset, the results are all slightly different. Specifically, once every while (once per 200 cases), my output sequences on the different machines will be different by a few tokens. Just to double check, I am wondering if this behaviour, that evaluation on different machines will result in slightly different results, is expected. Additionally, what might be the cause of this phenomenon?

Thank you in advance.

Yes, this is generally expected due to different hardware, algorithms, library versions, optimizations etc.
E.g. a different CPU might use other optimizations for e.g. SIMD instructions and could thus yield slightly different results; the same applies for GPUs, especially since different kernels can be used.