Frequencey of evaluation steps affecting PyTorch model accuracy

Dear PyTorch community,

I observed that the frequency of evaluation steps is affecting the final accuracy of my PyTorch model. When I evaluate the model after each epoch of training (this is with a dummy example, where I am only training the model for an epoch), I get an accuracy of 0.7551 and a loss of 1.0198748077556585.
Like below -

for epoch in range(num_epochs):
    train_one_epoch()
    evaluate()

However, when I evaluate the model before training, followed by training and then evaluation again, I get an accuracy of 0.7656 and a loss of 0.9733265159995692.

for epoch in range(num_epochs):
    evaluate()
    train_one_epoch()
    evaluate()

Similarly, if I evaluate the model twice before training, followed by training and then evaluation again, I get an accuracy of 0.7631 and a loss of 0.9899978056834762.

I would like to know why this is happening and how to ensure consistent results regardless of the order of training and evaluation steps. I would appreciate any suggestions or insights from the PyTorch community.

Here is the full code - example.py

The small differences in accuracy might be caused by a different pseudorandom number generator behavior, which could be caused by the BaseDataLoaderIter as seen in this post.
Could you check if re-seeding the code directly before calling train_one_epoch would yield the same performance?

Re-seeding before train_one_epoch gives slightly different numbers, for

for epoch in range(num_epochs):
    evaluate()
    seed_everything(seed)
    train_one_epoch()
    evaluate()

i got acc=0.7478 and loss=1.0340169899782556 before it was acc=0.7656 and loss=0.9733265159995692

This would point to a non-deterministic execution of the actual training loop. Could you remove the evaluate() calls and check if the train_one_epoch is giving the same results for the same seeds?

Yes, without the evaluate(), train_one_epoch() gives the same results for the same seed.