non-consistent behavior between "final evaluation" and "eval on each epoch" for mnist example

It is a common sense that, during evaluation, the model is not trained by the dev dataset.
However, I noticed a strange different behavior between the two results:
(1) train 10 epoch, having final evaluate on test data
(2) train 10 epoch, having an evaluation after each training epoch on test data

Prior knowledge:

Even though you set seed for everything

# set seed
random.seed(args.seed)
np.random.seed(args.seed)
torch.manual_seed(args.seed)
if use_cuda:
    torch.cuda.manual_seed_all(args.seed)  # if got GPU also set this seed

When you run examples/mnist/main.py, it still give different result on GPU.

run 1
-------------
Test set: Average loss: 0.1018, Accuracy: 9660/10000 (97%)
Test set: Average loss: 0.0611, Accuracy: 9825/10000 (98%)
Test set: Average loss: 0.0555, Accuracy: 9813/10000 (98%)
Test set: Average loss: 0.0409, Accuracy: 9862/10000 (99%)
Test set: Average loss: 0.0381, Accuracy: 9870/10000 (99%)
Test set: Average loss: 0.0339, Accuracy: 9891/10000 (99%)
Test set: Average loss: 0.0340, Accuracy: 9877/10000 (99%)
Test set: Average loss: 0.0399, Accuracy: 9872/10000 (99%)
Test set: Average loss: 0.0291, Accuracy: 9908/10000 (99%)
Test set: Average loss: 0.0315, Accuracy: 9896/10000 (99%)

run 2
--------------
Test set: Average loss: 0.1016, Accuracy: 9666/10000 (97%)
Test set: Average loss: 0.0608, Accuracy: 9828/10000 (98%)
Test set: Average loss: 0.0567, Accuracy: 9810/10000 (98%)
Test set: Average loss: 0.0408, Accuracy: 9864/10000 (99%)
Test set: Average loss: 0.0382, Accuracy: 9868/10000 (99%)
Test set: Average loss: 0.0339, Accuracy: 9894/10000 (99%)
Test set: Average loss: 0.0349, Accuracy: 9871/10000 (99%)
Test set: Average loss: 0.0396, Accuracy: 9876/10000 (99%)
Test set: Average loss: 0.0294, Accuracy: 9911/10000 (99%)
Test set: Average loss: 0.0304, Accuracy: 9895/10000 (99%)

As long as you set torch.backends.cudnn.deterministic = True
You could get consistent results:

====== parameters ========
  batch_size: 64
  do_eval: True
  do_eval_each_epoch: True
  epochs: 10
  log_interval: 10
  lr: 0.01
  momentum: 0.5
  no_cuda: False
  save_model: False
  seed: 42
  test_batch_size: 1000
==========================
Test set: Average loss: 0.1034, Accuracy: 9679/10000 (97%)
Test set: Average loss: 0.0615, Accuracy: 9804/10000 (98%)
Test set: Average loss: 0.0484, Accuracy: 9847/10000 (98%)
Test set: Average loss: 0.0361, Accuracy: 9888/10000 (99%)
Test set: Average loss: 0.0341, Accuracy: 9887/10000 (99%)
Test set: Average loss: 0.0380, Accuracy: 9877/10000 (99%)
Test set: Average loss: 0.0302, Accuracy: 9899/10000 (99%)
Test set: Average loss: 0.0315, Accuracy: 9884/10000 (99%)
Test set: Average loss: 0.0283, Accuracy: 9909/10000 (99%)
Test set: Average loss: 0.0266, Accuracy: 9907/10000 (99%)  -> epoch 10


====== parameters ========
  batch_size: 64
  do_eval: True
  do_eval_each_epoch: True
  epochs: 20
  log_interval: 10
  lr: 0.01
  momentum: 0.5
  no_cuda: False
  save_model: False
  seed: 42
  test_batch_size: 1000
==========================
Test set: Average loss: 0.1034, Accuracy: 9679/10000 (97%)
Test set: Average loss: 0.0615, Accuracy: 9804/10000 (98%)
Test set: Average loss: 0.0484, Accuracy: 9847/10000 (98%)
Test set: Average loss: 0.0361, Accuracy: 9888/10000 (99%)
Test set: Average loss: 0.0341, Accuracy: 9887/10000 (99%)
Test set: Average loss: 0.0380, Accuracy: 9877/10000 (99%)
Test set: Average loss: 0.0302, Accuracy: 9899/10000 (99%)
Test set: Average loss: 0.0315, Accuracy: 9884/10000 (99%)
Test set: Average loss: 0.0283, Accuracy: 9909/10000 (99%)
Test set: Average loss: 0.0266, Accuracy: 9907/10000 (99%) -> epoch 10
Test set: Average loss: 0.0373, Accuracy: 9870/10000 (99%)
Test set: Average loss: 0.0286, Accuracy: 9909/10000 (99%)
Test set: Average loss: 0.0309, Accuracy: 9908/10000 (99%)
Test set: Average loss: 0.0302, Accuracy: 9899/10000 (99%)
Test set: Average loss: 0.0261, Accuracy: 9907/10000 (99%)
Test set: Average loss: 0.0258, Accuracy: 9913/10000 (99%)
Test set: Average loss: 0.0288, Accuracy: 9917/10000 (99%)
Test set: Average loss: 0.0280, Accuracy: 9904/10000 (99%)
Test set: Average loss: 0.0294, Accuracy: 9902/10000 (99%)
Test set: Average loss: 0.0257, Accuracy: 9914/10000 (99%) -> epoch 20

However, when you change the model to have final evaluation after epoch 10, the result becomes:

====== parameters ========
  batch_size: 64
  do_eval: True
  do_eval_each_epoch: False
  epochs: 10
  log_interval: 10
  lr: 0.01
  momentum: 0.5
  no_cuda: False
  save_model: False
  seed: 42
  test_batch_size: 1000
==========================
Test set: Average loss: 0.0361, Accuracy: 9885/10000 (99%) -> epoch 10

Repeatability and consistent result is crucial in machine learning, do you guys know what is the reason for this strange behavior?

Attached code for your convenience:
PyTorch_mnist_example.zip