Resuming checpoint leads to a higher accuracy than keep it running

silvester · November 10, 2021, 10:00pm

Hello experienced ones,

When the training is progressing, I see the test accuracy does not change a lot and is almost steady. However, when I save the checkpoint and resume the iteration from the checkpoint the training show a jump in the test accuracy with a significant margin.

I wonder if you guys know what is going on and possibly provide an explanation for such a random behavior.
@ptrblck

eqy · November 11, 2021, 4:56am

It’s hard to understand the situation in detail without any code; are you referring to test accuracy that is run on a separate test dataset?

It might be useful to verify that the test accuracy is consistent when saving a checkpoint vs. when loading the same checkpoint (without any additional training).

silvester · November 11, 2021, 5:20am

Yes, it is accuracy on the test set.

The test accuracy at the event of saving is equal to the one at the event of resuming.

eqy · November 12, 2021, 8:04am

Are the other hyperparameters the same upon resuming? It might be easier to understand if there were some concrete code/outputs shown.

silvester · November 21, 2021, 5:11pm

The hyperparameters are the same. Preparing a code snippet due to semi-supervised nature of the training which is performed.