Trading from checkpoint seems to be same as training from the 0 epoch

Captaindoggo · May 3, 2021, 3:40pm

So I trained GAN model for 400 epochs and saved G and D and both optimizers as state_dict, after loading model from this checkpoint I evaluated it and it showed same score as during first training session (~ 0.77, the closer to 0.5 the better), however when if I continue training from this checkpoint losses values appear to be in completely different range than during the end of the first training session, as if the model is training from random initialisation, and so does the score, back to ~ 0,97. Also I checked the learning rates and the values are exactly how they need to be.

ptrblck · May 4, 2021, 6:25am

These issues are often created either by a failure in the model saving or a change in the data loading.
To check the first aspect, you could compare the outputs of the model (in .eval()) for a constant input (e.g. torch.ones) before saving and after loading. If the outputs have a larger error than the expected floating point precision, then the saving of the parameters and buffers fails.
On the other hand, if the outputs are equal, you could check the data loading pipeline and make sure the data is processed in the same way.