Until now I didn’t bother to restore the previous state of my Adam Optimizer to continue training.
In fact, I observed that by doing this, a new best performing model is often found after this “hard” continuation.
Is there anything I am missing that makes this practice absolutely wrong?
Well, so restoring the state and continuing is the equivalent of doing a single larger training run.
What happens is that during the first few steps, the statistics gathered by the optimizer are still “rubbish”, and so you will take steps of more or less not terribly controlled size (the could be more precise, I guess). This has - at the beginning of the training - bothered some people enough to run a few steps with a learning rate of 0 in the beginning or re-initialize after a few steps (i.e. just update statistics), even if I cannot find the reference at the moment.
So in a way, your method of re-starting Adam is “take a step in a lucky direction”. You might try to get a similar effect more systematically by varying the learning rate upwards or somesuch.