Today I found some unexpected behavior of using optimizer’s load_state_dict(). Suppose branch A is classifier for task A, branch B is classifier for task B. In the beginning, total_loss = loss_A + loss_B, and after training for some time I save a checkpoint of the model and optimizer states. Then I resume the training but only on task A, so total_loss = loss_A. If I load_state_dict() for the optimizer, I found even though branch B’s parameter gradients are indeed 0s, its parameters will still be updated after optim.step() as if branch A’s gradients got passed to branch B while they are totally independent of each other. If I don’t load_state_dict() for the optimizer, then branch B’s parameters won’t be updated as expected. I can’t figure out why this would happen, has anyone encountered this issue before? does optimizer’s load_state_dict() memorize some gradient dependency history which might cause this tricky issue?
Depending on the optimizer you used, it might store some internal states like adaptive moments.
Since branch B was trained before, these states (momentum etc.) might still update the branch even if the current gradients are zero.
Which optimizer did you use?
oh right, I was using Adam, thanks @ptrblck