How important is saving the state dict of the optimizer?

Ioan · November 8, 2022, 10:36am

I am training a language model, however due to time constraints I need to stop and restart training again.

I am aware of the right way to do it, with checkpointing.

How big is the impact of not saving the optimizer’s state? At the moment the model seems to go back to scratch when it learns. Is this due to the absence of the optimizer’s state loading or it might be a bug in the actual loading of the model?

initial_dict = copy.deepcopy(model.state_dict())
    if args.model_path is not None:
        print("--------USING PRETRAINED MODEL-----------", file=output)
        # strict = False because the model might have more or less heads than the pretrained one
        # normally we are only interested in the LM part with the cloze head
        model.load_state_dict(torch.load(args.model_path), strict=False)

    for k, v in model.state_dict().items():
        if k in initial_dict and not torch.equal(initial_dict[k], v):
            print(f"Updated {k} using pretrained model", file=output)
        else:
            print(f"Leaving {k} untouched", file=output)```

mvalente · November 8, 2022, 3:36pm

Optimizers and schedulers keep running statistics that can influence the resume. But it’s hard to say without having a full picture.