I have trained several models with amp in FP16 separately and saved all state dicts for both models and amp.
Now as a continuation of that training, I would like to load those models into an ensemble, freeze all gradients and train a new model with a few final layers that learns based on the output from those models and several different inputs.
My current code looks something like this
modelA = ModelA()
modelB = ModelB()
modelA.load_state_dict(checkpointA['model'])
modelB.load_state_dict(checkpointB['model'])
for a in modelA.parameters():
a.requires_grad = False
for b in modelB.parameters():
b.requires_grad = False
ensemble = EnsembleModel(modelA, modelB)
optimizer = FusedAdam(filter(lambda p: p.requires_grad, ensemble .parameters()), lr=learning_rate)
ensemble , optimizer = amp.initialize(ensemble , optimizer, opt_level)
***perform training***
Now to my questions:
In the checkpoints for modelA and modelB I also saved the amp state dict. Should those be loaded in this case and how?
Does amp.initialize support models embedded inside an ensemble?
What if I want to unfreeze the lower models at a later point, do I need to reinitialize both the optimizer and amp in that case?