Is there any documentation or special considerations when switching between amp to non-amp and vice versa for both training and inference.
The documentation states the following, but it is still not clear the effects it would have on the model.
" If a checkpoint was created from a run without Amp, and you want to resume training with Amp, load model and optimizer states from the checkpoint as usual. The checkpoint won’t contain a saved scaler state, so use a fresh instance of
GradScaler
.If a checkpoint was created from a run with Amp and you want to resume training without Amp, load model and optimizer states from the checkpoint as usual, and ignore the saved scaler state."