I’m using mixed half precision with torch.cuda.amp.GradScaler() for the training of my model/
I try to do a training pipeline where I can stop/resume any training.
For that, at each epoch I save:
To resume the training I load the above state dict for model, optimizer and scaler. The following code highligth the part of the optimizer state dict loading:
optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
scaler.load_state_dict(checkpoint['scaler_state_dict'])
# I try to move optimizer state to cuda but same error with or without
for state in optimizer.state.values():
for k, v in state.items():
if isinstance(v, torch.Tensor):
state[k] = v.to(device)
When I try to resume training, everything seems to be loaded correctly but at the first iteration when I tried to step the scaler in my backpropagtion I have the following error:
File ".../base_engine.py", line 97, in backprop_loss
self.scaler.step(self.optimizer)
File ".../site-packages/torch/cuda/amp/grad_scaler.py", line 318, in step
assert len(optimizer_state["found_inf_per_device"]) > 0, "No inf checks were recorded for this optimizer."
AssertionError: No inf checks were recorded for this optimizer.
However when I disable scaler when I resume training (with torch.cuda.amp.autocast(enabled=False)) it seems to work correctly.
I also tried different optimizer like pytorch SGD, Adam, but also more custom implementation like RAdam or SGDP.
Yes, same issue without this part.
In fact, I didn’t had the moving states part at first, I just try to it see if that could fix the error but it doesn’t…