I’m using mixed half precision with
torch.cuda.amp.GradScaler() for the training of my model/
I try to do a training pipeline where I can stop/resume any training.
For that, at each epoch I save:
- model state_dict
- optimizer state_dict
- scaler state_dict (
scaler = torch.cuda.amp.GradScaler())
To resume the training I load the above state dict for model, optimizer and scaler. The following code highligth the part of the optimizer state dict loading:
optimizer.load_state_dict(checkpoint['optimizer_state_dict']) scaler.load_state_dict(checkpoint['scaler_state_dict']) # I try to move optimizer state to cuda but same error with or without for state in optimizer.state.values(): for k, v in state.items(): if isinstance(v, torch.Tensor): state[k] = v.to(device)
When I try to resume training, everything seems to be loaded correctly but at the first iteration when I tried to step the scaler in my backpropagtion I have the following error:
File ".../base_engine.py", line 97, in backprop_loss self.scaler.step(self.optimizer) File ".../site-packages/torch/cuda/amp/grad_scaler.py", line 318, in step assert len(optimizer_state["found_inf_per_device"]) > 0, "No inf checks were recorded for this optimizer." AssertionError: No inf checks were recorded for this optimizer.
However when I disable scaler when I resume training (
with torch.cuda.amp.autocast(enabled=False)) it seems to work correctly.
I also tried different optimizer like pytorch SGD, Adam, but also more custom implementation like RAdam or SGDP.
Any idea how I could solve this?