Resume training with mixed precision lead to "No inf checks were recorded for this optimizer"

yohann84L · March 24, 2021, 1:48pm

I’m using mixed half precision with torch.cuda.amp.GradScaler() for the training of my model/
I try to do a training pipeline where I can stop/resume any training.
For that, at each epoch I save:

model state_dict
optimizer state_dict
scaler state_dict (scaler = torch.cuda.amp.GradScaler())

To resume the training I load the above state dict for model, optimizer and scaler. The following code highligth the part of the optimizer state dict loading:

optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
scaler.load_state_dict(checkpoint['scaler_state_dict'])
# I try to move optimizer state to cuda but same error with or without
for state in optimizer.state.values():
    for k, v in state.items():
        if isinstance(v, torch.Tensor):
            state[k] = v.to(device)

When I try to resume training, everything seems to be loaded correctly but at the first iteration when I tried to step the scaler in my backpropagtion I have the following error:

  File ".../base_engine.py", line 97, in backprop_loss
    self.scaler.step(self.optimizer)
  File ".../site-packages/torch/cuda/amp/grad_scaler.py", line 318, in step
    assert len(optimizer_state["found_inf_per_device"]) > 0, "No inf checks were recorded for this optimizer."
AssertionError: No inf checks were recorded for this optimizer.

However when I disable scaler when I resume training (with torch.cuda.amp.autocast(enabled=False)) it seems to work correctly.

I also tried different optimizer like pytorch SGD, Adam, but also more custom implementation like RAdam or SGDP.

Any idea how I could solve this?

ptrblck · March 25, 2021, 9:12am

Are you seeing the same issue without moving the states to another device?
Could you also explain, why this code is needed?

yohann84L · March 25, 2021, 9:20am

Yes, same issue without this part.
In fact, I didn’t had the moving states part at first, I just try to it see if that could fix the error but it doesn’t…

ptrblck · March 25, 2021, 9:53am

I cannot reproduce this issue using:

model = models.resnet18().cuda()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
scaler = torch.cuda.amp.GradScaler()

for epoch in range(10):
    optimizer.zero_grad()
    
    with torch.cuda.amp.autocast():
        out = model(torch.randn(1, 3, 224, 224).cuda())
        loss = out.mean()
    
    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()
    print('epoch {}, loss {}'.format(epoch, loss.item()))

checkpoint = {
    'model': model.state_dict(),
    'optimizer': optimizer.state_dict(),
    'scaler': scaler.state_dict()
}
torch.save(checkpoint, 'tmp.pt')

# load
cp = torch.load('tmp.pt')
model = models.resnet18().cuda()
model.load_state_dict(cp['model'])
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
optimizer.load_state_dict(cp['optimizer'])
scaler = torch.cuda.amp.GradScaler()
scaler.load_state_dict(cp['scaler'])

for epoch in range(10):
    optimizer.zero_grad()
    
    with torch.cuda.amp.autocast():
        out = model(torch.randn(1, 3, 224, 224).cuda())
        loss = out.mean()
    
    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()
    print('epoch {}, loss {}'.format(epoch, loss.item()))

Could you post an executable code snippet, which would show this error?

clarky · October 16, 2022, 9:26pm

I am having a similar issue. Could you check it out? https://discuss.pytorch.org/t/using-amp-with-cuda-graphs-and-ddp-scaler-error/163671

Indrashis_Das · June 1, 2024, 1:03pm

@yohann84L were you able to solve this problem? I have a similar issue when I train resnet50 using DDP. While resuming the training, I get this error.