Training diverges if torch.autocast is not used

Assefa_Seyoum · December 5, 2022, 4:41am

I was training a face recognition baseline model without changing any hyper parameters. After training 100 epochs, the equal error rate (EER) on a validation set was 29.61567.

Using the same hyper parameters and using torch.autocast() context manager, the exact same model has EER of 13.34276 in just 15 epochs.

         with autocast():
               loss = self.model(data.cuda(), label.cuda())

          torch.cuda.amp.GradScaler()(loss).backward();
          torch.cuda.amp.GradScaler().step(self.optimizer);
          torch.cuda.amp.GradScaler().update();

I have repeated the experiment several times, and the results are consistent.

What exactly does torch.autocast() do?

ptrblck · December 5, 2022, 6:59am

toch.cuda.amp.autocast() is a mixed-precision training util. and allows for op-specific dtype casts to speed up the training while maintaining accuracy as described in the docs. It should neither improve nor decrease the model accuracy if properly used.
Based on your code snippet you are recreating the GradScaler in each forward pass, which is also wrong, so take a look at these examples to see how amp should be used.

Assefa_Seyoum · December 5, 2022, 8:24am

You are right about recreating the GradScaler. It is actually initialized only once. I wanted make it explicit in the snippet which looks awfully wrong.

class Net(nn.Module):
    ...
    self.scaler = GradScaler()
    ...
    def forward(self, x, **kwargs): 
        ...
        with autocast():
           loss = self.model(data.cuda(), label.cuda())
        self.scaler.scale(loss).backward()
        self.scaler.step(self.optimizer)
        self.scaler.update()
        ...

However, the accuracy is still very low if I don’t use mixed precision training.