Loss function precision with AMP

In Transfer Learning tutorial, to be able to guarantee the preciseness of loss function calculation, regarding potential difference in sizes between the last batch and other batches, we introduce running loss:

running_loss += loss.item() * inputs.size(0)

I would like to keep this approach even when using autocast and grad_scaler. In this case:

with autocast():
    outputs = model(batch_inputs)
    loss = torch.nn.CrossEntropyLoss(outputs)

scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()

what happens to loss?

Should it be unscaled before running_loss += loss.item() * inputs.size(0) and reported that way? Also, where would that part of code be implemented?

Thanks!

Code sample:

def train_one_epoch(dataloader, model, optimizer, scaler, device):
   
    model.train()
    running_loss = 0.0
   
    for inputs, labels in dataloader:
         inputs, labels = inpunts.to(device), labels.to(device)
        
        optimizer.zero_grad()
        with torch.cuda.amp.autocast():
            outputs = model(inputs)
            loss = nn.CrossEntropyLoss(outputs)

        scaler.scale(loss).backward()
        scaler.step(optimizer)
        scaler.update()
    
        # is this correct or do we need unscaling somewhere?
        running_loss += loss.item() * inputs.size(0)

    return running_loss        

You could unscale the loss my multiplying it with inv_scale = 1./scaler.get_scale() before calling scaler.update().

1 Like

Can’t I just use loss before scaling it?

Also, torch docs says that crossentropy is calculated on fp32 data anyways. Does it mean that amp benefits do not apply while calculating loss this way?

Yes, using the loss before scaling should also work.

Also yes, nn.CrossEntropyLoss needs the higher precision of float32 for its stability.

1 Like