Loss function precision with AMP

BogdanSavkovic · February 22, 2022, 7:58pm

In Transfer Learning tutorial, to be able to guarantee the preciseness of loss function calculation, regarding potential difference in sizes between the last batch and other batches, we introduce running loss:

running_loss += loss.item() * inputs.size(0)

I would like to keep this approach even when using autocast and grad_scaler. In this case:

with autocast():
    outputs = model(batch_inputs)
    loss = torch.nn.CrossEntropyLoss(outputs)

scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()

what happens to loss?

Should it be unscaled before running_loss += loss.item() * inputs.size(0) and reported that way? Also, where would that part of code be implemented?

Thanks!

BogdanSavkovic · February 22, 2022, 8:10pm

Code sample:

def train_one_epoch(dataloader, model, optimizer, scaler, device):
   
    model.train()
    running_loss = 0.0
   
    for inputs, labels in dataloader:
         inputs, labels = inpunts.to(device), labels.to(device)
        
        optimizer.zero_grad()
        with torch.cuda.amp.autocast():
            outputs = model(inputs)
            loss = nn.CrossEntropyLoss(outputs)

        scaler.scale(loss).backward()
        scaler.step(optimizer)
        scaler.update()
    
        # is this correct or do we need unscaling somewhere?
        running_loss += loss.item() * inputs.size(0)

    return running_loss

ptrblck · February 22, 2022, 8:27pm

You could unscale the loss my multiplying it with inv_scale = 1./scaler.get_scale() before calling scaler.update().

BogdanSavkovic · February 22, 2022, 9:13pm

Can’t I just use loss before scaling it?

Also, torch docs says that crossentropy is calculated on fp32 data anyways. Does it mean that amp benefits do not apply while calculating loss this way?

ptrblck · February 22, 2022, 9:20pm

Yes, using the loss before scaling should also work.

Also yes, nn.CrossEntropyLoss needs the higher precision of float32 for its stability.