In Transfer Learning tutorial, to be able to guarantee the preciseness of loss function calculation, regarding potential difference in sizes between the last batch and other batches, we introduce running loss:
running_loss += loss.item() * inputs.size(0)
I would like to keep this approach even when using autocast and grad_scaler. In this case:
with autocast():
outputs = model(batch_inputs)
loss = torch.nn.CrossEntropyLoss(outputs)
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
what happens to loss?
Should it be unscaled before running_loss += loss.item() * inputs.size(0) and reported that way? Also, where would that part of code be implemented?
def train_one_epoch(dataloader, model, optimizer, scaler, device):
model.train()
running_loss = 0.0
for inputs, labels in dataloader:
inputs, labels = inpunts.to(device), labels.to(device)
optimizer.zero_grad()
with torch.cuda.amp.autocast():
outputs = model(inputs)
loss = nn.CrossEntropyLoss(outputs)
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
# is this correct or do we need unscaling somewhere?
running_loss += loss.item() * inputs.size(0)
return running_loss
Also, torch docs says that crossentropy is calculated on fp32 data anyways. Does it mean that amp benefits do not apply while calculating loss this way?