On running loss and average loss

oat · January 4, 2021, 2:52pm

[ 1. Context ]
I’m following Udacity’s tutorial Intro to Deep Learning with PyTorch. In Lesson 5 on Convolutional Neural Networks by Cezanne Camacho Step 10 Training the Network, Cezanne used the code as quoted below to calculate the training loss and validation loss.

[ 2. Question ]
Although Cezanne has explained in the video, I’m still not clear why she was using train_loss += loss.item()*data.size(0) to aggregate the total training loss and why she was using train_loss = train_loss/len(train_loader.sampler) to calculate the average training loss with criterion being nn.CrossEntropyLoss(), whereas in the earlier Fashion-MNIST tutorial in the same series they were coded as running_loss += loss.item() and running_loss/len(trainloader) with criterion being nn.NLLLoss().

[ 3. My understanding ]

Following Andrew Ng’s distinguishing between “cost: difference between prediction and target for each sample” and “loss: difference between prediction and target for the entire sample set”.
loss.item() is the value of “total cost, or, sum of target*log(prediction)” averaged across all training examples of the current batch, according to the definition of cross entropy loss.
Therefore, loss.item()*data.size(0) is the “total loss of the current batch (not averaged)”.
And, train_loss accumulates these “total loss per batch” for the entire epoch, i.e. “total loss of the current epoch”.
Finally, train_loss = train_loss/len(train_loader.sampler) calculates the “cost/loss averaged across all training examples for the current epoch”

May I ask if the interpretation above is correct?

# specify loss function (categorical cross-entropy)
criterion = nn.CrossEntropyLoss()

# specify optimizer (stochastic gradient descent) and learning rate = 0.01
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

# number of epochs to train the model
n_epochs = 50

# initialize tracker for minimum validation loss
valid_loss_min = np.Inf # set initial "min" to infinity

for epoch in range(n_epochs):
    # monitor training loss
    train_loss = 0.0
    valid_loss = 0.0
    
    ###################
    # train the model #
    ###################
    model.train() # prep model for training
    for data, target in train_loader:
        # clear the gradients of all optimized variables
        optimizer.zero_grad()
        # forward pass: compute predicted outputs by passing inputs to the model
        output = model(data)
        # calculate the loss
        loss = criterion(output, target)
        # backward pass: compute gradient of the loss with respect to model parameters
        loss.backward()
        # perform a single optimization step (parameter update)
        optimizer.step()
        # update running training loss
        train_loss += loss.item()*data.size(0)
        
    ######################    
    # validate the model #
    ######################
    model.eval() # prep model for evaluation
    for data, target in valid_loader:
        # forward pass: compute predicted outputs by passing inputs to the model
        output = model(data)
        # calculate the loss
        loss = criterion(output, target)
        # update running validation loss 
        valid_loss += loss.item()*data.size(0)
        
    # print training/validation statistics 
    # calculate average loss over an epoch
    train_loss = train_loss/len(train_loader.sampler)
    valid_loss = valid_loss/len(valid_loader.sampler)

ptrblck · January 16, 2021, 9:11am

Yes, your explanation is correct.
The first approach of multiplying the averaged batch loss by the batch size and dividing by the number of samples gives you the correct average sample loss for this particular epoch.
The second approach of dividing the averaged batch loss by the number of batches would yield the same result, if each batch in the epoch contains batch_size samples. This might not always be the case, if the length of the dataset is not divisible by the batch_size without a remainder. The last batch would thus contain less samples and the loss calculation would introduce a small bias.