Loss and Accuracy Tracking

PabloRR100 · August 3, 2018, 8:39pm

Hi everyone,

It is very common to see in the examples and tutorial this scheme (taken from tutorial: “How to train a classifier”):

for epoch in range(2):  # loop over the dataset multiple times

    running_loss = 0.0
    for i, data in enumerate(trainloader, 0):
        # get the inputs
        inputs, labels = data

        # zero the parameter gradients
        optimizer.zero_grad()

        # forward + backward + optimize
        outputs = net(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

        # print statistics
        running_loss += loss.item()
        if i % 2000 == 1999:    # print every 2000 mini-batches
            print('[%d, %5d] loss: %.3f' %
                  (epoch + 1, i + 1, running_loss / 2000))
            running_loss = 0.0

print('Finished Training')

However, I don’t understand why the loss and the accuracy are restarted every epoch.
Don’t we want to see how it has been evolving during the entire training process? Is there any reason to do it?

This is my attemp, initializing all the control variables at the beggining:

def train_baseline(epochs, trainset, validset, model, criterion, optimizer, 
                   log_file=None, save_frequency=1, validate=False):
    '''
    Function to trained a pre-trained model
    '''
        
    print('Starting training time...')    
    if log_file: f = open(log_file, 'w+')
    
    train_acc = []
    valid_acc = []
    
    train_loss = []
    valid_loss = []
    
    train_total = 0
    valid_total = 0
    
    train_correct = 0
    valid_correct = 0

    start = now()        
    for epoch in range(1, epochs+1):
        
        for i, (images, labels) in enumerate(trainset):
            
            i += 1
            
            # Forwad Pass
            images = Variable(images)
            labels = Variable(torch.tensor(labels)) # Why doesn't convert it manually?
            
            model.zero_grad()
            outputs  = model(images) 
            
            # Compute Loss and Accuracy
            loss = criterion(outputs, labels)
            train_loss.append(round(loss.item(), 2))
            
            scores, predictions = torch.max(outputs.data, 1)
            train_total += labels.size(0)
            train_correct += int(sum(predictions == labels)) # labels.size(0) returns int
            acc = round((train_correct / train_total) / 100, 2)
            train_acc.append(acc)
            
            # Backpropagation
            loss.backward()
            optimizer.step()
            
            # Get training statistics.
            stats = 'Epoch [{}/{}], Step [{}], Loss: {}, Accuracy: {}'.format(epoch, epochs, i, loss.item(), train_acc)
            print('\n' + stats)
            f.write(stats + '\n')
            f.flush
                     
            
            # Validation step
            if validate:
                
                # validate is an Fasle or an int -> after how many iteration we validate
                if i % validate == 0 and i > 0:
                    
                    print('Entering in validation...')
                    for j, (images, labels) in enumerate(validset):
                        
                        images = Variable(images)
                        labels = Variable(labels)
                        
                        outputs  = model(images)
                        
                        loss = criterion(outputs, labels)
                        valid_loss.append(round(loss.item(), 2))
                        
                        score, predictions = torch.max(outputs.data, 1)
                        valid_total += labels.size(0)
                        valid_correct += (predictions == labels).sum()
                        
                    acc = round((valid_correct / valid_total) / 100, 2)
                    valid_acc.append(acc)
#                    print('Iteration: {}. Loss: {}. Accuracy: {}'.format(i, loss, acc))
            if i > 1:
                break
        if epoch > 1:
            break
        
        if epoch % save_frequency == 0:
            torch.save(model.state_dict(), os.path.join('./models', '%s-%d.pkl' % (model.name, epoch)))
                
    print('Time: {} hours {} minutes'.format(time(start)[0], time(start)[1]))
    
    if log_file: f.close()             
    train_history = {'loss': train_loss, 'accuracy': train_acc}
    valid_history = {'loss': valid_loss, 'accuracy': valid_acc} 
    return train_history, valid_history

What I am missing?

Thanks in advance.
Regards,
Pablo

rasbt · August 3, 2018, 11:03pm

In the first example, that’s the average loss for a minibatch (in that example, ~2000 training examples). The reason why it’s restarted is because it’s used to track progress during training in terms of: does the loss go up or down with the next minibatches.

Usually, the loss is magnitudes higher during the first minibatches (or even epochs) when you start the training, and you would likely lose the signal in all that noise if you would keep a running average over all training examples over all epochs. I.e., the longer you keep training, the harder it will be to detect what’s currently going on in your epoch because of the averaging effect.

PabloRR100 · August 4, 2018, 1:50pm

Thank’s for the response.

I got the point of seing the performance per minibatch.
But regarding the running average, you meant the accuracy right? Because I am never dividing by anything in the loss I am just appending the new value to the list of losses right?

rasbt · August 4, 2018, 2:33pm

Oh I see. I only looked at the code you provided from the tutorials to address your question

However, I don’t understand why the loss and the accuracy are restarted every epoch.

It’s important to restart e.g., the accuracy as well because you are interested in the accuracy over the training dataset, not e.g., the average accuracy 3 times over the training dataset.

However, based on your code, you probably wonder why we don’t save the accuracy in a list or something like that when you asked

Don’t we want to see how it has been evolving during the entire training process?

Of course you can do that like you have done in the code. It’s just a matter of what you do with it later. Sometimes you maybe want to create additional plots or analyze the training rate in other ways. However, in many cases it’s sufficient to just print the loss and accuracy “live” during the training, as it is done for the tutorials in the docs. I think the key here is to keep the code short and minimalistic to focus on the core ideas of the tutorials.