Why computing loss in training function is different from validation function

Hi, I use this code for training and validation (for a regression problem), and it seems the code work fine. But i don’t know why the validation loss computes in a different way. Can anyone explain why? or is the validation loss compute wrongly?

#%% network structure

class Net(nn.Module):
    ''' A simple fully-connected deep neural network '''
    def __init__(self):
        super(Net, self).__init__()
        self.fc1 = nn.Linear(config['num_features'], 350)
        self.fc2 = nn.Linear(350, 350)
        self.fc3 = nn.Linear(350, 350)
        self.fc4 = nn.Linear(350, 350)
        #self.bn1 = nn.BatchNorm1d(32) 
        #self.dropout1 = nn.Dropout(0.05) 
        self.fc5 = nn.Linear(350, config['num_output'])
        
        # Mean squared error loss
        self.criterion = nn.MSELoss(reduction='mean')

    def forward(self, x):
        x = self.fc1(x)
        x = F.elu(x)
        x = self.fc2(x)
        x = F.elu(x)
        x = self.fc3(x)
        x = F.elu(x)
        x = self.fc4(x)
        x = F.elu(x)
        #x = self.bn1(x)
        #x = self.dropout1(x)
        x = self.fc5(x)
        return x

    def cal_loss(self, pred, target):
        ''' Calculate loss '''
        # TODO: implement L1/L2 regularization here
        return self.criterion(pred, target)

#%% training
def train(train_set,valid_set,model,config,device):
    
    n_epochs = config['n_epochs']
    optimizer = getattr(torch.optim, config['optimizer'])(model.parameters(), **config['optim_hparas'])
    min_mse = 1000.
    loss_record = {'train': [], 'dev': []}      # for recording training loss
    early_stop_cnt = 0
    epoch = 0
    # Early Stopping parameters
    config['early_stop']
    #training loop
    model.train()
    while epoch < n_epochs:
        model.train()                           # set model to training mode
        for x, y in train_set:                     # iterate through the dataloader
            optimizer.zero_grad()               # set gradient to zero
            x, y = torch.tensor(x).type(torch.FloatTensor).to(device), torch.tensor(y).type(torch.FloatTensor).to(device)   # move data to device (cpu/cuda)
            pred = model(x)                     # forward pass (compute output)
            mse_loss = model.cal_loss(pred, y)  # compute loss
            mse_loss.backward()                 # compute gradient (backpropagation)
            optimizer.step()                    # update model with optimizer
            loss_record['train'].append(mse_loss.detach().cpu().item())
            

        # After each epoch, test my model on the validation (development) set.
        dev_mse = valid(valid_set, model, device)
        if dev_mse < min_mse:
            # Save model if my model improved
            min_mse = dev_mse
            print('Saving model (epoch = {:4d}, loss = {:.4f})'
                .format(epoch + 1, min_mse))
            torch.save(model.state_dict(), config['save_path'])  # Save model to specified path
            early_stop_cnt = 0
        else:
            early_stop_cnt += 1

        epoch += 1
        loss_record['dev'].append(dev_mse)
        if early_stop_cnt > config['early_stop']:
            # Stop training if my model stops improving for "config['early_stop']" epochs.
            break

    print('Finished training after {} epochs'.format(epoch))
    return min_mse, loss_record
#%% validation
def valid(valid_set, model, device):
    model.eval()                                # set model to evalutation mode
    total_loss = 0
    for x, y in valid_set:                         # iterate through the dataloader
        x, y = torch.tensor(x).type(torch.FloatTensor).to(device), torch.tensor(y).type(torch.FloatTensor).to(device)       # move data to device (cpu/cuda)
        with torch.no_grad():                   # disable gradient calculation
            pred = model(x)                     # forward pass (compute output)
            mse_loss = model.cal_loss(pred, y)  # compute loss
            
        total_loss += mse_loss.detach().cpu().item() * len(x)  # HERE IS MY QUESTION
        
    total_loss = total_loss / len(valid_set.dataset)  # AND HERE

    return total_loss

I hope someone can help me
Thanks

Not clear what you’re getting at. They both compute using MSELoss. Can you be more specific about what you think is different in the calculation?

thank you @J_Johnson
my question specifically is:
why MSE loss in training is just

mse_loss = model.cal_loss(pred, y)
loss_record['train'].append(mse_loss.detach().cpu().item())

while in the validation part, it is multipled by the length of ‘x’ and accumulated, and later accumulated loss is divided by the length of the validation set?

mse_loss = model.cal_loss(pred, y)  # compute loss       
        total_loss += mse_loss.detach().cpu().item() * len(x)  # HERE IS MY QUESTION        
    total_loss = total_loss / len(valid_set.dataset)  # AND HERE

Why don’t we use the same way that we use in the training part?

The choice of train and validation metrics are subjective. Without seeing where loss_record is being used, it’s hard to say exactly what the intention was.

Perhaps they wanted to plot sub-epoch model development for the training set and epoch level metrics for the validation set. But hard to say without knowing how loss_record is being used.

I checked it again; does it make sense to say that the loss is calculated for only one batch in the training part, but the loss is calculated for all data in the validation set in the validation part? So it should compute accumulative loss and divide by the length of data.

If charting each out separately, train and validation, with validation being plotted per epoch and train plotted per batch, then it makes sense. Because during training, improvement *should occur after each batch, while there would be no such improvement between batches during validation. Additionally, your code showed the validation metric per epoch being called to determine whether to save the model or not.

That said, you can modify the code in whatever way you see fit. But if you’re wanting to normalize the metrics between train and validation, you’d probably be best off simply getting the mean of the stored metrics for train batches per epoch.

1 Like