Why computing loss in training function is different from validation function

Hi, I use this code for training and validation (for a regression problem), and it seems the code work fine. But i don’t know why the validation loss computes in a different way. Can anyone explain why? or is the validation loss compute wrongly?

#%% network structure

class Net(nn.Module):
''' A simple fully-connected deep neural network '''
def __init__(self):
super(Net, self).__init__()
self.fc1 = nn.Linear(config['num_features'], 350)
self.fc2 = nn.Linear(350, 350)
self.fc3 = nn.Linear(350, 350)
self.fc4 = nn.Linear(350, 350)
#self.bn1 = nn.BatchNorm1d(32)
#self.dropout1 = nn.Dropout(0.05)
self.fc5 = nn.Linear(350, config['num_output'])

# Mean squared error loss
self.criterion = nn.MSELoss(reduction='mean')

def forward(self, x):
x = self.fc1(x)
x = F.elu(x)
x = self.fc2(x)
x = F.elu(x)
x = self.fc3(x)
x = F.elu(x)
x = self.fc4(x)
x = F.elu(x)
#x = self.bn1(x)
#x = self.dropout1(x)
x = self.fc5(x)
return x

def cal_loss(self, pred, target):
''' Calculate loss '''
# TODO: implement L1/L2 regularization here
return self.criterion(pred, target)

#%% training
def train(train_set,valid_set,model,config,device):

n_epochs = config['n_epochs']
optimizer = getattr(torch.optim, config['optimizer'])(model.parameters(), **config['optim_hparas'])
min_mse = 1000.
loss_record = {'train': [], 'dev': []}      # for recording training loss
early_stop_cnt = 0
epoch = 0
# Early Stopping parameters
config['early_stop']
#training loop
model.train()
while epoch < n_epochs:
model.train()                           # set model to training mode
for x, y in train_set:                     # iterate through the dataloader
optimizer.zero_grad()               # set gradient to zero
x, y = torch.tensor(x).type(torch.FloatTensor).to(device), torch.tensor(y).type(torch.FloatTensor).to(device)   # move data to device (cpu/cuda)
pred = model(x)                     # forward pass (compute output)
mse_loss = model.cal_loss(pred, y)  # compute loss
mse_loss.backward()                 # compute gradient (backpropagation)
optimizer.step()                    # update model with optimizer
loss_record['train'].append(mse_loss.detach().cpu().item())

# After each epoch, test my model on the validation (development) set.
dev_mse = valid(valid_set, model, device)
if dev_mse < min_mse:
# Save model if my model improved
min_mse = dev_mse
print('Saving model (epoch = {:4d}, loss = {:.4f})'
.format(epoch + 1, min_mse))
torch.save(model.state_dict(), config['save_path'])  # Save model to specified path
early_stop_cnt = 0
else:
early_stop_cnt += 1

epoch += 1
loss_record['dev'].append(dev_mse)
if early_stop_cnt > config['early_stop']:
# Stop training if my model stops improving for "config['early_stop']" epochs.
break

print('Finished training after {} epochs'.format(epoch))
return min_mse, loss_record
#%% validation
def valid(valid_set, model, device):
model.eval()                                # set model to evalutation mode
total_loss = 0
for x, y in valid_set:                         # iterate through the dataloader
x, y = torch.tensor(x).type(torch.FloatTensor).to(device), torch.tensor(y).type(torch.FloatTensor).to(device)       # move data to device (cpu/cuda)
with torch.no_grad():                   # disable gradient calculation
pred = model(x)                     # forward pass (compute output)
mse_loss = model.cal_loss(pred, y)  # compute loss

total_loss += mse_loss.detach().cpu().item() * len(x)  # HERE IS MY QUESTION

total_loss = total_loss / len(valid_set.dataset)  # AND HERE

return total_loss

I hope someone can help me
Thanks

Not clear what you’re getting at. They both compute using MSELoss. Can you be more specific about what you think is different in the calculation?

thank you @J_Johnson
my question specifically is:
why MSE loss in training is just

mse_loss = model.cal_loss(pred, y)
loss_record['train'].append(mse_loss.detach().cpu().item())

while in the validation part, it is multipled by the length of ‘x’ and accumulated, and later accumulated loss is divided by the length of the validation set?

mse_loss = model.cal_loss(pred, y)  # compute loss
total_loss += mse_loss.detach().cpu().item() * len(x)  # HERE IS MY QUESTION
total_loss = total_loss / len(valid_set.dataset)  # AND HERE

Why don’t we use the same way that we use in the training part?

The choice of train and validation metrics are subjective. Without seeing where loss_record is being used, it’s hard to say exactly what the intention was.

Perhaps they wanted to plot sub-epoch model development for the training set and epoch level metrics for the validation set. But hard to say without knowing how loss_record is being used.

I checked it again; does it make sense to say that the loss is calculated for only one batch in the training part, but the loss is calculated for all data in the validation set in the validation part? So it should compute accumulative loss and divide by the length of data.

If charting each out separately, train and validation, with validation being plotted per epoch and train plotted per batch, then it makes sense. Because during training, improvement *should occur after each batch, while there would be no such improvement between batches during validation. Additionally, your code showed the validation metric per epoch being called to determine whether to save the model or not.

That said, you can modify the code in whatever way you see fit. But if you’re wanting to normalize the metrics between train and validation, you’d probably be best off simply getting the mean of the stored metrics for train batches per epoch.

1 Like