Calculating Validation Losses After Training Finished

nmtp · May 5, 2020, 2:33am

I’m new to PyTorch and to CNNs in general so I apologize if this is a strange question. I’ve trained a pretrained VGG-16 network and saved the checkpoint from each epoch along the way, but I’ve realized that while I have the training losses printed for each epoch, I didn’t have the validation losses.

So I’m wondering: Is it possible to run through in model.eval() and manually calculate the losses for each epoch with the checkpoints I have saved along the way? Would I need to call loss.backward() here as well?

sathvik_udupa · May 5, 2020, 3:01am

You call loss.backward() when you need to backpropogate your loss to your weights, which happens during training, you need not do it in validation step. You can load the weights to the model from the checkpoints and compute the loss from the output of the model directly.

nmtp · May 6, 2020, 2:58am

When I printed out loss.item() after calling loss.backward() during the training process, would this have changed the value of the loss?

When I load the weights to the model from checkpoints and compute loss directly on the training data, I get a significantly different loss than when I was printing it during training. Is this normal or am I doing something wrong?

sathvik_udupa · May 6, 2020, 8:28am

First case, I’m not sure. I save the loss.item() in another variable before loss.backward()

Second case, when you are training, each loss.backward() changes the model weights so the next batch loss would be different from what you would expect while doing model.eval()

Hope that answers your question, would be happy to elaborate if not.

nmtp · May 7, 2020, 12:09am

Thanks for your answers, it’s very helpful in understanding how things work.

Just wondering one more thing – I’m still getting some weird numbers and noticed the code I’m using has a different batch size for the dataloader for training data (batchsize = 2) and for the validation data (batchsize = 1).

Since the loss is currently loss/len(dataloader), would that be the source of why my validation losses are significantly below my training losses? What would be the correct way of recording the losses per epoch for both the validation and the training if the batch sizes are different?

sathvik_udupa · May 7, 2020, 4:06am

By default, loss functions such as crossentrophy(or log_softmax + nllloss) use mean reduction on the batch losses.
Given
N = No of datapoints
bs = batch_size
len(loader) = N/bs
Loss = (l1 + l2 + l3 + …)/bs + (l1 + l2 + l3 + …)/bs + …
Then Loss/len(loader) = Loss*bs/N = (Sum of all l)/N
So you can see that the batch size you use for test case doesn’t matter(there will be change in value if the last batch is not full)

You can keep the default mean reduction in your loss function, sum up all your losses, divide it by len(loader) for both train and val regardless of different batch sizes. One other way for train is to append all your batch losses and return it, this gives better loss fluctuation information though your plots may get messy.

nmtp · May 7, 2020, 6:10am

Thank you again. That makes a little more sense. I’m still a little confused why the validation losses are significantly less than the training losses if that is the case. I can see that the reduction on the MSELoss was set to be “sum”; would the correct thing here be to set that to “elementwise-mean” to ensure that validation/training is the same?

Is there a reason why different reduction would be used?

sathvik_udupa · May 7, 2020, 6:48am

If it is set to sum, then you will get different scales of loss,
L = sum of all(losses)*batch_size/N.
With lower batch_size for validation, your loss will be lesser than train.

To correct this, you can either divide each batch loss with batch size or equivalently perform loss.mean(). This should give you proper values.