How to translate pytorch training and validation loss tensorboard curves

ranaban · August 21, 2022, 1:47pm

I can not understand the x-axis values. I have used batch_size = 32 and 6153, 769 samples for training and validation respectively. Below is the training curve. The plotting is done using tensorboard.
train_loss

Please tell me how to understand the plot. Thank you!

ptrblck · August 21, 2022, 5:44pm

The x-axis is most likely representing the iterations or epochs while the y-axis would represent the corresponding loss value for the training and validation datasets. Since no axis labels are used, I’m speculating, but based on your description and the plot it looks like a standard way to visualize losses.

The loss value should generally correspond to the performance metric of your model (and you have to make sure it does). E.g. if you are working on a multi-class classification, you should see that the loss and accuracy show an inverse relation: the lower the loss, the higher the accuracy (this is not a perfect representation as the model “confidence” in its predictions will also change the loss, but not necessarily the accuracy etc.).

Besides that you can also check the loss curves to see if your model still continues to train, if it’s overfitting (lage gap between the training and validation losses), how stable the overall training is, if a learning rate scheduler does its job, if the training is stuck etc.

ranaban · August 23, 2022, 12:03pm

Thank you @ptrblck!
Yes, Y-axis is showing loss values.
I have used 10 epochs and was expecting to see the epochs in the X-axis like 0, 1, 2, …, 9. But it is showing 0, 200, 400, …, 1800, 2000. Which made me confused. I still couldn’t figure out, why those large values are there?

ptrblck · August 23, 2022, 4:41pm

Maybe you are plotting the number of iterations on the x-axis and not the epochs?
Could you check where the actual plot call is used and which arrays are passed to it?

ranaban · August 24, 2022, 8:15am

I wrote the below code to get the plot
logger = TensorBoardLogger(‘model_logs’, name = ‘My-model’)

trainer = pl.Trainer(logger = logger, checkpoint_callback = checkpoint_callback, max_epochs = 10,
gpus = 1, progress_bar_refresh_rate = 20)

%load_ext tensorboard
%tensorboard --logdir ./model_logs

Should I check the log files also?