Object detection fine tuning: interpreting printout of train_one_epoch?

I’m custom training a mask-rcnn using the helpful tutorial:

https://pytorch.org/tutorials/intermediate/torchvision_tutorial.html

Things are basically working, and I’m at the stage where I’m walking through really trying to understand and tweak things. Currently I’m confused by the MetricLogger outputs when I run engine.train_one_epoch() as follows:

metric_logger = train_one_epoch(model, optimizer, data_loader, device, epoch=0, print_freq=1);

When I run that (focusing on loss_classifier), for the eight batches, I get the following:

Epoch: [0]  [0/8]  loss_classifier: 0.6603 (0.6603) 
Epoch: [0]  [1/8]  loss_classifier: 0.6557 (0.6580)  
Epoch: [0]  [2/8]  loss_classifier: 0.6557 (0.5927)  
Epoch: [0]  [3/8]  loss_classifier: 0.4622 (0.5132)  
Epoch: [0]  [4/8]  loss_classifier: 0.6557 (0.5601)  
Epoch: [0]  [5/8]  loss_classifier: 0.6557 (0.5904)  
Epoch: [0]  [6/8]  loss_classifier: 0.6557 (0.5562)  
Epoch: [0]  [7/8]  loss_classifier: 0.5268 (0.5525)

I thought the loss was supposed to be described for each batch, with the running average in parentheses. But in the third row, the bit in parenthesis goes below all the values in the batches so far, so it seems it can’t be a running average. The class does have this SmoothedValue attribute so I think I’m just misunderstanding how things work at a basic level.

Also, my goal is to save values from each run of train_one_epoch(), and I’m not sure what is the best thing to pull at the end as a full summary of each epoch when I save checkpoints and gather summary statistics, or what the right syntax is for getting the values.

Since I started with this tutorial, I feel like I might be missing some cache of discussion/documentation on this MetricLogger class – sorry for all these questions. Any link to a discussion or information anyone can give to any docs would be greatly appreciated! I’m just piecing things together as best I can :grin: I am coming over from tensorflow to pytorch – so far it has been great!

Why wouldn’t it be possible if a lot of intermediate values were lower than the current loss value (i.e. 0.6557)?

1 Like

Thanks @ptrblck for replying: by setting print_freq to 1 I was explicitly trying to include every batch, so I wouldn’t miss values and the running mean would just equal the mean of all rows thus far in the epoch. I guess I’m missing something in how the calculations are being performed, as it sounds like the behavior I reported is expected even with print_freq set to 1.

Incidentally, I’m planning to just use loss_classifier = metric_logger.meters['loss_classifier'].avg in my logger when saving checkpoints for validation.

You might not miss anything as I didn’t realize the print_freq=1 setting and would also expect the same result of a running average per step.

EDIT: the print_freq argument seems to be passed to log_every here, which seems to print the epoch time and eta, if I’m not mistaken.
So still unsure where exactly the losses are printed. I’m not at my workstation so cannot just run the example. :confused:

1 Like