How to debug when Tensorboard validation loss doesn't match

whoab · July 2, 2020, 3:52am

I have a validation loss of .02 during training. I see this in tensorboard.

After, I load up the model’s checkpoint. I set all parameters to have 0 gradient. Then I run my training script again and comment out the backward pass:

            with self.timers.record("grad"):
                self.optimizer.zero_grad()
                if self.use_fp16:
                    with amp.scale_loss(loss, self.optimizer) as scaled_loss:
                        scaled_loss.backward()
                else:
                    loss.backward()

This is the plot I see:

The error is much higher, and also constant (I would expect variation from the minibatches).

Any clue what’s happening?

ptrblck · July 4, 2020, 6:38am

If I understand the issue correctly, you get a constant validation loss for the complete validation dataset after restoring the model, while the loss for the same setup changes before storing the model?

If so, could you post the model definition, so that we could have a look?