I have a validation loss of .02 during training. I see this in tensorboard.
After, I load up the model’s checkpoint. I set all parameters to have 0 gradient. Then I run my training script again and comment out the backward pass:
with self.timers.record("grad"):
self.optimizer.zero_grad()
if self.use_fp16:
with amp.scale_loss(loss, self.optimizer) as scaled_loss:
scaled_loss.backward()
else:
loss.backward()
This is the plot I see:
The error is much higher, and also constant (I would expect variation from the minibatches).
Any clue what’s happening?