Forward method messing up with the weights without stepping the optimizer

Hi guys, I noticed a weird behavior today.

If I load some weights to my model and compute my val score I get the same value as the previous trained model. However, if I ever compute a self.forward(im) in the train loader (I don’t even need to compute loss, or step the optimizer) it mess up with my loaded weights. For example:

self.set_mode('train')
# Begin epoch loop
for i, (index, im, mask) in enumerate(train_loader):
    self.step += 1
    self.optimizer.zero_grad()
    im = im.cuda()
    mask = mask.cuda()
    # Forward propagation
    self.do_validation(val_loader) # <--- if I do it here it's all ok
    logit = self.forward(im)
    self.do_validation(val_loader) # <--- if I do it here I get all 0 in score
    loss = self.criterion(logit, mask)
    loss.backward()
    self.optimizer.step()

For reference:

def do_validation(self, val_loader):
    '''Validation step after epoch end'''
    self.set_mode('valid')
    val_loss = []
    val_iou = []
    val_score = []
    for i, (index, im, mask, ind_mask) in enumerate(val_loader):
        im = im.cuda()
        mask = mask.cuda()

        with torch.no_grad():
            logit = self.forward(im)
            pred = torch.sigmoid(logit)

            loss = self.criterion(logit, mask)
            iou  = eval.dice_accuracy(pred.cpu().numpy(), mask.cpu().numpy(), is_average=False)
            score_i = eval.do_kaggle_metric(pred.cpu().numpy(), ind_mask)[0]

            val_loss.append(loss.item())
            val_iou.extend(iou)
            val_score.extend(score_i)

    # Inference stop here
    out = dict(loss=val_loss, iou=val_iou, score=val_score)

    # Append epoch data to metrics dict
    for metric, value in out.items():
        self.update_log(self.val_log, metric, np.mean(value))

I observed this because everytime I start a new epoch my metrics decreases considerably. Any help is appreciated.

Kind regards

Try setting the model to eval mode and then do any of this. What happens then?

Indeed, if I do self.set_mode('valid') in the beginning of training the weird behavior does not occur. Does that means that the problem is something to do with BatchNorm?

Yes, batchnorm behaves differently while in train and eval mode.

Please correct if I’m wrong, does that mean that I need to train my model in eval mode?
I don’t understand why BatchNorm is getting weird when I set the mode to train

You can read about batchnorm behavior here.

No, training is performed in train mode, and evaluation is performed in eval mode.

Yes, that’s what I’d expect. But if I perform a single forward in train mode it messes with my model when I evaluate under eval mode. That’s what is weird, if you take a look at my first post, the do_validation function sets my model back to eval mode.

to validate your claim, can you check the equality of weights before and after passing the forward function once, by printing them?

Also check the running_mean, running_var, num_batches_tracked of batch norm layers.

I found out the problem! During my training routine it would call validation sporadically but I forgot to set mode back to train once validation was finished. So basically I was training in eval mode without realizing it…
Thanks for the help!