Forward method messing up with the weights without stepping the optimizer

arc144 · November 3, 2018, 2:19pm

Hi guys, I noticed a weird behavior today.

If I load some weights to my model and compute my val score I get the same value as the previous trained model. However, if I ever compute a self.forward(im) in the train loader (I don’t even need to compute loss, or step the optimizer) it mess up with my loaded weights. For example:

self.set_mode('train')
# Begin epoch loop
for i, (index, im, mask) in enumerate(train_loader):
    self.step += 1
    self.optimizer.zero_grad()
    im = im.cuda()
    mask = mask.cuda()
    # Forward propagation
    self.do_validation(val_loader) # <--- if I do it here it's all ok
    logit = self.forward(im)
    self.do_validation(val_loader) # <--- if I do it here I get all 0 in score
    loss = self.criterion(logit, mask)
    loss.backward()
    self.optimizer.step()

For reference:

def do_validation(self, val_loader):
    '''Validation step after epoch end'''
    self.set_mode('valid')
    val_loss = []
    val_iou = []
    val_score = []
    for i, (index, im, mask, ind_mask) in enumerate(val_loader):
        im = im.cuda()
        mask = mask.cuda()

        with torch.no_grad():
            logit = self.forward(im)
            pred = torch.sigmoid(logit)

            loss = self.criterion(logit, mask)
            iou  = eval.dice_accuracy(pred.cpu().numpy(), mask.cpu().numpy(), is_average=False)
            score_i = eval.do_kaggle_metric(pred.cpu().numpy(), ind_mask)[0]

            val_loss.append(loss.item())
            val_iou.extend(iou)
            val_score.extend(score_i)

    # Inference stop here
    out = dict(loss=val_loss, iou=val_iou, score=val_score)

    # Append epoch data to metrics dict
    for metric, value in out.items():
        self.update_log(self.val_log, metric, np.mean(value))

I observed this because everytime I start a new epoch my metrics decreases considerably. Any help is appreciated.

Kind regards

angryziber · November 3, 2018, 2:33pm

Try setting the model to eval mode and then do any of this. What happens then?

arc144 · November 3, 2018, 3:42pm

Indeed, if I do self.set_mode('valid') in the beginning of training the weird behavior does not occur. Does that means that the problem is something to do with BatchNorm?

angryziber · November 3, 2018, 3:44pm

Yes, batchnorm behaves differently while in train and eval mode.

arc144 · November 3, 2018, 3:47pm

Please correct if I’m wrong, does that mean that I need to train my model in eval mode?
I don’t understand why BatchNorm is getting weird when I set the mode to train

angryziber · November 3, 2018, 3:53pm

You can read about batchnorm behavior here.

No, training is performed in train mode, and evaluation is performed in eval mode.

arc144 · November 3, 2018, 4:04pm

Yes, that’s what I’d expect. But if I perform a single forward in train mode it messes with my model when I evaluate under eval mode. That’s what is weird, if you take a look at my first post, the do_validation function sets my model back to eval mode.

InnovArul · November 3, 2018, 10:23pm

to validate your claim, can you check the equality of weights before and after passing the forward function once, by printing them?

Also check the running_mean, running_var, num_batches_tracked of batch norm layers.

arc144 · November 3, 2018, 10:46pm

I found out the problem! During my training routine it would call validation sporadically but I forgot to set mode back to train once validation was finished. So basically I was training in eval mode without realizing it…
Thanks for the help!