Loss abnormal after using batchnorm1d

Hi, I’m training a model with BCEWithLogitsLoss. The model has a head like this:

self.head = nn.Linear(config.hidden_size, num_labels)

My loss during this training is about 0.3, when I use the head below:

self.head = nn.Sequential(  #
            nn.Linear(config.hidden_size, 150),
            nn.BatchNorm1d(150),
            nn.Linear(150, num_labels),
        )

The loss goes to 0.6. Is this a normal situation?

Is just the initial loss higher?
How does the training look like? Are you able to get the same or lower loss with the batchnorm layer or does it stay at a higher level constantly?

The training process is roughly as follows:


The first one uses BN, the second one does not use BN. The training code is long, I send you a message with it. Thank you.

While the training loss seems to be higher, the validation score seems to be better in the batchnorm model.
In that case, I would stick to it and maybe play around with some hyperparameters.

Yes, the batchnorm model has better validation score, but worse loss. Is it because of the multi-gpu?

It’s hard to tell, where exactly this effect comes from and I have not really any idea, so let’s wait for some experts on this topic :slight_smile: