Question about nn.BatchNorm2d


When testing a model using nn.BatchNorm2d between convolutional layers, I saw that the error (cross entropy loss) over the test set was lower than the error for the training set.

I read some topics on here about testing and I realized I hadn’t used model.eval() before feeding the test set to the model.

What still isn’t clear to me is that in these topics, it is said that BatchNorm2d changes behaviour, when the model is set to model.eval().

My question is the following:
I would think that this behaviour change would imply that for

torch.nn.BatchNorm2d(num_features, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)

then “track_running_stats” should be false after setting the model to eval()

However I can see in the summary of model = model.eval() that “track_running_stats” is still true. Shouldn’t this be false, as the batchnorm layer should use the statistics learned from training?

Thanks in advance for any inputs.

Another way to state my question, in a code example which I hope is fairly general, as I couldn’t seem to find general examples on the PyTorch docs or on here.

Is this the correct idea of a way of training and testing for several epochs in PyTorch?

for epoch in range(no_epochs):
    running_loss = 0
    for image,target in training_set:        
        unet, loss = defnet.train( unet, image, target)
        running_loss += loss
    training_loss[epoch] = running_loss / len(training_set)

    running_test_loss =0
    with torch.no_grad():
        criterion = nn.CrossEntropyLoss()
        evaluation_unet = unet.eval()
        for image, target in test_set:

            running_test_loss += criterion(  evaluation_unet(image),  target)
        loss_validation[epoch] = cel / len(test_set)

My concern is that I don’t want my model to “go on training” on the test set. Is “torch.no_grad” and “model.eval()” enough to make sure this does not happen?

If you call model.eval(), the internal flag will be set to False and e.g. the running estimates in batchnorm layers won’t be updated anymore but applied to the current batch.
Note that model.train() and model.eval() work recursively on all modules, so that you don’t have to reassign the model.

The training loop looks strange, as you are calling defeat.train with some arguments.
Are you using some high-level wrapper?

Also note that it looks as if you are storing the computation graph using running_loss += loss, which might increase your memory usage in each iteration.
If you would like to add the current loss to running_loss to print it later, use running_loss += loss.item() instead.

1 Like

Hi ptrblck

Thank you for the reply. All right that makes sense.

My batch size is 1 (i.e. single images are passed forward). My understanding of the implementation of running mean and running variance is that it remembers the statistics of the previous batches and adjusts the statistics for the normalization layers as it goes through batches (i.e. in my case the single training images). This means that a batch size of 1 is not a problem.

Can you briefly state if this is correctly understood?

“defnet” is an import of a script with my network as well as this training function:

def train(network, image, target):
    criterion = nn.CrossEntropyLoss()
    optimizer = ts.optim.Adam(network.parameters())
    # variable wrap
    image, target = ts.autograd.Variable(image), ts.autograd.Variable(target)
    # forwardpass, backprop, optimization step
    prediction = network(image)
    loss = criterion(prediction, target)
    rloss = loss.item()
    return network, rloss

I would think that this would work, but I don’t know if there is a more efficient way around training.

The training function by the way returns loss.item() to answer your last question.

Best regards,


If I train my model with model.train() on a training set and evaluate with model.eval() on the same training set, shouldn’t I be able to reproduce the exact same loss on the training set with model.eval()?

A small batch size might be a problem for batch norm layers, as the estimate of the mean and std might be quite off. If you can’t fit more than a single sample into a batch you might want to try other normalization layers, e.g. InstanceNorm (or GroupNorm for small batch sizes).

The train function is re-initializing the optimizer in each iteration, which will reset the internal buffers of Adam. I would recommend passing the criterion and optimizer to this train method or just call the training procedure in the DataLoader loop.

Not really, as some layers e.g. Dropout and BatchNorm will behave differently as explained before.

1 Like


Thanks very much for the reply. There are some useful things to try out!