When testing a model using nn.BatchNorm2d between convolutional layers, I saw that the error (cross entropy loss) over the test set was lower than the error for the training set.
I read some topics on here about testing and I realized I hadn’t used model.eval() before feeding the test set to the model.
What still isn’t clear to me is that in these topics, it is said that BatchNorm2d changes behaviour, when the model is set to model.eval().
My question is the following:
I would think that this behaviour change would imply that for
torch.nn.BatchNorm2d(num_features, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
then “track_running_stats” should be false after setting the model to eval()
However I can see in the summary of model = model.eval() that “track_running_stats” is still true. Shouldn’t this be false, as the batchnorm layer should use the statistics learned from training?
Thanks in advance for any inputs.
Another way to state my question, in a code example which I hope is fairly general, as I couldn’t seem to find general examples on the PyTorch docs or on here.
Is this the correct idea of a way of training and testing for several epochs in PyTorch?
for epoch in range(no_epochs):
running_loss = 0
for image,target in training_set:
unet, loss = defnet.train( unet, image, target)
running_loss += loss
training_loss[epoch] = running_loss / len(training_set)
criterion = nn.CrossEntropyLoss()
evaluation_unet = unet.eval()
for image, target in test_set:
running_test_loss += criterion( evaluation_unet(image), target)
loss_validation[epoch] = cel / len(test_set)
My concern is that I don’t want my model to “go on training” on the test set. Is “torch.no_grad” and “model.eval()” enough to make sure this does not happen?
If you call
model.eval(), the internal
self.training flag will be set to
False and e.g. the running estimates in batchnorm layers won’t be updated anymore but applied to the current batch.
model.eval() work recursively on all modules, so that you don’t have to reassign the model.
The training loop looks strange, as you are calling
defeat.train with some arguments.
Are you using some high-level wrapper?
Also note that it looks as if you are storing the computation graph using
running_loss += loss, which might increase your memory usage in each iteration.
If you would like to add the current loss to
running_loss to print it later, use
running_loss += loss.item() instead.
Thank you for the reply. All right that makes sense.
My batch size is 1 (i.e. single images are passed forward). My understanding of the implementation of running mean and running variance is that it remembers the statistics of the previous batches and adjusts the statistics for the normalization layers as it goes through batches (i.e. in my case the single training images). This means that a batch size of 1 is not a problem.
Can you briefly state if this is correctly understood?
“defnet” is an import of a script with my network as well as this training function:
def train(network, image, target):
criterion = nn.CrossEntropyLoss()
optimizer = ts.optim.Adam(network.parameters())
# variable wrap
image, target = ts.autograd.Variable(image), ts.autograd.Variable(target)
# forwardpass, backprop, optimization step
prediction = network(image)
loss = criterion(prediction, target)
rloss = loss.item()
return network, rloss
I would think that this would work, but I don’t know if there is a more efficient way around training.
The training function by the way returns loss.item() to answer your last question.
If I train my model with model.train() on a training set and evaluate with model.eval() on the same training set, shouldn’t I be able to reproduce the exact same loss on the training set with model.eval()?
A small batch size might be a problem for batch norm layers, as the estimate of the mean and std might be quite off. If you can’t fit more than a single sample into a batch you might want to try other normalization layers, e.g.
GroupNorm for small batch sizes).
train function is re-initializing the optimizer in each iteration, which will reset the internal buffers of
Adam. I would recommend passing the criterion and optimizer to this
train method or just call the training procedure in the
Not really, as some layers e.g.
BatchNorm will behave differently as explained before.
Thanks very much for the reply. There are some useful things to try out!