If the running stats are saved and re-stored and used at test time (the docs also say so), then there should not be any difference, regardless of the batch size. I use torch.save('model.pkl', model)
and model = torch.load('model.pkl')
.
Here is the model, which is just a DenseNet:
class DenseNet(nn.Module):
def __init__(self, layers=169, pretrained=True, emb_size=256):
super(DenseNet, self).__init__()
if layers == 121:
self.model = models.densenet121(pretrained=pretrained)
fc_in = 1024
elif layers == 161:
self.model = models.densenet161(pretrained=pretrained)
fc_in = 2208
elif layers == 169:
self.model = models.densenet169(pretrained=pretrained)
fc_in = 1664
elif layers == 201:
self.model = models.densenet201(pretrained=pretrained)
fc_in = 1920
self.model.classifier = nn.Linear(fc_in, emb_size, bias=False)
def forward(self, x, norm=True):
x = self.model.features(x)
x = F.relu(x, inplace=True)
x = F.adaptive_avg_pool2d(x, (1, 1))
x = x.view(x.size(0), -1)
x = self.model.classifier(x)
if norm:
x = F.normalize(x)
return x
I used it like this:
model = DenseNet(layers=169, pretrained=True, emb_size=256)
model = model.cuda()
model.train()
torch.set_grad_enabled(True)
# train
torch.save('model.pkl', model)
# Test time
model = torch.load('model.pkl')
model.eval()
torch.set_grad_enabled(False)
# do test
I suppose model.train()/model.eval()
on the DenseNet object recursively calls train/eval on all the modules it contains. If it is not the case, then that is the problem [update: no, model.train()/eval() works as expected].
Update: Here are some numbers, L1 distances between the DenseNet 169 embeddings of the same 5 images (out of 512) using different batch sizes in model.eval() mode.
# load the model, switch to eval(), run it on 512 images with different batch sizes, and compute L1 distances in memory, without saving to disk.
# batch sizes: 512 vs 1 - five runs, seems to be non-deterministic
tensor([1.2099e-05, 1.2104e-05, 5.7479e-06, 1.1539e-05, 9.9048e-06])
tensor([1.2099e-05, 1.2104e-05, 5.7479e-06, 1.1539e-05, 9.9048e-06])
tensor([1.4106e-05, 1.2881e-05, 6.3117e-06, 1.1826e-05, 1.0850e-05])
tensor([1.4106e-05, 1.2881e-05, 6.3117e-06, 1.1826e-05, 1.0850e-05])
tensor([1.2099e-05, 1.2104e-05, 5.7479e-06, 1.1539e-05, 9.9048e-06])
# batch sizes: 512 vs 1
# Forward 10 batches of 512 in train mode, then switch to eval mode
tensor([1.3889e-05, 1.3013e-05, 7.1302e-06, 1.2253e-05, 1.3526e-05])
# batch sizes: 512 vs 4
tensor([1.4723e-05, 1.4445e-05, 8.1474e-06, 1.2522e-05, 1.2616e-05])
# batch sizes: 512 vs 64
tensor([1.4181e-05, 1.3389e-05, 6.6453e-06, 1.1660e-05, 1.1504e-05])
# batch sizes: 512 vs 256
tensor([3.9932e-06, 4.5871e-06, 2.6375e-06, 3.3856e-06, 4.4151e-06])
# batch sizes: 512 vs 512
tensor([0., 0., 0., 0., 0.])
# ImageNet Pre-trained DenseNet 169 with new non-trained embedding layer
# batch sizes: 512 vs 1
tensor([1.6982e-05, 1.5111e-05, 1.6363e-05, 1.5956e-05, 1.8549e-05])
# Random non-trained DenseNet
# batch sizes: 512 vs 1
tensor([4.8725e-06, 4.9665e-06, 4.8112e-06, 4.9490e-06, 4.7400e-06])
# ImageNet Pre-trained ResNet50 with new non-trained embedding layer
# batch sizes: 512 vs 1
tensor([1.7733e-05, 1.3676e-05, 1.5779e-05, 1.1592e-05, 1.1086e-05])
Based on these, it seems to be changing with batch size, and hence most probably related to batchnorm. PyTorch’s batchnorm layer uses an epsilon value of 1e-5
, and the L1 distances are in the same range. Maybe related to this?