@ptrblck do you have any guess how come my running mean are set to a value of zero? My understanding is that they should never have that value. In eval mode we use the running_mean. In “not tracking running mean” they should be None. So to I am so puzzled where to even start looking into my code where the running mean could have been set to zero before saving my checkpoint. Do you have an idea how this could happen? Have you ever seen such a weird thing where the checkpoint has zero in the running means?
args.mdl1.model.features.norm1.running_mean
Out[6]:
tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0.])
args.mdl1.model.features.norm1.running_var
Out[7]:
tensor([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]
I don’t think this should ever be happening according to my 3hours or so of reading the docs and all the pytorch code in some detail.
Doing a search over my code shows the only places where reset_running_stats
exists is in batch_norm and /Users/brando/anaconda3/envs/metalearning/lib/python3.9/site-packages/torch/nn/intrinsic/qat/modules/conv_fused.py
(code I didn’t write and I’m not calling).
also, in the construction of the BN layer __init__
thats when it’s done (torch.nn.modules.batchnorm — PyTorch 1.10.0 documentation)…so my model somehow wasn’t tracking them during training but when I print the checkpoint the track_running_stats
is True
which puzzles me even more.
Out[8]:
Learner(
(model): ModuleDict(
(features): Sequential(
(conv1): Conv2d(3, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(norm1): BatchNorm2d(32, eps=0.001, momentum=0.95, affine=True, track_running_stats=True)
(relu1): ReLU()
(pool1): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(conv2): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(norm2): BatchNorm2d(32, eps=0.001, momentum=0.95, affine=True, track_running_stats=True)
(relu2): ReLU()
(pool2): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(conv3): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(norm3): BatchNorm2d(32, eps=0.001, momentum=0.95, affine=True, track_running_stats=True)
(relu3): ReLU()
(pool3): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(conv4): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(norm4): BatchNorm2d(32, eps=0.001, momentum=0.95, affine=True, track_running_stats=True)
(relu4): ReLU()
(pool4): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
)
(cls): Linear(in_features=800, out_features=5, bias=True)
)
)