Torch version affects the network's training performance

I am opening this issue because apparently depending on which version of pytorch you are using, the training result will be different. Here are the 3px error evaluation curves of on a minimal example of overfitting the network on a single image for 300 epochs:

Screenshot from 2020-12-22 20-06-19

The purple line is trained with Pytorch 1.7.0 and the orange line is trained with Pytorch 1.5.1. As you can see, with version 1.7.0 the error rate is flat 100%, while version 1.5.1 the error rate is dropping. Reason for this is that the BatchNorm function has changed between version 1.5.1 and Pytorch 1.7.0. In version 1.5.1, if I disable track_running_stats here, both evaluation and training will use batch stats. However in Pytorch 1.7.0, it is forced to use running_mean and running_var in evaluation mode, while in training the batch stats is used. With track_running_stats disabled, the running_mean is 0 and running_var is 1, which is clearly different from the batch stats.

How can I maintain this feature beyond version 1.5(evaluation using batch stats)?

I don’t think that’s true, as seen in the code for 1.5.0 vs. 1.7.0. As you can see, in both versions the running stats are initialized as None.
Also, this code snippet shows the same behavior:

import torch
import torch.nn as nn

print(torch.__version__)

bn = nn.BatchNorm2d(3, track_running_stats=False)
print(bn.running_mean, bn.running_var)
x = torch.randn(2, 3, 24, 24) * 5 + 2

out = bn(x)
print(out.min(), out.max(), out.mean(), out.std())

bn.eval()
out = bn(x)
print(out.min(), out.max(), out.mean(), out.std())

Output:

1.5.0+cpu
None None
tensor(-3.5741, grad_fn=<MinBackward1>) tensor(3.5072, grad_fn=<MaxBackward1>) tensor(1.6557e-09, grad_fn=<MeanBackward0>) tensor(1.0001, grad_fn=<StdBackward0>)
tensor(-3.5741, grad_fn=<MinBackward1>) tensor(3.5072, grad_fn=<MaxBackward1>) tensor(1.6557e-09, grad_fn=<MeanBackward0>) tensor(1.0001, grad_fn=<StdBackward0>)

1.7.0+cpu
None None
tensor(-3.3744, grad_fn=<MinBackward1>) tensor(3.2699, grad_fn=<MaxBackward1>) tensor(3.8633e-09, grad_fn=<MeanBackward0>) tensor(1.0001, grad_fn=<StdBackward0>)
tensor(-3.3744, grad_fn=<MinBackward1>) tensor(3.2699, grad_fn=<MaxBackward1>) tensor(3.8633e-09, grad_fn=<MeanBackward0>) tensor(1.0001, grad_fn=<StdBackward0>)

so the difference might be coming from another part of the code.