BatchNorm2d results not matching BatchNorm in libtorch 1.4

Hello,
I recently upgraded from libtorch 1.2 to 1.4 and, among other things, replaced torch::nn::BatchNorm with torch::nn::BatchNorm2d in my code, since the former is now deprecated. I started seeing vastly different results with the new library, and tracked them to batch norm. Below is a minimal example that already shows the issue.

I would maybe expect some small delta between the two classes, but the differences I’m seeing in large runs are huge, going all the way from training fine to not converging at all. I did not see anything in the documentation that would explain such change in behavior.

Could someone please advise if there was any intentional (possibly undocumented) change in behavior between BatchNorm and BatchNorm2d in libtorch 1.4? Thanks!

#include <torch/torch.h>
#include <iostream>

int main()
{
    int64_t const C(3), H(4), W(4);
    torch::manual_seed(1);
    auto const input = torch::rand({1, C, H, W});

    /*** The two outputs below should be identical, but are vastly different! ***/
    torch::manual_seed(1);
    std::cout << torch::nn::BatchNorm(C)->forward(input) << std::endl;

    torch::manual_seed(1);
    std::cout << torch::nn::BatchNorm2d(C)->forward(input) << std::endl;

    return 0;
}

Here’s the output of the above:

Warning: torch::nn::BatchNorm module is deprecated and will be removed in 1.5. Use BatchNorm{1,2,3}d instead. (BatchNormImpl at ../../torch/csrc/api/src/nn/modules/batchnorm.cpp:21)
(1,1,.,.) = 
  0.9024 -0.8736 -0.4140  0.8172
 -1.8019  1.0592 -0.4361  0.8903
  0.2040 -0.2815  0.4608  0.0375
  0.6239 -0.7776 -0.1895 -0.2213

(1,2,.,.) = 
  0.0272 -0.0509  0.4098  0.1145
 -0.2442 -0.3657 -0.1367 -0.2751
 -0.2169 -0.0237  0.2639  0.2363
 -0.5616  0.2764  0.0978  0.4488

(1,3,.,.) = 
  0.4244 -0.5917 -0.6174 -0.3144
  0.5654 -0.1102  0.2743 -0.3030
  0.6321 -0.2761  0.4825  0.0052
 -0.3432  0.3300 -0.3458  0.1877
[ CPUFloatType{1,3,4,4} ]
(1,1,.,.) = 
  1.1911 -1.1530 -0.5465  1.0787
 -2.3783  1.3981 -0.5756  1.1752
  0.2692 -0.3715  0.6082  0.0494
  0.8235 -1.0264 -0.2501 -0.2921

(1,2,.,.) = 
  0.0975 -0.1822  1.4672  0.4098
 -0.8742 -1.3091 -0.4895 -0.9849
 -0.7766 -0.0848  0.9450  0.8462
 -2.0108  0.9897  0.3500  1.6069

(1,3,.,.) = 
  1.0529 -1.4680 -1.5317 -0.7800
  1.4028 -0.2734  0.6805 -0.7517
  1.5682 -0.6850  1.1971  0.0129
 -0.8514  0.8186 -0.8578  0.4658
[ CPUFloatType{1,3,4,4} ]

For the simple repro example, the Python equivalent gives the same output as BatchNorm2d in libtorch 1.4:

>>> import torch
>>> torch.manual_seed(1)
<torch._C.Generator object at 0x7f3d14d2ddb0>
>>> input = torch.rand(1, 3, 4, 4)
>>> torch.manual_seed(1)
<torch._C.Generator object at 0x7f3d14d2ddb0>
>>> torch.nn.BatchNorm2d(3)(input)
tensor([[[[ 1.1911, -1.1530, -0.5465,  1.0787],
          [-2.3783,  1.3981, -0.5756,  1.1752],
          [ 0.2692, -0.3715,  0.6082,  0.0494],
          [ 0.8235, -1.0264, -0.2501, -0.2921]],
 
         [[ 0.0975, -0.1822,  1.4672,  0.4098],
          [-0.8742, -1.3091, -0.4895, -0.9849],
          [-0.7766, -0.0848,  0.9450,  0.8462],
          [-2.0108,  0.9897,  0.3500,  1.6069]],
 
         [[ 1.0529, -1.4680, -1.5317, -0.7800],
          [ 1.4028, -0.2734,  0.6805, -0.7517],
          [ 1.5682, -0.6850,  1.1971,  0.0129],
          [-0.8514,  0.8186, -0.8578,  0.4658]]]],
       grad_fn=<NativeBatchNormBackward>)

For model convergence in general, have you tried training the equivalent Python model and checking that it converges?

Hi, thanks for your response.
Interestingly, I get the old BatchNorm answer from Python (I have torch.__version__ = 1.1.0, you’re probably using a more recent version).
This indicates that both Python and libtorch changed behavior recently. A simple math shows that the results of the new BatchNorm2d are actually correct. What I would like to understand, though, is what changed between these two. If I know what the old implementation did, hopefully I can understand why my model converged there, but not anymore.

Answering my own question: apparently there was a change in how the weights (gamma) are initialized from BatchNorm to BatchNorm2d, going from random uniform to a vector of ones. Unfortunately, I don’t think this was documented anywhere and, for reasons I don’t really understand, had a huge impact on convergence for the model I was training. The good news is that I can initialize those weights myself and recover the previous behavior.

Anyway, thanks for your help!

The change of the gamma initialization landed in the 1.2 release and was mentioned in the release notes.
Were you using a 1.2-pre-release?

Hi, I’m pretty sure I wasn’t using a pre-release version (the build-version file shows 1.2.0+cpu).
Assuming I’m note mistaken there, libtorch 1.2.0 doesn’t even seem to have a BatchNorm2d class. Are you sure you’re not talking about Python?

PyTorch 1.2.0 changed the weight init behavior of BatchNorm2d in Python (with reasons documented in https://github.com/pytorch/pytorch/issues/12259). BatchNorm2d in libtorch is mirroring how BatchNorm2d in Python behaves currently.

Thanks for the reference! Besides matching Python, initializing the weights to all-ones intuitively seems like the right thing to do – although, technically, this is not suggested in the original paper, and I haven’t seen any study of the effects of BatchNorm initialization on convergence/performance. I’m still puzzled by the huge negative impact this had on my results!