BatchNorm2d essential?

This is a snippet from my network code:

conv_block += [nn.Conv2d(in_dim, out_dim, kernel_size=3, stride=1, padding=1), nn.BatchNorm2d(out_dim), nn.ELU()]

with BatchNorm2d. The network gave good results for first few training epochs but after that it get unstable.

During debugging, I removed BatchNorm2d from the network to analysis the effect:

conv_block += [nn.Conv2d(in_dim, out_dim, kernel_size=3, stride=1, padding=1), nn.ELU()]

but result were very bad and was not even comparable with the version with BN.

Why is it so?

BatchNorm layers have the ability to normalize the input activations, so that learning will be accelerated/feasible.
The original paper clains it’s due to the reduction of the internal covariance shift.
Recent papers claim that’s bogus, and give another perspective on the functionality.
Have a look at the original paper for more information.