We use batch normalization layers in between the input layers of CNNs to reduce the internal covariate shift, as per my understanding from this paper. However, in section 3 there’s a part that says-
normalizing the inputs of a sigmoid would constrain them to the linear regime
of the nonlinearity. To address this, we make sure that the transformation
inserted in the network can represent the identity transforms.To accomplish
this, we introduce, for each activation, a pair of parameters γ and β, which
scale and shift the normalized value.
What does “linear regime of the nonlinearity” mean and how do scaling and shifting help?