We use batch normalization layers in between the input layers of CNNs to reduce the internal covariate shift, as per my understanding from this paper. However, in section 3 there’s a part that says-
normalizing the inputs of a sigmoid would constrain them to the linear regime of the nonlinearity. To address this, we make sure that the transformation inserted in the network can represent the identity transforms.To accomplish this, we introduce, for each activation, a pair of parameters γ and β, which scale and shift the normalized value.
What does “linear regime of the nonlinearity” mean and how do scaling and shifting help?