What is linear regime of the non-linearity for a non-linear activation function?

shrbrh · January 22, 2023, 5:32am

We use batch normalization layers in between the input layers of CNNs to reduce the internal covariate shift, as per my understanding from this paper. However, in section 3 there’s a part that says-

normalizing the inputs of a sigmoid would constrain them to the linear regime 
of the nonlinearity. To address this, we make sure that the transformation 
inserted in the network can represent the identity transforms.To accomplish
this, we introduce, for each activation, a pair of parameters γ and β, which
scale and shift the normalized value.

What does “linear regime of the nonlinearity” mean and how do scaling and shifting help?

eqy · January 22, 2023, 7:19am

I’m not an expert on batchnorm, but I would interpret this as a roundabout way of saying the region (near zero) where the first derivative of the sigmoid function is not changing much, and scaling/shifting moves the input of the sigmoid toward zero.