What is the reason behind this?
The short answer is that we don’t need a mean of 0 and a standard
deviation of 1.
I assume that you are talking about “targets” – the output values you
are using to train with and then will be using the trained network to
There can be some benefits to “normalizing” your targets – rescaling
your targets so that their mean and standard deviation are roughly
0 and 1, respectively, but this is in no way necessary.
(My comments apply equally well to the question of normalizing
inputs, but let me speak in terms of targets for simplicity.)
If your targets are very large, you could run into overflow problems
(NaNs) when training.
If you start out with a sensible random initialization of your network
weights (and biases), your network won’t start out knowing anything
about the mean and standard deviation of your targets, and will
(sort of) start out tending to predict them to be 0 and 1, respectively.
So it will have to learn the mean and standard deviation, but this
isn’t a big deal.
If you change the scale of your targets you will often be changing the
scale of your loss function as a result. For example, if you double
your targets, you will, in effect, multiply the mean-squared-error
loss function by four. This will then, in effect, multiply your learning
rate by four.
But (with the exception of overflow for extreme values) these are
minor issues and won’t significantly affect the training or predictive
performance of your network.
You can see this for yourself. Generate some multivariate regression
data (with a little noise in it). Your targets don’t have to be a Gaussian
distribution, but don’t make them too wacky. Rescale the targets
twice – once to a mean and standard deviation of 0 and 1, and
again to something else, say a mean of -5 and a standard deviation
of 10. Train a simple network, say a single hidden layer and a
mean-squared-error loss function. Train on both the normalized
and unnormalized datasets. You should get very similar results,
with (perhaps) slightly slower initial training for the unnormalized
data as the network learns the overall scale of the data.
The biggest difference will be that by scaling up the standard
deviation by a factor of ten you will have, in effect, multiplied
your learning rate by one hundred. So reducing your learning
rate for the unnormalized data will make the two training runs
even more similar.