What is the reason behind this?

Hi Vainaijr!

The short answer is that we don’t need a mean of 0 and a standard

deviation of 1.

I assume that you are talking about “targets” – the output values you

are using to train with and then will be using the trained network to

predict.

There can be some benefits to “normalizing” your targets – rescaling

your targets so that their mean and standard deviation are roughly

0 and 1, respectively, but this is in no way necessary.

(My comments apply equally well to the question of normalizing

inputs, but let me speak in terms of targets for simplicity.)

If your targets are *very* large, you could run into overflow problems

(NaNs) when training.

If you start out with a sensible random initialization of your network

weights (and biases), your network won’t start out knowing anything

about the mean and standard deviation of your targets, and will

(sort of) start out tending to predict them to be 0 and 1, respectively.

So it will have to learn the mean and standard deviation, but this

isn’t a big deal.

If you change the scale of your targets you will often be changing the

scale of your loss function as a result. For example, if you double

your targets, you will, in effect, multiply the mean-squared-error

loss function by four. This will then, in effect, multiply your learning

rate by four.

But (with the exception of overflow for extreme values) these are

minor issues and won’t significantly affect the training or predictive

performance of your network.

You can see this for yourself. Generate some multivariate regression

data (with a little noise in it). Your targets don’t have to be a Gaussian

distribution, but don’t make them too wacky. Rescale the targets

twice – once to a mean and standard deviation of 0 and 1, and

again to something else, say a mean of -5 and a standard deviation

of 10. Train a simple network, say a single hidden layer and a

mean-squared-error loss function. Train on both the normalized

and unnormalized datasets. You should get very similar results,

with (perhaps) slightly slower initial training for the unnormalized

data as the network learns the overall scale of the data.

The biggest difference will be that by scaling up the standard

deviation by a factor of ten you will have, in effect, multiplied

your learning rate by one hundred. So reducing your learning

rate for the unnormalized data will make the two training runs

even more similar.

Have fun!

K. Frank