How to deal with the labels having drastically different values?

Capo_Mestre · March 31, 2021, 12:23pm

Hello,

I am dealing with a regression problem where inputs are images 98x98, and outputs are vectors of 16 elements.

Some examples of the outputs are:

[12023, 0.1, 2.0, 11982, 0.8, 1.2, 0.3, 0.9, 1.9, 1.1, 0.4, 0.5, 1.0, 0.9, 0.9, 1.7]
[11975, 0.6, 2.1, 11145, 0.4, 1.1, 0.9, 0.2, 1.3, 1.6, 0.1, 0.4, 1.5, 0.4, 0.8, 1.0]
etc.

As you can see the first and the fourth elements are several orders of magnitudes larger than the rest of the vectors.

The question is if this going to affect the learning process negatively and if labels need to be preprocessed somehow (e.g. normalized, or something else)?

omarfoq · March 31, 2021, 2:59pm

Hi,

This will affect your training if you simply use MSE, because the model will neglect all dimensions but the first and the fourth. The easy solution is to normalize your outputs across dimension.

Capo_Mestre · March 31, 2021, 4:31pm

Thanks! I will try that!

Capo_Mestre · April 6, 2021, 1:46pm

Hi @omarfoq ,
What would be better:

to normalize the labels only across those dimensions that have huge values, or
to normalize labels across all dimensions (separately), so that the labels across dimensions have mean 0 and standard deviation 1?

omarfoq · April 6, 2021, 2:01pm

Hello,

It’s better to standarize all dimensions, because:

You may have some dimension with a scale that is smaller then all the rest, and then it will be neglected.
I believe that standardizing helps during optimization