Incomplete training dataset of NN

srl123 · November 30, 2020, 10:59am

Hi all,
I have a (huge) theoretical question connected with a regression problem of neural network: what approach should I follow if my training dataset is incompleted and or my measurements are affected by strong biases?

let’s assume I want my net to predict the Y variable given the X,W, Z variables. Unfortunately I know that Y depends also on another variable, K, of whom I have no records. Of course the training dataset is essential to produce a “optimal” net, but I am wondering if there is any approach that considers, I don’t know, the inclusion of a stochastic variable that may take into account this lack of data.

Connected with the previous: what if my training dataset is not exact? how can I consider the uncertainty related to my measurements in training my net?

thank you!

qmeeus · November 30, 2020, 11:19am

Assuming that K is related to X, W, Z (no conditional independence: P(K, D) = P(K | D) * P(D)), then it should not be a problem:

P(Y, K | D) = P(Y | K, D) * P(K, D) = P(Y | K, D) * P(K | D) * P(D)

where the first and second terms in the right-most side of the equation are estimated by your model and K is a hidden variable.

If the K variable cannot be predicted from the data (i.e. P(K, D) = P(K) * P(D) or equivalently P(K | D) = P(K)) then your model will be missing a variable. It does not mean that it will not manage to predict anything good but rather that it does not have all the information needed to predict your target variable correctly.

srl123 · November 30, 2020, 1:49pm

correct, thank you for your answer so you’r saying train with what you have. And what about the uncertainty related to the input measurements?

qmeeus · November 30, 2020, 2:02pm

There are methods to account for uncertainty in the training data. If you can quantify it or estimate, then you can modelize the error but this has more to do with statistics and with statistical analysis in general than with neural networks (I mean, the answer with DNNs will not be different than for any predictive model).
Also, you should know whether it’s your inputs, your outputs or both that are biased. Unfortunately, if your outputs are biased, then your model will be biased. If your inputs are biased, then your model might learn to cope with that (to some extend of course: the expression “garbage in, garbage out” still holds)

srl123 · November 30, 2020, 3:07pm

Can you provide the name of some of these statistical analyses you’r talking about? thanks

qmeeus · November 30, 2020, 3:15pm

Here is one example that can give you a good starting point: https://en.wikipedia.org/wiki/Errors-in-variables_models