Varying number of inputs

Hello together,
I have a question to machine/deep-learning in general and less related to PyTorch itself, but since I am using the amazing PyTorch library, I hope that maybe one of you could provide help.

I am trying to use Gnss-observation samples (distance, Signal/Noise, etc), leading to a data structure as follows:

  • K - epochs
  • N - satellites (4~10 per constellation) are tracked at epoch k
  • F Frequencies (1 to 3) are tracked of satellite n
  • with observations {distance, Signal/Noise} per frequency f

Since I am trying to perform a per satellite regression, a problem arises how to handle varying number of tracked frequencies, since I believe the loss of a frequency contains also important information of a potential threat, I would really like to include this information (at least the Signal/Noise value as it contains valuable information) instead of neglecting it.

I know that common ways of ‘missing-data’ are:

  • removing data entries, if something is missing
  • try to estimate the missing values via mean, regression, NN, etc.
    But these approaches are not applicable, as i would like to use these samples, but i do not know the underlying process creating the observations.

Do you have any ideas, thoughts or are aware of a technique how to handle this issue, especially how to define a fixed number of input neurons, while the number of actual data might vary?

Should one create an ensemble of networks for each possible case?
Might be applicable to model the missing Signal/Noise (in detail NaN) as 0?

Thank you very much in advance
Best
Nerolf

I would advise against using NaN values to represent missing data as they will likely break the implementation of many layers (e.g., consider what happens when you do a weighted sum or reduction in a fully-connected layer and one of the values is NaN).

In general the model should be able to “learn around” the missing data provided that the method used to indicate that it is missing is consistent. It could be as simple as padding missing values to zero, adding another scalar value (e.g., 1 or 0) for each frequency to indicate whether the frequency value is valid, etc…

Thank you for your fast reply @eqy

Yes, using “0” to indicate missing values seems to be a good an idea, as zero could also not arise as a physical observation (whereas “1” might be possible").

I was also thinking about to create a feature ‘ratio of tracked frequencies’, indicating the number of tracked frequencies with respect to the number of possible frequencies, but this might be learned by the network itself.

A few thoughts:

  • To my mind, the most typical thing people do is to provide a mask that indicated which values are present.
  • The bad news is that you need to adapt your processing to consider the mask and that this typically is highly nontrivial.
  • The good news is that transformers and more specifically the attention mechanism they use have more or less established a standard way to use masks. In essence, they use an encoding of positions (of words or image patches) that at the highest level is similar to storing coordinates with entries rather than having a pre-defined grid of entries (similar to how (COO) sparse matrices work compared to dense ones).
  • As training a standard transformer architecture from scratch for your task may be prohibitive (in terms of data and compute requirements), you could either look at using a pre-trained transformer if you feel your problem domain is similar enough to them or you could use the attention mechanism inside some other (smaller) architecture.

Best regards

Thomas

Thanks a lot for your idea @tom,

as far as I understand the masking approach, please correct me if I got something wrong, it can be compared to a user defined Drop-Out in the first layer and can be summarized as zero padding in case specific features are missing?

Do you know any related documentation, project where this approach is discussed more thoroughly, as it sounds interesting? What’s the name of this masking technique?

Best
Nerolf

So the difference to dropout is that dropout replaces things with 0 while the attention mechanism (in a gross oversimplification) computes weighted averages of inputs at various locations and a mask would cause the masked locations to have weight 0.

I would not know a reference for masking in particular, but so the “Annotated transformer” has a short section on masking and is a very nice reference.
http://nlp.seas.harvard.edu/annotated-transformer/#batches-and-masking

(But it’s an interesting point, maybe masking should take a larger role in discussing data representation.)

Also, BERT uses masking and predicting the values at the masked positions for training, so you if your missing values are benign (i.e. randomly missing, with conceptually the same distribution as the non-missing values), you could drop additional inputs and train a transformer-like model to predict the inputs which you yourself dropped.

Best regards

Thomas