Handling missing data when working with sequences with RNNs: pack and padding


I work with time-series sequence data. In a hypothetical way, I can frame my problem as follows:

  • List item I will have N temporally aligned sequences in each forward pass.
  • List item Each of these sequences will be fed into an LSTM and the last outputs of the LSTMs will be concatenated to form a tensor of size batch_size*(N*hidden_dimension)
  • List item The resulting tensor will be fed into a linear layer for prediction.

The problem I have is, randomly, some of these N sequences might be missing, resulting in the last tensor having a different dimension (for example, in case of one sequence missing, it will be batch_size*((N-1)*hidden_dimension)).

My first attempt at fixing this was to use pad_sequences and pack_padded_sequence. Basically, when there is a missing sequence, I created an empty sensor and padded it to match the rest of the sequence lengths, guaranteeing N sequences in each pass, albeit some being empty. This way, I expected empty sequences to do not contribute to the loss. However, pack_padded_sequence results in the following error when a 0 length is passed to it:
“Length of all samples has to be greater than 0, but found an element in ‘lengths’ that is <= 0”

Is there a way to modify pack_padded_sequences so it will work with zero lengths? Other possibilities are as follows:

  • List item Masking hidden units when there is no data: Is there a way to mask parts of the Linear Layer at the end, where the missing sequence will normally fill, as similar to dropout?
  • List item Padding the empty streams with a value that would not affect the training procedure: I’m not sure if filling the empty streams with zero will work since my valid sequences already have zeros, so it is a valid value. I’m also not sure if I fill these sequences with a value that does not occur in my dataset will guarantee that it won’t affect the training.

What will be the best way to handle this? Thanks in advance!