Handling missing data when working with sequences with RNNs: pack and padding

ekingedik · January 30, 2020, 1:55pm

Hello,

I work with time-series sequence data. In a hypothetical way, I can frame my problem as follows:

List item I will have N temporally aligned sequences in each forward pass.
List item Each of these sequences will be fed into an LSTM and the last outputs of the LSTMs will be concatenated to form a tensor of size batch_size*(N*hidden_dimension)
List item The resulting tensor will be fed into a linear layer for prediction.

The problem I have is, randomly, some of these N sequences might be missing, resulting in the last tensor having a different dimension (for example, in case of one sequence missing, it will be batch_size*((N-1)*hidden_dimension)).

My first attempt at fixing this was to use pad_sequences and pack_padded_sequence. Basically, when there is a missing sequence, I created an empty sensor and padded it to match the rest of the sequence lengths, guaranteeing N sequences in each pass, albeit some being empty. This way, I expected empty sequences to do not contribute to the loss. However, pack_padded_sequence results in the following error when a 0 length is passed to it:
“Length of all samples has to be greater than 0, but found an element in ‘lengths’ that is <= 0”

Is there a way to modify pack_padded_sequences so it will work with zero lengths? Other possibilities are as follows:

List item Masking hidden units when there is no data: Is there a way to mask parts of the Linear Layer at the end, where the missing sequence will normally fill, as similar to dropout?
List item Padding the empty streams with a value that would not affect the training procedure: I’m not sure if filling the empty streams with zero will work since my valid sequences already have zeros, so it is a valid value. I’m also not sure if I fill these sequences with a value that does not occur in my dataset will guarantee that it won’t affect the training.

What will be the best way to handle this? Thanks in advance!