Missing values in a train sequence during training LSTM

farhad-bat · November 19, 2020, 10:05pm

Hello,

I have work with LSTM in PyTorch, but I have faced a new challenge.

Assume you have some sequences during training that each of contains 10 time step like this:
a0, a1, a2, a3, a4, a5, a6, a7, a8, a9

in this form we easily give these sequence to LSTM, and we train it.

Now my problem is this. In some of sequences, some time steps are not available. For example, above sequece can be like this:
a0, a1, a2, a3, , a5, a6, , a8, a9

In this form, for this sequence, a4 and a7 are not available during training.

How should I handle this?

vdw · November 19, 2020, 11:08pm

It depends on your exact data and task.

In text processing one often has to deal with “missing” words, i.e., words that are not available in the index (e.g., rare words, typos, named entities). Such missing words are generally represented by a special word or token. For example, “the loudd noise woke me up” becomes “the noise woke me up”. Maybe such a special “not available” token/value works for you as well.

Alternatively, is it a problem if the sequence is simply shorter. RNNs can handle sequences of different lengths.