How to use binary data of different length with LSTM?

I have a huge list of binary strings: [010101, 011111111, 0101111100011…]. The length of each string is different: from 1 to 10000 characters. The output of the Neural Network is simple: yes/no.

Can you please provide an example of how to define and train simple LSTM that works with this data? LSTM should read input character by character and at each step predict yes/no.

There are a couple of things to unpack:

  • Sequences of 10,000 are extremely long. Sure, LSTMs proud themselves to have long-term memory but 10k time steps seems awfully long. I’m not saying that it won’t work, it’s just something too keep in mind.

  • Since the range of you sequences varies a lot, have a look at BucketIterator to generate batches where all sequences within one batch have the same or at least very similar size.

  • Since batches can still have sequences of different lengths, you should have a look at pack_padded_sequences.

  • Once the batches are right setting up the network model is relatively straightforward.