Input sequence for RNNs

I am building a sequence tagger for pages of a book. The task is to segment the sequence of pages into segments.
These pages have been digitized by OCR, and I am using a multi-modal tagger to compute a vector representation the pages. Subsequently, the pages are fed into a (bi-directional) GRU (or an LSTM, or a plain RNN).

The labels for each page are hence BEGIN, IN, or END, marking the beginning and the end (and everything in between) of a segment. This is analogue to other sequence tagging tasks such as named entity recognition.

Regarding (training) data, I have human-annotated segments, hence a sequence of pages. These segments are not related to each other and can come from different books.

By design, the first page of such a segment is labelled as BEGIN, the last page as END, and all other pages as IN, I am now wondering how to use this data to train the RNN. If I feed it all the pages with their respective labels (in random batches), the network learns from arbitrary page sequences that might span over multiple segments.

I could divide the data into the original segments again, and feed the pages of each segments as one batch. However, that would mean that the first page of a batch is always a BEGIN etc. – an assumption that does not hold on unseen data.

I could also concatenate all sequences (in random order) and pretend they came from one book. This, however, poses technical challenges (hold all page embeddings in memory). Furthermore, it does not seem to make sense intuitively, because these segments really originate from different books. Perhaps, however, this is the best I can do given the data?

For illustration, each sequence looks like this:

Seq2: BEGIN, IN, IN, IN, IN, IN, … END

(the only difference is the number of IN pages)

Unseen data, however, looks like this:


The task is to retrieve the labels in that sequence.

Any hints on the best way to use the available training data to effectively train an RNN?

As you mentioned, passing the BEGIN to the model would present a case where the model could more easily assume that the time sequence 0 is always BEGIN, I.e. whenever the hidden state is initialized to zeroes. A method to address this is randomly choose the beginning of your training sample from within the available pages during training. So one sample might start on page 10 while another might start on page 1 and so on.

Additionally, you’ll run into the issue of having a highly unbalanced dataset. Suppose the average book size is 400 pages. But only one of the pages are class “BEGIN” and class “END” per book. The model could score 99.5%(398/400) accuracy by simply always guessing “IN”.

To avoid finding yourself in this scenario, you’ll want to pass in the pos_weight argument into the loss function.

Additionally, you’ll need a method to separate out the accuracy by class. Here is one way to get class separated accuracy for an unbalanced dataset:

Set weighted_mean = False to get class separated accuracy.

If you do this, you may have a similar problem as mentioned above where the model just learns to always guess “BEGIN” after “END”. Better if you start at a random point and always finish at “END”.

1 Like