Hi,
I am building a sequence tagger for pages of a book. The task is to segment the sequence of pages into segments.
These pages have been digitized by OCR, and I am using a multi-modal tagger to compute a vector representation the pages. Subsequently, the pages are fed into a (bi-directional) GRU (or an LSTM, or a plain RNN).
The labels for each page are hence BEGIN
, IN
, or END
, marking the beginning and the end (and everything in between) of a segment. This is analogue to other sequence tagging tasks such as named entity recognition.
Regarding (training) data, I have human-annotated segments, hence a sequence of pages. These segments are not related to each other and can come from different books.
By design, the first page of such a segment is labelled as BEGIN
, the last page as END
, and all other pages as IN
, I am now wondering how to use this data to train the RNN. If I feed it all the pages with their respective labels (in random batches), the network learns from arbitrary page sequences that might span over multiple segments.
I could divide the data into the original segments again, and feed the pages of each segments as one batch. However, that would mean that the first page of a batch is always a BEGIN
etc. – an assumption that does not hold on unseen data.
I could also concatenate all sequences (in random order) and pretend they came from one book. This, however, poses technical challenges (hold all page embeddings in memory). Furthermore, it does not seem to make sense intuitively, because these segments really originate from different books. Perhaps, however, this is the best I can do given the data?
For illustration, each sequence looks like this:
Seq1: BEGIN, IN, IN, IN, END
Seq2: BEGIN, IN, IN, IN, IN, IN, … END
(the only difference is the number of IN
pages)
Unseen data, however, looks like this:
BEGIN, IN, IN, IN, END, BEGIN, IN, IN, IN, …, END, …
The task is to retrieve the labels in that sequence.
Any hints on the best way to use the available training data to effectively train an RNN?