As you mentioned, passing the BEGIN to the model would present a case where the model could more easily assume that the time sequence 0 is always BEGIN, I.e. whenever the hidden state is initialized to zeroes. A method to address this is randomly choose the beginning of your training sample from within the available pages during training. So one sample might start on page 10 while another might start on page 1 and so on.
Additionally, you’ll run into the issue of having a highly unbalanced dataset. Suppose the average book size is 400 pages. But only one of the pages are class “BEGIN” and class “END” per book. The model could score 99.5%(398/400) accuracy by simply always guessing “IN”.
To avoid finding yourself in this scenario, you’ll want to pass in the pos_weight
argument into the loss function.
Additionally, you’ll need a method to separate out the accuracy by class. Here is one way to get class separated accuracy for an unbalanced dataset:
Set weighted_mean = False
to get class separated accuracy.
If you do this, you may have a similar problem as mentioned above where the model just learns to always guess “BEGIN” after “END”. Better if you start at a random point and always finish at “END”.