Machine learning text segmentation

julkhami · October 9, 2021, 7:32pm

I would like to segment a text and I am trying to approach ways of dealing with plaintext files where a paragraph is composed of lines of text with newline characters between each line. I believe standard rule-based segmentation regular expression functions would naturally think the end of a line must be the end of a segment; whereas if I first joined the text file on newlines it might join two lines which are separate but don’t have punctuation between them (like a title and a subtitle). Then the rule-based segmentation wouldn’t realise those are to be kept separate.

I think I need an algorithm which somehow knows if a line is a segment even if it’s not punctuated - it could be a title or a line of code, for example - and can join all lines that are part of a continuous paragraph, then segment them by sentences.

I am open to different strategies for this but the simplest that come to mind are:

some very general model which has been trained on well-segmented documents and I can just give it the file without trying to consciously prepare the file in some way; or
a model which understands if a line of text is a continuation of the sentence in the previous line or if they are discursively/sententially separate.

In either case, is it likely I could train such a model myself? I’d need to find a large source of relevant data somehow, wouldn’t I? Maybe if I can find some kind of original source code for the dumped plaintext of various documents, I could train the model to effectively back-generate the source code of plaintext. In the source code the text would be more structured, with markup tags.

In which case I guess it makes more sense to try to find a pre-trained model.

Does anyone know of any pre-trained model that has the kind of knowledge I describe here?

Or could an unsupervised algorithm potentially separate headers and lines of code from paragraphs of text? Maybe an unsupervised algorithm is quite good at identifying the structural components of a document: header, paragraph, etc?

Thanks very much to anyone who can help me figure out a way to approach this.