PubMed Named Entity Recognition / Generative Model


I have an interesting project which is to analyze PubMed PDFs and to programatically extract certain aspects of each PDF such as Author, PubMed ID, URL, citations etc. Obviously an easier way of doing this would be simple regex or some type of tree-based lexical analysis engine, but the goal for this platform is the ability to train a machine learning model on unstructured input text sequences and to then return a probability of label ownership for any particular category.

I’ve been looking at Named Entity Recognition models or perhaps a VAE which would take a sequence of input text and then respond with the appropriate chunk of text from the input sequence?

Any input would be greatly appreciated!