Extracting metadata from PDF

MagaMakhauri · August 19, 2022, 6:52am

Hi!
I have used PyPDF2 with regex to extract metadata from different invoice PDFs my company recieves. Then I tried tesseract-ocr and thought some manipulation improved the ‘PDF reading capabilities’
But its still not enough. I was wondering if its at all possible to train a ML model to do that. Like write out the metadata (adress, total amount etc) by hand and ‘show’ it as a desired result.

My question is: Is it possible? Is it something pytorch can actually do?