I have used PyPDF2 with regex to extract metadata from different invoice PDFs my company recieves. Then I tried tesseract-ocr and thought some manipulation improved the ‘PDF reading capabilities’
But its still not enough. I was wondering if its at all possible to train a ML model to do that. Like write out the metadata (adress, total amount etc) by hand and ‘show’ it as a desired result.
My question is: Is it possible? Is it something pytorch can actually do?