I’m wondering how one would go about creating a neural net to retrieve text from a document.
A practical example would be for web crawling. Let’s say we want to track real estate listings across multiple websites. We don’t want to have 10+ different crawlers for each site as that’s not scalable so we would want to train a neural net that can adapt to different sites.
I was thinking that the same way CNNs use a sliding grid to cycle through a 2d image, perhaps it could also go through an extremely long 1d embedded array of HTML and we can put bounding boxes around certain HTML elements.
Full disclosure: I’m a newbie