Neural Net for text retrieval

GabrielGhe · June 23, 2020, 5:39pm

I’m wondering how one would go about creating a neural net to retrieve text from a document.

A practical example would be for web crawling. Let’s say we want to track real estate listings across multiple websites. We don’t want to have 10+ different crawlers for each site as that’s not scalable so we would want to train a neural net that can adapt to different sites.

I was thinking that the same way CNNs use a sliding grid to cycle through a 2d image, perhaps it could also go through an extremely long 1d embedded array of HTML and we can put bounding boxes around certain HTML elements.

Full disclosure: I’m a newbie

vdw · June 24, 2020, 2:34am

I’m not really sure what your trying to do, but for your example I would argue that writing 10+ crawlers/scrapers is a perfectly good solutions.

Training deep networks such as CNNs require large amount of data to get good results. Where do you get that training data from? In principle what you’re do is sequence labeling, with each token/word/tag/etc. in your HTML string is associated with a label (similar to named entity recognition). But again, this is arguably not an easy task to learn.