I use scrapy crawled a lot of html source from many many websites, there are many kinds of pages. home page, list page, article page(I call it detail page), and other mix pages, etc
I stored the html source page in a jl format file. each line with url, title, html as feilds.
my target is to use AI to recognize all the detail pages and locate the summary part so called the main content. then store them into mysql database.
I am so tired to write crawl rules, regrex one site by one site.
I suppose I need two steps to go
- create a model to justify which pages are detail pages against a lot of mixed pages, maybe 100000000 pages in total.
method 1, parse the html source with marking html tags as 1 and No-tags as 0, which will turn the entire html source into numbers, which can be features.
method 2, use one-hot to mark different html tag as different type, and no html tag as another type.
I think the second way as feature is better.
then mark some with label detail or no-detail.
then use it as train dataset.
I don’t know whether it work or not.
I am not very good at model a Ai learning.
I just read a little about CNN model and RNN model.
CNN is a classifying model and features are numbers.
RNN is a timeline series model and feature are one-hot. and the length of each sample is not the same.
as for me, I need a classifying model with one-hot features.
I hope I can get some advices and help here