java - On web scraping best practices -
java - On web scraping best practices -
given:
client side
1 - database
2 - coding language
server side:
1 - website displaying own database results (a chronologically descending list of blog posts let's say, can alter if site/author updates post) each post having unique id not change.
by , big have resources helped them understand how best approach problem?
computationally type of work time consuming due nature of crawling website , waiting results (for me, @ least).
my typical process (pseudo):
for each page on url: each post on page: id = getid(post) data1...datan = getdata(post) phone call sql.execute("insert ... on duplicate key update")
the sql part tedious , not efficient, , sense there must improve way accomplish doing in flow.
the overall goal always:
1 - grab info care site (acknowledging here css/site can alter on average using xpath find data) , stop when have gotten point captured info (meaning: @ post id in database assuming posts in descending order, , id not change).
2 - feed info through later analysis in language best fit type of problem trying solve.
on average (and doesn't matter but) packages utilize here are: beautiful soup, selenium, mechanize etc.
java python language-agnostic web-scraping
Comments
Post a Comment