java - On web scraping best practices -

given:

client side

1 - database

2 - coding language

server side:

1 - website displaying own database results (a chronologically descending list of blog posts let's say, can alter if site/author updates post) each post having unique id not change.

by , big have resources helped them understand how best approach problem?

computationally type of work time consuming due nature of crawling website , waiting results (for me, @ least).

my typical process (pseudo):

for each page on url:   each post on page:       id = getid(post)       data1...datan = getdata(post)        phone call sql.execute("insert ... on duplicate key update")

the sql part tedious , not efficient, , sense there must improve way accomplish doing in flow.

the overall goal always:

1 - grab info care site (acknowledging here css/site can alter on average using xpath find data) , stop when have gotten point captured info (meaning: @ post id in database assuming posts in descending order, , id not change).

2 - feed info through later analysis in language best fit type of problem trying solve.

on average (and doesn't matter but) packages utilize here are: beautiful soup, selenium, mechanize etc.

java python language-agnostic web-scraping

Search This Blog

Jaimee

java - On web scraping best practices -

Comments

Post a Comment

Popular posts from this blog

c - Compilation of a code: unkown type name string -

java - Bypassing "final local variable defined in an enclosing type" -

json - Hibernate and Jackson (java.lang.IllegalStateException: Cannot call sendError() after the response has been committed) -