java - On web scraping best practices -



java - On web scraping best practices -

given:

client side

1 - database

2 - coding language

server side:

1 - website displaying own database results (a chronologically descending list of blog posts let's say, can alter if site/author updates post) each post having unique id not change.

by , big have resources helped them understand how best approach problem?

computationally type of work time consuming due nature of crawling website , waiting results (for me, @ least).

my typical process (pseudo):

for each page on url: each post on page: id = getid(post) data1...datan = getdata(post) phone call sql.execute("insert ... on duplicate key update")

the sql part tedious , not efficient, , sense there must improve way accomplish doing in flow.

the overall goal always:

1 - grab info care site (acknowledging here css/site can alter on average using xpath find data) , stop when have gotten point captured info (meaning: @ post id in database assuming posts in descending order, , id not change).

2 - feed info through later analysis in language best fit type of problem trying solve.

on average (and doesn't matter but) packages utilize here are: beautiful soup, selenium, mechanize etc.

java python language-agnostic web-scraping

Comments

Popular posts from this blog

Delphi change the assembly code of a running process -

json - Hibernate and Jackson (java.lang.IllegalStateException: Cannot call sendError() after the response has been committed) -

C++ 11 "class" keyword -