java - On web scraping best practices -



java - On web scraping best practices -

given:

client side

1 - database

2 - coding language

server side:

1 - website displaying own database results (a chronologically descending list of blog posts let's say, can alter if site/author updates post) each post having unique id not change.

by , big have resources helped them understand how best approach problem?

computationally type of work time consuming due nature of crawling website , waiting results (for me, @ least).

my typical process (pseudo):

for each page on url: each post on page: id = getid(post) data1...datan = getdata(post) phone call sql.execute("insert ... on duplicate key update")

the sql part tedious , not efficient, , sense there must improve way accomplish doing in flow.

the overall goal always:

1 - grab info care site (acknowledging here css/site can alter on average using xpath find data) , stop when have gotten point captured info (meaning: @ post id in database assuming posts in descending order, , id not change).

2 - feed info through later analysis in language best fit type of problem trying solve.

on average (and doesn't matter but) packages utilize here are: beautiful soup, selenium, mechanize etc.

java python language-agnostic web-scraping

Comments

Popular posts from this blog

assembly - What is the addressing mode for ld, add, and rjmp instructions? -

vowpalwabbit - Interpreting Vowpal Wabbit results: Why are some lines appended by "h"? -

ubuntu - Bash Script to Check That Files Are Being Created -