Web scraping frameworks?

Tue Mar 4 23:32:22 GMT 2014

On 4 Mar 2014, at 23:15, Hernan Lopes <hernanlopes at gmail.com> wrote:

> Another usual problem is data coersion, it can be done at the moment data
> is read. Or, it can be right before the moment data is written/saved. Of
> course the latter is probably the best option in most cases.
> 
> The creaton of the crawler for each site must be as fast as possible and
> with the less hassle, ie, data coersion, encodings problems, parsing data
> according to content types if each crawler must implement all that, its not
> going to be fun. Its going to be a nightmare.
> 
> The web scrapping frameworks exist to take care of all those parts and
> simply dispose objects your recipe can use to scrap each site already with
> the correct encoding and parsed module.
> 
> Imagine if you read prices, and every time you read a price "$ 20,000.00"
> you must coerse it into 20000.00 That operation can be done for each
> site/template which would not be fun. Better just grab "$ 20,000.00" and
> before writing on database, coerce into 20000.00. This is more reusable.
> and if needed it can evolve into better parsing methods.

Soooo, $startup is importing data from an external webservice which is actually json (but it’s so shit it may as well be screenscraping). We elected to clean it up on entry. It really simplifies the massive amounts of other processing that happen down the line. In fact, I’d say it’s just not doable any other way.

James