Web scraping frameworks?
Hernan Lopes
hernanlopes at gmail.com
Tue Mar 4 23:47:11 GMT 2014
Agreed. After sending i realized i meant 'the first' and not 'the latter'.
So clean it upon entry would be best yes.
On Tue, Mar 4, 2014 at 8:32 PM, James Laver <james.laver at gmail.com> wrote:
>
> On 4 Mar 2014, at 23:15, Hernan Lopes <hernanlopes at gmail.com> wrote:
>
> > Another usual problem is data coersion, it can be done at the moment data
> > is read. Or, it can be right before the moment data is written/saved. Of
> > course the latter is probably the best option in most cases.
> >
> > The creaton of the crawler for each site must be as fast as possible and
> > with the less hassle, ie, data coersion, encodings problems, parsing data
> > according to content types if each crawler must implement all that, its
> not
> > going to be fun. Its going to be a nightmare.
> >
> > The web scrapping frameworks exist to take care of all those parts and
> > simply dispose objects your recipe can use to scrap each site already
> with
> > the correct encoding and parsed module.
> >
> > Imagine if you read prices, and every time you read a price "$ 20,000.00"
> > you must coerse it into 20000.00 That operation can be done for each
> > site/template which would not be fun. Better just grab "$ 20,000.00" and
> > before writing on database, coerce into 20000.00. This is more reusable.
> > and if needed it can evolve into better parsing methods.
>
> Soooo, $startup is importing data from an external webservice which is
> actually json (but it's so shit it may as well be screenscraping). We
> elected to clean it up on entry. It really simplifies the massive amounts
> of other processing that happen down the line. In fact, I'd say it's just
> not doable any other way.
>
> James
>
More information about the london.pm
mailing list