Web scraping frameworks?

Hernan Lopes hernanlopes at gmail.com
Tue Mar 4 22:25:48 GMT 2014


Why lwp ? what if you want to change the interface module ? test with
others.. thats where the abstraction comes into place... web scraping can
be organized.
A web scrappnig framework will allow you to create recipes for each website
and focus on only that. You recipes must describe where each block of
information is within the page/document/csv/json/whatever. And which pages
it must follow, there is usually an order ie:
- first read the list index and grab some info,
- later go deeply into hrefs (detail page) and grab more info.
- Then mix info from the index and detail page and save as an object.
This recipe can be described as a class for each site. The recipes does not
need to know if will use lwp or whatever... it can describe only the
specific pieces which map data on a webpage.

Other thing is queue. What is a good queue solution ? an array of urls ?
or, a redis fifo ? the scrapping recipe doesnt need to know this neither.
But if you want more crawlers, better use redis as a queue and make as many
crawlers (workers) retrieve a queue task that contains which url they need
to crawl and what class it must be used together with the respective
method/subroutine.

Another problem that comes into place are content types. If you receive an
html element you probably will want to parse it using Xpath. If its CSV,
you might want to parse with excell too. etc ie:
'text/html' => HTML::TreeBuilder::XPath
'application/json' => JSON::XS
'application/csv' => 'Text::CSV_XS'
So the framework can automate all this for you and after parsing it can
throw the result into a reference your recipe class will be able to access
so you can use whatever your needs. Your recipes doesnt need to handle this
neither, it only needs the parsed objects.

And very commonly the charset encoding is important also, so depending on
charset the content must be decoded into the solutions encoding ie UTF-8.

The useragent is also another layer that can be replaced also. If needed to
run benchmarks better be in way the engine can be replaced with ease.


And after all the parsing, your recipe (reader) could send that info to a
(writer) class which will save on a database, or elasticsearch, mongo, etc.

Thats how HTML::Robot::Scrapper functions


On Tue, Mar 4, 2014 at 6:50 PM, ๏̯͡๏ Guido Barosio <gbarosio at gmail.com>wrote:

> Curious about this one. How far a scraping franework would be from lwp?
>
>
>
> On Tuesday, March 4, 2014, DAVID HODGKINSON <davehodg at gmail.com> wrote:
>
> >
> > Does something exist?
> >
> > If it doesn't does anyone want to help make it happen?
> >
> > I *really* don't want to have to write the code all over again ten
> times...
> >
> >
> >
>
> --
> Guido Barosio
> Pensando en los estudiantes de Venezuela, por un futuro mejor para todo ese
> pueblo.
>
> http://www.ted.com/profiles/1085580
>


More information about the london.pm mailing list