Web scraping frameworks?
hernanlopes at gmail.com
Tue Mar 4 23:16:40 GMT 2014
Last but not least, request headers. Some sites need this option. So the
framework must allow that to be used if needed for each page.
On Tue, Mar 4, 2014 at 8:15 PM, Hernan Lopes <hernanlopes at gmail.com> wrote:
> Another usual problem is data coersion, it can be done at the moment data
> is read. Or, it can be right before the moment data is written/saved. Of
> course the latter is probably the best option in most cases.
> The creaton of the crawler for each site must be as fast as possible and
> with the less hassle, ie, data coersion, encodings problems, parsing data
> according to content types if each crawler must implement all that, its not
> going to be fun. Its going to be a nightmare.
> The web scrapping frameworks exist to take care of all those parts and
> simply dispose objects your recipe can use to scrap each site already with
> the correct encoding and parsed module.
> Imagine if you read prices, and every time you read a price "$ 20,000.00"
> you must coerse it into 20000.00 That operation can be done for each
> site/template which would not be fun. Better just grab "$ 20,000.00" and
> before writing on database, coerce into 20000.00. This is more reusable.
> and if needed it can evolve into better parsing methods.
> On Tue, Mar 4, 2014 at 7:25 PM, Hernan Lopes <hernanlopes at gmail.com>wrote:
>> Why lwp ? what if you want to change the interface module ? test with
>> others.. thats where the abstraction comes into place... web scraping can
>> be organized.
>> A web scrappnig framework will allow you to create recipes for each
>> website and focus on only that. You recipes must describe where each block
>> of information is within the page/document/csv/json/whatever. And which
>> pages it must follow, there is usually an order ie:
>> - first read the list index and grab some info,
>> - later go deeply into hrefs (detail page) and grab more info.
>> - Then mix info from the index and detail page and save as an object.
>> This recipe can be described as a class for each site. The recipes does
>> not need to know if will use lwp or whatever... it can describe only the
>> specific pieces which map data on a webpage.
>> Other thing is queue. What is a good queue solution ? an array of urls ?
>> or, a redis fifo ? the scrapping recipe doesnt need to know this neither.
>> But if you want more crawlers, better use redis as a queue and make as many
>> crawlers (workers) retrieve a queue task that contains which url they need
>> to crawl and what class it must be used together with the respective
>> Another problem that comes into place are content types. If you receive
>> an html element you probably will want to parse it using Xpath. If its CSV,
>> you might want to parse with excell too. etc ie:
>> 'text/html' => HTML::TreeBuilder::XPath
>> 'application/json' => JSON::XS
>> 'application/csv' => 'Text::CSV_XS'
>> So the framework can automate all this for you and after parsing it can
>> throw the result into a reference your recipe class will be able to access
>> so you can use whatever your needs. Your recipes doesnt need to handle this
>> neither, it only needs the parsed objects.
>> And very commonly the charset encoding is important also, so depending on
>> charset the content must be decoded into the solutions encoding ie UTF-8.
>> The useragent is also another layer that can be replaced also. If needed
>> to run benchmarks better be in way the engine can be replaced with ease.
>> And after all the parsing, your recipe (reader) could send that info to a
>> (writer) class which will save on a database, or elasticsearch, mongo, etc.
>> Thats how HTML::Robot::Scrapper functions
>> On Tue, Mar 4, 2014 at 6:50 PM, ๏̯͡๏ Guido Barosio <gbarosio at gmail.com>wrote:
>>> Curious about this one. How far a scraping franework would be from lwp?
>>> On Tuesday, March 4, 2014, DAVID HODGKINSON <davehodg at gmail.com> wrote:
>>> > Does something exist?
>>> > If it doesn't does anyone want to help make it happen?
>>> > I *really* don't want to have to write the code all over again ten
>>> Guido Barosio
>>> Pensando en los estudiantes de Venezuela, por un futuro mejor para todo
More information about the london.pm