Web scraping frameworks?
Jérôme Étévé
jerome.eteve at gmail.com
Tue Mar 4 23:35:52 GMT 2014
Web::Scraper is great to hack something together quickly.
I use it regularly to do some quick ad-hoc data scraping.
For more heavy work, I prefer a combination of the following tools:
- Curl (Net::Curl or its LWP style incarnation LWP::Curl). I found it
to be more resilient than LWP against dodgy http server responses.
- For the page data scraping itself, LibXML (with its load_html and in
recover mode) + XPath. Again, for its resilience against crap HTML. We
all know correct HTML is the exception rather than the norm on the big
bad web.
- For queuing jobs, I'm a big fan of Gearman. It's light, very stable
and very simple.
Of course, it's only a toolbox. I doubt you can find a ready made
"framework" that fits your specific business needs out of the box.
J.
On 4 March 2014 22:55, Pierre M <piemas25 at gmail.com> wrote:
> I love using
> Web::Scraper
> It's so simple and intuitive to use!
> But it only "goes down" (unless if I've missed something), and it doesn't
> allow to interact with the page (fill forms, click buttons, etc) so it
> doesn't handle complex scraping scenarios. For these, I like
> Mojo::UserAgent
> which gives me more control. An example here:
>
> http://blog.kraih.com/post/43198036449/mojolicious-hack-of-the-day-web-scraping-with
--
Jerome Eteve
+44(0)7738864546
http://www.eteve.net/
More information about the london.pm
mailing list