Web scraping frameworks?
creaktive at gmail.com
Wed Mar 5 12:23:06 GMT 2014
Shameless self-promotion, but I could not resist when "parallel" was
My point is: forking parallel workers to crawl one single domain is a
terrible way of doing things. Because of connection persistence. Reopening
connection for each worker defeats the speed gain of parallelism in first
On Wed, Mar 5, 2014 at 12:31 PM, Dave Hodgkinson <davehodg at gmail.com> wrote:
> I've tended to use Parallel::Process where remote sites have been able to
> keep up and haven't been throttled, otherwise just let it run.
> On Tue, Mar 4, 2014 at 11:49 PM, Kieren Diment <diment at gmail.com> wrote:
> > Gearman's fine until you need a reliable queue. It's certainly less of a
> > pain to set up than rabbitmq, but if you start with gearman and find you
> > need reliability after a while there's substantial pain to be experienced
> > (unless you already know all about your reliable job queue implementation
> > of choice).
> > On 05/03/2014, at 10:35 AM, Jérôme Étévé wrote:
> > > - For queuing jobs, I'm a big fan of Gearman. It's light, very stable
> > > and very simple.
More information about the london.pm