wiki scraping

Nic Gibson nicg at corbas.net
Fri Feb 29 12:04:14 GMT 2008


On 28 Feb 2008, at 16:04, Chris Devers wrote:

> On Thu, 28 Feb 2008, Nic Gibson wrote:
>
>> I've been asked to generate some pdf docs for one of our projects.  
>> Not
>> too hard. The problem is that the docs are currently in a trac wiki.
>
> I had to so this recently, and just used wget.
>
> I forget the exact invocation now, but the -k flag ("make links in
> downloaded HTML point to local files") was key, as was -m "mirror". In
> long form then, `wget --convert-links --mirror http://whatever`.
>
> This produced a local folder with a full mirror of the wiki, along  
> with
> some degenerate cruft (e.g. it was following search for every term
> imaginable somehow, which wasn't really relevant for my purposes), and
> the HTML documents it produced would open & display properly in Safari
> (correct layout with CSS & images, etc, links mostly worked  
> correctly).

I did actually try this but I want to avoid the cruft bit really - I'm  
sure this will
be a job that I have to repeat on a regular basis. Also I seemed to be  
doing badly
at controlling the 'dont go *there*' bit with wget

nic


More information about the london.pm mailing list