wiki scraping

Chris Devers cdevers at pobox.com
Thu Feb 28 16:04:27 GMT 2008


On Thu, 28 Feb 2008, Nic Gibson wrote:

> I've been asked to generate some pdf docs for one of our projects. Not 
> too hard. The problem is that the docs are currently in a trac wiki.

I had to so this recently, and just used wget. 

I forget the exact invocation now, but the -k flag ("make links in 
downloaded HTML point to local files") was key, as was -m "mirror". In 
long form then, `wget --convert-links --mirror http://whatever`. 

This produced a local folder with a full mirror of the wiki, along with 
some degenerate cruft (e.g. it was following search for every term 
imaginable somehow, which wasn't really relevant for my purposes), and 
the HTML documents it produced would open & display properly in Safari 
(correct layout with CSS & images, etc, links mostly worked correctly). 

I was considering going the next step of making PDFs out of all of this, 
but laziness won the day and I decided this was Good Enough For Me.



-- 
Chris Devers
DO NOT LEAVE IT IS NOT REAL


More information about the london.pm mailing list