wiki scraping

IvorW combobulus at xemaps.com
Thu Feb 28 15:42:10 GMT 2008


Nic Gibson wrote:
> Afternoon all
>
> I'm after a bit of advice (and rough plan knocking down) and it's sort of
> perlish. Well, I plan to use perl to do it...
>
> I've been asked to generate some pdf docs for one of our projects. Not too
> hard. The problem is that the docs are currently in a trac wiki. I don't
> have access to the database (assuming trac keeps the wiki in a db) or the
> server (big internationals being what they are) so I'm going to have to grab
> it in some sort of mirroring manner. Now, iirc, trac lets you append
> 'format=text' to the url and get the content so I plan to do it that way.
>
> I'm planning to put together a little script using LWP::UserAgent and so on,
> convert the wiki markup to xml, feed it through FOP and hand over a pdf.
>
> Does that sound sane? Is there some little tool lurking somewhere that can
> do any of this for me? Have I missed an obvious solution?
>   
Reasonably sane. If there are any feeds available, such as RDF, RSS or
Atom, this may help assist you getting to raw data and ignoring the
formatting. Still, format=text may give you this.

See http://search.cpan.org/~ivorw/OpenGuides-RDF-Reader/ in particular
the og_mirror script that  comes with it, for where I used this for
OpenGuides.

Ivor.


More information about the london.pm mailing list