Jurgen Pletinckx jurgen.pletinckx at gmail.com
Mon Jun 30 21:34:33 BST 2008

2008/6/30 David Cantrell <david at cantrell.org.uk>:
> I have a very large PDF document.  It's something like 600 bitmaps,
> one per page.  Most pages contain both text and diagrams, and the
> layout is quite important.  Does anyone know of any software which
> will read the text and build an index in the PDF file, so that it's easily searchable?
> All I need is for it to be able to spot that the word "frobnitz"
> occurs on pages 13, 200, 255 and 432, not to attempt to convert the
> file to Wyrd or anything like that.

I've just tested our OCR app, ABBYY Finereader 8, on the first page
of US patent 5159081 - see http://tinyurl.com/3h68wq (for full text)
or http://tinyurl.com/4bg4ts (for the first page as tiff - may need
a plugin). I actually started from a bitmap Pdf version of this
patent, but that is not freely available online, and a bit too large
to send to the list as attachment.

Attached is the OCR-to-Pdf readout by FineReader. As you see, the
OCR is not faultless, but it contains indexed text. Extracting the
text for out-of-band searching is feasible with e.g. the xpdf tools.
Adding the indexed text to the bitmap version might be slightly more

I don't know the status of freeware OCR. Finereader has been well
worth the money to me, even if I think the actual text recognition
could be better.

Apologies if sending attachments is considered gauche, or if the
mailing list soft eats the attachment.

Jurgen Pletinckx

More information about the london.pm mailing list