OCRing PDFs

Mon Jun 30 21:34:33 BST 2008

2008/6/30 David Cantrell <david at cantrell.org.uk>:
> I have a very large PDF document.  It's something like 600 bitmaps,
> one per page.  Most pages contain both text and diagrams, and the
> layout is quite important.  Does anyone know of any software which
> will read the text and build an index in the PDF file, so that it's easily searchable?
>
> All I need is for it to be able to spot that the word "frobnitz"
> occurs on pages 13, 200, 255 and 432, not to attempt to convert the
> file to Wyrd or anything like that.

I've just tested our OCR app, ABBYY Finereader 8, on the first page
of US patent 5159081 - see http://tinyurl.com/3h68wq (for full text)
or http://tinyurl.com/4bg4ts (for the first page as tiff - may need
a plugin). I actually started from a bitmap Pdf version of this
patent, but that is not freely available online, and a bit too large
to send to the list as attachment.

Attached is the OCR-to-Pdf readout by FineReader. As you see, the
OCR is not faultless, but it contains indexed text. Extracting the
text for out-of-band searching is feasible with e.g. the xpdf tools.
Adding the indexed text to the bitmap version might be slightly more
challenging...

I don't know the status of freeware OCR. Finereader has been well
worth the money to me, even if I think the actual text recognition
could be better.

Apologies if sending attachments is considered gauche, or if the
mailing list soft eats the attachment.

--
Jurgen Pletinckx