OCRing PDFs
Jurgen Pletinckx
jurgen.pletinckx at gmail.com
Mon Jun 30 21:34:33 BST 2008
2008/6/30 David Cantrell <david at cantrell.org.uk>:
> I have a very large PDF document. It's something like 600 bitmaps,
> one per page. Most pages contain both text and diagrams, and the
> layout is quite important. Does anyone know of any software which
> will read the text and build an index in the PDF file, so that it's easily searchable?
>
> All I need is for it to be able to spot that the word "frobnitz"
> occurs on pages 13, 200, 255 and 432, not to attempt to convert the
> file to Wyrd or anything like that.
I've just tested our OCR app, ABBYY Finereader 8, on the first page
of US patent 5159081 - see http://tinyurl.com/3h68wq (for full text)
or http://tinyurl.com/4bg4ts (for the first page as tiff - may need
a plugin). I actually started from a bitmap Pdf version of this
patent, but that is not freely available online, and a bit too large
to send to the list as attachment.
Attached is the OCR-to-Pdf readout by FineReader. As you see, the
OCR is not faultless, but it contains indexed text. Extracting the
text for out-of-band searching is feasible with e.g. the xpdf tools.
Adding the indexed text to the bitmap version might be slightly more
challenging...
I don't know the status of freeware OCR. Finereader has been well
worth the money to me, even if I think the actual text recognition
could be better.
Apologies if sending attachments is considered gauche, or if the
mailing list soft eats the attachment.
--
Jurgen Pletinckx
More information about the london.pm
mailing list