OCRing PDFs

Mon Jun 30 16:24:39 BST 2008

On Mon, Jun 30, 2008 at 03:57:49PM +0100, David Cantrell wrote:
>I have a very large PDF document.  It's something like 600 bitmaps, one
>per page.  Most pages contain both text and diagrams, and the layout is
>quite important.  Does anyone know of any software which will read the
>text and build an index in the PDF file, so that it's easily searchable?

Extracting the text one page at a time is easy: pdftotext.

Building the index sounds like a job for Perl.

Appending a new page or two to a PDF can be done by PDF::API2.

R