Michael Lush mjlush at gmail.com
Thu Dec 12 11:38:02 GMT 2013

pdf is where data goes to die.

I've been peripherally involved in extracting data from tables in
scientific papers, it is fairly easy to extract text from a pdf, but not
the formatting with is liable to get *horribly* scrambled.

If i were actually given the job I'd be inclined to convert the table to an
image and use OCR to extract the text and formatting and then use the text
directly extracted from the pdf to correct the misreads.  Either than or
look at getting the Mechanical Turk to do it.

You'll probably be able to hack something up to work with the bank
statements they will all be in the same format generated by the same
program,  but that format is liable to break regularly when the statement
layout is altered or the program is updated/changed or the stars are wrong


On Thu, Dec 12, 2013 at 10:47 AM, Dave Hodgkinson <davehodg at gmail.com>wrote:

> I'm about to hit CPAN, but any wisdom from you lovely people
> would be nice!
> I've got bank statements in PDF from Barclays. Would it be easy
> to produce a CSV of the statement parts from them?
> What's the go-to PDF module?

More information about the london.pm mailing list