andrew-perl08@mail.black1.org.uk andrew-perl08 at mail.black1.org.uk
Thu Dec 12 16:12:01 GMT 2013

On Thu, Dec 12, 2013 at 11:38:02AM +0000, Michael Lush wrote:
> pdf is where data goes to die.
> I've been peripherally involved in extracting data from tables in
> scientific papers, it is fairly easy to extract text from a pdf, but not
> the formatting with is liable to get *horribly* scrambled.

I have tried extracting words from a 4 part vocal item and ended up with

We / We / We / We / wish / wish / wish / wish / you / you / you / you /

(where / represents a newline).  The problem is the order of the text elements seems to be at the whim of the program producing the PDF though I haven't investigate in detail.

More information about the london.pm mailing list