PDF to CSV?
andrew-perl08@mail.black1.org.uk
andrew-perl08 at mail.black1.org.uk
Thu Dec 12 16:12:01 GMT 2013
On Thu, Dec 12, 2013 at 11:38:02AM +0000, Michael Lush wrote:
> pdf is where data goes to die.
>
> I've been peripherally involved in extracting data from tables in
> scientific papers, it is fairly easy to extract text from a pdf, but not
> the formatting with is liable to get *horribly* scrambled.
I have tried extracting words from a 4 part vocal item and ended up with
We / We / We / We / wish / wish / wish / wish / you / you / you / you /
(where / represents a newline). The problem is the order of the text elements seems to be at the whim of the program producing the PDF though I haven't investigate in detail.
More information about the london.pm
mailing list