Jurgen Pletinckx jurgen.pletinckx at gmail.com
Tue Mar 30 13:54:18 BST 2010

Dear lazyweb,

I have a problem. Let me illustrate it by pointing at the man page for
HTML::Entities at
http://search.cpan.org/~gaas/HTML-Parser-3.64/lib/HTML/Entities.pm. I happen
to know that, in the synopsis, the source for that page says

A. $input = "vis-à-vis Beyoncé's naïve\npapier-mâché résumé";

Which, as I write this, contains nice and well-formed accented characters:
a-accent-grave, e-accent-aigu, i-trema, a-accent-circonflexe. Dunno what it
will look like in your mail reader or in the archives, of course.

However, if I look at that page using any browser I have easy access to
(Chrome 4.1 on WinXP, FF 3.0 on Ubuntu, IE8 on WinXP), I see instead  

B. $input = "vis-à -vis Beyoncé's naïve\npapier-mâché résumé";

with all the accented characters from the original replaced by 'interesting'
combinations of two characters. I assume somewhere along the line, a piece
of software either cannot handle the used encoding, or gives the wrong
information about the encoding it uses to output text. Could well be on my

But that's not my actual problem, just the illustration. My actual problem
is: I have a metric ton of data that looks like the chewed-up example B

Is there a way of restoring version B to version A? I understand it's not
possible in the general case, when the encoding of a particular original
character is not known. However, I have two things going for me:
* We're only talking about a handful of different sources and original
* I still have *some* of the original data
* I know that putting these text fragments through the PHP/symfony function
'esc_entities' resolves the problem. Except in cases where it doesn't, or
blows up. Still, I'd prefer a solution that doesn't involve putting PHP on
_my_ production machine. 

On the negative side, I should say I'm out of my depth. This mail contains
just about the sum total of my knowledge about UTF-8, unicode etc. I have
messed around a bit with Encode and with HTML::Entities (misguided by the
mangled webpage), but to no real effect. 

Suggestions? Using small words, if possible? Ta.

Jurgen Pletinckx

More information about the london.pm mailing list