djk at tobit.co.uk
Tue Mar 30 14:45:56 BST 2010
Jurgen Pletinckx wrote:
> Dear lazyweb,
> I have a problem. Let me illustrate it by pointing at the man page for
> HTML::Entities at
> http://search.cpan.org/~gaas/HTML-Parser-3.64/lib/HTML/Entities.pm. I happen
> to know that, in the synopsis, the source for that page says
> A. $input = "vis-à-vis Beyoncé's naïve\npapier-mâché résumé";
> Which, as I write this, contains nice and well-formed accented characters:
> a-accent-grave, e-accent-aigu, i-trema, a-accent-circonflexe. Dunno what it
> will look like in your mail reader or in the archives, of course.
> However, if I look at that page using any browser I have easy access to
> (Chrome 4.1 on WinXP, FF 3.0 on Ubuntu, IE8 on WinXP), I see instead
> B. $input = "vis-Ã -vis BeyoncÃ©'s naÃ¯ve\npapier-mÃ¢chÃ© rÃ©sumÃ©";
And doing this will tell you why:
> But that's not my actual problem, just the illustration. My actual problem
> is: I have a metric ton of data that looks like the chewed-up example B
It's (probably) not actually chewed up. It is what utf8 looks like when
you display it in iso-8859-* or some form of ascii or M$/IBM codepage.
There may actually be nothing to do other than make sure that the
language environment variable is set correctly (if using something like
a terminal window), I have "LANG=en_US.UTF-8" set on mine.
Or, if we are talking web pages, make sure that (unlike CPAN) you have a
character set declaration in the head, such as:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
is a wordy but comprehensive guide.
More information about the london.pm