Encoding/decoding

Tue Mar 30 14:45:56 BST 2010

Jurgen Pletinckx wrote:
> Dear lazyweb,
> 
> 
> I have a problem. Let me illustrate it by pointing at the man page for
> HTML::Entities at
> http://search.cpan.org/~gaas/HTML-Parser-3.64/lib/HTML/Entities.pm. I happen
> to know that, in the synopsis, the source for that page says
> 
> A. $input = "vis-à-vis Beyoncé's naïve\npapier-mâché résumé";
> 
> Which, as I write this, contains nice and well-formed accented characters:
> a-accent-grave, e-accent-aigu, i-trema, a-accent-circonflexe. Dunno what it
> will look like in your mail reader or in the archives, of course.
> 
> However, if I look at that page using any browser I have easy access to
> (Chrome 4.1 on WinXP, FF 3.0 on Ubuntu, IE8 on WinXP), I see instead  
> 
> B. $input = "vis-Ã -vis BeyoncÃ©'s naÃ¯ve\npapier-mÃ¢chÃ© rÃ©sumÃ©";
> 

And doing this will tell you why:

http://validator.w3.org/check?verbose=1&uri=http%3A%2F%2Fsearch.cpan.org%2F~gaas%2FHTML-Parser-3.64%2Flib%2FHTML%2FEntities.pm

> 
> But that's not my actual problem, just the illustration. My actual problem
> is: I have a metric ton of data that looks like the chewed-up example B
> above. 
>

It's (probably) not actually chewed up. It is what utf8 looks like when 
you display it in iso-8859-* or some form of ascii or M$/IBM codepage.

There may actually be nothing to do other than make sure that the 
language environment variable is set correctly (if using something like 
a terminal window), I have "LANG=en_US.UTF-8" set on mine.

Or, if we are talking web pages, make sure that (unlike CPAN) you have a 
character set declaration in the head, such as:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>

http://www.w3.org/International/tutorials/tutorial-char-enc/#Slide0250
is a wordy but comprehensive guide.