Encoding/decoding
Jurgen Pletinckx
jurgen.pletinckx at gmail.com
Wed Mar 31 17:50:57 BST 2010
Dave Hodgkinson wrote
| Welcome to perl encoding hell.
Nice and toasty in here, isn't it?
Dirk Koopman wrote
| It's (probably) not actually chewed up. It is what utf8 looks like
| when you display it in iso-8859-* or some form of ascii or M$/IBM
| codepage.
| There may actually be nothing to do other than make sure that the
| language environment variable is set correctly (if using something
| like a terminal window), I have "LANG=en_US.UTF-8" set on mine.
| Or, if we are talking web pages, make sure that (unlike CPAN) you
| have a character set declaration in the head, such as:
| <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
Aha! That does help. Yes, web pages is where it's at. Except I assumed the
text was mangled long before I pushed it out to the website and it got close
to a users' eyeballs.
When I add the meta tag above to a page's head, the W3 validator complains
that meta/http-equiv says UTF-8, but the actual HTTP headers say ISO-8859-1,
and it's inclined to believe those. And the httpd.conf for the site contains
AddDefaultCharset UTF-8. Bah. Can't even trust what you read in a conf file
these days.
And Mark Fowler wrote
| It's very hard for anyone to work out the solution to this unless we
| know *exactly* what is in the files, not how it's being rendered.
|
| What's the exact bytes stored in the files? Or more bluntly, what
| does this print:
|
| perl -e 'use Devel::Peek; open my $fh, "<:bytes", "filename" or die
| $!; undef $/; Dump <$fh>'
You say blunt, I call it idiot-proof. Anyway,
SV = PV(0x703ae8) at 0x72bbb0
REFCNT = 1
FLAGS = (TEMP,POK,pPOK)
PV = 0x733bb0 "Plat pr\303\251f\303\251r\303\251\nDas M\344dchen Jeanne
d\264Arc (Kr\374ck von Poturzyn, Maria J.)\n"\0
CUR = 72
LEN = 88
Am I correct in thinking that \303\251 is correct utf-8 for é (e-aigu), and
\344 correct latin-1 for ä (a-trema)? And that I'm going to burn for using
them mixed up with one another, as \303\251 is _also_ correct latin-1 for é
(A-tilde copyright)?
Thanks, I feel positively enlightened! Of course, I would still like all
that text to use a single encoding. "How hard could it be?"
--
Jurgen Pletinckx
More information about the london.pm
mailing list