Encode::Mangled?
Richard Huxton
dev at archonet.com
Fri May 29 09:55:30 BST 2009
I'm dealing with data from a web-page that claims to be ISO-8859-1 but
actually has some Win-1252 embedded in it. I can convert it to UTF-8 and
all seems well, however the characters need mapping. It's
straightforward enough to handle the dozen or so chars I know about but
I can't believe there isn't something on cpan for this.
Concrete example:
Page claims 8859-1 but has the character equivalent to • in it
(displays as a bullet). Note this isn't the HTML entity, it's a single
byte = 149. It looks fine in a web-browser because presumably the
browser special-cases it.
I can happily convert this to UTF-8 and store it (xC295), but it's not a
displaying unicode character (and certainly not the bullet-point). The
equivalent should be: 8226.
I *think* I'm safe in treating 8859-1 as win1252 since the latter is a
strict superset. That's not going to work with 8859-15 though.
Now the *correct* solution is to track down the people responsible for
this travesty and beat them with sticks. Failing that, are people just
rolling their own three-line function each time?
--
Richard Huxton
Archonet Ltd
More information about the london.pm
mailing list