Richard Huxton dev at archonet.com
Fri May 29 09:55:30 BST 2009

I'm dealing with data from a web-page that claims to be ISO-8859-1 but 
actually has some Win-1252 embedded in it. I can convert it to UTF-8 and 
all seems well, however the characters need mapping. It's 
straightforward enough to handle the dozen or so chars I know about but 
I can't believe there isn't something on cpan for this.

Concrete example:
Page claims 8859-1 but has the character equivalent to • in it 
(displays as a bullet). Note this isn't the HTML entity, it's a single 
byte = 149. It looks fine in a web-browser because presumably the 
browser special-cases it.
I can happily convert this to UTF-8 and store it (xC295), but it's not a 
displaying unicode character (and certainly not the bullet-point). The 
equivalent should be: 8226.
I *think* I'm safe in treating 8859-1 as win1252 since the latter is a 
strict superset. That's not going to work with 8859-15 though.

Now the *correct* solution is to track down the people responsible for 
this travesty and beat them with sticks. Failing that, are people just 
rolling their own three-line function each time?

   Richard Huxton
   Archonet Ltd

More information about the london.pm mailing list