robin at berjon.com
Fri May 29 10:59:09 BST 2009
On May 29, 2009, at 10:55 , Richard Huxton wrote:
> Concrete example:
> Page claims 8859-1 but has the character equivalent to • in it
> (displays as a bullet). Note this isn't the HTML entity, it's a
> single byte = 149. It looks fine in a web-browser because presumably
> the browser special-cases it.
> I can happily convert this to UTF-8 and store it (xC295), but it's
> not a displaying unicode character (and certainly not the bullet-
> point). The equivalent should be: 8226.
> I *think* I'm safe in treating 8859-1 as win1252 since the latter is
> a strict superset. That's not going to work with 8859-15 though.
Sorry, I'm not sure I understand precisely what the issue is. Can you
not simply use Encode to convert it from CP-1252 to UTF-8? In "legacy
situations" (i.e. broken HTML, most of the web), browsers normally
default to CP-1252 if they haven't detected another encoding as it
generally works, even for ISO-8859-1.
If the page claims to be in ISO-8859-15 then the chances are that
whoever is sending it to you know what they're doing, and you can just
use the real thing.
Or am I missing something?
If you're trying to process this in as much as possible the same way
that browsers do, the algorithm to follow is pretty scary, but should
get you covered:
That actually might be worth a HTML5::DetectEncoding contribution to
CPAN as it certainly would help improve scrapers and friends.
Robin Berjon - http://berjon.com/
Feel like hiring me? Go to http://robineko.com/
More information about the london.pm