Encode::Mangled?

Fri May 29 10:59:09 BST 2009

Hi,

On May 29, 2009, at 10:55 , Richard Huxton wrote:
> Concrete example:
> Page claims 8859-1 but has the character equivalent to &#149; in it  
> (displays as a bullet). Note this isn't the HTML entity, it's a  
> single byte = 149. It looks fine in a web-browser because presumably  
> the browser special-cases it.
> I can happily convert this to UTF-8 and store it (xC295), but it's  
> not a displaying unicode character (and certainly not the bullet- 
> point). The equivalent should be: 8226.
> I *think* I'm safe in treating 8859-1 as win1252 since the latter is  
> a strict superset. That's not going to work with 8859-15 though.

Sorry, I'm not sure I understand precisely what the issue is. Can you  
not simply use Encode to convert it from CP-1252 to UTF-8? In "legacy  
situations" (i.e. broken HTML, most of the web), browsers normally  
default to CP-1252 if they haven't detected another encoding as it  
generally works, even for ISO-8859-1.

If the page claims to be in ISO-8859-15 then the chances are that  
whoever is sending it to you know what they're doing, and you can just  
use the real thing.

Or am I missing something?

If you're trying to process this in as much as possible the same way  
that browsers do, the algorithm to follow is pretty scary, but should  
get you covered:

   http://dev.w3.org/html5/spec/#determining-the-character-encoding

That actually might be worth a HTML5::DetectEncoding contribution to  
CPAN as it certainly would help improve scrapers and friends.

-- 
Robin Berjon - http://berjon.com/
     Feel like hiring me? Go to http://robineko.com/