character set detection?

Dominic Mitchell dom at happygiraffe.net
Sun Jan 7 15:16:19 GMT 2007


Dominic Mitchell wrote:
> ben at bpfh.net wrote:
>> On Sun, Jan 07, 2007 at 01:13:12PM +0000, Dominic Mitchell wrote:
>>> Taht is,treat the input as UTF-8 by default (which *is* reliably 
>>> recognisable, and also catches plain ASCII), and failing that, treat 
>>> it as Windows-1252, which is (more-or-less) a superset of ISO-8859-1.
>>
>> Um. UTF-8 has some multi-byte characters which are also valid ISO-8859-1,
>> I believe, although this is a corner case.
> 
> Well, all UTF-8 is also valid ISO-8859-1.  But, if the string is correct 
> UTF-8, it's exceedingly unlikely that you are also looking at valid 
> ISO-8859-1.  The probability gets smaller the longer the length of the 
> UTF-8 string.

I mean "It's exceedingly unlikely that you are looking at _actual) 
ISO-8859-1"...

-Dom


More information about the london.pm mailing list