character set detection?
Dominic Mitchell
dom at happygiraffe.net
Sun Jan 7 14:46:36 GMT 2007
ben at bpfh.net wrote:
> On Sun, Jan 07, 2007 at 01:13:12PM +0000, Dominic Mitchell wrote:
>> Taht is,treat the input as UTF-8 by default (which *is* reliably
>> recognisable, and also catches plain ASCII), and failing that, treat it
>> as Windows-1252, which is (more-or-less) a superset of ISO-8859-1.
>
> Um. UTF-8 has some multi-byte characters which are also valid ISO-8859-1,
> I believe, although this is a corner case.
Well, all UTF-8 is also valid ISO-8859-1. But, if the string is correct
UTF-8, it's exceedingly unlikely that you are also looking at valid
ISO-8859-1. The probability gets smaller the longer the length of the
UTF-8 string.
-Dom
More information about the london.pm
mailing list