[OT] Encode woes
Daniel Pittman
daniel at rimspace.net
Fri Sep 25 10:36:06 BST 2009
Philip Newton <philip.newton at gmail.com> writes:
> On Fri, Sep 25, 2009 at 09:54, Dirk Koopman <djk at tobit.co.uk> wrote:
>> Dirk Koopman wrote:
>>>
>>> Now, is there a reasonably reliable way of determining what we have, on a
>>> string by string basis, to at least tell whether we are dealing with utf8
>>> or iso-8859 (not caring which variant) so that I can drive Encode
>>> appropriately to avoid crashes of the above type.
There isn't one. You /can/ check for valid or invalid UTF-8, and make a guess
about it, or perhaps use something like Encoding::Detect, but nothing can
completely reliably determine which is which.
>>> Or how do I completely switch off utf8 encoding/decoding - everywhere - in
>>> an 80,000 line perl app.
I am honestly surprised it got turned on anywhere; I fear that I don't know a
mechanism for doing this universally short of modifying all the code, sorry.
>> As no-one seems interested in this, or may be no-one else has had these
>> problems themselves, can anyone suggest a better mailing list to poll?
>
> I was going to suggest Encode::is_utf8 and/or utf8::is_utf8, but I wasn't
> sure whether it would actually solve your problem so I thought I'd rather
> stay quiet and hope someone with real-world experience in utf8 woes would
> pipe up.
>From my real-world experience, by the time you have a database that contains a
mixture of text with a random encoding, and you don't have a way to
distinguish them, you have already lost.
Personally, I would consider encoding everything to UTF-8 based on a basic "is
it valid UTF-8 text? If so, UTF-8, if not, Latin1" test, and then work from
there.
Daniel
Plus, manually fix instances where people actually complain.
--
✣ Daniel Pittman ✉ daniel at rimspace.net ☎ +61 401 155 707
♽ made with 100 percent post-consumer electrons
Looking for work? Love Perl? In Melbourne, Australia? We are hiring.
More information about the london.pm
mailing list