[OT] Encode woes

Fri Sep 25 10:36:06 BST 2009

Philip Newton <philip.newton at gmail.com> writes:
> On Fri, Sep 25, 2009 at 09:54, Dirk Koopman <djk at tobit.co.uk> wrote:
>> Dirk Koopman wrote:
>>>
>>> Now, is there a reasonably reliable way of determining what we have, on a
>>> string by string basis, to at least tell whether we are dealing with utf8
>>> or iso-8859 (not caring which variant) so that I can drive Encode
>>> appropriately to avoid crashes of the above type.

There isn't one.  You /can/ check for valid or invalid UTF-8, and make a guess
about it, or perhaps use something like Encoding::Detect, but nothing can
completely reliably determine which is which.

>>> Or how do I completely switch off utf8 encoding/decoding - everywhere - in
>>> an 80,000 line perl app.

I am honestly surprised it got turned on anywhere; I fear that I don't know a
mechanism for doing this universally short of modifying all the code, sorry.

>> As no-one seems interested in this, or may be no-one else has had these
>> problems themselves, can anyone suggest a better mailing list to poll?
>
> I was going to suggest Encode::is_utf8 and/or utf8::is_utf8, but I wasn't
> sure whether it would actually solve your problem so I thought I'd rather
> stay quiet and hope someone with real-world experience in utf8 woes would
> pipe up.

>From my real-world experience, by the time you have a database that contains a
mixture of text with a random encoding, and you don't have a way to
distinguish them, you have already lost.

Personally, I would consider encoding everything to UTF-8 based on a basic "is
it valid UTF-8 text?  If so, UTF-8, if not, Latin1" test, and then work from
there.

        Daniel

Plus, manually fix instances where people actually complain.
-- 
✣ Daniel Pittman            ✉ daniel at rimspace.net            ☎ +61 401 155 707
               ♽ made with 100 percent post-consumer electrons
   Looking for work?  Love Perl?  In Melbourne, Australia?  We are hiring.