duduping mangled UTF8 strings
djk at tobit.co.uk
Mon Jun 18 12:50:15 BST 2007
Nicholas Clark wrote:
> If it's just the three forms you suggest, the the heuristic seems to be that
> you need to
> 1: see if the octet stream is valid UTF-8. If so, convert it to ISO-8859-1
> 2: now strip the top bit from all characters
> so the smashed to ASCII version is your "canonical" form. But obviously there
> could be two or more accented string that mangle to the same ASCII.
This seems like a reasonable solution, in my case. I probably need to
start using Encode specifically (where available) and normalise on input
now that I seem to have a definite split between utf8 users and
iso-something or other. It had to happen sometime...
However, on requiring Encode (which I have to do because there are
installations out there that are on 5.005004 and 5.6.x), I get this
warning (on my ubuntu install):
Can't locate Encode/ConfigLocal.pm in @INC...
Should I be bothered by this? (Obviously I can deal with it one way or
More information about the london.pm