duduping mangled UTF8 strings

Dirk Koopman djk at tobit.co.uk
Mon Jun 18 12:50:15 BST 2007


Nicholas Clark wrote:
> If it's just the three forms you suggest, the the heuristic seems to be that
> you need to
> 
> 1: see if the octet stream is valid UTF-8. If so, convert it to ISO-8859-1
> 2: now strip the top bit from all characters
> 
> so the smashed to ASCII version is your "canonical" form. But obviously there
> could be two or more accented string that mangle to the same ASCII.
> 

This seems like a reasonable solution, in my case. I probably need to 
start using Encode specifically (where available) and normalise on input 
now that I seem to have a definite split between utf8 users and 
iso-something or other. It had to happen sometime...

However, on requiring Encode (which I have to do because there are 
installations out there that are on 5.005004 and 5.6.x), I get this 
warning (on my ubuntu install):

Can't locate Encode/ConfigLocal.pm in @INC...

Should I be bothered by this? (Obviously I can deal with it one way or 
another).




More information about the london.pm mailing list