Character set sucktitude

Tue May 22 01:00:13 BST 2007

Randy J. Ray writes:
> David Cantrell writes:
> > I have some data that is unfortunately in "Western (Mac OS Roman)",
> > whatever the fuck that is.  I need to turn it into ISO-8859-1.
> 
> http://search.cpan.org/~dankogai/Encode-2.21/
> 
> I don't know for certain that it covers "Western (Mac OS Roman)", but I
> would be surprised if it didn't.

It does; the name is 'MacRoman'.

  $ perl -MEncode -le'print for grep { /mac.*rom/i } Encode->encodings(":all")'
  MacCentralEurRoman
  MacRoman
  MacRomanian

And one of Encode's allowable names for ISO-8859-1 is "latin-1".

Encode should work -- subject to the characters in your MacRoman data
actually being present in Latin-1, that is.  By my reckoning, there are 48
MacRoman characters that might cause you problems; I can produce a list of
them on request.  Encode's default in this situation is to use a question
mark as a substitution character.  If you want something more clever, see
the "Handling Malformed Data" section of the Encode pod.

-- 
Aaron Crane