Perl 5.16 vs Ruby 2.0 UTF-8 support

Joel Bernstein joel at fysh.org
Thu Aug 22 17:27:04 BST 2013


What problematic char? Why not just tell Ruby your strings are Latin-1? BTW
Latin-1 is not ASCII. If your data really *was* ASCII (a 7-bit charset), as
you had claimed, it would also be perfectly valid UTF-8.

To be clear, Ruby is correct, but if you tell it your data isn't in the
encoding it assumes it is, but in the one it actually is, your problem will
go away. Any Latin-1 character >127 is encoded differently
 in UTF-8 and this is your problem. Which Ruby is correctly complaining
about. You should use File#open's mode option (inherited from IO#open) to
set the filehandle to Latin-1.

/joel


On 22 August 2013 18:13, gvim <gvimrc at gmail.com> wrote:

> On 22/08/2013 16:59, Dave Cross wrote:
>
>  Without seeing your data (or knowing anything much about Ruby's
>> string-handling) I'd guess that your file is in one of the extended
>> ASCII character sets (probably ISO-8859-1 or cp1252). You haven't told
>> Perl to decode the data in any way, so it's just treating it as a stream
>> of bytes. Perhaps Ruby defaults to assuming the input is utf8 and tries
>> to decode it as such. And then barfs when one of the characters is in
>> the range 128-255 - which is invalid for utf8.
>>
>> All a guess though.
>>
>> Dave...
>>
>>
> Great. That makes sense. The character set is  ISO-8859-1 but I can't
> locate the problematic char.
>
> gvim
>
>


More information about the london.pm mailing list