Perl 5.16 vs Ruby 2.0 UTF-8 support

Dave Cross dave at dave.org.uk
Thu Aug 22 16:59:07 BST 2013


Quoting gvim <gvimrc at gmail.com>:

> Can anyone who also uses Ruby enlighten me? For benchmarking  
> purposes this Perl 5.16 script works fine parsing a large Maildir  
> folder:
>
> ------------------------------------------------------------
> use 5.016;
> use autodie;
>
> my $dir = 'my/mail/path';
> chdir $dir;
> opendir my $dh, $dir;
>
> while (readdir $dh) {
>   next unless /^\d{4}/;
>   open my $fh, '<', $_;
>   say "\n\n************* Opening $_ *************";
>   while (<$fh>) {
>     chomp;
>     say if /^\w{4}\s/;
>   }
>   close $fh;
> }
> closedir $dh;
>
> -------------------------------------------------------------
>
> However, the equivalent Ruby 2.0 script produces at UTF-8 error  
> after parsing 7 files:
>
> ---------------------------------------------------------
> dir = 'my/maildir/path'
> Dir.chdir(dir)
>
> Dir.foreach(dir) do |file|
>   next unless file =~ /^\d{4}/
>   print "\n\n************* Opening #{file} *************\n"
>   fh = File.open(file)
>   while fh.gets do
>     print if $_ =~ /^\w{4}\b/
>   end
>   fh.close
> end
>
> -------------------------------------------------------------
>
> The problematic mail file doesn't display any non-ASCII characters  
> when opened in Vim. Here's the Ruby 2.0 error message:
>
>
> ************* Opening  
> 1270516984.M407293P18051.mac,S=1601,W=1645:2,Sb *************
> Paul
> ./1.rb:13:in `block in <main>': invalid byte sequence in UTF-8  
> (ArgumentError)
>     from ./1.rb:8:in `foreach'
>     from ./1.rb:8:in `<main>'

Without seeing your data (or knowing anything much about Ruby's  
string-handling) I'd guess that your file is in one of the extended  
ASCII character sets (probably ISO-8859-1 or cp1252). You haven't told  
Perl to decode the data in any way, so it's just treating it as a  
stream of bytes. Perhaps Ruby defaults to assuming the input is utf8  
and tries to decode it as such. And then barfs when one of the  
characters is in the range 128-255 - which is invalid for utf8.

All a guess though.

Dave...




More information about the london.pm mailing list