Perl 5.16 vs Ruby 2.0 UTF-8 support
Dave Cross
dave at
Thu Aug 22 16:59:07 BST 2013
Quoting gvim <gvimrc at>:
> Can anyone who also uses Ruby enlighten me? For benchmarking
> purposes this Perl 5.16 script works fine parsing a large Maildir
> folder:
> ------------------------------------------------------------
> use 5.016;
> use autodie;
> my $dir = 'my/mail/path';
> chdir $dir;
> opendir my $dh, $dir;
> while (readdir $dh) {
> next unless /^\d{4}/;
> open my $fh, '<', $_;
> say "\n\n************* Opening $_ *************";
> while (<$fh>) {
> chomp;
> say if /^\w{4}\s/;
> }
> close $fh;
> }
> closedir $dh;
> -------------------------------------------------------------
> However, the equivalent Ruby 2.0 script produces at UTF-8 error
> after parsing 7 files:
> ---------------------------------------------------------
> dir = 'my/maildir/path'
> Dir.chdir(dir)
> Dir.foreach(dir) do |file|
> next unless file =~ /^\d{4}/
> print "\n\n************* Opening #{file} *************\n"
> fh =
> while fh.gets do
> print if $_ =~ /^\w{4}\b/
> end
> fh.close
> end
> -------------------------------------------------------------
> The problematic mail file doesn't display any non-ASCII characters
> when opened in Vim. Here's the Ruby 2.0 error message:
> ************* Opening
> 1270516984.M407293P18051.mac,S=1601,W=1645:2,Sb *************
> Paul
> ./1.rb:13:in `block in <main>': invalid byte sequence in UTF-8
> (ArgumentError)
> from ./1.rb:8:in `foreach'
> from ./1.rb:8:in `<main>'
Without seeing your data (or knowing anything much about Ruby's
string-handling) I'd guess that your file is in one of the extended
ASCII character sets (probably ISO-8859-1 or cp1252). You haven't told
Perl to decode the data in any way, so it's just treating it as a
stream of bytes. Perhaps Ruby defaults to assuming the input is utf8
and tries to decode it as such. And then barfs when one of the
characters is in the range 128-255 - which is invalid for utf8.
All a guess though.
More information about the
mailing list