LWP output encoding
Matt Lawrence
matt.lawrence at virgin.net
Wed Nov 23 15:43:26 GMT 2005
Struan Donald wrote:
> * at 23/11 14:49 +0000 Andy Armstrong said:
>
>>Googled. Can't figure. Can anyone update me on what the current
>>semantics of HTTP::Response->decoded_content are?
>>
>>Specifics: I'm parsing a bunch of RSS feeds. I have two, both of
>>which claim to be encoded UTF-8. I'm generating a hash for the
>>contents of the feeds like this
>>
>>my $content = $res->decoded_content;
>>my $hash = md5_base64($content);
>>
>>md5_base64() barfs on one of the feeds with
>>
>>"Wide character in subroutine entry"
>
>
> I can't really answer the question of what it returns but I found
> that I cured the same issues with Digest::MD5 by encoding the content
> passed to it before to make sure that it's plain old octets and all
> was well.
>
> i.e one does:
>
> my $content = encode( 'utf8', $res->decoded_content );
> my $hash = md5_base64($content);
>
> I am sure someone who understands more about this will be along to
> explain why this is not a good idea...
>
You might want to keep a copy of the unicode-ified string too and use
all your nice character semantics on it:
use Encode qw( encode :fallbacks );
my $content = $res->decoded_content;
my $hash = md5_base64(encode('UTF-8', $content, FB_CROAK));
I can't see a downside unless you want to treat messages that are
identical in every way expect for belonging to different encodings
differently.
That still doesn't explain why the Arabic one succeeded, I get that
error on any string containing characters over 255. The source of
HTTP::Message shows Encode::decode being used to generate the return
value, so arabic should really cause it. *shrug*
Matt
More information about the london.pm
mailing list