LWP output encoding

Wed Nov 23 15:30:51 GMT 2005

Struan Donald wrote:
> * at 23/11 14:49 +0000 Andy Armstrong said:
> 
>>Googled. Can't figure. Can anyone update me on what the current  
>>semantics of HTTP::Response->decoded_content are?
>>
>>Specifics: I'm parsing a bunch of RSS feeds. I have two, both of  
>>which claim to be encoded UTF-8. I'm generating a hash for the  
>>contents of the feeds like this
>>
>>my $content = $res->decoded_content;
>>my $hash    = md5_base64($content);
>>
>>md5_base64() barfs on one of the feeds with
>>
>>"Wide character in subroutine entry"
> 
> 
> I can't really answer the question of what it returns but I found
> that I cured the same issues with Digest::MD5 by encoding the content
> passed to it before to make sure that it's plain old octets and all
> was well. 
> 
> i.e one does:
> 
> my $content = encode( 'utf8', $res->decoded_content );
> my $hash    = md5_base64($content);
> 
> I am sure someone who understands more about this will be along to
> explain why this is not a good idea...
> 

You might want to keep a copy of the unicode-ified string too and use 
all your nice character semantics on it:

use Encode qw( encode :fallbacks );
my $content = $res->decoded_content;
my $hash = md5_base64(encode('UTF-8', $content, FB_CROAK));

I can't see a downside unless you want to treat messages that are 
identical in every way expect for belonging to different encodings 
differently.

That still doesn't explain why the Arabic one succeeded, I get that 
error on any string containing characters over 255. The source of 
HTTP::Message shows Encode::decode being used to generate the return 
value, so arabic should really cause it. *shrug*

Matt