Web weirdness

Wed Jun 20 13:34:32 BST 2007

On Wed, Jun 20, 2007 at 12:36:08AM +0100, Aaron Crane wrote:
> David Cantrell writes:
> > Lots of URLs on my web site contain the ASCII string "&image" ...
> HTML has a named entity "&image;" which corresponds to the character U+2111
> BLACK-LETTER CAPITAL I.  And, lo, that's the funny-looking character that
> appears in the broken places.

&image=... != &image;

Not even in EBCDIC.

> I'm not sure exactly where the brokenness is here, but I'm pretty sure that
> something somewhere is failing to entity-encode this:
>   .../photodetails.tt2?set=york-xmas-2005&image=commondale
> to this:
>   .../photodetails.tt2?set=york-xmas-2005&amp;image=commondale
> when it's used in HTML source.

My understanding is that you didn't need to encode ampersands in URLs
unless they would otherwise look like the beginning of an entity - so
the string '&quot;' would have to be represented as '%XXquot;' or
somesuch.

> Then, presumably, something else is converting named entities to the
> characters they represent.

&image isn't a named entity though.  Anything that thinks it is is
broken.  Browsers, for example, treat &image=blah correctly, it's only
stuff pretending to be a browser (like WWW::Robot or one of its friends
and relations) and other stuff that tries to parse HTML (like the
Livejournal comment thingy) that gets it wrong.

-- 
David Cantrell | London Perl Mongers Deputy Chief Heretic

I think the most difficult moment that anyone could face is seeing
their domestic servants, whether maid or drivers, run away
  -- Abdul Rahman Al-Sheikh, writing at
     http://www.arabnews.com/?article=38558