Web weirdness

Wed Jun 20 14:15:45 BST 2007

David Cantrell writes:
> On Wed, Jun 20, 2007 at 12:36:08AM +0100, Aaron Crane wrote:
> > HTML has a named entity "&image;" which corresponds to the character U+2111
> > BLACK-LETTER CAPITAL I.  And, lo, that's the funny-looking character that
> > appears in the broken places.
> 
> &image=... != &image;
> 
> Not even in EBCDIC.

True, but the semicolon is mostly optional in HTML, so you still get bitten.

> My understanding is that you didn't need to encode ampersands in URLs
> unless they would otherwise look like the beginning of an entity - so
> the string '&quot;' would have to be represented as '%XXquot;' or
> somesuch.

I've just checked the HTML4 spec, and I can't find any such relaxation of
the normal rules.  But since the semicolon is optional, "&image=" _does_
look like the beginning of an entity.

> > Then, presumably, something else is converting named entities to the
> > characters they represent.
> 
> &image isn't a named entity though.  Anything that thinks it is is
> broken.

I'm inclined to disagree on the question of that brokenness.  But, even if
I'm wrong, it's undeniably the case that failing to encode & as &amp; in
HTML source will impede interoperability.  As you've observed.

-- 
Aaron Crane