Web weirdness

Wed Jul 4 20:52:28 BST 2007

On Wed, Jun 20, 2007 at 01:34:32PM +0100, David Cantrell wrote:
> On Wed, Jun 20, 2007 at 12:36:08AM +0100, Aaron Crane wrote:
> > David Cantrell writes:
> > > Lots of URLs on my web site contain the ASCII string "&image" ...
> > HTML has a named entity "&image;" which corresponds to the character U+2111
> > BLACK-LETTER CAPITAL I.  And, lo, that's the funny-looking character that
> > appears in the broken places.
> 
> &image=... != &image;
> 
> Not even in EBCDIC.

Right.

And 

  do {print "foo"}  !=  do {print "foo";}

yet, semantically, they are identical. In Perl, the semi-colon terminating
the statement is optional if Perl can deduce from what follows where the
statement ends. Same in HTML.

> > I'm not sure exactly where the brokenness is here, but I'm pretty sure that
> > something somewhere is failing to entity-encode this:
> >   .../photodetails.tt2?set=york-xmas-2005&image=commondale
> > to this:
> >   .../photodetails.tt2?set=york-xmas-2005&amp;image=commondale
> > when it's used in HTML source.
> 
> My understanding is that you didn't need to encode ampersands in URLs
> unless they would otherwise look like the beginning of an entity - so
> the string '&quot;' would have to be represented as '%XXquot;' or
> somesuch.

You are embedding two different things here. You have an URL embedded into
an HTML document. & doesn't need any *URL* encoding. However, HTML has no
specific knowledge about URLs. HTML knows PCDATA, CDATA, element names,
attribute names, attribute values, and such. And in attribute values, it
will consider &entities. Just as in PCDATA.

The choice of & to separate CGI parameters (which in themselves are a
minilanguage inside URLs) was a pretty poor one considering the role of
& in HTML documents. Luckily, since a past decade or so, CGI processors have
been able to parse query strings that use ; instead of & to separate

> > Then, presumably, something else is converting named entities to the
> > characters they represent.
> 
> &image isn't a named entity though.  Anything that thinks it is is

Perhaps you should notify the W3C so they can update their standard.

   http://www.w3.org/TR/REC-html40/sgml/entities.html

> &image isn't a named entity though.  Anything that thinks it is is
> broken.  Browsers, for example, treat &image=blah correctly, it's only
> stuff pretending to be a browser (like WWW::Robot or one of its friends
> and relations) and other stuff that tries to parse HTML (like the
> Livejournal comment thingy) that gets it wrong.

"Browsers" typically fail to parse HTML 2.0 correctly - most browsers have
been coded by programmers that can't parse their way out of a wet paper
bag if their live depended on it. 

Abigail