Web weirdness
Abigail
abigail at abigail.be
Wed Jul 4 20:52:28 BST 2007
On Wed, Jun 20, 2007 at 01:34:32PM +0100, David Cantrell wrote:
> On Wed, Jun 20, 2007 at 12:36:08AM +0100, Aaron Crane wrote:
> > David Cantrell writes:
> > > Lots of URLs on my web site contain the ASCII string "&image" ...
> > HTML has a named entity "ℑ" which corresponds to the character U+2111
> > BLACK-LETTER CAPITAL I. And, lo, that's the funny-looking character that
> > appears in the broken places.
>
> &image=... != ℑ
>
> Not even in EBCDIC.
Right.
And
do {print "foo"} != do {print "foo";}
yet, semantically, they are identical. In Perl, the semi-colon terminating
the statement is optional if Perl can deduce from what follows where the
statement ends. Same in HTML.
> > I'm not sure exactly where the brokenness is here, but I'm pretty sure that
> > something somewhere is failing to entity-encode this:
> > .../photodetails.tt2?set=york-xmas-2005&image=commondale
> > to this:
> > .../photodetails.tt2?set=york-xmas-2005&image=commondale
> > when it's used in HTML source.
>
> My understanding is that you didn't need to encode ampersands in URLs
> unless they would otherwise look like the beginning of an entity - so
> the string '"' would have to be represented as '%XXquot;' or
> somesuch.
You are embedding two different things here. You have an URL embedded into
an HTML document. & doesn't need any *URL* encoding. However, HTML has no
specific knowledge about URLs. HTML knows PCDATA, CDATA, element names,
attribute names, attribute values, and such. And in attribute values, it
will consider &entities. Just as in PCDATA.
The choice of & to separate CGI parameters (which in themselves are a
minilanguage inside URLs) was a pretty poor one considering the role of
& in HTML documents. Luckily, since a past decade or so, CGI processors have
been able to parse query strings that use ; instead of & to separate
> > Then, presumably, something else is converting named entities to the
> > characters they represent.
>
> &image isn't a named entity though. Anything that thinks it is is
Perhaps you should notify the W3C so they can update their standard.
http://www.w3.org/TR/REC-html40/sgml/entities.html
> &image isn't a named entity though. Anything that thinks it is is
> broken. Browsers, for example, treat &image=blah correctly, it's only
> stuff pretending to be a browser (like WWW::Robot or one of its friends
> and relations) and other stuff that tries to parse HTML (like the
> Livejournal comment thingy) that gets it wrong.
"Browsers" typically fail to parse HTML 2.0 correctly - most browsers have
been coded by programmers that can't parse their way out of a wet paper
bag if their live depended on it.
Abigail
More information about the london.pm
mailing list