Web weirdness
Aaron Crane
perl at aaroncrane.co.uk
Wed Jun 20 14:15:45 BST 2007
David Cantrell writes:
> On Wed, Jun 20, 2007 at 12:36:08AM +0100, Aaron Crane wrote:
> > HTML has a named entity "ℑ" which corresponds to the character U+2111
> > BLACK-LETTER CAPITAL I. And, lo, that's the funny-looking character that
> > appears in the broken places.
>
> &image=... != ℑ
>
> Not even in EBCDIC.
True, but the semicolon is mostly optional in HTML, so you still get bitten.
> My understanding is that you didn't need to encode ampersands in URLs
> unless they would otherwise look like the beginning of an entity - so
> the string '"' would have to be represented as '%XXquot;' or
> somesuch.
I've just checked the HTML4 spec, and I can't find any such relaxation of
the normal rules. But since the semicolon is optional, "&image=" _does_
look like the beginning of an entity.
> > Then, presumably, something else is converting named entities to the
> > characters they represent.
>
> &image isn't a named entity though. Anything that thinks it is is
> broken.
I'm inclined to disagree on the question of that brokenness. But, even if
I'm wrong, it's undeniably the case that failing to encode & as & in
HTML source will impede interoperability. As you've observed.
--
Aaron Crane
More information about the london.pm
mailing list