perl at aaroncrane.co.uk
Wed Jun 20 14:15:45 BST 2007
David Cantrell writes:
> On Wed, Jun 20, 2007 at 12:36:08AM +0100, Aaron Crane wrote:
> > HTML has a named entity "ℑ" which corresponds to the character U+2111
> > BLACK-LETTER CAPITAL I. And, lo, that's the funny-looking character that
> > appears in the broken places.
> &image=... != ℑ
> Not even in EBCDIC.
True, but the semicolon is mostly optional in HTML, so you still get bitten.
> My understanding is that you didn't need to encode ampersands in URLs
> unless they would otherwise look like the beginning of an entity - so
> the string '"' would have to be represented as '%XXquot;' or
I've just checked the HTML4 spec, and I can't find any such relaxation of
the normal rules. But since the semicolon is optional, "&image=" _does_
look like the beginning of an entity.
> > Then, presumably, something else is converting named entities to the
> > characters they represent.
> &image isn't a named entity though. Anything that thinks it is is
I'm inclined to disagree on the question of that brokenness. But, even if
I'm wrong, it's undeniably the case that failing to encode & as & in
HTML source will impede interoperability. As you've observed.
More information about the london.pm