Web weirdness
David Cantrell
david at cantrell.org.uk
Wed Jun 20 13:34:32 BST 2007
On Wed, Jun 20, 2007 at 12:36:08AM +0100, Aaron Crane wrote:
> David Cantrell writes:
> > Lots of URLs on my web site contain the ASCII string "&image" ...
> HTML has a named entity "ℑ" which corresponds to the character U+2111
> BLACK-LETTER CAPITAL I. And, lo, that's the funny-looking character that
> appears in the broken places.
&image=... != ℑ
Not even in EBCDIC.
> I'm not sure exactly where the brokenness is here, but I'm pretty sure that
> something somewhere is failing to entity-encode this:
> .../photodetails.tt2?set=york-xmas-2005&image=commondale
> to this:
> .../photodetails.tt2?set=york-xmas-2005&image=commondale
> when it's used in HTML source.
My understanding is that you didn't need to encode ampersands in URLs
unless they would otherwise look like the beginning of an entity - so
the string '"' would have to be represented as '%XXquot;' or
somesuch.
> Then, presumably, something else is converting named entities to the
> characters they represent.
&image isn't a named entity though. Anything that thinks it is is
broken. Browsers, for example, treat &image=blah correctly, it's only
stuff pretending to be a browser (like WWW::Robot or one of its friends
and relations) and other stuff that tries to parse HTML (like the
Livejournal comment thingy) that gets it wrong.
--
David Cantrell | London Perl Mongers Deputy Chief Heretic
I think the most difficult moment that anyone could face is seeing
their domestic servants, whether maid or drivers, run away
-- Abdul Rahman Al-Sheikh, writing at
http://www.arabnews.com/?article=38558
More information about the london.pm
mailing list