This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 9352 - Make unescaped & conforming in attribute values in some cases
Summary: Make unescaped & conforming in attribute values in some cases
Status: CLOSED FIXED
Alias: None
Product: HTML WG
Classification: Unclassified
Component: pre-LC1 HTML5 spec (editor: Ian Hickson) (show other bugs)
Version: unspecified
Hardware: PC All
: P2 normal
Target Milestone: ---
Assignee: Ian 'Hixie' Hickson
QA Contact: HTML WG Bugzilla archive list
URL:
Whiteboard:
Keywords:
Depends on: 9207
Blocks:
  Show dependency treegraph
 
Reported: 2010-03-27 22:41 UTC by Maciej Stachowiak
Modified: 2010-10-04 14:46 UTC (History)
6 users (show)

See Also:


Attachments

Description Maciej Stachowiak 2010-03-27 22:41:36 UTC
HTML syntax and URL syntax have an unfortunate conflict. HTML interprets & as the start of an entity reference, while in URLs it has special meaning as a separator in the query portion of a URL.

HTML5 disallows the & character in attribute values unless it is actually the start of an entity reference. That means markup like this is nonconforming:

<a href="http://images.google.com/imghp?hl=en&tab=wi">

In this specific case, there is no change that &tab= could be mistaken for an entity reference, and parsing will proceed exactly as the author expects.

The spec explains that the reason for this syntax error is markup fragility:

"For example, the parsing of certain named character references in attributes happens even with the closing semicolon being omitted. It is safe to include an ampersand followed by letters that do not form a named character reference, but if the letters are changed to a string that does form a named character reference, they will be interpreted as that character instead."

http://dev.w3.org/html5/spec/Overview.html#conformance-requirements-for-authors

However, for an author to be aware of this kind of error, they must be regularly using a conformance checker (or equivalently, a tool that ensures conformance at the output stage). Then the conformance checker can tell them if they have used a construct that actually will be interpreted as an entity reference, rather than merely one that might be, if edited.

As a result of getting the error, authors who want the full benefits of conformance checking must write in a more awkward style, and must bloat their markup by replacing instances of "&" with "&amp;".

7 of the Alexa top 15 sites have this error: http://www.w3.org/html/wg/wiki/index.php?title=HTML5_Authoring_Conformance_Study

In many cases it appears an inordinate number of times, close to 100, and is the single most frequent error on the site.

It seems that many authors, even on prominent sites, have not found the markup bloat and awkward syntax of consistently using &amp; to be a cost worth paying for the benefit of speculatively avoiding future errors.

Thus, I think HTML5 should reconsider and only make href="&foo=" an error in the case where foo is an entity name, since that is the only case where author expectations will actually be defeated.
Comment 1 Maciej Stachowiak 2010-03-27 22:42:50 UTC
Relatedly, bug 9351 makes a proposal to change the parsing of actual unterminated entities followed by an equal sign. If that change was made, then the risk of fragile markup would apply only to legacy UAs, not to HTML5 UAs. I think the case for a blanket ban on & would then be even weaker.
Comment 2 Ian 'Hixie' Hickson 2010-04-02 06:50:56 UTC
> However, for an author to be aware of this kind of error, they must be
> regularly using a conformance checker (or equivalently, a tool that ensures
> conformance at the output stage). Then the conformance checker can tell them if
> they have used a construct that actually will be interpreted as an entity
> reference, rather than merely one that might be, if edited.

This is incorrect. It is possible for an author to be aware of this error due to regular usage of a validator but to still write markup that is not tested by a validator. For instance, I am an example of such an author: I run the validator every time I edit the HTML5 spec, but I almost never use the validator otherwise. Making this an error means that authors are more likely to always use the right style, even on projects where they don't use the validator.

Making the mistake with a real entity such as "amp" or "copy" can be really hard to track down, especially with long URLs with lots of query parameters. Unless we change how such strings are parsed, we are not doing authors any favours by hiding this problem. Good practices here are a huge aid to reducing problems in real work, and we should be encouraging those practices.

(I'm not closing this yet; I'll revisit this once I've addressed bug 9351.)
Comment 3 Maciej Stachowiak 2010-04-02 07:03:14 UTC
(In reply to comment #2)
> > However, for an author to be aware of this kind of error, they must be
> > regularly using a conformance checker (or equivalently, a tool that ensures
> > conformance at the output stage). Then the conformance checker can tell them if
> > they have used a construct that actually will be interpreted as an entity
> > reference, rather than merely one that might be, if edited.
> 
> This is incorrect. It is possible for an author to be aware of this error due
> to regular usage of a validator but to still write markup that is not tested by
> a validator. For instance, I am an example of such an author: I run the
> validator every time I edit the HTML5 spec, but I almost never use the
> validator otherwise. Making this an error means that authors are more likely to
> always use the right style, even on projects where they don't use the
> validator.

I think the average author is far less likely than you to learn in that way. For the few that are, they are probably aware enough to think about what might be a real entity. Giving an error that 99.99% of the time is not a real bug is a very inefficient way to teach authors about the possible cases that are potential bugs.

> Making the mistake with a real entity such as "amp" or "copy" can be really
> hard to track down, especially with long URLs with lots of query parameters.
> Unless we change how such strings are parsed, we are not doing authors any
> favours by hiding this problem. Good practices here are a huge aid to reducing
> problems in real work, and we should be encouraging those practices.

This supposedly good practice seems to be the most violated single HTML5 conformance requirement by orders of magnitude. It seems to me that: (a) authors do not see the value of escaping & in cases where it makes no difference, solely to train themselves; and (b) this makes validator output so noisy that in many cases it is hard to see real errors. Thus, requiring this to be an error even in cases where there is no harm makes the validator a *less* useful tool rather than *more*. If the motive is to help authors, then I think this particular form of help is misguided.

> (I'm not closing this yet; I'll revisit this once I've addressed bug 9351.)

Fair enough.
Comment 4 Ian 'Hixie' Hickson 2010-04-02 23:17:38 UTC
EDITOR'S RESPONSE: This is an Editor's Response to your comment. If you are satisfied with this response, please change the state of this bug to CLOSED. If you have additional information and would like the editor to reconsider, please reopen this bug. If you would like to escalate the issue to the full HTML Working Group, please add the TrackerRequest keyword to this bug, and suggest title and text for the tracker issue; or you may create a tracker issue yourself, if you are able to do so. For more details, see this document:
   http://dev.w3.org/html5/decision-policy/decision-policy.html

Status: Accepted
Change Description: see diff given below
Rationale: Fair enough. I've made it not an error to have an unescaped & except if it is followed by alphanumeric characters and a semicolon, since that's an extension point.
Comment 5 Ian 'Hixie' Hickson 2010-04-02 23:26:42 UTC
http://html5.org/tools/web-apps-tracker?from=4959&to=4960
Comment 6 Maciej Stachowiak 2010-04-03 01:28:38 UTC
Looks good to me.