This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 8606 - ambiguous ampersand does not include character references
Summary: ambiguous ampersand does not include character references
Status: RESOLVED NEEDSINFO
Alias: None
Product: HTML WG
Classification: Unclassified
Component: pre-LC1 HTML5 spec (editor: Ian Hickson) (show other bugs)
Version: unspecified
Hardware: PC Windows NT
: P2 normal
Target Milestone: ---
Assignee: Ian 'Hixie' Hickson
QA Contact: HTML WG Bugzilla archive list
URL: http://dev.w3.org/html5/spec/Overview...
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-01-03 19:25 UTC by Don Brutzman
Modified: 2010-10-04 14:28 UTC (History)
5 users (show)

See Also:


Attachments

Description Don Brutzman 2010-01-03 19:25:34 UTC
9.1.4 Character references

Draft document sayeth:

        "An ambiguous ampersand is a U+0026 AMPERSAND character (&)
        that is followed by some text other than a space character,
        a U+003C LESS-THAN SIGN character (<), or another
        U+0026 AMPERSAND character (&)."

probably should insert 2nd line as follows

        An ambiguous ampersand is a U+0026 AMPERSAND character (&)
        that is not a valid character reference, and
Comment 1 Ian 'Hixie' Hickson 2010-01-06 12:03:04 UTC
EDITOR'S RESPONSE: This is an Editor's Response to your comment. If you are satisfied with this response, please change the state of this bug to CLOSED. If you have additional information and would like the editor to reconsider, please reopen this bug. If you would like to escalate the issue to the full HTML Working Group, please add the TrackerRequest keyword to this bug, and suggest title and text for the tracker issue; or you may create a tracker issue yourself, if you are able to do so. For more details, see this document:
   http://dev.w3.org/html5/decision-policy/decision-policy.html

Status: Rejected
Change Description: no spec change
Rationale: An ambiguous ampersand is text. A character reference is not text. Therefore an ambiguous ampersand can never be a character reference.
Comment 2 Don Brutzman 2010-01-26 07:56:10 UTC
wellll, i guess if you strictly parse the linked definition of
<a href="#syntax-text" title="syntax-text">text</a>,
then the complete character reference only comprises a single
<a href="#syntax-text" title="syntax-text">text</a>
character.  nevertheless the individual characters that follow that initial ampersand within a character reference would otherwise be considered plain text if they weren't in that context.

due to overloaded terminology, this logic can get convoluted and doesn't seem immediately obvious to a reader trying to understand the definition.

re-reading the definition for ambiguous ampersand still seems to me to include the characters making up character reference.

inserting the phrase "that is not a valid character reference" explictly disambiguates such a possible misperception and reinforces the sense of the definition.  thus i again suggest inserting that phrase.
Comment 3 Ian 'Hixie' Hickson 2010-02-14 02:55:29 UTC
EDITOR'S RESPONSE: This is an Editor's Response to your comment. If you are satisfied with this response, please change the state of this bug to CLOSED. If you have additional information and would like the editor to reconsider, please reopen this bug. If you would like to escalate the issue to the full HTML Working Group, please add the TrackerRequest keyword to this bug, and suggest title and text for the tracker issue; or you may create a tracker issue yourself, if you are able to do so. For more details, see this document:
   http://dev.w3.org/html5/decision-policy/decision-policy.html

Status: Did Not Understand Request
Change Description: no spec change
Rationale: I am completely at a loss as to what comment 2 is trying to say.

The concept of ambiguous ampersands is used to restrict what values "text" can have. Its purpose is to make it non-conforming to have an ampersand followed by something that would, when parsed, be confused for a character reference. As such, the only characters that are allowed after & are space characters, "<" characters, and other "&" characters. All other characters, including all the characters that would form a character reference, are not allowed, and thus a & followed by any such character (e.g. "a" or "#") is am ambiguous ampersand.

If we were to _exclude_ characters that formed character references, then this would completely fail to achieve the stated goal. If "&" followed by "gt;" was _not_ an ambiguous ampersand, then there'd be no way to distinguish the text consisting of the four characters "&", "g", "t", ";" from a single character reference "&gt;", and yet both would be legal.

This is why ambiguous ampersands are defined as they are.