This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 17418 - & did not start a character reference and Errors involving fragile syntax constructs
Summary: & did not start a character reference and Errors involving fragile syntax con...
Status: NEW
Alias: None
Product: HTML Checker
Classification: Unclassified
Component: General (show other bugs)
Version: unspecified
Hardware: PC Windows NT
: P2 normal
Target Milestone: ---
Assignee: Michael[tm] Smith
QA Contact: qa-dev tracking
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-06-05 15:31 UTC by rasamassen
Modified: 2015-08-23 06:58 UTC (History)
1 user (show)

See Also:


Attachments
Test case (289 bytes, text/html)
2012-10-26 12:33 UTC, rasamassen
Details

Description rasamassen 2012-06-05 15:31:46 UTC
As explained in the non-normative section of HTML5 (obviously based on normative sections), http://www.w3.org/TR/html5/introduction.html#syntax-errors, under "Errors involving fragile syntax constructs":

The correct way to express the above cases is as follows:

<a href="?bill&ted">Bill and Ted</a> <!-- &ted is ok, since it's not a named character reference -->
<a href="?art&amp;copy">Art and Copy</a> <!-- the & has to be escaped, since &copy is a named character reference -->


Thus, the error "& did not start a character reference" should only appear when the "&" precedes a named character reference.
Comment 1 Michael[tm] Smith 2012-10-26 11:21:32 UTC
I wrote an experimental patch for this and pushed it to http://qa-dev.w3.org:8888/

So for now you can test it there and please let me know if find any problems.

I'll try to get the patch landed in the sources soon and pushed out to the production validator.
Comment 2 rasamassen 2012-10-26 12:33:34 UTC
Created attachment 1240 [details]
Test case
Comment 3 rasamassen 2012-10-26 12:39:19 UTC
Tested with the attached test case. The error never showed up where it shouldn't. Tested it on other sites as well. Looks like the patch is working.

Based on http://www.w3.org/TR/html5/named-character-references.html, an error should have shown up for "&dollar" and "&minus", but the live validator (http://validator.w3.org) does not recognize them as named character references, so I imagine that is a separate bug.
Comment 4 Michael[tm] Smith 2012-10-26 12:48:52 UTC
(In reply to comment #3)
> Tested with the attached test case. The error never showed up where it
> shouldn't. Tested it on other sites as well. Looks like the patch is working.

Excellent. Thanks very much for taking the time to test -- I really appreciate it.

> Based on http://www.w3.org/TR/html5/named-character-references.html, an
> error should have shown up for "&dollar" and "&minus", but the live
> validator (http://validator.w3.org) does not recognize them as named
> character references, so I imagine that is a separate bug.

Yes, I can confirm from inspection of the current validator source code that the code currently does not recognize "dollar" and "minuss" as named characters. The only characters it recognizes as such are the ones in the NAMES array in this file:

  http://hg.mozilla.org/projects/htmlparser/raw-file/default/src/nu/validator/htmlparser/impl/NamedCharacters.java

So please do file a bug noting that "dollar" and "minus" are missing from that (along with any other missing ones you might find).
Comment 5 rasamassen 2012-10-26 15:40:09 UTC
Bug 19718 created to address the issue.
Comment 6 Michael[tm] Smith 2012-11-04 10:23:38 UTC
(In reply to comment #3)
> Tested with the attached test case. The error never showed up where it
> shouldn't. Tested it on other sites as well. Looks like the patch is working.
> 
> Based on http://www.w3.org/TR/html5/named-character-references.html, an
> error should have shown up for "&dollar" and "&minus", but the live
> validator (http://validator.w3.org) does not recognize them as named
> character references, so I imagine that is a separate bug.

The validator does recognize "&dollar;" and "&minus;" as valid named character references. The current spec actually does not require it to recognize semicolon-less "&dollar" and "&minus" as special in any way, and they are not errors, so the per-spec behavior for them it to report nothing at all.

I realize that the validator (actually the HTML parser used by the validator) does report "Named character reference was not terminated by a semicolon" errors for semicolon-less versions of some named character references such as "&reg;". I'd need to look at the code more to figure out why it does that for some and not for others. I suspect it just has to do with length. But regardless, the current spec doesn't actually define "&reg" as a parse error, so I think the actual bug here might be that the parser is emitting any error message at all for the "&reg" case.