This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 9300 - Remove the requirement to close NCR and entities with a semicolon
Summary: Remove the requirement to close NCR and entities with a semicolon
Status: CLOSED WONTFIX
Alias: None
Product: HTML WG
Classification: Unclassified
Component: pre-LC1 HTML5 spec (editor: Ian Hickson) (show other bugs)
Version: unspecified
Hardware: All All
: P3 normal
Target Milestone: LC
Assignee: Ian 'Hixie' Hickson
QA Contact: HTML WG Bugzilla archive list
URL: http://dev.w3.org/html5/spec/syntax#c...
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-03-22 19:31 UTC by Leif Halvard Silli
Modified: 2011-01-21 07:11 UTC (History)
6 users (show)

See Also:


Attachments
Test document of the issue (15.30 KB, text/html)
2010-03-22 21:25 UTC, Leif Halvard Silli
Details
Test document of the issue (15.30 KB, text/html)
2010-03-22 21:25 UTC, Leif Halvard Silli
Details
test case (14.93 KB, text/html)
2010-03-22 22:05 UTC, Leif Halvard Silli
Details
test case (14.94 KB, text/html)
2010-03-22 22:08 UTC, Leif Halvard Silli
Details
Test cases for bug 9300 (19.81 KB, text/html)
2010-04-03 17:05 UTC, Leif Halvard Silli
Details

Description Leif Halvard Silli 2010-03-22 19:31:19 UTC
Lynx, IE, Mozilla, Webkit, Konqueror, Opera and Chrome all support decimal NCRs without semicolon.

The same user agents also all support hexadecimal NCR without semicolon inside attributes.

The only issue is directly in text - which IE, for hexadecimal NCRs doesn't support.

Tests: http://målform.no/ncr-test/

Hence the requirement to terminate NCRs with a semicolon should be removed - it is not a MUST in text/HTML.
Comment 1 Leif Halvard Silli 2010-03-22 20:28:20 UTC
Corrected the subect to say "semicolon" (instead of "colon" ...)
Comment 2 Leif Halvard Silli 2010-03-22 21:25:02 UTC
Created attachment 838 [details]
Test document of the issue

Attached the test case page.
Comment 3 Leif Halvard Silli 2010-03-22 21:25:24 UTC
Created attachment 839 [details]
Test document of the issue

Attached the test case page.
Comment 4 Leif Halvard Silli 2010-03-22 21:40:44 UTC
(In reply to comment #0)

> Hence the requirement to terminate NCRs with a semicolon should be removed - it
> is not a MUST in text/HTML.

And of course HTML4 doesn't require the semicolon:

http://validator.w3.org/check?uri=http%3A%2F%2Fwww.w3.org%2FBugs%2FPublic%2Fattachment.cgi%3Fid%3D839&charset=%28detect+automatically%29&doctype=Inline&group=0

Comment 5 Leif Halvard Silli 2010-03-22 22:05:39 UTC
Created attachment 840 [details]
test case

Corrected and beautified the result table.
Comment 6 Leif Halvard Silli 2010-03-22 22:08:55 UTC
Created attachment 841 [details]
test case
Comment 7 Ian 'Hixie' Hickson 2010-04-02 01:25:51 UTC
EDITOR'S RESPONSE: This is an Editor's Response to your comment. If you are satisfied with this response, please change the state of this bug to CLOSED. If you have additional information and would like the editor to reconsider, please reopen this bug. If you would like to escalate the issue to the full HTML Working Group, please add the TrackerRequest keyword to this bug, and suggest title and text for the tracker issue; or you may create a tracker issue yourself, if you are able to do so. For more details, see this document:
   http://dev.w3.org/html5/decision-policy/decision-policy.html

Status: Rejected
Change Description: no spec change
Rationale: It's a bad idea to get used to not closing one's character references, because it only works so long as the next character isn't numeric. This is why we have a restriction here: to help authors not get used to risky practices.
Comment 8 Leif Halvard Silli 2010-04-03 16:56:54 UTC
(In reply to comment #7)

> Status: Rejected
> Change Description: no spec change
> Rationale: It's a bad idea to get used to not closing one's character
> references, because it only works so long as the next character isn't numeric.

CORRECTION: to avoid the ambiguity problem that you point out, it also necessary to avoid that the next  character is a alphabetic letter in the range a-f/A-F.

> This is why we have a restriction here: to help authors not get used to risky
> practices.

NEW info for the Editor to consider

FIRST:  Today only XHTML requires semicolon, due to XML restrictions.  For HTML, authors have not experienced such error messages. But despite of this, and against the Editor's rationale, this has not lead to any big trend where authors drop the semicolon.

SECOND: Confusing error messages. If an author uses a NCR which points to an illegal (too big) unicode number, then he/she gets a single error message. But if this illegal (too big) NCR is closed iwith a semicolon, then he/she gets TWO error message: one error will say that the NCR is too long, while the other will say that it lacks the semicolon. My own experience as well as what I've seen in the mailinglist ofr the W3 validator show that authors get confused by interdependent error messages. In this case, telling that the NCR was too long, would already have catched the error.

THIRD: Often catched by other errors + ambigiuos error message. Like the SECOND example above shows, these errors are often catched by related error: The number becomes to big. Or it ends up pointing to a point in the Private Unicode range etc. A too big number creates an error. A private number creates a warning. Both of them are related to the length of the NCR. If the NCR is not too long and also not  in the private range, then simply adding a semicolon could make the code valid, even if the semicolon is not placed in the correct place. Therefore, whenever a validator detects lack of semicolon, it is better to display a *warning* that points out the ambiguity which points out the *ambiguity* that this creates (an ambiguity which is related to length!) rather than telling author that there lacks a semicolon.

FOURTH: User agent myth. Today, many give the advice that one needs to use semicolon due to lack of user agent support. My test case shows that this is almost a myth. Displaying a warning that points out the ambiguity issue that you point out in your rationale, more accurately makes authors understand the real issue.

FIFT: Length is a  real issue. The myth I mentioned under the FOURTH point hides another user agent issue: that the length of the NCR is considered by a group of user agents. I suggest displaying a warning whenever the NCR contains superfluous zeros that makes the NCR longer than it needs to be.

SIXT: Vendors. One of the little used user agents in my test cases (Lobo) did not have any support for lack of semicolon at all. If authors are required to use semicolon, then why should vendors make sure that their parsers support lack of semicolon? It is a better strategy, also w.r.t. vendors, to not display an error when semicolon is lacking, if the goal is to make them support the lack of semicolon.
Comment 9 Leif Halvard Silli 2010-04-03 17:00:15 UTC
(In reply to comment #8)

TYPO

This: " But if this illegal (too big) NCR is closed iwith a semicolon"
Should be:  "But if this illegal (too big) NCR is **NOT** closed with a semicolon"
Comment 10 Leif Halvard Silli 2010-04-03 17:05:56 UTC
Created attachment 856 [details]
Test cases for bug 9300

The test case now has the results for tests with 15 user agents/user agent groups.
Comment 11 Sam Ruby 2010-04-04 00:36:36 UTC
(In reply to comment #8)
> 
> this has not lead to any big trend where
> authors drop the semicolon.

Leif: do we know of *any* author that routinely produces content that omits needless semicolons?  I have yet to encounter any.

I'm interested in rerunning tests against a number of web sites after the validator has been updated based on the following change:

http://html5.org/tools/web-apps-tracker?from=4958&to=4959

I'm not a fan of imposing requirements that large number of people will routinely and willfully ignore; that being said, I see no evidence that this is one of those cases.
Comment 12 Leif Halvard Silli 2010-04-06 00:22:41 UTC
(In reply to comment #11)
> (In reply to comment #8)
> > 
> > this has not lead to any big trend where
> > authors drop the semicolon.
> 
> Leif: do we know of *any* author that routinely produces content that omits
> needless semicolons?  I have yet to encounter any.

I agree. They are probably few.
 
> I'm interested in rerunning tests against a number of web sites after the
> validator has been updated based on the following change:
> 
> http://html5.org/tools/web-apps-tracker?from=4958&to=4959

Thanks for the heads up. Seems very relevant.

> I'm not a fan of imposing requirements that large number of people will
> routinely and willfully ignore; that being said, I see no evidence that this is
> one of those cases.

I understand that priority.  But I think that such a thing as CSS escapes are also very little used. I don't think that that is a reason to warn against them. 

My motivation for filing this bug, had to do with the following: Uncommon syntax can sometimes be used for targeting or un-targeting specific user agents.

For instance,  we are all familiar with the zillion of hacks for targeting legacy versions of Internet Explorer. 

But did you know that if you do this: <div class=" _classname ">, then IE6 is inable to select it based on the class name? As it happens, inside CSS selectors, then IE6 doesn't accept class names which begins with the underscore character - unless you escape it:

div._classname{/*for all other UAs*/}
div.\_classname{/*for IE6 only */}

Do Web authors routinely start their class names with the underscore? No. But that is not a reason to forbid it.  And how did I discover this way to target IE6? Because I put my head in the tube to try to discover a way to target IE6 without trigger any errors in a validator. (I of course knew that IE6 had problem with certain characters that it ignored and so inside CSS properties.)

Now, as you understand, by writing a class name  using unterminated NCRs, I can target those user agents which support it. For instance, by writing
                     <div class=" &#xe5 ">

instead of writing
                     <div class=" &#xe5; ">

I could set up the CSS selector
                     div.å {color:red;}

which would target only those Web browsers that *do* support to drop the semicolon. (Thus the above CSS selector would not target the  Java based Lobo browser, for instance.)  

As it happens, and as my test document shows, the length of the NCR is far more relevant than the semicolon. So with that ammunition in hand, one can use long NCR to target particular user agents.  

Do you, with this explanation, understand why I do not want that unterminated NCRs become illegal? There are always some unthoughtof reason to not be eager to forbid things just because one, for the moment, is unable to think of a reason why it should be legal.

If I had been asked to suggest a compromise solution, then I would have suggested that the dropping of the semicolon should be forbidden, EXCEPT in attributes which are meant for "computer semantics". E.g. such attributes  as @class, @id, @data-*, @src, @data et cetera et cetera.  (I'm not sure where to draw the line - I have not tried to dig through it.)

Does this sound as a resaonable compromise to anyone of you?  Does any of this make sense to you?
Comment 13 Ian 'Hixie' Hickson 2010-04-13 01:07:52 UTC
> NEW info for the Editor to consider
> 
> FIRST:  Today only XHTML requires semicolon, due to XML restrictions.  For
> HTML, authors have not experienced such error messages. But despite of this,
> and against the Editor's rationale, this has not lead to any big trend where
> authors drop the semicolon.

Granted.


> SECOND: Confusing error messages. If an author uses a NCR which points to an
> illegal (too big) unicode number, then he/she gets a single error message. But
> if this illegal (too big) NCR is closed iwith a semicolon, then he/she gets TWO
> error message: one error will say that the NCR is too long, while the other
> will say that it lacks the semicolon. My own experience as well as what I've
> seen in the mailinglist ofr the W3 validator show that authors get confused by
> interdependent error messages. In this case, telling that the NCR was too long,
> would already have catched the error.

That's an issue for the validator. The spec doesn't say what the error messages should be.


> THIRD: Often catched by other errors + ambigiuos error message. Like the SECOND
> example above shows, these errors are often catched by related error: The
> number becomes to big. Or it ends up pointing to a point in the Private Unicode
> range etc. A too big number creates an error. A private number creates a
> warning. Both of them are related to the length of the NCR. If the NCR is not
> too long and also not  in the private range, then simply adding a semicolon
> could make the code valid, even if the semicolon is not placed in the correct
> place. Therefore, whenever a validator detects lack of semicolon, it is better
> to display a *warning* that points out the ambiguity which points out the
> *ambiguity* that this creates (an ambiguity which is related to length!) rather
> than telling author that there lacks a semicolon.

Either it's ok or it isn't. If it's not ok (as indicated by a warning) then why allow it?

We shouldn't rely on the chance that people will screw this up in a particular way (hitting the PUA, e.g.) to help them. That's poor and unpredictable language design.


> FOURTH: User agent myth. Today, many give the advice that one needs to use
> semicolon due to lack of user agent support. My test case shows that this is
> almost a myth. Displaying a warning that points out the ambiguity issue that
> you point out in your rationale, more accurately makes authors understand the
> real issue.

Having an error or a warning here makes no difference to this case.


> FIFT: Length is a  real issue. The myth I mentioned under the FOURTH point
> hides another user agent issue: that the length of the NCR is considered by a
> group of user agents. I suggest displaying a warning whenever the NCR contains
> superfluous zeros that makes the NCR longer than it needs to be.

This seems unrelated. Please keep one bug per issue.


> SIXT: Vendors. One of the little used user agents in my test cases (Lobo) did
> not have any support for lack of semicolon at all. If authors are required to
> use semicolon, then why should vendors make sure that their parsers support
> lack of semicolon? It is a better strategy, also w.r.t. vendors, to not display
> an error when semicolon is lacking, if the goal is to make them support the
> lack of semicolon.

I have no idea what this means.


EDITOR'S RESPONSE: This is an Editor's Response to your comment. If you are satisfied with this response, please change the state of this bug to CLOSED. If you have additional information and would like the editor to reconsider, please reopen this bug. If you would like to escalate the issue to the full HTML Working Group, please add the TrackerRequest keyword to this bug, and suggest title and text for the tracker issue; or you may create a tracker issue yourself, if you are able to do so. For more details, see this document:
   http://dev.w3.org/html5/decision-policy/decision-policy.html

Status: Rejected
Change Description: no spec change
Rationale: It's a bad idea to get used to not closing one's character
references, because it only works so long as the next character isn't numeric.
This is why we have a restriction here: to help authors not get used to risky
practices.
Comment 14 Leif Halvard Silli 2010-04-13 19:19:20 UTC
(In reply to comment #13)

First of all, I miss a comment from you to my reply to Sam in comment #12.

> > THIRD: Often catched by other errors + ambigiuos error message. Like the SECOND
> > example above shows, these errors are often catched by related error: The
> > number becomes to big. Or it ends up pointing to a point in the Private Unicode
> > range etc. [ ....]

> Either it's ok or it isn't. If it's not ok (as indicated by a warning) then why
> allow it?

Do you prefer error messages over warning messages? If so, why does the private unicode range only trigger a warning? Why not make PUA an errror also? I mean, if the logic is "either it is ok or it is not".


> > FOURTH: User agent myth. Today, many give the advice that one needs to use
> > semicolon due to lack of user agent support. My test case shows that this is
> > almost a myth. Displaying a warning that points out the ambiguity issue that
> > you point out in your rationale, more accurately makes authors understand the
> > real issue.
> 
> Having an error or a warning here makes no difference to this case.

Well, it is only *almost* a myth.  A warning that speaks about UA compatibility (with some less used user agents) is different from a total forddbing. Such a thing also feels more helpful for authors. 

Othwrwise, I can't see that a warning is useful for the use of private unicode area either. To be consistent with your own "either ok or not ok", then PUA should be an outright error as well.

> > SIXT: Vendors. One of the little used user agents in my test cases (Lobo) did
> > not have any support for lack of semicolon at all. If authors are required to
> > use semicolon, then why should vendors make sure that their parsers support
> > lack of semicolon? It is a better strategy, also w.r.t. vendors, to not display
> > an error when semicolon is lacking, if the goal is to make them support the
> > lack of semicolon.
> 
> I have no idea what this means.

The ideas is very simple: Vendors could use the fact that it is forbidden for authors, as an excuse for not implementing support for entities without semicolon.

> Status: Rejected
> Change Description: no spec change
> Rationale: It's a bad idea to get used to not closing one's character
> references, because it only works so long as the next character isn't numeric.
> This is why we have a restriction here: to help authors not get used to risky
> practices.

I am an author. I don't feel that this is a help. I don't want this help. And you agreed that authors have not developed this habit despite the current permission. So I cannot accept this rationale. For Web authors it would be just as good help, perhaps better, if the Error simply gave a warning. We do no need a Error message with the justification you gave.

Also, to permit PUAs (which is a vendor extension) while at the same time forbidding the lack of semicolon (with the justification that "it is for your own best, my dear Web author") seems very vendor biazed.
Comment 15 Leif Halvard Silli 2010-04-13 19:22:05 UTC
Subustute: if the Error simply gave a warning.
With: if the validator simply gave a warning.
Comment 16 Ian 'Hixie' Hickson 2010-04-14 00:04:59 UTC
> First of all, I miss a comment from you to my reply to Sam in comment #12.

That comment was gigantic. Anything in particular you want me to reply to?


> > > THIRD: Often catched by other errors + ambigiuos error message. Like the SECOND
> > > example above shows, these errors are often catched by related error: The
> > > number becomes to big. Or it ends up pointing to a point in the Private Unicode
> > > range etc. [ ....]
> 
> > Either it's ok or it isn't. If it's not ok (as indicated by a warning) then why
> > allow it?
> 
> Do you prefer error messages over warning messages? If so, why does the private
> unicode range only trigger a warning? Why not make PUA an errror also? I mean,
> if the logic is "either it is ok or it is not".

That's a can of worms I don't think we should broach in this bug. However, in general, I think that there shouldn't be any warnings. Most of the warnings in the spec are things I disagree with.


> Well, it is only *almost* a myth.  A warning that speaks about UA compatibility
> (with some less used user agents) is different from a total forddbing. Such a
> thing also feels more helpful for authors. 

Such a warning is perfectly reasonable for a validator to give, but it seems somewhat out of scope for the spec, especially if we keep this as an error.


> Vendors could use the fact that it is forbidden for
> authors, as an excuse for not implementing support for entities without
> semicolon.

The same could be said of all conformance requirements in the spec; I don't think it's a real problem.


> > Rationale: It's a bad idea to get used to not closing one's character
> > references, because it only works so long as the next character isn't numeric.
> > This is why we have a restriction here: to help authors not get used to risky
> > practices.
> 
> I am an author. I don't feel that this is a help. I don't want this help. And
> you agreed that authors have not developed this habit despite the current
> permission. So I cannot accept this rationale. For Web authors it would be just
> as good help, perhaps better, if the Error simply gave a warning. We do no need
> a Error message with the justification you gave.

I understand that you might not want this help, but you may be more competent than most authors.


> Also, to permit PUAs (which is a vendor extension) while at the same time
> forbidding the lack of semicolon (with the justification that "it is for your
> own best, my dear Web author") seems very vendor biazed.

This seems like a complete non-sequitur.


EDITOR'S RESPONSE: This is an Editor's Response to your comment. If you are satisfied with this response, please change the state of this bug to CLOSED. If you have additional information and would like the editor to reconsider, please reopen this bug. If you would like to escalate the issue to the full HTML Working Group, please add the TrackerRequest keyword to this bug, and suggest title and text for the tracker issue; or you may create a tracker issue yourself, if you are able to do so. For more details, see this document:
   http://dev.w3.org/html5/decision-policy/decision-policy.html

Status: Rejected
Change Description: no spec change
Rationale: no new information
Comment 17 Leif Halvard Silli 2011-01-21 07:11:58 UTC
This was a very interesting bug for me to file and research ... ;-) So if I were to count the time I used on it, I should not accept that it weren't solved ... ;-) But - I  am eating this baby. :)