This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 19718 - Missing Named Character References
Summary: Missing Named Character References
Status: RESOLVED WORKSFORME
Alias: None
Product: HTML Checker
Classification: Unclassified
Component: General (show other bugs)
Version: unspecified
Hardware: PC Windows NT
: P2 normal
Target Milestone: ---
Assignee: Michael[tm] Smith
QA Contact: qa-dev tracking
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-10-26 15:39 UTC by rasamassen
Modified: 2015-08-23 07:07 UTC (History)
3 users (show)

See Also:


Attachments

Description rasamassen 2012-10-26 15:39:38 UTC
Per Bug 17418, Comments 3 & 4

http://hg.mozilla.org/projects/htmlparser/raw-file/default/src/nu/validator/htmlparser/impl/NamedCharacters.java does not contain the complete list of named character references in HTML5 per http://www.w3.org/TR/html5/named-character-references.html

"dollar" and "minus" are two named character references that are definitely missing. Other named character references may be missing as well. Either the two lists need to be compared and NamedCharacters.java should be updated to reflect HTML5 standards or a separate HTML5 NamedCharacters file needs to be created for the validator to use with HTML5 documents.
Comment 1 Michael[tm] Smith 2012-11-04 09:08:39 UTC
Cc'ing Henri

Henri, As this bug notes, the named character references $ and − are reported as errors by the HTML parser. From inspection of the NamedCharacters.java source in the parser source I see that in fact those are not in that file at all. I guess it's possible that some other named references currently allowed by the spec are not yet in the NamedCharacters.java source. Maybe either Hixie added more   named references a while back, or the Math WG did to the upstream source.

Anyway, I also notice that the parser source is missing the ones that were added in spec r5557 http://html5.org/r/5557

I'm happy to write up a patch for this but I assume you're generating the NamedCharacters.java source, and I could not find the code for making it in the parser repo. Maybe it would be good if the repo also had whatever code you're using to generate that file?
Comment 2 Michael[tm] Smith 2012-11-04 09:22:37 UTC
Ah I now find the generator code in the parser sources:

http://hg.mozilla.org/projects/htmlparser/file/c83141518cf3/translator-src/nu/validator/htmlparser/generator/GenerateNamedCharacters.java

So I'll attempt to run that on the current named-character-references list and see what I end up with.
Comment 3 Michael[tm] Smith 2012-11-04 09:54:28 UTC
(In reply to comment #2)
> Ah I now find the generator code in the parser sources:
> 
> http://hg.mozilla.org/projects/htmlparser/file/c83141518cf3/translator-src/
> nu/validator/htmlparser/generator/GenerateNamedCharacters.java
> 
> So I'll attempt to run that on the current named-character-references list
> and see what I end up with.

So I just now ran it on a copy of the named-character-references table in the current spec, and I got output that is identical to what's currently in the NamedCharacters.java source. That is, lacking $ and −

So I think there might be a bug in the GenerateNamedCharacters code that's causing it to drop some items in the table.
Comment 4 Michael[tm] Smith 2012-11-04 10:11:24 UTC
um, after looking more carefully at the NAMES array in the NamedCharacters.java source, I see now that I was misunderstanding it. I see that the items in the array are not exact matches for named character references but instead are all missing the first two characters of the name. And I see "llar;" and "nus;" in there, as expected.

And anyway in the mean time I actually got around to taking the time to actually test with the validator and found that the "$" and "−" work as expected. So there's no bug here after all. The test file posted to bug 17418 appears to be testing whether semicolon-less "&dollar" and "&minus" work. But that test doesn't conform to the spec. The named-character-references table only allows "$" and "−" (with the semicolon), not "&dollar" and "&minus" (without the semicolon).

So I'm moving this bug to resolved.
Comment 5 Michael[tm] Smith 2012-11-04 10:26:10 UTC
OK, I went back and re-read
https://www.w3.org/Bugs/Public/show_bug.cgi?id=17418#c3 which says:

> Based on http://www.w3.org/TR/html5/named-character-references.html, an
> error should have shown up for "&dollar" and "&minus", but the live
> validator (http://validator.w3.org) does not recognize them as named
> character references, so I imagine that is a separate bug.

The current spec actually does not require parsers to recognize semicolon-less "&dollar" and "&minus" as special in any way, and they are not errors, so the  actual per-spec behavior for them it to report nothing at all.

I realize that the validator (actually the HTML parser used by the validator) does report "Named character reference was not terminated by a semicolon" errors for semicolon-less versions of some named character references such as "®". I'd need to look at the code more to figure out why it does that for some and not for others. I suspect it just has to do with length. But regardless, the current spec doesn't actually define "&reg" as a parse error, so I think the actual bug here might be that the parser is emitting any error message at all for the "&reg" case.
Comment 6 Henri Sivonen 2012-11-05 10:12:38 UTC
(In reply to comment #5)
> The current spec actually does not require parsers to recognize
> semicolon-less "&dollar" and "&minus" as special in any way, and they are
> not errors, so the  actual per-spec behavior for them it to report nothing
> at all.

Them not being errors seems unfortunate. Is that an accident or by design in response to feedback to paper over unescaped ampersands in href?
Comment 7 Michael[tm] Smith 2012-11-05 10:19:34 UTC
(In reply to comment #6)
> (In reply to comment #5)
> > The current spec actually does not require parsers to recognize
> > semicolon-less "&dollar" and "&minus" as special in any way, and they are
> > not errors, so the  actual per-spec behavior for them it to report nothing
> > at all.
> 
> Them not being errors seems unfortunate. Is that an accident or by design in
> response to feedback to paper over unescaped ampersands in href?

In response to the href case, I think. See the discussion over at bug 19102