Bugzilla – Bug 16106
Clarify paragraph about character references in tokenization.html
Last modified: 2012-09-28 05:17:35 UTC
In the tokenization.html page, in the section "220.127.116.11 Tokenizing character references", after the table, it says:
Otherwise, return a character token for the Unicode character whose code point is that number. If the number is in the range 0x0001 to 0x0008, 0x000E to 0x001F, 0x007F to 0x009F, 0xFDD0 to 0xFDEF, or is one of 0x000B, 0xFFFE, 0xFFFF, 0x1FFFE, 0x1FFFF, 0x2FFFE, 0x2FFFF, 0x3FFFE, 0x3FFFF, 0x4FFFE, 0x4FFFF, 0x5FFFE, 0x5FFFF, 0x6FFFE, 0x6FFFF, 0x7FFFE, 0x7FFFF, 0x8FFFE, 0x8FFFF, 0x9FFFE, 0x9FFFF, 0xAFFFE, 0xAFFFF, 0xBFFFE, 0xBFFFF, 0xCFFFE, 0xCFFFF, 0xDFFFE, 0xDFFFF, 0xEFFFE, 0xEFFFF, 0xFFFFE, 0xFFFFF, 0x10FFFE, or 0x10FFFF, then this is a parse error.
As far as I understand, the character is still returned even if it's a parse error, but this is not clear. The current wording might suggest that the character is returned, /but/ if the number is in those ranges, then it's a parse error (and it doesn't say what should be returned).
I suggest rephrasing it a bit to state explicitly that the character corresponding to that value is returned in both the cases.
(In reply to comment #0)
> As far as I understand, the character is still returned even if it's a parse
> error, but this is not clear.
It's pretty clear to me that the first sentence already covers all cases. Otherwise, the first sentence and the second long long sentence would have been switched.
Having said that, I am not the editor and he might agree with you.
> I suggest rephrasing it a bit to state explicitly that the character
> corresponding to that value is returned in both the cases.
Why don't you propose some text by the way?
I'm with Kenny on this. I don't really see how to make it clearer. If you have any proposals though I'm happy to entertain them.
One solution would be to use a list like the in the rest of the page, so something like:
→ 0xD800 to 0xDFFF
→ greater than 0x10FFFF
Parse error. Return U+FFFD.
→ 0x0001 to 0x0008
→ 0x000E to 0x001F
Parse error. Treat it as per the "anything else" entry below.
→ Anything else
Return a character token for the Unicode character whose code point is that number.
This bug was cloned to create bug 18021 as part of operation convergence.
EDITOR'S RESPONSE: This is an Editor's Response to your comment. If
you are satisfied with this response, please change the state of
this bug to CLOSED. If you have additional information and would
like the Editor to reconsider, please reopen this bug. If you would
like to escalate the issue to the full HTML Working Group, please
add the TrackerRequest keyword to this bug, and suggest title and
text for the Tracker Issue; or you may create a Tracker Issue
yourself, if you are able to do so. For more details, see this
Change Description: applied patch
Rationale: adopted resolution by WHATWG