Bug 16106 - Clarify paragraph about character references in tokenization.html
Clarify paragraph about character references in tokenization.html
Status: RESOLVED FIXED
Product: HTML WG
Classification: Unclassified
Component: HTML5 spec
unspecified
PC Linux
: P2 normal
: ---
Assigned To: Silvia Pfeiffer
HTML WG Bugzilla archive list
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2012-02-24 11:37 UTC by Ezio Melotti
Modified: 2012-09-28 05:17 UTC (History)
7 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Ezio Melotti 2012-02-24 11:37:49 UTC
In the tokenization.html page, in the section "8.2.4.69 Tokenizing character references", after the table, it says:

"""
Otherwise, return a character token for the Unicode character whose code point is that number. If the number is in the range 0x0001 to 0x0008, 0x000E to 0x001F, 0x007F to 0x009F, 0xFDD0 to 0xFDEF, or is one of 0x000B, 0xFFFE, 0xFFFF, 0x1FFFE, 0x1FFFF, 0x2FFFE, 0x2FFFF, 0x3FFFE, 0x3FFFF, 0x4FFFE, 0x4FFFF, 0x5FFFE, 0x5FFFF, 0x6FFFE, 0x6FFFF, 0x7FFFE, 0x7FFFF, 0x8FFFE, 0x8FFFF, 0x9FFFE, 0x9FFFF, 0xAFFFE, 0xAFFFF, 0xBFFFE, 0xBFFFF, 0xCFFFE, 0xCFFFF, 0xDFFFE, 0xDFFFF, 0xEFFFE, 0xEFFFF, 0xFFFFE, 0xFFFFF, 0x10FFFE, or 0x10FFFF, then this is a parse error.
"""

As far as I understand, the character is still returned even if it's a parse error, but this is not clear.  The current wording might suggest that the character is returned, /but/ if the number is in those ranges, then it's a parse error (and it doesn't say what should be returned).
I suggest rephrasing it a bit to state explicitly that the character corresponding to that value is returned in both the cases.
Comment 1 Kang-Hao (Kenny) Lu 2012-02-24 12:47:08 UTC
(In reply to comment #0)
> As far as I understand, the character is still returned even if it's a parse
> error, but this is not clear.  

It's pretty clear to me that the first sentence already covers all cases. Otherwise, the first sentence and the second long long sentence would have been switched.

Having said that, I am not the editor and he might agree with you. 

> I suggest rephrasing it a bit to state explicitly that the character
> corresponding to that value is returned in both the cases.

Why don't you propose some text by the way?
Comment 2 Ian 'Hixie' Hickson 2012-02-24 17:16:38 UTC
I'm with Kenny on this. I don't really see how to make it clearer. If you have any proposals though I'm happy to entertain them.
Comment 3 Ezio Melotti 2012-02-28 10:59:04 UTC
One solution would be to use a list like the in the rest of the page, so something like:
...
→ 0xD800 to 0xDFFF
→ greater than 0x10FFFF
    Parse error.  Return U+FFFD.
→ 0x0001 to 0x0008
→ 0x000E to 0x001F
→ ...
    Parse error. Treat it as per the "anything else" entry below.
→ Anything else
    Return a character token for the Unicode character whose code point is that number.
Comment 4 contributor 2012-07-18 15:06:18 UTC
This bug was cloned to create bug 18021 as part of operation convergence.
Comment 5 Silvia Pfeiffer 2012-09-28 05:17:35 UTC
EDITOR'S RESPONSE: This is an Editor's Response to your comment. If
you are satisfied with this response, please change the state of
this bug to CLOSED. If you have additional information and would
like the Editor to reconsider, please reopen this bug. If you would
like to escalate the issue to the full HTML Working Group, please
add the TrackerRequest keyword to this bug, and suggest title and
text for the Tracker Issue; or you may create a Tracker Issue
yourself, if you are able to do so. For more details, see this
document:   http://dev.w3.org/html5/decision-policy/decision-policy.html

Status: Accepted
Change Description: applied patch
https://github.com/w3c/html/commit/6ce78faff3937f156ea217bba6d290de3f456de0
Rationale: adopted resolution by WHATWG