18021 – Clarify paragraph about character references in tokenization.html

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 18021 - Clarify paragraph about character references in tokenization.html

Summary: Clarify paragraph about character references in tokenization.html

Status:	RESOLVED FIXED

Alias:	None

Product:	WHATWG
Classification:	Unclassified
Component:	HTML (show other bugs)
Version:	unspecified
Hardware:	Other other

Importance:	P3 normal
Target Milestone:	Unsorted
Assignee:	Ian 'Hixie' Hickson
QA Contact:	contributor

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2012-07-18 15:06 UTC by contributor
Modified:	2012-09-01 00:17 UTC (History)
CC List:	4 users (show)

See Also:

Attachments

Description contributor 2012-07-18 15:06:14 UTC

This was was cloned from bug 16106 as part of operation convergence.
Originally filed: 2012-02-24 11:37:00 +0000
Original reporter: Ezio Melotti <ezio.melotti@gmail.com>

================================================================================
 #0   Ezio Melotti                                    2012-02-24 11:37:49 +0000 
--------------------------------------------------------------------------------
In the tokenization.html page, in the section "8.2.4.69 Tokenizing character references", after the table, it says:

"""
Otherwise, return a character token for the Unicode character whose code point is that number. If the number is in the range 0x0001 to 0x0008, 0x000E to 0x001F, 0x007F to 0x009F, 0xFDD0 to 0xFDEF, or is one of 0x000B, 0xFFFE, 0xFFFF, 0x1FFFE, 0x1FFFF, 0x2FFFE, 0x2FFFF, 0x3FFFE, 0x3FFFF, 0x4FFFE, 0x4FFFF, 0x5FFFE, 0x5FFFF, 0x6FFFE, 0x6FFFF, 0x7FFFE, 0x7FFFF, 0x8FFFE, 0x8FFFF, 0x9FFFE, 0x9FFFF, 0xAFFFE, 0xAFFFF, 0xBFFFE, 0xBFFFF, 0xCFFFE, 0xCFFFF, 0xDFFFE, 0xDFFFF, 0xEFFFE, 0xEFFFF, 0xFFFFE, 0xFFFFF, 0x10FFFE, or 0x10FFFF, then this is a parse error.
"""

As far as I understand, the character is still returned even if it's a parse error, but this is not clear.  The current wording might suggest that the character is returned, /but/ if the number is in those ranges, then it's a parse error (and it doesn't say what should be returned).
I suggest rephrasing it a bit to state explicitly that the character corresponding to that value is returned in both the cases.
================================================================================
 #1   Kang-Hao (Kenny) Lu                             2012-02-24 12:47:08 +0000 
--------------------------------------------------------------------------------
(In reply to comment #0)
> As far as I understand, the character is still returned even if it's a parse
> error, but this is not clear.  

It's pretty clear to me that the first sentence already covers all cases. Otherwise, the first sentence and the second long long sentence would have been switched.

Having said that, I am not the editor and he might agree with you. 

> I suggest rephrasing it a bit to state explicitly that the character
> corresponding to that value is returned in both the cases.

Why don't you propose some text by the way?
================================================================================
 #2   Ian 'Hixie' Hickson                             2012-02-24 17:16:38 +0000 
--------------------------------------------------------------------------------
I'm with Kenny on this. I don't really see how to make it clearer. If you have any proposals though I'm happy to entertain them.
================================================================================
 #3   Ezio Melotti                                    2012-02-28 10:59:04 +0000 
--------------------------------------------------------------------------------
One solution would be to use a list like the in the rest of the page, so something like:
...
 0xD800 to 0xDFFF
 greater than 0x10FFFF
    Parse error.  Return U+FFFD.
 0x0001 to 0x0008
 0x000E to 0x001F
 ...
    Parse error. Treat it as per the "anything else" entry below.
 Anything else
    Return a character token for the Unicode character whose code point is that number.
================================================================================

Comment 1 Ian 'Hixie' Hickson 2012-09-01 00:15:49 UTC

I don't think that would really work here. But I've tried to make it clearer nonetheless by adding the word "Additionally" to the second sentence. Is that enough? If not, please don't hesitate to reopen the bug, and I'll see what else I can do.

Comment 2 contributor 2012-09-01 00:17:25 UTC

Checked in as WHATWG revision r7309.
Check-in comment: Clarify that the second sentence doesn't override the first.
http://html5.org/tools/web-apps-tracker?from=7308&to=7309