10067 – this only lists entities whose replacement text is a single character, for example many of the negated operators, for example

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 10067 - this only lists entities whose replacement text is a single character, for example many of the negated operators, for example

Summary: this only lists entities whose replacement text is a single character, for ex...

Status:	RESOLVED NEEDSINFO

Alias:	None

Product:	HTML WG
Classification:	Unclassified
Component:	pre-LC1 HTML5 spec (editor: Ian Hickson) (show other bugs)
Version:	unspecified
Hardware:	Other other

Importance:	P3 normal
Target Milestone:	---
Assignee:	Ian 'Hixie' Hickson
QA Contact:	HTML WG Bugzilla archive list

URL:	http://www.whatwg.org/specs/web-apps/...
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2010-07-02 16:04 UTC by contributor
Modified:	2010-10-04 14:29 UTC (History)
CC List:	12 users (show)

See Also:

Attachments

Description contributor 2010-07-02 16:04:17 UTC

Section: http://www.whatwg.org/specs/web-apps/current-work/#named-character-references-table

Comment:
this only lists entities whose replacement text is a single character, for
example many of the negated operators, for example

Posted from: 62.231.145.254

Comment 1 David Carlisle 2010-07-02 16:09:50 UTC

(In reply to comment #0)
sorry the comment form on the whatwg spec sent the comment before I expected.

I intented to add...


a typical example is

nsupset; U02283-020D2  entity missing from html5 spec

(a full list could be supplied on demand)

If this is intentional, some documentation of the difference between the built-in entities for text/html and the supplied dtd for xhtml+mathml for use with the xml serialisation should be given. If it's not intentional the list should be extended, ideally to match 




http://www.w3.org/2003/entities/2007/htmlmathml-f.ent

David

Comment 2 Ian 'Hixie' Hickson 2010-09-24 14:39:15 UTC

Currently my script explicitly skips any character references that are more than one code point, because having multiple codepoints would be presumably a lot harder for implementations, who would have to use more expensive data structures for this.

I'm still not really convinced we want to add the thousands of characters we have added so far... do we really want to add even more?

Input from browser vendors would be useful here. Henri? Adam/Eric? James?

Comment 3 Anne 2010-09-24 14:45:26 UTC

Stig said it should be no problem for us.

Comment 4 Adam Barth 2010-09-24 17:49:57 UTC

Entities that expand to multiple code points are no problem for WebKit to implement.  Now, whether we want to have more entities is a separate question.

Comment 5 Henri Sivonen 2010-09-27 08:28:32 UTC

How many named character names would the change add? Would the first two letters of the additional names be evenly distributed? How long would the additional expansions be in terms of a) UTF-16 code units and b) UTF-8 code units? Are the names that aren't currently in HTML5 actually shown to be useful for XML MathML authoring?

For the implementation in Gecko, it would be bad to introduce a large number of names that shared the first two letters with commonly used named characters. (Names starting with lt, gt, qu, nb or am would probably be the worst.) Also, the implementation in Gecko now assumes the expansion is always one or two UTF-16 code units.

I'd be the most OK with adding names whose first two letters don't collide with pre-existing names and whose expansions aren't be longer than two UTF-16 code units. For other kinds of additional names, I'd be interested in the expected benefit of the complication.

Comment 6 Ian 'Hixie' Hickson 2010-09-27 09:10:28 UTC

The character reference names and values would be:

 name: nvlt; value: U0003C-020D2
 name: bne; value: U0003D-020E5
 name: nvgt; value: U0003E-020D2
 name: fjlig; value: U00066-0006A
 name: ThickSpace; value: U0205F-0200A
 name: nrarrw; value: U0219D-00338
 name: npart; value: U02202-00338
 name: nang; value: U02220-020D2
 name: caps; value: U02229-0FE00
 name: cups; value: U0222A-0FE00
 name: nvsim; value: U0223C-020D2
 name: race; value: U0223D-00331
 name: acE; value: U0223E-00333
 name: NotEqualTilde; value: U02242-00338
 name: nesim; value: U02242-00338
 name: napid; value: U0224B-00338
 name: nvap; value: U0224D-020D2
 name: NotHumpDownHump; value: U0224E-00338
 name: nbump; value: U0224E-00338
 name: nbumpe; value: U0224F-00338
 name: NotHumpEqual; value: U0224F-00338
 name: nedot; value: U02250-00338
 name: bnequiv; value: U02261-020E5
 name: nvle; value: U02264-020D2
 name: nvge; value: U02265-020D2
 name: nlE; value: U02266-00338
 name: nleqq; value: U02266-00338
 name: ngE; value: U02267-00338
 name: ngeqq; value: U02267-00338
 name: NotGreaterFullEqual; value: U02267-00338
 name: lvnE; value: U02268-0FE00
 name: lvertneqq; value: U02268-0FE00
 name: gvnE; value: U02269-0FE00
 name: gvertneqq; value: U02269-0FE00
 name: nLtv; value: U0226A-00338
 name: NotLessLess; value: U0226A-00338
 name: nLt; value: U0226A-020D2
 name: nGtv; value: U0226B-00338
 name: NotGreaterGreater; value: U0226B-00338
 name: nGt; value: U0226B-020D2
 name: NotSucceedsTilde; value: U0227F-00338
 name: vnsub; value: U02282-020D2
 name: nsubset; value: U02282-020D2
 name: NotSubset; value: U02282-020D2
 name: vnsup; value: U02283-020D2
 name: nsupset; value: U02283-020D2
 name: NotSuperset; value: U02283-020D2
 name: vsubne; value: U0228A-0FE00
 name: varsubsetneq; value: U0228A-0FE00
 name: vsupne; value: U0228B-0FE00
 name: varsupsetneq; value: U0228B-0FE00
 name: NotSquareSubset; value: U0228F-00338
 name: NotSquareSuperset; value: U02290-00338
 name: sqcaps; value: U02293-0FE00
 name: sqcups; value: U02294-0FE00
 name: nvltrie; value: U022B4-020D2
 name: nvrtrie; value: U022B5-020D2
 name: nLl; value: U022D8-00338
 name: nGg; value: U022D9-00338
 name: lesg; value: U022DA-0FE00
 name: gesl; value: U022DB-0FE00
 name: notindot; value: U022F5-00338
 name: notinE; value: U022F9-00338
 name: nrarrc; value: U02933-00338
 name: NotLeftTriangleBar; value: U029CF-00338
 name: NotRightTriangleBar; value: U029D0-00338
 name: ncongdot; value: U02A6D-00338
 name: napE; value: U02A70-00338
 name: nles; value: U02A7D-00338
 name: NotLessSlantEqual; value: U02A7D-00338
 name: nleqslant; value: U02A7D-00338
 name: nges; value: U02A7E-00338
 name: NotGreaterSlantEqual; value: U02A7E-00338
 name: ngeqslant; value: U02A7E-00338
 name: NotNestedLessLess; value: U02AA1-00338
 name: NotNestedGreaterGreater; value: U02AA2-00338
 name: smtes; value: U02AAC-0FE00
 name: lates; value: U02AAD-0FE00
 name: npre; value: U02AAF-00338
 name: npreceq; value: U02AAF-00338
 name: NotPrecedesEqual; value: U02AAF-00338
 name: nsce; value: U02AB0-00338
 name: nsucceq; value: U02AB0-00338
 name: NotSucceedsEqual; value: U02AB0-00338
 name: nsubE; value: U02AC5-00338
 name: nsubseteqq; value: U02AC5-00338
 name: nsupE; value: U02AC6-00338
 name: nsupseteqq; value: U02AC6-00338
 name: vsubnE; value: U02ACB-0FE00
 name: varsubsetneqq; value: U02ACB-0FE00
 name: vsupnE; value: U02ACC-0FE00
 name: varsupsetneqq; value: U02ACC-0FE00
 name: nparsl; value: U02AFD-020E5

Comment 7 Henri Sivonen 2010-09-27 12:12:42 UTC

(In reply to comment #6)
> The character reference names and values would be:

>  name: nLt; value: U0226A-020D2

Given that these proposed names always expand to 2 BMP characters, they aren't worse than the pre-existing astral characters in UTF-16. Also, it seems the first two letters don't collide too badly with the first two letters of the most common named characters, so that seems OK, too. It also looks like the longest of the proposed names is substantially longer than the longest existing name, so the need to buffer in case of mismatch at the last possible point doesn't get substantially worse.

There are a couple of unfortunate characteristics, but I guess they aren't too bad:
 1) Many names start with "No". That is, the first two letters don't provide as much uniqueness as one might hope. Anyway, chances are these names won't become too popular on the Web scale, so it probably won't matter if these aren't carefully optimized in Gecko.

 2) &nLt; is 5 bytes in UTF-8 but its expansion is 6 bytes. This changes the buffering nature of named characters when the buffers are in UTF-8: The output buffer may have to be larger than the input buffer. However, this problem already exists when U+0000 is turned into U+FFFD, so the worst case for UTF-8 is already worse (output 3 times the size of input) than what these new names require (output 1.2 times the size of input).

I don't have immediate objections the addition of these named characters.

However, the sheer size of the list is already rather excessive. I hope the Math WG isn't planning on adding more names over time. If this list is just going to grow and grow, maybe we should just say "no" now. OTOH, if there's a promise that the list doesn't get bigger after this, I guess these additions can be lived with.

Comment 8 Henri Sivonen 2010-09-27 12:15:24 UTC

s/is substantially/is NOT substantially/

Comment 9 David Carlisle 2010-09-27 13:24:35 UTC

(In reply to comment #7)
e immediate objections the addition of these named characters.
> 
> However, the sheer size of the list is already rather excessive. I hope the
> Math WG isn't planning on adding more names over time. If this list is just
> going to grow and grow, maybe we should just say "no" now. OTOH, if there's a
> promise that the list doesn't get bigger after this, I guess these additions
> can be lived with.

It's dangerous to predict the future but I can promise there is absolutely no intention of ever extending this list. MathML3 added no new names, MathML2 added just 1 (I think) so all but asympeq come from MathML1 in 1998 (and the vast majority of them come from the earlier ISO entity sets).

As I commented this morning in IRC (but it didn't make the log for some reason) We have the (self imposed) constraint that we never remove a name because if an xml document gets used with a catalog that switches in a newer dtd the entity would become undefined, so the entire document would be rejected as not well formed.

HTML doesn't have the draconian error handling and the names were not previously in html so the pressures on you are slightly different.

Some workflows (and my sanity) are probably helped if the lists are exactly the same on the html and xml sides, but if the html entities are going to be a subset, then (a) this should be mentioned somewhere in the html5 spec spec (and I'd mention it and list the html5 ones in the editors draft (at least) of the xml entities spec) and (b) there are probably some other ones that you could drop in addition to the multiple character ones, specifically
NegativeMediumSpace; 	U+0200B 	&#8203; 
NegativeThickSpace; 	U+0200B 	&#8203; 
NegativeThinSpace; 	U+0200B 	&#8203; 
NegativeVeryThinSpace; 	U+0200B
all expanding to zero width space. the only reason they are there was because they were in mathml1 and kept as noted above.

MathML1 used the private use area for the majority of its characters, based on the STIX submission to Unicode. These negative spaces were in the submission but not accepted into Unicode when the other math characters went in in Unicode 3.1 and 3.2, which left them with nowhere to go to once we stopped using the private use area. (Arguably they should have gone to the replacement character, but zero width space had better behaviour in the systems of the time).

Comment 10 Ian 'Hixie' Hickson 2010-09-29 18:47:00 UTC

EDITOR'S RESPONSE: This is an Editor's Response to your comment. If you are satisfied with this response, please change the state of this bug to CLOSED. If you have additional information and would like the editor to reconsider, please reopen this bug. If you would like to escalate the issue to the full HTML Working Group, please add the TrackerRequest keyword to this bug, and suggest title and text for the tracker issue; or you may create a tracker issue yourself, if you are able to do so. For more details, see this document:
   http://dev.w3.org/html5/decision-policy/decision-policy.html

Status: Accepted
Change Description: see diff given below
Rationale: Concurred with reporter's comments.

Please keep an eye out for parts of the spec that assume a character reference is one codepoint and let me know of any I missed.

Comment 11 contributor 2010-09-29 18:48:11 UTC

Checked in as WHATWG revision r5557.
Check-in comment: Add the remaining MathML entities -- the ones that expand to two characters.
http://html5.org/tools/web-apps-tracker?from=5556&to=5557

Comment 12 David Carlisle 2010-09-30 12:14:19 UTC

Thanks, just to confirm that the list of entity names extracted from the editors draft of html5 now matches a list constructed from the source files of the xml entities REC.