XML Entity Definitions for Characters Last Call Dispositions

Public Version [Member-Confidential Version]

6 February 2010, Version 1.2

Editors:: David Carlisle, NAG, Editor; Patrick Ion, MR/AMS, Math WG Co-chair

bfsfit glyphs
missing entities in w3centities-f.ent
U02220-020D2 , five hexadecimal digits,...
isotech error with lowast?
Series of comments from I18N direction
htmlmath entities collection
Possible bug in the operator dictionary and mmlalias.ent?

7 comments of which 7 fully resolved, 0 resolved but not yet formally accepted and 0 unresolved.

XML Entity Definitions for Characters Last Call Comments and Responses

The XML Entity Definitions for Characters draft has received several comments. We itemize the comments received and their responses.

Entities

Summary:	bfsfit glyphs
Submitted:	Will Robertson, http://lists.w3.org/Archives/Public/www-math/2009Nov/0056.html
Response:	David Carlisle, http://lists.w3.org/Archives/Public/www-math/2009Nov/0057.html Will Robertson, http://lists.w3.org/Archives/Public/www-math/2009Nov/0060.html
Discussion:	The glyphs shown as examples for sans serif bold italic look as if they're from the regular weight: http://www.w3.org/TR/2009/WD-xml-entity-names-20091117/sans-serif-bold-italic.html Not that there exists a regular weight for Greek sans :) -- Will It's 10 years ago and I'm not sure I have the sources, but I did tweak the metafont parameters taking a creative merge of the bold, slanted and sans serif parameters to come up with a bold sans serif slanted. Perhaps I wasn't creative enough, I agree it could be more bold. I recently discarded my script and bold script glyphs (which were identical) and replaced them with glyphs derived from the stix beta. perhaps I should do the same here? David It wasn't the absolute weight that I was worried about, just that it looked like the same glyphs were being used for normal weight slanted sans serif as well. The results would certainly look better; not the end of the world either way, of course :) -- Will
Resolution:	David Carlisle, http://lists.w3.org/Archives/Public/www-math/2009Nov/0070.html Actually it was only 9 years, October 2000 seems to have been the last time I touched those sources... They didn't only look similar: cmp confirmed the png files were identical. So despite there being a start of some MF sources for this, it appears that I just used the normal weight ones. Rather than switch to STIX beta here I decided to stick with the Computer Modern heritage of (almost) all the rest and have tweaked the metafont a bit more resulting in an updated set of bold slanted sans glyphs as can be seen in the editors' draft: http://www.w3.org/2003/entities/2007doc/sans-serif-bold-italic.html Thanks for reporting this, David


Summary:	missing entities in w3centities-f.ent
Submitted:	John Cowan, http://lists.w3.org/Archives/Public/www-math/2009Nov/0061.html
Response:	David Carlisle, http://lists.w3.org/Archives/Public/www-math/2009Nov/0062.html
Discussion:	The definitions of the lowercase equivalents of the HTML5-UPPERCASE set (namely amp, copy, gt, lt, quot, reg, and trade) do not exist in w3centities-f.ent, even though they still appear in the xhtml1-lat1.ent and xhtml1-special.ent files. Please fix.
Resolution:	Yes, sorry, I just noticed that last night. I messed up the arguments to sort. I wanted to change the sort order to bring AMP and amp together but I actually coalesced them. Corrected file has been committed.


Summary:	U02220-020D2 , five hexadecimal digits,...
Submitted:	Matthias Mittelstein, http://lists.w3.org/Archives/Public/www-math/2009Nov/0058.html
Response:	David Carlisle, http://lists.w3.org/Archives/Public/www-math/2009Nov/0059.html
Discussion:	(1) http://www.w3.org/TR/2009/WD-xml-entity-names-20091117/glyphs/022/U02220-020D2.png seems to point to an existing file. Nevertheless it shows a placeholder only. (2) I prefer hexadecimal Unicode code point numbers to have four or six digits. May be that is old-fashioned and byte-oriented. But five digit numbers hurt my eyes especially in columns with the title "BMP". (3) http://www.w3.org/TR/2009/WD-xml-entity-names-20091117/U0FE00.html , http://www.w3.org/TR/2009/WD-xml-entity-names-20091117/U020D2.html , http://www.w3.org/TR/2009/WD-xml-entity-names-20091117/U020D2.html show a variable number of Entity Names. What is the reason to show duplicates like " oplus, oplus, CirclePlus ".
Resolution:	> http://www.w3.org/TR/2009/WD-xml-entity-names-20091117/glyphs/022/U02220-020D2.png > seems to point to an existing file. Nevertheless it shows a placeholder only. Ah thanks for that, The png works in FireFox but not in IE. That's happened before occasionally, previously if I use imagemagic convert to convert the png (to anything and back again) it will warn of some internal inconsistency and fix it up.... $ convert U02220-020D2.png x.gif convert: Incorrect tRNS chunk length `U02220-020D2.png'. David Carlisle@dcarlisle /home/w3c/WWW/2003/entities/2007doc/glyphs/022 $ convert x.gif U02220-020D2.png David Carlisle@dcarlisle /home/w3c/WWW/2003/entities/2007doc/glyphs/022 $ cvs commit -m "bad chunk length" U02220-020D2.png Checking in U02220-020D2.png; /w3ccvs/WWW/2003/entities/2007doc/glyphs/022/U02220-020D2.png,v <-- U02220-020D2.png new revision: 1.2; previous revision: 1.1 done yes seems to work now, try the editor's draft at http://www.w3.org/2003/entities/2007doc/U020D2.html thanks. > I prefer hexadecimal Unicode code point numbers to have four or six > digits. May be that is old-fashioned and byte-oriented. But five digit > numbers hurt my eyes especially in columns with the title "BMP". original versions of these tables (in mathml, going back a decade or so) used the internal U01234 form pretty much everywhere: this form has advantages in the internal build as it's a valid XML ID (unlike U+ form which can't be used as an XML ID value, and consistently using 5 digits allows things to be sorted naively (until someone pushes some interesting characters in the 6 digit range;-) however in the visible text of the specification we've almost completely switched to using the Unicode U+1234 form, just using the original form for internal identifiers, and png file names, so I suppose it makes sense to catch the remaing cases as well. All the tables are generated so changing notation isn't a big deal just a matter of dropping in a suitable regular expression replace. I'll see what I can do. > What is the reason to show duplicates like " oplus, oplus, CirclePlus > ". well they are dupicated because (in the case of oplus) the name is both in xhtml-symbol and in isoamsb, but since I don't show the set name there the duplication is not very helpful, ..... I just checked in the stylesheet with distinct-values() xpath functin inserted and the editors' draft now just shows these just once: http://www.w3.org/2003/entities/2007doc/U020D2.html Thanks for your comments. David and from the commenter on a private list From: "Mittelstein, Matthias" <matthias.mittelstein@sap.com> To: David Carlisle <davidc@nag.co.uk> Importance: low Date: Mon, 23 Nov 2009 09:27:09 +0100 Subject: RE: [Entities-last-call] U02220-020D2 , five hexadecimal digits,... Thread-Topic: [Entities-last-call] U02220-020D2 , five hexadecimal digits,... Thread-Index: AcprbkGaEdUJvF/OQAKstASdcIlQEgAp6X8g In-Reply-To: <200911221220.nAMCKohw025358@edinburgh.nag.co.uk> Accept-Language: en-US, de-DE Content-Language: en-US acceptlanguage: en-US, de-DE Hello David, I withdraw my proposal not to use five digits for plane 1 and 2. Even if I prefered hexadecimal Unicode code point numbers to have four or six digits, I will try to use five digits, where suitable. It is hard enough to explain Unicode to everybody. Using a consistent notation makes it a bit easier. Somehow I did not remembered that sentence. But it was also in older Unicode books. Matthias Mittelstein Development Architect TD Core AS&VM ABAP Infratructure TD Core ABAP Server and VM Technology BST Technology Develoment Core BST Technology Develoment Business Solutions & Technology SAP AG Großer Grasbrook 17 20457 Hamburg T +49 40 22707 131 T +49 6227 7 61164 F +49 6227 78 00295 mailto:matthias.mittelstein@sap.com www.sap.com Pflichtangaben/Mandatory Disclosure Statements: http://www.sap.com/company/legal/impressum.epx Diese E-Mail kann Betriebs- oder Geschäftsgeheimnisse oder sonstige vertrauliche Informationen enthalten. Sollten Sie diese E-Mail irrtümlich erhalten haben, ist Ihnen eine Kenntnisnahme des Inhalts, eine Vervielfältigung oder Weitergabe der E-Mail ausdrücklich untersagt. Bitte benachrichtigen Sie uns und vernichten Sie die empfangene E-Mail. Vielen Dank. This e-mail may contain trade secrets or privileged, undisclosed, or otherwise confidential information. If you have received this e-mail in error, you are hereby notified that any review, copying, or distribution of it is strictly prohibited. Please inform us immediately and destroy the original transmittal. Thank you for your cooperation.


Summary:	isotech error with lowast?
Submitted:	Bruce Rosenbloom, http://lists.w3.org/Archives/Public/www-math/2009Nov/0043.html
Response:	David Carlisle, http://lists.w3.org/Archives/Public/www-math/2009Nov/0045.html http://lists.w3.org/Archives/Public/www-math/2009Dec/0001.html
Discussion:	> Shouldn't this be defined as: > > <!ENTITY lowast "⁎" ><!--low asterisk --> One might expect that, but like some others (most notably asymp) the definitions are somewhat skewed by HTML compatibility. Many of the HTML4 entity definitions are somewhat strange but the HTML4 symbol.ent file linked from http://www.w3.org/TR/html4/sgml/dtd.html defines lowast to be <!ENTITY lowast CDATA "∗" -- asterisk operator, U+2217 ISOtech --> The HTML entity sets are baked into code in multiple browsers of every desktop and mobile phone on the planet and nothing that is put in a DTD entity file will change that as most HTML systems don't read these definitions from any kind of declarative file, thus is seems futile to publish an HTML DTD with definitions different from the ones actually implemented, and it would be odd to try to make XHTML incompatible to HTML with respect to entity definitions. If the MathML set were incompatible with XHTML then the meaning of lowast throughout an xhtml+mathml document would depend on the technical details of how the DTDs were combined, but whichever definition "won" it would mean the same thing throughout the document; you can't make the expansion of the entity sensitive to which element the entity is in. The main aim of taking the entity set definitions out of MathML into their own spec at http://www.w3.org/TR/xml-entity-names/ is to get a common set of definitions that can be used accross languages, but where existing usage in different communities is incompatible, getting a common set means something has to change and the above considerations mean that essentially if there was a conflict the HTML definition was taken. I hope that explains why things are as they are.
Resolution:	David, Thanks for the clear and complete explanation. It's helpful, and I appreciate your time to reply. Best regards, Bruce


Summary:	Series of comments from I18N direction
Submitted:	Martin Dürst, http://lists.w3.org/Archives/Member/member-i18n-core/2009Nov/0006.html
Response:	David Carlisle, http://lists.w3.org/Archives/Public/www-math/2009Nov/0054.html ,
Discussion:	Since the comments were ofered on non-public list, but were valuable, it seems correct to reproduce essentially all here. Please proceed on any of my comments either as a quick fix before LC publication or as an input for LC. .. Now for the comments themselves: Title: "XML Entity definitions for Characters" looks very ambigous. I think something like "XML Entity Definitions for Characters used by MathML" or so would help the general public a lot to understand the context and coverage of the document. Abstract: "This document defines several sets of names which are assigned to Unicode characters. Each of these sets is also implemented as a file of XML entity declarations.": First, this says that the names are the main stuff, and the XML entities are just an implementation detail. This is a contradiction to the title, where XML entities are the main thing. Second, "sets of names which are assigned to Unicode characters" is unclear as to whether a set of names is assingned to a Unicode character, or something else. The same problem is present elsewhere (e.g. first sentence of the Introduction) Third, all Unicode characters have official names (e.g. LATIN CAPITAL LETTER A for U+0041). These are a very important part of nailing down the identity of a character. It would be good if either the abstract or the Introduction or both would make clear that what you are dealing with are short mnemotic names that are different from the official Unicode names. Fourth, names being assigned to Unicode characters doesn't sound right. This may be a programmer's viewpoint, but what you are doing, in terms of an average programmig language, is to assign Unicode codepoints/characters to entity names, not the other way round. XML entities in this sense are not much different from variables in a programming language, so it would help a lot to keep things straight. Introduction: "The W3C Math Working Group has been invited to take over the maintenance and development of these sets by the original standards committee (ISO/IECJTC1 SC34).": It should say somewhere that this document is the result of this "taking over". There should be a section on Notation, which explains things such as U+ and leading slashes (is that TEX?). Tables: http://www.w3.org/2003/entities/2007doc/bycodes.html: - Instead of U00009 and the like, please use the official U+0009 notation, and do not use a hyphen for character sequences, as this may look like a character range. - Use a <table> so that this displays decently even with non-proportional fonts (you can then eliminate the ugly commas). There are lots of cases where <table> is misused in Web pages, but this is clearly a case where it is "misunused" or "misnonused" or whatever one would call the absence of the use of a feature when such use is clearly warranted. - Use proper table headings - For character sequences, use e.g. "LESS-THAN SIGN with COMBINING LONG VERTICAL LINE OVERLAY" rather than "LESS-THAN SIGN with vertical line" http://www.w3.org/2003/entities/2007doc/byalpha.html: - Similar comments as for bycodes.html - I don't understand why this table contains the origins/collections, but bycodes.html doesn't. - I don't understand the lowercase stuff at the end of each line. It seems to be some kind of annotations, but in some cases is totally useless (e.g. [LATIN SMALL LETTER A WITH CIRCUMFLEX], latin capital letter A with circumflex) - This table puts the official Unicode names in "[" and "]", but bycodes.html doesn't. Why? There should be no such gratutious differences. http://www.w3.org/2003/entities/2007doc/000.html and similar: Please add a note to all the pages with lots of small glyphs that it may take time to load all the images to see all the glyphs. (one test run with Mozilla Firebug took 37 seconds on a broadband connection). Please use a stable, final location for all these GIFs. It's okay to have an occasional "301 Moved Permanently" for a page, but it essentially doubles the number of objects your page has to download from 256 to 512. Even the former isn't pretty, the later is definitely bad and totally unnecessary. (the redirects come from URIs of the form http://www.w3.org/2003/entities/glyphs/003/U003FF.png, the actual images seem to be at places such as http://www.w3.org/2003/entities/2007doc/glyphs/003/U003FF.png) Codepoints U+0000 through U+0010 (with three exceptions) are shown as "Unicode or XML Non-Character". They are valid control characters in Unicode. Strangely enough, there are also such cases (red background color) in the U+1D4xx and U+1D5xx 'blocks'. A codepoint such as U+1D53F is simply <reserved> in Unicode, the Unicode consortium could decide to allocate a character there in the future. This is no different at all from all the characters that you marked with a yellow background. The only codepoints that are actually non-characters in Unicode are cases such as U+FFFF and the like, but you don't have any of these. I therefore suggest that the red backgrounds in the U+1D4xx and U+1D5xx 'blocks' have to be turned to yellow, and the text for the red background should be changed to "Characters not representable in XML 1.0" or some such (most of them would be representable in XML 1.1). For codepoints with a yellow background, the legend says "XML Character not currently described in Unicode". The term "XML Character" is really strange. XML uses Unicode, there are no "XML Characters". The cells with yellow backgrounds represent unassigned (reserved) Unicode codepoints. So the best legend would be "reserved Unicode codepoint (no character currently assigned)" or something similar. Putting the "Next" link above the "Previous" link at the top and bottom of these tables seems counterintuitive, because the overall flow is from top to bottom. For http://www.w3.org/2003/entities/2007doc/double-struck.html and similar: Why do some rows have a yellow background? There's no explanation, so the reader is left guessing. Why do some of these characters not have any corresponding entity names at all? Section 3: Title: An "Unicode Character Block": As you can see from http://unicode.org/Public/UNIDATA/Blocks.txt, Unicode blocks are not of equal size of 256 characters, and are not all alligned on boundaries divisible by 256. But the reader can easily get such an impression. The title, or the text below it, should be changed to reflect this, unless (which would be more appropriate for the document (see next comment), but may be difficult in terms of production costs) actual Unicode blocks are used. I don't understand why Arabic presentation forms are (as indicated by the yellow background) available in the STIX fonts, when basic Arabic isn't. Turning things around, would a font for Math or Science have to support these? The sentence "The following tables display Unicode ranges containing the characters that are most used in mathematics." at the start of section 3 seems to suggest so. Turning things around: Are these tables for all the 256-character-sized, aligned parts that contain one or more of the characters for which entities have been defined in this document? If yes, please say so. If no, please say what the differences are. Section 5, first sentence: "there are some that use multiple character combinations": "multiple character combinations" is "multiple combinations of characters". However, characters are used in sequences, not in combinations. So "a sequence of multiple characters" or so would be better. Editorial: - Please change 'definitions' to 'Definitions' it the title, or adopt any other W3C approved consistent casing convention. That such an inconsistency is 'traditional for this document' shouldn't be a reason to keep it. Section 1, first sentence: "especially in scientific documents, especially in mathematics": Repetition; unclear about the relationship between the two clauses introduced by 'especially'. Section 1, second sentence: "has grown in part because its notation continually changes": I suggest changing "changes" to "changed" to align the tenses. Section 1, first paragarph: "It is difficult to write science fluently" -> "It is difficult to write scientific texts fluently"; same later for "read science". Section 3, first sentence: "Certain characters are of of particular relevance": "of of" -> "of" Section 5, first sentence: "character, however" -> "character. However" or "character, but" (however starts a new sentence) Hope this helps, Martin. This plethora of points was answered by David Carlisle on the public Math list: Martin, Thanks for your comments on the entities draft. I've changed the CC list and will handle them as LC comments (as the LC draft publication is imminent) These are _personal_ first impressions not a formal response to the coments. ... Then see below under the Resolution section for details.
Resolution:	Note the final response was not sent by the commenter to the public list but in response to a prompt from Chris Lilley at W3C just before the Director's meeting on advancing the specification to a Proposed Recommendation. Relevant extracts are therefore reproduced below: From: duerst Subject: Re: URGENT: your comments on "XML Entity definitions for Characters" Date: February 4, 2010 5:58:08 AM EST (CA) Great, thanks! Regards, Martin. On 2010/02/04 18:52, David Carlisle wrote: \| Sorry, I just changed it to use exactly the wording you suggest. sh-3.2$ cvs commit -m "update unassigned legend for MD" characters.xsl Checking in characters.xsl; /w3ccvs/WWW/2003/entities/2007xml/characters.xsl,v <-- characters.xsl new revision: 1.54; previous revision: 1.53 From: duerst Subject: Re: URGENT: your comments on "XML Entity definitions for Characters" Date: February 4, 2010 3:25:26 AM EST (CA) To: chris@w3.org Sorry to be so late in responding. I don't really feel like I have been able to read email the last two/three weeks, very busy time of the year. I'm quite satisfied with how my comments were addressed. However, what in fact originally got me to look closer at the document unfortunately hasn't been fixed at all. It is the legend given to the light yellow cells in the "code range" charts (see e.g. http://www.w3.org/2003/entities/2007doc/003.html). It still says "XML Character not currently described in Unicode". The previous exchange on this was: >> For codepoints with a yellow background, the legend says "XML Character >> not currently described in Unicode". The term "XML Character" is really >> strange. XML uses Unicode, there are no "XML Characters". > >"XML Characters" is intended to mean something matching the XML char >production, that is, a character usable as character data in XML, >which is a bit less than full unicode range as you know. However >the legend has been reworded as noted in the previous comment. Something seems to be missing, because e.g. for the cells with light purple background (e.g. at http://www.w3.org/2003/entities/2007doc/000.html), the text now says "Codepoint not allowed as XML 1.0 character data", so I would have expected the text for the light yellow cells to say something like "Codepoint allowed as XML 1.0 character data; no Unicode character defined" or something similar. Anyway, David's response says it has been changed but those changes seem to have been lost. I'm sure that can be fixed quickly, and the document can go ahead. Regards, Martin. -- #-# Martin J. Dürst, Professor, Aoyama Gakuin University Martin, thanks again for your comments on the last call draft of XML Entity Definitions for Characters The last call draft is at the URI: http://www.w3.org/TR/2009/WD-xml-entity-names-20091117/ An Editors' draft showing the changes made in response to LC comments so far is available at the URI: http://www.w3.org/2003/entities/2007doc/Overview.html I hope we have addressed all the points that you have raised. As you will know, the W3C process requires that we log the resolution of every last call comment, so we would appreciate it if you could confirm via an email to www-math list whether all the points you have raised have been addressed satisfactorily. David > > Now for the comments themselves: > > Title: "XML Entity definitions for Characters" looks very ambigous. I > think something like "XML Entity Definitions for Characters used by > MathML" or so would help the general public a lot to understand the > context and coverage of the document. Although parts of this document were derived from the MathML2 spec sources, this is explicitly _not_ just for MathML. It includes several entity sets that are not included in the MathML DTD (isogrk1, isogrk2, isogrk4, xhtml1-lat1, xhtml1-special, xhtml1-symbol, html5-uppercase) So as well as being used for MathML it can be used for HTML (HTML5 uses these definitions for example) and serves as an update for the (now cancelled) ISO/IEC document 9573-13 defining the ISO entity sets. It was for example cited in the docbook documentation for use with docbook (now that docbook5 is RelaxNG defined and does not have its own set of entity definitions). Thus it is important that the title does not mention MathML as it is explicitly not just for MathML. > > abstract: "This document defines several sets of names which are > assigned to Unicode characters. Each of these sets is also implemented > as a file of XML entity declarations.": > First, this says that the names are the main stuff, and the XML entities > are just an implementation detail. This is a contradiction to the title, > where XML entities are the main thing. The statement you quote is factually true, however we have reworded it to remove the implied relative importance of the different aspects. > Second, "sets of names which are assigned to Unicode characters" is > unclear as to whether a set of names is assigned to a Unicode > character, or something else. The same problem is present elsewhere > (e.g. first sentence of the Introduction) This has been reworded to clarify this. > Third, all Unicode characters have official names (e.g. LATIN CAPITAL > LETTER A for U+0041). These are a very important part of nailing down > the identity of a character. It would be good if either the abstract or > the Introduction or both would make clear that what you are dealing with > are short mnemotic names that are different from the official Unicode names. A comment pointing this out has been added to the introduction. > Fourth, names being assigned to Unicode characters doesn't sound > right. This may be a programmer's viewpoint, but what you are doing, in > terms of an average programmig language, is to assign Unicode > codepoints/characters to entity names, not the other way round. XML > entities in this sense are not much different from variables in a > programming language, so it would help a lot to keep things straight. > It is of course possible to view this mapping in either direction. and in fact the mappings are implemented in both directions by the xml entity files and the xslt character maps respectively. Although being a many-many map these are not exact inverses. However as you say, it is probably clearer to use the wording of assigning codepoints to names rather than the other way round, and the document has been edited accordingly wherever it used "assigned". > > Introduction: > "The W3C Math Working Group has been invited to take over the > maintenance and development of these sets by the original standards > committee (ISO/IECJTC1 SC34).": It should say somewhere that this > document is the result of this "taking over". > Well historically the document began before SC34 considered updating 9573-13 and a long time before they decided to cancel that project. Informally they cancelled the project because this set was being more actively maintained and although I was editing both documents I couldn't keep to SC34 timescales as I couldn't get ahead of mathml3 and html5, however we shouldn't speculate on the reasons behind the SC34 decision in the W3C REC track document. > > There should be a section on Notation, which explains things such as U+ > and leading slashes (is that TEX?). > It's pseudo TeX used (without explanation) in the original ISO standard. The original ISO entity definitions only gave those descriptions (and no unicode mappings) and the job really is to match those to unicode in the most sane way possible subject to compatibility constraints. So I don't want to change the entity description texts in any way as they are the reference point for comparison to the ISO standards. > > Tables: > http://www.w3.org/2003/entities/2007doc/bycodes.html: > - Instead of U00009 and the like, please use the official U+0009 > notation, and do not use a hyphen for character sequences, as this may > look like a character range. We have revised the document to use U+ notation consistently. The U12345 ID form is just now used for internal linking, and for filenames, not for referring to codepoints on the text or tables. > - Use a <table> so that this displays decently even with > non-proportional fonts (you can then eliminate the ugly commas). There > are lots of cases where <table> is misused in Web pages, but this is > clearly a case where it is "misunused" or "misnonused" or whatever one > would call the absence of the use of a feature when such use is clearly > warranted. > - Use proper table headings > - For character sequences, use e.g. "LESS-THAN SIGN with COMBINING LONG > VERTICAL LINE OVERLAY" rather than "LESS-THAN SIGN with vertical line" > There were explicit requests from developers (when this table was in MathML2) for an ascii file that could easily be tested against code, the format that developed with the monospace layout but including some hyperlinking is a compromise. > http://www.w3.org/2003/entities/2007doc/byalpha.html: > - Similar comments as for bycodes.html > - I don't understand why this table contains the origins/collections, > but bycodes.html doesn't. > - I don't understand the lowercase stuff at the end of each line. It > seems to be some kind of annotations, but in some cases is totally > useless (e.g. [LATIN SMALL LETTER A WITH CIRCUMFLEX], latin capital > letter A with circumflex) The final field is the original ISO entity description. If it looks the same as the unicode formal name than that is good, it isn't superfluous: it is conformation that the entity has been paired with the right unicode character. We note again that the original ISO entity definitions _only_ gave those lower case descriptions not any unicode mapping. However the order of the columns has now been changed so that this entity description now comes after the entity name, with the Unicode codepoint and formal name being the last two columns. Also information has been added to the top of the file explaining what is in each column. > - This table puts the official Unicode names in "[" and "]", but > bycodes.html doesn't. Why? There should be no such gratutious differences. Accepted as an editorial improvement. Also the order of the columns has been changed to put the entity description after the entity name rather than after the Unicode formal name, and a paragraph describing the column format has been added at the start of the page. > http://www.w3.org/2003/entities/2007doc/000.html and similar: > Please add a note to all the pages with lots of small glyphs that it may > take time to load all the images to see all the glyphs. (one test run > with Mozilla Firebug took 37 seconds on a broadband connection). A suitable warning note has been added. > Please use a stable, final location for all these GIFs. It's okay to > have an occasional "301 Moved Permanently" for a page, but it > essentially doubles the number of objects your page has to download from > 256 to 512. Even the former isn't pretty, the later is definitely bad > and totally unnecessary. (the redirects come from URIs of the form > http://www.w3.org/2003/entities/glyphs/003/U003FF.png, the actual images > seem to be at places such as > http://www.w3.org/2003/entities/2007doc/glyphs/003/U003FF.png) You happened to review the document while it was in transition, and the redirects were put in place to keep everything working. Current builds directly reference the new location of the png images, and the redirects would only be used if someone has linked to the old locations. > Codepoints U+0000 through U+0010 (with three exceptions) are shown as > "Unicode or XML Non-Character". They are valid control characters in > Unicode. yes they are valid in unicode but not in XML 1.0 hence "Unicode or XML" but see below. > Strangely enough, there are also such cases (red background > color) in the U+1D4xx and U+1D5xx 'blocks'. A codepoint such as U+1D53F > is simply <reserved> in Unicode, the Unicode consortium could decide to > allocate a character there in the future. This is no different at all > from all the characters that you marked with a yellow background. The > only codepoints that are actually non-characters in Unicode are cases > such as U+FFFF and the like, but you don't have any of these. I > therefore suggest that the red backgrounds in the U+1D4xx and U+1D5xx > 'blocks' have to be turned to yellow, and the text for the red > background should be changed to "Characters not representable in XML > 1.0" or some such (most of them would be representable in XML 1.1). All except 0000 would be representable in xml 1.1 as numeric references I think. XML 1.1 came out after that text was written... We don't want to mark the reserved "holes" in the 1Dxxx blocks the same as completely unallocated codepoints. The various cases are now separately distinguished (codepoint not usable in xml 1,0, reserved codepoint in plane 1, unallocated codepoint) these have been given different css classes and colours, and the key on each table identifies the cases that occur on that page. > > For codepoints with a yellow background, the legend says "XML Character > not currently described in Unicode". The term "XML Character" is really > strange. XML uses Unicode, there are no "XML Characters". "XML Characters" is intended to mean something matching the XML char production, that is, a character usable as character data in XML, which is a bit less than full unicode range as you know. However the legend has been reworded as noted in the previous comment. > The cells with > yellow backgrounds represent unassigned (reserved) Unicode codepoints. > So the best legend would be "reserved Unicode codepoint (no character > currently assigned)" or something similar. Looking at it from a unicode viewpoint it makes sense to say it's a codepoint to which no character is currently assigned. But looking at it from an xml viewpoint it _is_ a character (or more exactly it corresponds to well formed character data matching the char production) but unicode has not assigned any interpretation for that character. As noted above the tables now distinguish more cases, separating out the control characters (not usable directly in XML) from the reserved codepoints. > > Putting the "Next" link above the "Previous" link at the top and bottom > of these tables seems counterintuitive, because the overall flow is from > top to bottom. > The ordering was inconsistent, we have now consistently ordered these links as suggested. > > For http://www.w3.org/2003/entities/2007doc/double-struck.html and similar: > > Why do some rows have a yellow background? There's no explanation, so > the reader is left guessing. They are highlighting the cases that are in the BMP not in the (possibly?) expected runs in the 1Dxxx block. This was explained in a note at the start of the section (in the overview document) however we have added an additional footnote at the bottom of each affected page. > > Why do some of these characters not have any corresponding entity names > at all? Because, as stated explicitly in the introduction, this specification doesn't define any new names, it only allocates unicode code points to names previously thought up by ISO or the W3C. > > > Section 3: > > Title: An "Unicode Character Block": As you can see from > http://unicode.org/Public/UNIDATA/Blocks.txt, Unicode blocks are not of > equal size of 256 characters, and are not all alligned on boundaries > divisible by 256. But the reader can easily get such an impression. The > title, or the text below it, should be changed to reflect this, unless > (which would be more appropriate for the document (see next comment), > but may be difficult in terms of production costs) actual Unicode blocks > are used. > Yes in the table of contents all the block names that occur in the 256 square are listed, with "(continued)" added when the blocks run over. The section title has been changed to use "Ranges" rather than "Blocks" to avoid any impression that the 256 squares are Blocks. > I don't understand why Arabic presentation forms are (as indicated by > the yellow background) available in the STIX fonts, when basic Arabic > isn't. Turning things around, would a font for Math or Science have to > support these? The sentence "The following tables display Unicode ranges > containing the characters that are most used in mathematics." at the > start of section 3 seems to suggest so. Given the list of blocks most used in science/mathematics (eg as listed in unicode report 25) every 256-aligned range that covers those blocks is listed, which means that some additional characters are shown in the tables. The exact details of the Arabic support are somewhat in flux as there are unicode proposals to add variant forms (in a similar manner to the variants for latin and greek in 1d4xx and 1d5xx) and as for the latin/greek cases there is some discussion as to whether existing variant letters in the BMP should be reused. > > Turning things around: Are these tables for all the 256-character-sized, > aligned parts that contain one or more of the characters for which > entities have been defined in this document? If yes, please say so. If > no, please say what the differences are. > As above; they are tables for all the 256-character-sized, aligned parts that contain a math/science related block as listed in unicode tr 25. > > Section 5, first sentence: "there are some that use multiple character > combinations": "multiple character combinations" is "multiple > combinations of characters". However, characters are used in sequences, > not in combinations. So "a sequence of multiple characters" or so would > be better. > OK, change made. > > > Editorial: > > - Please change 'definitions' to 'Definitions' it the title, or adopt > any other W3C approved consistent casing convention. That such an > inconsistency is 'traditional for this document' shouldn't be a reason > to keep it. agreed, d changed to D. > > Section 1, first sentence: "especially in scientific documents, > especially in mathematics": Repetition; unclear about the relationship > between the two clauses introduced by 'especially'. agreed , this has been reworded. > Section 1, second sentence: "has grown in part because its notation > continually changes": I suggest changing "changes" to "changed" to align > the tenses. > The tense of "changes" is intentional here. The evolution is still in progress. > Section 1, first paragarph: "It is difficult to write science fluently" > -> "It is difficult to write scientific texts fluently"; same later for > "read science". This has been reworded. > > Section 3, first sentence: "Certain characters are of of particular > relevance": "of of" -> "of" The spurious "of" has been deleted. > > Section 5, first sentence: "character, however" -> "character. However" > or "character, but" (however starts a new sentence) > I think "however" is being used as a conjunction there rather than start a new sentence, however the phrase has been reworded. Thanks again for the comments,


Summary:	htmlmath entities collection
Submitted:	David Carlisle, http://lists.w3.org/Archives/Public/www-math/2009Dec/0000.html
Response:	,
Discussion:	This is a last call comment for the public record, noting an addition made to the editor's draft of the XML Entity Definitions for Characters specification. The change has already been made to the Editors' draft so will be assumed resolved, however if anyone would like to comment on this addition, please do reply to this message (on www-math list). The entities last call draft http://www.w3.org/TR/2009/WD-xml-entity-names-20091117/ includes a combined entity file w3centities-f that defines all the entities defined in the specification. As noted in a thread a while ago on the public-html list http://lists.w3.org/Archives/Public/public-html/2009Nov/0305.html for some uses it would be more useful to have a combined entity set just including those sets used in mathml and html and omitting the other ISO entity sets that are not typically used in a web context. This would correspond closely with an updated version of the entity file that firefox uses for mathml documents for example. In the Editors' draft http://www.w3.org/2003/entities/2007doc/Overview.html#sets I've now added two files htmlmathml.ent (which references each of the entity sets used by mathml or html) and htmlmathml-f.ent which directly contains each definition, sorted into alphabetic order, with duplicates removed. http://www.w3.org/2003/entities/2007/htmlmathml.ent http://www.w3.org/2003/entities/2007/htmlmathml-f.ent there are no changes to any of the defined entity definitions resulting from this addition. David
Resolution:	Self-closing editorial comment.


Summary:	Possible bug in the operator dictionary and mmlalias.ent?
Submitted:	Bobby Thomale, http://lists.w3.org/Archives/Public/www-math/2010Feb/0005.html
Response:	David Carlisle, http://lists.w3.org/Archives/Public/www-math/2010Feb/0007.html
Discussion:	I have been going through the operator dictionary writing text descriptions of each mathematical operator. While I was doing this, I noticed what appears to be an error in the mmlalias.ent file, as well as the entity tables on the mathml standard website. At the top of the file it says to report bugs on this list. If you look at the following entities in mmlalias.ent: <!ENTITY NotGreater "≯" ><!--alias ISOAMSN ngt --> <!ENTITY NotGreaterEqual "≱" ><!--alias ISOAMSN nge --> <!ENTITY NotGreaterFullEqual "≦̸" ><!--alias ISOAMSN nlE --> <!ENTITY NotGreaterGreater "≫̸" ><!--alias ISOAMSN nGtv --> NotGreaterFullEqual is, I believe, in error. If you look at the glyphs, the symbol being negated: ≦ is actually a less than sign over a full equals. <!ENTITY LessFullEqual "≦" ><!--alias ISOAMSR lE --> http://www.fileformat.info/info/unicode/char/2266/index.htm The other "not greater" glyphs are, correctly, variants of greater than symbols with a negation line through them. You can also see this problem if you search the following table for "NotGreaterFullEqual": http://www.w3.org/TR/MathML/isoamsn.html Its clearly mapped to "LESS-THAN OVER EQUAL TO with slash." Seems wrong. Also odd is the fact that NotGreaterFullEqual is defined here and listed in the operator dictionary, but NotLessFullEqual isn't. There are 7 entities beginning with "NotGreater..." defined in the operator dictionary, and only 6 "NotLess..." ones - NotGreaterFullEqual is the one that does not have a corresponding NotLess... defined. http://www.w3.org/TR/MathML2/appendixf.html "&NotGreater;" form="infix" lspace="thickmathspace" rspace="thickmathspace" "&NotGreaterEqual;" form="infix" lspace="thickmathspace" rspace="thickmathspace" "&NotGreaterFullEqual;" form="infix" lspace="thickmathspace" rspace="thickmathspace" "&NotGreaterGreater;" form="infix" lspace="thickmathspace" rspace="thickmathspace" "&NotGreaterLess;" form="infix" lspace="thickmathspace" rspace="thickmathspace" "&NotGreaterSlantEqual;" form="infix" lspace="thickmathspace" rspace="thickmathspace" "&NotGreaterTilde;" form="infix" lspace="thickmathspace" rspace="thickmathspace" "&NotLess;" form="infix" lspace="thickmathspace" rspace="thickmathspace" "&NotLessEqual;" form="infix" lspace="thickmathspace" rspace="thickmathspace" "&NotLessGreater;" form="infix" lspace="thickmathspace" rspace="thickmathspace" "&NotLessLess;" form="infix" lspace="thickmathspace" rspace="thickmathspace" "&NotLessSlantEqual;" form="infix" lspace="thickmathspace" rspace="thickmathspace" "&NotLessTilde;" form="infix" lspace="thickmathspace" rspace="thickmathspace" -- Bobby Thomale Vital Source Technologies http://www.vitalsource.com http://lists.w3.org/Archives/Public/www-math/2010Feb/0008.html oh, thanks. It would seem that that has been wrong forever, at least since mathml 1 in 1998 http://www.w3.org/TR/1998/REC-MathML-19980407/chap6/MMALIAS2.html We have a working group phone call this afternoon and I'll put this on the agenda, but it seems like this is definitely wrong unless there s some subtlety I'm missing. David http://lists.w3.org/Archives/Public/www-math/2010Feb/0009.html Yes, that sounds right to me too. Also you might want to consider adding NotLessFullEqual as the negation of: <!ENTITY LessFullEqual "≦" ><!--alias ISOAMSR lE --> for completeness, since all of the other variants of greater and not greater have corresponding less than and not less than entities defined. Somehow those two symbols probably got mixed up and combined when the list was initially being compiled. -- Bobby Thomale http://lists.w3.org/Archives/Public/www-math/2010Feb/0010.html It's clear that this is "missing" in some sense, but we're taking a fairly hard line in not adding any new names. If we add a name and someone uses it then any fragments of MathML using that that get moved to a document using an older dtd are not well formed and the xml parser would reject the entire document. This is why for example the double struck letters only have entities for upper case not lower case, even though there are now defined Unicode code points for upper and lower case David http://lists.w3.org/Archives/Public/www-math/2010Feb/0011.html Interesting. Ok, that makes sense. Fixing the entity that's named wrong is the important part anyhow. -- Bobby Thomale
Resolution:	On Feb 4, 2010, at 8:53 PM, David Carlisle wrote: I fixed the "typos" relating to Unicode and NotGreaterFullEqual It affected several tables in the doc and I thought I'd better record it in the change log in both B.1 and C.2. Altogether the files below changed.... David Checking in 2007/htmlmathml-f.ent; /w3ccvs/WWW/2003/entities/2007/htmlmathml-f.ent,v <-- htmlmathml-f.ent new revision: 1.4; previous revision: 1.3 done Checking in 2007/mmlalias.ent; /w3ccvs/WWW/2003/entities/2007/mmlalias.ent,v <-- mmlalias.ent new revision: 1.14; previous revision: 1.13 done Checking in 2007/mmlaliasmap.xsl; /w3ccvs/WWW/2003/entities/2007/mmlaliasmap.xsl,v <-- mmlaliasmap.xsl new revision: 1.10; previous revision: 1.9 done Checking in 2007/w3centities-f.ent; /w3ccvs/WWW/2003/entities/2007/w3centities-f.ent,v <-- w3centities-f.ent new revision: 1.13; previous revision: 1.12 done Checking in 2007doc/Overview.html; /w3ccvs/WWW/2003/entities/2007doc/Overview.html,v <-- Overview.html new revision: 1.25; previous revision: 1.24 done Checking in 2007doc/U00338.html; /w3ccvs/WWW/2003/entities/2007doc/U00338.html,v <-- U00338.html new revision: 1.10; previous revision: 1.9 done Checking in 2007doc/byalpha.html; /w3ccvs/WWW/2003/entities/2007doc/byalpha.html,v <-- byalpha.html new revision: 1.26; previous revision: 1.25 done Checking in 2007doc/bycodes.html; /w3ccvs/WWW/2003/entities/2007doc/bycodes.html,v <-- bycodes.html new revision: 1.23; previous revision: 1.22 done Checking in 2007doc/isoamsn.html; /w3ccvs/WWW/2003/entities/2007doc/isoamsn.html,v <-- isoamsn.html new revision: 1.17; previous revision: 1.16 done Checking in 2007doc/mmlalias.html; /w3ccvs/WWW/2003/entities/2007doc/mmlalias.html,v <-- mmlalias.html new revision: 1.20; previous revision: 1.19 done Checking in 2007xml/character-set.xml; /w3ccvs/WWW/2003/entities/2007xml/character-set.xml,v <-- character-set.xml new revision: 1.85; previous revision: 1.84 done Checking in 2007xml/unicode.xml; /w3ccvs/WWW/2003/entities/2007xml/unicode.xml,v <-- unicode.xml new revision: 1.47; previous revision: 1.46 done sh-3.2$

XML Entity Definitions for Characters Last Call Dispositions

Public Version [Member-Confidential Version]

6 February 2010, Version 1.2

Table of contents

XML Entity Definitions for Characters Last Call Comments and Responses