This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 5486 - Regex: Names of Unicode code blocks
Summary: Regex: Names of Unicode code blocks
Status: CLOSED FIXED
Alias: None
Product: XML Schema
Classification: Unclassified
Component: Datatypes: XSD Part 2 (show other bugs)
Version: 1.1 only
Hardware: PC Windows XP
: P3 normal
Target Milestone: ---
Assignee: C. M. Sperberg-McQueen
QA Contact: XML Schema comments list
URL:
Whiteboard: cluster: regex
Keywords: editorial
Depends on:
Blocks:
 
Reported: 2008-02-16 10:43 UTC by Michael Kay
Modified: 2009-03-11 23:29 UTC (History)
1 user (show)

See Also:


Attachments

Description Michael Kay 2008-02-16 10:43:52 UTC
XML Schema Part 2 in 1.1 references Unicode 4.1 rather than Unicode 3.1.

Some of the names of code blocks, as used in constructs like \p{IsGreek}, have changed in different versions of Unicode, for example Greek is now "Greek and Coptic". This change has not been reflected in the list of allowed block names; nor is it clear how the change should be reflected, given the need for backwards compatibility.
Comment 1 Michael Kay 2008-02-16 15:20:31 UTC
Looking at the list more closely, it seems that new blocks that have been added since Unicode 3.1 are present in the table in the 1.1 spec, but blocks that have been renamed in Unicode have not been renamed in the table. This is probably the right thing to do, but it merits a note. The statement (part of a Definition no less) "The set containing all characters that have block name X (with all white space stripped out), can be identified with a block escape \p{IsX}." appears not in fact to be normative; I think we must assume that it is intended that the table should contain the normative names of the blocks.

I'm also puzzled by a bit of history. XML Schema 1.0 First Edition contained a number of blocks in the non-BMP area, such as MusicalSymbols. These disappeared in the second edition of 1.0, but there appears to have been no erratum, and the changes are not highlighted in the change-marked version of the (1.0 2e) spec. How can this have happened?
Comment 2 Paul Biron 2008-02-16 15:48:03 UTC
For what it's worth, the original table in 1.0 was generated from a perl script I wrote that read the Unicode DB and output the table.  I'll look and see if still have that...although it may have gotten lost when I left Kaiser.
Comment 3 C. M. Sperberg-McQueen 2008-05-09 20:01:54 UTC
The XML Schema working group discussed this issue today and reached the
following tentative conclusions:

  - Implementations are required to support a particular version of
    the Unicode Database and may support later versions.
  - In supporting any version of the Unicode Database, the normative
    statement of block names is that of the database (modified for use
    in regular expressions by the space-removal algorithm given in the
    spec).
  - (As a result:) The list of block names given in XSD 1.0 and 1.1 is
    informative, not normative.

As a result, we are classifying the issue as editorial and instructing the
editors to clean up the problem by aligning the block names with the
appropriate version of the Unicode Database and by making clearer that the
list given is informative not normative.  The editors are also instructed
to spend a little time trying to figure out why the changes made between 
1.0 1E and 2E aren't shown with change coloring in 2E's diffed version.
Comment 4 Michael Kay 2008-10-28 12:06:59 UTC
See also

http://lists.w3.org/Archives/Public/www-xml-schema-comments/2008OctDec/0076.html

from James Clark
Comment 5 Michael Kay 2009-03-11 23:29:36 UTC
The proposed fix appears to be present in the Jan 2009 draft. I am therefore marking this fixed and closed.