This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 8744 - Regex characters classes C, L, M, etc
Summary: Regex characters classes C, L, M, etc
Status: CLOSED FIXED
Alias: None
Product: XML Schema
Classification: Unclassified
Component: Datatypes: XSD Part 2 (show other bugs)
Version: 1.0/1.1 both
Hardware: PC Windows NT
: P2 normal
Target Milestone: ---
Assignee: David Ezell
QA Contact: XML Schema comments list
URL:
Whiteboard:
Keywords: decided
Depends on:
Blocks:
 
Reported: 2010-01-14 12:38 UTC by Michael Kay
Modified: 2010-04-26 14:59 UTC (History)
2 users (show)

See Also:


Attachments

Description Michael Kay 2010-01-14 12:38:58 UTC
The specification states:

<quote>
[Definition:]  [Unicode Database] specifies a number of possible values for the "General Category" property and provides mappings from code points to specific character properties.  The set containing all characters that have property X, can be identified with a category escape  \p{X} .  The complement of this set is specified with the category escape  \P{X} .  ( [\P{X}] = [^\p{X}] ).
</quote>

It then gives a table purporting to show the values of "General Category" that occur in Unicode 5.1. This includes single-character categories such as "C", "L", and "M". As far as I can see, however, Unicode only defines the two-character categories such as Ll, Lu, Mc and so on. The single-character categories are an invention of the regex language, and therefore need to be described in our specification, rather than by reference to Unicode.

There are two possible definitions of these categories, which give different results.

At least one XML Schema implementation has interpreted the single-character category X to be the union of all two-character categories starting with X, for example C is the union of (Cc, Cf, Co, and Cn). However, another interpretation (the one used by the Java regex library) is that it is the set of all characters listed in the Unicode database as belonging to a category starting with that letter. This gives a different result in the case of category C, since Cn is the set of characters that are not listed in the relevant section of the Unicode database.
Comment 1 Michael Kay 2010-01-14 12:47:06 UTC
For reference, the Unicode 5.1 definition of character categories is here:

http://www.unicode.org/Public/5.1.0/ucd/UCD.html#General_Category_Values
Comment 2 David Ezell 2010-01-15 16:34:56 UTC
On the telcon the WG agreed that C should include Cn, but that a note should be added explaining that Java does it differently.