This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 13607 - some regex tests depend on the unicode version used
Summary: some regex tests depend on the unicode version used
Status: RESOLVED FIXED
Alias: None
Product: XML Schema Test Suite
Classification: Unclassified
Component: Microsoft tests (show other bugs)
Version: 2006-11-06
Hardware: All All
: P2 normal
Target Milestone: XSD 1.1 PR
Assignee: C. M. Sperberg-McQueen
QA Contact: XML Schema Test Suite mailing list
URL:
Whiteboard:
Keywords: decided
Depends on:
Blocks:
 
Reported: 2011-08-03 13:41 UTC by Andreas Meissl
Modified: 2011-10-24 22:53 UTC (History)
2 users (show)

See Also:


Attachments
Comparison of character categories between Unicode 4.0.0 and 6.0.0 (42.65 KB, application/xml)
2011-10-24 17:40 UTC, Michael Kay
Details

Description Andreas Meissl 2011-08-03 13:41:10 UTC
The following test cases from testSet 'MS-Regex2006-07-15' have the wrong result when using the latest unicode database:

group='reS17' test='reS17.v' (፩ not in \p{Nd})
group='reS38' test='reS38.v' (፱ not in \p{Nd})
group='reS51' test='reS51.i' (௦ in \p{Nd})
group='reT17' test='reT17.i' (፩ not in \p{Nd})
group='reT38' test='reT38.i' (፱ not in \p{Nd})
group='reT51' test='reT51.v' (௦ in \p{Nd})
group='reU6' test='reU6.i' (ȿ not in \p{Cn})
group='reZ004v' test='reZ004v.v' (፩-፱ not in \p{Nd})

XSD 1.1 allows implementors to use later versions of the unicode database. This leads to different results for the testcases above when using the latest version of the unicode database.


When using the unicode database 3.1 (as referenced from the XSD 1.0 spec) the following testcase has the wrong result:

group='reZ003v' test='reZ003v.v' (Ƞ in \p{Cn})
Comment 1 David Ezell 2011-10-21 16:18:31 UTC
decided: to mark tests as depending on a specific unicode version if required.
Comment 2 Michael Kay 2011-10-24 17:40:45 UTC
Created attachment 1037 [details]
Comparison of character categories between Unicode 4.0.0 and 6.0.0
Comment 3 Michael Kay 2011-10-24 17:41:53 UTC
I have confirmed these can be accounted for by differences between Unicode 4.0.0 and 6.0.0. I attach a file that summarises the differences between character categories in these two Unicode versions.
Comment 4 Michael Kay 2011-10-24 21:32:55 UTC
I can further confirm that after converting Saxon to use the Unicode 6.0.0 character categories, the tests that fail in the Microsoft regex test set are exactly those listed.
Comment 5 Michael Kay 2011-10-24 22:53:27 UTC
Fixed m=by making expected results conditional on Unicode version. Schema for metadata updated to accommodate this.