13607 – some regex tests depend on the unicode version used

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 13607 - some regex tests depend on the unicode version used

Summary: some regex tests depend on the unicode version used

Status:	RESOLVED FIXED

Alias:	None

Product:	XML Schema Test Suite
Classification:	Unclassified
Component:	Microsoft tests (show other bugs)
Version:	2006-11-06
Hardware:	All All

Importance:	P2 normal
Target Milestone:	XSD 1.1 PR
Assignee:	C. M. Sperberg-McQueen
QA Contact:	XML Schema Test Suite mailing list

URL:
Whiteboard:
Keywords:	decided

Depends on:
Blocks:

Reported:	2011-08-03 13:41 UTC by Andreas Meissl
Modified:	2011-10-24 22:53 UTC (History)
CC List:	2 users (show)

See Also:

Attachments
Comparison of character categories between Unicode 4.0.0 and 6.0.0 (42.65 KB, application/xml) 2011-10-24 17:40 UTC, Michael Kay	Details

Description Andreas Meissl 2011-08-03 13:41:10 UTC

The following test cases from testSet 'MS-Regex2006-07-15' have the wrong result when using the latest unicode database:

group='reS17' test='reS17.v' (&#x1369; not in \p{Nd})
group='reS38' test='reS38.v' (&#x1371; not in \p{Nd})
group='reS51' test='reS51.i' (&#x0BE6; in \p{Nd})
group='reT17' test='reT17.i' (&#x1369; not in \p{Nd})
group='reT38' test='reT38.i' (&#x1371; not in \p{Nd})
group='reT51' test='reT51.v' (&#x0BE6; in \p{Nd})
group='reU6' test='reU6.i' (&#x023F; not in \p{Cn})
group='reZ004v' test='reZ004v.v' (&#x1369;-&#x1371; not in \p{Nd})

XSD 1.1 allows implementors to use later versions of the unicode database. This leads to different results for the testcases above when using the latest version of the unicode database.


When using the unicode database 3.1 (as referenced from the XSD 1.0 spec) the following testcase has the wrong result:

group='reZ003v' test='reZ003v.v' (&#x0220; in \p{Cn})

Comment 1 David Ezell 2011-10-21 16:18:31 UTC

decided: to mark tests as depending on a specific unicode version if required.

Comment 2 Michael Kay 2011-10-24 17:40:45 UTC

Created attachment 1037 [details]
Comparison of character categories between Unicode 4.0.0 and 6.0.0

Comment 3 Michael Kay 2011-10-24 17:41:53 UTC

I have confirmed these can be accounted for by differences between Unicode 4.0.0 and 6.0.0. I attach a file that summarises the differences between character categories in these two Unicode versions.

Comment 4 Michael Kay 2011-10-24 21:32:55 UTC

I can further confirm that after converting Saxon to use the Unicode 6.0.0 character categories, the tests that fail in the Microsoft regex test set are exactly those listed.

Comment 5 Michael Kay 2011-10-24 22:53:27 UTC

Fixed m=by making expected results conditional on Unicode version. Schema for metadata updated to accommodate this.