29865 – [FO31] UCA collation in substring matching

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 29865 - [FO31] UCA collation in substring matching

Summary: [FO31] UCA collation in substring matching

Status:	CLOSED FIXED

Alias:	None

Product:	XPath / XQuery / XSLT
Classification:	Unclassified
Component:	Functions and Operators 3.1 (show other bugs)
Version:	Candidate Recommendation
Hardware:	PC All

Importance:	P2 normal
Target Milestone:	---
Assignee:	Michael Kay
QA Contact:	Mailing list for public feedback on specs from XSL and XML Query WGs

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2016-09-23 08:59 UTC by Michael Kay
Modified:	2016-12-16 19:55 UTC (History)
CC List:	0 users

See Also:

Attachments

Description Michael Kay 2016-09-23 08:59:02 UTC

We say on the one hand:

All implementations must recognize URIs in this family [UCA collations] in the collation argument of functions that take a collation argument.

and on the other

It is possible to define collations that do not have the ability to decompose a string into units suitable for substring matching. An argument to a function defined in this section may be a URI that identifies a collation that is able to compare two strings, but that does not have the capability to split the string into collation units. Such a collation may cause the function to fail, or to give unexpected results or it may be rejected as an unsuitable argument. The ability to decompose strings into collation units is an ·implementation-defined· property of the collation.

I think we should be explicit (unless there is a technical reason to the contrary) that the UCA family of collations can be used for substring matching.

I have added some tests to contains(), starts-with(), and ends-with() to use UCA collations. Saxon fails on these tests, but I think that's an implementation shortcoming.

Comment 1 Michael Kay 2016-09-23 15:05:43 UTC

Just to confirm that I have now implemented substring matching on UCA collations (using the ICU library). It's tricky, but do-able, so I think there is no reason not to say that the UCA collations are always substring-capable.

I've submitted a slew of tests - my apologies if it spoils your weekend.

Comment 2 Michael Kay 2016-09-25 19:25:24 UTC

I'd like to extend the scope of this issue: we have a number of examples of how functions such as contains() handle ignorable characters such as punctuation and spaces, and this area seems to have evolved in recent drafts of UCA. As always with UCA, it's very difficult to understand the full complexity of the specs, but here's an attempt.

We define three values for the collation parameter "alternate": non-ignorable, shifted, and blanked. Although no-one would guess it from the name, "alternate" is about how "noise" characters like punctuation and whitespace are handled.

In the spec we're very coy about saying what these mean, and the reason for that is that it's difficult and dangerous to paraphrase something so complex. But here's at attempt at a summary:

"non-ignorable" - noise characters are treated like ordinary primary characters, generally with a sort order lower than other characters. So for example "data type" sorts before "database".

"shifted" - noise characters are less significant than other differences, for example they are less significant than accents (secondary differences) or case (tertiary differences) - that is, they are treated as quaternary differences. In turn this means (I think) that if the strength of the collation is less than quaternary, then noise characters are ignored entirely.

"blanked" - I think this means that if the strength of the collation is less than "identical", noise characters are ignored entirely.

The other question is, what characters are treated as noise (which is my term, not a UCA or LDML term)? The UCA/LDML terms for these (completely unintuitively) is "variable" characters. In older versions of LDML and UCA this is defined by something called variableTop, in the most recent versions it is instead defined by maxVariable. I think this is a useful parameter and we should add it as follows:

maxVariable=space|punct|symbol|currency - indicates that all characters in the specified group and earlier groups are treated as "noise" characters to be treated as defined by the "alternate" parameters. For example, maxVariable=punct indicates that characters classified as whitespace or punctuation get this treatment.

alternate=non-ignorable|shifted|blanked - indicates how "noise" characters (as defined by the maxVariable parameter) are to be treated: non-ignorable indicates that they are significant characters in their own right; shifted indicates that they affect the comparison of strings only at the quaternary level; blanked indicates that they affect the comparison of strings only at the identical level.

In addition, I think that interoperability demands that we define some defaults, especially as the defaults are in some cases different between UCA and LDML. strength=tertiary, alternate=non-ignorable, maxVariable=punct, backwards=no, caseLevel=no, normalization=no, numeric=no, caseFirst unspecified (default then depends on other parameters e.g. lang).

Question: should lang default to the default language from the dynamic context? There are usability arguments in favour of this, but on the whole I think not: in XQuery "order by" the collation is defined statically and I think it's assumed that if a collation is specified as a literal, we know statically what collation is being used and can optimize accordingly, e.g. by selecting database indexes. Leave it implementation-defined, and an implementation can take it from the dynamic context it it chooses.

Comment 3 Michael Kay 2016-09-25 21:59:49 UTC

It might be worth mentioning why I am exploring this area at this particular stage: in the run-up to publication of F+O as a PR, I have been running a script that tests all the examples in the spec. In particular I have been trying to test examples such as the one for fn:contains:

The expression fn:contains ( "abcdefghi", "-d-e-f-", "http://example.com/CollationA") returns true().

by substituting a real collation 

"http://www.w3.org/2013/collation/UCA?lang=en;alternate=blanked;strength=primary"

that should have the desired properties.

This doesn't work in the current release of Saxon, and I have been investigating what is needed to make it work.

In the course of this I also found that UCA collations were insufficiently tested in the QT3 test suite and I have been extending the test coverage and exploring the test failures that arise in Saxon.

Comment 4 Michael Kay 2016-09-27 21:55:26 UTC

The agreed changes have been applied to both the F+O 3.1 and XSLT 3.0 specifications.