This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 3698 - [FT] Interaction between FTDiacriticsOption and collation unclear
Summary: [FT] Interaction between FTDiacriticsOption and collation unclear
Status: CLOSED FIXED
Alias: None
Product: XPath / XQuery / XSLT
Classification: Unclassified
Component: Full Text 1.0 (show other bugs)
Version: Working drafts
Hardware: All All
: P2 normal
Target Milestone: ---
Assignee: Jim Melton
QA Contact: Mailing list for public feedback on specs from XSL and XML Query WGs
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2006-09-11 16:40 UTC by Jochen Doerre
Modified: 2007-02-18 22:47 UTC (History)
0 users

See Also:


Attachments

Description Jochen Doerre 2006-09-11 16:40:13 UTC
Editorial

Some of the entries of the Diacritics Matrix in 3.2.2 do not clearly describe what the intended comparison operation for the given case should be. In particular, the entries for 
 - entry for UCC / "insensitive", which states "compare as if with and without" (well, what???)
 - 4 entries for UCC+CDS / "with" + "without diacritics", which use an exemplary query.
The reader has no clue how to interpret those exemplary queries and even if they are meant to show how to reduce the "with" and "without" options to the other options, there are several problems with those queries. 

E.g. in the entry for CDS / "with diacritics" the query stated there: 

  "resume diacritics insensitive" not in  "resume"

(i) is syntactically not what it meant to be (probably: "resume" diacritics insensitive not in  "resume"), 
(ii) depends on diacritic options higher up the query tree, or a specified default for the diacritic option (note that the second "resume" term is matched according to that diacritic setting); 
and (iii) can never have a match in the default case where the second "resume" is matched insensitive as well.

So maybe, this query should be: 

  "resume" diacritics insensitive not in  "resume" diacritics sensitive

(which would indeed be an equivalent rewrite for "resume" with diacritics, because the term "resume" is spelled deliberately without diacritics in the second subquery), but then what would be the case for "without diacritics"?
Also the rewriting relies that we have control over whether the query term contains diacritics itself and how it would need to be transposed in case it did. In general, however, we cannot assume this. E.g. consider the query:
 $node ftcontains $term with diacritics

/jochen
Comment 1 Jochen Doerre 2006-10-13 11:25:54 UTC
Here is my proposal to fix the matrix.

1. UCC/"insensitive" should read:
compare base characters only, disregarding diacritics

The row 3 and 4 (for with+without diacritics) should be dropped.
Instead add the following sentence after the table and the Note:

For options "with diacritics" and "without diacritics" the underlying
comparison is the same as for "diacritics insensitive", however only tokens are considered that contain, respectively, do not contain characters with diacritical marks.

I hope this improves it.

/Jochen
Comment 2 Pat Case 2007-01-31 13:54:27 UTC
I support Jochen's proposal to reduce the diacritics options in v.1 to sensitive and insensitive.

If this proposal is accepted, it may also prompt closing of Bug 3927.
Comment 3 Jim Melton 2007-02-18 22:46:57 UTC
The proposed change was discussed by the TF at its February 1/2 F2F and was adopted in principal.  We are marking this bug FIXED.  Since you were present when we adopted this resolution, and agreed to that resolution, we are also marking it CLOSED.