This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 6830 - [FT] Thesaurus vs other Match Options
Summary: [FT] Thesaurus vs other Match Options
Status: CLOSED FIXED
Alias: None
Product: XPath / XQuery / XSLT
Classification: Unclassified
Component: Full Text 1.0 (show other bugs)
Version: Candidate Recommendation
Hardware: All All
: P2 normal
Target Milestone: ---
Assignee: Jim Melton
QA Contact: Mailing list for public feedback on specs from XSL and XML Query WGs
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2009-04-16 19:30 UTC by Christian Gruen
Modified: 2009-06-09 13:20 UTC (History)
1 user (show)

See Also:


Attachments

Description Christian Gruen 2009-04-16 19:30:32 UTC
Hi again,

I noticed that the evaluation of a combination of several match options with the Thesaurus may lead to different interpretations. My major question is if other match options influence the way the thesaurus works. An example:

 "improving" ftcontains "improve" with stemming

This query should return true. If we add a thesaurus here..

 "improving" ftcontains "optimizing" with stemming with thesaurus..

...and if the thesaurus resvolves "optimize" to "improve", I am wondering if this query will return true, as the thesaurus entries would have to be stemmed as well.

The same problem/question occurs with the default match options. E.g.: Are diacritics to be removed in the thesaurus?

As a Thesaurus can get pretty large, similar to index structures, I would recommend to apply all match options while building and BEFORE querying the Thesaurus - otherwise, Thesaurus requests could get pretty expensive. This is why I would propose to extend section 3.4 of the specification:

   1. The Language Option must be applied first
   2. The Stemming Option must be applied before the Case Option and the 
      Diacritics Option
-> 3. The Thesaurus Option must be applied after all other options

This will also make sense, as the Thesaurus might not be accessed at all if the query and document term equal anyway...

  "A" ftcontains "A" with thesaurus...
  -> should yields true without even checking the thesaurus


I just discovered the following sentence in the first section of the Specs..

"The WGs particularly solicit feedback regarding how thesauri are to be used in combination."

So I hope that my discussion here contributes a little to this issue.

Christian
Comment 1 Christian Gruen 2009-05-11 20:22:28 UTC
Hi Mary,

"full-text-composability-queries-results-q3.xq" is a test suite examples which looks ambiguous to me:

   ... "quote.{0,5}" with wildcards with thesaurus ... 


The implementation could...

a) either yield all thesaurus entries that match the "quote.{0,5}" wildcard expression - which implies that the thesaurus itself must be able to handle wildcards and is aware of the other match options

b) or only look up "quote.{0,5}" in the thesaurus


What do you think?

Christian
Comment 2 Pat Case 2009-06-09 12:52:35 UTC
Christian,

I have corrected full-text-composability-queries-results-q3 and q3b by adding a second empty result to each - to cover implementations that process the thesaurus match option first.

I believe we decided that these were the only test cases where the order of processing for the thesaurus and wildcard match options was an issue, so I am marking this bug fixed.

If you agree with my fix, would you mark this bug cloased?

Pat
Comment 3 Christian Gruen 2009-06-09 13:20:04 UTC
Thanks Pat,

I've tested the latest changes, and closed this bug.
Christian