This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 11444 - [FT] FTThesaurusOption "levels" default should be implementation-defined
Summary: [FT] FTThesaurusOption "levels" default should be implementation-defined
Status: CLOSED FIXED
Alias: None
Product: XPath / XQuery / XSLT
Classification: Unclassified
Component: Full Text 1.0 (show other bugs)
Version: Candidate Recommendation
Hardware: All All
: P2 normal
Target Milestone: ---
Assignee: Jim Melton
QA Contact: Mailing list for public feedback on specs from XSL and XML Query WGs
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-12-01 04:03 UTC by Paul J. Lucas
Modified: 2011-01-04 17:27 UTC (History)
1 user (show)

See Also:


Attachments

Description Paul J. Lucas 2010-12-01 04:03:38 UTC
The spec section 3.4.3 says in part:

> FTThesaurusID specifies the relationship sought between tokens and phrases written in the query and terms in the thesaurus and the number of levels to be queried in hierarchical relationships by including an FTRange "levels". If no levels are specified, the default is to query all levels in hierarchical relationships.

The problem with defaulting to "all levels" is that is makes queries too broad.  For example, if using the WordNet data for the word "canary":

$ wn canary -n1 -hypen

Synonyms/Hypernyms (Ordered by Estimated Frequency) of noun canary

Sense 1
fink, snitch, snitcher, stoolpigeon, stool pigeon, stoolie, sneak, sneaker, canary
       => informer, betrayer, rat, squealer, blabber
           => informant, source
               => communicator
                   => person, individual, someone, somebody, mortal, soul
                       => organism, being
                           => living thing, animate thing
                               => whole, unit
                                   => object, physical object
                                       => physical entity
                                           => entity
                       => causal agent, cause, causal agency
                           => physical entity
                               => entity

then every query, e.g.:

	.//book/content contains text "canary" using
	thesaurus at "http://wordnet.princeton.edu"

would return true if it contains any of the words "whole", "object", "entity", etc., which is, IMHO, not a useful result and most likely not what the user would want because those words are so far removed from "canary".

My suggestion is to change the last sentence of the cited paragraph from the spec to read:

> If no levels are specified, the default number of levels to query in hierarchical relationships is implementation-defined.

- Paul
Comment 1 Pat Case 2010-12-21 16:53:06 UTC
Paul,

The XQuery and XSL working groups meet today and considered your proposal to change the default for thesaurus levels.

We decided to leave the default (to all levels) as is, not making any change to the document. 

We believe there is enough flexibility in the spec to accomplish what you want whether the default is changed or not.

Please mark this bug closed if you accept this resolution.

Pat Case, Library of Congress, Member XML Working Group.
Comment 2 Paul J. Lucas 2010-12-21 17:11:59 UTC
Please cite other relevant excerpts from the spec that can be used to accomplish what I want.

No, I do not accept this resolution, so I am not marking it as closed.
Comment 3 Paul J. Lucas 2011-01-03 18:13:35 UTC
Since I didn't get a response on how, exactly, "there is enough flexibility in the spec to accomplish what [I] want," I'm re-opening the bug.

To me, the spec as currently written is quite clear in that "the default is to query all levels *ALL* levels in hierarchical relationships" [emphasis, mine].

Of course the spec doesn't specify what a "level" is either.  Although the spec doesn't explicitly specify that a "level" is implementation-defined, its silence on that point makes it de-facto implementation-defined.  I'm OK with this, although the spec could say so explicitly.  However, I don't feel strongly enough on that point to file a separate bug.

But whatever an implementation defines a "level" to be, it shouldn't be forced to query *ALL* of them by default.  In addition to the likely semantically useless results I gave an example of already ("entity" is a synonym of "canary"), querying *ALL* levels would be more computationally expensive and hence shouldn't be the default for that reason also.

In practice, hierarchical relationships wouldn't be longer that a dozen or so levels anyway (but this doesn't negate either of my reasons above for not querying all of them by default).  If a user really wants *ALL* levels, s/he can specify an arbitrarily high range of, say, "at most 100 levels" and get the same effect.
Comment 4 Pat Case 2011-01-04 17:15:55 UTC
Paul,

As you know the XML Query and XSL Working Groups decided to change the level default to:
The default is either to all levels or to an implementation-defined number of levels.

I will add your issue of refining the language about levels to the Bug 5624 [FT] TRACKER: deferred functionality where we keep issues for the next version of the specification.

I am marking this bug as resolved/fixed. If you are satisfied, please mark it closed.

Pat Case, Library of Congress, XML Query Working Group