This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 9858 - [FT] FTStopWordOption and FTCaseOption interaction clarification
Summary: [FT] FTStopWordOption and FTCaseOption interaction clarification
Status: CLOSED FIXED
Alias: None
Product: XPath / XQuery / XSLT
Classification: Unclassified
Component: Full Text 1.0 (show other bugs)
Version: Candidate Recommendation
Hardware: All All
: P2 normal
Target Milestone: ---
Assignee: Jim Melton
QA Contact: Mailing list for public feedback on specs from XSL and XML Query WGs
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-06-05 00:17 UTC by Paul J. Lucas
Modified: 2011-01-06 01:30 UTC (History)
1 user (show)

See Also:


Attachments

Description Paul J. Lucas 2010-06-05 00:17:10 UTC
I originally sent this an an ordinary (non-bug) e-mail to the public-qt-comments@w3.org list hoping for some kind of comment.  I prefer not to write bugs willy-nilly, but get at least one other person who agrees with me before filing a bug.  However, in this case, after having gotten no comments, I'm submitting this as a bug anyway because I think it's warranted.  That said....

As far as I can tell, nowhere in the spec does it say anything specifically about case sensitivity and the FTStopWordOption.

Does the value of FTCaseOption affect stop-word comparisons?  E.g.:

	let $x := <p>BEST OF TIMES</p>
	return $x contains text "BEST ANY TIMES"
	  using stop words ("any")
	  using case sensitive

Should that query return true or false?

The spec should say explicitly what the interaction between those two match options is supposed to be or at the very least explicitly state that it's implementation-defined.
Comment 1 Michael Dyck 2010-07-14 08:04:47 UTC
This matter was discussed during the joint teleconference of the XML Query WG and the XSL WG on 2010-06-15.

The result of your example query hinges on whether the query token "ANY" is a stop word. This in turn depends on whether "ANY" is in the list of stop words defined by the stop word option
    using stop words ("any")
i.e., whether that involves a case-insensitive comparison.

Interaction with the 'case' option is not the issue here, because it (like
all match options) only affects the matching between query tokens and tokens
*in the text being searched*, and the stop word "any" is not in the text being searched.

Instead, the WGs decided that this is an implementation-defined matter. Specifically:
    An implementation-defined comparison is used to determine whether
    a query token appears in the collection of stop words defined by
    the applicable stop word option.

I was directed to make the necessary changes to the Full Text spec, which I have done. Therefore, I am marking this issue resolved-FIXED. If you are satisfied with this outcome, please mark it CLOSED.
Comment 2 Paul J. Lucas 2010-07-14 15:39:07 UTC
Even though this bug has been "resolved" by making the answer "implementation dependent," the issue, despite Mr. Dyck's statement to the contrary, really does have to do with the query tokens. So, for the record....

From the spec, section 3.4.7:

> Stop words are tokens in the *query* that match any token in the text being searched.

> Note the asymmetry in the stop word semantics: the property of being a stop word is only relevant to query terms, not to document terms.

If my query were instead:

    let $x := <p>BEST OF TIMES</p>
    return $x contains text "best any times"
      using stop words ("any")

then the query term would effectively become:

    "best .* times" using wildcards

which matches "BEST OF TIMES" because:

> The "stop words" option specifies that if a token is within the specified collection of stop words, it is removed from the search and any token may be substituted for it.

Using .* as a replacement for each stop word satisfies the semantics of "any token may be substituted for it."

Now, if we return to my original query: if "using case sensitive" were to apply to stop-word determination, then "ANY" would not be found in the list of stop-words of "any"; hence, "ANY" would not be considered a stop-word and therefore it would not be "removed from the search and [allow] any token [to] be substituted for it."  So "BEST ANY TIMES" would not match "BEST OF TIMES" and the query would return false.

If "using case sensitive" were not to be considered during stop-word determination, then "ANY" would be found in the list of stop-words of "any"; hence "ANY" would be considered a stop-word and therefore would be "removed from the search and [allow] any token [to] be substituted for it."  So "BEST .* TIMES" would match "BEST OF TIMES" and the query would return true.

Also, and very importantly, it's intentional and entirely the point that "any" is *not* in the text being searched.  If the query were instead:

    let $x := <p>BEST ANY TIMES</p>
    return $x contains text "BEST ANY TIMES"
      using stop words ("any")
      using case sensitive

then it would be equivalent to:

    let $x := <p>BEST OF TIMES</p>
    return $x contains text "BEST ANY TIMES"

since the query text matches the search context tokens exactly whether "ANY" is considered a stop-word or not.
Comment 3 Paul J. Lucas 2010-07-14 15:46:10 UTC
It seems a typo creeped in.  My last example query should have been:

    let $x := <p>BEST ANY TIMES</p>
    return $x contains text "BEST ANY TIMES"

Sorry.  Regardless, my points still stand.
Comment 4 Michael Dyck 2010-07-14 17:10:15 UTC
(In reply to comment #2)
> Even though this bug has been "resolved" by making the answer "implementation
> dependent,"

Implementation-defined, actually; "implementation-dependent" means something else.

> the issue, despite Mr. Dyck's statement to the contrary, really
> does have to do with the query tokens.

Indeed it does. If you think I said it didn't, then it seems you misunderstood me.

> [...]
> If my query were instead:
> 
>     let $x := <p>BEST OF TIMES</p>
>     return $x contains text "best any times"
>       using stop words ("any")
> 
> then the query term would effectively become:
> 
>     "best .* times" using wildcards
> 
> which matches "BEST OF TIMES" [...]

Agreed.

> Now, if we return to my original query: if "using case sensitive" were to
> apply to stop-word determination, then "ANY" would not be found in the list
> of stop-words of "any"; hence, "ANY" would not be considered a stop-word and
> therefore it would not be "removed from the search and [allow] any token [to]
> be substituted for it."  So "BEST ANY TIMES" would not match "BEST OF TIMES"
> and the query would return false.

Agreed.

> If "using case sensitive" were not to be considered during stop-word
> determination, then "ANY" would be found in the list of stop-words of "any";
> hence "ANY" would be considered a stop-word and therefore would be "removed
> from the search and [allow] any token [to] be substituted for it."  So
> "BEST .* TIMES" would match "BEST OF TIMES" and the query would return true.

Agreed, more or less.

In the second paragraph of my comment #1, I summarized what I saw to be the point of your example, and I believe it's consistent with what you've said above.

My subsequent point was that, although the matter certainly hinges on whether a particular comparison is case-[in]sensitive, it's incorrect to bring the case option into the discussion, because the case option is not defined to govern comparisons of the two things being compared here. Specifically, the case option governs the matching of
   a query token vs. a token in the text being searched,
not the comparison of
   a query token vs. a stop word in the collection of stop words
                     defined by a stop word option

> Also, and very importantly, it's intentional and entirely the point that "any"
> is *not* in the text being searched.

Agreed. I think I see the problem. When I said:
    and the stop word "any" is not in the text being searched.
I did *not* mean:
    and the token "any" does not occur in the text being searched.
Rather, I meant something more like:
    and, in the stop word option
        using stop words ("any")
    that "any" is a StringLiteral in an	FTStopWordOption,
    not a token in the text being searched (and so, is not
    something that the case option is defined to deal with).
Comment 5 Paul J. Lucas 2011-01-05 22:10:01 UTC
I'd be happy to mark this as CLOSED, but I can't see the updated specification from the "outside".
Comment 6 Paul J. Lucas 2011-01-06 01:30:30 UTC
Verified as fixed.